I Introduction
The Gaussian process (GP) is a nonparametric probabilistic model that can be used to model an unknown nonlinear function from observed input data and (noisy) output data . No explicit form of is assumed, but some assumptions on are encoded through the GP prior and a mean function , a covariance function , and their socalled hyperparameters . In mathematical terms, is a priori modeled to be distributed as
(1) 
i.e., an infinitedimensional Gaussian distribution. See
[1] for a more general introduction to GPs.The posterior distribution over given data is also a GP. This is due to the conjugacy property of the Gaussian distribution. The posterior is often greatly influenced by the choice of hyperparameters , which typically are unknown. We therefore propose a method to marginalize the hyperparameters in GPs. Marginalization can be seen as averaging over the range of hyperparameters supported by the data and by the prior;
can be integrated out by treating it as a random variable with prior
and likelihood , giving rise to the posterior . For example, the predictive distribution is computed by(2) 
which unfortunately is analytically intractable. However, using a Monte Carlo method to obtain (weighted) samples of the distribution , the predictive distribution (2) can be approximated by
(3) 
where the weights are normalized, i.e., .
A common alternative to marginalization is to choose a point estimate of using an optimization procedure maximizing the likelihood (sometimes referred to as empirical Bayes). This may be difficult if the likelihood is multimodal. See the small toy example in Figure 3 illustrating the robustness of marginalization compared to point estimates. There are also situations where point estimates are not sufficient, and marginalization is necessary, such as the change point detection problem in Section IIIC.
Our contribution is a method for sampling from the hyperparameter posterior distribution , based on sequential Monte Carlo (SMC) samplers [2]. SMC samplers and their convergence properties are well studied [3].
Several methods have previously been proposed in the literature for marginalization of the GP hyperparameters: Bayesian Monte Carlo (BMC) [4], slice sampling [5], Hamiltonian Monte Carlo [6, 7], and adaptive importance sampling (AIS) [8]. Particle learning which is closely related to SMC has been proposed by [9] for this purpose. The work by [9]
, however, is not targeting the hyperparameters directly, and makes (possibly restrictive) assumptions on conjugate priors and model structure.
In this paper, we compare our proposed method to some of these methods, and apply it to two realdata problems: the first demonstrates that marginalization does not have to be more computationally demanding than finding point estimates. The second example, which deals with a fault detection problem from industry, is possible only with an efficient method for marginalization. Our proposed method (and all examples) are available as Matlab code via the first authors homepage.
From the experiments, we conclude that the advantages of the proposed method are (i) robustness towards multimodal hyperparameter posteriors, (ii) simplified tuning (compared to some other alternatives), (iii) competitive computational load, and (iv) online updating of hyperparameters as the data record grows.
Ii Sampling hyperparameters using SMC
For the numerical marginalization (3), we require samples, known as particles, from the posterior. In this section, we discuss how to use a SMC sampler [2] to generate such a particle system , where is the weight of particle
. The underlying idea is to construct a sequence of probability distributions (
), starting from the prior, and ending up in the posterior. The particles are then ‘guided’ through the sequence.To construct a sequence , we use the fact that depends on the data , by partitioning the data points into disjoint batches and adding them sequentially as .
To guide the particles through the smooth sequence , we will iteratively apply the three steps weighting, resampling and propagation, akin to a particle filter.
In the weighting step, the ‘usefulness’ of each particle is evaluated. To ensure convergence properties, the particles can be evaluated as [2, Section 3.3.2]
(4) 
To avoid numerical problems, the particles have to be resampled. The idea is to duplicate particle with large weights, and discard particles with small weights.
To propagate the particles from to , a MetropolisHastings (MH) kernel with invariant distribution can be used. The procedure of propagating (a sample of ) to (a sample of ) by is as follows: (i) Sample a new particle from a proposal
, e.g., a random walk with variance
. (ii) Compensate for the discrepancy between and by setting with probability(5) 
and otherwise . To improve the mixing, this procedure can be repeated times. For this, we use the notation .


We now have an SMC sampler to obtain samples from the hyperparameter posterior, summarized in Algorithm 1 and illustrated by Figure 6. From the figure, the suitability to online applications is clear: If another data point is added to the data, the sequence can be extended to including the new data point, and only the transition from to has to be performed.
We make use of the adaptive SMC sampler by [10] in the numerical examples to adapt the proposal automatically.
The computational cost of Algorithm 1 is in practice governed by the evaluations of the likelihood . Hence, it is important to choose the number of samples , SMC steps , and MHmoves per SMCstep sensibly. An idea of sensible numbers will be given along with the examples in the next section.
Iii Examples and results
We consider three examples for demonstrating our proposed approach. First, we consider a small simulated example, also comparing to alternative sampling methods, and thereafter two applications with realworld data. The first realdata example is a benchmark problem to compare the marginalization approach in Algorithm 1 to the point estimates obtained using optimization. In the third example, we illustrate how we can make use of our solution within a GPbased online change point detection algorithm. To this end, we require marginalization of the hyperparameters, so an efficient hyperparameter posterior sampler is indeed a key enabler for this. The online nature of the problem also fits well to the possibility to update the samples in Algorithm 1 online, as discussed in Section II.
Iiia Simulated example
We consider a small problem of 5 data points, and a covariance and mean function with 7 hyperparameters in total. We begin by considering the problem of marginalizing out 7 hyperparameters in a GP prior given 5 data points. Here, we are interested in comparing the performance of our SMC sampler (Algorithm 1) with some popular alternative methods; BMC [4], AIS [8], and (deterministic) griding.
The results for 15 runs are presented in Figure 7; it is indeed good if the variance between consecutive runs of the same algorithm gives similar results. The variations between the runs decrease faster for Algorithm 1 than for the comparable methods. When the GP prior has few hyperparameters, we conclude that the AIS and griding might be competitive methods. We have not managed to obtain competitive results with BMC for any problem size, but it should be noted that the computational load of BMC can be substantially decreased if the hyperparameter prior is independent between the dimensions.
The results for the conceptually different point estimates are also presented in Figure 7. The initialization point to the optimization algorithm is drawn from the prior: although it is a deterministic method, it is obviously very sensitive to the initialization.
IiiB Learning a robot arm model
We consider the problem of learning the inverse dynamics of a seven degreesoffreedom SARCOS antromorphic robot arm
[11, 1]. We use the same setup as [1, Section 2.5], i.e., a nontrivial setting involving 23 hyperparameters.To handle the size of the data set ( training and test data points), we make use of a subset of: (i) datapoints and (ii) regressors as discussed by [1, Section 8.3.1]. To use our method, we sample the hyperparameters from the posterior with a subset of data points. For comparability, we have also reproduced the results using point estimates from [1]. The results are reported in Table I. For Algorithm 1, , and was used. The priors to the logarithms of the lengthscale and the signal variance are , and for the noise variance.
Method  SMSE  MSLL  Time (s)  

Subset of datapoints  
Point est.  256  8.36 0.80  1.38 0.04  6.8 
SMC  256  8.10 1.32  1.38 0.56  7.1 
Point est.  512  6.36 1.13  1.51 0.05  26.4 
SMC  512  6.13 0.91  1.49 0.04  22.3 
Point est.  1024  4.31 0.16  1.66 0.02  101 
SMC  1024  4.54 0.33  1.61 0.03  92.5 
Point est.  2048  2.99 0.08  1.78 0.03  423 
SMC  2048  3.33 0.28  1.69 0.06  405 
Subset of regressors  
Point est.  256  3.67 0.17  1.63 0.02  6.8 
SMC  256  3.55 0.28  1.65 0.05  7.1 
Point est.  512  2.77 0.44  1.79 0.07  26.4 
SMC  512  2.89 0.20  1.77 0.03  22.3 
Point est.  1024  2.03 0.11  1.95 0.03  101 
SMC  1024  2.00  1.95  92.5 
Table I presents the results in the same way as [1, Table 8.1]. SMSE is the standardized mean square error (i.e., mean square error normalized by the variance of the target), and MSLL is the mean standardized log loss; 0 if predicting using a Gaussian density with mean and variance of the training data, and negative if ‘better’. The time is referring to the time required to sample and optimize the hyperparameters, respectively (not including the test evaluation). Numerical problems were experienced for large , therefore indicates runs where no interval can be reported.
Table I indicates no significant difference between the performance of our method and point estimates. It is however worth also to note the computational load: As Algorithm 1 apparently makes an equally good job in finding relevant hyperparameters as the optimization, it is a confirmation that our proposed method is indeed a competitive alternative to point estimates even for large problems.
IiiC Fault detection of oxygen sensors
We now consider data from the wastewater treatment plant Käppalaverket, Sweden. An oxygen sensor measures the dissolved oxygen (in mg/l) in a bioreactor, but the sensor gets clogged because of suspended cleaning. The identification of such events is relevant to the control of wastewater treatment plants [12]. We apply the GPbased online change point detection algorithm by [7], where the hyperparameters are marginalized using our proposed method.
The GPbased change point detection presented by [7] can be summarized as follows: If data undergo a change at time , it is of interest to (online) detect , i.e., estimate . The algorithmic idea is a recursive message passing scheme, updating the probability , where is the last change point at time .
To make predictions using a GP model, the hyperparameters either have to be fixed across all data segments, or marginalized. As it is not relevant to use fixed hyperparameters, an efficient sampling algorithm is a key enabler in solving this problem. The consecutive predictions and are both needed for the algorithm, hence our approach fit this problem well, as discussed in Section II. We used particles. On average, sampling the hyperparameters, i.e., one run of Algorithm 1, took 0.55 seconds on a standard desktop computer.
The results are presented in Figure (a)a. The expected points, suspension and resuming of the cleaning, are indeed indicated. An interpretation of the result is obtained by converting the results to point estimates by thresholding, and plotting at the GP regression for each individual segment, see Figure (b)b.
Note the datadriven nature of the algorithm, as no explicit model of the sensor was used at all. The tuning parameters are the covariance and mean functions, the prior of the change points and the hyperparameter priors.
Iv Conclusion
We have proposed and demonstrated an SMCbased method to marginalize hyperparameters in GP models. The observed benefits are robustness towards multimodal posteriors (Figure 3) and a competitive computational load (Section IIIB), also compared to the commonly used point estimates of the hyperparameters. We have been able to cope with a hyperparameter space of dimension 23 (Section IIIB), and also concluded a sound convergence behavior (Section III). Finally, the online update of the hyperparameters has been shown useful within the industryrelevant datadriven fault detection application (Section IIIC). As a future direction, it would be interesting to apply our method to the challenging GP optimization problem of system identification [13].
Acknowledgments
This work was supported by the project Probabilistic modeling of dynamical systems (Contract number: 62120135524) funded by the Swedish Research Council (VR). We would also like to thank Oscar Samuelsson and Dr. Jesús Zambrano for providing the sensor data in Section IIIC.
References

[1]
C. E. Rasmussen and C. K. I. Williams,
Gaussian processes for machine learning
. Cambridge, MA.: MIT Press, 2006.  [2] P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo samplers,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 3, pp. 411–436, 2006.
 [3] N. Whiteley, “Sequential Monte Carlo samplers: error bounds and insensitivity to initial conditions,” Stochastic Analysis and Applications, vol. 30, no. 5, pp. 774–798, 2012.
 [4] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings, “Towards realtime information processing of sensor network data using computationally efficient multioutput Gaussian processes,” in Proceedings of the 7^{th} international conference on information processing in sensor networks, St. Louis, MO, USA, Apr. 2008, pp. 109–120.
 [5] D. K. Agarwal and A. E. Gelfand, “Slice sampling for simulation based fitting of spatial data models,” Statistics and Computing, vol. 15, no. 1, pp. 61–69, 2005.

[6]
R. M. Neal, “MCMC using Hamiltonian dynamics,” in
Handbook of Markov Chain Monte Carlo
, S. Brooks, A. Gelman, G. Jones, and X.L. Meng, Eds. Chapman & Hall/ CRC Press, 2010.  [7] Y. Saatçi, R. D. Turner, and C. E. Rasmussen, “Gaussian process change point models,” in Proceedings of the 27^{th} International Conference on Machine Learning (ICML), Haifa, Israel, Jun. 2010, pp. 927–934.

[8]
D. Petelin, M. Gašperin, and V. Šmıdl, “Adaptive importance sampling for Bayesian inference in Gaussian process models,” in
Proceedings of the 19^{th} IFAC World Congress, Cape Town, South Africa, Aug. 2014, pp. 5011–5015.  [9] R. B. Gramacy and N. G. Polson, “Particle learning of Gaussian process models for sequential design and optimization,” Journal of Computational and Graphical Statistics, vol. 20, no. 1, pp. 102–118, 2011.
 [10] P. Fearnhead and B. M. Taylor, “An adaptive sequential Monte Carlo sampler,” Bayesian Analysis, vol. 8, no. 2, pp. 411–438, 2013.
 [11] S. Vijayakumar and S. Schaal, “Locally weighted projection regression: Incremental real time learning in high dimensional space,” in Proceedings of the 7^{th} International Conference on Machine Learning (ICML), Stanford, CA, USA, Jun. 2000, pp. 1079–1086.
 [12] G. Olsson, B. Carlsson, J. Comas, J. Copp, K. V. Gernaey, P. Ingildsen, U. Jeppsson, C. Kim, L. Rieger, I. RodriguezRoda, J.P. Steyer, I. Takács, P. A. Vanrolleghem, A. Vargas Casillas, Z. Yuan, and L. Åmand, “Instrumentation, control and automation in wastewater – from London 1973 to Narbonne 2013.” Water Science and Technology, vol. 69, no. 7, pp. 1373–1385, 2014.
 [13] J. Dahlin and F. Lindsten, “Particle filterbased Gaussian process optimisation for parameter inference,” in Proceedings of the 19^{th} IFAC World Congress, Cape Town, South Africa, Aug. 2014, pp. 8675–8680.
Comments
There are no comments yet.