I Introduction
Classification is an important problem with a high number of applications, for example, in handwriting and speech recognition, and medical diagnosis [1]. In (supervised) classification, a set of training data points with their corresponding classes are available to learn the underlying structure of the problem. Based on this information, the objective is to infer the classes of new data points. This classification problem can be posed using Gaussian processes (GPs) [2, 3, 4, 5, 6, 7, 8].
In binary GP classification, it is assumed that there is a latent function, distributed as a GP [2]
, whose value at a certain data point is related to the probability that this data point belongs to one class. The GP prior has some hyperparameters that can be marginalised out
[9], or estimated by maximising the log marginal likelihood
[2]. Then, for the estimated hyperparameters, we compute the posterior distribution over the latent function evaluated at the training data points, which in turn allows us to predict the classes for new data points.The main difficulties in GP classification are the approximations of the posterior and the log marginal likelihood. Markov chain Monte Carlo algorithms
[10] can provide very accurate approximations, but they usually have a high computational burden. This is the reason why there is interest in using computationally efficient approximations. We proceed to review two of such approximations in the literature.One possibility is to use the Laplace approximation [2]. A drawback of the Laplace approximation is that it cannot handle likelihood functions in which the gradient is zero almost everywhere, such as the noisy threshold likelihood [4].
Expectation propagation (EP) [11] can be used with all likelihood functions and often outperforms Laplace approximation in GP classification [12, 13, 14]
. EP is an iterative algorithm in which, at each iteration, a Gaussian approximation to the likelihood for one data point is reapproximated. In order to do this, we first remove the considered approximated likelihood from the posterior approximation, which results in the “cavity” distribution. Then, we use the true likelihood and the cavity distribution to provide a new Gaussian approximation to the likelihood by performing moment matching. EP has two drawbacks that are relevant to this paper: 1) the cavity distributions can have negative definite covariance matrices with possibly large negative eigenvalues that are not due to numerical errors
[1, 15], and 2) there is no convergence proof in the literature that indicates conditions of convergence [2]. In order to deal with 1), simple, adhoc solutions are sometimes used, for example, processing the likelihoods in a different order to see if this resolves the issue, or arbitrarily setting the negative definite covariance to a predefined positivedefinite matrix [15]. More robust EP algorithms, such as damped EP or double loop algorithms, also require pragmatic solutions to avoid negative definite covariance matrices [16].This paper proposes an algorithm for GP classification based on posterior linearisation (PL) [17, 18]
. In PL, we compute a Gaussian approximation to the posterior distribution by linearising the conditional mean of each label, in the region where the posterior lies, and by setting the conditional covariance to a value that accounts for the linearisation error. Importantly, the selection of the linearisation parameters is done in an optimal way by minimising the mean square error of the linearisation. This optimal linearisation is given by statistical linear regression (SLR)
[19] of the conditional mean with respect to the posterior.In practice, PL is implemented by an iterative procedure, in which we can process the likelihoods sequentially or in parallel. PL has some advantages compared to EP: 1) PL always provides positive definite covariance matrices, so adhoc fixes are not necessary, 2) PL is simpler to implement and does not compute cavity distributions 3) there is a local convergence proof [18] that indicates sufficient local conditions of convergence. Experimental results show that PL generally outperforms Laplace and EP in GP classification.
Ii Problem formulation
We consider a set , which contains data points , with , and their binary labels , with . Then, in binary classification, given this set, we are interested in predicting the binary labels for new data points [2]. We proceed to explain how this problem is formulated using GPs.
It is assumed that the label of a data point only depends on the value of a latent realvalued function evaluated at , . Then, there are several widelyused models for the probability mass function
, for example, the probit, logit, and noisy threshold, whose respective
are [4](1)  
(2)  
(3) 
where , is the Gaussian density with mean and covariance , , and if , and otherwise.
It is further assumed that function is distributed according to a zeromean Gaussian process with a given covariance function , where
is a vector that contains all the hyperparameters. As a result, the prior density of
becomes(4) 
where is an covariance matrix such that . The posterior of is
(5) 
where the marginal likelihood is
(6) 
The hyperparameters are usually not known and are often estimated by maximising (6) [2]. Once we have estimated , the posterior over , which considers new data points, is computed as
(7) 
Finally, all information about the class labels for the new data points is given by the distribution of given , and , which can be written as
(8) 
Based on this distribution, we can predict the class labels, which is our main objective, for example, by computing their expected value [2]. Unfortunately, none of the densities of interest, (5)(8), has a closedform expression, so approximations are necessary. As in the Laplace and EP methods, in this paper, we consider Gaussian approximations of (5) and (7).
Iia Enabling approximation
We can obtain a Gaussian approximation to the posterior by approximating the conditional mean
as an affine function and the conditional variance
as a constant(9) 
where , , and , and approximating the conditional distribution of given as a Gaussian
(10) 
We should note that the conditional moments and are taken with respect to .
Once we make approximation (10), by GP regression [2], the posterior of , which is given by (5), is Gaussian with mean and covariance matrix
(11)  
(12) 
where , and . Similarly, the posterior of , see (7), becomes Gaussian with mean and covariance matrix
(13)  
(14) 
where and are and matrices such that and .
Iii Posterior linearisation of GPs
In this section, we first explain SLR using conditional moments in Section IIIA. Then, we explain iterated posterior linearisation in Section IIIB. We propose a method for approximating the marginal likelihood in Section IIIC. Finally, a discussion of the algorithm is provided in Section IIID.
Iiia Statistical linear regression of conditional moments
With SLR, we can optimally linearise in a mean square error sense, so that we can make approximation (9). In addition, SLR provides us with the linearisation error, which is used to approximate as a constant, as required in (9). The SLR linearisation parameters are denoted as and we proceed to explain to obtain them.
In SLR of random variables
[18], we are given a density on variable , whose first two moments are and , and the conditional moments and , which describe the relation between and . Parameters and are then selected by minimising the mean square error over random variables and(15) 
where we have highlighted the expectations that are taken with respect to variables and . Therefore,
In SLR, the parameter is chosen as the resulting mean square error in (15) for the optimal values and so
In other words, SLR makes the best affine fit of the conditional mean in the region indicated by , the density of , and sets so that the resulting mean square error is preserved. The resulting is given by [18]
(16)  
(17) 
where denotes both the variance and covariance with respect to . We can then obtain in terms of moments of and with respect to density . Expressions of these moments for the likelihoods (1)(3) are given in the supplementary material.
IiiB Iterated posterior linearisation
This section explains how to use SLR to make approximations (9) in an optimal fashion. If we did not know the labels , the best approximation of the conditional moments would be given by SLR with respect to the prior, which is given by (4), as this density indicates the region where lies. However, we know these labels, and the insight of posterior linearisation [17] is that, given these labels, the best linearisation , in a mean square error sense, and the resulting mean square error are given by SLR with respect to the posterior.
Direct application of posterior linearisation is not useful as we need to know the posterior to select , which is used to calculate the posterior. Nevertheless, posterior linearisation can be approximated by using iterated SLRs, giving rise to iterated posterior linearisation. That is, since we do not have access to the posterior, we perform SLR with respect to the best available approximation of the posterior. At the end of each iteration, we expect to obtain an improved approximation of the posterior, which is used to compute an improved SLR at the next iteration. The steps of the parallel version of the method are:

Set and .

Set and repeat from step 2 for a fixed number of iterations or until some convergence criterion is met.
It is important to notice that in the aforementioned algorithm, we first relinearise all likelihoods and then compute the posterior with the current linearisation. This is beneficial from a computational point of view because the linearisations can be done in parallel and we only perform one update per iteration. Nevertheless, it is also possible to recompute the posterior after relinearising a single likelihood. In GP classification, can be close to singular so we have adapted the numerically stable implementations of Laplace and EP algorithms in [2] to our PL implementation.
IiiC Marginal likelihood approximation
In this section, we propose the use of sigmapoints/quadrature rules to approximate the marginal likelihood, which is given by (6). Importantly, we select the quadrature points with respect to the posterior approximation, as this density has its mass in the region where the integrand is high. We therefore write (6) as
(18) 
where . Factor corresponds to the marginal likelihood assuming that the true likelihoods are given by , see (10). Consequently, the marginal likelihood can be seen as times a correction factor that depends on the similarity between and in the region indicated by the posterior density.
There are some drawbacks when integrating with respect to the joint density of using sigmapoints/quadrature rules. First, accurate and computationally efficient integration in highdimensional spaces is more difficult than in lowdimensional spaces. Second, sigmapoints/quadrature rules require the Cholesky decomposition of the covariance matrix but this covariance can be illconditioned in GP classification [2], so it is not always possible to compute it. We therefore discard correlations in the posterior for approximating the correction factor in (18) such that
(19) 
where and represent the posterior mean and variance of , which are obtained from (11) and (12).
In short, in order to compute (19), we compute the posterior moments, and , and the resulting SLR parameters , which are required in and . Accurate approximation of (18) is quite important as it is used to estimate the hyperparameters . Without an accurate estimation of
, the results of a GP classifier are poor, as the GP does not model the training data properly. The results in Section
IV confirm that approximation (19) is accurate for classification purposes.IiiD Discussion
The iterated SLRs of PL can be done in parallel for each likelihood and sequentially. We can also combine both types of linearisation approaches, for example, by performing several updates sequentially and then in parallel. Importantly, two main benefits of PL compared to EP are that there is no need to compute cavity distributions and that all involved densities have positive definite covariance matrices. This is ensured by the fact that is positivedefinite by definition. Clearly, numerical inaccuracies could render a negativedefinite but this would be easy to address, as the eigenvalues would be close to zero, or using square root solutions [20].
As EP, iterated PL is not ensured to converge in general. Nevertheless, there is a local convergence proof, which is given in [18, Thm. 2], that states sufficient conditions for convergence.
Iv Experimental results
This section assesses Laplace, EP, and PL, in their parallel and sequential forms, in six realworld data sets from [21]. One additional synthetic example that analyses a case where EP fails is provided in the supplementary material. In particular, we consider the data sets: breast cancer (9, 699), crab gender (6, 200), glass chemical (9, 214), ionosphere (33, 351), thyroid (5,215) and housing (13,506), where the number of attributes (dimension of data points) and data points in each data set are given in parentheses. The last two data sets are used for binary classification as in [4]. Data points have been normalised to have zero mean and an identity covariance matrix [12]. We use tenfold crossvalidation [2] to compute the average classification error.
We use the covariance function [10]
where denotes the Kronecker delta, is set to 0.1, the hyperparameters , and for . EP and Laplace have been implemented as indicated in [2].
We report results for all the likelihoods in (1)(3). The integrations that do not admit closedform expressions are approximated using a GaussHermite quadrature of order 10. We first estimate the hyperparameters by maximising (6) using the QuasiNewton algorithm implemented in Matlab function fminunc. The optimisation method is run on variable and the initial point is set to . Then, we approximate the posterior (5) as Gaussian and, finally, we compute the expected value of (8) to predict the label for each test data point. We consider 10 iterations for EP and PL. If the variance of a likelihood approximation is negative for EP, it is set to a small positive number to avoid negative definite covariance matrices, as suggested in [15].
The resulting average classification errors for the data sets are shown in Table I, where Prob. Log. and NT stand for probit, logit and noisy threshold. PEP and SEP refer to parallel and sequential EP, and PPL and SPL to parallel and sequential PL. In this table, the best performing algorithm for each data set and likelihood is bolded. Also, the cases for which the classification error is considerably high is marked in red. We can see that for the noisy threshold likelihood, parallel EP provides high errors in general. Sequential EP works well, but, on the whole PL methods work better. As mentioned before, Laplace does not work with this likelihood. For the logit model, parallel EP provides a high error in one data set. This is a numerical error that can be solved by using a GaussHermite quadrature rule of order 32, which increases the computational load. Sequential PL performs best in five out of the six data sets, which makes it the best performing algorithm for this likelihood. For the probit likelihood, all methods work well, though Laplace provides slightly higher errors on the whole. Sequential PL, parallel EP and sequential EP perform best in 3 out of the 6 experiments, so they are the best performing algorithms with this likelihood. In total, across all likelihoods, the number of times an algorithm is the best performing one in the experiments is: Laplace (3), parallel EP (4), sequential EP (9), parallel PL (6), sequential PL (11). In general, PL algorithms work better than Laplace and their EP counterparts taking into account all likelihoods. Importantly, they are most robust than parallel EP, as errors are not markedly higher in any of the data sets.
Finally, we provide the average computational times of our Matlab implementations (probit model and crab data set), as a indication of their computational complexities, though running times depend on the data set. With an Intel Core i7 processor, we have: 0.5 s (Laplace), 6.9 s (parallel EP), 13.4 s (sequential EP), 4.0 s (parallel PL) and 9.8 s (sequential PL). Laplace is clearly the fastest algorithm, followed by parallel EP and PL, though it tends to provide slightly higher errors, and cannot deal with the noisy threshold likelihood. The sequential algorithms are slower, but they usually perform better than parallel algorithms.
Like.  Alg.  Cancer  Crab  Glass  Ionos.  Thyroid  Housing 

Prob.  L  0.051  0.050  0.067  0.108  0.061  0.069 
PEP  0.034  0.045  0.067  0.088  0.062  0.064  
SEP  0.034  0.045  0.067  0.088  0.062  0.064  
PPL  0.037  0.035  0.076  0.083  0.071  0.069  
SPL  0.034  0.045  0.067  0.091  0.057  0.069  
Log.  L  0.036  0.045  0.071  0.105  0.057  0.067 
PEP  0.251  0.045  0.071  0.083  0.062  0.070  
SEP  0.034  0.045  0.067  0.083  0.062  0.070  
PPL  0.039  0.040  0.071  0.088  0.071  0.067  
SPL  0.034  0.045  0.067  0.083  0.057  0.067  
NT  L             
PEP  0.202  0.085  0.071  0.384  0.454  0.124  
SEP  0.034  0.045  0.067  0.086  0.062  0.069  
PPL  0.043  0.025  0.081  0091  0.071  0.058  
SPL  0.034  0.035  0.067  0.091  0.057  0.070 
V Conclusions
We have proposed a novel method for classification using Gaussian processes based on posterior linearisation. The proposed algorithm consists of performing iterated statistical linear regressions with respect to the current approximation to the posterior. An important benefit compared to expectation propagation is that the proposed method does not provide negative definite covariance matrices by construction, which implies that it does not need adhoc procedures to solve the resulting problems. In addition, posterior linearisation is simpler to implement than expectation propagation, as it does not need the computation of cavity distributions, and has a local convergence theorem. Experimental results show that, in general, posterior linearisation performs better than EP and Laplace for a range of likelihood functions used in classification.
References
 [1] K. P. Murphy, Machine Learning: a Probabilistic Perspective. The MIT Press, 2012.
 [2] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.
 [3] F. PérezCruz, S. V. Vaerenbergh, J. J. MurilloFuentes, M. LázaroGredilla, and I. Santamaría, “Gaussian processes for nonlinear signal processing: An overview of recent advances,” IEEE Signal Processing Magazine, vol. 30, no. 4, pp. 40–50, July 2013.
 [4] H.C. Kim and Z. Ghahramani, “Bayesian Gaussian process classification with the EMEP algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 1948–1959, Dec. 2006.
 [5] K. Markov and T. Matsui, “Music genre classification using Gaussian process models,” in IEEE International Workshop on Machine Learning for Signal Processing, Sept 2013, pp. 1–6.

[6]
Y. Xiao, H. Wang, and W. Xu, “Hyperparameter selection for Gaussian process
oneclass classification,”
IEEE Transactions on Neural Networks and Learning Systems
, vol. 26, no. 9, pp. 2182–2187, Sept. 2015.  [7] P. MoralesÁlvarez, A. PérezSuay, R. Molina, and G. CampsValls, “Remote sensing image classification with largescale Gaussian processes,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 1103–1114, Feb 2018.
 [8] P. M. Olmos, J. J. MurilloFuentes, and F. PérezCruz, “Joint nonlinear channel equalization and soft LDPC decoding with Gaussian processes,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1183–1192, March 2010.
 [9] J. Vanhatalo, V. Pietiläinen, and A. Vehtari, “Approximate inference for disease mapping with sparse Gaussian processes,” Statistics in Medicine, vol. 29, July 2010.
 [10] R. M. Neal, “Regression and classification using Gaussian process priors,” in Bayesian Statistics 6, J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Eds. Oxford University Press, 1998.
 [11] T. P. Minka, “Expectation propagation and approximate Bayesian inference,” in Proceedings of the 17th Conference in Uncertainty in Artifical Intelligence, 2001, pp. 362–369.
 [12] M. Kuss and C. E. Rasmussen, “Assessing approximate inference for binary Gaussian process classification,” Journal of Machine Learning Research, vol. 6, pp. 1679–1704, 2005.
 [13] ——, “Assessing approximations for Gaussian process classification,” in Advances in Neural Information Processing Systems 18, Y. Weiss, P. B. Schölkopf, and J. C. Platt, Eds., 2006, pp. 699–706.
 [14] C. VillacampaCalvo and D. HernándezLobato, “Scalable multiclass Gaussian process classification using expectation propagation,” in Proceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 3550–3559.
 [15] A. Vehtari and et al, “Expectation propagation as a way of life: a framework for Bayesian inference on partitioned data,” available in arxiv, 2018. [Online]. Available: https://arxiv.org/abs/1412.4869
 [16] P. Jylänki, J. Vanhatalo, and A. Vehtari, “Robust Gaussian process regression with a Student likelihood,” Journal of Machine Learning Research, vol. 12, pp. 3227–3257, 2011.
 [17] A. F. GarcíaFernández, L. Svensson, M. R. Morelande, and S. Särkkä, “Posterior linearization filter: principles and implementation using sigma points,” IEEE Transactions on Signal Processing, vol. 63, no. 20, pp. 5561–5573, Oct. 2015.
 [18] F. Tronarp, A. F. GarcíaFernández, and S. Särkkä, “Iterative filtering and smoothing in nonlinear and nonGaussian systems using conditional moments,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 408–412, March 2018.

[19]
I. Arasaratnam, S. Haykin, and R. Elliott, “Discretetime nonlinear filtering algorithms using GaussHermite quadrature,”
Proceedings of the IEEE, vol. 95, no. 5, pp. 953–977, May 2007. 
[20]
I. Arasaratnam and S. Haykin, “Squareroot quadrature Kalman filtering,”
IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2589–2593, June 2008.  [21] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
 [22] S. Särkkä, Bayesian Filtering and Smoothing. Cambridge University Press, 2013.
 [23] S. J. Julier and J. K. Uhlmann, “Unscented filtering and nonlinear estimation,” Proceedings of the IEEE, vol. 92, no. 3, pp. 401–422, Mar. 2004.

[24]
V. Tolvanen, P. Jylänki, and A. Vehtari, “Expectation propagation for nonstationary heteroscedastic Gaussian process regression,” in
IEEE International Workshop on Machine Learning for Signal Processing, Sept. 2014, pp. 1–6.
SLR for probit, logit and noisythreshold likelihoods
In this section, we provide the expressions for the SLR moments required in (16)(17) for the probit, logit and noisythreshold likelihoods. For the probit model, we have
where . We have used Eq. (3.84) in [2] to obtain the last expression.
For the noisy threshold model, we have
where .
Synthetic example where EP fails
We consider a Gaussian prior over with mean and unit variances for both variables with correlation 0.8. We also consider the noisy threshold likelihood, see (3), with , and .
We run EP first on the first variable and then on the second variable. After the first round of updates, the cavity distribution over has a variance of 117.9, which stops the EP iterations. If we process the measurements in the other order, we face the same problem. In this case, the variance of the cavity distribution over is 14.3. If we run EP in parallel [24], after the second update, the cavity distribution over also has a variance of 117.9. These negative variances are not due to numerical errors, they are the result of the EP iterations. In particular, the problematic part is the variance of the second variable which increases after its update, so the likelihood approximation has a negative definite variance, which creates problems at the next step of the iteration [16].
We show the contour plot of the posterior, which has been obtained using a fine grid of points, and the PL solution in Figure 1(a). Both parallel and sequential implementations of PL converge to the same solution and they match the mode of the posterior with highest density, which is a reasonable Gaussian approximation to a bimodal density. In the figure, we can also see the sequence of posterior means provided by the parallel PL iterations. In 6 iterations, the algorithm gets quite close to the fixed point. Laplace approximation would not change the prior as the gradient of the likelihood is zero almost everywhere. This example demonstrates that this type of situation can be satisfactorily handled with PL, but not with EP, without fixes, and Laplace methods.
Comments
There are no comments yet.