Classification is an important problem with a high number of applications, for example, in handwriting and speech recognition, and medical diagnosis . In (supervised) classification, a set of training data points with their corresponding classes are available to learn the underlying structure of the problem. Based on this information, the objective is to infer the classes of new data points. This classification problem can be posed using Gaussian processes (GPs) [2, 3, 4, 5, 6, 7, 8].
In binary GP classification, it is assumed that there is a latent function, distributed as a GP 9]
, or estimated by maximising the log marginal likelihood. Then, for the estimated hyperparameters, we compute the posterior distribution over the latent function evaluated at the training data points, which in turn allows us to predict the classes for new data points.
The main difficulties in GP classification are the approximations of the posterior and the log marginal likelihood. Markov chain Monte Carlo algorithms can provide very accurate approximations, but they usually have a high computational burden. This is the reason why there is interest in using computationally efficient approximations. We proceed to review two of such approximations in the literature.
One possibility is to use the Laplace approximation . A drawback of the Laplace approximation is that it cannot handle likelihood functions in which the gradient is zero almost everywhere, such as the noisy threshold likelihood .
. EP is an iterative algorithm in which, at each iteration, a Gaussian approximation to the likelihood for one data point is reapproximated. In order to do this, we first remove the considered approximated likelihood from the posterior approximation, which results in the “cavity” distribution. Then, we use the true likelihood and the cavity distribution to provide a new Gaussian approximation to the likelihood by performing moment matching. EP has two drawbacks that are relevant to this paper: 1) the cavity distributions can have negative definite covariance matrices with possibly large negative eigenvalues that are not due to numerical errors[1, 15], and 2) there is no convergence proof in the literature that indicates conditions of convergence . In order to deal with 1), simple, ad-hoc solutions are sometimes used, for example, processing the likelihoods in a different order to see if this resolves the issue, or arbitrarily setting the negative definite covariance to a predefined positive-definite matrix . More robust EP algorithms, such as damped EP or double loop algorithms, also require pragmatic solutions to avoid negative definite covariance matrices .
. In PL, we compute a Gaussian approximation to the posterior distribution by linearising the conditional mean of each label, in the region where the posterior lies, and by setting the conditional covariance to a value that accounts for the linearisation error. Importantly, the selection of the linearisation parameters is done in an optimal way by minimising the mean square error of the linearisation. This optimal linearisation is given by statistical linear regression (SLR) of the conditional mean with respect to the posterior.
In practice, PL is implemented by an iterative procedure, in which we can process the likelihoods sequentially or in parallel. PL has some advantages compared to EP: 1) PL always provides positive definite covariance matrices, so ad-hoc fixes are not necessary, 2) PL is simpler to implement and does not compute cavity distributions 3) there is a local convergence proof  that indicates sufficient local conditions of convergence. Experimental results show that PL generally outperforms Laplace and EP in GP classification.
Ii Problem formulation
We consider a set , which contains data points , with , and their binary labels , with . Then, in binary classification, given this set, we are interested in predicting the binary labels for new data points . We proceed to explain how this problem is formulated using GPs.
It is assumed that the label of a data point only depends on the value of a latent real-valued function evaluated at , . Then, there are several widely-used models for the probability mass function
, for example, the probit, logit, and noisy threshold, whose respectiveare 
where , is the Gaussian density with mean and covariance , , and if , and otherwise.
It is further assumed that function is distributed according to a zero-mean Gaussian process with a given covariance function , where
is a vector that contains all the hyperparameters. As a result, the prior density ofbecomes
where is an covariance matrix such that . The posterior of is
where the marginal likelihood is
Finally, all information about the class labels for the new data points is given by the distribution of given , and , which can be written as
Based on this distribution, we can predict the class labels, which is our main objective, for example, by computing their expected value . Unfortunately, none of the densities of interest, (5)-(8), has a closed-form expression, so approximations are necessary. As in the Laplace and EP methods, in this paper, we consider Gaussian approximations of (5) and (7).
Ii-a Enabling approximation
We can obtain a Gaussian approximation to the posterior by approximating the conditional mean
as an affine function and the conditional varianceas a constant
where , , and , and approximating the conditional distribution of given as a Gaussian
We should note that the conditional moments and are taken with respect to .
where , and . Similarly, the posterior of , see (7), becomes Gaussian with mean and covariance matrix
where and are and matrices such that and .
Iii Posterior linearisation of GPs
In this section, we first explain SLR using conditional moments in Section III-A. Then, we explain iterated posterior linearisation in Section III-B. We propose a method for approximating the marginal likelihood in Section III-C. Finally, a discussion of the algorithm is provided in Section III-D.
Iii-a Statistical linear regression of conditional moments
With SLR, we can optimally linearise in a mean square error sense, so that we can make approximation (9). In addition, SLR provides us with the linearisation error, which is used to approximate as a constant, as required in (9). The SLR linearisation parameters are denoted as and we proceed to explain to obtain them.
In SLR of random variables, we are given a density on variable , whose first two moments are and , and the conditional moments and , which describe the relation between and . Parameters and are then selected by minimising the mean square error over random variables and
where we have highlighted the expectations that are taken with respect to variables and . Therefore,
In SLR, the parameter is chosen as the resulting mean square error in (15) for the optimal values and so
In other words, SLR makes the best affine fit of the conditional mean in the region indicated by , the density of , and sets so that the resulting mean square error is preserved. The resulting is given by 
where denotes both the variance and covariance with respect to . We can then obtain in terms of moments of and with respect to density . Expressions of these moments for the likelihoods (1)-(3) are given in the supplementary material.
Iii-B Iterated posterior linearisation
This section explains how to use SLR to make approximations (9) in an optimal fashion. If we did not know the labels , the best approximation of the conditional moments would be given by SLR with respect to the prior, which is given by (4), as this density indicates the region where lies. However, we know these labels, and the insight of posterior linearisation  is that, given these labels, the best linearisation , in a mean square error sense, and the resulting mean square error are given by SLR with respect to the posterior.
Direct application of posterior linearisation is not useful as we need to know the posterior to select , which is used to calculate the posterior. Nevertheless, posterior linearisation can be approximated by using iterated SLRs, giving rise to iterated posterior linearisation. That is, since we do not have access to the posterior, we perform SLR with respect to the best available approximation of the posterior. At the end of each iteration, we expect to obtain an improved approximation of the posterior, which is used to compute an improved SLR at the next iteration. The steps of the parallel version of the method are:
Set and .
Set and repeat from step 2 for a fixed number of iterations or until some convergence criterion is met.
It is important to notice that in the aforementioned algorithm, we first relinearise all likelihoods and then compute the posterior with the current linearisation. This is beneficial from a computational point of view because the linearisations can be done in parallel and we only perform one update per iteration. Nevertheless, it is also possible to recompute the posterior after relinearising a single likelihood. In GP classification, can be close to singular so we have adapted the numerically stable implementations of Laplace and EP algorithms in  to our PL implementation.
Iii-C Marginal likelihood approximation
In this section, we propose the use of sigma-points/quadrature rules to approximate the marginal likelihood, which is given by (6). Importantly, we select the quadrature points with respect to the posterior approximation, as this density has its mass in the region where the integrand is high. We therefore write (6) as
where . Factor corresponds to the marginal likelihood assuming that the true likelihoods are given by , see (10). Consequently, the marginal likelihood can be seen as times a correction factor that depends on the similarity between and in the region indicated by the posterior density.
There are some drawbacks when integrating with respect to the joint density of using sigma-points/quadrature rules. First, accurate and computationally efficient integration in high-dimensional spaces is more difficult than in low-dimensional spaces. Second, sigma-points/quadrature rules require the Cholesky decomposition of the covariance matrix but this covariance can be ill-conditioned in GP classification , so it is not always possible to compute it. We therefore discard correlations in the posterior for approximating the correction factor in (18) such that
In short, in order to compute (19), we compute the posterior moments, and , and the resulting SLR parameters , which are required in and . Accurate approximation of (18) is quite important as it is used to estimate the hyperparameters . Without an accurate estimation of
, the results of a GP classifier are poor, as the GP does not model the training data properly. The results in SectionIV confirm that approximation (19) is accurate for classification purposes.
The iterated SLRs of PL can be done in parallel for each likelihood and sequentially. We can also combine both types of linearisation approaches, for example, by performing several updates sequentially and then in parallel. Importantly, two main benefits of PL compared to EP are that there is no need to compute cavity distributions and that all involved densities have positive definite covariance matrices. This is ensured by the fact that is positive-definite by definition. Clearly, numerical inaccuracies could render a negative-definite but this would be easy to address, as the eigenvalues would be close to zero, or using square root solutions .
As EP, iterated PL is not ensured to converge in general. Nevertheless, there is a local convergence proof, which is given in [18, Thm. 2], that states sufficient conditions for convergence.
Iv Experimental results
This section assesses Laplace, EP, and PL, in their parallel and sequential forms, in six real-world data sets from . One additional synthetic example that analyses a case where EP fails is provided in the supplementary material. In particular, we consider the data sets: breast cancer (9, 699), crab gender (6, 200), glass chemical (9, 214), ionosphere (33, 351), thyroid (5,215) and housing (13,506), where the number of attributes (dimension of data points) and data points in each data set are given in parentheses. The last two data sets are used for binary classification as in . Data points have been normalised to have zero mean and an identity covariance matrix . We use ten-fold cross-validation  to compute the average classification error.
We use the covariance function 
where denotes the Kronecker delta, is set to 0.1, the hyperparameters , and for . EP and Laplace have been implemented as indicated in .
We report results for all the likelihoods in (1)-(3). The integrations that do not admit closed-form expressions are approximated using a Gauss-Hermite quadrature of order 10. We first estimate the hyperparameters by maximising (6) using the Quasi-Newton algorithm implemented in Matlab function fminunc. The optimisation method is run on variable and the initial point is set to . Then, we approximate the posterior (5) as Gaussian and, finally, we compute the expected value of (8) to predict the label for each test data point. We consider 10 iterations for EP and PL. If the variance of a likelihood approximation is negative for EP, it is set to a small positive number to avoid negative definite covariance matrices, as suggested in .
The resulting average classification errors for the data sets are shown in Table I, where Prob. Log. and NT stand for probit, logit and noisy threshold. PEP and SEP refer to parallel and sequential EP, and PPL and SPL to parallel and sequential PL. In this table, the best performing algorithm for each data set and likelihood is bolded. Also, the cases for which the classification error is considerably high is marked in red. We can see that for the noisy threshold likelihood, parallel EP provides high errors in general. Sequential EP works well, but, on the whole PL methods work better. As mentioned before, Laplace does not work with this likelihood. For the logit model, parallel EP provides a high error in one data set. This is a numerical error that can be solved by using a Gauss-Hermite quadrature rule of order 32, which increases the computational load. Sequential PL performs best in five out of the six data sets, which makes it the best performing algorithm for this likelihood. For the probit likelihood, all methods work well, though Laplace provides slightly higher errors on the whole. Sequential PL, parallel EP and sequential EP perform best in 3 out of the 6 experiments, so they are the best performing algorithms with this likelihood. In total, across all likelihoods, the number of times an algorithm is the best performing one in the experiments is: Laplace (3), parallel EP (4), sequential EP (9), parallel PL (6), sequential PL (11). In general, PL algorithms work better than Laplace and their EP counterparts taking into account all likelihoods. Importantly, they are most robust than parallel EP, as errors are not markedly higher in any of the data sets.
Finally, we provide the average computational times of our Matlab implementations (probit model and crab data set), as a indication of their computational complexities, though running times depend on the data set. With an Intel Core i7 processor, we have: 0.5 s (Laplace), 6.9 s (parallel EP), 13.4 s (sequential EP), 4.0 s (parallel PL) and 9.8 s (sequential PL). Laplace is clearly the fastest algorithm, followed by parallel EP and PL, though it tends to provide slightly higher errors, and cannot deal with the noisy threshold likelihood. The sequential algorithms are slower, but they usually perform better than parallel algorithms.
We have proposed a novel method for classification using Gaussian processes based on posterior linearisation. The proposed algorithm consists of performing iterated statistical linear regressions with respect to the current approximation to the posterior. An important benefit compared to expectation propagation is that the proposed method does not provide negative definite covariance matrices by construction, which implies that it does not need ad-hoc procedures to solve the resulting problems. In addition, posterior linearisation is simpler to implement than expectation propagation, as it does not need the computation of cavity distributions, and has a local convergence theorem. Experimental results show that, in general, posterior linearisation performs better than EP and Laplace for a range of likelihood functions used in classification.
-  K. P. Murphy, Machine Learning: a Probabilistic Perspective. The MIT Press, 2012.
-  C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.
-  F. Pérez-Cruz, S. V. Vaerenbergh, J. J. Murillo-Fuentes, M. Lázaro-Gredilla, and I. Santamaría, “Gaussian processes for nonlinear signal processing: An overview of recent advances,” IEEE Signal Processing Magazine, vol. 30, no. 4, pp. 40–50, July 2013.
-  H.-C. Kim and Z. Ghahramani, “Bayesian Gaussian process classification with the EM-EP algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 1948–1959, Dec. 2006.
-  K. Markov and T. Matsui, “Music genre classification using Gaussian process models,” in IEEE International Workshop on Machine Learning for Signal Processing, Sept 2013, pp. 1–6.
Y. Xiao, H. Wang, and W. Xu, “Hyperparameter selection for Gaussian process
IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 9, pp. 2182–2187, Sept. 2015.
-  P. Morales-Álvarez, A. Pérez-Suay, R. Molina, and G. Camps-Valls, “Remote sensing image classification with large-scale Gaussian processes,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 1103–1114, Feb 2018.
-  P. M. Olmos, J. J. Murillo-Fuentes, and F. Pérez-Cruz, “Joint nonlinear channel equalization and soft LDPC decoding with Gaussian processes,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1183–1192, March 2010.
-  J. Vanhatalo, V. Pietiläinen, and A. Vehtari, “Approximate inference for disease mapping with sparse Gaussian processes,” Statistics in Medicine, vol. 29, July 2010.
-  R. M. Neal, “Regression and classification using Gaussian process priors,” in Bayesian Statistics 6, J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Eds. Oxford University Press, 1998.
-  T. P. Minka, “Expectation propagation and approximate Bayesian inference,” in Proceedings of the 17th Conference in Uncertainty in Artifical Intelligence, 2001, pp. 362–369.
-  M. Kuss and C. E. Rasmussen, “Assessing approximate inference for binary Gaussian process classification,” Journal of Machine Learning Research, vol. 6, pp. 1679–1704, 2005.
-  ——, “Assessing approximations for Gaussian process classification,” in Advances in Neural Information Processing Systems 18, Y. Weiss, P. B. Schölkopf, and J. C. Platt, Eds., 2006, pp. 699–706.
-  C. Villacampa-Calvo and D. Hernández-Lobato, “Scalable multi-class Gaussian process classification using expectation propagation,” in Proceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 3550–3559.
-  A. Vehtari and et al, “Expectation propagation as a way of life: a framework for Bayesian inference on partitioned data,” available in arxiv, 2018. [Online]. Available: https://arxiv.org/abs/1412.4869
-  P. Jylänki, J. Vanhatalo, and A. Vehtari, “Robust Gaussian process regression with a Student- likelihood,” Journal of Machine Learning Research, vol. 12, pp. 3227–3257, 2011.
-  A. F. García-Fernández, L. Svensson, M. R. Morelande, and S. Särkkä, “Posterior linearization filter: principles and implementation using sigma points,” IEEE Transactions on Signal Processing, vol. 63, no. 20, pp. 5561–5573, Oct. 2015.
-  F. Tronarp, A. F. García-Fernández, and S. Särkkä, “Iterative filtering and smoothing in non-linear and non-Gaussian systems using conditional moments,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 408–412, March 2018.
I. Arasaratnam, S. Haykin, and R. Elliott, “Discrete-time nonlinear filtering algorithms using Gauss-Hermite quadrature,”Proceedings of the IEEE, vol. 95, no. 5, pp. 953–977, May 2007.
I. Arasaratnam and S. Haykin, “Square-root quadrature Kalman filtering,”IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2589–2593, June 2008.
-  M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
-  S. Särkkä, Bayesian Filtering and Smoothing. Cambridge University Press, 2013.
-  S. J. Julier and J. K. Uhlmann, “Unscented filtering and nonlinear estimation,” Proceedings of the IEEE, vol. 92, no. 3, pp. 401–422, Mar. 2004.
V. Tolvanen, P. Jylänki, and A. Vehtari, “Expectation propagation for nonstationary heteroscedastic Gaussian process regression,” inIEEE International Workshop on Machine Learning for Signal Processing, Sept. 2014, pp. 1–6.
SLR for probit, logit and noisy-threshold likelihoods
where . We have used Eq. (3.84) in  to obtain the last expression.
For the noisy threshold model, we have
Synthetic example where EP fails
We consider a Gaussian prior over with mean and unit variances for both variables with correlation 0.8. We also consider the noisy threshold likelihood, see (3), with , and .
We run EP first on the first variable and then on the second variable. After the first round of updates, the cavity distribution over has a variance of -117.9, which stops the EP iterations. If we process the measurements in the other order, we face the same problem. In this case, the variance of the cavity distribution over is -14.3. If we run EP in parallel , after the second update, the cavity distribution over also has a variance of -117.9. These negative variances are not due to numerical errors, they are the result of the EP iterations. In particular, the problematic part is the variance of the second variable which increases after its update, so the likelihood approximation has a negative definite variance, which creates problems at the next step of the iteration .
We show the contour plot of the posterior, which has been obtained using a fine grid of points, and the PL solution in Figure 1(a). Both parallel and sequential implementations of PL converge to the same solution and they match the mode of the posterior with highest density, which is a reasonable Gaussian approximation to a bimodal density. In the figure, we can also see the sequence of posterior means provided by the parallel PL iterations. In 6 iterations, the algorithm gets quite close to the fixed point. Laplace approximation would not change the prior as the gradient of the likelihood is zero almost everywhere. This example demonstrates that this type of situation can be satisfactorily handled with PL, but not with EP, without fixes, and Laplace methods.