Gaussian processes (GPs) are Bayesian state-of-the-art tools for discriminative machine learning, i.e., regression [Williams96], classification [Kuss05] and dimensionality reduction [Lawrence05]. GPs were first proposed in statistics by Tony O’Hagan [OHagan78] and they are well-known to the geostatistics community as kriging. However, due to their high computational complexity they did not become widely applied tools in machine learning until the early XXI century [rasmusssen2006gaussian]
. GPs can be interpreted as a family of kernel methods with the additional advantage of providing a full conditional statistical description for the predicted variable, which can be primarily used to establish confidence intervals and to set hyper-parameters. In a nutshell, Gaussian processes assume that a Gaussian process prior governs the set of possible latent functions (which are unobserved), and the likelihood (of the latent function) and observations shape this prior to produce posterior probabilistic estimates. Consequently, the joint distribution of training and test data is a multidimensional Gaussian and the predicted distribution is estimated by conditioning on the training data.
While GPs are well-established tools in machine learning, they are not as widely used by the signal processing community as neural networks or support vector machines (SVMs) are. In our opinion, there are several explanations for the limited use of GPs in signal processing problems. First, they do not have a simple intuition for classification problems. Second, their direct implementation is computationally demanding. Third, their plain vanilla version might seem uptight and not flexible enough. Fourth, to most signal processing experts Gaussian process merely stands for a noise model and not for a flexible algorithm that they should be using.
In this paper, we present an overview on Gaussian processes explained for and by signal processing practitioners. We introduce GPs as the natural nonlinear Bayesian extension to the linear minimum mean square error (MMSE) and Wiener filtering, which are central to many signal processing algorithms and applications. We believe that GPs provide the correct approach to solve an MMSE filter nonlinearly, because they naturally extend least squares to nonlinear solutions through the kernel trick; they use a simple yet flexible prior to control the nonlinearity; and, evidence sampling or maximization allows setting the hyper-parameters without overfitting. This last feature is most interesting: by avoiding cross-validation we are able to optimize over a larger number of hyperparameters, thus increasing the available kernel expressiveness. Additionally, GP provides a full statistical description of its predictions.
The tutorial is divided in three parts. We have summarized in Figure 1 the relationship between the regression techniques introduced throughout the different sections. In the first part, Section II provides a detailed overview of Gaussian processes for regression (GPR) [Williams96]. We show that they are the natural nonlinear extension to MMSE/Wiener filtering and how they can be solved recursively. The second part of the paper focuses briefly on several key aspects of GP-based techniques. Consecutively, we review solutions to adjust the kernel function (Section III ), to tame the computational complexity of GPs (Section IV ), and to deal with non-Gaussian noise models (Section V ). In the third part, we cover additional extensions of interest to signal processing practitioners, in particular dealing with non-stationary scenarios (Section VI ) and classification problems (Section VII ). We illustrate them with relevant examples in signal processing for wireless communications. We conclude the paper with a discussion.
Ii Gaussian Processes for Machine Learning
Ii-a Minimum mean square error: a starting point
GPs can be introduced in a number of ways and we, as signal processing practitioners, find it particularly appealing to start from the MMSE solution. This is because the Wiener solution, which is obtained by minimizing the MSE criterion, is our first approach to most estimation problems and, as we show, GPs are its natural Bayesian extension.
Many signal processing problems reduce to estimating from an observed random process another related process . These two processes are related by a probabilistic, possibly unknown, model . It is well known that the unconstrained MMSE estimate,
coincides with the conditional mean estimate of given
If is jointly Gaussian, i.e. and are Gaussians and is linear in , this solution is linear. If and are zero mean, the solution yields , where
Furthermore, these expectations can be easily estimated, using the sample mean, from independently and identically distributed (iid) samples drawn from and , namely .
However, if is not linearly related to (plus Gaussian noise) or
is not Gaussian distributed, the conditional estimate ofgiven is no longer linear. Computing the nonlinear conditional mean estimate in (2) directly from either leads to overfitted solutions, because there are no convergence guarantees for general density estimation [Vapnik98], or to suboptimal solutions, if we restrict the density model to come from a narrow class of distributions. For instance, in channel equalization, although suboptimal, the sampled version of (3) is used due to its simplicity. One viable solution would be to minimize the sampled version of (1) with a restricted family of approximating functions to avoid overfitting. Kernel least squares (KLS) [PerezCruz04b] and Gaussian process regression, among others, follow such approach.
Ii-B Gaussian Processes for Regression
In its simplest form, GPR models the output nonlinearly according to
and it follows (1), without assuming that and are linearly related or that is Gaussian distributed. Nevertheless, it still considers that is Gaussian distributed, i.e., is a zero-mean Gaussian111A further relaxation to this condition is discussed in Section V.. In this way, GP can be understood as a natural nonlinear extension to MMSE estimation. Additionally, GPR does not only estimate (2) from , but it also provides a full statistical description of given , namely
GPs can be presented as a nonlinear regressor that expresses the input-output relation in (4) by assuming that a real-valued function , known as latent function, underlies the regression problem and that this function follows a Gaussian process. Before the labels are revealed, we assume this latent function has been drawn from a Gaussian process prior. GPs are characterized by their mean and covariance functions, denoted by and , respectively. Even though nonzero mean priors might be of use, working with zero-mean priors typically represents a reasonable assumption and it simplifies the notation. The covariance function explains the correlation between each pair of points in the input space and characterizes the functions that can be described by the Gaussian process. For example,
only yields linear latent functions and it is used to solve Bayesian linear regression problems, for which the mean of the posterior process coincides with the MMSE solution in (3), as shown in Section II-E. We cover the design of covariance functions in Section III.
For any finite set of inputs , a Gaussian process becomes a multidimensional Gaussian defined by its mean (zero in our case) and covariance matrix, . The Gaussian process prior becomes
where and . We want to compute the estimate for a general input , when the labels for the training examples, denoted by , are known. We can analytically compute (5
) by using the standard tools of Bayesian statistics: Bayes’ rule, marginalization and conditioning.
We first apply Bayes’ rule to obtain the posterior density for the latent function
where is the Gaussian process prior in (6) extended with a general input , is the likelihood for the latent function at the training set, in which is independent of given the latent function , and is the marginal likelihood or evidence of the model.
The likelihood function is given by a factorized model:
because the samples in are iid. In turn, for each pair , the likelihood is given by (4), therefore
A Gaussian likelihood function is conjugate to the Gaussian prior and hence the posterior in (7) is also a multidimensional Gaussian, which simplifies the computations to obtain (5). If the observation model were not Gaussian, warped Gaussian processes (see Section V ) could be used to estimate (5).
Finally, we can obtain the posterior density in (5) for a general input by conditioning on the training set and , and by marginalizing the latent function:
where222Given the training data set, takes values in as it is a vector of samples of a Gaussian process.
We have divided the marginalization in two separate equations to show the marginalization of the latent function over the training set in (11), and the marginalization of the latent function at a general input in (10). As mentioned earlier, the likelihood and the prior are Gaussians and therefore the marginalization in (10) and (11) only involves Gaussian distributions. Thereby, we can analytically compute (10) and (11) by using Gaussian conditioning and marginalization properties, leading to the following Gaussian density for the output:
The mean for is also given by (13a), i.e.,
, and its variance is
which, as expected, also accounts for the noise in the observation model.
The mean prediction of GPR in (13a
) is the solution provided by KLS, or kernel ridge regression (KRR)[PerezCruz04b], in which the covariance function takes the place of the kernel. However, unlike standard kernel methods, GPR provides error bars for each estimate in (13b) or (16) and has a natural procedure for setting the covariance/kernel by evidence sampling or maximization, as detailed in Section III. In SVM or KRR the hyper-parameters are typically adjusted by cross-validation, needing to retrain the models for different settings of the hyper-parameters on a grid search. So, typically only one or two hyper-parameters can be fitted. GPs can actually learn tens of hyper-parameter, because either sampling or evidence maximization allows setting them by a hassle-free procedure.
Ii-C An example
In Fig. 2 we include an illustrative example with 20 training points, in which we depict (12) for any between and . We used standard functions from the GPML toolbox, available at http://www.gaussianprocess.org/gpml/, to generate the GP in this figure. We have chosen a Gaussian kernel that is fixed333The kernel is typically expressed in a parametric form, see Section III. as and . In the plot, we show the mean of the process in red and the shaded area denotes the error bar for each prediction, i.e., . We also plot 5 samples from the posterior in thin blue lines.
We observe three different regions in the figure. On the right-hand side, we do not have samples and, for , the GPR provides the solution given by the prior (zero mean and ). At the center, where most of the data points lie, we have a very accurate view of the latent function with small error bars (close to ). On the left hand side, we only have two samples and we notice the mixed effect of the prior widening the error bars and the data points constraining the values of the mean to lie close to the available samples. This is the typical behavior of GPR, which provides an accurate solution where the data lies and high error bars where we do not have available information and, consequently, we presume that the prediction in that area is not accurate.
Ii-D Recursive GPs
In many signal processing applications, the samples become available sequentially and estimation algorithms should obtain the new solution every time a new datum is received. In order to keep the computational complexity low, it is more interesting to perform inexpensive recursive updates rather than to recalculate the entire batch solution. Online Gaussian Processes [csato2001sparse] fulfill these requisites as follows.
Let us assume that we have observed the first samples and that at this point the new datum is provided. We can readily compute the predicted distribution for using (13a), (13b) and (16). Furthermore, by using the formula for the inverse of a partitioned matrix and the Woodbury identity we update from
Nevertheless, for online scenarios, it is more convenient to update the predicted mean and covariance matrix for all the available samples, as it is easier to interpret how the prediction changes with each new datum. Additionally, as we will show in Section VI, this formulation makes the adaptation to non-stationary scenarios straightforward. Let us denote by and the posterior mean and covariance matrix for the samples in . By applying (13a) and (13b) we obtain
Once the new datum is observed, the updated mean and covariance matrix can be computed recursively as follows:
where . As can be observed in (19a), the mean of the new process is obtained by applying a correction term to the previous mean, proportional to the estimation error, . Because of the relation between and stated in (18b), only one of the two matrices needs to be stored and updated in an online formulation. Some authors [csato2001sparse] prefer to rely on , whereas others [vanvaerenbergh2012kernel] store and update .
The recursive update of the mean in (19a) is equivalent to what is known as kernel recursive least-squares (KRLS) in the signal processing literature (see for instance [csato2001sparse, engel2004kernel, vanvaerenbergh2012kernel] ). The unbounded growth of the involved matrices, visible in (19) and (17), is the main limitation in the KRLS formulation. Practical KRLS implementations typically either limit this growth [engel2004kernel, liu2009information] or even fix the matrix sizes [vanvaerenbergh2006sliding]. Nevertheless, the solution of KRLS is limited to the mean only and it cannot estimate confidence intervals. By using a GP framework, though, an estimate of the entire posterior distribution is obtained, including the covariance in (19b).
Ii-E Connection to MMSE: GPR with a linear latent function
If we replace in (4) with a linear model
the Gaussian process prior over becomes a spherical-Gaussian prior distribution over , .
We can now compute the posterior for , as we did for the latent function in (7)
where is the likelihood. Since the prior and likelihood are Gaussians, so it is the posterior, and its mean and covariance are given by
We can readily notice that (20a) is the sampled version of (3), when the prior variance tends to infinity (i.e., the prior has no effect of the solution). The precision matrix (the inverse covariance) is composed of two terms: the first depends on the data and the other one on the prior over . The effect of the prior in the mean and covariance fades away, as we have more available data. The estimate for a general input is computed as in (10)
which is a Gaussian distribution with mean and variance given by:
Equations (22) and (23) can be, respectively, rewritten as (13a) and (16), if we use the inner product between the multiplied by the width of the prior over , i.e. the kernel matrix is given by: . The kernel matrix must include the width of the prior over , because the kernel matrix represents the prior of the Gaussian process and is the prior of the linear Bayesian estimator. By using the Woodbury’s identity, it follows that
Now, by replacing (24) in (22) and (23), we,respectively, recover (13a) and (16). These steps connect the estimation of a Bayesian linear model and the nonlinear estimation using a kernel or covariance function without needing to explicitly indicate the nonlinear mapping.
Iii Covariance functions
In the previous section, we have assumed that the covariance functions are known, which is not typically the case. In fact, the design of a good covariance function is crucial for GPs to provide accurate nonlinear solutions. The covariance function plays the same role as the kernel function in SVMs or KLS [PerezCruz04b]. It describes the relation between the inputs and its form determines the possible solutions of the GPR. It controls how fast the function can change or how the samples in one part of the input space affect the latent function everywhere else. For most problems, we can specify a parametric kernel function that captures any available information about the problem at hand. As already discussed, unlike kernel methods, GPs can infer these parameters, the so-called hyper-parameters, from the samples in using the Bayesian framework. Instead of relying on computational intensive procedures as cross-validation [Kimeldorf71] or learning the kernel matrix [Bousquet03], as kernel methods need to.
The covariance function must be positive semi-definite, as it represents the covariance matrix of a multidimensional Gaussian distribution. The covariance can be built by adding simpler covariance matrices, weighted by a positive hyper-parameter, or by multiplying them together, as the addition and multiplication of positive definite matrices yields a positive definite matrix. In general, the design of the kernel should rely on the information that we have for each estimation problem and should be designed to get the most accurate solution with the least amount of samples. Nevertheless, the following kernel in (25) often works well in signal processing applications
where are the hyper-parameters. The first term is a radial basis kernel, also denoted as RBF or Gaussian, with a different length-scale for each input dimension. This term is universal and allows constructing a generic nonlinear regressor. If we have symmetries in our problem, we can use the same length-scale for all dimensions: for . The second term is the linear covariance function. The last term represents the noise variance , which can be treated as an additional hyper-parameter to be learned from the data. We can add other terms or other covariance functions that allow for faster transitions, like the Matérn kernel among others [rasmusssen2006gaussian].
If the hyper-parameters, , are unknown, the likelihood in (8) and the prior in (6) can, respectively, be expressed444We have dropped the subindex , as it is inconsequential and unnecessarily clutters the notation. as and , and we can proceed to integrate out as we did for the latent function, , in Section II-B. First, we compute the marginal likelihood of the hyper-parameters of the kernel given the training dataset
Second, we can define a prior for the hyper-parameters,
, that can be used to construct its posterior. Third, we integrate out the hyper-parameters to obtain the predictions. However, in this case, the marginal likelihood does not have a conjugate prior and the posterior cannot be obtained in closed form. Hence, the integration has to be done either by sampling or approximations. Although this approach is well principled, it is computational intensive and it may be not feasible for some applications. For example, Markov-Chain Monte Carlo (MCMC) methods require several hundred to several thousand samples from the posterior ofto integrate it out. Interested readers can find further details in [rasmusssen2006gaussian].
Alternatively, we can maximize the marginal likelihood in (26) to obtain its optimal setting [Williams96]. Although setting the hyper-parameters by maximum likelihood (ML) is not a purely Bayesian solution, it is fairly standard in the community and it allows using Bayesian solutions in time sensitive applications. This optimization is nonconvex [MacKay03], but, as we increase the number of training samples, the likelihood becomes a unimodal distribution around the ML hyper-parameters and the solution can be found using gradient ascent techniques. See [rasmusssen2006gaussian] for further details.
Iv Sparse GPs: Dealing with large-scale data sets
To perform inference under any GP model, the inverse of the covariance matrix must be computed. This is a costly operation, , that becomes prohibitive for large enough . Given the ever-increasing availability of large-scale databases, a lot of effort has been devoted over the last decade to the development of approximate methods that allow inference in GPs to scale linearly with the number of data points. These approximate methods are referred to as “sparse GPs”, since they approximate the full GP model using a finite-basis-set expansion. This set of bases is usually spawned by using a common functional form with different parametrizations. For instance, it is common to use bases of the type , where —known as the active set— is a subset of the input samples parametrizing the bases.
Under the unifying framework of [Candela05], it can be shown that most relevant sparse GP proposals [Seeger03, Snelson06], which were initially thought of as entirely different low-cost approximations, can be expressed as exact inference under different modifications of the original GP prior. This modified prior induces a rank- () covariance matrix —plus optional (block) diagonal correcting terms—, clarifying how the reduced cost of exact inference arises.
Among the mentioned approximations, the sparse pseudo-input GP (SPGP) [Snelson06] is generally regarded as the most efficient. Unlike other alternatives, it does not require the active set to be a subset of the training data. Instead, can be selected to lie anywhere in the input space, thus increasing the flexibility of the finite set expansion. This selection is typically performed by evidence maximization. An even more flexible option, which does not require the active set to even lie in the input domain, is presented in [Lazaro09].
Despite the success of SPGP, it is worth mentioning that increasing the number of bases in this algorithm does not yield, in general, convergence to the full GP solution because the active set is not constrained to be a subset of input data. This might lead to overfiting in some pathological cases. A recent variational sparse GP proposal that guarantees convergence to the full GP solution while still allowing the active set to be unconstrained is presented in [Titsias09].
Further approaches yielding reduced computational cost involve numerical approximations to accelerate matrix-vector multiplications and compactly supported covariance functions which set most entries of the covariance matrix to zero [Gu12].
Sparsity is often seen in online signal processing in the form of pruning, which restricts the active set to a subset of input data. The success of SPGP and its variational counterpart suggests that advanced forms of pruning may result in increased efficiency for a given sparsity level.
V Warped GPs: Beyond the standard noise model
Even though GPs are very flexible priors for the latent function, they might not be suitable to model all types of data. It is often the case that applying a logarithmic transformation to the target variable of some regression task (e.g., those involving stock prices, measured sound power, etc) can enhance the ability of GPs to model it.
In [Snelson03] it is shown that it is possible to include a non-linear preprocessing of output data (called warping function in this context) as part of the modeling process and learn it. In more detail, a parametric form for is selected, then (which depends on the parameters of ) is regarded as a GP, and finally, the parameters of are selected by maximizing the evidence of such GP (i.e., a ML approach). The authors suggest using as the parametric form of the warping function, but any option resulting in a monotonic function is valid. A non-parametric version of warped GPs using a variational approximation has been proposed in [Lazaro12].
Vi Tracking non-stationary scenarios: Learning to forget
KRLS algorithms, discussed in Section II-D, traditionally consider that the mapping function is constant throughout the whole learning process [engel2004kernel, liu2010adaptive]. However, in the signal processing domain this function (which might represent, for instance, a fading channel) is often subject to changes and the model must account for this non-stationarity. Some kernel-based algorithms have been proposed to deal with non-stationary scenarios. They include a kernelized version of the extended RLS filter [liu2010adaptive], a sliding-window KRLS approach [vanvaerenbergh2006sliding] and a family of projection-based algorithms [slavakis2008online, theodoridis2011adaptive].
In order to add adaptivity to the online GP algorithm described in Section II-D, it is necessary to make it “forget” the information contained in old samples. This becomes possible by including a “forgetting” step after each update
to shift the posterior distribution towards the prior (for ), thus effectively reducing the influence of older samples. Note that when using this formulation there is no need to store or update , see [vanvaerenbergh2012kernel] for further details. The adaptive, GP-based algorithm obtained in this manner is known as KRLS-T.
Equations (27) might seem like an ad-hoc step to enable forgetting. However, it can be shown that the whole learning procedure —including the mentioned controlled forgetting step— corresponds exactly to a principled non-stationary scheme within the GP framework, as described in [vanvaerenbergh2012estimation]. It is sufficient to consider an augmented input space that includes the time stamp of each sample and define a spatio-temporal covariance function:
where is the already-known spatial covariance function and is a temporal covariance function giving more weight to samples that are closer in time. Inference on this augmented model effectively accounts for non-stationarity in and recent samples have more impact in predictions for the current time instant. It is fairly simple to include this augmented model in the online learning process described in the previous section. When the temporal covariance is set to , inference in the augmented spatio-temporal GP model is exactly equivalent to using (27) after each update (19) in the algorithm of Section II-D, which has the added benefit of being inexpensive and online. See [lazaro2011bayesian, vanvaerenbergh2012kernel, vanvaerenbergh2012estimation] for further details.
Observe that is used here to model the speed at which varies, playing a similar rôle to that of the forgetting factor in linear adaptive filtering algorithms. When used with a linear spatial covariance, the above model reduces to linear extended RLS filtering. The selection of this parameter is usually rather ad-hoc. However, using the GP framework, we can select it in a principled manner using Type-II ML, see [vanvaerenbergh2012estimation].
In Fig. 3 we take the example of Fig. 2 and we apply a forgetting factor . The red continuous line indicates the original mean function before forgetting. After applying one forgetting update, this mean function is displaced toward zero, as indicated by the the blue dashed line. The shaded gray area represents the error bars prior to forgetting. The forgetting update expands this area into the shaded red area, which tends to the prior variance of 1.
Vi-a Tracking a time-selective nonlinear communication channel
To illustrate the validity of the adaptive filtering algorithm, we focus on the problem of tracking a nonlinear Rayleigh fading channel [sayed2003fundamentals, Chapter 7]. The used model consists of a memoryless saturating nonlinearity followed by a time-varying linear channel, as shown in Fig. 4. This model appears for instance in broadcast or satellite communications when the amplifier operates close to saturation regime [feher83digital].
In a first, simulated setup, the time-varying linear fading channel consists of randomly generated paths, and the saturating nonlinearity is chosen as . We fix the symbol rate at s, and we simulate two scenarios: one with a normalized Doppler frequency of (where denotes the Doppler spread), representing a slow-fading channel, and another one with , corresponding to a fast time-varying channel. Note that a higher Doppler frequency yields a more difficult tracking problem, as it corresponds to a channel that changes faster in time. We consider a Gaussian source signal, and we add dB of additive white Gaussian noise to the output signal. Given one input-output data pair per time instant, the tracking problem consists in estimating the received signal that corresponds to a new channel input.
Figs. 5(a,b) illustrate the tracking results obtained by KRLS-T in these scenarios. As a reference, we include the performance of several state-of-the-art adaptive filtering algorithms, whose Matlab implementations are taken from the Kernel Adaptive Filtering Toolbox, available at http://sourceforge.net/projects/kafbox/. In particular, we compare KRLS-T with normalized least mean squares (NLMS), extended RLS (EX-RLS), both of which are linear algorithms, see [sayed2003fundamentals], and quantized kernel LMS (QKLMS) [chen2012quantized], which is an efficient, kernelized version of the LMS algorithm. A Gaussian kernel is used for QKLMS and KRLS-T. In each scenario the optimal hyperparameters of KRLS-T are obtained by performing Type-II ML optimization (see Section III ) on a separate data set of test samples. The optimal parameters of the other algorithms are obtained by performing cross-validation on the test data set. To avoid an unbounded growth of the matrices involved in KRLS-T, its memory is limited to bases which are selected by pruning the least relevant bases (see [vanvaerenbergh2012kernel] for details on the pruning mechanism). The quantization parameter of QKLMS is set to yield similar memory sizes. As can be seen in Figs. 5(a,b), KRLS-T outperforms the other algorithms with a significant margin in both scenarios. By being kernel-based it is capable to deal with nonlinear identification problems, in contrast to the classical EX-RLS and NLMS algorithms. Furthermore, it shows excellent convergence speed and steady-state performance when compared to QKLMS. Additional experimental comparisons to other kernel adaptive filters can be found in [vanvaerenbergh2012kernel].
|(a) , simulated data||(b) , simulated data|
|(c) , real data||(d) measured linear channels|
|vhv , simulated||dB||dB||dB||dB|
|, real data||dB||dB||dB||dB|
In a second setup we used a wireless communication test bed that allows to evaluate the performance of digital communication systems in realistic indoor environments. This platform is composed of several transmit and receive nodes, each one including a radio-frequency front-end and baseband hardware for signal generation and acquisition. The front-end also incorporates a programmable variable attenuator to control the transmit power value and therefore the signal saturation. A more detailed description of the test bed can be found in [gutierrez2011frequency]. Using the hardware platform, we reproduced the model corresponding to Fig. 4 by transmitting clipped orthogonal frequency-division multiplexing (OFDM) signals centered at 5.4 GHz over real frequency-selective and time-varying channels. Notice that, unlike the simulated setup, several parameters such as the noise level and the variation of the channel coefficients are unknown. To have an idea about the channel characteristics, we first measured the indoor channel using the procedure described in [gutierrez2011frequency]. As an example, the variation of the four main channel coefficients is depicted in Fig. 5(d), indicating a normalized Doppler frequency around . We then transmitted periodically OFDM signals with the transmit amplifier operating close to saturation and acquired the received signals. The transmitted and received signals were used to track the nonlinear channel variations as in the simulated setup. The results, shown in Fig. 5(c), are similar to those of the simulated setup. Finally, the steady-state NMSE performances of all three scenarios, Figs. 5(a,b,c), are summarized in Table I.
Vii Gaussian Processes for Classification
For classification problems, the labels are drawn from a finite set and GPs return a probabilistic prediction for each label in the finite set, i.e., how certain is the classifier about its prediction. In this tutorial, we limit our presentation of GPs for classification (GPC) for binary classification problems, i.e.,. For GPC, we change the likelihood model for the latent function at using a response function :
The response function “squashes” the real-valued latent function to an
-interval that represents the posterior probability for[rasmusssen2006gaussian]. Standard choices for the response function are
and the cumulative density function of a standard normal distribution, used in logistic and probit regression respectively.
The integrals in (10) and (11) are now analytically intractable, because the likelihood and the prior are not conjugated. Therefore, we have to resort to numerical methods or approximations to solve them. The posterior distribution in (7) is typically single-mode and the standard methods approximate it with a Gaussian [rasmusssen2006gaussian]. Using a Gaussian approximation for (7) allows exact marginalization in (11) and we can use numerical integration for solving (10), as it involves marginalizing a single real-valued quantity. The two standard approximations are the Laplace method or expectation propagation (EP) [Minka01]. In [Kuss05], EP is shown to be a more accurate approximation.
Vii-a Probabilistic channel equalization
GPC predictive performance is similar to other nonlinear discriminative methods, such as SVMs. However, if the probabilistic output is of importance, then GPC outperforms other kernel algorithms, because it naturally incorporates the confidence interval in its predictions. In digital communication, channel decoders follow equalizers, which work optimally when accurate posterior estimates are given for each symbol. To illustrate that GPC provide accurate posterior probability estimates, we equalize a dispersive channel model like the one in Fig. 4 using GPC and SVM with a probabilistic output. These outputs are subsequently fed to a low-density parity-check (LDPC) belief-propagation based channel decoder to assess the quality of the estimated posterior probabilities. Details for the experimental set up can be found in [olmos2010joint] in which linear and nonlinear channel models are tested. We now summarize the results for the linear channel model in that paper.
In Fig. 6, we depict the posterior probability estimates versus the true posterior probability, in (a) for the GPC-based equalizer and in (b) for SVM-based equalizer, to emphasize the differences between the equalizers we use a highly noisy scenario with normalized signal-to-noise ratio of dB. If we threshold at 0.5, both equalizers provide similar error rates and we cannot tell if there is an advantage from using GPC. However, if we consider the whole probability space, GPC predictions are significantly closer to the main diagonal that represents a perfect match, hence GPC provides more accurate predictions to the channel decoder.
To further quantify the gain from using a GPC-based equalizer with accurate posterior probability estimates, we plot the bit error rate (BER) in Fig. 7 after the probabilistic channel encoder, in which the GPC-based equalizer clearly outperforms the SVM-based equalizer and is close to the optimal solution (known channel and forward-backward (BCJR) equalizer). This example is illustrative of the results that can be expected from GPC when a probabilistic output is needed to perform optimally.
In this tutorial, we have presented Gaussian Processes for Regression in detail from the point of view of MMSE/Wiener filtering, so it is amenable to signal processing practitioners. GPR provides the same mean estimate as KLS or KRR for the same kernel matrix. On the plus side, GPR provides error bars that take into account the approximation error and the error from the likelihood model, so we know the uncertainty of our model for any input point (see Fig. 2 ), while KLS assumes the error bars are given by the likelihood function (i.e., constant for the whole input space). Additionally, GPR naturally allows computing the hyper-parameters of the kernel function by sampling or maximizing the marginal likelihood, being able to set tens of hyper-parameters, while KLS or SVM need to rely on cross-validation, in which only one or two parameters can be easily tuned. On the minus side, the GP prior imposes a strong assumption on the error bars that might not be accurate, if the latent variable model does not follow a Gaussian process. Although, in any case, it is better than not having error bars.
We have also shown that some of the limitations of the standard GPR can be eased. GPs can be extended to non-Gaussian noise models and classification problems, in which GPC provides an accurate a posteriori probability estimate. The computational complexity of GPs can be reduced considerably, from cubic to linear in the number of training examples, without significantly affecting the mean and error bars prediction. Finally, we have shown the GP can be solved iteratively, with an RLS formulation that can be adapted to non-stationary environments efficiently.
Instead of covering more methods and applications in detail, our intention was to provide a tutorial paper on how to use GPs in signal processing, with a number of illustrative examples. Nevertheless, since we assume that there are several other methods and applications that are relevant to the reader, we finish with a brief list of further topics. In particular, GPs have also been applied to problems including modeling human motion [Wang08], source separation [Liutkus11], estimating chlorophyll concentration [Pasolli10], approximating stochastic differential equations [Archambeau07] and multi-user detection [Murillo09], among others.
The authors would like to thank Jesús Gutiérrez, University of Cantabria, Spain, for his assistance in capturing the data of the test bed experiment.