Multi-class classification problems involve predicting a class label that can take values in a discrete set of labels with (Murphy, 2012). For this task, one should use the information contained in the input attributes , with the dimensionality of the data. That is, we assume input attributes in the real line. In order to infer the relation between and it is assumed that one can use some training data in the form of pairs , namely, . Multi-class classification problems arise in a huge variety of fields, from industry to science. Most of the times, however, it is common to have datasets whose inputs are the result of experimental measurements. These measurements are unavoidably contaminated with noise, as a consequence of measurement error. Furthermore, in some situations, the errors in the measurement of the explaining attributes can be well determined. See (Barford, 1985) for further details. Incorporating this inductive bias, or prior knowledge about the particular characteristics of the inputs of the classification problem is expected to lead to better results when training a classifier to predict given . Conversely, ignoring errors or noise in the input measurements is expected to lead to sub-optimal results when this is indeed the case. In this research work we have empirically validated this hypothesis on several commonly used datasets extracted from the UCI repository (Bay et al., 2000), as well as on the well-known MNIST data set (LeCun et al., 1998). Even for these commonly used datasets in the machine learning community, we find better prediction results by assuming the presence (and learning the amount) of noise in some of the corresponding input variables, as we will show later.
While inference tasks on data with noisy attributes have been considered since long time in the context of regression —see for example, (Press et al., 2007), or more recently (Mchutchon and Rasmussen, 2011), in the context of Gaussian processes— the specific case of multi-class classification has received much less attention from the literature, with a few exceptions (Sáez et al., 2014). Taking into account the presence of noise in the input is, as we show below, potentially essential to better modeling the conditional distribution
giving rise to the observed labels. Considering input noise is also expected to have an important impact on the classification of points that are not far from the decision boundaries, since those are regions of the input space in which the data is more susceptible of being misclassified (at least in the case of additive noise with finite variance). Needless to say, both improving the estimated underlying predictive distribution and the better confidence in the classification ofdifficult points are two desirable properties for real-world problems in generic science applications, as for example in the case of the medical domain or in astrophysics applications (Gal, 2016).
Gaussian Processes (GPs) are machine learning methods that are inherently specified within a Bayesian framework (Rasmussen and Williams, 2006)
. Therefore, they can deliver probabilistic outputs that allow to extract uncertainty in the predictions made. This uncertainty is specified in terms of a predictive distribution for the target variable and may arise both from the modeling of intrinsic noise and also due to the lack of observed data. GPs are also non-parametric models. Therefore, their expressiveness grows as the number of data points in the training set, of size, increases. GPs are, however, expensive to train since their complexity is in . Specifically, they require the inversion of a covariance matrix of size
. These methods also suffer from the difficulty of requiring approximate techniques to compute the posterior distribution of the latent variables of the model in the case of classification tasks. This posterior distribution is required to compute the predictive distribution for the target variable. Nevertheless, in spite of this difficulties, GPs have been successfully used to address multi-class classification problems and have been shown to be competitive with other approaches such as support vector machines or neural networks(H.-C.Kim and Ghahramani, 2006; Hernández-Lobato et al., 2011; Villacampa-Calvo and Hernández-Lobato, 2017; Hensman et al., 2015c; Henao and Winther, 2012).
To alleviate the problems of scalability of GPs several approximations have been proposed in the literature. Among them, the most popular ones are based on using inducing points representations (Snelson and Ghahramani, 2006; Titsias, 2009b). These techniques consist in introducing a set of inducing points that are carefully chosen to approximate the full GP model. Specifically, the locations of the inducing points are optimized during training alongside with any other GP hyper-parameter, by maximizing a an estimate of the marginal likelihood (Villacampa-Calvo and Hernández-Lobato, 2017; Hensman et al., 2015c). The complexity of these approximations is in , which is significantly better than when . Importantly, sparse approximations based on inducing points can be combined with stochastic optimization techniques, which allow to scale GPs to very large datasets (Villacampa-Calvo and Hernández-Lobato, 2017; Hensman et al., 2015b). Nevertheless, in spite of this, to our knowledge, all current methods for multi-class GPs classification assume noiseless input attributes. In this paper we extend the framework of multi-class GP classification to account for this type of noise.
Our work is motivated by a concrete multi-class classification problem and from a dataset coming from astrophysics, dealing with measurements, by the Fermi-LAT instrument operated by NASA, of point-like sources of photons in the gamma ray energy range all over the sky (https://fermi.gsfc.nasa.gov/). The experimental measurements obtained from such a system are unavoidably contaminated with noise and, in practice, the level of noise in some of the explaining attributes is known and can be well determined. As it turns out, at present, a significant fraction of those point-like sources are not associated to or labeled as known astrophysical gamma rays sources as for example pulsars, blazars or quasars Abdollahi et al. (2019). It is thus of paramount importance for the physics related to those measurements to know whether those point-like sources belong to the standard astrophysical classes, or instead whether they are part of other more exotic kind of sources. As a study-case exercise, though, we train a fully supervised classifier using exclusively the sources which do have labels already. Further details of the dataset are given when presenting our experimental results. These show that a GP classifier which considers input noise can obtain better predictive distributions in terms of the test log-likelihood in this dataset.
To account for input noise in the context of multi-class GP classification, we describe several methods. A first approach is based on considering the actual noiseless input attributes (which we denote by
) as a latent variable, to then perform approximate Bayesian inference and compute their posterior distribution. The required computations are approximated using variational inference(Blei et al., 2017) combined with Monte Carlo methods and stochastic optimization. A second approach considers a first order Taylor expansion of the predictive mean of the GPs contained in the multi-class GP classifier (typically, one different GP per each potential class label), following (Mchutchon and Rasmussen, 2011). Under this linear approximation the input noise is simply translated into output noise, which is incorporated in the modeling process. The variance of the output noise is determined by the slope of the GP predictive mean, which can be obtained using automatic differentiation tools. Variational inference is also used in this second method to approximate the required computations. The two methods described to account for input noise in the context of multi-class GP classification are validated on several experiments, including synthetic data, datasets from the UCI repository, MNIST, and the dataset related to astrophysics that motivated this work described above. These experiments give empirical claims supporting our hypothesis that our methods can effectively deal with noise in the inputs. In particular, we have consistently observed that the predictive distribution of the methods proposed is better, in terms of the test log-likelihood, than the one of a standard multi-class GP classifier that ignores input noise. The prediction error of the proposed method is, however, similar.
The rest of the manuscript is organized as follows: We first introduce the fundamentals about multi-class GP classification and sparse GPs in Section 2. Section 3 describes the proposed models and methods to account for input noise in the observed attributes. Related work about input noise in GPs and other machine learning methods is described in Section 4. Section 5 illustrates the empirical performance of the proposed methods. Finally, Section 6 gives the conclusions of this work.
2 Multi-class Gaussian Process Classification
In this section we describe how Gaussian processes (GPs) can be used to address multi-class classification problems. We consider first a noiseless input setting. Next, in the following section, we describe how noisy inputs can be incorporated into the model. Assume a dataset consisting of instances with the observed explaining attributes and the target class labels, where and is the number of classes. The task of interest is to make predictions about the label of a new instance given the observed data and .
Following the representation introduced by Kim and Ghahramani (2006) for multi-class classification with Gaussian processes, we assume that each class label has been obtained with the labeling rule:
where , for , are different latent functions, each one of them corresponding to a different class label. Therefore the class label has been obtained simply by considering the latent function with the largest value at the data point . Let . Under this labeling rule the likelihood of the value of each latent function at a training point is given by:
where is a Heaviside step function. Other likelihood functions such as the soft-max likelihood arise simply by considering and marginalizing Gumbell noise around the latent functions (Maddison et al., 2014). Here we instead consider Gaussian noise around each , as described later on. To account for labeling errors, we consider that the actual class label associated to
could have been flipped with probabilityto some other class label (Hernández-Lobato et al., 2011; Hensman et al., 2015c). Under this setting, the likelihood becomes:
In order to address multi-class classification with GPs, a GP prior is assumed for each latent function (Rasmussen and Williams, 2006). That is, , where is a covariance function, with hyper-parameters . Popular examples of covariance functions include the squared exponential covariance function. Namely,
where is an indicator function and are the hyper-parameters. More precisely, is the amplitude parameter, are the length-scales and is the level of additive Gaussian noise around . See (Rasmussen and Williams, 2006) for further details. In practice, the hyper-parameters will be different for each latent function . We have ignored here their dependence on the latent function for the sake of readability.
In order to make predictions about the potential class label of a new data point , one would like to compute the posterior distribution of . This distribution summarizes which values of the latent functions are compatible with the observed data. The posterior distribution can be computed using Bayes’ rule as follows:
is a multi-variate Gaussian distribution with zero mean and covariance matrix, with . The denominator in the previous expression, , is just a normalization constant and is known as the marginal likelihood. It can be maximized to obtain the good values for the model hyper-parameters , for . Note that in this setting, we assume independence among the latent functions , since the prior factorizes as .
A practical problem, however, is that the non-Gaussian likelihood in (3) makes infeasible the exact computation of the marginal likelihood, so one has to make use of approximate inference methods to approximate the posterior in (5). One of the most widely used methods for approximate inference is variational inference (VI), which will be explained in detail in the following sections. See (Titsias, 2009b; Hensman et al., 2015b, c) for further details.
2.1 Sparse Gaussian Processes
A difficulty of the method described so far is that, even if the likelihood was Gaussian and exact inference was tractable, the cost of computing the posterior distribution would be in , where is the training set size. The reason for this cost is the need of inverting the prior covariance matrices . See (Rasmussen and Williams, 2006) for further details. As a consequence, multi-class GPs classification would only be applicable on a dataset of at most a few thousand data points.
A popular and successful approach to reduce the previous cost, is to consider an approximation based on sparse Gaussian processes (Titsias, 2009a). Under a sparse setting, a set of pseudo-inputs or inducing points is introduced associated to each latent function . Namely, . These points will lie in the same space as the training data. Namely, , and their locations will be specified during training, simply by maximizing an estimate of the marginal likelihood. Associated to these inducing points we will consider some inducing outputs , where . The process values at each , i.e., , are characterized by and and can then be obtained from the predictive distribution of a GP as follows:
where is a matrix of covariances of between the process values at the observed data points and the inducing points . Similarly, is the covariance matrix. Each entry in this matrix contains the covariances among the process values at the inducing points . Under the sparse approximation, the prior for each is simply the Gaussian process prior. Namely,
Importantly, now one only has to invert the matrices of size and compute the product . Therefore, the training cost will be in , which is significantly better if .
In practice, the inducing point values will be unknown and they are treated as latent variables of the model. An approximate Gaussian posterior distribution will be specified for them. Namely, for . This uncertainty about can be readily introduced in simply by marginalizing these latent variables. The result is a Gaussian distribution for with extra variance due to the randomness of . That is:
In the following sections we describe how to compute the parameters of each approximate distribution , for , using variational inference.
3 Multi-class GP Classification with Input Noise
In this section we describe the proposed approaches for dealing with noisy inputs in the context of multi-class GP classification. Let us consider that is the matrix of noisy observations in which the data patterns are contaminated with additive Gaussian noise with some mean and some variance. Again, consider that is the matrix of noiseless inputs. That is:
where is a particular observation and is a
diagonal matrix. That is, we assume independent additive Gaussian noise for the inputs. For the moment, we consider that the variance of the additive noiseis known beforehand. Later on, we describe how this parameter can be inferred from the training data. Recall that we assume input attributes in the real line.
3.1 Modeling the Input Noise using Latent Variables
A first approach for taking into account noisy inputs in the context of GP multi-class classification is based on making approximate Bayesian inference about the actual noiseless inputs , for . Importantly, these variables will be latent. The observed variables will be the ones contaminated with Gaussian noise , for . With this goal, note that the assumption made in (11) about the generation of provides a likelihood function for the actual observation. That is,
In order to make inference about the noiseless observation , we need to specify a prior distribution for that variable. In practice, however, the actual prior distribution for is specific of each classification problem and unknown, in general. Therefore, we set the prior for to be a multi-variate Gaussian with a broad variance. Namely,
is the identity matrix andis chosen to have a large value, i.e.,
, in order to make it similar to a non-informative uniform distribution. This prior has shown good results in our experiments.
3.1.1 Joint and Posterior Distribution
The first step towards making inference about
is to describe the joint distribution of all the variables of the model (observed and latent). This distribution is given by
where is the matrix of noisy observations, is the matrix of actual noiseless inputs, is the vector of observed labels, is the matrix of process values at the actual noiseless inputs (i.e., at each instead of at each ), and is the matrix of process values at the inducing points.
The posterior distribution of the latent variables, i.e., , and is obtained using Bayes’ rule:
Again, as in the case of standard multi-class classification where there is noise in the inputs, computing this posterior distribution is intractable and approximate inference will be required. To approximate this distribution we employ variational inference, as described in the next section.
3.1.2 Approximate Inference using Variational Inference
with a diagonal matrix. This posterior approximation assumes independence among the different GPs of the model and the actual noiseless inputs .
To enforce that
looks similar to the target posterior distribution, variational inference minimizes the Kullback-Leibler divergence betweenand the exact posterior , given by the distribution in (15). This is done indirectly by maximizing the evidence lower bound . See (Jordan et al., 1999) for further details. The evidence lower bound (ELBO) is given by
where is the Kullback-Leibler divergence and where we have used the fact that the factors of the form described in (6) and present in both the joint distribution and in will cancel.
One problem that arises when computing the previous expression is that the first expectation in (18), i.e. , does not have a closed form solution. It can, however, be computed using a one dimensional quadrature and Monte Carlo methods combined with the reparametrization trick (Hensman et al., 2015a; Kingma and Welling, 2014). Concerning the other factors, the second expectation, , is the expectation of the logarithm of a Gaussian distribution, so it can be computed in a closed form. We can also evaluate analytically the Kullback Leibler divergences in the lower bound, , since they involve Gaussian distributions.
Importantly, the ELBO in (18
) is expressed as a sum across the observed data points. This means that this objective is suitable for being optimized using mini-batches. The required gradients can be obtained using automatic differentiation techniques such as those implemented in frameworks such as Tensorflow(Abadi et al., 2015). Appendix A describes the details about how to obtain an unbiased noisy estimate of the ELBO, .
One last remark is that it is possible to show that
where is the Kullback-Leibler (KL) divergence between and the target posterior distribution in (15) (Jordan et al., 1999). After maximizing the ELBO, , it is expected that the KL term is fairly small and hence . Therefore, can be maximized to find good values for the model hyper-parameters. This is expected to maximize , which will correspond to a type-II maximum likelihood approach (Rasmussen and Williams, 2006). The locations of the inducing points , for , are found by maximizing , an estimate of the marginal likelihood, as in (Villacampa-Calvo and Hernández-Lobato, 2017; Hensman et al., 2015c). Note, however, that these correspond to parameters of the posterior approximation , defined in (16), and not of the described probabilistic model (Titsias, 2009a).
After the maximization of the lower bound (18), the approximate distribution is fitted to the actual posterior. The predictive distribution for the class label of a new instance can be approximated by replacing the exact posterior by the posterior approximation in the exact predictive distribution. Namely,
where is the posterior distribution of the actual attributes of the new instance given the observed attributes . This posterior is the normalized product of the prior times the likelihood. Therefore, it can be computed in closed form and is a Gaussian. Namely,
where and , and where is a diagonal matrix with the variances of the Gaussian noise around .
In general, the integral in (20) is intractable. However, we can generate samples of simply by drawing from to then compute a Monte Carlo approximation:
where is the number of samples, , with the generated sample of .
The only remaining thing is how to compute the integral in the right hand side of (22). It turns out that the integral with respect to each can be computed analytically using (8). The integral with respect to can be approximated using a one-dimensional quadrature. In particular, under the likelihood function in (3), the approximation of the predictive distribution becomes:
where , , is the cumulative probability of a standard Gaussian distribution and
for with the variance of at , the matrix of covariances between the values of at and , and the covariances of among the inducing points . Note that the integral over in (23) has no closed form solution but it can be computed using one-dimensional quadrature.
3.2 Amortized Approximate Inference
A limitation of the method described is that the approximate posterior distribution over the noiseless inputs, i.e., , demands storing in memory a number of parameters that is in , where is the number of observed instances. Of course, in the big data regime, i.e., when is in the order of thousands or even millions the memory resources needed to store those parameters can be too high. To alleviate this problem, we propose to use amortized approximate inference to reduce the number of parameters that need to be store in memory (Kingma and Welling, 2014; Shu et al., 2018).
Amortized variational inference assumes that the parameters of each distribution , i.e., the mean and the diagonal covariance matrix, can be obtained simply as a non-linear function that depends on the observed data instance . This non-linear function is set to be a neural network whose parameters are optimized during training. That is,
where both and are obtained as the output of a neural network with parameters
(we use a one-hot encoding for the label). Therefore, one only has to store in memory the neural network which has a fixed number of parameters. This number of parameters does not depend on . The neural network can be adjusted simply by maximizing w.r.t the evidence lower bound described in (18). The computational cost of the method is not changed, since the cost for the feed-forward pass of the network to obtain and is constant.
Amortized approximate inference introduces the inductive bias that points that are located in similar regions of the input space should have similar parameters in the corresponding posterior approximation . Of course, this has the benefit property of reducing the number of parameters of the approximate distribution . A second advantage is, however, that the neural network can provide a beneficial regularization that is eventually translated into better generalization results (Shu et al., 2018). More precisely, our experiments show that amortized variational inference sometimes provides better results than using an approximate distribution that has separate parameters per each data point .
Besides using this neural network to compute the parameters of , the model is not changed significantly. Prediction is done in the same way, and hyper-parameter optimization is also carried out by maximizing the evidence lower bound.
3.3 First Order Approximation
In this section we describe an alternative method to account for input noise in the context of multi-class classification with GPs. This method is inspired on work already done for regression problems (Mchutchon and Rasmussen, 2011). Consider the relation between the noisy and the noiseless input measurements given by (11). Now, consider a Taylor expansion of a latent function around the noiseless measurement . Namely,
where is the noisy observation. Since we do not have access to the noiseless input vector , we simply approximate it with the noisy one . Note that this last expression involves the derivatives of the GP. Although they can be showed to be again GPs (see (Rasmussen and Williams, 2006) for further details), we here decide to approximate these derivatives with the derivatives of the mean of GP, as in (Mchutchon and Rasmussen, 2011). This approximation corresponds to ignoring the uncertainty about the derivative. Let denote the d-dimensional vector corresponding to the derivative of the mean of the GP with respect to each input dimension at . If we expand the right hand side of (27) up to the first order terms, we get a linear model. Namely,
Therefore, the input noise can be understood as output noise whose variance is proportional to square of the derivative of the mean value of at .
The model just described can be combined with the framework for sparse GPs described in Section 2.1
to give an alternative posterior predictive distribution forto the one described in (8). That is,
where and are the parameters of , the covariance matrices are evaluated at the noisy measurements and
Therefore, is a diagonal matrix whose entries account for the extra output noise that results from the corresponding input noise. This makes sense, since the input noise is expected to have a small effect in those regions of the input space in which the latent function is expected to be constant. Importantly, in the sparse setting
This partial derivative can be easily obtained automatically in modern frameworks for implementing multi-class GP classifiers such as Tensorflow (Abadi et al., 2015). The expression in (29) can replace (8) in a standard sparse multi-class GP classifier to account for input noise. One only has to provide the corresponding input noise variances , for . Approximate inference in such a model can be carried out using variational inference, as described in (Hensman et al., 2015c).
Figure 1 shows the predictive distribution of the model described, which we refer to as (), for each latent function in a three class toy classification problem described in Section 5.1. The figure on the left shows the predictive distribution obtained when the extra term in the predictive variance that depends on the slope of the mean is ignored in (29). The figure on the right shows the resulting predictive distribution when that term is taken into account. We can observe that the produced effect is to increase the variance of the predictive distribution by some amount that is proportional to the squared value of slope of the predictive mean. This is particularly noticeable in the case of the latent function corresponding to class number 2. A bigger variance in the predictive distribution for the latent function will correspond to less confidence in the predictive distribution for the class label. See Section 5.1 for further details.
Figure 1 also shows the learned locations of the inducing points for each latent function (displayed at the bottom of each image). We observe that they tend to be placed uniformly in the case of the latent functions corresponding to class labels 0 and 1. However, in the case of the latent function corresponding to class label 2, they concentrate in specific regions of the input space. Namely, in those regions in which the latent function changes abruptly.
3.4 Learning the Level of Noise in the Inputs
The previous sections assumed that the variance of Gaussian noise associated to each input dimension, i.e., the diagonal matrix , is known before-hand. This is the case of many practical problems in which the error associated to the measurement instrument that is used to obtain the observed attributes is well-known. However, in some certain situations it can be the case that the variance of the error is unknown. In this case, it may still be possible to infer this level of error from the observed data.
Typically, in this case, one will assume that the level of noise is the same across all observed data instances. That is, . This has the advantage of reducing the number of parameters that have to be inferred. To estimate one can simply treat this parameter as a hyper-parameter of the model. Its value can be estimated simply by type-II maximum likelihood, as any other hyper-parameter of the GP (Rasmussen and Williams, 2006). Under this setting, one simply maximizes the marginal likelihood of the model, i.e.
, the denominator in Bayes’ theorem w.r.t the parameter of interest. This is precisely the approach followed in(Mchutchon and Rasmussen, 2011) for regression problems.
Because evaluating the marginal likelihood is infeasible in the models described so far, one has to use an approximation. This approximation can be the evidence lower bound described in (18), which will be similar to the marginal likelihood if the approximate distribution is an accurate posterior approximation. The maximization of (18) w.r.t can be simply done with no extra computational cost using again stochastic optimization techniques.
In general, we will assume that is known for each data instance. If that is not the case, we will infer that level of noise using the method described in this section.
3.5 Summary of the Proposed Methods to Deal with Input Noise
Below we briefly the different methods described to deal with input noise in the context of multi-class GP classification:
NIMGP: This is the method described in Section 3.1
which uses latent variables to model the original noiseless inputs. It relies on a Gaussian distribution for the actual observation with the mean being a random variable representing the noiseless input. We assume a non-informative Gaussian prior for the noiseless input. The joint posterior distribution that involves non-gaussian likelihood factors for the process values is approximated using variational inference. Quadrature and Monte Carlo methods are combined with the reparametrization trick to obtain a noisy estimate of the lower bound which is optimized using stochastic methods.
: A limitation of the previous method is the need of storing parameters for each data instance. This is a disadvantage in big data applications. To circumvent this problem, amortized approximate inference is employed in this method, as explained in Section 3.2. We use a neural network to compute the parameters of posterior approximation for the noiseless attributes of each data instance. This network reduces the number of parameters of the approximate distribution and also regularizes the model. Approximate inference is carried out as in NIMGP.
: This is the method described in Section 3.3. In this method we adapt for classification with sparse GPs an already proposed method for regression that accounts for noisy inputs using GPs. This method is based on propagating the noise found in the inputs to the variance of the GPs predictive distribution. For this, a local linear approximation of the GP is used at each input location. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP predictive mean.
4 Related Work
In the literature there are some works dealing with GPs and input points corrupted with noise in the context of regression problems (Mchutchon and Rasmussen, 2011). For this, a local linear approximation of the GP at each input point is performed. When this is done, the input noise is translated into output noise proportional to the squared gradient of the GP posterior mean. Therefore, the mentioned paper simplifies the problem of modeling input noise by assuming that the input measurements are deterministic and by inflating the corresponding output variance to compensate for the extra noise. When this operation is performed, it leads to output noise variance varying across the input space. This property is defined as heteroscedasticity
. The model presented in the that paper can hence be described as a heteroscedastic GP model. In our work we have extended the approach of those authors to address multi-class classification problems since they only considered regression problems. Furthermore, we have compared such a method () with another approach that uses latent variables to model the noiseless inputs (NIMGP) and that, in principle, does not rely on a linear approximation of the GP. We hypothesize that NIMGP circumvents the bias of the linear approximation and give some empirical evidence supporting this claim. Note that we also consider sparse GPs instead of standard GPs, which make our approach more scalable to datasets with a larger number of instances. These differences are common between our proposed methods and most of the techniques described in this section.
As described in the previous paragraph, one of the proposed methods can be understood as a GP model in which the level of output noise depends on the input location, i.e., the data can be considered to be heteroscedastic. Several works have tried to address such a problem in the context of GPs and regression tasks. In particular, Goldberg et al. (1998)
introduce a second GP to deal with the output noise level as a function of the input location. This approach uses a Markov chain Monte Carlo method to approximate the posterior noise variance, which is time-consuming. An interesting extension to the described work tries to circumvent the limitations found in the method ofGoldberg et al. (1998) by replacing the Monte Carlo method with an approximative most likely noise approach (Kersting et al., 2007). This other work learns both the hidden noise variances and the kernel parameters simultaneously. Furthermore, it significantly reduces the computational cost. Again, the domain of application is different from ours since only regression problems are addressed.
Other related work concerning heteroscedasticity in the context of regression problems is the one of Lázaro-Gredilla and Titsias (2011). This approach also relies on variational inference for approximate inference in the context of GPs. These authors explicitly take into account the input noise, and model it with GP priors. By using an exponential transformation of the noise process, they specify the variance of the output noise. Exact inference in the heteroscedastic GP is intractable and the computations need to be approximated using variational inference. Variational inference in such a model has equivalent cost to an analytically tractable homoscedastic GP. Importantly, this work focuses exclusively on regression and ignores classification tasks. Furthermore, no GP sparse approximation is considered by these authors. Therefore, the problems addressed cannot contain more than a few thousand instances.
Copula processes are another alternative to deal with input noise in the context of regression tasks and GPs (Wilson and Ghahramani, 2010). In this case, approximate inference is carried out using the Laplace approximation and Markov chain Monte Carlo methods. There also exists in the literature an approach for online heteroscedastic GP regression that tackles the incorporation of new measurements in constant runtime and makes the computation cheaper by considering online sparse GPs (Bijl et al., 2017). This approach has proven to be effective in a practical applications considering, e.g., system identification.
The work of Mchutchon and Rasmussen (2011) has been employed in several practical applications involving machine learning regression problems. For example, in a problem concerning driving assistant systems (Armand et al., 2013), where the velocity profile that the driver follows is modeled as the vehicle decelerates towards a stop intersection. Another example application can be found in the context of Bayesian optimization (Nogueira et al., 2016; Oliveira et al., 2017), where a GP models an objective function that is being optimized. In this scenario, the input space is contaminated with i.i.d Gaussian noise. Two real applications are considered: safe robot grasping and safe navigation under localization uncertainty (Nogueira et al., 2016; Oliveira et al., 2017).
Another approach that considers input noise in the context of regression problems in arbitrary models is that of Bócsi and Csató (2013). This approach corrects the bias caused by the integration of the noise. The correction is proportional to the Hessian of the learned model and to the variance of the input noise. The advantage of the method is that it works for arbitrary regression models and the disadvantage is that it does not improve prediction for high-dimensional problems, where the data are implicitly scarce, and the estimated Hessian is considerably flattened.
Input dependent noise has also been taken into account in the context of binary classification with GPs by Hernández-lobato et al. (2014). In particular, these authors describe the use of an extra GP to model the variance of additive Gaussian noise around a latent function that is used for binary classification. This latent function is modeled again using a GP. Importantly, in this work the level of noise is expected to depend on privileged information. These are extra input attributes that are only available at training time, but not at test time. The goal is to exploit that privileged information to obtain a better classifier during the training phase. Approximate inference is done in this case using expectation propagation instead of variational inference (Minka, 2001). The experiments carried out show that privileged information is indeed useful to obtain better classifiers based on GPs.
To tackle input noise in the context multi-class classification, when one does does not rely on the use of GPs, some decomposition strategies can be used. Specifically, it is possible to decompose the problem into several binary classification subproblems, reducing the complexity and, hence, dividing the effects caused by the noise into each of the subproblems (Sáez et al., 2014). There exist several of these decomposition strategies, being the one-vs-one scheme a method that can be applied for well known binary classification algorithms. The results obtained by these authors show that the one-vs-one decomposition leads to better performances and more robust classifiers than other decompositions. A problem of these decompositions, however, is that they can lead to ambiguous regions in which it is not clear what class label should be predicted (Bishop, 2006). Furthermore, they do not learn an underlying true multi-class classifier and rely on binary classifiers to solve the multi-class problem, which is expected to be sub-optimal.
. The first work can be understood as a pre-processing step in which robust features are generated and used for training with the goal of reducing the effect of background noise in a sound recognition system. These robust features are, however, specific of the application domain conspired. Namely, sounds recognition. They are not expected to be useful in other classification problems. The second work focuses exclusively on linear regression models and hence cannot address multi-class classification problems, as the ones considered in our work.
In this section we carry out several experiments to evaluate the performance of the proposed method for multi-class GP classification with input noise. More precisely, we compare the performance of a standard multi-class Gaussian process that does not consider noise in the inputs (MGP) and the three proposed methods. Namely, the approach described in Section 3.1 (NIMGP), the variant described in Section 3.2 where the parameters of the Gaussian posterior approximation are computed using a neural network () and the method proposed in Section 3.3 (), which is based on the work of Mchutchon and Rasmussen (2011), and uses a first order approximation to account for input noise. The experiments considered include both in synthetic and real data. All the experiments carried out (except those related to the MNIST dataset) involve 100 repetitions and we report average results. These are detailed in the following sections.
All the methods described have been implemented in Tensorflow (Abadi et al., 2015). The source code to reproduce all the experiments carried out is available online at https://github.com/cvillacampa/GPInputNoise. In these experiments, for each GP, we have employed a squared exponential covariance function with automatic relevance determination (Rasmussen and Williams, 2006). All hyper-parameters, including the GP amplitude parameter, the length-scales and the level of additive Gaussian noise have been tuned by maximizing the ELBO. The class noise level has been set equal to , and kept fixed during training as in (Hensman et al., 2015c). Unless indicated differently, we have set the number of inducing points for the sparse gaussian process to the minimum of and of the total number of points. For the optimization of the ELBO we have used the ADAM optimizer with learning rate equal to
, the number of epochs has been set toand the mini-batch size to (Kingma and Ba, 2015). This number of epochs seems to guarantee the convergence of the optimization process. All other ADAM parameters have been set equal to their default value. In the neural network has is set to .
5.1 Illustrative Toy Problem
Before performing a fairly complete study on more realistic examples, we show here the results of the three proposed methods on a one-dimensional synthetic dataset with three classes. This dataset is simple enough so that in can be analyzed in detail and the optimal predictive distribution provided by the Bayes classifier can be computed in closed-form. The dataset is generated by sampling the latent functions from the GP prior using specific hyper-parameter values and then applying the labeling rule in (1). The input locations have been generated randomly by drawing from a uniform distribution in the interval. Then, we add a Gaussian noise to each observation to generate
, with standard deviationfor . We consider training instances and the number of inducing points . The mini-batch size is set to points. Since in this experiment the variance of the input noise is known beforehand we do not infer its value from the observed data and rather specify its actual value in each method. The number of test points is . In the neural network is set to have 2 hidden layers with units each.
We have trained each method on this dataset and compared the resulting predictive distribution with that of the optimal Bayes classifier, which we can compute in closed form since we know the generating process of the labels. Figure 2 shows, for each method, as a function of the input value , the predicted distribution for each of the three classes111The wiggles in the prediction probability for some of the classes for the NIMGP and are produced by the Monte Carlo approximation of the predictive distribution. Here we use a Monte Carlo average across 700 samples.. The observed labels for each data point are shown at the top of each figure as small vertical bars in green, blue and red colors, depending on the class label. We observe that each method produces decision boundaries that agree with the optimal ones (i.e., all methods predict the class label that has the largest probability according to the optimal Bayes classifier). However, the predictive distributions produced differ from one method to another. More precisely, the first method, i.e., MGP (top-left), which ignores the input noise, does produce a predictive distribution that is significantly different from the optimal one, especially in regions of the input space that are close to the decision boundaries. This method produces a predictive distribution that is too confident. The closest predictive distribution to the optimal one is obtained by NIMGP (top-right), followed by (bottom-left). Finally, (bottom-right) seems to improve the results of MGP, but is far from NIMGP.
Figure 2 shows very clearly the advantage of including the input noise when estimating the predictive distribution for each class label , given a new test point . Indeed, the proposed models reproduce more closely the optimal predictive distribution of the Bayes classifier. By contrast, the model that ignores the input noise, i.e., MGP fails to produce an accurate predictive distribution. Therefore, we expect to obtained better results for the proposed methods in terms of the log-likelihood of the test labels.
Table 1 shows the prediction error of each method and the corresponding negative test log-likelihood (NLL). Standard deviations are estimated using the bootstrap. Importantly, NLL can be understood as a measure of the quality of the predictive distribution. The smaller the better. We observe that while the prediction error of each method is similar (since the decision boundaries produced are similar) the NLL of the proposed methods is significantly better than the one provided by MGP, the method that ignores input noise. These results highlight the importance of accurately modeling the input noise to obtain better predictive distributions, which can play a critical role when one is interested in the confidence on the decisions to be made in terms of the classifier output.
5.2 Synthetic Experiments
Next, we compare the methods on synthetic two-dimensional classification problems with three classes. As in the case of the one-dimensional dataset described above, these problems are generated by sampling the latent functions from a GP prior with the squared exponential function and then applying the labeling rule in (1). The GP hyper-parameters employed are , , . The input vectors are chosen uniformly in the box . Then, we add three different levels of random noise to each observation , i.e., . The interest of these experiments is to evaluate the proposed methods in a controlled setting in which the expected optimal model to explain the observed data is a multi-class GP classifier and we can control the level of input noise in the data. In the next section we carry out experiments with real datasets. The number of training and test instances, inducing points, mini-batch size and parameters of the ADAM optimizer are the same as in the previous experiment. Again, since the level of injected noise is known in these experiments, we directly codify this information in each method. Figure 3 shows a sample dataset before and after the noise injection.
The average results obtained on the 100 datasets, for each method, are displayed in Tables 2 and 3. We observe that in terms of the negative test log-likelihood, all the proposed methods improve over MGP, i.e., the standard GP multi-class classifier that ignores input noise. Among the proposed methods, the best performing one is NIMGP, closely followed by . Therefore, these experiments highlight the benefits of using a neural network to compute the parameters of the posterior approximation . Specifically, there is no performance degradation. The method that is based on the first order approximation, i.e., , also improves over MGP, but the differences are smaller. In terms of the prediction error the differences are much smaller. However, in spite of this, the proposed methods improve the results of MGP. Among them, again NIMGP is the best method, except when . In that case, performs best, but the differences are very small. Summing up, the results obtained agree with the ones obtained in the one-dimensional problem. Namely, the proposed approaches significantly improve the quality of the predictive distribution in terms of the neg. test log-likelihood. The prediction error is, however, similar.
5.3 Experiments on Datasets Extracted from the UCI Repository
Another set of experiments evaluates the proposed methods on different multi-class datasets extracted from the UCI repository (Dua and Graff, 2017). Table 4 displays the characteristics of the datasets considered. For each of these datasets, we consider splits into train and test, containing 90% and 10% of the data respectively. Unlike in the previous experiments, in these datasets, a GP multi-class classifier need not be optimal. Furthermore, the input attributes may already be contaminated with additive noise. To assess the benefits of considering such noise during the learning process, we have considered four different setups. In each one, we inject Gaussian noise in the observed attributes with different variances. Namely, , , and . This will allow to evaluate the different methods in a setting in which there may be or may be not input noise due to the particular characteristics of the problem addressed (i.e. when the variance of the injected noise is equal to ), and also for increasing levels of input noise (i.e., when the variance of the injected noise is equal to , and ). This noise injection process is done after a standardization step in which the observed attributes are normalized to have zero mean and unit variance in the training set. All methods are trained for epochs using a mini-batch size of . In these experiments, in each of the proposed methods, the level of noise is learned during the training process by maximizing the ELBO. The reason for this is that the actual level of noise in the input attributes need not be equal to the injected level of noise.
Table 5 shows the average results obtained for each method in terms of the negative test log-likelihood, for each level of noise considered. We also report the mean rank for each method across datasets and splits. If a method is always the best one, it will receive a mean rank equal to 1. Conversely, if a method is always worst, it will receive a mean rank equal to 4. Therefore, in general lower is better. We can see that on average, the proposed methods improve over MGP and the method that works the best (according to the mean rank) is , even for the case where we do not introduce noise in the inputs. This suggests that these datasets have already some noise in the inputs. Also, as we increase the noise level the mean rank for MGP and NIMGP is worsen and and both improve. The fact that and give better results than NIMGP as we increase the level of noise indicates that NIMGP may be over-fitting the training data and that the use of the neural network and the first order approximation may act as a regularizer that alleviates this (Shu et al., 2018).
Table 6 shows the average results obtained for each method in terms of the prediction error. In this case, we do not observe big differences among the different methods. Moreover, the methods that take into account the noise in the inputs do not improve the prediction error as we increase the noise level. These small differences in terms of the error can be due to each method having similar decision boundaries, even when we obtain better predictive distributions in terms of the test log-likelihood, as illustrated in Section 5.1.
Noise = 0.0
Noise = 0.1
Noise = 0.25
Noise = 0.5
Noise = 0.0
Noise = 0.1
Noise = 0.25
Noise = 0.5
5.4 Experiments on the MNIST Dataset
In this section we consider a dataset in which sparse GPs are needed in order to train a multi-class classifier based on GPs. Namely, the MNIST dataset (LeCun et al., 1998). This dataset has 10 different class labels and training instances lying in a dimensional space. The test set has