1 Introduction
Gaussian process (GP) models are flexible and powerful probabilistic models that are used to solve classification problems in many areas of application (Rasmussen and Williams, 2006). In the Bayesian GP setup, latent function values and hyperparameters that are involved in modeling are integrated with chosen priors. However, the required integrals are often not analytically tractable (due to various choices of likelihoods and priors) and closed form analytic expressions are not available. Two important problems in this context are finding good approximations for integrating over the latent function variables and the hyperparameters. There have been two approaches reported in the literature. In the first approach, both the latent function variables and the hyperparameters are integrated out within some approximations. Williams and Barber (1998) used Laplace approximation to integrate over the latent function variables and Hybrid Monte Carlo (HMC) to integrate over the hyperparameters. Neal (1998) used Gibbs sampling to integrate over the latent function variables and HMC to integrate over hyperparameters; this method is more accurate, but it is computationally expensive. In the second approach, only the latent function variables are integrated out and the hyperparameters are optimized on some welldefined objective function. This latter problem of choosing hyperparameters that define the model is essentially the Gaussian process model selection problem and in this paper we focus on this problem.
There are two commonly used approaches to address this model selection problem. They are marginal likelihood or evidence maximization and minimization of LOOCV based average negative logarithmic predictive probability (NLP). Both these approaches are available for GP regression model selection (Rasmussen and Williams, 2006; Sundararajan and Keerthi, 2001). For GP classifier model selection, Gibbs and MacKay (2000)
used a variational approximation method to integrate over the latent function values and estimated the hyperparameters by maximizing marginal likelihood (ML). Laplace approximation and Expectation Propagation (EP) approximation
(Rasmussen and Williams, 2006; Seeger, 2003) are other methods that have been used to integrate over the latent function variables. Then, the marginal likelihood is optimized using gradient information obtained with any one of these approximations (Rasmussen and Williams, 2006; Seeger, 2003). Kim and Ghahramani (2006)presented an approximate ExpectationMaximization (EM) algorithm to learn the hyperparameters. In the Estep, they used EP to estimate the joint density of latent function values and in the Mstep, the hyperparameters were optimized by maximizing a variational lower bound on the marginal likelihood.
In this paper, we consider the approach of using LOOCV based predictive distributions to address the GP classifier model selection problem. In a related work, Opper and Winther (2000) used LOO error estimate to choose rough hyperparameter values by scanning a range of values. In the EP framework (Minka, 2001), cavity distributions directly provide LOO error estimates during training and were used to assess predictive performance and select automatic relevance determination (ARD) type hyperparameters (Qi et al., 2004). The LOOCV based predictive distributions obtained from probabilistic least squares classifier were used in the minimization of NLP for GP classifier model selection (Rasmussen and Williams, 2006).
In practice, while measures like marginal likelihood and average negative logarithmic predictive probability measures are very useful, other measures like Fmeasure (van Rijsbergen, 1974) and Weighted Error Rate (WER) are also important and useful, for instance in applications like medical diagnostics, image understanding, etc, where the number of positive examples is much smaller than the number of negative examples. Several works that use such measures for hyperparameters optimization exist in the nonGP literature. Hong et al. (2007) proposed a kernel classifier construction algorithm based on regularized orthogonal weighted least squares (ROWLS) estimation with LOOArea Under the ROC Curve (AUC) as model selection criterion for handling imbalanced datasets. Jansche (2005)
proposed the training of a probabilistic classifier based on a logistic regression model by optimizing expected Fmeasure. Here an approximation to Fmeasure was made so that the Fmeasure is smoothed and becomes a smooth function of model weights.
Seeger (2007) proposed a general framework for learning in probabilistic kernel classification models with a large or structured set of classes. He optimized the kernel parameters by minimizing the NLP over kfolds. Keerthi et al. (2006) considered the task of tuning hyperparameters in SVM models based on minimizing smooth performance validation functions like smoothed kfold CV error. The last three works did not use LOOCV in their work.This paper is aimed at addressing two issues. First, the proposed method is different from the work of Opper and Winther (2000) and Qi et al. (2004) in that we optimize the hyperparameters directly in the continuous space and any standard nonlinear optimization method can be used. It is also different from the LOOCV based probabilistic LS method (Rasmussen and Williams, 2006) in that we use the more accurate EP approximation than the LS approximation. Second, criteria such as Fmeasure and WER, which are needed for tackling imbalanced problems, have not been considered in GP classifier designs.
We define smoothed LOOCV based measures using predictive distributions as a function of GP classifier model hyperparameters. Thus, the objective functions can be optimized using standard nonlinear optimization techniques. We investigate usage of LOOCV predictive distributions obtained from expectation propagation approximation. Actually, the proposed algorithm can also be used with Laplace approximation. However, Kuss and Rasmussen (2005) showed for binary classification problems that the EP approximation is better than the Laplace approximation. Therefore, we restrict our attention to using the EP approximation here.
We conduct experiments on two criteria: the standard average negative logarithm of predictive probability (NLP) and a smoothed version of Fmeasure. On the NLP criterion we compare our method (EPCV(NLP)) against the LOOCV based probabilistic least squares (LS) classifier (Rasmussen and Williams, 2006) and standard GP classifier doing ML maximization using EP approximation. We refer the latter two methods as LSCV(NLP) and EP(ML) respectively; the abbreviations in parentheses refer to the type of objective functions used. The experimental results on several real world benchmark datasets show that the proposed method is better than the LSCV(NLP) method and is quite competitive to the EPML method. On the Fmeasure criterion we compare the EPCV(NLP) method with the EPCV(FM) method. In the latter method we optimize over the Fmeasure instead of the NLP measure. We also compare with a twostep method^{1}^{1}1This method was suggested by an anonymous reviewer. where, in the first step we optimize over the hyperparameters using the NLP measure and, in the second step we optimize only the bias hyperparameter using the Fmeasure. Experimental results demonstrate that this method is also inferior to the EPCV(FM) method.
The paper is organized as follows. We give a brief introduction to Gaussian process classification and ML optimization criteria in Section 2. In Section 3 we give a general set of smooth LOOCV based optimization criteria and illustrate their use with LOOCV predictive distributions. Optimization aspects and specific algorithm are given in Section 4. In Section 5 we discuss related work on LOOCV based GPC model selection. In Section 6 we present experimental results and then conclude the paper in Section 7.
2 Gaussian Process Classification
In binary classification problems, we are given a training data set composed of inputtarget pairs where , , and . The true function value at is represented as a latent variable . The goal is to compute the predictive distribution of the class label at test location . In standard GPs for classification (Rasmussen and Williams, 2006), the latent variables
are modelled as random variables in a zero mean GP indexed by
. The prior distribution of is a zero mean multivariate joint Gaussian, denoted as , where , and is the covariance matrix whose element is , denoted as . One of the most commonly used covariance functions is the squared exponential covariance function given by: . Here,represents signal variance and the remaining
’s represent width parameters across different input dimensions. Let . These parameters are also known as automatic relevance determination (ARD) hyperparameters. We call this covariance function the ARD Gaussian kernel function. Next, it is assumed that the probability over class labels as a function of depends only on the latent function value . For the binary classification problem, given the value of the probability of class label is independent of all other quantities: where is the dataset. The likelihoodcan be modelled in several forms such as a sigmoidal function or cumulative normal
where . Assuming that the examples are i.i.d, we have . Here, represents hyperparameters that characterize the likelihood. The prior and likelihood along with the hyperparameters characterize the GP model. With these modelling assumptions, we can write the inference probability given as:(1) 
Here, the posterior predictive distribution of latent function
is given by:where . In a Bayesian solution, the class probability at the test point
would be obtained by integrating over the hyperparameters weighted by their posterior probability
In general there is no closed form expression available for this integral and it is expensive to compute. Therefore, instead of integrating over the hyperparameters, a single set of their values is usually estimated from the dataset by optimizing various criteria as mentioned earlier and then used in (1).
2.1 Marginal Likelihood Maximization
Marginal likelihood or evidence maximization (Rasmussen and Williams, 2006) is commonly used to estimate the hyperparameters during model selection. The marginal likelihood is given by:
(2) 
This integral cannot be calculated analytically except for a special case like GP regression with Gaussian noise. Therefore, certain approximations are needed to compute these quantities. Laplace approximation and EP approximations are two popular methods used for this purpose. To gain more insight into the quality of the Laplace and EP approximations, Kuss and Rasmussen (2005)
carried out comprehensive comparisons of these approximations with (the more exact) Markov Chain Monte Carlo (MCMC) sampling approach in terms of their predictive performance and marginal likelihood estimates. They found that EP is the method to be used for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. Hence in our study we restrict ourselves to the more accurate EP approximation given in the next subsection. With the EP approximation the hyperparameters are learnt by optimizing marginal likelihood with gradient information
(Rasmussen and Williams, 2006; Seeger, 2003) using standard nonlinear optimization techniques.2.2 Expectation Propagation
The EP algorithm is an iterative algorithm which is used for approximate Bayesian inference
(Opper and Winther, 2000; Minka, 2001). It has been applied to GP classification (Rasmussen and Williams, 2006; Seeger, 2003) . EP finds a Gaussian approximation to the posteriorby moment matching of approximate marginal distribution and the posterior. Mean and covariance of the approximate Gaussian are given by:
(3) 
where and are called site function parameters. As per the approximation, the posterior is written in terms of the site functions and prior as
The EP algorithm iteratively visits each site function in turn, and adjusts the site parameters to match moments of an approximation to the posterior marginals. This process requires replacement of intractable exact cavity distribution with a tractable approximation based on the site functions and is given by:
where is the approximate cavity function and is related to the diagonal entries of the posterior as: Here, the represent the diagonal entries of the matrix . Using Gaussian identities, the mean and variance of cavity distribution is related to the site parameters as:
(4) 
Then, EP adjusts the site parameters , and such that the approximate posterior marginal using the exact likelihood approximates the posterior marginal based on the site function well. That is, . This is done by matching the zeroth, first and second moments on both sides. Thus, the EP algorithm iteratively updates site parameters until convergence. Though there is no convergence proof, in practice the EP algorithm converges in most cases. See Rasmussen and Williams (2006) for more details.
Next, within some constant, the marginal likelihood with EP approximation (Rasmussen and Williams, 2006) is given by:
(5) 
where and . The hyperparameters are optimized using gradient expressions with standard conjugate gradient or quasiNewton type nonlinear optimization techniques.
3 LeaveOneOut Cross Validation based Optimization Criteria
In this section, we give definitions of various LOOCV based optimization criteria. In section 4 we give details on how these measures can be optimized using standard nonlinear optimization techniques. The LOO predictive distributions , play a crucial role. Here represents the dataset without th example. Their exact computation is expensive. In section 4 we will also discuss how to approximate them efficiently.
3.1 NLP Measure
3.2 Smoothed LOO Measures
While measures such as marginal likelihood (5) and NLP in (6) are useful for normal situations, other measures like Fmeasure and WER are important, for example, when dealing with imbalanced datasets. Let us now define these measures.
Consider the binary classification problem with class labels +1 and 1. Assume that there are positive examples and negative examples. In general, the performance of the classifier may be evaluated using counts of data samples
defined via the confusion matrix given in Table 1. Let
and . The true positive rate (TP) is the proportion of positive data samples that were correctly classified (that is, true positives) and the false positive rate (FP) is the proportion of negative data samples that were incorrectly classified (that is, false positives) (Hong et al., 2007). These rates are given by: TP and FP. The misclassification rate is given by: MCR. Note that the true positive rate is also known as Recall (R). Precision is another important quantity defined as: P.Now let us consider the imbalanced data case and assume that . In this case if MCR is minimized then the classifier will be biased toward the negative class due to its effort in minimizing the false positives (that is c) more strongly than minimizing false negatives (that is b). In the worst case almost all the positive examples will be wrongly classified, that is . This results in both and . Thus MCR is not a good measure to use when the dataset is imbalanced. This problem can be addressed by optimizing other measures that we discuss next.
The Fmeasure is one such measure and is defined (van Rijsbergen, 1974) as:
Positive (Predicted)  Negative (Predicted)  

Positive (Actual)  a  b 
Negative (Actual)  c  d 
where and . It has been used in various applications like document retrieval (van Rijsbergen, 1974) and text classification (Joachims, 2005). It is particularly preferable over MCR when the dataset is highly imbalanced (Joachims, 2005; Jansche, 2005).
Let us get into more details on the functioning of the Fmeasure. When , we get . Then optimizing the Fmeasure means we are interested only in maximizing the Precision. On the other hand, when , we get . In this case we are interested only in maximizing the Recall. The user can choose an appropriate value for
depending on how much of importance he/she wants to give to the precision and recall. Thus, the Fmeasure combines precision and recall into a single optimization criterion by taking their
weighted harmonic mean. In the imbalanced data case mentioned above MCR minimization can potentially result in
. By maximizing the Fmeasure we can prevent the classifier from being completely biased towards the negative class. Note that can be rewritten in terms of , and as:(7) 
In all our experiments we set . In this case, it becomes and can also be written as: . Then, maximizing is equivalent to maximizing . Thus, we can maximize by both minimizing the error (that is, ) and maximizing the true positives. The tradeoff kicksin since maximizing the true positives tends to increase the false positives. Thus maximizing the Fmeasure controls both the true positives and the error appropriately. In general the Fmeasure summarizes a classifier’s ability to identify the positive class and plays an important role in the evaluation of binary classifier. As a criterion for optimizing hyperparameters Fmeasure can be computed on an evaluation or validation dataset. However, in practical situations involving small datasets^{2}^{2}2GP models are known to be particularly valuable for problems with small datasets. it is wasteful to employ a separate evaluation set. The LOOCV approach would be useful in such situations and we show next how the Fmeasure can be estimated with such approach.
Hong et al. (2007) estimated and as:
Here represents predicted label for th sample. Therefore takes value when the prediction matches with the actual label and otherwise. is an indicator function which is 1 if and . Similarly, is one if and . Otherwise, these functions take zero values. Hong et al. (2007) used these estimates to compute
as an approximation of AUC and used this criterion to select a subset of basis vectors in a kernel classifier model construction procedure for imbalanced datasets. Note that this definition of AUC is applicable only for a hard classifier (fixed nonprobabilistic classifier) with binary outputs. See
Hong et al. (2007) for more details. In a strict sense such a definition of AUC is not suitable for a probabilistic classifier like GP classifier that provides continuous probabilistic output. However, we can make use of this approach of defining and as above to compute the quantities , and that are needed to evaluate the Fmeasure in (7). There are two issues associated with these estimates. The first issue is that these estimates are not smooth (in fact, not even continuous) functions of hyperparameters. Therefore they cannot be used directly in any approach that uses gradientbased nonlinear optimization methods to tune the hyperparameters. Secondly, these estimates do not use predictive probability values which is particularly important when we want to take variance also into account. In nonGP contexts Jansche (2005) and Keerthi et al. (2006) addressed the first issue by defining smoothed Fmeasure or other validation functions by replacing the indicator function with a sigmoid function, which makes the optimization criterion as a smooth function of hyperparameters. However, they did not consider a LOO approach and used a validation set instead. Jansche (2005) considered maximum a posterior probabilities and Keerthi et al. (2006) used sigmoidal approximations for SVM models. Here, we propose to combine LOO based estimation and smoothed version of the quantities denoted as , , and . We can set(8) 
Since , we can write . With denoting the number of examples predicted as positive, we can parameterize it as . This can be rewritten as:
(9) 
Thus, the smoothed Fmeasure can be defined from (7) as:
(10) 
Note that can be defined in a similar fashion as . Using these quantities, other derived quantities like and can be defined as LOO based estimates. Then, smoothed LOO estimates of WER can be obtained as shown below.
The WER measure is another useful measure for imbalanced datasets. Using the quantities defined above, its smoothed version can be written as:
(11) 
where is the ratio of the cost of misclassifications of the negative class to that of the positive class and . Thus by choosing a suitable value for a given problem and optimizing over the hyperparameters we can design classifiers without becoming biased toward one class. Note that for ease of notation we have omitted hat on and . Following the work of Hong et al. (2007) one can also define
(12) 
and optimize over the hyperparameters. As mentioned earlier, such a definition is not suitable for the GP classifier. Nevertheless it is interesting to note that it has the desirable property of tradingoff between high TP and low FP. Also, on comparing this definition of AUC with (11) we see that they are related in the sense that maximizing AUC is same as minimizing WER when and .
Overall we see that the LOOCV predictive distributions can be used to define various criteria that are smooth functions of hyperparameters resulting in smoothed LOOCV measures. Now given that the LOOCV predictive distributions are readily available from the EP algorithm, we can optimize the various smoothed LOOCV measures directly using standard nonlinear optimization techniques.
4 EPCV Algorithm for Choosing Hyperparameters
Various criteria such as (6), (10), (11) and (12) depend on the hyperparameters via the predictive distributions . With cumulative Gaussian likelihood, they can be written as:
(13) 
Note that (13) is obtained from (1) with . The hyperparameter is referred to as the bias parameter and it helps in shifting the decision boundary with the probability value . In general, the bias hyperparameter is very useful (Rasmussen and Williams, 2006; Seeger, 2003) and can be optimized.
EP Approximation
To compute (13) we need the LOO mean and variance . With the EP approximation, they can be computed using (4). Full details of gradient calculations needed for implementing hyperparameter optimization are given in the appendix.
We take the expectationmaximization (EM) type approach for hyperparameters optimization. This is because gradient expressions involving implicit derivatives (with site parameters varying as a function of hyperparameters) are not available due to the iterative nature of the EP algorithm. This approach results in the following algorithm.
EPCV Algorithm:

Initialize the hyperparameters

Perform EStep: Given the hyperparameters, we find the site parameters and and the posterior using EP algorithm.

Perform MStep: Find the hyperparameters by optimizing over any LOOCV based measure like (6), (10), (11) or (12) using any standard gradient based optimization technique. We carry out just one line search in this optimization process. During this line search as the hyperparameters change, we perform the following sequence of operations.
Note that through out this Mstep, it is assumed that the site parameters are fixed and the values obtained from step (2) are used.

repeat steps 23 until there is no significant change in the objective function value.
This algorithm worked well in our experiments. A similar EM approach was used by Kim and Ghahramani (2006) (which they called EMEP algorithm) in the optimization of a lower bound on the marginal likelihood.
Since the EPCV algorithm optimizes the smoothed Fmeasure it is useful to study the behavior of the true Fmeasure as optimization proceeds. We do this study on two of the datasets described in Table 6 of section 6. The optimization algorithm was terminated when there was no significant change in the smoothed Fmeasure value. From Figure 1 we see that the smoothed Fmeasure monotonically increases in both cases. Also as expected, the true Fmeasure exhibits a nonsmooth behavior expected of a discrete function, and also, the values of true and smoothed Fmeasures are not the same. The difference arises because the smoothed Fmeasure is based on probabilistic scores, which can take any value between 0.5 and 1 depending on the problem (even when correct classification occurs). The important point to observe is that, in general, there is an increasing trend in true Fmeasure value as the optimization progresses. In the case of Car3 dataset (left panel) we see that clearly. A similar trend is seen in the case of Yeast7 dataset (right panel) also, except for a small dip at the 10th iteration. Though such a behavior happens sometimes in early iterations, we observed that better true Fmeasure value is almost always obtained as the optimization progresses.
Computational and Storage Complexities
The computational complexity of the EPCV algorithm depends on the number of ARD kernel parameters . See appendix for more details. For a given problem with fixed, the complexity is . This complexity is same as that of the EP(ML) method (see equation (5)) and LSCV(NLP) method given in the next section. Also in many practical problems a single global scaling hyperparameter for all the input features is sufficient. Finally, the storage complexities of all the methods are .
5 Other LOOCV based Methods
Having discussed our approach in detail it is useful to recall and discuss other LOOCV based GPC model selection methods in relation to it.
Opper and Winther (2000) derived a mean field algorithm for binary classification with GPs based on the TAP approach originally proposed in the statistical physics of disordered systems. They showed that this approach yields an approximate LOO estimator for the generalization error. This estimate is equivalent to the LOOCV error estimate obtained from EP (Minka, 2001). Instead of optimizing over the hyperparameters, Opper and Winther (2000) used the LOOCV error count (using indicator functions) to choose rough hyperparameter values by scanning a range of values.
Qi et al. (2004) used the LOOCV error estimate obtained from EP to determine ARD hyperparameters. They worked with the Gaussian process classifier where each input feature is associated with a weight parameter with the prior . The hyperparameters were obtained by maximizing the evidence using a fast sequential update based on the work of Faul and Tipping (2002). The outcome of this optimization is that many s would go to infinity such that only a few nonzero weights will be present. Even though the ARD hyperparameters were optimized by maximizing the evidence, to prevent overfitting Qi et al. (2004) proposed to select the final model as the one that gives the minimum LOOCV error count or probability. As the LOOCV error count is discrete, they chose the model with maximum evidence when there is a tie in the count. Compared to this approach, we work with the GP classifier model (without the weight parameters) detailed in Section 2 and optimize over the hyperparameters (including ARD) directly with various LOOCV based measures (including Fmeasure, WER etc.) using gradient information.
In this context, the LOOCV based probabilistic LS classifier (Rasmussen and Williams, 2006) is a more direct LOOCV based GPC model selection approach. For the sake of completion we give some details here and later compare our algorithms with this approach in our experiments. This approach treats classification as a regression problem. Note that the probabilistic interpretation of LS criterion implies a Gaussian noise model. But the output can take only or
which is slightly odd. However, this approach is simple to implement and a probabilistic interpretation is given by passing the predictions through a sigmoid.
Specifically, the LOO mean and variance are obtained from LOOCV formulation of GP regression and the predictive distributions are obtained via (13). Here, and are given by:
(14) 
where and . Here can either be set to a small positive value or treated as a regularization hyperparameter with a small upper bound constraint. This is useful when can become illconditioned during optimization. Finally, the hyperparameters are optimized using (6). We call this method as LSCV(NLP).
Method  Description 

EP(ML)  Marginal likelihood maximization within EP approximation. 
That is, optimize (5) over .  
EPCV(NLP)  Negative Logarithmic Predictive loss minimization within EP approximation. 
That is, optimize (6) over . For ease of notation, this  
method is referred as NLP in Table 7.  
LSCV(NLP)  Negative Logarithmic Predictive loss minimization within Least Squares 
approximation as described in Section 5. Optimize (6) over .  
EPCV(FM)  FMeasure maximization within EPCV approximation. 
That is, optimize (10) over . For ease of notation, this  
method is referred as FM in Table 7.  
NLPFM(BIAS)  In the first step, hyperparameters are optimized using EPCV(NLP) method. 
In the second step, only the bias parameter () is optimized using  
EPCV(FM) method.We also refer this method as twostep method. 
Dataset  

Banana  400  2  4900  100 
Breastcancer  200  9  77  100 
Diabetes  468  8  300  100 
German  700  20  300  100 
Heart  170  13  100  100 
Image  1300  18  1010  20 
Ringnorm  400  20  7000  100 
Splice  1000  60  2175  20 
Thyroid  140  5  75  100 
Titanic  150  3  2051  100 
Twonorm  400  20  7000  100 
Waveform  400  21  4600  100 
Dataset/Method  EP(ML)  EPCV(NLP)  LSCV(NLP) 

Banana  23.90 0.81  24.26 1.06  33.88 1.89 
Breastcancer  53.57 4.75  54.12 5.27  55.58 5.03 
Diabetes  47.74 1.96  47.97 2.09  50.72 2.13 
German  48.67 2.74  49.05 2.76  50.51 2.30 
Heart  40.16 5.36  40.03 5.00  45.11 4.91 
Image  8.26 1.07  8.45 0.97  22.70 0.66 
Ringnorm  16.88 0.93  16.56 1.01  28.48 0.75 
Solar  57.25 1.38  57.35 1.42  59.61 1.34 
Splice  28.48 0.88  29.60 0.79  36.83 0.42 
Thyroid  10.21 3.76  9.94 3.69  25.33 4.86 
Titanic  66.86 1.97  51.73 1.73  53.78 14.08 
Twonorm  8.31 0.88  9.08 1.97  25.94 0.53 
Waveform  23.01 0.89  22.97 0.67  32.63 0.59 
Dataset/Method  EP(ML)  EPCV(NLP)  LSCV(NLP) 

Banana  10.41 0.65  10.51 0.50  10.93 0.67 
Breastcancer  26.52 4.89  26.61 4.80  25.94 4.59 
Diabetes  23.28 1.82  23.41 1.82  24.30 2.51 
German  23.36 2.11  23.48 2.00  23.94 2.33 
Heart  16.65 2.87  16.62 3.08  17.91 4.21 
Image  2.82 0.54  2.77 0.51  2.74 0.65 
Ringnorm  4.41 0.64  4.29 0.69  5.05 0.99 
Solar  34.20 1.75  34.27 1.80  35.03 1.89 
Splice  11.61 0.81  11.85 0.83  11.83 0.80 
Thyroid  4.37 2.19  4.20 2.17  6.97 3.78 
Titanic  22.64 1.34  22.50 0.98  22.99 2.81 
Twonorm  3.05 0.34  3.19 0.51  3.43 0.43 
Waveform  10.10 0.48  9.95 0.48  11.70 0.88 
Dataset  

Yeast7  297  8  1187  50  2 
Yeast5  320  8  1164  50  4 
Car3  350  6  1378  50  4 
Ecoli5  124  7  212  50  6 
Yeast4  165  8  1319  50  11 
Breastcancer  200  9  77  100  29 
German  700  20  300  100  30 
Diabetes  468  8  300  100  35 
Dataset/Method  NLP  FM  NLPFM(bias) 

yeast7  32.24 15.54  42.58 7.84  40.85 8.59 
yeast5  19.79 12.85  32.85 8.56  28.45 10.57 
car3  55.89 9.30  64.35 8.51  62.67 8.49 
ecoli5  84.41 6.25  83.79 5.90  84.08 5.86 
yeast4  71.35 3.95  73.64 2.97  73.23 2.81 
BreastCancer  38.42 10.55  47.34 6.44  46.98 6.37 
German  54.15 4.17  57.53 2.95  56.91 2.87 
Diabetes  62.69 3.49  66.23 2.87  66.12 2.67 
6 Experiments
We conducted two experiments with various methods. See Table 2 for a summary of the various methods. In the first experiment we compared the performance of EPCV(NLP) method with that of EP(ML) and LSCV(NLP) methods. In the second experiment we compared the performance of EPCV(FM) method with that of EPCV(NLP) and two step classifier methods. We used the minimize Matlab routine of the GPML Matlab code available at http://www.gaussianprocess.org/gpml/code/matlab/doc/ for hyperparameters optimization. In all the experiments we used a single global scaling hyperparameter.
6.1 NLP Experiment
In this experiment we used the thirteen benchmark datasets available in the web^{3}^{3}3http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm summarized in Table 3. Let us first consider the results from the first experiment given in Table 4 and Table 5. For the EP(ML) method we used the GPML Matlab code available in the web^{4}^{4}4http://www.gaussianprocess.org/gpml/code/matlab/doc/.
We conducted Friedman test (Demsar, 2006) with the corresponding posthoc tests for comparison of classifiers over multiple datasets. The comparison over multiple datasets requires a performance score of each method on each dataset (Demsar, 2006). Here, we consider the mean over the partitions of a given dataset as the performance score. As pointed out in Demsar (2006)
, it is not clear how to make use of the standard deviation information when the datasets are not independent over the partitions. The Friedman test ranks the methods for each dataset separately based on the chosen performance score (mean performance in our case). The best performing method gets the rank of 1, the second best rank 2 and so on. In case of ties, average ranks are assigned. The Friedman test checks whether the measured average ranks (over the datasets) are significantly different from the mean rank under the null hypothesis. Under the null hypothesis all the methods are equivalent and so their ranks should be equal.
In the case of NLP performance measure, the measured average ranks for the EP(ML), EPCV(NLP) and LSCV(NLP) methods were , and respectively. With three methods and 13 datasets, the Fstatistic comparison at a significance level of 0.05 rejected the null hypothesis. Since the null hypothesis was rejected we conducted the Nemenyi posthoc test for pairwise comparisons. This test revealed that the results of the EP(ML) and EPCV(NLP) methods are better than the LSCV(NLP) method at the significance level of . On the other hand, the posthoc test did not detect any significant difference in the results of EP(ML) and EPCV(NLP) methods. In the case of test set performance measure, the measure averaged ranks for the EP(ML), EPCV(NLP) and LSCV(NLP) methods were , and respectively. Note that the average rank of LSCV method has improved on the test set error performance. Here again, the null hypothesis is rejected at the same significance level and the posthoc test did not detect any significant difference in the results of EP(ML) and EPCV(NLP) methods. The results of the EP(ML) and EPCV(NLP) methods are better than the LSCV(NLP) method at the significance level of and respectively. Thus, we can conclude that EP(ML) and EPCV(NLP) are competitive to each other. Further, both these methods perform better than the LSCV(NLP) method in this experiment.
We can also make other observations from the tables. We note that the NLP performance of the LSCV(NLP) method is quite inferior on several datasets even though its test set error performances on most of these datasets (except waveform and thyroid) are relatively closer. Further, some kind of group behavior can be seen. For example, the NLP scores are high on titanic, breastcancer, diabetes, German and flaresolar. Also, the test set errors are on these datasets. Consequently, we may consider these datasets as difficult ones. From Table 4 we observe that the NLP performance of LSCV(NLP) is closer to the other two methods on these datasets (compared to its performance on other datasets). Next we can order the remaining datasets heart, splice, banana, waveform, ringnorm, thyroid, twonorm and image in terms of descending difficulty. Note that the difference in the NLP performance seems to have an increasing trend as the dataset becomes easier. To understand this we looked at the predictive probabilities of these methods on both correctly and wrongly classified examples. In the case of thyroid dataset these average probability scores for the LSCV(NLP) method were (for correct classification) and (wrong classification). Here, the averaging was done over all the correctly(wrongly) classified examples over all the partitions. On the other hand, the corresponding values for the EPCV(NLP) were . In the case of banana dataset, these scores were and for the LSCV(NLP) and EPCV(NLP) methods respectively. In the case of German dataset, they were and . The scores for the EPML method were very close to that of the EPCV(NLP) method. In general, we observed that the predictive probability estimates from the LSCV(NLP) method were relatively poor and resulted in poor NLP performance. We looked at the hyperparameter estimates of the different methods and observed that the LSCV(NLP) method takes smaller width and signal variance hyperparameter values on most of the datasets except on the difficult datasets (mentioned above) compared to the other two methods. Apart from this we did not observe any specific pattern in the hyperparameter values chosen by these methods. Looking at the hyperparameter estimates of EP(ML) and EPCV(NLP), it seems that several solutions in the space of hyperparameters that give close performances are possible.
6.2 Fmeasure Experiment
The datasets used in this experiment are described in Table 6. The datasets yeast, car and ecoli are multiclass datasets and we converted them into binary classification datasets by considering examples belonging to the class label indicated by the number (for example, in yeast7) as positive class respectively and treating the rest of the examples as negative class. These datasets are available in the web ^{5}^{5}5
ftp://ftp.ics.uci.edu/pub/machinelearningdatabases/
. We created 50 partitions for these datasets in a stratified manner reflecting the class distributions.Let us consider the results from the second experiment given in Table 7. In this experiment all the results were obtained within the EPCV framework. The first and second columns represent results obtained using NLP and smoothed Fmeasure (i.e., eq. (10)) as the optimization criterion respectively. The third column represents results obtained from the two step classifier described earlier. We looked at the hyperparameter estimates of the different methods. We observed that the bias estimates of the two step classifier were somewhat closer (within ) to those of the smoothed Fmeasure method on the breastcancer, diabetes and German datasets. On the remaining datasets they were different by more than . The width and signal variance hyperparameter estimates were also quite different. From Table 7, we observe that the two step method is also good and gives closer performance to the smoothed Fmeasure method on several datasets. Further analysis of the performance results revealed that even though the standard deviations are high, the smoothed Fmeasure method gave better performance than the two step method on majority of the partitions on several datasets. We believe that the larger standard deviations in the results arises from the sensitivity to the dataset with lesser number of positive examples. To carry out the statistical significance tests, again we used the mean (over the partitions) as the performance score for each of the methods. The measured average ranks for these three methods were , and respectively. With three methods and 8 datasets, the Fstatistic comparison at a significance level of 0.05 rejected the null hypothesis. The Nemenyi posthoc test for pairwise comparisons revealed that only the results of smoothed Fmeasure is better than the NLP based method at the significance level of 0.05. In this experiment the posthoc test did not detect any significant differences in the comparisons of smoothed Fmeasure method with the two step method and the two step method with the NLP method. However in these two comparisons the rank differences were closer to the required critical differences at the significance level of 0.1. We also observed that if we were to conduct Wilcoxon signedrank test on these methods (as if we were comparing only two classifiers) then the results were statistically significant at the significance level of 0.05 for all the three pairs. In summary, the results demonstrate the usefulness of direct optimization of smoothed Fmeasure.
7 Conclusion
In this paper, we considered the problem of Gaussian process classifier model selection with different LOOCV based optimization criteria and provided a practical algorithm using LOO predictive distributions with criteria like standard NLP, smoothed Fmeasure and WER to select hyperparameters. More specifically, apart from optimization of standard NLP, we demonstrated its usefulness in direct optimization of smoothed Fmeasure, which is useful to handle imbalanced data. We considered predictive distribution arrived from the Expectation Propagation (EP) approximation. We derived relevant expressions and proposed a very useful EPCV algorithm. The experimental results on several real world benchmark datasets showed comparable NLP generalization performance (with NLP optimization) with existing approaches. We demonstrated that the smoothed Fmeasure optimization method is a very useful method that improves the Fmeasure performance significantly. Overall, the EPCV algorithm is an excellent choice for GP classifier model selection with different LOOCV based optimization criteria.
References
 Demsar (2006) J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
 Faul and Tipping (2002) A. C. Faul and M. E. Tipping. Analysis of sparse Bayesian learning. Advances In Neural Information Processing Systems, 14, 2002.

Gibbs and MacKay (2000)
M. Gibbs and D. J. C. MacKay.
Variational Gaussian process classifiers.
IEEE Transactions on Neural Networks
, 11(6):1458–1464, 2000.  Hong et al. (2007) X. Hong, S. Chen, and C. J. Harris. A kernel based twoclass classifier for imbalanced data sets. IEEE Transactions on Neural Networks, 18(1):28–41, 2007.

Jansche (2005)
M. Jansche.
Maximum expected fmeasure training of logistic regression models.
In
Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing
, pages 692–699. Association for Computational Linguistics, NJ, USA, 2005.  Joachims (2005) T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377–384, 2005.
 Keerthi et al. (2006) S. S. Keerthi, V. Sindhwani, and O. Chapelle. An efficient method for gradient based adaptation of hyperparameters in SVM models. Advances In Neural Information Processing Systems, 18, 2006.
 Kim and Ghahramani (2006) H. C. Kim and Z. Ghahramani. Bayesian Gaussian process classification with the EMEP algorithm. IEEE Transactions on Pattern Analysis and Maching Intelligence, 28(12):1948–1959, 2006.
 Kuss and Rasmussen (2005) M. Kuss and C. E. Rasmussen. Assessing approximate inference for Bayesian Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005.
 Minka (2001) T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001.
 Neal (1998) R. Neal. Regression and classification using Gaussian process priors. Bayesian Statistics, 6:475–501, 1998.
 Opper and Winther (2000) M. Opper and O. Winther. Gaussian processes for classification: Meanfield algorithms. Neural Computation, 12:2655–2684, 2000.
 Qi et al. (2004) Y. Qi, T. P. Minka, R. W. Picard, and Z. Ghahramani. Predictive automatic relevance determination by expectation propagation. In ICML, pages 362–369, 2004.
 Rasmussen and Williams (2006) C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
 Seeger (2003) M. Seeger. Bayesian Gaussian process models: PacBayesian generalization error bounds and sparse approximation. In Phd Thesis. University of Edinburgh, 2003.
 Seeger (2007) M. Seeger. Crossvalidation optimization for large scale hierarchical classification kernel methods. Advances In Neural Information Processing Systems, 19, 2007.
 Sundararajan and Keerthi (2001) S. Sundararajan and S. S. Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Computation, 13(5):1103–1118, 2001.
 van Rijsbergen (1974) van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365–373, 1974.
 Williams and Barber (1998) C. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.
Appendix: Optimization criteria, approximate predictive distributions and their derivatives
From the definitions of various optimization criteria like NLP measure (6) and (10) etc, we see that the chosen measure and its derivatives can be obtained from the LOO predictive distributions and their derivatives . Here, is component of the hyperparameter vector . Note that the derivative of the NLP measure (6) is given by:
(15) 
and the derivative of the smoothed Fmeasure (10) is given by:
(16) 
where . Note that the derivatives and are directly dependent on . Now, from (13) we see that to define the LOO predictive distributions, we need the LOO mean and variance . In the case of EP approximation, analytical expressions to compute these quantities are already available (see eqn. (4)). Next, we give details on how the derivatives of predictive distributions can be obtained with these approximations.
Derivatives of Predictive Distributions
For ease of reference we recall (13) here.
Then, with and we have
Here, represents any element of other than . Similarly, we have
Thus, we need and . Below, we give details on how they can be obtained with the EP approximation.
Derivatives of LOO mean and variance with fixed site parameters: EP Approximation
In the case of EP approximation, since we resort to EM type optimization, we derive expressions for and assuming that the site parameters are fixed. From , we have
From , we have
Since , we have . Note that can be rewritten using ShermanMorrisonWoodbury formula as: and it is useful to work with this expression to achieve improved numerical stability (Rasmussen and Williams, 2006) as it avoids inversion of . Then we have
Note that , (where ) are nothing but the diagonal entries of the above expression. Note also that can be efficiently computed by taking advantage of the presence of the vector . But, to compute , we cannot avoid the matrix multiplication with ; this results in for each . Finally, it is useful to rewrite (Rasmussen and Williams, 2006).
Comments
There are no comments yet.