Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification

05/28/2018 ∙ by Dimitrios Milios, et al. ∙ MIT Istituto Italiano di Tecnologia 6

In this paper, we study the problem of deriving fast and accurate classification algorithms with uncertainty quantification. Gaussian process classification provides a principled approach, but the corresponding computational burden is hardly sustainable in large-scale problems and devising efficient alternatives is a challenge. In this work, we investigate if and how Gaussian process regression directly applied to the classification labels can be used to tackle this question. While in this case training time is remarkably faster, predictions need be calibrated for classification and uncertainty estimation. To this aim, we propose a novel approach based on interpreting the labels as the output of a Dirichlet distribution. Extensive experimental results show that the proposed approach provides essentially the same accuracy and uncertainty quantification of Gaussian process classification while requiring only a fraction of computational resources.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classification is a classic machine learning task. While the most basic performance measure is classification accuracy, in practice assigning a calibrated confidence to the predictions is often crucial

[5]. For example, in image classification providing classification predictions with a calibrated score is important to avoid making over-confident decisions [6, 11, 14]

. Several classification algorithms that output a continuous score are not necessarily calibrated (e.g., support vector machines


). Popular ways to calibrate classifiers use a validation set to learn a transformation of their output score that recovers calibration; these include Platt scaling

[22] and isotonic regression [36]

. Calibration can also be achieved if a sensible loss function is employed

[12], for example the logistic/cross-entropy loss, and it is known to be positively impacted if the classifier is well regularized [6].

Bayesian approaches provide a natural framework to tackle these kinds of questions, since quantification of uncertainty is of primary interest. In particular, Gaussian Processes Classification (gpc) [23, 34, 8] combines the flexibility of Gaussian Processes (gps) [23] and the regularization stemming from their probabilistic nature, with the use of the correct likelihood for classification, that is Bernoulli or multinomial for binary or multi-class classication, respectively. While we are not aware of empirical studies on the calibration properties of gpc, our results confirm the intuition that gpc is actually calibrated. The most severe drawback of gpc, however, is that it requires carrying out several matrix factorizations, making it unattractive for large-scale problems.

In this paper, we study the question of whether Gaussian process approaches can be made efficient to find accurate and well-calibrated classification rules. A simple idea is to use Gaussian process regression directly on classification labels. This idea is quite common in non-probabilistic approaches [32, 25]

and can be grounded from a decision theoretic point of view. Indeed, the Bayes’ rule minimizing the expected least squares is the the expected conditional probability, which in classification is directly related to the conditional probabilities of each class, see e.g.

[29, 3]. Directly regressing the labels leads to fast training and excellent classification accuracies [27, 16, 10]. However, the corresponding predictions are not calibrated for uncertainty quantification. The question is then if calibration can be achieved while retaining speed.

The main contribution of our work is the proposal of a transformation of the classification labels, which turns the original problem into a regression problem without compromising calibration. For gps, this has the enormous advantage of bypassing the need for expensive posterior approximations, leading to a method that is as fast as a simple regression of the original labels. The proposed method is based on the interpretation of the labels as the output of a Dirichlet distribution, so we name it Dirichlet-based gp classification (gpd). Through an extensive experimental validation, including large-scale classification tasks, we demonstrate that gpd is calibrated and competitive in performance with state-of-the-art gpc.

2 Related work

Calibration of classifiers:

Platt scaling [22] is a popular method to calibrate the output score of classifiers, as well as isotonic regression [36]. More recently, Beta calibration [12] and temperature scaling [6]

have been proposed to extend the class of possible transformations and reduce the parameterization of the transformation, respectively. It is established that binary classifiers are calibrated when they employ the logistic loss; this is a direct consequence of the fact that the appropriate model for Bernoulli distributed variables is the one associated with this loss


. The extension to multi-class problems yields the so-called cross-entropy loss, which corresponds to the multinomial likelihood. Not necessarily, however, the right loss makes classifiers well calibrated; recent works on calibration of convolutional neural networks for image classification show that depth negatively impacts calibration due to the introduction of a large number of parameters to optimize, and that regularization is important to recover calibration


Kernel-based classification:

Performing regression on classification labels is also known as least-squares classification [32, 25]. We are not aware of works that study gp-based least-squares classification in depth; we could only find a few comments on it in [23] (Sec. 6.5). gpc is usually approached assuming a latent process, which is given a gp prior, that is transformed into a probability of class labels through a suitable squashing function [23]. Due to the non-conjugacy between the gp

prior and the non-Gaussian likelihood, applying standard Bayesian inference techniques in

gpc leads to analytical intractabilities, and it is necessary to resort to approximations. Standard ways to approximate computations include the Laplace Approximation [34] and Expectation Propagation [18]; see, e.g., [15, 20] for a detailed review of these methods. More recently, there have been advancements in works that extend “sparse” gp approximations [33] to classification [9] in order to deal with the issues of scalability with the number of observations. All these approaches require iterative refinements of the posterior distribution over the latent gp, and this implies expensive algebraic operations at each iteration.

3 Background

Consider a multi-class classification problem. Given a set of training inputs and their corresponding labels

, with one-hot encoded classes denoted by the vectors

, a classifier produces a predicted label as function of any new input .

In the literature, calibration is assessed through the Expected Calibration Error (ece) [6], which is the average of the absolute difference between accuracy and confidence:


where the test set is divided into disjoint subsets , each corresponding to a given level of confidence predicted by the classifier , while is the classification accuracy of measured on the -th subset. Other metrics used in this work to characterize the quality of a classifier are the mean negative log-likelihood (mnll) of the test set under the model, and the error rate on the test set. All metrics are defined so that lower values are better.

3.1 Kernel methods for classification

gp classification (gpc)

gp-based classification is defined by the following abstract steps:

  1. A Gaussian prior is placed over a latent function . The gp prior is characterized by the mean function and the covariance function

    . The observable (non-Gaussian) prior is obtained by transforming through a sigmoid function so that the sampled functions produce proper probability values. In the multi-class case, we consider

    independent priors over the vector of functions ; transformation to proper probabilities is achieved by applying the softmax function 111Softmax function s.t. for .

  2. The observed labels are associated with a categorical likelihood with probability components , for any .

  3. The latent posterior is obtained by means of Bayes’ theorem.

  4. The latent posterior is transformed via , to obtain a distribution over class probabilities.

Throughout this work, we consider and . This choice of covariance function is also known as the rbf kernel; it is characterized by the and hyper-parameters, interpreted as the gp

marginal variance and length-scale, respectively. The hyper-parameters are commonly selected my maximizing the marginal likelihood of the model.

The major computational challenge of gpc can be identified in Step 3 described above. The categorical likelihood implies that the posterior stochastic process is not Gaussian and it cannot be calculated analytically. Therefore, different approaches resort to Gaussian approximations of the likelihood so that the resulting approximate Gaussian posterior has the following form:


For example, in Expectation Propagation (EP, [18]) and are determined by the site parameters, which are learned via an iterative process. In the variational approach of [9], and are the variational parameters of the approximate Gaussian likelihoods. Despite being successful, approaches like these contribute significantly to the computational cost of classification, as they introduce a large number of parameters that need to be optimized. In this work, we explore a more straightforward Gaussian approximation to the likelihood that requires no significant computational overhead.

gp regression (gpr) on classification labels

A simple way to bypass the problem induced by categorical likelihoods is to perform least-squares regression on the labels by ignoring their discrete nature. This implies considering a Gaussian likelihood , where is the observation noise variance. It is well-known that if the observed labels are and , then the function that minimizes the mean squared error converges to the true class probabilities in the limit of infinite data [24]. Nevertheless, by not squashing through a softmax function, we can no longer guarantee that the resulting distribution of functions will lie within 0 and 1. For this reason, additional calibration steps are required (i.e. Platt scaling).

Kernel Ridge Regression (krr) for classification

The idea of directly regressing labels is quite common when gp estimators are applied within a frequentist context [25]

. Here they are typically derived from a non-probabilistic perspective based on empirical risk minimization, and the corresponding approach is dubbed Kernel Ridge Regression

[7]. Taking this perspective, two comments can be made. The first is that the noise and covariance parameters are viewed as regularization parameters that need to be tuned, typically by cross-validation. In our experiments, we compare this method with a canonical gpr approach. The second comment is that regressing labels with least squares can be justified from a decision theoretic point of view. The Bayes’ rule minimizing the expected least squares is the regression function (the expected conditional probability), which in binary classification is proportional to the conditional probability of one of the two classes [3] (similar reasoning applies to multi-class classification [19, 2]

). From this perspective, one could expect a least squares estimator to be self-calibrated, however this is typically not the case in practice, a feature imputed to the limited number of points and the choice of function models. In the following we breifly present Platt scaling, a simple and effective post-hoc calibration method which can be seamlessly applied to both

gpr- and krr-based learning pipelines to obtain a calibrated model.

Platt scaling

Platt scaling [22] is an effective approach to perform post-hoc calibration for different types of classifiers, such as svms [21] and neural networks [6]. Given a decision function , which is the result of a trained binary classifier, the class probabilities are given by the sigmoid transformation , where and are optimised over a separate validation set, so that the resulting model best explains the data. Although this parametric form may seem restrictive, Platt scaling has been shown to be effective for a wide range of classifiers [21].

3.2 A note on calibration properties

We advocate that two components are critical for well-calibrated classifiers: regularization and the cross-entropy loss. Previous work indicates that regularization has a positive effect on calibration [6]. Also, classifiers that rely on the cross-entropy loss are reported to be well-calibrated [21]. This form of loss function is equivalent to the negative Bernoulli log-likelihood (or categorical in the multi-class case), which is the proper interpretation of classification outcomes.

In Figure 1, we demonstrate the effects of regularization and cross-entropy empirically: we summarize classification results on four synthetic datasets of increasing size. We assume that each class label is sampled from a Bernoulli distribution with probability given by the unknown function with input space of dimensionality . For a classifier to be well-calibrated, it should accurately approximate . We fit three kinds of classifiers: a maximum likelihood (ML) classifier that relies on cross entropy loss (CE), a Bayesian classifier with MSE loss (i.e. gpr classification), and finally a Bayesian classifier that relies on CE (i.e. gpc

). We report the averages over 1000 iterations and the average standard deviation. The Bayesian classifiers that rely on the cross entropy loss converge to the true solution at a faster rate, and they are characterized by smaller variance.

Figure 1: Convergence of classifiers with different loss functions and regularization properties. Left: summary of the mean squared error (MSE) from the true function for 1000 randomly sampled training sets of different size; the Bayesian CE-based classifier is characterized by smaller variance even when the number of training inputs is small. Right: demonstration of how the averaged classifiers approximate the true function for different training sizes.

Although performing gpr on the labels induces regularization through the prior, the likelihood model is not appropriate. One possible solution is to employ meticulous likelihood approximations such as EP or variational gp classification [9], alas at an often prohibitive computational cost, especially for considerably large datasets. In the section that follows, we introduce a methodology that combines the best of both worlds. We propose to perform gp regression on labels transformed in such a way that a less crude approximation of the categorical likelihood is achieved.

4 gp regression on transformed Dirichlet variables

There is an obvious defect in gp-based least-squares classification: each point is associated with a Gaussian likelihood, which is not the appropriate noise model for Bernoulli-distributed variables. Instead of approximating the true non-Gaussian likelihood, we propose to transform the labels in a latent space where a Gaussian approximation to the likelihood is more sensible.

For a given input, the goal of a Bayesian classifier is to estimate the distribution over its class probability vector; such a distribution is naturally represented by a Dirichlet-distributed random variable. More formally, in a

-class classification problem each observation is a sample from a categorical distribution . The objective is to infer the class probabilities , for which we use a Dirichlet model: . In order to fully describe the distribution of class probabilities, we have to estimate the concentration parameters . Given an observation such that , our best guess for the values of will be: and . Note that it is necessary to add a small quantity , so as to have valid parameters for the Dirichlet distribution. Intuitively, we implicitly induce a Dirichlet prior so that before observing a data point we have the probability mass shared equally across classes; we know that we should observe exactly one count for a particular class, but we do not know which one. Most of the mass is concentrated on the corresponding class when is observed. This practice can be thought of as the categorical/Bernoulli analogue of the noisy observations in gp regression. The likelihood model is:


It is well-known that a Dirichlet sample can be generated by sampling from

independent Gamma-distributed random variables with shape parameters

and rate ; realizations of the class probabilities can be generated as follows:


Therefore, the noisy Dirichlet likelihood assumed for each observation translates to independent Gamma likelihoods with shape parameters either , if , or otherwise.

In order to construct a Gaussian likelihood in the log-space, we approximate each Gamma-distributed with

, which has the same mean and variance (i.e. moment matching):

Thus, for the parameters of the normally distributed logarithm we have:


Note that this is the first approximation to the likelihood that we have employed so far. One could argue that a log-normal approximation to a Gamma-distributed variable is reasonable, although it is not perfect for small values of the shape parameter . However, the most important implication is that we can now consider a Gaussian likelihood in the log-space. Assuming a vector of latent processes , we have:


where the class label is now denoted by in the transformed logarithmic space. It is important to note that the noise parameter is different for each observation; we have a heteroskedastic regression model. In fact, the values (as well as ) solely depend on the Dirichlet pseudo-count assumed in the prior, which only has two possible values. Given this likelihood approximation, it is straightforward to place a gp prior over and evaluate the posterior over latent processes exactly.

Remark: In the binary classification case, we still have to perform regression on two latent processes. The use of heteroskedastic noise model implies that one latent process is not a mirrored version of the other (see Figure 2), contrary to gpc.

4.1 From gp posterior to Dirichlet variables

The obtained gp posterior emulates the logarithm of a stochastic process with Gamma marginals that gives rise to the Dirichlet posterior distributions. It is straightforward to sample from the posterior log-normal marginals, which should behave as Gamma-distributed samples to generate posterior Dirichlet samples as in Equation (4). It is easy to see that this corresponds to a simple application of the softmax function on the posterior gp samples. The expectation of class probabilities will be:


which can be approximated by sampling from the Gaussian posterior .

Figure 2 is an example of Dirichlet regression for a one-dimensional binary classification problem. The left-side panels demonstrate how the gp posterior approximates the transformed data; the error bars represent the standard deviation for each data-point. Notice, that the posterior for class “0” (top) is not a mirror image of class “1” (bottom), because of the different noise terms assumed for each latent process. The right-side panels show results in the original output space, after applying softmax transformation; as expected in the binary case, one posterior process is a mirror image of the other.

Figure 2: Example of Dirichlet regression for a one-dimensional binary classification problem. Left: the latent gp posterior for class “0” (top) and class “1” (bottom). Right: the transformed posterior through softmax for class “0” (top) and class “1” (bottom).

4.2 Optimizing the Dirichlet prior

The performance of Dirichlet-based classification is affected by the choice of , in addition to the usual gphyperparameters. As approaches zero, converges to either or . It is easy to see that for the transformed “1” labels we have and in the limit. The transformed “0” labels, however, converge to infinity, and so do their variances. The role of is to make the transformed labels finite, so that it is possible to perform regression. The smaller the is, the further the transformed labels will be apart, but at the same time, the variance for the “0” label will be larger.

By increasing , the transformed labels of different classes tend to be closer. The marginal log-likelihood tends to be larger, as it is easier for a zero-mean gp prior to fit the data. However, this behavior is not desirable for classification purposes. For this reason, the Gaussian marginal log-likelihood in the transformed space is not appropriate to determine the optimal value for .

Figure 3 demonstrates the effect of on classification accuracy, as reflected by the mnll metric. Each subfigure corresponds to a different dataset; mnll is reported for different choices of between and . As a general remark, it appears that there is no globally optimal parameter across datasets. However, the reported training and testing mnll curves appear to be in agreement regarding the optimal choice for . We therefore propose to select the value that minimizes the mnll on the training data.

(a) htru2
(b) magic
(c) letter
(d) drive
Figure 3: Exploration of for 4 different datasets with respect to the mnll metric.

5 Experiments

We experimentally evaluate the methodologies discussed on the datasets outlined in Table 1. For the implementation of gp-based models, we use and extend the algorithms available in the GPFlow library [17]. More specifically, for gpc we make use of variational sparse gp [8], while for gpr we employ sparse variational gp regression [33]. The latter is also the basis for our gpd implementation: we apply adjustments so that heteroskedastic noise is admitted, as dictated by the Dirichlet mapping. Concerning krr, in order to scale it up to large-scale problems we use a subsampling-based variant named Nyström krr (nkrr) [31, 35]. Nyström-based approaches have been shown to achieve state-of-the-art accuracy on large-scale learning problems [13, 30, 26, 4, 28]. The number of inducing (subsampled) points used for each dataset is reported in Table 1.

Dataset Classes Training instances Test instances Dimensionality Inducing points
eeg 2 10980 4000 14 200
htru2 2 12898 5000 8 200
magic 2 14020 5000 10 200
miniboo 2 120064 10000 50 400
coverbin 2 522910 58102 54 500
susy 2 4000000 1000000 18 200
letter 26 15000 5000 16 200
drive 11 48509 10000 48 500
mocap 5 68095 10000 37 500
Table 1: Datasets used for evaluation, available from the UCI repository [1].

The experiments have been repeated for 10 random training/test splits. For each iteration, inducing points are chosen by applying k-means clustering on the training inputs. Exceptions are

coverbin and susy, for which we used 5 splits and inducing points chosen uniformly at random. For gpr we further split each training dataset: 80% of which is used to train the model and the remaining 20% is used for calibration with Platt scaling. nkrr uses an 80-20% split for -fold cross-validation and Platt scaling calibration, respectively. For each of the datasets, the parameter of gpd was selected according to the training mnll: we have for coverbin, for letter, drive and mocap, and for the remaining datasets.

In all experiments, we consider an isotropic rbf kernel; the kernel hyperparameters are selected by maximizing the marginal likelihood for the gp-based approaches, and by -fold cross validation for nkrr (with for all datasets except from susy, for which ). In the case of gpr, we also optimize the noise variance jointly with all kernel parameters.

The performance of gpd, gpc, gpr and nkrr is compared in terms of various error metrics, including error rate, mnll and ece for a collection of datasets. The error rate, mnll and ece values obtained are summarised in Figure 4. The gpc method tends to outperform gpr in most cases. Regarding the gpd approach, its performance tends to lie between gpc and gpr; in some instances classification performance is better than gpc and nkrr. Most importantly, this performance is obtained at a fraction of the computational time required by the gpc method. Figure 5 summarizes the speed-up achieved by the used of gpd as opposed to the variational gp classification approach.

Figure 4: Error rate, mnll and ece for the datasets considered in this work.

This dramatic difference in computational efficiency has some interesting implications regarding the applicability of gp-based classification methods on large datasets. gp-based machine learning approaches are known to be computationally expensive; their practical application on large datasets demands the use of scalable methods to perform approximate inference. The approximation quality of sparse approaches depends on the number (and the selection) of inducing points. In the case of classification, the speed-up obtained by gpd implies that the computational budget saved can be spent on a more fine-grained sparse gp approximation. In Figure 5, we explore the effect of increasing the number of inducing points for three datasets: letter, miniboo and mocap; we see that the error rate drops below the target gpc with a fixed number of inducing points, and still at a fraction of the computational effort.

(a) letter
(b) miniboo
(c) mocap
Figure 5: Left: Speed-up obtained by using gpd as opposed to gpc. Right: Error vs training time for gpd as the number of inducing points is increased for three datasets. The dashed line represents the error obtained by gpc using the same number of inducing points as the fastest gpd listed.

6 Conclusions

Most gp-based approaches to classification in the literature are characterized by a meticulous approximation of the likelihood. In this work, we experimentally show that such gp classifiers tend to be well-calibrated, meaning that they estimate correctly classification uncertainty, as this is expressed through class probabilities. Despite this desirable property, their applicability is limited to small/moderate size of datasets, due to the high computational complexity of approximating the true posterior distribution.

Least-squares classification, which may be implemented either as gpr or krr, is an established practice for more scalable classification. However, the crude approximation of a non-Gaussian likelihood with a Gaussian one has a negative impact on classification quality, especially as this is reflected by the calibration properties of the classifier.

Considering the strengths and practical limitations of gps, we proposed a classification approach that is essentially an heteroskedastic gp regression on a latent space induced by a transformation of the labels, which are viewed as Dirichlet-distributed random variables. This allowed us to convert -class classification to a problem of regression for latent processes with Gamma likelihoods. We then proposed to approximate the Gamma-distributed variables with log-normal ones, and thus we achieved a sensible Gaussian approximation in the logarithmic space. Crucially, this can be seen as a pre-processing step, that does not have to be learned, unlike gpc, where an accurate transformation is sought iteratively. Our experimental analysis shows that Dirichlet-based gp classification produces well-calibrated classifiers without the need for post-hoc calibration steps. The performance of our approach in terms of classification accuracy tends to lie between properly-approximated gpc and least-squares classification, but most importantly it is orders of magnitude faster than gpc.

As a final remark, we note that the predictive distribution of the gpd approach is different from that obtained by gpc, as can be seen in the extended results in the appendix. An extended characterization of the predictive distribution for gpd is subject of future work.


DM and PM are partially supported by KPMG. RC would like to thank Luigi Carratino for the useful exchanges concerning large-scale experiments. LR is funded by the Air Force project FA9550-17-1-0390 (European Office of Aerospace Research and Development) and the RISE project NoMADS - DLV-777826. MF gratefully acknowledges support from the AXA Research Fund.


  • [1] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.
  • [2] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Multi-output learning via spectral filtering. Machine learning, 87(3):259–301, 2012.
  • [3] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006.
  • [4] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco. NYTRO: When Subsampling Meets Early Stopping. In

    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics

    , pages 1403–1411, 2016.
  • [5] P. A. Flach. Classifier Calibration, pages 1–8. Springer US, Boston, MA, 2016.
  • [6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330, International Convention Centre, Sydney, Australia, Aug. 2017. PMLR.
  • [7] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2001.
  • [8] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classification. In G. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 351–360, San Diego, California, USA, May 2015. PMLR.
  • [9] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for Variationally Sparse Gaussian Processes. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1648–1656. Curran Associates, Inc., 2015.
  • [10] P. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep neural networks on TIMIT. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2014.
  • [11] A. Kendall and Y. Gal.

    What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, Mar. 2017.

  • [12] M. Kull, T. S. Filho, and P. Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 623–631, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.
  • [13] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystrom Method. In NIPS, pages 1060–1068. Curran Associates, Inc., 2009.
  • [14] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world, Feb. 2017. arXiv:1607.02533.
  • [15] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process Classification. Journal of Machine Learning Research, 6:1679–1704, 2005.
  • [16] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan, M. Collins, B. Kingsbury, M. Picheny, and F. Sha. How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets. CoRR, abs/1411.4000, 2014.
  • [17] A. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman.

    GPflow: A Gaussian process library using TensorFlow.

    Journal of Machine Learning Research, 18(40):1–6, Apr. 2017.
  • [18] T. P. Minka. Expectation Propagation for approximate Bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
  • [19] Y. Mroueh, T. Poggio, L. Rosasco, and J.-J. Slotine. Multiclass learning with simplex coding. In Advances in Neural Information Processing Systems, pages 2789–2797, 2012.
  • [20] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classification. Journal of Machine Learning Research, 9:2035–2078, Oct. 2008.
  • [21] A. Niculescu-Mizil and R. Caruana.

    Predicting good probabilities with supervised learning.

    In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 625–632. ACM, 2005.
  • [22] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), 1999.
  • [23] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • [24] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • [25] R. Rifkin, G. Yeo, T. Poggio, and Others. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003.
  • [26] A. Rudi, R. Camoriano, and L. Rosasco. Less is More: Nyström Computational Regularization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1657–1665. Curran Associates, Inc., 2015.
  • [27] A. Rudi, L. Carratino, and L. Rosasco. FALKON: An Optimal Large Scale Kernel Method. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3888–3898. Curran Associates, Inc., 2017.
  • [28] A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, pages 3891–3901, 2017.
  • [29] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.
  • [30] S. Si, C.-J. Hsieh, and I. S. Dhillon. Memory Efficient Kernel Approximation. In ICML, volume 32 of JMLR Proceedings, pages 701–709. JMLR.org, 2014.
  • [31] A. J. Smola and B. Schölkopf. Sparse Greedy Matrix Approximation for Machine Learning. In ICML, pages 911–918. Morgan Kaufmann, 2000.
  • [32] J. A. K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Process. Lett., 9(3):293–300, June 1999.
  • [33] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. A. Dyk and M. Welling, editors, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, volume 5 of JMLR Proceedings, pages 567–574. JMLR.org, 2009.
  • [34] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1342–1351, 1998.
  • [35] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, V. Tresp, T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 682–688. MIT Press, 2000.
  • [36] B. Zadrozny and C. Elkan. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 694–699, New York, NY, USA, 2002. ACM.

Appendix A Extended calibration results

Reliability diagrams offer a visual representation of calibration properties, where accuracy is plotted as a function of confidence for the subsets . For a perfectly calibrated classifier, the accuracy function should be equal to the identity line, implying that . Large deviations from the identity line mean that the class probabilities are either underestimated or overestimated.

In Figure 6 we summarize the reliability diagrams for a number of binary classification datasets. Each row of diagrams corresponds to a particular dataset; and each column to one of the gp-based classification approaches that we have discussed in this work. In particular we consider the variational gp classification algorithm of [9] (gpc), our Dirichlet-based classification scheme (gpd), and gp regression on the labels without and with a Platt-scaling post-hot calibration step (gpr and gpr (platt)). Note that each one of these approaches produces a distribution of classifiers. Thus, in the diagrams of Figure 6

we show the reliability curve of the mean classifier (depicted as solid lines-points), along with the classifiers described by the upper and lower 95% quantiles of the predictive distribution (grey area). If a classifier is well-calibrated, then its reliability curve should be close to the identity curve (dashed line); the latter should also lie within the limits of the grey area.

For the results of Figure 6, we have considered subsets for different levels of confidence. We note that deviations from the identity curve should not deemed important, if these are not backed by a sufficient number of samples. For some datasets there are certain levels of confidence (middle section of htru2 for example) that contain very few data. In order to reflect this behavior, we also plot the histograms showing the proportion of the test set that corresponds to each confidence level.

A careful inspection of Figure 6 suggests that both gpc and gpd produce well-calibrated models. The gpr approach on the other hand tends to produce a sigmoid-shaped reliability curve, which suggests that there is underestimation of the class probabilities. This behavior is cured however by performing calibration via Platt-scaling, as we see for the gpr (platt) method.

These conclusions are further supported by Figure 7, which summarizes the reliability plots for a number of multi-class datasets. In the multi-class case, it is not obvious how to concisely summarise the effect of the predictive distribution, so we resort to simple reliability plots of the average classifier for each method.

As a final remark, we note that the predictive distribution of the gpd approach is different from that obtained by gpc. In fact, judging form the upper and lower quantile classifiers as presented in Figure 6, it appears that gpd results in a narrower predictive distribution, which is nevertheless well-calibrated. An extended characterization of the predictive distribution for gpd is subject of future work.

Figure 6: Calibration results for four different gp

-based classification approaches on four binary classification datasets. The bounds correspond to the classifiers given by the 95% confidence interval of the posterior

Figure 7: Calibration results for three multiclass classification datasets.