Faster variational inducing input Gaussian process classification

by   Pavel Izmailov, et al.

Gaussian processes (GP) provide a prior over functions and allow finding complex regularities in data. Gaussian processes are successfully used for classification/regression problems and dimensionality reduction. In this work we consider the classification problem only. The complexity of standard methods for GP-classification scales cubically with the size of the training dataset. This complexity makes them inapplicable to big data problems. Therefore, a variety of methods were introduced to overcome this limitation. In the paper we focus on methods based on so called inducing inputs. This approach is based on variational inference and proposes a particular lower bound for marginal likelihood (evidence). This bound is then maximized w.r.t. parameters of kernel function of the Gaussian process, thus fitting the model to data. The computational complexity of this method is O(nm^2), where m is the number of inducing inputs used by the model and is assumed to be substantially smaller than the size of the dataset n. Recently, a new evidence lower bound for GP-classification problem was introduced. It allows using stochastic optimization, which makes it suitable for big data problems. However, the new lower bound depends on O(m^2) variational parameter, which makes optimization challenging in case of big m. In this work we develop a new approach for training inducing input GP models for classification problems. Here we use quadratic approximation of several terms in the aforementioned evidence lower bound, obtaining analytical expressions for optimal values of most of the parameters in the optimization, thus sufficiently reducing the dimension of optimization space. In our experiments we achieve as well or better results, compared to the existing method. Moreover, our method doesn't require the user to manually set the learning rate, making it more practical, than the existing method.


page 1

page 2

page 3

page 4


Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Gaussian processes (GPs) are a powerful tool for probabilistic inference...

Quadruply Stochastic Gaussian Processes

We introduce a stochastic variational inference procedure for training s...

Scalable Gaussian Process Classification with Additive Noise for Various Likelihoods

Gaussian process classification (GPC) provides a flexible and powerful s...

Adversarially Robust Optimization with Gaussian Processes

In this paper, we consider the problem of Gaussian process (GP) optimiza...

Generic Inference in Latent Gaussian Process Models

We develop an automated variational method for inference in models with ...

Non-Parametric Variational Inference with Graph Convolutional Networks for Gaussian Processes

Inference for GP models with non-Gaussian noises is computationally expe...

SigGPDE: Scaling Sparse Gaussian Processes on Sequential Data

Making predictions and quantifying their uncertainty when the input data...

1 Tractable evidence lower bound for GP-classification

In the previous section we described the svi method. It is based on stochastic optimization of the lower bound (11) for marginal likelihood and the lower bound itself is computed in . But the bound depends on parameters which makes the optimization problem hard to solve when a big number of inducing points is needed.

For GP-regression the situation is similar. Paper [1] describes a method analogical to the svi method for classification. The only difference is that the lower bound becomes tractable in case of regression. Then the paper [6] tries to solve the problem of big number of parameters in the algorithm from [1] in the following way. In case of regression the lower bound (11) can be analytically optimised with respect to variational parameters . Doing so and substituting the optimal values back into the lower bound, one can obtain a new lower bound to the marginal likelihood that depends solely on kernel hyper-parameters . This simplifies the optimization problem by dramatically reducing the number of optimization parameters. Unfortunately, this new bound doesn’t have a form of ’’sum over objects’’, hence stochastic optimization methods are no longer applicable here. However, in our experiments we’ve found that even for fairly big datasets the method from [6] outperforms [1] despite the lack of stochastic optimization.

In the following subsection we devise an approach similar to the method of [6] for the case of classification. We provide a tractable evidence lower bound and analytically maximize it with respect to variational parameters and . Substituting the optimal values of these parameters back into the lower bound we obtain a new lower bound, that depends only on kernel hyper-parameters .

Global evidence lower bound

In order to derive a tractable lower bound for (11), we seek a quadratic approximation to the log-logistic function , where . The paper [4] provides a global parametric quadratic lower bound for this function:

where and is a parameter of the bound. This bound is tight when .

Substituting this bound back to (11) with separate values for every data point, we obtain a tractable lower bound


Differentiating with respect to and and setting the derivatives to zero, we obtain


Substituting the optimal values of variational parameters back to the lower bound and omitting the terms not depending on and , we obtain a compact lower bound



In the following we consider three different methods for maximizing the lower bound .

0:  ,
0:  , ,
  , ,
  repeat , ,
     for  :{stage 1: updating , , do
        , , ,   
     , , {see (12), (13).}
     , , , ,
      = minimize(, method=’L-BFGS-B’, maxfun=){stage 2: updating }
  until convergence
Algorithm 1 vi-JJ method

Note, that given the values of , , , we can maximize with respect to analytically. The optimal values for are given by


The values and were defined in (10). In our first method we use analytical formulas (15) to recompute the values of and use gradient-based optimization to maximize the bound with respect to . The pseudocode is given in Alg. 1. We will refer to this method as vi-JJ, where JJ stands for Jaakkola and Jordan, the authors of [4]. Note that the computational complexity of one iteration of this method is , the same as for the svi method.

Our second method uses gradient-based optimization to maximize with respect to both and . Note that in this method we don’t have to recompute and at each iteration which makes the methods iterations empirically faster for big values of . We refer to this method as vi-JJ-full.

Finally, vi-JJ-hybrid is a combination of the two methods described above. The general scheme of this method is the same as vi-JJ. In the vi-JJ-hybrid method we use analytical formulas to recompute as we do in the vi-JJ method at stage , but at stage we use gradient-based optimization with respect to both and . The virtues of this method will be described in the experiments section.

Tractable local approximation to the evidence lower bound

Another way to obtain a tractable approximation to the lower bound (11) is to use a local quadratic approximation for the log-logistic function . In this way we perform a second-order Taylor expansion of this function at points :


The following derivation is analogical to the derivation in the previous section. Substituting the approximation (16) into the lower bound (11), we obtain

Here is a diagonal matrix

where .

Differentiating the approximate bound with respect to , , and , and setting the derivatives to zero, we obtain the following formulas for optimal values of these parameters:



and is a vector, composed of

Substituting the optimal values for and , given by (17), (18), back into the approximate bound, and omiting the terms, that do not depend on , we obtain the following approximate lower bound:



Note that the lower bound (19) is not a global lower bound for the log-evidence . However, locally we get a good approximation of the evidence lower bound (11).

For maximizing the approximate lower bound (19) we consider a method, analogical to vi-JJ. In order to specify this method we simply substitute the bound by in the second stage in Alg. 1. We will refer to this method as vi-Taylor. The computational complexity of one iteration of this method is once again .

2 Experiments

In this section we empirically compare the derived vi-JJ, vi-Taylor, vi-JJ-full and vi-JJ-hybrid methods with svi. Below we describe the setting of the experiments and discuss their results.

Experimental setting

Рис. 4: Methods performance on small datasets

In our experiments we compared methods for variational inducing point GP-classification:

  • svi-AdaDelta uses the AdaDelta optimization method for maximization of the lower bound (11) as it is done in the paper [2];

  • vi-JJ was described in section ;

  • vi-Taylor was described in section ;

  • vi-JJ-full was described in section ;

  • vi-JJ-hybrid was described in section .

We also tried using deterministic L-BFGS-B optimization method for maximizing the evidence lower bound (11

), but it worked substantially worse than all the other methods. Note, that all the methods have the same complexity of epochs

. Table 1 shows which variables are optimized numerically and which are optimized analytically for each method.

Numerically optimized
Analytically optimized
svi-AdaDelta , ,
vi-JJ, vi-Taylor , ,
vi-JJ-hybrid , , ,
vi-JJ-full , ,
Таблица 1: Methods outline

In our experiments we didn’t optimize the lower bound with respect to the positions of inducing points. Instead we used -means clustering procedure with equal to the number of inducing inputs and took clusters centres as . Also we used the squared exponential covariance function (see section Model adaptation) in all experiments with a Gaussian noise term.

The stochastic method svi-AdaDelta requires the user to manually specify the learning rate and the batch size for the optimization method. For the former we had to run the method with different learning rates and choose the value that resulted in the fastest convergence. We used the learning rates from a fixed grid with a step of . It always happened, that for the largest value from the grid the method diverged, and for the smallest the method converged slower, than for some medium value, verifying, that the optimal learning rate was somewhere in the range. To choose the batch size we used the following convention. For small german and svmguide datasets we’ve set the batch size to . For other datasets we used approximately as the batch size, where is the size of the training set.

Рис. 5: Methods performance on medium datasets

For the vi-JJ, vi-Taylor and vi-JJ-hybrid in all of the experiments on every iteration we recomputed the values of , , and times ( in algorithm 1). To tune , on every iteration we’ve run L-BFGS-B optimization method constrained to do no more than evaluations of the lower bound and it’s gradient. We found that these values of parameters work well for all the datasets we experimented with.

For the svi-AdaDelta method we used optimization w.r.t. Cholesky factor of the matrix in order to maintain it’s positive definiteness, as described in [2]. We used AdaDelta optimization method implementation from the climin toolbox [7] as is done in the original paper.

For every dataset we experimented with a number of inducing points to verify that the results of the methods are close to optimal.

We evaluate the methods plotting the accuracy of their predictions on the test data against time. All of the plots have titles of the following format.

We also preprocessed all the datasets by normalizing the features setting the mean of all features to and the variance to . For datasets without available test data, we used of the data as a test set and as a train set.

Results and discussion

Рис. 6: Methods performance on big datasets

We compared the methods’ performance on seven datasets. Here we discuss the results.

Fig. 4 provides the results for german and svmguide datasets. As we can see, on the small german dataset the stochastic svi-AdaDelta method struggles, and it takes it longer to achieve the optimal quality, than for all the other methods, which show similar results. On the svmguide dataset it takes vi-JJ-full and svi-AdaDelta a little bit longer to converge, while the other three methods show roughly the same performance.

The results on magic telescope and ijcnn datasets are provided in fig. 5. On the magic telescope dataset the vi-JJ and vi-Taylor show poor quality on the first iterations, but still manage to converge faster, than svi-AdaDelta. On both datasets the vi-JJ-hybrid method works similar to vi-JJ and vi-Taylor, but shows better quality on first iterations on the magic telescope data. vi-JJ-full can’t converge to a reasonable quality on both datasets.

Fig. 6 provides the results on big cod-rna and skin_nonskin datasets. On these datasets the vi-JJ-full once again fails to achieve a reasonable quality, while the other methods work similarly.

Finally, the results on a8a data are provided in fig. 7. Here we use a rather big amount of inducing inputs. As we can see, the vi-JJ-full and vi-JJ-hybrid are the fastest to achieve the optimal quality. The svi-AdaDelta method also converges reasonably fast, while the vi-JJ and vi-Taylor struggle a little bit.

In general vi-JJ, vi-Taylor and vi-JJ-hybrid methods perform similar to svi-AdaDelta method. On the big dataset skin_nonskin with only features the vi-JJ-hybrid is a little bit slower than the stochastic svi-AdaDelta, but on all the other datasets it is better. The vi-Taylor and vi-JJ struggle with a8a, but are otherwise comparable to vi-JJ-hybrid. The stochastic svi-AdaDelta method performs poorly on small datasets and even on the big skin_nonskin data doesn’t manage to substantially outperform the other methods, even provided a good value of learning rate. Finally, vi-JJ-full works well on small data and on the a8a, but on all the other datasets it doesn’t manage to achieve a reasonable quality.

Рис. 7: Methods performance on the a8a dataset

3 Conclusion

In this paper we presented a new approach to training variational inducing input Gaussian process classification. We derived two new tractable evidence lower bounds and described several ways to maximize them. The resulting methods vi-JJ, vi-JJ-full, vi-JJ-hybrid and vi-Taylor are similar to the method of [6] for GP-regression.

We provided an experimental comparison of our methods with the current state-of-the-art method svi-AdaDelta of [2]. In our experimental setting, our approach proved to be more practical, as it converges to the optimal quality as fast as the svi-AdaDelta method without requiring the user to manually choose the parameters of the optimization method.

The four described vi methods showed similar performance and it’s hard to distinguish them. However, note that the vi-Taylor approach is more general and can be applied to the likelihood functions, that are not logistic. We could also easily derive a method, similar to vi-JJ-hybrid and vi-JJ-full for the non-logistic case, but it is out of scope of this paper.

Список литературы

  • [1]

    J. Hensman, N. Fusi, and N. D. Lawrence. 2013. Gaussian processes for big data. Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. P. 282–290.

  • [2] J. Hensman, G. Matthews, Z. Ghahramani. 2015. Scalable Variational Gaussian Process Classification. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.
  • [3] T. Jaakkola. 2000. Tutorial on Variational Approximation Methods. Advanced Mean Field Methods: Theory and Practice. P. 129–159.
  • [4]

    T. Jaakkola, M. Jordan. 1997. A Variational Approach to Bayesian Logistic Regression Models and Their Extensions. Proceedings of the 1997 Conference on Artificial Intelligence and Statistics.

  • [5]

    C. E. Rasmussen and C. K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.

  • [6] M. Titsias. 2009. Variational learning of inducing variables in sparse Gaussian processes. Proceedings of the Twelfth International Workshop on Artificial Intelligence and Statistics. vol. 5, P. 567–574.
  • [7] Climin. URL:
  • [8] A. Smola, P. Bartlett 2001. Sparse greedy Gaussian process regression. Neural Information Processing Systems.
  • [9] J. Quinonero-Candela, C. Rasmussen 2005. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research. vol. 6, P. 1939–1959.
  • [10] L. Csato, M. Opper 2002. Sparse online Gaussian processes. Neural Computation. vol. 14, P. 641–668.