Bayesian inference provides a way of making use of the complete information available either as data or prior knowledge. It enables us to capture the uncertainty present in the data and model assumptions, and propagate it to further tasks such as prediction and decision making. However, exact Bayesian inference is only possible whenever mathematically convenient priors are combined with particular likelihood functions (e.g. conjugacy). Deviation from such convenience, results in intractable calculations that call for approximations.
The Laplace approximation (LA) has helped the advancement of Bayesian methodology in the machine learning field and has also been an important tool for practitioners . LA produces a Gaussian approximating posterior. In doing so, it operates myopically in the sense that it determines the mean and covariance simply by looking locally around the mode instead of focusing on where the volume of the density actually lies. Despite this shortcoming, it has been found to produce good approximations in a variety of contexts. In a Gaussian process binary classification setting , LA is found to be on a par with other approximations in terms of error rate, though it performed poorer on other criteria. The work in  employs LA in order to calculate approximate intractable integrals within the Expectation Propagation algorithm . In  LA is found to perform well when compared to Expectation Propagation in a bounded regression task. In 
LA is used to formulate an approximate form of the marginal likelihood that facilitates the update of hyperparameters without the need to update the current Gaussian posterior given by LA.
Variational inference has been put forward (VI) [1, 25] as a solution to calculating posterior distributions in situations where certain expectations are not analytically tractable. Its use has been widespread in eliciting posterior densities in e.g. dimensionality reduction , classification , regression 
, density estimation and in specialised applications like in astronomy . The applicability of VI depends on choosing a posterior density form that allows the Kullback-Leibler divergence () between the approximating and the true (unnormalised) posterior to be calculated in closed form. However, such a choice may not always exist.
Whenever it is not possible to obtain a closed-form within the VI framework, approximations become necessary. One type of approximation approximates the logarithmic (unnormalised) posterior. In  the logarithmic posterior is linearised via a first-order Taylor expansion which allows then calculating the expectation with respect to the approximating posterior in the . In a similar vein, second order Taylor expansions are considered in . Interestingly,  considers multiple second order Taylor expansions at different parameter locations of the logarithmic posterior. This results in an approximate posterior density expressed as a mixture of spherical Gaussians that has the potential to capture multiple modes. A second type of approximation [18, 21, 24, 13] approximates the
as a Monte Carlo average with samples drawn from the approximating posterior. The resulting expression allows calculating the gradient with respect to the free variational parameters. An update of the variational parameters follows, typically using a small step size in a stochastic gradient descent setting, after which theis approximated with a new Monte Carlo average. In a slightly manner, [10, 6] fix the Monte Carlo average approximation of the throughout the optimisation of the variational parameters. We finally note that often in practice e.g. [18, 24], VI methods choose to work with a factorised Gaussian posterior which has the advantage of reducing the number of free variational parameters that need to be optimised, but also has the inevitable disadvantage of discarding potential parameter correlations in the posterior.
In this work, we propose a method that combines the Laplace approximation with the variational approximation. The method works on non-conjugate models, captures a-posteriori correlations and limits the number of free variational parameters. The main idea is to take the Gaussian posterior obtained from the Laplace approximation, plug it into the variational lower bound and adapt it by optimising the lower bound. The crux of the approach is to allow only a partial update of the Laplace Gaussian posterior.
2 Approximate Bayesian Inference
We briefly review methods for approximate inference as a gentle reminder and for the purpose of introducing relevant notation. We write the log-posterior as the sum of the model log-likelihood, log-prior and evidence:
where are the data, are the model parameters and . The log-likelihood and log-prior terms have hyperparameters and which are hereafter jointly summarised as in the log-posterior. In the following, the evidence is a constant which we discard. Discarding it, gives us the unnormalised log-posterior .
2.1 Laplace approximation
The Laplace approximation (LA) seeks the mode111Multiple modes may be present. of the log-posterior density where . This may be carried out with gradient-based optimisation. At the found mode, we calculate the Hessian matrix . The obtained approximating Gaussian posterior reads:
We see that the covariance of the approximating posterior is given by the local curvature of the posterior at the found mode. The approximation can be good, if the true posterior concentrates strongly around the mode.
2.2 Variational Inference
In general, VI does not require that is a Gaussian, but here we choose to work with . For this choice, the above objective now reads as:
where the second term is the Gaussian entropy. The free parameters in objective (4) are the variational parameters , and hyperparameters .
2.3 Stochastic variational inference
Stochastic variational inference  addresses the difficulty that arises when the expectation in the first term of (4) is not tractable. It does so by approximating the expectation by a Monte Carlo average with samples drawn from :
The variational parameters no longer appear in the approximation, but it is possible to reintroduce them using the reparametrisation , in terms of samples :
where is a matrix222A common choice is the Cholesky decomposition. such that . The free parameters in objective (6) are , and . We note that, typically, one chooses covariance to be a diagonal matrix (e.g. [18, 24]) in order to limit the number of free variational parameters to be optimised. In this case, is a factorised posterior.
3 Proposed method
The motivation behind this work is to apply VI on non-conjugate models using an approximating posterior that captures a-posteriori correlations but at the same time limits the number of free variational parameters that need to be optimised. To that end, we make use of the covariance of the approximating posterior obtained via LA and the approximate variational lower bound in (6). Since the proposed method combines LA with the variational lower bound, we name it mixed variational inference (MVI). In the following, we propose three ways that MVI can exploit the correlation structure present in .
3.1 Adaptation of mean only - MVI
We perform the following Cholesky decomposition:
We propose the posterior and use it in the approximate variational lower bound in (6), which results in the following objective:
The free parameters in (8) are the mean and hyperparameters . Effectively, the proposed posterior is the Laplace posterior with the added flexibility of shifting its mean while keeping its covariance fixed to . Note, that here the entropy is a constant term that can be discarded during optimisation.
3.2 Mean and scaling of covariance - MVIeig
We perform the following eigenvalue decomposition:
hold the eigenvectors and square roots of the eigenvalues respectively333Operator creates a diagonal matrix using the vector it is applied to. Notation implies raising the components of vector to the power of .. We propose and optimise:
The free parameters in (10) are , and . Note the simplification in the entropy term due to the orthogonal , i.e. . Effectively, the proposed posterior is the Laplace posterior which now has the added flexibility to shift the mean and scale the covariance matrix along its axes by adapting vector .
3.3 Mean and low rank update of covariance - MVIlr
We introduce the vectors . We use the Cholesky decomposition and form the matrix . We propose the posterior and the associated objective:
The free parameters in (11) are , , and . Effectively, the proposed posterior is the Laplace posterior which now has the added flexibility to shift the mean but also modify its covariance matrix via a low-rank update.
The proposed posteriors are summarised in Table 1.
We use the mean and optimised hyperparameters obtained from the Laplace approximation to initialise the mean in and hyperparameters in each of the three proposed objectives. We emphasize that the covariance in MVI is initialised to and fixed. Vector in MVIeig is initialised to the square root of the eigenvalues . Vectors , in MVIlr are randomly initialised by drawing them from .
Following [10, 6] we draw number of samples which we keep fixed throughout the optimisation of the objectives in (8), (10) and (11). This enables the use of scaled-conjugate gradients (SCG) as the optimisation routine in contrast to the typically employed stochastic gradient descent444We note that in  the use of stochastic gradient is additionally motivated by the desire to train with “mini-batches”.. We note that the proposed method can in principle also employ the same optimisation scheme as in . The free parameters , and the ones pertaining to the covariance in each proposed posterior are jointly optimised via SCG. In all experiments we fix the number of drawn samples to .
4 Numerical setup
The proposed work builds on LA and VI in order to improve the performance (see section 4.2) of the Laplace approximation and do better than VI when employing a factorised Gaussian posterior. Specifically, the proposed posterior for the latter reads and has a diagonal covariance matrix whose elements are specified by the vector . The associated objective reads:
The free parameters in (12) are , and . We refer to this method as VIdiag.
In the numerical experiments, we initialise the mean and hyperparameters with and . Regarding , we experimented with two initialisations: either setting the elements of equal to the diagonal elements of , or all equal to . In the experiments of Section 5 we report for VIdiag the best performance achieved by either initialisation.
4.2 Measuring performance
In the experiments of Section 5, we measure performance in terms of the logarithmic predictive density (LPD) (i.e. marginal log-likelihood) evaluated on test data:
The LPD is approximated by number of samples drawn from , where is the respective posterior obtained via LA, MVI or VIdiag. In all numerical experiments we use .
Along LPD, we also report error rates. For the regression problem we report the mean squared error (MSE). For the classification problems, the error rate is the percentage of predicted labels not matching the true labels.
We compare MVI to LA and VIdiag on a number of datasets as detailed in the corresponding sections. Each dataset is split times into a training and testing set. The algorithms are run on each split, hence, we collect samples of the algorithms’ performance in terms of LPD and error rate on the test set. For each dataset, we report the median LPD and median error rate on the test set for each algorithm. The best performance is marked with bold in the tables reporting the results.
Moreover, we attempt to detect whether the observed differences in median, over the
collected performances, are statistically significant. In the experiments, we observed that the collected performances are not normally distributed which precludes the use of a paired T-test. The Wilcoxon signed rank test is also precluded as it requires[4, Chapter 4.7] that the distribution of the difference in median of the tested pairs is symmetric. Therefore, we resort to using a sign test [4, Chapter 2.52]
, to check whether a difference in median performance exists (i.e. better or worse, but not by how much), and the confidence intervals constructed by the bootstrap.
We test whether the performance of the best algorithm (marked in the tables with bold) is statistically significantly better by checking two conditions: we pair the best algorithm with all other algorithms and perform the sign test. The first condition is satisfied if for each pair
, the sign test rejects the null hypothesis that the median of the best algorithm is equal to the median of its respective paired algorithm. The second condition is satisfied if theconfidence interval constructed by the bootstrap on the difference of the paired medians does not contain the value. That is, if the confidence interval does not contain , then is not a likely value for the difference in the true medians. If both conditions are satisfied, we declare the best performance as statistically significant and mark it with a marker in the respective tables.
We first demonstrate the behaviour of the proposed MVI posteriors on two synthetic examples. We then proceed with experiments on benchmark problems.
5.1 Illustration with 2D posterior
We illustrate the MVI posteriors on a synthetic 2D example where we specify the true, target posterior as a mixture of two Gaussian components:
Fig. 1 displays a contour plot of . We approximate with LA and the three proposed MVI posteriors and plot them in Fig. 1. The figure also reports the of each approximating posterior to the true posterior. Here is calculated numerically as there is no, at least not straightforward, closed-form expression for it. We observe that LA (black), by design, places its mean on the mode of . All other approximations place their means on alternative locations, but fairly close to one another. Even MVI (red), whose covariance is constrained to be equal to that of LA, can shift its mean to a better location so that it covers more of the target density. We also note how the axes of MVIeig (green) are parallel by design, but scaled compared to the axes of MVI (red), i.e. we see that the red ellipse of MVI is contained in the green ellipse of MVIeig. By scaling its axes, MVIeig achieves a lower . MVIlr has the flexibility of rotating its covariance and, in this case, achieves the lowest to the true posterior .
5.2 Robust regression on synthetic task
We experiment with a regression task with input-target pairs , . Inputs are drawn uniformly in . Targets are generated through the expression
and corrupted with i.i.d. noise drawn from the uniform distribution with support. This is a regression task where adopting a Gaussian likelihood would lead to poor results as it cannot adequately explain the noise. We adopt a Cauchy density instead. The unnormalised log-posterior reads:
where is the Cauchy density. Additionally, we have calculated a set of radial basis functions on the data inputs:
where . The last element in (17) serves as a bias term. Hence, and with .
In this numerical experiment, we generate datasets with training data items using (15). We also generate test data items in precisely the same way. We report the median log-predictive density (LPD) on test data in Table 2. Best performances are marked with bold. We see that the MVIeig approximation performs the best, hence we mark it in bold. When looking at the results, we established that MVIeig performs statistically significantly better than LA, MVI and VIdiag. However, as MVIeig does not outperform MVIlr with statistical significance, we do not mark it additionally with marker. In terms of error rate, we see that MVIlr performs best and marginally better than MVIeig. Finally, in figure 2 we show the true underlying curve, specified in (15), the observed training data as filled circles along with the regressions induced by MVI and MVIeig.
5.3 Logistic Regression
Median LPD on test data for logistic regression overruns on the datasets (higher is better).
We experiment with logistic regression [2, Chapter ]. The data are input-label pairs with . Just like in Section 5.2, equation (17), we calculate a set of radial basis functions on the data inputs . The weights are given by with . The unnormalised log-posterior reads:
To avoid inadvertently selecting single datasets on which the proposed algorithm performs well, we experiment with the entire collection of datasets preprocessed by Rätsch et al555http://www.raetschlab.org/Members/raetsch/benchmark. Each dataset has been standardised and split into training and testing instances, except for Image and Splice that have 20 splits. We approximate the log-posterior in (18) with LA, the MVI posteriors and VIdiag. To initialise the hyperparameters in Laplace, we proceed as follows: per dataset, we run LA for iterations for each combination of and randomly drawn pairs , , i.e. a total of combinations. The centres
are determined by K-means for each choice of. The combination with the highest lower bound (we are maximising) is declared the winner and used to initialise LA which is then run for a maximum of iterations. The MVI posteriors and hyperparameters in the respective objectives are initialised using the optimised Laplace posterior and hyperparameters, as described in Section 3.4.
We report the median log-predictive density (LPD) on test data in Table 3, along with details about the datasets. Best performances are marked with bold. Best performances that differ in a statistically significant way to all other performances (see Section 4.2) are additionally marked with a marker. Table 3 reveals that, in general, the proposed MVI posteriors perform better than LA or VIdiag. In particular, we see that MVIlr scores better on a number of datasets and that the difference in performance is often statistically significant. Table 4 displays the results on error rates. We see that all methods achieved more or less the same error rates with no performance being statistically significantly superior. Nonetheless, we do observe a few exceptions, e.g. on datasets Splice and Diabetis LA and VIdiag respectively perform noticeably poorer.
5.4 Multiclass Logistic Regression
Similarly to logistic regression, multiclass logistic regression [2, Chapter ] does not allow direct Bayesian inference as the use of the softmax function renders integrals over the likelihood term intractable. The unnormalised log-posterior reads:
where denotes the total number of classes. The data are input-label pairs with . Vectors are binary vectors encoding class labels using a -of- coding scheme, e.g. encodes class label in a
-class problem. The probabilityof the -th data item belonging to class is modelled via the softmax function:
where each class is associated with a weight vector , with . The basis functions are defined in the same way as in Section 5.2. We initialise hyperparameters in the same way as described in Section 5.3.
To avoid inadvertently selecting single datasets on which the proposed algorithm performs well, we experiment with the collection of multiclass datasets used in the work of  in a different context. Details of the datasets are shown in Table 5. We standardise the data column-wise to zero mean and unit standard deviation. Using random subsampling, we split each dataset times into a training ( of the data) and testing () set. We report the median LPD for each dataset and algorithm in Table 5 and median error rate in Table 6. Again, best performances are marked in bold. We mark the best performance with a marker if it is found to be statistically significant using the same two conditions described in Section 4.2. The results show an improvement over LA and the use of a factorised posterior in VIdiag. We also see that MVIeig performs well on this set of problems in terms of LPD, though the picture is not as clear in terms of error rate in Table 6. Finally, we note the low performance of all methods on the dataset Crab, evidently in Table 6. This may be perhaps attributed to the particular choice of the RBF kernel made here, though other kernels (cf ) may be more appropriate.
6 Discussion and Conclusion
We proposed Mixed Variational Inference (MVI) as a method for approximate Bayesian inference in non-conjugate models. MVI makes use of the posterior obtained via the Laplace approximation and the objective function provided by variational inference. The adoption of the Laplace posterior helps with capturing a-posteriori correlations; the partial adaptation of the Laplace posterior, in the form of the proposed MVI posteriors, helps limit the number of free variational parameters that need to be optimised. The numerical results show that the MVI posteriors have the potential to improve on the performance of the Laplace approximation and on the performance of the commonly adopted factorised Gaussian posterior in variational inference.
Strictly speaking, however, one should be aware of the fact that a posterior that approximates the true posterior better, does not necessarily guarantee improved log-predictive density; vice versa, a “naive” approximating posterior (e.g. factorised) may in principle provide satisfactory predictive performance. This observation has been previously stated in 
where a variety of approximations are evaluated in the context of Gaussian process binary classification. Therein it is stated that, in principle, even a poor approximation in terms of posterior moments can still provide good predictions. After all, as far as variational approximations are concerned, it is evident in objective (3) that the goal is to find a that is as close as possible to the true posterior; this does not necessarily correlate with improved predictive performance. Nevertheless, one does expect in practice that an approximation that captures posterior correlations in the parameters to be more useful than a factorised approximation that practically draws the parameters independently of one another when making predictions (see (13)). But beyond this expectation, it is admittedly difficult to anticipate what approximation may perform best. Indeed, in the numerical experiments we notice that while MVIlr seems well suited for logistic regression (see Table 3), this is not necessarily the case in multiclass logistic regression (see Table 5).
In its present form, MVI is limited to Gaussian posteriors. It would be interesting to extend MVI to non-Gaussian posteriors, though at first sight it seems that its dependence on the Laplace approximation considerably limits it. An interesting direction, inspired by , would be to postulate a posterior based on a mixture of Gaussians, where the covariance of each Gaussian comes from a Laplace approximation performed at a different mode. Forming an approximating posterior using multiple modes procured by the Laplace approximation has been previously suggested in . However, therein no objective function akin to (6) is guiding the inference of the posterior. This could be potentially addressed by extending MVI so that is now a Gaussian mixture whose covariance matrices are given by the Laplace approximation and partially updated as suggested in Sections 3.1, 3.2, 3.3. We reserve such investigations for future research.
The author acknowledges the useful discussions and encouragement of Christoph Schnörr, Kai Polsterer and Ata Kaban.
-  M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.
-  C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
-  Michael A Chappell, Adrian R Groves, Brandon Whitcher, and Mark W Woolrich. Variational bayesian inference for a nonlinear forward model. IEEE Transactions on Signal Processing, 57(1):223–236, 2009.
-  Peter Dalgaard. Introductory statistics with R. Springer Science & Business Media, 2008.
-  Anthony Christopher Davison, David Victor Hinkley, et al. Bootstrap methods and their application, volume 1. Cambridge university press, 1997.
Nicolas Depraetere and Martina Vandebroek.
A comparison of variational approximations for fast inference in mixed logit models.Computational Statistics, 32(1):93–125, 2017.
-  Eleazar Eskin, Alex J Smola, and SVN Vishwanathan. Laplace propagation. In Advances in neural information processing systems, pages 441–448, 2004.
-  Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.
-  Samuel J Gershman, Matthew D Hoffman, and David M Blei. Nonparametric variational inference. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 235–242. Omnipress, 2012.
-  Nikolaos Gianniotis, Christoph Schnörr, Christian Molkenthin, and Sanjay Singh Bora. Approximate variational inference based on a finite sample of gaussian latent variables. Pattern Analysis and Applications, 19(2):475–485, May 2016.
-  Bjørn Sand Jensen, Jens Brehm Nielsen, and Jan Larsen. Bounded gaussian process regression. In Machine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop on, pages 1–6. IEEE, 2013.
-  Ata Kabán. On bayesian classification with laplace priors. Pattern Recognition Letters, 28(10):1271–1282, 2007.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
David JC MacKay.
A practical bayesian framework for backpropagation networks.Neural computation, 4(3):448–472, 1992.
Thomas P Minka.
Expectation propagation for approximate bayesian inference.
Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.
Patrick Kofod Mogensen and Asbjørn Nilsen Riseth.
Optim: A mathematical optimization package for julia.Journal of Open Source Software, 3(24), 2018.
-  Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary gaussian process classification. Journal of Machine Learning Research, 9(Oct):2035–2078, 2008.
-  John Paisley, David M Blei, and Michael I Jordan. Variational bayesian inference with stochastic search. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1363–1370. Omnipress, 2012.
Ioannis Psorakis, Theodoros Damoulas, and Mark A. Girolami.
Multiclass Relevance Vector Machines: Sparsity and Accuracy.
IEEE Transactions on Neural Networks, 21(10):1588–1598, 2010.
-  Jeffrey Regier, Andrew Miller, Jon McAuliffe, Ryan Adams, Matt Hoffman, Dustin Lang, David Schlegel, and Mr Prabhat. Celeste: Variational inference for a generative model of astronomical images. In International Conference on Machine Learning, pages 2095–2103, 2015.
Tim Salimans, David A Knowles, et al.
Fixed-form variational posterior approximation through stochastic linear regression.Bayesian Analysis, 8(4):837–882, 2013.
-  Devinderjit Sivia and John Skilling. Data analysis: a Bayesian tutorial. OUP Oxford, 2006.
-  Harold Soh. Distance-preserving probabilistic embeddings with side information: Variational bayesian multidimensional scaling gaussian process. In IJCAI, pages 2011–2017, 2016.
-  Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In International Conference on Machine Learning, pages 1971–1979, 2014.
-  D.G. Tzikas, C.L. Likas, and N.P. Galatsanos. The Variational Approximation for Bayesian inference. Signal Processing Magazine, IEEE, 25(6):131–146, 2008.
-  M. W. Woolrich and T. E. Behrens. Variational bayes inference of spatial mixture models for segmentation. IEEE Transactions on Medical Imaging, 25(10):1380–1391, 2006.
-  Anqi Wu, Nicholas G Roy, Stephen Keeley, and Jonathan W Pillow. Gaussian process based nonlinear latent structure discovery in multivariate spike train data. In Advances in Neural Information Processing Systems, pages 3496–3505, 2017.
-  Shipeng Yu, Kai Yu, Volker Tresp, and Hans-Peter Kriegel. Variational bayesian dirichlet-multinomial allocation for exponential family mixtures. In European Conference on Machine Learning, pages 841–848. Springer, 2006.