Bayesian inference is becoming the standard mode of inference as computational resources increase, algorithms advance and scientists across fields become aware of the importance of uncertainty. However, exact Bayesian inference is hardly ever possible whenever the model likelihood function deviates from mathematically convenient forms (i.e. conjugacy). Deterministic approximations are constantly gaining ground on the ubiquitous and computationally intensive Monte Carlo sampling methods that are capable of producing high quality approximations to otherwise intractable quantities such as posterior densities or marginal likelihoods.
Often, however, deterministic schemes are tailored to a particular model, or family of models, and hence previously derived methods might not be transferable to a new setting (e.g. in 
a specialised algorithm for Bayesian inference in neural networks is considered). This introduces practical difficulties, mostly in fields beyond machine learning, whenever an implementation of Bayesian inference is required particularly in early explorative stages where the model formulation is likely to keep changing. In this work we introduce a scheme for approximate Bayesian inference that aims to be general enough so that it can accommodate a variety of models. The proposed scheme is conceptually simple and requires only that the gradient of the model log-likelihood with respect to its parameters be available.
Specifically, we consider the task of computing a global Gaussian approximation to a given intractable posterior distribution representing a probabilistic model by maximizing a standard variational lower bound . This lower bound involves the expectation of the log-likelihood of the model distribution with respect to the approximating distribution . To enable the computation of these expectations either in closed form or through computationally tractable numerical approximations, the likelihoods are typically restricted to conditionally factorized forms. This step of the approach specifically depends on the model at hand.
By contrast, our conceptually simple method presented below is more generally applicable to various models in the same way. We demonstrate this by working out examples for a variety of models. In particular, we demonstrate empirically that our approach results in Gaussian approximations that are superior to the basic Laplace approximation , which is the typical objective of variational approximation schemes. A formal comparison to related state-of-the-art methods for computing improved Gaussian approximations, e.g. through the nested Laplace approximation  or by expectation propagation , is beyond the scope of this paper, however.
Our paper is organized as follows: in Section 2 we introduce in general terms the proposed scheme. Starting from a standard formulation, that typically variational methods use, we postulate a Gaussian posterior and show how to form an approximation to the lower bound of the marginal log-likelihood. The obtained approximation allows us to optimise the variational parameters and of by gradient optimisation. For the reader that wishes to refresh her/his memory or obtain a more detailed explanation of the equations presented in Section 2, we refer to [5, 21].
In Section 3 we demonstrate the proposed scheme on a number of applications and compare it against exact inference, Laplace and variational approximations. Specifically, in Subsection 3.1 we show a visual comparison of approximating flexible bivariate densities using the Laplace approximation and the proposed scheme. In Subsection 3.2
we apply our approach on the problem of Bayesian linear regression which actually does admit an analytical and exact solution. This is useful as it allows us to empirically verify the correctness of our scheme against the posterior obtained by exact inference. Subsequently, in Subsections3.3 and 3.4 we compare the proposed scheme with approaches that take into account the functional forms of classification problems. We show that despite its general formulation, the proposed scheme performs up to par in this setting without exploiting any such problem specific knowledge. In Subsection 3.5
we show how a change in the model likelihood of probabilistic principal component analysis, that renders inference problematic, can easily be accommodated by the proposed scheme. This demonstrates the versatility of our approach in handling such cases in a direct and simple manner. Finally, in Subsection 3.6 we show how the proposed scheme can be applied beyond the usual statistical models, namely on a real geophysical model . We believe that the proposed method raises a range of interesting questions and directions of research; we briefly discuss them in Section 4.
2 Proposed Scheme for Approximate Variational Inference
Assume an observed dataset of inputs and outputs modelled by a model parametrised by . For observed outputs corrupted by Gaussian noise of precision , the following likelihood111Besides the likelihood based on the Gaussian density, others can also be accommodated as shown in Sections 3.3-3.5. The choice of the Gaussian is made for ease of exposition. arises:
is the vector of model outputs calculated on all inputs. Furthermore, assume a Gaussian prior on the parameters :
Our wish is to approximate the true posterior of the parameters
. We do not make any assumptions about the model having conjugate priors for the parameters. Model may have a complex functional form that hinders exact Bayesian inference, or even the application of an approximate Bayesian scheme such as VBEM  with a factorised prior. However, we do have to make an assumption on the form of the posterior. We choose to postulate an approximate Gaussian posterior for the parameters :
Parameters and of the posterior are called variational parameters. For reasons that will become obvious shortly, we choose to parametrise covariance matrix as with . The postulated posterior now reads:
Hence, the actual variational parameters are and .
2.1 Approximate Lower Bound
The first step in introducing the proposed scheme, is writing the marginal log-likelihood and lower-bounding it in the standard way using Jensens’ inequality [5, Eq. (2.46)]:
An alternative motivation of the lower bound is provided in [21, Eq. (15)] Maximising the lower bound in Eq. (5) with respect to the free variational parameters and of , results in the best Gaussian approximation to the true posterior. Term , the integrated likelihood in Eq. (5), is a potentially intractable integral. We approximate term using Monte Carlo sampling:
where we draw samples from the postulated posterior . Due to the sampling, however, the variational parameters no longer appear in the approximation Eq. (6) . Nevertheless, it is possible to re-introduce them by rewriting the sampled weights as222This is where the parametrisation becomes useful.:
where variables are sampled from the standard normal . We summarise all samples by . Hence, we can rewrite Eq. (6) as:
Hence, the variational parameters and are now made explicit in this approximation. We expand the approximation of term further:
Term in Eq. (5
) is simply the Kullback-Leibler divergence (KLD) between the two Gaussian densitiesand , and can be calculated in closed form:
We can now put together the approximated term in Eq. (9) and the KLD term in Eq. (10), to formulate the following objective function333The subscript stands for finite sample. . Discarding constants, the proposed approximate lower bound reads:
Objective is an approximation to the intractable lower bound in Eq. (5). It consists of two parts, the approximation to the integrated likelihood (term ) and the exact KLD (term ). The proposed lower bound becomes more accurate when the number of samples is large.
2.2 Optimisation of Approximate Lower Bound
Gradients of can be calculated with respect to the variational parameters and in order to find the best approximating posterior :
where denotes the Jacobian matrix of and is the pseudo-inverse of due to the derivation of the log-determinant444See derivative rule in . in Eq. (11). Analogous equations for the case of exact variational inference can be found at [5, Eq. (2.64)]. Given the current posteriorand have analytical updates:
Again, analogous equations for the above hyperparameter updates can be found in [21, Eqs. (38), (39)].
The proposed scheme is summarised with the pseudocode in Algorithm 1. Convergence was established in the experiments by checking whether the difference between the objective function values between two successive iterations is less than . Gradient optimisation of and was carried out using the scaled conjugate gradient algorithm . The outcome of the above scheme is the approximation to the marginal log-likelihood (also called log-evidence). The scheme imparts us with the approximate Gaussian posterior .
2.3 Monitoring Generalisation Performance
For large values of the proposed lower bound approximates the true bound in Eq. (5) closely. Therefore, we expect that optimising will yield approximately the same optimal variational parameters as the optimisation of the intractable true lower bound would.
The proposed scheme exhibits some fluctuation as is a function of the random set of samples . Hence, if the algorithm, as summarised in Algorithm 1, is run again, a new set will be drawn and a different function will be optimised. However, for large enough the fluctuation due to will be innocuous and approximately the same variational parameters will be found for any drawn 555Discounting other sources of randomness like initialisation, etc..
However, if on the other hand we choose a small , then the variational parameters will overly depend on the small set of samples that happened to be drawn at the beginning of the algorithm. As a consequence, will approximate poorly, and the resulting posterior will also be a poor approximation to the true posterior. Hence, the variational parameters will be overfitted to the small set of samples that happened to be drawn.
Naturally, the question arises of how to choose a large enough in order avoid overfitting the variational parameters on . A practical answer to this question is the following: at the beginning of the algorithm we draw a second independent set of samples where is preferably a number larger than . At each (or every few) iteration(s) we monitor the quantity on the independent666We stress that is not used in training. sample set . If the variational parameters are not overfitting the drawn , then we should see that as the lower bound increases, the quantity should also display a tendency to increase. If on the other hand the variational parameters are overfitting the drawn , then though is increasing, we will notice that is actually deteriorating. This is a clear sign that a larger is required.
The described procedure is reminiscent of monitoring the generalisation performance of a learning algorithm on a validation set during training. A significant difference, however, is that while validation sets are typically of limited size, here we can set arbitrarily large. For practical purposes, we found that was good enough to detect overfitting. An illustration of overfitting the variational parameters in provided in Sec. 3.2.
In this section we apply the proposed approach on a variety of applications, namely regression, classification, denoising and geophysical modelling. In particular, the geophysical example shows how the method can be applied beyond standard statistical models.
3.1 Fitting Bivariate Posteriors
We test our proposed scheme on some artificially constructed posteriors by using a flexible parametric class of densities, due to , which reads:
whereis a real-value function such that . Here, we take to be the dot product of a row and column vector, as in . The goal is to find the best Gaussian approximation to instances of Eq. (16) for different column vectors . To that end, we tried to find the best Gaussian using the Laplace approximation and the proposed scheme. We used . The results are shown in Fig. 1. The Gaussian approximations are drawn as black dashed curves with their mean marked as a red cross. The goodness of each approximation has been measured as the Kullback-Leibler divergence, and it is noted in the respective captions. We note that the proposed scheme fares better than the Laplace approximation as the latter evidently focuses on the mode of the target density instead on where the volume of the density lies. The KLD in these examples was calculated numerically as there is no closed form between a Gaussian and a member of the densities in Eq. (16)
3.2 Bayesian Linear Regression
Bayesian linear regression constitutes a useful example for corroborating that the proposed scheme works correctly as we can compare the obtained posterior to the posterior obtained by exact Bayesian inference .
We consider data targets generated by the equation
with inputs uniformly drawn in the range
. We add white Gaussian noise to the data targets with a standard deviation of
. We calculate a set of radial basis functions on the data inputs
where . The last element in serves as a bias term. We set , and adopt the linear model where . We complete the model by choosing the following densities:
Postulated posterior: , where and .
We set the number of samples of variables to . We inferred the Gaussian posterior of the weights using both exact Bayesian inference  and the proposed scheme. In Fig. 2 we plot the mean predictions as obtained by the two schemes and note that they are very similar, especially in the areas where enough data items are present. Similarly, in Fig. 3 we plot the covariance matrices found by the two schemes and note their close similarity. Hence, we conclude that the proposed scheme stands in close agreement to the exact Bayesian solution.
Finally, we demonstrate on the same dataset the effect of overfitting the variational parameters when is set too low. In Fig. 4 we are monitoring the lower bounds and , see Sec. 2.3. In the left, we run the algorithm for and : we see that as increases with each iteration, so does . This means that the fitted variational parameters generalise well. On the right hand side, we run the algorithm for but kept . Here we clearly see that while increases, the lower bound is deteriorating. This a clear sign that a larger is required and that the variational parameters are overfitted.
3.3 Bayesian Logistic Regression
In this section we apply the proposed scheme to Bayesian logistic regression and compare with the variational approach presented in. The data are input-label pairs with . Again, like in Sec. 3.2, we calculate basis functions on the input data and take . We set . We complete the model by choosing the following densities:
Postulated posterior: .
We evaluated both schemes on datasets preprocessed by Rätsch et al777http://www.raetschlab.org/Members/raetsch/benchmark
. Each preprocessed dataset has been randomly partitioned into a 100 non-overlapping training and testing sets. Hence the performance of both schemes was evaluated as the accuracy on the test set, that is the ratio of correctly classified test samples over all test samples. The predictive distribution for the proposed scheme was approximated using a Monte Carlo estimate. We drew 200 parameter samples from the fitted Gaussian posteriorand measured performance on the testing set as the average accuracy under each sample of parameters. The results reported in Table 1, mean squared error and standard deviation, show that both the proposed schemes and the variational bound in  perform virtually the same. We note that, as opposed to  which exploits the functional form of logistic regression in order to design a bespoke lower bound, the proposed method does not take into account any such knowledge and still is capable of delivering comparable performance. Hence, we find the results in this section encouraging.
|Dataset||Bound in ||Proposed|
|Banana||0.8893 0.0055||0.8893 0.0054|
|Cancer||0.7119 0.0456||0.7116 0.0454|
|Heart||0.5528 0.0445||0.5500 0.0476|
|Solar||0.6449 0.0172||0.6445 0.0167|
3.4 Bayesian Multiclass Classification
In this section we apply the proposed scheme on Bayesian multiclass classification. The data are input-label pairs . Vectors are binary vectors encoding class labels using -of- coding scheme, e.g. encodes class label in a -class problem. A typical way of formulating multiclass classification is multiclass logistic regression (MLR), see [6, Chapter ]
for more details. MLR models the probabilityof the -th data item belonging to class via the softmax function . denotes the total number of classes, and each class is associated with a weight vector . Similarly to logistic regression, MLR does not allow direct Bayesian inference as the use of the softmax function renders integrals over the likelihood term intractable. Thus, Bayesian MLR is a good candidate problem for the proposed approach. We specify the following model:
Postulated posterior: , with .
As a corroboration of the usefulness of our approximation to Bayesian MLR, we compare with the multiclass generalisation of the relevance vector machine (RVM)  presented in . We use the multiclass UCI datasets suggested therein. Amongst the two generalisations of the RVM suggested in , we use the version. We also follow the suggestion of  concerning the choice of kernels for the different datasets. We set . Table 2 summarises the results of our numerical simulations along with details of the datasets. While the designs a refined probabilistic model in order to make probabilistic multiclass classification possible, the proposed scheme does not take into account any kind of such knowledge and is still able to deliver competitive performance, in terms of predictive accuracy, as seen in Table 2. The good performance demonstrates both the usefulness and versatility of the proposed method.
3.5 Probabilistic Image Denoising
In this section we further demonstrate how the proposed method can take in its stride a change in the model-likelihood that complicates computations. The model considered here is the ubiquitous probabilistic principal component analysis (PPCA) introduced in
. PPCA assumes that the observed high-dimensional dataare manifestations of low-dimensional latent variables , under a linear mapping expressed by a matrix and an offset , corrupted by Gaussian noise :
PPCA formulates a computationally amenable linear-Gaussian model which allows integrating out the latent variables and obtaining the marginal log-likelihood. Estimating and follows by maximising the marginal log-likelihood . Various works extend PPCA by replacing the noise model with other choices, e.g.  uses the Student-t distribution, in order to deal with different types of noise. A recent interesting suggestion is the choice of the Cauchy density as the noise model , albeit in a non-probabilistic formulation. The Cauchy density with location and scale parameters reads:
Choosing the Cauchy density as the noise model leads to a version of PPCA where the marginal log-likelihood is no longer tractable and so the latent variables cannot be integrated out. This is simply because the prior on is not conjugate to the Cauchy likelihood. However, the proposed method can be used to approximate this intractable marginal log-likelihood. Formally, we specify the following Cauchy-PPCA model:
Postulated posterior: .
Parameters , and are obtained by gradient optimisation of the proposed lower bound .
We applied the original PPCA  and the proposed Cauchy-PPCA on a task concerning the denoising of images that have undergone pixel corruption. The aim here is to show the ease with which the proposed method can accommodate a change in the specification (i.e. change from Normal to Cauchy likelihood) and deliver a well-performing model.
The data are face images of individuals obtained from the Extended Yale B Database . There are images per individual under poses and illumination conditions. The images are grayscale images whose pixels have values between and . We rescale the images to pixels. Hence and we set , i.e. both PCA schemes project the images to a latent space of dimension equal to . We corrupt of the pixels in each image by drawing a new value uniformly in the range . For each individual, we use half of the corrupted images as the training set and the other half as the test set.
Fig. 5 presents results obtained on test images from individuals. The figure shows from left to right the original and corrupted test image followed by the two reconstructions obtained by PPCA and Cauchy-PPCA respectively. In order to quantify the quality of reconstruction, we use the following measure between the original and reconstructed images: . This measure is quoted below each image. The results in Fig. 5 evidently show that Cauchy-PPCA achieves better denoising levels than PPCA. In actual fact, in our numerical experiments we found that Cauchy-PPCA outperformed PPCA on all individuals.
The present numerical experiment demonstrates the versatility of the proposed method in how it can easily extend PPCA to incorporate a Cauchy likelihood. This is achieved without exploiting any particular knowledge pertaining to the probabilistic specification of the model.
3.6 Bayesian Inference for the Stochastic Model by Boore
In this section we apply the proposed scheme on a geophysical model called the stochastic model. The stochastic model, due to Boore , is used to predict ground motion at a given site of interest caused by an earthquake. Ground motion is simply the shaking of the earth and it is a fundamental quantity in estimating the seismic hazard of structures. From a physical point of view, the stochastic model describes the radiated source spectrum and its amplitude changes in the frequency domain due to wave propagation from the source to the site of interest. The inputs to the stochastic model are the distance of the site of interest to the seismic source, the magnitude of the earthquake, and the frequency of ground motion. The stochastic model, in its simple form, has a parameter associated with the seismic source known in the literature as stress parameter (), two parameters associated with the path attenuation called geometrical spreading () and quality factor (), and one more parameter associated with the site called near-surface attenuation (). All aforementioned parameters are bounded within a physical range. In the case of multiple seismic sources, each source is associated with its own distinct stress parameter. The scalar output of the model is the mean Fourier amplitude of the ground motion. The type of ground motion we consider here is acceleration. We denote the stochastic model as a function , where , where is the number of seismic sources. This situation is depicted in Fig. 6. We refer the interested reader to  for more details. Estimating the posterior uncertainty of the model parameters is important in seismic hazard analysis as the propagation of uncertainty in the parameters can have an impact on the estimated hazard curve . A discussion of how these posteriors can be utilised in later stages of seismic hazard analysis is beyond the scope of this work.
We specify the model by choosing the following densities:
Flat prior: .
Postulated posterior: .
In contrast to the previous applications, here we choose a very flat prior. Ground motion data inputs are denoted by and targets by .
We performed experiments on a subset of the recently compiled RESORCE database  which is a pan-European database of strong-motion recordings. In particular we focused on data records originating from a station in the region of L’Aquilla for seismic sources. Hence, the total number of free model parameters in is . We experimented with varying number of data records, , in order to test the robustness of the Laplace and the proposed approximation in scenarios of limited data. Such situations arise in geophysical studies when data recordings are incomplete due to distortions in frequencies caused by instrumentation errors. The performance of Laplace and the proposed scheme was evaluated as the prediction error on test sets. Both schemes were run times, and each run involved a new random realisation of the training and testing set. Parameter was set to for all experiments in this section. The predictive distribution for the proposed scheme was approximated using a Monte Carlo estimate. We drew parameter samples from the Gaussian fitted posterior and estimated performance on the testing set as the average of the mean squared error under each parameter sample. The results are reported in Table 3.
The results show that the proposed approximation fares better than Laplace, although at the performances are virtually identical. For lower
, however, the Laplace approximation exhibits much higher variance than the proposed scheme.
4 Discussion and Conclusion
We have presented a scheme for Bayesian variational inference that is applicable in cases where the likelihood function renders more standard approaches difficult. The scheme is conceptually simple as it relies on a simple Monte Carlo average of the intractable part of the variational lower bound, see Eq. (5), and the re-introduction of the variational parameters resulting in the objective of Eq. (11). The scheme can thus be generally applied to other models where variational inference is difficult requiring only the gradients of the log-likelihood function with respect to the parameters.
In the numerical experiments we have shown that (a) the proposed scheme stands in close agreement with exact inference in Bayesian linear regression, (b) it performs up to par in classification tasks against methods that design bespoke model formulations, (c) it fares better than the Laplace approximation in a number of cases, and (d) it is very versatile and can be applied to a variety of problems.
Future work will address the relationship of our approach to variational approaches [18, 9] that provide alternative ways to compute improved Gaussian approximations to intractable posteriors relative to the Laplace approximation. Another aspect concerns ways to cope with very large problems that would require a large number of samples to obtain a sufficiently accurate approximation in Eq. (8). A natural choice would be to turn the scheme in Algorithm 1 into a recursive stochastic optimisation scheme  that employs small sample sets computed at each iteration, akin to stochastic gradient-based large-scale empirical risk minimisation . These two approaches should not be confused, however. The latter employs subsampling of the data whereas our scheme generates samples based on current parameter estimates of the approximate posterior. Clearly, our scheme could incorporate sequential subsampling of the data as well. The problem of proving convergence of such an overall stochastic approximation approach in a suitable sense  seems to be open.
The RESORCE database  was used in this work with the kind permission of the SIGMA project888http://www.projet-sigma.com. N. Gianniotis was partially funded by the BMBF project “Potsdam Research Cluster for Georisk Analysis, Environmental Change and Sustainability”. C. Molkenthin and S. S. Bora were funded by the graduate research school GeoSim of the Geo.X initiative999http://www.geo-x.net.
-  S. Akkar, M.A. Sandikkaya, M. Senyurt, A. Azari Sisi, B. Ö. Ay, P. Traversa, J. Douglas, F. Cotton, L. Luzi, B. Hernandez, and S. Godey. Reference Database for Seismic Ground-Motion in Europe (RESORCE). Bulletin of Earthquake Engineering, 12(1):311–339, 2014.
-  Cédric Archambeau, Nicolas Delannay, and Michel Verleysen. Robust Probabilistic Projections. In Proceedings of the 23rd International Conference on Machine Learning, pages 33–40. ACM, 2006.
The Skew-Normal Distribution and Related Multivariate Families.Scandinavian Journal of Statistics, 32(2):159–188, 2005.
-  David Barber and Christopher M. Bishop. Ensemble Learning in Bayesian Neural Networks. In Generalization in Neural Networks and Machine Learning, pages 215–237. Springer, 1998.
-  M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.
-  C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
-  D. M. Boore. Simulation of Ground Motion Using the Stochastic Method. Pure and Applied Geophysics, 160(3-4):635–676, 2003.
-  L. Bottou. Stochastic Gradient Tricks. In Neural Networks, Tricks of the Trade, Reloaded, volume 7700 of LNCS. Springer, 2012.
-  B. Cseke and T. Heskes. Approximate Marginals in Latent Gaussian Models. Journal of Machine Learning Research, 12:417–454, 2011.
Athinodoros S. Georghiades, Peter N. Belhumeur, and David Kriegman.
From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(6):643–660, 2001.
-  Tommi Jaakkola and Michael Jordan. Bayesian Parameter Estimation via Variational Methods. Statistics and Computing, 10(1):25–37, 2000.
-  H.J. Kushner and G.G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd edition, 2003.
Martin Fodslette Møller.
A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning.Neural networks, 6(4):525–533, 1993.
-  M. Opper and C. Archambeau. The Variational Gaussian Approximation Revisited. Neural Computation, 21:786–792, 2009.
-  Kaare B. Petersen and Michael S. Pedersen. The Matrix Cookbook, November 15 2012.
-  Ioannis Psorakis, Theodoros Damoulas, and Mark A. Girolami. Multiclass Relevance Vector Machines: Sparsity and Accuracy. IEEE Transactions on Neural Networks, 21(10):1588–1598, 2010.
-  L. Reiter. Earthquake Hazard Analysis: Issues and Insights. Colombia University Press, 1991.
-  H. Rue, S. Martino, and N. Chopin. Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations. Journal of the Royal Statistical Society: Series B, 71(2):319–392, 2009.
-  Michael E Tipping. Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:211–244, 2001.
-  Michael E. Tipping and Christopher M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B, 61(3):611–622, 1999.
-  D.G. Tzikas, C.L. Likas, and N.P. Galatsanos. The Variational Approximation for Bayesian inference. Signal Processing Magazine, IEEE, 25(6):131–146, 2008.
-  Pengtao Xie and Eric P. Xing. Cauchy Principal Component Analysis. CoRR, abs/1412.6506, 2014.