Bayesian inference is intrinsically plagued by computational problems due to the fact that its key object, the posterior distribution , is computationally hard to approximate. Solutions to this challenge can be roughly decomposed into two classes. Sampling methods, dating back to the famed Metropolis-Hastings algorithm (Metropolis et al. (1953); Hastings (1970)), provide a first possible solution: approximate the posterior using a large number of samples whose marginal distribution is the posterior distribution (Brooks et al. (2011)). Any expected value under the posterior can then be approximated using the corresponding empirical mean in the samples. A second solution is provided by Variational methods which aim to find the member of a parametric family which is the closest (in some sense) to the posterior (Blei et al. (2017)). For example, the historical Laplace approximation (Laplace (1820)
) proposes to approximate the posterior by a Gaussian centered at the Maximum A Posterior value while the more modern Gaussian Variational Approximation finds a Gaussian which minimizes the reverse Kullback-Leibler divergence (KL) (Opper and Archambeau (2009)). Computationally, the choice between the two corresponds to a trade-off between accuracy (the error of sampling methods typically converges at speed where is the number of samples, while Variational methods will always have some residual error) and speed (Variational methods tend to quickly and cheaply find the best approximation; e.g. Nickisch and Rasmussen (2008)). Currently, Variational methods are furthermore held back by the fact that they are perceived to be unrigorous approximations due to the absence of results guaranteeing their precision.
One limit for which Variational methods are particularly interesting is for large datasets. Indeed, as the number of datapoints grows, sampling methods struggle computationally since the calculation of one additional sample requires a pass through the whole dataset (but see Bardenet et al. (2014); Chen et al. (2014); Maclaurin and Adams (2015); Bardenet et al. (2017) for modern attempts at tackling this issue). In contrast, Variational methods are still able to tackle these large datasets since they can leverage the computational prowess of optimization. For example, it is straightforward to solve an optimization problem while accessing only subsets (or batches) of the data at any one time, thus minimizing the memory cost of the method while evaluation of the Metropolis-Hastings ratio on subsets of the data is much trickier.
Furthermore, in the large-data limit, it is known that the posterior distribution becomes simple: it tends to be almost equal to its Laplace approximation (with total variation error typically scaling as if the dimensionality is fixed), a result known as the Bernstein-von Mises theorem (BvM; Van der Vaart (2000); Kleijn et al. (2012)). Intuitively, we should thus expect that Variational methods would typically have smaller error as grows larger.
However, existing variants of the BvM theorem fail to be useful in characterizing whether, for a given posterior and a given approximation , this approximation is good enough or not. This might be due to historical reasons since the BvM theorems might have aimed instead at proving that Bayesian methods are valid in the frequentist paradigm (under correct model specification; Kleijn et al. (2012)).
In this article which expands upon the preliminary work in Dehaene (2017), I propose to extend the scope of BvM theorems by proving (Th.5 and Cor.7) that the distance (measured using the KL divergence) between a log-concave density on and its Laplace approximation
can be (roughly and with caveats) approximated using the “Kullback-Leibler variance”:
thus yielding a computable quantity assessment of whether, in a given problem, the Laplace approximation is good enough or not. Furthermore, the KL divergence scales at most as:
where the scalar (defined in eq.42 below) measures the relative strength of the third-derivative compared to the second-derivative of . In the conventional large-data limit, scales as (Cor.6) and this result thus recovers existing BvM theorems since the total variational distance scales as the square-root of the KL divergence due to Pinsker’s inequality.
This article is organized in the following way. Section 1 introduces key notations and gives a short review of existing BvM results. Section 2 then introduces three preliminary propositions which give deterministic approximations of . These approximations are limited, but provide a key stepping stone towards Th.5. Section 3 then presents the main result, Th.5, and shows that the classical BvM theorem in the IID large-data limit is recovered as Cor.6. Next, I show that Th.6 yields computable approximations of and investigate them empirically in the linear logistic classification model. Finally, Section 5 discusses the significance of these findings and how they can be used to derive rigorous Bayesian inferences from Laplace approximations of the posterior distribution.
Note that, due to the length of the proofs, I only give sketches of the proofs in the main text. Fully rigorous proofs are given in the appendix.
1 Notations and background
Throughout this article, let denote a
using its Laplace approximation:
I will distinguish whether variables have distribution or through the use of indices (e.g. ) or through more explicit notation. Recall that the mean and covariance of are as follows: is the maximum of and is the inverse of the negative log-Hessian of :
The Laplace approximation thus corresponds to a quadratic Taylor approximation of around the maximum of .
I will furthermore assume that is a strictly convex function that can be differentiated at least three times. This ensures that is unique, straightforward to compute through gradient descent and is a log-concave density function, a family with many interesting properties (see Saumard and Wellner (2014) for a thorough review). Note that, in a Bayesian context, it is straightforward to guarantee that the posterior is log-concave since it follows from the prior and all likelihoods being log-concave. These assumptions could be weakened but at the cost of a sharp loss of clarity.
I will measure the distance between and using the Kullback-Leibler divergence:
I will refer to these as the forward (from truth to approximation) KL divergence and the reverse (approximation to truth) KL divergence in order to distinguish them.
The KL divergence is a positive quantity which upper-bounds the total-variation distance through the Pinsker inequality:
Contrary to the total-variation distance, the KL divergence is not symmetric nor does it generally respect a triangle inequality.
In order to state convergence results, I will use the probabilistic big-O and small-o notations. Recall that is equivalent to the sequence being bounded in probability: for any , there exists a threshold such that:
while is equivalent to the sequence converging weakly to , i.e. for any threshold :
Finally, my analysis strongly relies on three key changes of variables. First, I will denote with the affine transformation of such that
is the standard Gaussian distribution:
Densities and log-densities on the
space will be denoted as .
Second, I will further consider a switch to spherical coordinates in the referential:
where is the -dimensional unit
sphere. Recall that, under the Gaussian distribution ,
are independent, is a random variable and
is a uniform random variable over .
Finally, for technical reasons, I do not work with the radius but with its cubic root :
1.2 The Bernstein-von Mises theorem
The Bernstein-von Mises theorem asserts that the Laplace approximation is asymptotically correct. While it was already intuited by Laplace (Laplace (1820)), the first explicit formulations date back to separate foundational contributions by Bernstein (1911) and von Mises (1931). The first rigorous proof was given by Doob (1949).
We focus here on the IID setting, which is the easiest to understand but the result can be derived in the very general Locally Asymptotically Normal setup (LAN, Van der Vaart (2000); Kleijn et al. (2012)).
Consider the problem of analyzing IID datapoints , with common density , according to some Bayesian model, composed of a prior and a conditional model describing the conditional IID distribution of with density . The posterior is then:
Noting the negative log-density due to the datapoint, and
the negative log-density of the joint distribution, we have:
The trivial rewriting in eq.(19) makes explicit that is somewhat akin to a biased empirical mean of the which are IID function-valued random variables, a structure they inherit from the being IID.
Critically, the BvM theorem does not require correct model specification, i.e. for the data-generating density to correspond to one of the inference-model densities . Note that this requires us to reconsider what is the goal of the analysis since it means that there is no true value such that . As an alternative, we might try to recover the value of such that the conditional density is the closest to the truth. Both Maximum Likelihood Estimation (MLE) and Bayesian inference behave in this manner and aim at recovering the parameter which minimizes the KL divergence:
which is equivalent to minimizing the theoretical average of the :
The connection is particularly immediate for the MLE since it aims at recovering the minimum of the theoretical average by using the minimum of the empirical average .
Three technical conditions need to hold in order for the BvM theorem to apply in the IID setting:
The prior needs to put mass on all neighborhoods of .
If not, then the prior either rules out or is on the edge of the prior. Both possibilities modify the limit behavior heavily.
The posterior needs to concentrate around as .
More precisely, for any
, the posterior probability of the eventneeds to converge to . Recall that the posterior is random since it inherits its randomness from the sequence , this is thus a convergence in probability:
This is a technical condition that ensures that the posterior is consistent in recovering as grows. This is very comparable to the necessity of assuming that the MLE is consistent in order for it to exhibit Gaussian limit behavior (Van der Vaart (2000)).
The function-valued random variables needs to have a regular distribution. More precisely:
The theoretical average function: needs to have a strictly positive Hessian at .
The gradient of at must have finite variance.
In a local neighborhood around , is -Lipschitz where is a random variable such that is finite.
Under such assumptions, then the posterior distribution becomes asymptotically Gaussian, in that the total-variation distance between the posterior and its Laplace approximation converges to as:
Indeed, both and inherit their randomness from the random sequence and the distance between the two is thus random. The BvM theorem establishes furthermore that Bayesian inference is a valid procedure for frequentist inference in that it is equivalent in the large limit to Maximum Likelihood Estimation. Furthermore, under correct model specification2012)).
However, BvM theorems of this nature are limited in several ways. One trivial but worrying flaw is that they ignore the prior. If we consider for example a “strong” Gaussian prior with small variance, then when is small, the likelihood will have negligible influence on the posterior, and we intuitively expect that the posterior will be almost equal to its Laplace approximation. Existing theorems fail to establish this. Furthermore, such BvM theorems are of a different probabilistic nature than normal Bayesian inference. Indeed, in BvM theorems, the probability statement is made a-priori from the data . The result thus applies to the typical posterior and not to any specific one. This complicates the application of such results in a Bayesian setting where the focus is instead in making probabilistic statements conditional on the data. Finally, these theorems are also of limited applicability due to their reliance on theoretical quantities that are inaccessible directly and hard to estimate, like the random Lipschitz constant
and its second moment.
In this article, I will prove a theorem which addresses these limits and gives a sharp approximation of the KL divergence between the Laplace approximation and a fixed log-concave target . This bound recovers the classical BvM theorem in the classical large-data limit setup that we have presented here (see Section 3.3). Finally, this bound identifies both the small and large regimes in which the posterior is approximately Gaussian (see Section 4.2).
2 Three limited deterministic BvM propositions
In this section, I present three simple propositions that are almost-interesting deterministic BvM results. However, all three fall short due to either relying on assumptions that are too restrictive or referring to quantities that are unwieldy. Even then, they still are useful both in building intuition on BvM results and as a gentle introduction into important tools for the proof of Theorem 5.
Note that, while they are restricted in other ways, the following propositions hold under more general assumptions than only for log-concave and for its Laplace approximation , as emphasized in every section and in the statement of the propositions.
2.1 The “Kullback-Leibler variance”
Throughout this subsection, let and be any probability densities.
The KL divergence between and corresponds to the first moment of the random variable under the density :
It is very hard to say anything about this quantity since it requires knowledge of both normalization constants and
. Outside of the handful of probability distributions for which the integral is known, normalization constants prove to be a major obstacle towards theoretical or computational results of any kind.
The following proposition offers a possible solution: the variance of can be used to approximate the KL divergence. I will refer to this quantity as the KL variance defined as:
Critically, does not require knowledge of the normalization constants because they do not modify the variance.
Restrictive approximation of .
For any densities and , define the exponential family:
then the reverse KL divergence can be approximated:
where is the third cumulant of under :
If the difference between the log-densities is bounded, then the difference between and is bounded too:
This proposition offers the as a great alternative to the KL divergence that is both computable and tractable. However, it falls short due to the very important limitation that must be bounded in order to achieve precise control of the error of the approximation. This means that Prop.1 will not be able to provide a rigorous general BvM theorem. Indeed, the Laplace approximation is a purely local approximation of and the statement is only valid in a small neighborhood around the MAP value . As becomes large, the error typically tends to infinity.
2.2 The log-Sobolev inequality
Throughout this subsection, is not restricted to being the Laplace approximation of and is not necessarily restricted to being Gaussian.
Another possible path towards avoiding the normalization constants in the KL divergence is given by the Log-Sobolev Inequality (LSI; Bakry and Émery (1985); Otto and Villani (2000)). The LSI relates the KL divergence, which requires knowledge of the normalization constants, to the relative Fisher information defined as:
which does not require the normalization constants due to the derivative.
A key theorem by Bakry and Émery (1985) shows that the relationship between and can be controlled by the minimal curvature of
, i.e. the smallest eigenvalue of the matrices. In order for the LSI to hold, this minimal curvature needs to be strictly positive, i.e. needs to be strongly log-concave (Saumard and Wellner (2014)). The LSI is usually stated in the following form.
If is a -strongly log-concave density, i.e.
then, for any , the reverse KL divergence is bounded:
There are two possible ways to transform the LSI into a BvM result: either we can assume that is strongly log-concave or we can use the fact that a Gaussian distribution is strongly log-concave. The following two propositions correspond to these two direct applications of the LSI.
Loose/restrictive LSI bound on .
If is strongly log-concave so that there exists a strictly positive matrix such that:
then, for any approximation , the reverse KL divergence is bounded:
Unwieldy LSI bound on .
If is a Gaussian with covariance , then for any the forward KL divergence is bounded:
Both of these propositions offer an interesting alternative to the KL divergence which can be useful to build heuristic understanding of the BvM result but they fail to be useful in practice.
Prop.3 falls short due to the severity of assuming that the target density , i.e. the posterior in a Bayesian context, is strongly log-concave. Indeed, in order for the bound of Prop.3 to be tight, the minimal curvature needs to be comparable to the curvature at the MAP value: , or more generally to be representative of the typical curvature. For a typical posterior distribution, these quantities widely differ with the mode being much more peaked than any other region of the posterior distribution, or at least much more peaked than the region with minimum curvature.
Prop.4 falls short for very different reasons. Indeed, it comes with absolutely no restrictions on the target density . However, it bounds the KL divergence with an expected value of a complicated quantity under the target . Dealing with an expected value under is as complicated as tackling the normalization constant for theoretical or computational purposes which makes Prop.4 inapplicable.
3 Gaussian approximations of simply log-concave distributions
This section first details the assumptions required and then presents the main result of this article, Theorem 5, establishing an approximation of that is well-suited to computational and theoretical investigations of the quality of the Laplace approximation. Finally, I show how to recover the classical IID BvM Theorem as a corollary.
In this first subsection, I present the assumptions required for Theorem 5 to hold and discuss the reasons why they are necessary. I differ the discussion of their applicability in a Bayesian context to Section 5.
In order for the Laplace approximation to be close to , we need assumptions that control at two qualitatively different levels.
First, we need global control over the shape of so that we can avoid trivial counter-examples such as the following mixture of two Gaussians where the second Gaussian has a very high-variance:
where denotes the density of the Gaussian with a given mean and covariance. Around the mode , the density is completely dominated by the component with the small variance and the Laplace approximation would completely ignore the second component. The Laplace approximation would then miss half of the mass of and would thus provide a very poor approximation. Less artificial examples are straightforward to construct with fat tails instead of a mixture component.
In the present article, I propose to achieve global control by assuming that is log-concave. Log-concave distributions are not only unimodal but they have tails that decay at least exponentially. As such, it is thus impossible for a lot of mass of to be hidden in the tails and the Laplace approximation thus necessarily captures the global shape of . Please see Saumard and Wellner (2014) for a thorough review of their properties.
However, global control is not enough. Since the Laplace approximation is based on a Taylor expansion of to second order, we need local control on the regularity of . We will do so by assuming that is differentiable three times and by controlling the third derivative ofand returns a scalar:
The size of the third derivative can be measured by using the max norm which measures the maximum relative size of the inputted vectors and outputted scalar:
The smaller the third-derivative, the more accurate the Taylor expansion of to second order. However, a key technical point is that the absolute size of the third-derivative is not what matters. What matters instead is the relative size of the third-derivative to the second, and more precisely to the Hessian matrix at the mode: . One possibility to measure this relative size is to compute the third-derivative of the log-density on the standardized space : . Theorem 5 establishes that the maximum of the max-norms as we vary is the key component that controls the distance between and . We will denote this key quantity as :
Note that it is already NP-hard to compute exactly the max norm of an order 3 tensor (Hillar and Lim (2013)). is thus potentially computationally tricky to compute. However, this value is only required in order to control the higher-order error terms and, for computational concerns, we will be able to focus instead on the dominating term. The precise value of can then be bounded very roughly or ignored.
3.2 A general deterministic bound
3.2.1 Structure of the proof
Given the two assumptions that the target density is log-concave and regular, as measured by the scalar , it is possible to precisely bound the KL divergence . The bound is actually derived from a direct application of Prop.1 and Prop.3, despite their flaws. This subsection gives an overview of the structure of the proof. Full details are presented in the appendix.
The key step is the use of a tricky change of variables from to the standardized variable and finally to spherical coordinates in :
where takes values over the -dimensional unit sphere .
Under the distribution , the pair of random variables is simple. Indeed, it is a basic result of statistics that they are independent and that is a uniform random variable over while is a random variable (Appendix Lemma 17). Expected values under are thus still simple to compute in the parameterization. Furthermore, the KL divergence has the nice feature that it is invariant to bijective changes of variables. This change of variable thus decomposes the KL divergence of interest: into the KL divergence due to and that due to (Appendix Lemma 16). More precisely, we have:
where is the KL divergence between the marginal distributions of under and and is the KL divergence between the conditional of distribution of under and .
This decomposition already resurrects Proposition 1. Indeed, the random direction takes values over a compact region of space. There is thus going to be a maximum for the difference between the log-densities which means that we will be able to approximate using the KL variance and precisely bound the error.
Computation of the is further simplified because is constant. Thus, the variance only comes from . However, computing would be slightly tricky here since that would involve a slightly unwieldy integral against . An additional useful trick thus consists in replacing with its Evidence Lower Bound (ELBO) computed under the approximation distribution :
where is a constant that does not depend on and thus does not affect the KL variance. This error of this approximation of is precisely equal to which we now turn to bounding.
The KL divergence between and is still challenging. While is strongly log-concave, cannot be guaranteed to be more than simply log-concave without additional highly-damaging assumptions. This is due to the tail of where the curvature could tend to . An additional trick comes into play here: changing the variable from the radius into its cubic-root :
Once more, this does not modify the KL divergence, but it does modify the properties of the problem in interesting ways. Indeed, we can now prove that is necessarily strongly log-concave and that its minimum log-curvature tends to the minimum log-curvature of . This unlocks applying the LSI in the interesting direction of Proposition 3 thus yielding a bound on the KL divergence:
3.2.2 Statement of the Theorem
By combining the approximations of both halves of , we obtain the following Theorem.
A deterministic BvM theorem.
For any log-concave distribution supported on and its Laplace approximation , in the asymptote the reverse KL divergence scales as:
The reverse KL thus goes to as:
This theorem establishes an upper-bound on the rate at which the KL divergence goes to in the large data limit. Critically, this convergence is slower for higher-dimensional probability distributions. It is unclear whether this rate is optimal or pessimistic. We discuss this point further in section 5 where we give an example which scales at this rate but is much less regular than assumed by the theorem since it is continuous in but not in . However, a careful investigation of the proof reveals that the the proof of the theorem does not make use of the -continuity of and this example is thus still very interesting in establishing one worst case of the theorem.
Another important aspect of Theorem 5 to note is that the first order term of the error is only expressed as expected values under . It is thus straightforward to approximate this term by sampling from . I take advantage of this property in section 4 in order to give computable asymptotically-correct approximations of the KL divergence.
3.3 Recovering the classical IID theorem
First, we need to extend the setup of Section 1.2 so that the assumptions of Theorem 5 hold. This only requires two additional assumptions. First, we need the negative log-prior and all negative log-likelihoods to be strictly convex so that the posterior is guaranteed to be log-concave. Second, we need to control the size of the third log-derivative of the prior and the typical size of the third derivative of the negative log-likelihoods .
More precisely, let denote the pseudo-true parameter value, and let be the Hessian-based version of the Fisher information:
Then, let denote the (random) maximum of the third derivative of , normalized by :
Note that this normalization by is comparable to the normalization by in the definition of (eq.42) since asymptotically we will have . We will assume that is finite. We will similarly assume that the equivalent value computed on the prior term is finite:
Under these assumptions, the MAP estimator constructed from the first likelihoods, is a consistent estimator of and asymptotically Gaussian with variance decaying at rate (Appendix Lemma 33, Appendix Prop.34). As a consequence, the variance of the Laplace approximation
scales as (using a law-of-large-numbers approximation of the sum):
The scaled third log-derivative of the log-posterior then scales approximately as (using the sub-additivity of maxima):
From a law-of-large-numbers argument, the sum scales approximately as and thus scales at most as . It is furthermore straightforward to establish through a law-of-large-numbers argument that this rate cannot generally be improved if any is such that is non-zero.
The following corollary of Theorem 5 summarizes this chain of reasoning.
Bernstein-von Mises: IID case.
Let be the posterior distribution:
where the are IID function-valued random variables. Let be the Laplace approximation of and be the Hessian-based Fisher information:
If and and all are log-concave and have controlled third-derivatives:
with and .
Then, as , and and converge to one-another as:
Note that this line of reasoning is straightforward to extend to other models for which the might have a more complicated dependence structure. For example, consider linear models with a fixed design. Under such models, the are independent but not identically distributed. However, it is still straightforward to establish, as long as the design is balanced so that no one observation retains influence asymptotically, that scales linearly asymptotically and that would scale as . Similarly, the result could be extended to a situation in which the have a time-series dependency. As long as there is a high-enough degree of independence inside the chain, an Ergodic-theorem type argument would yield that scales linearly and as . More generally, I conjecture that conditions yielding a Locally Asymptotically Normal likelihood (LAN; Van der Vaart (2000)) and a log-concave posterior would result in Theorem 5 being applicable. I will address this question is future work.
4 Computable approximations of
In this section, I detail how to approximate the KL divergence by using Theorem 5
. I further demonstrate the bounds in a simple example: logistic regression.
4.1 Proposed approximations
At face value, Theorem 5 offers two main approximations:
which can be combined to yield an approximation of which we will denote as “LSI+VarELBO” (recall that ).
We can similarly use the -based bound of :
to obtain another approximation of which we will denote as “LSI+KLvar”.
However, observe that the KL variance is related to the VarELBO term as:
where the second term corresponds to the KL divergence due to the -variables (Appendix Lemma 30). The “LSI+KLvar” approximation thus counts twice the contribution of the -variables to which seems silly. Theorem 5 thus offer heuristic support for the idea of using directly the KL variance as an approximation of as suggested by Proposition 3 . I must emphasize again that neither of those results establish rigorously that this is correct but this turns out to be a very accurate of in my experiments.
Computing these bounds requires the evaluation of expected values of which can be performed through sampling from . Further simplification can be achieved by replacing by its Taylor approximation to third or fourth order. For the KL variance approximation, this yields immediately an explicit expected value. For the LSI and the VarELBO approximations, we need to further perform the rough approximation (accurate in the large limit; Appendix Lemma 19) in order to simplify the corresponding expected values.
These various approximations are summarized in the following corollary of Theorem 5.
Computable approximations of .
In an asymptote where , the KL divergence can be approximated as (in order from most interesting to least):