 # Metric Gaussian Variational Inference

A variational Gaussian approximation of the posterior distribution can be an excellent way to infer posterior quantities. However, to capture all posterior correlations the parametrization of the full covariance is required, which scales quadratic with the problem size. This scaling prohibits full-covariance approximations for large-scale problems. As a solution to this limitation we propose Metric Gaussian Variational Inference (MGVI). This procedure approximates the variational covariance such that it requires no parameters on its own and still provides reliable posterior correlations and uncertainties for all model parameters. We approximate the variational covariance with the inverse Fisher metric, a local estimate of the true posterior uncertainty. This covariance is only stored implicitly and all necessary quantities can be extracted from it by independent samples drawn from the approximating Gaussian. MGVI requires the minimization of a stochastic estimate of the Kullback-Leibler divergence only with respect to the mean of the variational Gaussian, a quantity that only scales linearly with the problem size. We motivate the choice of this covariance from an information geometric perspective. The method is validated against established approaches in a small example and the scaling is demonstrated in a problem with over a million parameters.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The inference of large and complex Bayesian models is a challenging task. For many models of interest the posterior distribution cannot be obtained analytically and one has to rely on approximations to this posterior. Depending on the concrete problem, there is a large toolbox of methods with different accuracies and computational demands. Sampling techniques based on MCMC approximate the true posterior with a set of samples drawn from this posterior, which converges to the true posterior distribution in the limit of infinite samples. Such methods do come with extremely high computational costs, especially in high dimensions and strong coupling between the parameters. One efficient extension of this is Hamiltonian Monte Carlo (HMC) (Duane et al., 1987), which explores the posterior more efficiently. It still struggles for deep hierarchical Bayesian models due to the coupling between different parameters. To solve this, Betancourt and Girolami (2015) proposes to choose a coordinate system in which the hierarchy is flat and therefore the parameters are independent conditioned on the data. We will rely in this paper on such a standardized parametrization as well, as it exhibits a number of conceptual and numerical advantages. Another modification to HMC is proposed by Girolami and Calderhead (2011). Here the Fisher information metric is used to explore the manifold even more efficiently. This metric will play a central role in this paper, as we will use it to locally approximate the uncertainty.

Thus, sampling techniques provide the most accurate posterior approximations, but are demanding the most computational costs. On the other side of the spectrum of possible inference methods is the Maximum Posterior (MAP) estimate. It approximates the posterior with a delta distribution at the most probable parameter configuration. Its single advantage is that it is extremely fast to compute. It only requires the maximization of the posterior, which is still feasible in extremely high dimensional parameter spaces. There are a couple of downsides to this approach, for example this maximization may not result in a global maximum for multi-modal distribution and it depends on the initial conditions. More severe is the lack of a notion of uncertainty. It is prone to over-fit the concrete data realization, and especially in complex model, results are often implausible. One way to obtain an uncertainty estimate from a MAP estimate is the Laplace approximation (for details see

Bishop (2006)

), which approximates the true posterior with a Gaussian distribution centered at the MAP and the inverse Hessian at this location is used as an approximate covariance. Sometimes also the Fisher information metric is used, then it is a Fisher-Laplace approximation

(Amari, 1998; Hartmann and Vanhatalo, 2017). This approach lends the concepts of natural gradients (Amari, 1998; Martens, 2014), are used for efficient numerical optimization. This is already close to the algorithm we are proposing. Instead of first maximizing the posterior and then estimating the uncertainty around the maximum, we will use the Fisher information metric at our current location in parameter space as covariance estimate and then we shift the location of this Gaussian such that it better approximates the true posterior. We iteratively re-estimate the uncertainty at the newly obtained location until it converges.

An economic compromise between accuracy and computational cost is variational inference. For a comprehensive review on this topic see Blei et al. (2017). Here the true posterior distribution is approximated with some other, parametrized distribution. These parameters are then adjusted such that the variational Kullback-Leibler divergence is minimized. The choice of the approximating distribution is crucial to capture certain aspects of the posterior. A popular approach for complex models is the mean-field approximation, where different parameters are approximated individually by their own distribution, assuming independence. This assumption might not be justified, resulting in a sub-optimal approximation. Another widely used approach is a Gaussian approximate distribution (see Opper and Archambeau (2009)). It allows to preserve posterior correlations between the different model parameters by inferring the covariance as well. This approach severely limited to small problems as the size of the covariance scales quadratically with the overall problem size. For variational inference it is also important in which coordinate system a certain parameter is approximated. Automatic Differentiation Variational Inference (ADVI) (Kucukelbir et al., 2017) proposes to choose a standard parametrization of the problem, utilizing the reparametrization trick (Kingma and Welling, 2013), similarly as in the previously mentioned HMC context. These standard coordinates can be chosen in a way that the prior on all parameters are independent Gaussian distributions. In this space it is then natural to approximate the posterior with a Gaussian as well. For uninformative data the posterior will still be close to a Gaussian which is captured well by the approximation.

As mentioned, a full parametrization of the covariance is only feasible for relatively small problems due to its quadratic scaling with the number of model parameters. In this paper we propose Metric Gaussian Variational inference (MGVI) that replaces the covariance of our approximating Gaussian with the inverse Fisher information metric evaluated at the current estimate of the mean, following the spirit of the Fisher-Laplace approximation. This way we do not have to parametrize the covariance at all. The metric itself is still an object of the squared parameter size, but it is sufficient to have it only available implicitly as a collection of sparse matrices expressed by computer routines. We describe a procedure how independent samples according to such an implicit covariance can be obtained. These samples are used to stochastically estimate and minimize the variational Kullback-Leibler divergence between the true posterior and the approximation with respect to the mean of the Gaussian. This results in a new mean estimate with an updated Fisher information metric. The procedure is iterated until convergence. MGVI infers only the mean of the Gaussian, which scales linearly with the problem size. This allows to use the method for significantly larger problems while still taking correlations between all parameters into account.

We validate the accuracy of MGVI in a first example against HMC, as well as full covariance and mean-field ADVI in a Poisson log-Gaussian inference with squared exponential kernel. The result provided by MGVI is extremely close to the posterior samples provided by HMC and practically equivalent to those from full covariance ADVI. Mean-field ADVI was incapable of capturing the correct posterior correlations.

In a second example we demonstrate the scaling in size and complexity of MGVI by approximating the posterior of a binary Gaussian process classification with non-parametric kernel with over a million parameters. In this case a MAP provides a poor result and sampling, as well as full covariance ADVI is practically unfeasible. The MGVI inference was performed on a single CPU within a couple of hours and provides accurate results for the classification rate as well as the non-parametric spectral density. The obtained uncertainty on the estimate captures the true error well, however it is under-estimated on a percentage level. An under-estimation is expected in the variational approach.

## 2 Basics and Notation

### 2.1 Bayesian Inference

Bayesian inference in general describes how the knowledge on one quantity of a system affects the knowledge on some other quantity of interest. This is done according to Bayes theorem:

 P(θ|d)=P(d|θ)P(θ)P(d) (1)

The posterior distribution of the unknown quantity given some known data is equal to the likelihood of observing the data given a certain configuration of multiplied by the prior distribution . This whole expression is normalized by one over the evidence .

Prior knowledge on the system is encoded in the prior distribution. The likelihood describes how the observed data is related to the parameters of the model. The main difficulty arises in the calculation of the evidence to obtain a properly normalized posterior distribution. In many cases this normalization cannot be calculated analytically and one either has to perform an approximate inference, for example Maximum Posterior (MAP), or variational inference, or consult costly sampling techniques.

Instead of working with probability distributions, it is equivalent to discuss the inference in terms of information

, defined as the negative logarithm of a distribution , i.e. . Bayes theorem in this perspective reads:

 H(θ|d) ≡−ln(P(θ|d)) (2) =H(d|θ)+H(θ)−H(d) (3) ˆ=H(d|θ)+H(θ) (4)

Often we will deal with un-normalized distributions, which are proportional to the true posterior, up to a constant factor. On the logarithmic level of information, this corresponds to terms of constant information independent of the quantity of interest. Leaving these terms out is indicated here by the

sign. Information, compared to probability, is additive, making it a more convenient quantity to deal with and we will base our further discussion on it.

### 2.2 Variational Inference

Variational inference is a powerful tool to obtain approximate posterior distributions to complex problems within reasonable timescales (Blei et al., 2017). Because the true posterior is only approximated, some features of the true posterior will be neglected. The approximation is obtained by minimizing the Kullback-Leibler divergence (Kullback and Leibler, 1951). It is defined as

 DKL(˜P(θ|θ∗)||P(θ|d)) =∫Dθ˜P(θ|θ∗)ln˜P(θ|θ∗)P(θ|d) (5) ≡⟨H(θ|d)⟩˜P(θ|θ∗)−⟨˜H(θ|θ∗)⟩˜P(θ|θ∗) (6) ˆ=⟨H(d,θ)⟩˜P(θ|θ∗)−⟨˜H(θ|θ∗)⟩˜P(θ|θ∗). (7)

Expectation values are expressed by , always indicating the respective distribution. We will indicate the approximate distribution and information by and to make them more distinguishable from their counterparts in the original problem. In order to minimize the KL-divergence between the two distributions it is sufficient to minimize the last expression, as parameter-independent terms are irrelevant to obtain minimal divergence. This expression is equivalent to the negative Evidence Lower Bound (ELBO) (Bishop, 2006). The parameter solution of minimal KL divergence provides the variational approximation of the original problem.

It is sufficient to minimize a stochastic estimate of the KL-divergence (Salimans et al., 2013) by drawing samples from the approximate distribution. Using the reparametrization trick (Kingma and Welling, 2013) allows for complex approximate posterior distributions.

## 3 Gaussian Variational Inference

Gaussian Variational Inference (Opper and Archambeau, 2009) describes the special case of Gaussian approximate distributions. They exhibit a number of advantages, for example capturing uncertainty and complex correlations between all quantities. In this case the approximate distribution reads:

 ˜P(θ|¯θ,Θ)=G(θ−¯θ,Θ) (8)

With this the general KL-divergence is:

 DKL(G(θ−¯θ,Θ)||P(θ|d))ˆ=⟨H(d,θ)⟩G(θ−¯θ,Θ)−⟨˜H(θ−¯θ,Θ)⟩G(θ−¯θ,Θ) (9)

In order to perform the variational inference of the parameters, the expression above is minimized with respect to the approximate posterior mean and covariance . The second term in this equation corresponds to the Shannon entropy of the approximate Gaussian and it has the analytic form:

 ⟨˜H(θ−¯θ,Θ)⟩G(θ−¯θ,Θ)=12ln|2πeΘ| (10)

Here expresses a determinant and is Euler’s number. For the inference we require gradient information with respect to the parameters. Here we can use two properties of Gaussian distributions to replace the derivatives with respect to the parameters outside the expectation values with derivatives of their arguments inside the expectation value. Consider some generic function :

 (11) =12∂2∂¯θ∂¯θ†⟨g(θ)⟩G(θ−¯θ,Θ)=12⟨∂2∂θ∂θ†g(θ)⟩G(θ−¯θ,Θ) (12)

This results in a gradient of the mean:

 (13)

Note that the Shannon entropy term does not explicitly contribute as it is independent of the mean. Setting the gradient of Eq. 9 with respect to the covariance to zero, using Eq. 10 and Eq. 12, one can derive the following implicit relation for the covariance:

 Θ−1= ⟨∂2∂θ∂θ†H(d,θ)⟩G(θ−¯θ,Θ) (14) = −⟨∂∂θ(1P(d,θ)∂P(d,θ))∂θ†)⟩G(θ−¯θ),Θ (15) = ⟨∂H(d,θ)∂θ∂H(d,θ)∂θ†⟩G(θ−¯θ,Θ)−⟨1P(d,θ)∂2P(d,θ)∂θ∂θ†⟩G(θ−¯θ,Θ) (16)

This is only a valid representation for

if the overall term is positive definite. The first term, containing the outer product of first derivatives certainly is. Problematic is the second term, which involves the second derivative of the probability distribution. It might introduce negative eigenvalues, destroying the overall positive definiteness of the covariance of the Gaussian. Note that if the posterior

is identical to the approximating Gaussian, this second term vanishes and it is small if the posterior is close to this Gaussian.

Due to its positive definiteness any covariance can be expressed in terms of another matrix via . This way the reparametrization trick (Kingma and Welling, 2013) can be used to perform the inference as now an arbitrary Gaussian field is expressed in terms of an a priori standard Gaussian parameter via . The problem with an explicit parametrization of the covariance is that the size of the covariance scales quadratically with the number of model parameters in the original problem. This allows only for small problem sizes if the full correlation structure should be taken into account. For large problems usually a diagonal covariance has to be assumed to avoid the explosion of free parameters in the approximation. A diagonal covariance approximation has forgotten any posterior correlations between the parameters.

In variational inference the family of approximating distributions can be chosen freely. One can, for example, choose a Gaussian with a fixed covariance and only infer the mean. In this spirit the Maximum Posterior solution (MAP) can be interpreted as a Gaussian approximation with vanishing variance. How useful the result of such an approximation is depends critically on the accuracy of the uncertainty estimate.

In this paper we propose to use a local estimate of the uncertainty at the location of the current mean extracted from the true posterior distribution. This follows the logic of a Laplace approximation, where first the MAP solution is computed and then the curvature around the maximum is used as an uncertainty estimate. The curvature in the Laplace approximation is guaranteed to be positive definite. For our purpose it is not the right quantity to approximate the local uncertainty, as it can become indefinite outside the mode. During the inference we might certainly come across such regions. One possible solution is to use the Fisher Information metric as proxy, which we will discuss it in the next section.

If we only have to infer the mean of the approximating Gaussian, we do not have to calculate the Shannon entropy given in Eq.10, as it is (mostly) independent of the mean. This simplifies the relevant KL-divergence to only the cross entropy term:

 DKLˆ=⟨H(d,θ)⟩G(θ−¯θ,Θ) (17)

We already calculated its gradient in Eq. 13. In order to perform the minimization of the KL-divergence, we have to compute the expectation value. Instead of minimizing the KL directly, we can also minimize a stochastic estimate where the expectation value is replaced with the summation over a number of samples drawn from the approximate distribution (Salimans et al., 2013).

Splitting the sample in a mean contribution and Gaussian residual allows us to adapt the samples to an updated mean. This is a form of the reparametrization trick. It is therefore sufficient to obtain residual samples to infer the mean of the approximate Gaussian for a given covariance.

## 4 Standardization

Deep hierarchical Bayesian models can be used to describe sophisticated models and complex dependencies. The inference of their parameters might be extremely hard due to the strong interdependence between the different quantities, resulting in a numerically stiff problem. For continuous parameters, one can reparametrize the problem to get rid of the hierarchical structure and flatten down the hierarchy. (Kingma and Welling, 2013; Betancourt and Girolami, 2015). Automatic Differentiation Variational Inference (ADVI) (Kucukelbir et al., 2017) proposes to choose a parametrization in which all parameters are independent a priori and have a standard Gaussian prior and perform a variational inference there. This parametrization exhibits a number of numerical and conceptual advantages, especially if the true distribution is approximated with a Gaussian (Knollmüller and Enßlin, 2018).

Conceptually one takes a likelihood together with a hierarchical prior and performs coordinate transformation to uniform parameters using the multivariate distributional transform (Rüschendorf, 2009). This uses the inverse conditional cumulative density functions, following the logic of inverse transform sampling (Devroye, 1986). The uniform parameters are then transformed to standard Gaussian parameters in the same fashion with . The resulting parameters are a priori completely independent and the entire complexity is encoded in the composition of the two transformations . The probability distribution and its information in these coordinates read:

 P(d,ξ) =P(d|f(ξ))G(ξ,1) (20) H(d,ξ) ˆ=H(d|f(ξ))+12ξ†1ξ (21)

For the rest of the paper we will indicate standardized parameters with , whereas general parameters are . The Gaussian approximation in standard coordinates is denoted as . This standardization allows us to obtain an uncertainty estimate of a certain structure, which enables us to draw samples from the approximate distribution.

## 5 Information Metric

The Fisher information metric measures the sensitivity of a distribution under parametric perturbation and it can be used as a lower bound of the variance of an estimator. We will use it in this paper in two distinct ways. First, to construct a locally valid uncertainty estimate at the location of the mean of our distribution, and secondly to perform fast optimization using natural gradient optimization (Amari, 1998). Conceptually these two are completely distinct, but they will involve the same mathematical object.

### 5.1 Approximating the Covariance

We want to find a locally valid uncertainty estimate of the true posterior at the location of the current mean estimate of our Gaussian to be used as the covariance of the variational Gaussian. As mentioned before, the curvature cannot be used due to possibly negative eigenvalues. A suitable uncertainty estimate derives from the Fisher information metric. We cannot give a mathematically rigorous justification why it is the right choice, but this object exhibits a number of required properties. First and foremost it can be represented as an implicit operator, which allows the application of the here presented procedure even for extremely high dimensional problems. More details on this will follow in the next section. There are also a number of motivations to choose such an approximation. The covariance estimate we will be using also occurs in the Fisher-Laplace approximation (Hartmann and Vanhatalo, 2017) and as local guidance for HMC (Girolami and Calderhead, 2011), as well as natural gradient optimization (Amari, 1998). Finally, an empirical investigation of the here presented procedure yields exceptional results, as we will see later.

Information geometry (Amari and Nagaoka, 2007) explores the statistical manifold generated by differential variation of probability distributions equipped with the Fisher information metric. Its basis is differential geometry that provides a rich toolbox to investigate the properties of probability distributions.

For now we adapt a frequentists perspective, varying parameters of some (likelihood) distribution. Here the Fisher information metric is defined as:

 Md|θ (22)

The information metric is by construction symmetric and positive definite. Let now

be an unbiased estimator for

, then the variance of the estimator is bounded by the inverse Fisher information metric:

 ⟨(θ−ˆθ)(θ−ˆθ)†⟩P(d|θ)≥M−1d|θ (23)

This is the Cramér-Rao bound (Cramér, 1946; Rao, 1992). The indicates that the left minus the right side of the equation exhibits a positive semi-definite matrix.

The concept of Fisher information can also be extended to a Bayesian setting (Ferreira, 1981) via averaging over the prior and adding a prior information term. The joint information metric then reads:

Note the structural similarity of this object and the first term in Eq. 16. This leads to the Schützenberger inequality (Schützenberger, 1957):

 (26)

The posterior variance averaged over the evidence is larger or equal to the inverse information metric. Unfortunately this inequality does not make a statement directly about the posterior variance and we usually cannot calculate due to computationally unfeasible expectation values involved.

This is the point where the mathematical argumentation stops and we have to make a number of ad hoc simplifications to obtain computational feasibility. We want to evaluate this joint metric at a certain location for a certain data set to obtain a local uncertainty estimate. The location will be the current mean of our approximate Gaussian distribution and the data will be the actually observed data.

We propose the following metric to be used as a locally valid uncertainty estimate around :

 M ≡Md|¯θ+M¯θ (27)

To some extent we can regard as an estimator of , extend the Fisher information metric from the likelihood with the prior metric and interpret this as an estimate of the posterior uncertainty on itself. This can be done because of the symmetry between the mean of a Gaussian and its argument. This is also the term used by Girolami and Calderhead (2011), as well as Hartmann and Vanhatalo (2017). It is worth pointing out that in the purely Gaussian case the above relation reproduces exactly the posterior covariance, which should be reassuring.

Generalizing the above procedure to hierarchies of the Bayesian model with
yields:

 M= Md|¯θ+N∑i=1Mθi|¯θi<\quad, with (28) Mθi|θi<= ⟨∂H(θi|θi<)∂θi≤∂H(θi|θi<)∂θ†i≤⟩P(θi|θi<) (29)

Here indicates all parameters higher up the hierarchy of the deep Bayesian model. This leads to a number of complex and involved expressions that make it hard to calculate and later draw samples from.

One approach to solve this is choosing the standardized parametrization of the problem in which the hierarchy becomes flat and all parameters are independent a priori (Knollmüller and Enßlin, 2018). The mean of the approximate Gaussian in this parametrization is indicated by

. In this special case the prior metric becomes the identity matrix and our uncertainty estimate takes the following form:

 M=Md|¯ξ+1 (30)

Usually the likelihood itself is a rather simple distribution describing some process on how the data was created. What makes the model complex is how the parameters of the simple distribution are possibly non-linearly related to a set of other parameters. Having a Gaussian likelihood and a Gaussian prior can still result in an arbitrarily complex model if the function that relates the prior quantity to the parameters of the likelihood is sufficiently rich. In the standardized coordinates this is exactly where the complexity of the model is stored.

Consider some simple parameters of the likelihood . These might exhibit a deep hierarchical prior structure, but the likelihood itself is simple. In order to calculate the Fisher information metric it is sufficient to calculate the metric for the simple parameters and to know how these are related to the complex parametrization. This function is simply the coordinate transformation to standard coordinates.

 Md|ξ= ⟨∂H(d|ξ)∂ξ∂H(d|ξ)∂ξ†⟩P(d|ξ) (31) = (32) = J†ξMd|θJξ (33)

The middle term is the information metric, the outer terms are the Jacobian matrices of the relation how these simple parameters are related to the complex model. The metric contribution from the likelihood is simply a push-forward of the likelihood metric into the parameter space. The inverse of this term is also the Cramér-Rao bound for biased estimators.

Adding the prior information metric to that and evaluating it at the current mean estimate, we obtain our local estimate of the posterior covariance .

 Ξ−1 =M (34) =J†¯ξMd|¯θJ¯ξ+1 (35)

Once the we formulated our problem in standardized coordinates, the uncertainty estimate has only three ingredients. First, the prior metric, which is just the identity operator in the space of parameters. Second, the Fisher information metric of the likelihood, which can be easily calculated for a large number of commonly used likelihoods. Finally, the Jacobian of the function relating the likelihood parameters to the standardized model parameters. This function has to be implemented anyway as it is equivalent to implementing a concrete inference problem. Its Jacobian can then be obtained by auto-differentiation, or consistently applying the chain rule. Typically none of these quantities have to be stored in the form of a dense matrix. We will elaborate on the concept of implicit operators in the dedicated Section

6. Nevertheless, this approximate covariance is a non-diagonal approximation that captures correlations between all involved parameters.

### 5.2 The Information Metric of Variational Gaussian Inference

The information metric of the Gaussian variational inference problem is an object of its own interest. For the general case consider two distributions and . Their KL-divergence reads:

 DKL(p||q)=⟨ln(q)⟩p−⟨ln(p)⟩p (36)

We will investigate small variations of the approximate distribution . Expanding the divergence up to second order in this variation yields:

 DKL(p+δp||q) (37) (38)

For Gaussian Variational Inference we consider and a posterior distribution . For MGVI only variations of the mean are relevant, therefore .

 DKL(p+δp||q)≈ DKL(p||q)−⟨ln(q)Θ−1(θ−¯θ)⟩pδ¯θ −12⟨(θ−¯θ)†Θ−1(θ−¯θ)Θ−1(θ−¯θ)⟩pδ¯θ +12δ¯θ†⟨Θ−1(θ−¯θ)(θ−¯θ)†Θ−1⟩pδ¯θ (39) = DKL(p||q)+⟨∂H(θ|d)∂θ⟩pδ¯θ+12δ¯θ†Θ−1δ¯θ (40)

This second order expansion of the KL divergence of Gaussian variational inference reveals that is the Fisher information metric of the problem. We also re-discover the gradient in the linear term as also stated in Eq. 13. This allows us to use efficient Natural Gradient descent (Amari, 1998; Martens, 2014) to perform our inference.

## 6 Implicit Operators

The information metric as a matrix has a dimension of the number of parameters squared. Storing it explicitly on a computer is already unfeasible for relatively small problems. In imaging for example, millions of pixel parameters are not uncommon and we will demonstrate MGVI for such an example at the end. The metric is build out of a collection of linear transformations, projections and diagonal operators that all can be expressed efficiently by sparse matrices represented by computer routines. The metric itself is therefore expressible as an implicit operator, described by the composition of these simple operators. By construction the metric is linear and positive definite, and therefore invertible. The inverse of the metric correlates all parameters with each other, usually resulting in a dense matrix expression, which will serve as approximate posterior covariance. This object is of interest during the inference, as well as for posterior analysis. As mentioned before, we cannot afford to store the posterior covariance at any moment explicitly. We have to extract all relevant information on correlations from the implicit metric only. This requires to apply the metric, as well as its inverse to vectors.

The implicit representation allows us to apply the metric to some vector efficiently.

 b =Θ−1x (41)

More problematic is the application of the covariance , the inverse metric, to some vector .

 x =Θb (42)

This matrix inversion can be done by solving Eq. 41 numerically for , equivalent to solving a set of linear equations. The metric is certainly positive definite, which allows us to use the Conjugate Gradient algorithm (Shewchuk et al., 1994) for this inversion. This algorithm makes extensive use of the positive definiteness of the problem, allowing for rapid convergence, compared to more general solvers. The resulting vector then approximately satisfies Eq. 42.

The numerical inversion of the metric is the key to draw samples from a Gaussian distribution with the metric as precision matrix.

 Θ=⟨θθ†⟩G(θ,Θ)≈1N∑{θ∗i}θ∗iθ∗†i (43)

By choosing a standardized parametrization in which all parameters exhibit a standard Gaussian prior, our covariance always exhibits the identical structure:

 Ξ=(J†¯ξMd|¯ξJ¯ξ+1)−1 (44)

This is the structure of a Wiener covariance (Wiener, 1949) with a standard Gaussian prior, the Jacobian takes the role of a response, or design matrix and finally the Fisher information metric behaves like the noise covariance. We will use this structure to re-create a completely synthetic Wiener filter reconstruction from which we extract a valid residual sample that just has to be shifted to the current mean to obtain a sample from the approximate posterior distribution (Knollmüller and Enßlin, 2017).

We start by drawing a parameter realization from the standard Gaussian prior and some noise realization. Note that the likelihood metric should be easily invertible and we should have access to its eigenbasis for common likelihoods.

 ξ′ ↶G(ξ,1) (45) n′ ↶G(n,M−1d|¯ξ) (46)

The next step involves generating synthetic data by applying the Jacobian to the prior sample and adding the noise to that.

 d′=J¯ξξ′+n′ (47)

From this synthetic data we can now calculate the posterior mean according to the model:

 ¯ξ′ =Ξj′\quad,\,with (48) j′ =J†¯ξMd|¯ξd′ (49)

Here is the noise weighted, back-projected data, or information source. In order to solve for the posterior mean, the covariance has to be applied to that vector. To perform this step numerically, we will rely on implicit operator inversion, as described above. The posterior distribution and variance is for this synthetic problem given by

 P(ξ|d′) =G(ξ−¯ξ′,Ξ) (50) ⟨(ξ−¯ξ′)(ξ−¯ξ′)†⟩G(ξ−¯ξ′,Ξ) =Ξ. (51)

Thus, exhibits correlations according to our approximate covariance. Shifting it to the actual mean of the approximate posterior results in a sample from this distribution:

 ¯ξ+Δξ↶G(ξ−¯ξ,Ξ) (52)

Using this procedure, we can draw a set of independent samples from the approximate posterior distribution, which allows us to statistically estimate the Kullback-Leibler divergence. Note that drawing these samples can be costly, as every sample requires the numerical inversion of the covariance, but drawing several samples is completely independent from each other and it can be done in parallel. Overall we might want to use as little samples as possible to reduce the numerical effort. Note that is equally a sample from the distribution that requires almost no additional computations. Despite being statistically dependent of adding it to the set of samples still leads to the correct sample average. Empirically we found that adding such samples is helpful in stabilizing the algorithm by counterbalancing extreme fluctuations in certain parameters.

Another important point is how accurately the numerical inversion is performed. Of course, a higher accuracy results in better samples, but also requires more computations. The effect of un-converged samples depends mainly of the starting position of the Conjugate Gradient. Roughly speaking it updates first the most informative directions. These correspond to the smallest eigenvalues of the covariance. Starting at the mean of the standard Gaussian prior, i.e. at zero, after iterations of the conjugate gradient at least the most informative directions are updated in the mean. The corresponding residual now exhibits the correct variance for these direction, whereas the remaining directions still have the prior variance. Overall, un-converged samples will have the correct variance for the best informed directions and the remaining directions over-estimate the actual variance encoded in the approximate covariance. This behavior safeguards us from a number of pitfalls that can be observed in MAP estimators by underestimating, or ignoring variance.

## 7 Metric Gaussian Variational Inference

At this point we want to summarize the key concepts of the here proposed method. MGVI approximates a complex posterior with a Gaussian distribution. The covariance of this approximate posterior is a local estimate of the true uncertainty at the location of its current mean. Using implicit operators, we avoid explicitly parametrizing its covariance, reducing the number of required parameters drastically. In general we are free to chose any valid covariance, but practically it should resemble the true local covariance as closely as possible and one must be able to draw samples from a Gaussian with such a covariance.

Our proposed covariance makes it especially convenient, but it is certainly not the only possibility. The procedure might be extended to better covariance estimates in the future.

We chose to follow ADVI (Kucukelbir et al., 2017) and perform the approximation in a standardized parametrization. This might result in non-Gaussian approximate posteriors in the original parametrization. In this parametrization the information, as outlined in Eq. 21

of the joint distribution over data and standardized parameters

We want to variationally approximate the posterior corresponding to this model with a Gaussian distribution of the form (Eq. 8)

 ˜P(ξ|¯ξ,Ξ)=G(ξ−¯ξ,Ξ) (54)

by choosing such that it locally approximates the true covariance at the location of . This requires the minimization of the Kullback-Leibler divergence between the approximate and the true posterior with respect to the mean parameter . As stated in Eq. 17 it consists of only one term explicitly depending on the mean:

 (55)

The gradient on the mean is calculated by averaging the gradient of the information with respect to the parameters over the approximate distribution (Eq. 13):

 ∂DKL∂¯ξ=⟨∂H(d,ξ)∂ξ⟩G(ξ−¯ξ,Ξ) (56)

To efficiently minimize the KL-divergence for a given , we found in Eq. 40 that this covariance is equivalent to the Fisher information of the minimization problem, which allows us to base our minimization scheme on natural gradients. We obtain our descent direction by weighting the gradient with the inverse local metric:

 Δ¯ξ=Ξ∂DKL∂¯ξ (57)

The local uncertainty around is estimated by the Fisher information metric of the likelihood evaluated at the current mean plus the prior metric (Eq. 35):

 Ξ=(J†¯ξMd|¯θJ¯ξ+1)−1 (58)

where was the Fisher metric of the likelihood in simple coordinates, the Jacobian of the standardizing transformation and the identity operator. This is a non-diagonal full-rank, positive definite matrix that correlates all parameters with another. We cannot store it explicitly at any time, but its inverse, the precision matrix can be well represented by a collection of sparse, implicit operations. In order to work with the covariance, we do have to rely on numerical operator inversion, as outlined in Sec. 6.

Crucial to this entire method is that we can draw samples following the statistics of such an implicit covariance. This enables us to approximate all the above expectation values with sample averages without having to calculate any of them analytically. This broadens the applicability of this method to arbitrarily complex models, as long as the standardization is feasible.

Such samples are drawn by identifying the individual terms of the covariance with the one of a Wiener covariance and to perform a reconstruction of synthetic data with the underlying linear measurement model. The residuals of the initially drawn synthetic parameters and its reconstruction exhibits exactly the correct covariance of the approximation. Solving the linear filter problem requires the numerical inversion of the covariance, but this way we avoid the necessity to have it explicitly available. In the same way we do also only require the application of the covariance to the gradient to calculate the mean in the next iteration. A possible way to implement MGVI is outlined in Algorithm 1. Details may be adjusted for personal preference or better alternatives, for example the sampling routine, the concrete minimization or inversion strategy.

## 8 Numerical examples

We will explore the applicability of the here proposed algorithm in two examples.

In the first example we discuss the problem of inferring the rate of a Poisson distribution described as a log-Gaussian process. This process exhibits a squared exponential kernel of known amplitude and width. The results obtained by MGVI are compared to HMC sampling, MAP estimation, ADVI with a full covariance, as well as to a mean-field approximation with diagonal covariance.

The second example demonstrates the well behaved scaling of MGVI with the problem size, as well as its viability in the context of complex models with conceptually distinct parameters. Here we discuss the problem of binary Gaussian process classification in two dimensions with non-parametric kernel estimation. The data consists of binary values with associated location. The likelihood is the Bernoulli distribution and its rate is linked through a sigmoid function to a Gaussian process with unknown kernel. The size of this problem exceeds one million parameters. The computation and storage of a dense covariance as used by ADVI with full covariance is computationally unfeasible as it would require to maintain

entries. This problem size prohibits validation with the other methods and we compare the result of MGVI of the second example only to an inexpensive MAP estimate.

For an even larger numerical example with real data we refer to Leike and Enßlin (2019), where a three dimensional dust map in our galactic vicinity is reconstructed in a resolution of voxels from dust absorption measures and star locations obtained by the Gaia satellite. The reconstruction problem involved a truncated Gaussian likelihood with log-Gaussian prior and unknown kernel, analogous to the model used in the second example. The inference was conducted using the here described MGVI procedure. Finally, Frank et al. (2019) formulate locality and causality priors to learn the dynamics of a field from noisy and incomplete observation. Again, the inference of this field together with its dynamics is done via MGVI.

### 8.1 Poisson log-Gaussian Inference

#### 8.1.1 Setup

In the first example we discuss the inference of the rate of a Poisson likelihood providing count data , where the rate is modeled as a log-Gaussian process with Gaussian kernel of known amplitude and width. The count data for our experiment is displayed in Fig. 1. The Poisson likelihood reads:

 P(d|λ)=λde−λd! (59)

Its Fisher information metric with respect to this rate parameter is:

 ⟨∂H(d|λ)∂λ∂H(d|λ)∂λ†⟩P(d|λ)=ˆλ−1 (60)

This is a diagonal matrix, indicated by the hat, in the data space with the inverse of the rate as covariance. Interestingly, it is the inverse of the variance of the Poisson distribution. The rate is expressed in terms of the exponential of a Gaussian process with prior distribution and some linear response operator . Assuming a stationary, or homogeneous and isotropic kernel, the kernel can be expressed in terms of a spectral density in the harmonic domain, i.e. where

indicates the Fourier transformation,

is the projection of the spectral density and represents the squared exponential correlation kernel in Fourier space (in one dimension). Here, is a characteristic length-scale, a variance parameter and is the harmonic coordinate. A hat over a vector raises it to diagonal matrix, i.e. . This defines the mathematical setup of this first example.

Now we perform the reparametrization trick such that the new parametrization follows a standard Gaussian distribution. As the prior is already Gaussian, we simply have to identify with and rewrite . With this reparametrization we express the information of the problem for a given spectrum as

 H(d,ξ)ˆ=−d†lnReAξ+1†ReAξ+12ξ†1ξ (61)

The indicates a scalar product with the one vector, corresponding to the integration over the space. Especially interesting is the function that relates the initial rate to the parameters of our model:

 λ =f(ξ) (62) =ReAξ (63)

This function allows us to build the local covariance estimate that we will use during the inference. Here the parameter dependence is still relatively simple and the Jacobian can be calculated by hand, following the chain rule. For more complex models we recommend the usage of auto-differentiation tools. The covariance turns out to be:

 Ξ−1 =∂f(ξ)∂ξ†ˆλ−1∂f(ξ)∂ξ∣∣ ∣∣ξ=¯ξ+1 (64) =A†ˆeA¯ξ†R†ˆ1ReA¯ξRˆeA¯ξA+1 (65)

Here we see that the metric is composed out of a collection of operators that can be simply implemented and combined.

Now, we approximate the posterior probability implied by the model as described by Eq.

61 with the a metric Gaussian and follow the procedure described in this paper in order to infer the model parameters.

We implemented the problem within Python using the NIFTy111NIFTy documentation: http://ift.pages.mpcdf.de/NIFTy/
NIFTy code: https://gitlab.mpcdf.mpg.de/ift/NIFTy
package (Selig et al., 2013; Steininger et al., 2017). From version 5 on this package implements the MGVI algorithm natively, supports implicit operator handling, auto-differentiation, and it allows to build complex models by having a syntax that allows the code to be close to the mathematical notation. We implemented the identical problem in Stan (Stan Development Team et al., 2016). This allows us to compare our results to HMC sampling, ADVI, and MAP estimation. HMC provides samples from the true posterior and therefore should serve as a reference in the validation of our approach. For ADVI we perform both, a fully parametrized covariance, as well as a mean-field approximation, estimating only a diagonal covariance. Using the full covariance limits the possible problem size and we will stick to parameters to describe the Gaussian process, as well as equidistant data points. The data is is drawn according to the model and the concrete realization is shown in Fig. 1. Figure 1: A Poisson realization drawn according to a log-Gaussian process with Gaussian kernel on linear scale.

For the inference with MGVI we used samples to estimate the KL and its gradient and we updated them times after we minimized the KL sufficiently using natural gradients. Figure 2: Reconstructed rates and posterior samples provided by MGVI in comparison to those from various other methods.

#### 8.1.2 Results

All methods recover the underlying rate quite well. The inferred rates are shown in Fig. 2 for MGVI and the other methods. The uncertainty of the different estimates are indicated by a set of posterior samples drawn around their corresponding mean rates for all methods that provide such samples. Here, the result of MGVI is close to that of HMC as well as that of ADVI for the fully parametrized covariance. The MAP estimate does not provide an uncertainty and the ADVI mean-field approximation exhibits a larger scatter compared to the true posterior, as indicated by the HMC samples. The behavior of the mean-field ADVI is surprising, as one would expect the mean-field approximation to under-estimate the uncertainty. We checked whether this is an artifact of insufficient convergence of the mean-field ADVI method by trying it on various other field configurations (not shown). However, this behavior is reproduced consistently, which argues for it being an intrinsic limitation of the mean-field approximation. We also note that overall the relative uncertainty is higher in regions of low counts and smaller in regions of high counts. This is expected from a Poisson likelihood, as its variance is equal to its rate and therefore the relative uncertainty increases with decreasing rate, .

A more detailed insight into the result is obtained by looking at the correlation between posterior quantities. Here we choose to compare the rates at a pair of nearby locations and to investigate scatter plots of their values for a set of samples provided by the different methods. To investigate quantities with sufficient uncertainty we pick a pair of locations from a low-count region near visible in Fig. 3.

Comparing MGVI and HMC (Fig. 2(a)), we find a strong overlap between both distributions. In this case MGVI provides an excellent estimate for the true covariance and the sampling procedure from the implicit covariance generates accurate samples. The underlying true value is consistent with both results. The same can be observed by comparing MGVI to ADVI with fully parametrized covariance (Fig. 2(b)), where both methods result in practically the identical approximation. As mentioned before, the mean-field approximation (Fig. 2(c)) shows significantly larger variance, but only along the correlation direction. The diagonal approximation seems to be insufficient for this problem and MGVI surpasses this approach in fidelity.

As this is a low-count region, the dominant contribution to the posterior is the prior. This is the origin for the large variance and strong correlation between the nearby locations. The same scatter plots for a high-count region are shown in Fig. 4. Here the dominating contribution to the posterior is the likelihood. The overall variance is smaller and the correlation is less pronounced. The overall observation from the low-count region also applies here. The MGVI posterior samples overlap strongly with those obtained by HMC (Fig. 3(a)), maybe the MGVI samples are shifted down slightly. The same can be said about full ADVI (Fig. 3(b)). Mean-field ADVI (Fig. 3(c)) over-estimates the true posterior covariance again.

This first example validates empirically MGVI and shows that it is on par with full covariance ADVI. The proposed approximation of the variational covariance provides reasonable uncertainty estimates. MGVI requires only the estimation of parameters in this concrete example, whereas full covariance ADVI has two orders of magnitude more: parameters to represent the posterior mean rates plus additionally parameters to represent the posterior uncertainty covariance.

### 8.2 Binary Gaussian Process Classification with non-parametric Kernel

In the second example we apply MGVI to a much higher dimensional problem and more complex context, making it unfeasible for a fully parametrized covariance. Binary Gaussian process classification is used to attribute regions to certain classes and identify boundaries between them. A comprehensive overview can be found in Kuss and Rasmussen (2005) and Nickisch and Rasmussen (2008). In addition to the typical formulation, we will also infer the underlying kernel. We consider binary data in two dimensions, measured only at certain locations., a Bernoulli likelihood and that its rate parameter is described by a sigmoid function applied to an underlying Gaussian process. The kernel of this process is unknown and will be modeled non-parametrically. We will assume a stationary, isotropic kernel and model it by two components. The first component follows a power law that is modified by the second component, a log-Gaussian process with a smooth kernel. Overall the spectral density is parametrized by two power-law parameters, an amplitude and the spectral index, and the Gaussian process parameters for the component modifying this power-law.

#### 8.2.1 Setup

The likelihood in this example is the Bernoulli distribution that reads

 P(d|μ)=μd(1−μ)1−d (66)

for some rate parameter on the unit interval and binary outcome . The Fisher information metric for this likelihood is

 ⟨∂H(d|μ)∂μ∂H(d|μ)∂μ†⟩P(d|μ)=ˆμ(1−μ)−1 (67)

The rate is linked to a Gaussian process by a sigmoid function and a linear response:

 μ =Rσ(s) (68) =R12(1+tanh(s)) (69)

The kernel of this process is assumed to be stationary and isotropic and can be expressed as with spectral density . This object is also part of the inference and it is modeled according to

 p(k)=ealnk+b+τk. (70)

The first two terms in the exponent model a power-law kernel, which is linear on double-logarithmic scale, with power and amplitude , both of which get a Gaussian prior with assumed mean ( and ) and variance ( and ) . The last term in the exponent follows a Gaussian process that is smooth on logarithmic spatial scale and known kernel . The graphical structure of this described model is shown in Fig.  LABEL:fig:diagram.

Reparametrizing the model parameters yields the following relation to the original rate :

 μ =f(ξ) (71) =Rσ(F−1(ˆPe(¯a+σaξa)lnk+¯b+σbξb+Aξτ)ξs) (72)

This reparametrized model has therefore a highly non-linear likelihood in terms of its parameters, where . and are again the Fourier transformation and the isotropic projection of a spectrum to a Fourier space. Hats indicate diagonal operators. Obtaining this function is tedious but straight forward and can be done automatically, given the structure of the model. We spare the reader the expressions of the Jacobian of the function with respect to its parameters as this should be implemented using to auto-differentiation tools. Structurally the problem is now identical to the previous one with information and approximate covariance:

 H(d,ξ) (73) Ξ =(J†¯ξˆ(f(¯ξ)(1−f(¯ξ))−1J¯ξ+1)−1 (74)

Regarding the numerical setup, we consider binary data points on a two dimensional plane organized in a checkerboard. We use parameters to describe the Gaussian process underlying this rate. The spectral density is parametrized by additional two parameters for the power-law and for the non-parametric part, resulting in overall more than a million parameters, which is completely out of reach for explicit covariance parametrization. For simplicity periodic boundaries were assumed. For the inference we used independent samples plus their mirrored counter-parts, as described in Sec. 6 and we updated them times. At the end we obtained additional samples for posterior analysis. The inference was conducted on a single CPU within a couple of hours. We compare the result with a MAP estimate.

#### 8.2.2 Results

The synthetic data, the true underlying rate, the MAP solution, as well as the mean inferred rate from MGVI are shown in Fig. 6. The data is only sampled at certain locations and due to the binary output appears noisy (Fig.  5(a)). It exhibits spatial characteristics, predominately showing one class over the other in certain regions. The true rate (Fig. 5(b)), from which the data was drawn, shows rich features on all scales. The largest of them can also be seen in the data directly, but small-scale features are washed out due to the Bernoulli noise. Performing the MAP estimate for all parameters does not provide a reasonable result for the rate, as shown in Fig. 5(c). MAP is known to over-fit on the data and this can be clearly seen in this case. Because the spectral density is also part of the inference problem, MAP can absorb almost any noise feature into the spectral density, resulting in a very high and flat spectrum, as can be seen in Fig. 6(a). This way the data can be perfectly explained, as de facto the regularization term involved in Gaussian process regression was switched off by this high spectrum estimate. The MAP estimate is therefore largely identical to an unregularized Maximum Likelihood estimate. For such coupled and complex models MAP does not provide any reasonable result, and therefore a Laplace approximation does not provide any reasonable result either.

The mean rate recovered by MGVI is shown in Fig. 5(d)

. Up to a certain scale its morphology matches the true rate exceptionally well. Even in unobserved areas the structures are recovered correctly to some extent (as can be seen e.g. in the top right and bottom left corners). Small scales cannot be recovered as the data does not contain much information on them. This is also reflected in the standard deviation at each location, shown in Fig

6(b). The highest uncertainty is, as expected, in the un-observed areas, reproducing the checkerboard pattern. The standard deviation is also modulated by the rate itself. The more a certain region is attributed to one class, the lower its uncertainty. The uncertainty is especially high at the boundaries between the classes. We can validate this error by calculating how much of the true rate lies outside the one sigma interval indicated by the sampled standard deviation. For the correct standard deviation we would expect , but in our case it is and therefore the estimated error is roughly too low. Such an underestimation is expected from the Cramér-Rao bound and in variational inference in general.

The recovered spectral density is shown in Fig. 6(a). At the largest scales, and therefore the smallest modes, the true correlation structure is correctly recovered within the error, indicated by a set of samples. Even most large-scale spectral features are identified correctly by the algorithm. At a certain point towards smaller scales the error increases significantly. This is also the point where the recovered spectrum diverges from the true one. This might indicate incomplete convergence, however, those highly uncertain parameters are the last ones to converge anyway and are affected the most by the stochastic estimation of the KL-divergence. Even on those scales the trues spectrum is not completely out of the error bound and seems half-way consistent with the inferred spectrum.

Overall MGVI manages to accurately infer an approximation to this large and complex model within a reasonable amount of computational resources and time. The error estimates are, as expected slightly too low, but of correct magnitude. (a) The binary data drawn from the true rate. (a) The true and recovered spectral densities of MAP and MGVI with samples.

## 9 Conclusion

In this paper we introduced Metric Gaussian Variational Inference (MGVI) as a procedure to perform approximate inference for large and complex inference problems. MGVI performs a variational Gaussian inference in a standardized parametrization, following ADVI (Kucukelbir et al., 2017). Instead of explicitly parametrizing the covariance, a local approximation based on the Fisher information metric is used, which corresponds to a lower bound of the uncertainty. This covariance can be expressed in terms of implicit operators removing the need to store it at any moment explicitly. A stochastic estimate of the Kullback-Leibler divergence between the true posterior and the approximation is minimized with respect to the parameters of the approximation, the mean of the metric Gaussian. The KL-divergence is estimated with independent samples from the approximation. The MGVI approach circumvents the severe limitation of the quadratic scaling of memory and computations of a fully parametrized covariance in Gaussian variational inference. Instead, MGVI scales only linearly with the problem size in terms of required parameters, allowing its application to significantly larger problems.

We demonstrate the capabilities of MGVI in two examples. In the first one we applied MGVI to a Poisson log-Gaussian process inference problem and compared its result to those provided by HMC, ADVI and MAP. Our approach closely reproduced the result of HMC and ADVI with full covariance in terms of accuracy. MGVI significantly outperforms ADVI with a mean-field covariance, which exhibits the same computational scaling behavior. The second example is a binary Gaussian process classification with a non-parametric kernel. Here the problem involves more than a million parameters, making a full covariance parametrization completely unfeasible. The Gaussian process, as well as its kernel were inferred simultaneously. The inference with MGVI was performed within hours on a single CPU and the result accurately recovers the underlying ground truth for both, the Gaussian process and the kernel. We found that the error is slightly underestimated by roughly . A MAP estimate in the same setting completely over-fits the data and does not provide any reasonable result.

Overall MGVI is a general method that can be used in a large variety of different contexts to provide accurate approximations to the correct posterior. Correlations between all quantities are preserved by the scheme and the number of required parameters scales only linearly with the problem size, making MGVI suitable for extremely large problems. Independent posterior samples can be obtained from the approximation to propagate the uncertainties to any posterior quantity.

## 10 Acknowledgments

We acknowledge Philipp Arras, Philipp Frank, Maksim Greiner, Sebastian Hutschenreuter, Reimar Leike, Daniel Pumpe, Martin Reinecke and Theo Steininger for fruitful discussions and comments on the manuscript.