# Variational Reformulation of Bayesian Inverse Problems

The classical approach to inverse problems is based on the optimization of a misfit function. Despite its computational appeal, such an approach suffers from many shortcomings, e.g., non-uniqueness of solutions, modeling prior knowledge, etc. The Bayesian formalism to inverse problems avoids most of the difficulties encountered by the optimization approach, albeit at an increased computational cost. In this work, we use information theoretic arguments to cast the Bayesian inference problem in terms of an optimization problem. The resulting scheme combines the theoretical soundness of fully Bayesian inference with the computational efficiency of a simple optimization.

## Authors

• 5 publications
• 12 publications
• 1 publication
• 17 publications
• ### Solution of Physics-based Bayesian Inverse Problems with Deep Generative Priors

Inverse problems are notoriously difficult to solve because they can hav...
07/06/2021 ∙ by Dhruv V. Patel, et al. ∙ 6

• ### Variational Autoencoding of PDE Inverse Problems

Specifying a governing physical model in the presence of missing physics...
06/28/2020 ∙ by Daniel J. Tait, et al. ∙ 0

• ### A Hadamard fractioal total variation-Gaussian (HFTG) prior for Bayesian inverse problems

This paper studies the infinite-dimensional Bayesian inference method wi...
10/29/2021 ∙ by Li-Li Wang, et al. ∙ 0

• ### On the Locally Lipschitz Robustness of Bayesian Inverse Problems

In this note we consider the robustness of posterior measures occuring i...
06/17/2019 ∙ by Björn Sprungk, et al. ∙ 0

• ### Data-Driven Forward Discretizations for Bayesian Inversion

This paper suggests a framework for the learning of discretizations of e...
03/18/2020 ∙ by Daniele Bigoni, et al. ∙ 0

• ### Bayesian inference under model misspecification using transport-Lagrangian distances: an application to seismic inversion

Model misspecification constitutes a major obstacle to reliable inferenc...
05/14/2021 ∙ by Andrea Scarinci, et al. ∙ 0

• ### Sparse image reconstruction on the sphere: a general approach with uncertainty quantification

Inverse problems defined naturally on the sphere are becoming increasing...
05/11/2021 ∙ by Matthew A. Price, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As we are approaching the era of exascale computing, we encounter more and more complex physical models. These complex models have many unknown parameters that need to be inferred from experimental measurements. That is, inverse problems are becoming an integral part of every scientific discipline that attempts to combine computational models with data. These include climate modeling [4], numerical weather prediction [18, 25], subsurface hydrology and geology [15], and many more.

It can be argued that the “right” way to pose an inverse problem is to follow the Bayesian formalism [28, 17]

. It is the “right” way because it deals with three well-known difficulties of inverse problems: non-uniqueness issues, modeling prior knowledge, and estimating experimental noise. The Bayesian solution of an inverse problem is summarized neatly by the

posteriorprobability density of the quantity of interest. In turn, the posterior can only be explored numerically by Monte Carlo (MC) methods, the most successful of which is Markov Chain Monte Carlo (MCMC) [24, 14]. Because of the need to repeatedly evaluate the underlying physical model, MCMC is computationally impractical for all but the simplest cases. Therefore, we need methods that approximate the posterior in a computationally efficient manner.

One way to come up with a computationally attractive approximation is to replace the physical model with a cheap-to-evaluate surrogate [20, 22]. In this way, MCMC becomes feasible again. However, there is no free lunch: firstly, things become complicated when the surrogate is inaccurate, and, secondly, constructing accurate surrogates is exponentially hard as the number of parameters increase [3]. Because of these difficulties, in this work, we attempt to develop a surrogate-free methodology.

Perhaps the simplest idea is to approximate the posterior with a delta function centered about its maximum. The maximum of the posterior is known as the maximum a posteriori (MAP) estimate of the parameters. The MAP approach is justified if the posterior is sharply picked around a unique maximum. It requires the solution of an optimization problem. The objective function of this optimization has two parts: a misfit and a regularization part that arise from the likelihood and the prior, respectively. The MAP estimate is commonly used in numerical weather prediction problems [25] as well as in seismic inversion [8].

The Laplace approximation represents the posterior by a Gaussian density with a mean given by the MAP and a covariance matrix given by the negative inverse Hessian of the logarithm of the posterior. The Laplace approximation is justified when the posterior has a unique maximum and is shaped, more or less, like a Gaussian around it. As before, the identification of the MAP requires the solution of an optimization problem. The required Hessian information may be estimated numerically either by automatic differentiation methods [12] or by developing the adjoint equations of the underlying physical model [26].

Variational inference (VI) [10, 7]

is a class of techniques in Bayesian statistics targeted toward the construction of approximate posteriors. One proceeds by posing a variational problem whose solution over a family of probability densities approximates the posterior. VI techniques have been proved quite effective for a wide class of inference problems. However, in their archetypal form, they are not directly applicable to inverse problems. This is due to the, typically, non-analytic nature of the underlying physical models. Fortunately, the recent developments in non-parametric VI by

[11] can come to rescue. This is the approach we follow in this work. In non-parametric VI, the family of probability densities that serve as candidate posteriors is the family of mixtures of Gaussians with a fixed number of components [23]. Since a mixture of Gaussians with an adequate number of components can represent any probability density, this approximating family is sufficiently large. The approximate posterior is constructed by minimizing the information loss between the true posterior and the approximate one over the candidate family. This is achieved by solving a series of optimization problems [7].

The outline of the paper is as follows. We start section 2 with a generic discussion of the Bayesian formulation of inverse problems. In section 2.1 we present the core ideas of VI and in section 2.2 we show how one can develop approximation schemes. We validate the proposed methodology numerically by solving two inverse problems: the estimation of the kinetic parameters of a catalysis system (section 3.1) and the identification of the source of contamination based on scarce concentration measurements (section 3.2). Finally, in A we provide all the technical details that are required for the implementation of the proposed methodology.

## 2 Methodology

A forward model associated with a physical phenomenon can be thought of as a function , that connects some unknown parameters to some observed quantities . This connection is defined indirectly via a likelihood function:

 p(y|ξ,θ)=p(y|f(ξ),θ), (1)

where denotes the parameters that control the measurement noise and/or the model discrepancy. Notice how in Eq. (1) the observations, , are explicitly connected to the parameters, , through the model, . A typical example of a likelihood function is the isotropic Gaussian likelihood:

 p(y|ξ,θ)=N(y|f(ξ),e2θI), (2)

where is a real number, and is the unit matrix with the same dimensions as . The exponential of the parameter,

 σ=eθ, (3)

can be thought of as the standard deviation of the measurement noise. The usual parameterization of the isotropic Gaussian likelihood uses

instead of . We do not follow the usual approach because in our numerical examples, it is preferable to work with real rather than positive numbers.

Both and are characterized by prior probability densities, and , respectively. These describe our state of knowledge, prior to observing . As soon as is observed, our updated state of knowledge is captured by the posterior distribution:

 p(ξ,θ|y)=p(y|ξ,θ)p(ξ)p(θ)p(y), (4)

where the normalization constant ,

 p(y)=∫p(y|ξ,θ)p(ξ)p(θ)dξdθ, (5)

is known as the evidence. Eq. (4) is the Bayesian solution to the inverse problem. Notice that it is a probability density over the joint space of and . This is to be contrasted with the classical approaches to inverse problems whose solutions result in point estimates of the unknown variables. The mass of this probability density corresponds to our inability to fully resolve the values of and due to insufficient experimental information.

Notice that by writing ,, , and , we may simplify the notation of Eq. (4) to

 p(ω|y)=p(y|ω)p(ω)p(y). (6)

This is the notation we follow through out this work. The goal of the rest of the paper is to construct an algorithmic framework for the efficient approximation of the posterior of Eq. (6).

### 2.1 Variational inference

Consider a family of probability densities over . Our objective is to choose a that is as “close” as possible to the posterior of Eq. (6). This “closeness” is measured by the Kullback-Leibler (KL) divergence [21],

 KL[q(ω)∥p(ω|y)]=Eq[logq(ω)p(ω|y)], (7)

where denotes the expectation with respect to . Intuitively, the KL divergence can be thought of as the information loss we experience when we approximate the posterior with the probability density . It is easy to show that

 KL[q(ω)∥p(ω|y)]≥0, (8)

and that the equality holds if and only if . Therefore, if the posterior is in , then minimizing Eq. (7) over will give an exact answer. For an arbitrary choice of , we postulate that minimizing Eq. (7) yields a good approximation to the posterior.

Unfortunately, calculation of Eq. (7) requires knowledge of the posterior. This means that Eq. (7) cannot be used directly in an optimization scheme. In order to proceed, we need an objective function that does not depend explicitly on the posterior. To this end, notice that the evidence may be expressed as:

 logp(y)=F[q]+KL[q(ω)∥p(ω|y)], (9)

where

 F[q]=Eq[logp(y,ω)q(ω)]=H[q]+Eq[logp(y,ω)], (10)

with

 p(y,ω)=p(y|ω)p(ω) (11)

being the joint probability density of and , and

 H[q]=−Eq[logq(ω)] (12)

being the entropy of . Since the KL divergence is non-negative (Eq. (8)), we have from Eq. (9) that

 F[q]≤logp(y). (13)

The functional is generally known as the evidence lower bound (ELBO).

We see, that maximizing Eq. (10) is equivalent to minimizing Eq. (7). In addition, Eq. (10) does not depend on the posterior in an explicit manner. This brings us to the variational principle used through out this work: The “best” approximation to the posterior of Eq. (6) over the family of probability densities is the solution to the following optimization problem:

 q∗(ω)=argmaxqF[q]. (14)

### 2.2 Developing approximation schemes

The main difficulty involved in the solution Eq. (14) is the evaluation of expectations over . In principle, one can approximate these expectations via a Monte Carlo procedure and, then, use a variant of the Robbins-Monro algorithm [27]. Such an approach yields a stochastic optimization scheme in the spirit of [5], and [2]. Whether or not such a scheme is more efficient than MCMC sampling of the posterior is beyond the scope of this work. Here, we follow the approach outlined by [11]. In particular, we will derive analytical approximations of for the special case in which is the family of Gaussian mixtures.

The family of Gaussian mixtures with components is the family of probability densities of the form:

 q(ω)=L∑i=1wiN(ω|μi,Σi) (15)

where , and are the responsibility, mean, and covariance matrix, respectively, of the -th component of the mixture. The responsibilities are non-negative and they sum to one while the covariance matrices are positive definite. When we work with , the generic variational problem of Eq. (6) is equivalent to optimization with respect to all the and . In what follows, we replace the ELBO, of Eq. (10), with a series of analytic approximations that exploit the properties of , and, finally, we derive a three-step optimization scheme that yields a local maximum of the approximate ELBO.

We start with an approximation to the entropy (see Eq. (12)) of a Gaussian mixture Eq. (15). There are basically two kinds of approximations that may be derived: 1) using Jensen’s inequality yields a lower bound to built out of convolutions of Gaussians (see [11] and [16]), and 2) employing a Taylor expansion of about each and evaluating the expectation over (see [16]). We have experimented with both approximations to the entropy without observing any significant differences in the numerical results. Therefore, we opt for the former one since it has a very simple analytical form. An application of Jensen’s inequality followed by well-known results about the convolution of two Gaussians yields

 H[q]≥H0[q], (16)

where

 H0[q]=−L∑i=1wilnqi, (17)

with

 qi=L∑j=1wjN(μi|μj,Σi+Σj). (18)

The idea is to simply replace in Eq. (10) with of Eq. (17). This results in a lower bound to the ELBO .

Now, we turn our attention to the second term of Eq. (10). For convenience, let us write it as:

 L[q]=Eq[logp(y,ω)]. (19)

Notice that may be expanded as:

 L[q]=L∑i=1wiEN(ω|μi,Σi)[logp(y,ω)], (20)

and that each expectation term can be approximated by taking the Taylor expansion of about :

 logp(y,ω)≈logp(y,ω=μi)+∇ωlogp(y,ω=μi)(ω−μi)+12(ω−μi)T∇2ωlogp(y,ω=μi)(ω−μi), (21)

where and stand for the Jacobian and the Hessian with respect to , respectively. Combining Eq. (20) with Eq. (21), we get the zero and second order Taylor approximation to of Eq. (19),

 L0[q]=L∑i=1wilogp(y,ω=μi), (22)

and

 L2[q]=L0[q]+12L∑i=1wiTr[Σi∇2ωlogp(y,ω=μi)], (23)

respectively.

Combining Eq. (17) with Eq. (22) or Eq. (23) we get an approximation to Eq. (10). In particular, we define:

 Fa[q]=H0[q]+La[q], (24)

where , or , selects the approximation to Eq. (19). From this point on, our goal is to derive an algorithm that converges to a local maximum of .

Notice that requires the Hessian of at . Therefore, optimizing it with respect to would require third derivatives of . This, in turn, implies the availability of third derivatives of the forward model . Getting third derivatives of the forward model is impractical in almost all cases. In contrast, optimization of with respect to requires only first derivatives of , i.e., only first derivatives of the forward model . In many inverse problems of interest, derivatives can be obtained at a minimum cost by making use of adjoint techniques (e.g., [9]). Therefore, optimization of with respect to is computationally feasible.

The situation for is quite different. Firstly, notice that increases logarithmically without bounds as a function of and that the does not depend on at all. We see that the lowest approximation that carries information about is . Looking at Eq. (23) we observe that this information is conveyed via the Hessian of . This, in turn, requires the Hessian of the forward model . The latter is a non-trivial task which is, however, feasible. In addition, notice that if is restricted to be diagonal, then only the diagonal part of the Hessian of is required, thus, significantly reducing the memory requirements. The computation of as well as of its gradients with respect to , and is discussed in A.

Taking the above into account, we propose an optimization scheme that alternates between optimizing , , and . The algorithm is summarized in Algorithm 1. However, in order to avoid the use of third derivatives of the forward model we follow [11] in using as the objective function when optimizing for . Furthermore, we restrict our attention to diagonal covariance matrices,

 Σi=diag(σ2i1,…,σ2id), (25)

since we do not want to deal with the issue of enforcing positive definiteness of the ’s during their optimization. We use the L-BFGS-B algorithm of [6], which can perform bound constrained optimization. The bounds we use are problem-specific and are discussed in section 3. In all numerical examples, we use the same convergence tolerance .

## 3 Examples

In this section we present two numerical examples: 1) The problem of estimating rate constants in a catalysis system (section 3.1), and 2) the problem of identifying the source of contamination in a two dimensional domain (section 3.2). In both examples, we compare the approximate posterior to a MCMC [14] estimate. We used the Metropolis-Adjusted-Langevin-Algorithm (MALA) [1], since it can use the derivatives of the forward models to accelerate convergence. In all examples, the step size of the MALA proposal was , the first steps were burned, and we observed the chain every steps for a total of steps. We implented our methodology is Python. The code is freely available at https://github.com/ebilionis/variational-reformulation-of-inverse-problems.

### 3.1 Reaction kinetic model

We consider the problem of estimating kinetic parameters of multi-step chemical reactions that involve various intermediate or final products. In particular, we study the dynamical system that describes the catalytic conversion of nitrate () to nitrogen () and other by-products by electrochemical means. The mechanism that is followed is complex and not well understood. In this work, we use the simplified model proposed by [19], which includes the production of nitrogen (), ammonia (), and nitrous oxide () as final products, as well as that of nitrite () and an unknown species X as reaction intermediates (see figure 1).

The dynamical system associated with the reaction depicted in figure 1 is:

 d[NO−3]dt=−k1[NO−3],d[NO−2]dt=k1[NO−3]−(k2+k4+k5)[NO−2],d[X]dt=k2[NO−2]−k3[X],d[N2]dt=k3[X],d[NH3]dt=k4[NO−2],d[N2O]dt=k5[NO−2], (26)

where denotes the concentration of a quantity, and , are the kinetic rate constants. Our goal is to estimate the kinetic rate constants based on the experimental measurements obtained by [19]. For completeness, these measurements are reproduced in table 1. The initial conditions for Eq. (26) are given by the row of the table 1.

In order to avoid numerical instabilities, we work with a dimensionless version of Eq. (26). In particular, we define the scaled time:

 τ=t180min, (27)

the scaled concentrations:

 ui=[Y]500mmol⋅L−1, (28)

for , and , respectively, and the scaled kinetic rate constants:

 κi=ki⋅180min, (29)

for . The dimensionless dynamical system associated with Eq. (26) is:

 ˙u1=−κ1u1,˙u2=κ1u1−(κ2+κ4+κ5)u2,˙u3=κ2u2−κ3u3,˙u4=κ3u3,˙u5=κ4u2,˙u6=κ5u2, (30)

where denotes the differentiation of with respect to the scaled time . The initial conditions for Eq. (30) are given by the scaled version of the row of table 1. We arrange the scaled version of the experimental data of table 1 in a vector form, with , by concatenating rows of table 1 while skipping the third column.

Since the kinetic rate constants, , of the scaled dynamical system of Eq. (30) are non-negative, it is problematic to attempt to approximate the posteriors associated with them with mixtures of Gaussians. For this reason, we work with the logarithms of the ’s,

 ξi=logκi, (31)

for . We collectively denote those variables by

. The prior probability density we assign to

is:

 p(ξ)=5∏i=1p(ξi), (32)

with

 p(ξi)=N(ξi|0,1), (33)

for .

The forward model associated with the experimental observations is:

 f(ξ)=⎛⎜⎝u1(t2,ξ),u2(t2,ξ),u4(t2,ξ)u5(t2,ξ),u6(t2,ξ),…u1(t7,ξ),u2(t7,ξ),u4(t7,ξ)u5(t7,ξ),u6(t7,ξ),⎞⎟⎠, (34)

where is the solution of Eq. (30) with the initial conditions specified by the scaled version of the row of table 1, and scaled kinetic rate constants, , given by inverting Eq. (31), i.e. . The derivatives of can be solving a series of dynamical systems forced by the solution of Eq. (30). This is discussed in A.5.

We use the isotropic Gaussian likelihood defined in Eq. (2). It is further discussed in A.4. The prior we assign to the parameter of the likelihood is:

 p(θ)=N(θ|−1,1). (35)

Since the noise represented by is , this prior choice corresponds to a belief that the measurement noise is around %.

We solve the problem using the variational approach as outlined in Algorithm 1. To approximate the posterior, we only use one Gaussian, in Eq. (15). We impose no bounds on . However, we require that the diagonal elements of the covariance matrix are bounded below by and above by .

Table 2 compares the variational estimates of the scaled kinetic rate constants, (see Eq. (31)), and the logarithm of the likelihood noise, (see Eq. (2)), to the MCMC (MALA) estimates. We see that the mean of the two estimates are in close agreement, albeit the variational approach slightly underestimates the uncertainty of its prediction. However, notice that if we order the parameters in terms of increasing uncertainty, both methods yield the same ordering. Therefore, even though the numerical estimates of the uncertainty differ, the relative estimates of the uncertainty are qualitatively the same. It is worth mentioning at this point, that the variational approach uses only evaluations of the forward model. This is to be contrasted with the thousands of evaluations required so that the MCMC (MALA) estimates converge.

The variational approach with approximates the posterior of and

with one multivariate Gaussian distribution with a diagonal covariance. Therefore, the distribution of each one of the components is a Gaussian. Using Eqs. (

31) and (29), it is easy to show that the kinetic rate constant

follows a log-normal distribution with log-scale parameter

and shape parameter . Similarly, we can show that the noise follows a log-normal distribution with local-scale parameter and shape parameter . Using the percentiles of these lognormal distributions, we compute the median and the credible intervals of the kinetic rate constants and the noise . The results are shown in the third and fourth columns of table 3. They are in good agreement with the results found in [19] using a MCMC strategy (reproduced in the second column of the same table 3). An element of our analysis not found in [19] is the estimation of the measurement noise. Since measures the noise of the scaled version of the data , we see (last line of table 3) that the measurement noise is estimated to be around 2.15%.

In figure 2, we compare the variational posterior (VAR ()) of the scaled kinetic rate constants, , as well as of the noise of the likelihood, , to histograms of the same quantities obtained via MCMC (MALA). Once again, we confirm the excellent agreement between the two methods. In the same figure, we also plot the prior probability density we assigned to each parameter. The prior probability of is practically invisible, because it picks at about . Given the big disparity between prior and posterior distributions, we see that the result is relatively insensitive to the priors we assign. If, in addition, we take into account that the measurement noise is estimated to be quite small, we conclude that Eq. (26) does a very good job of explaining the experimental observations.

Figure 3 shows the uncertainty in the scaled concentrations, , as a function of scaled time, , obtained by approximating the posterior of the parameters and compares them with the variational approach (left), to the MCMC (MALA) estimate. Again, we notice that the two plots are in very good agreement, albeit the variational approach seems to slightly underestimate the uncertainty.

### 3.2 Contaminant source identification

We now apply our methodology to a synthetic example of contaminant source identification. We are assuming that we have experimental measurements of contaminant concentrations at specific locations, and we are interested in estimating the location of the contamination source.

The concentration of the contaminant obeys the two-dimensional transport model described by a diffusion equation

 ∂u∂t=∇2u+g(t,x,ξ),x∈B, (36)

where is the spatial domain and is the source term. The source term is assumed to have a Gaussian:

 g(t,s;xs)=g0e−|x−ξ|22ρ21[0,Ts](t), (37)

where is the strength of the contamination, is its spreadwidth, is the shutoff time parameter, and is the source center. Therefore, is the only parameter that needs to be identified experimentally. We impose homogeneous Neumann boundary conditions

 ∇u⋅n=0,x∈∂B (38)

and zero initial condition

 u(0,x,ξ)=0. (39)

We consider two different scenarios. In the first one, measurements of take place on the four corners of . On the second one, measurements of take place on the middle points of the upper and lower boundaries of . The former results in a unimodal posterior for and can be approximated with just one Gaussian. The latter results in a bimodal posterior for and requires a mixture of two Gaussians. Eq. (36) is solved via a finite volume scheme implemented using the Python package Fipy[13]. The required gradients of the solution are obtained by solving a series of PDE’s similar to Eq. (36) but with different source terms (see A.6). In both scenarios, we generate synthetic observations, , by solving Eq. (36) on a grid, source , and adding Gaussian noise with standard deviation . For the forward evaluations needed during the calibration process we use a grid and we denote the corresponding solution by . The prior we use for is uniform,

 p(ξ)∝1B(ξ), (40)

the likelihood is given by Eq. (2), and the log-noise parameter of the likelihood has the same prior as the previous example (see Eq. (35)).

#### First case: Observations at the four corners

The synthetic data, ( sensors measurements) are generated by sampling the solution, at the four corners of , and by adding Gaussian noise with :

 y=⎛⎜ ⎜⎝u(t1,(0,0)),u(t1,(1,0)),u(t1,(0,1)),u(t1,(1,1)),⋮u(t4,(0,0)),u(t4,(1,0)),u(t4,(0,1)),u(t4,(1,1))⎞⎟ ⎟⎠+noise. (41)

The corresponding forward model generated by the solution, , is given by:

 f(ξ)=⎛⎜ ⎜⎝~u(t1,(0,0),ξ),~u(t1,(1,0),ξ),~u(t1,(0,1),ξ),~u(t1,(1,1),ξ),⋮~u(t4,(0,0),ξ),~u(t4,(1,0),ξ),~u(t4,(0,1);ξ),~u(t4,(1,1),ξ)⎞⎟ ⎟⎠. (42)

Figure 4 compares the posteriors obtained via the variational approach with Gaussian in Eq. (15) to those obtained via MCMC (MALA). The true value of each parameter is indicated by a vertical, green, dashed line. It is worth noting at this point, that the variational approach required forward model evaluations as opposed to thousands required by MCMC (MALA). In real time, the variational approach took about 15 minutes on a single computational node, while the MCMC (MALA) required 3 days on the same node. Notice that the posterior cannot identify the true source exactly. This is due to the noise that we have added in the synthetic data. The result of such noise is always to broaden the posterior. We see that the variational approach does a good job in identifying an approximate location for the source as well as estimating the noise level . However, we notice once again that it underestimates the true uncertainty.

#### Second case: Observations at the middle point of the upper and lower boundaries

The synthetic data, ( sensors measurements) are generated by sampling the solution, at the upper and lower boundaries of , and by adding Gaussian noise with :

 y=⎛⎜ ⎜⎝u(t1,(0.5,0)),u(t1,(0.5,1)),⋮u(t4,(0.5,0)),u(t4,(0.5,1))⎞⎟ ⎟⎠+noise. (43)

The corresponding forward model generated by the solution, , is given by:

 f(ξ)=⎛⎜ ⎜⎝~u(t1,(0.5,0),ξ),~u(t1,(0.5,1),ξ),⋮~u(t4,(0.5,0),ξ),~u(t4,(0.5,1),ξ)⎞⎟ ⎟⎠. (44)

Figure 5 compares the posteriors obtained via the variational approach with Gaussians in Eq. (15) to those obtained via MCMC (MALA). The true value of each parameter is indicated by a vertical, green, dashed line. It is worth noting at this point, that the variational approach required only forward model evaluations. Using symmetry arguments, it is easy to show that data generated by solving Eq. (36) with a source located at look identical to the data that can be generated from a source located at . As a result, the posterior distribution is bimodal. Therefore, we expect common MCMC methodologies to have a hard time dealing with this problem. The reason is that once the MCMC chain visits one of the modes, it is very unlikely that it will ever leave it to visit the other mode. In reality, it is impossible to visit the other mode unless a direct jump is proposed. The reason our MCMC (MALA) scheme works is because we have handpicked a proposal step that does allow for occasional jumps from one mode to the other. On the other hand, we see that the variational approach with Gaussians in Eq. (15) can easily deal with bimodal (or multimodal) posteriors. However, there are a few details that need to be mentioned here. Firstly, one needs to use an greater than or equal to the true number of modes of the posterior. Since, the latter is unknown, a little bit of experimentation would be required in a real problem. Secondly, even if the true is used, Algorithm 1 might still find fewer modes than the true number. For example, in our numerical experiments we have noticed that if and of Eq. (15) with are initially very close together, they are both attracted by the same mode. We believe that this is an artifact introduced by the Taylor approximation to the joint probability function (see Eq. (22) and Eq. (23)), and, in particular, of its local nature. In our example, a few random initializations of the ’s are enough to guarantee the identification of both posterior modes.

## 4 Conclusions

We presented a novel approach to inverse problems that provides an optimization perspective to the Bayesian point of view. In particular, we used information theoretic arguments to derive an optimization problem that targets the estimation of the posterior within the the class of mixtures of Gaussians. The scheme proceeds by postulating that the “best” approximate posterior is the one that exhibits the minimum information loss (relative entropy) within the class of candidate posteriors. We showed how the minimization of the information loss is equivalent to the maximization of a lower bound to the evidence (normalization constant of the posterior). Since the derived lower bound was a computationally intractable quantity, we derived a crude approximation to it that requires the gradients of the forward model with respect to the input variables that we want to infer.

We demonstrated the efficacy of our method to solve inverse problems with just a few forward model evaluations in two numerical examples: 1) the estimation of the kinetic rate constants in a catalysis system, and 2) the identification of the contamination source in a simple diffusion problem. The performance of the scheme was compared to that of a state of the art MCMC technique (MALA) and was found to be satisfactory, albeit slightly underestimating the uncertainty. The scheme was able to solve both inverse problems with a fraction of the computational cost. In particular, our approach required around 50 forward model evaluations as opposed to the tens of thousands that are required by MCMC.

The variational approach seems to open up completely new ways of solving stochastic inverse problems. The scope of the approach is much wider than the particular techniques used in this paper. Just as a indication, the following are some of the research directions that we plan to pursue in the near future: 1) Derive alternative -more accurate- approximations to the lower bound of the evidence; 2) Experiment with dimensionality reduction ideas that would allow us to carry out the variational optimization in high-dimensional problems; 3) Derive stochastic algorithms for maximizing the lower bound without invoking any approximations. We believe that the variational approach has the potential of making stochastic inverse problems solvable with only a moderate increase in the computational cost as compared to classical approaches.

## Acknowledgments

I.K. acknowledges financial support through a Marie Curie International Outgoing Fellowship within the 7th European Community Framework Programme (Award IOF-327650). N.Z, as ‘Royal Society Wolfson Research Merit Award’ holder acknowledges support from the Royal Society and the Wolfson Foundation. N.Z. also acknowledges strategic grant support from EPSRC to the University of Warwick for establishing the Warwick Centre for Predictive Modeling (grant EP/L027682/1). In addition, N.Z. as Hans Fisher Senior Fellow acknowledges support of the Technische Universitt Mnchen - Institute for Advanced Study, funded by the German Excellence Initiative and the European Union Seventh Framework Programme under grant agreement no. .

## Appendix A Computation of Fa[q] and its gradient

Algorithm 1 requires the evaluation of the gradient of the approximate ELBO of Eq. (24) with respect to all the parameters of the Gaussian mixture of Eq. (15). That is, we must be able to evaluate

 Fa[q] = H0[q]+La[q], (45) ∂∂βFa[q] = ∂∂βH0[q]+∂∂βLa[q], (46)

for , for , and and .

### a.1 Computation of H0[q] and its gradient

The computations relative to are:

 H0[q] = −L∑i=1wilog(qi), (47) ∂∂wiH0[q] = −log(qi)−L∑r=1wrNriqr, (48) ∂∂μijH0[q] = −wiL∑r=1wrNriArij(1qi+1qr), (49) ∂∂ΣijkH0[q] = 12wiL∑r=1wrNriBrijk(1qi+1qr), (50)

where, in order to simplify the notation, we have used the following intermediate quantities:

 Nri = N(μr|μi,Σr+Σi), (51) qi = L∑r=1wrNri, (52) Arij = d∑s=1(Σr+Σi)−1js(μrs−μis), (53) Brijk = (Σr+Σi)−1jk+ArijArik. (54)

### a.2 Computation of La[q] and its gradients

The computations relative to the part are:

 L0[q] = L∑i=1wiCi, (55) ∂∂wiL0[q] = Ci, (56) ∂∂μijL0[q] = Dij, (57) ∂∂ΣijkL0[q] = 0, (58) L2[q] = L0[q]+12L∑i=1wid∑j,k=1ΣijkEijk, (59) ∂∂wiL2[q] = ∂∂wiL0[q]+12d∑j,k=1ΣijkEijk, (60) ∂∂ΣijkL2[q] = ∂∂ΣijkL0[q]+12wiEijk, (61)

where, in order to simplify the notation, we have used the following intermediate quantities:

 J(ω) = logp(y,ω)=logp(y|ω)+logp(ω), (62) Ci = J(μi), (63) Dij = ∂∂ωjJ(μi), (64) Eijk = ∂2∂ωjωkJ(μi). (65)

We do not provide the derivatives of with respect to because they are not needed in Algorithm 1. The joint probability function of Eq. (62) depends on the details of the likelihood, the prior, and the underlying forward model. We discuss the computation of Eqs. (62), (64), and (65) in A.3.

### a.3 Computing the derivatives of J(ω)

In this section, we show how the gradient of the forward model appears through the differentiation of the joint probability density of Eq. (62). Recall (see beginning of section 2) that , where are the parameters of the forward model , and the parameters of the likelihood function of Eq. (1). Therefore, let us write:

 J(ω)=J(ξ,θ) = L(y,f(ξ),θ)+Pξ(ξ)+Pθ(θ), (66) L(y,f(ξ),θ) = logp(y|f(ξ),θ), (67) Pξ(ξ) = logp(ξ), (68) Pθ(θ) = logp(θ). (69)

Using the chain rule, we have:

 ∂J∂ξj = dy∑s=1∂L∂fs∂fs∂ξj+∂Pξ∂ξj, (70) ∂J∂θj = ∂L∂θj+∂Pθ∂θj, (71) ∂2J∂ξj∂ξj = dy∑s,r=1∂2L∂fr∂fs∂fr∂ξk∂fs∂ξj+dy∑s=1∂L∂fs∂2fs∂ξj∂ξk+∂2Pξ∂ξj∂ξk, (72) ∂2J∂θi∂θj = ∂2L∂θi∂θj+∂2Pξ∂θi∂θj, (73) ∂2J∂ξj∂θk = dy∑s=1∂2L∂θk∂fs∂fs∂ξj. (74)

Therefore, the Jacobian and the Hessian of the forward model are required. However, as is obvious by close inspection of Eqs. (59), (60), and (61), if the covariance matrices of the mixture of Eq. (15) are diagonal, then only the diagonal elements of the Hessian of the forward model are essential. This is the approach we follow in our numerical examples.

### a.4 Isotropic Gaussian likelihood

In both our numerical examples, we use the isotropic Gaussian likelihood defined in Eq. (2). Its logarithm is:

 (75)