# Unbiased Estimation of the Gradient of the Log-Likelihood in Inverse Problems

We consider the problem of estimating a parameter associated to a Bayesian inverse problem. Treating the unknown initial condition as a nuisance parameter, typically one must resort to a numerical approximation of gradient of the log-likelihood and also adopt a discretization of the problem in space and/or time. We develop a new methodology to unbiasedly estimate the gradient of the log-likelihood with respect to the unknown parameter, i.e. the expectation of the estimate has no discretization bias. Such a property is not only useful for estimation in terms of the original stochastic model of interest, but can be used in stochastic gradient algorithms which benefit from unbiased estimates. Under appropriate assumptions, we prove that our estimator is not only unbiased but of finite variance. In addition, when implemented on a single processor, we show that the cost to achieve a given level of error is comparable to multilevel Monte Carlo methods, both practically and theoretically. However, the new algorithm provides the possibility for parallel computation on arbitrarily many processors without any loss of efficiency, asymptotically. In practice, this means any precision can be achieved for a fixed, finite constant cost, provided that enough processors are available.

## Authors

• 27 publications
• 8 publications
• 2 publications
• ### Unbiased Estimation of the Gradient of the Log-Likelihood for a Class of Continuous-Time State-Space Models

In this paper, we consider static parameter estimation for a class of co...
05/24/2021 ∙ by Marco Ballesio, et al. ∙ 0

• ### Accelerate iterated filtering

In simulation-based inferences for partially observed Markov process mod...
02/23/2018 ∙ by Dao Nguyen, et al. ∙ 0

• ### On Unbiased Estimation for Discretized Models

02/24/2021 ∙ by Jeremy Heng, et al. ∙ 0

• ### On Unbiased Score Estimation for Partially Observed Diffusions

We consider the problem of statistical inference for a class of partiall...
05/11/2021 ∙ by Jeremy Heng, et al. ∙ 0

• ### Unbiased Optimal Stopping via the MUSE

We propose a new unbiased estimator for estimating the utility of the op...
06/04/2021 ∙ by Zhengqing Zhou, et al. ∙ 0

• ### Estimating Fisher Information Matrix in Latent Variable Models based on the Score Function

The Fisher information matrix (FIM) is a key quantity in statistics as i...
09/13/2019 ∙ by Maud Delattre, et al. ∙ 0

• ### Parameters estimation in a 3-parameters p-star model

An important issue in social network analysis refers to the development ...
09/16/2018 ∙ by Pietro Lenarda, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem of inferring unknown parameters associated to the solution of (partial) differential equations (PDEs) is referred to as an inverse problem. In such a context, when the forward problem is well-posed, the inverse problem is often ill-posed and challenging to solve, even numerically. The area has a long history and a large literature (see e.g.

[7, 14]) yet the intersection with statistics is still comparatively small, particularly considering the significant intersection, in terms of both methods and algorithms as well as objectives. If one adopts a Bayesian approach to solution of the inverse problem then the object of interest is a posterior distribution, and in particular expectations with respect to this distribution [8, 12]

. While this provides an elegant solution and quantified uncertainty via well-defined target distribution, it is more challenging to solve than its deterministic counterpart, requiring at least a Hessian in addition to a maximum a posteriori estimator for a Laplace approximation, if not more expensive Monte Carlo methods. Here we assume solution of the Bayesian inverse problem (BIP) requires computationally intensive Monte Carlo methods for accurate estimation. We furthermore assume that the statistical model can only be defined up to some unknown parameters.

Consider a BIP with unknown and data , related through a PDE, and assume that the statistical model is known only up to some parameter (assumed finite dimensional). In other words, the posterior distribution takes the form

 p(du,θ|y)∝p(y|u,θ)p(du|θ)p(θ).

Due to sensitivity with respect to the parameter and strong correlation with the unknown

, such posterior distribution can be highly complex and very challenging to sample from, even using quite advanced Markov chain Monte Carlo (MCMC) algorithms. In this article, the unknown

is treated as a nuisance parameter and the goal is to maximize the marginal likelihood of the parameters

 p(y|θ)=∫Xp(y|u,θ)p(du|θ).

In such a scenario one is left with a finite-dimensional optimization problem, albeit with an objective function that is not available analytically. This intractability arises from two sources:

• first, for a given only a discretization of the likelihood can be evaluated;

• second, the discretized marginal likelihood is a high-dimensional integral which itself must be approximated.

Moreover, the associated gradient of the log-likelihood is not available, which may be of interest in optimization algorithms. In the following we will suppress the notation for fixed observation and present the method generally. In particular, we use the notation , where represents the finite measure of an infinitesimal volume element, which may or may not be Lebesgue measure, and . We will also denote its integral , and the posterior by .

In this article, we present a new scheme to provide finite variance estimates of the gradient of the log-likelihood that are unbiased. To be precise, let denote the gradient of the log-likelihood with no discretization bias. The proposed method provides an estimator such that , where is the expectation with respect to the randomization induced by our numerical approach. Moreover, the estimator is constructed so that one only needs access to finite resolution (discretized) approximations of the BIP. This scheme is of interest for several reasons:

2. The method is easy to parallelize;

3. The method helps to provide a benchmark for other computations.

In terms of the first point, it is often simpler to verify the validity of stochastic gradient algorithms when the estimate of the noisy functional is unbiased. Whilst this is not always needed (see [13] for a special case, which does not apply in our context), it at least provides the user a peace-of-mind when implementing optimization schemes. The second point is of interest, in terms of efficiency of application, especially relative to competing methods. The third point simply states that one can check the precision of biased methodology. We now explain the approach in a little more detail.

This article is structured as follows. In Section 2 we explain the generic problem to which our approach is applicable. In particular, a concrete example in the context of Bayesian inverse problems is described. In Section 3 we present our methodology and the proposed estimator. In Section 4 we show that our proposed estimator is unbiased and of finite variance and we consider the cost to obtain the estimate. In Section 5 several numerical examples are presented to investigate performance of the estimator in practice, including the efficiency of the estimator when used in in the relevant context of a stochastic gradient algorithm for parameter estimation. In appendix A the proofs of some of our theoretical results can be found.

## 2 Problem Setting

### 2.1 Generic Problem

Let be a measurable space, and define a probability measure on it as

 ηθ(du)=γθ(u)du∫Xγθ(u)du

where , and is a finite measure on . We are interested in computing

 ∇θlog(∫Xγθ(u)du) = = ∫Xφθ(u)ηθ(du),

where we have defined . From here on, we will use the following short-hand notation for a measure on and a measurable integrable

 μ(φ):=∫Xφ(x)μ(dx),

which should be understood as a column vector of integrals.

In practice, we assume that we must work with an approximation of and . Let , and set

 ηlθ(du)=γlθ(u)du∫Xγlθ(u)du

where . We are now interested in computing

 ∇θlog(∫Xγlθ(u)du) = ∫X∇θlog(γlθ(u))ηlθ(du) = ∫Xφlθ(u)ηlθ(du).

It is assumed explicitly that

 liml→+∞ηlθ(φlθ)=ηθ(φθ).

### 2.2 Example of Problem

We will focus on the following particular problem. Let with convex and . Consider the following PDE on :

 −∇⋅(^u∇p) =f, on D, (1) p =0, on ∂D,

where

 ^u(x)=¯u(x)+K∑k=1ukσkϕk(x).

Define , with

i.i.d. (the uniform distribution on

). This determines the prior distribution for . The state space is . Let denote the weak solution of for parameter value . The following will be assumed.

• , , and there is a such that .

Note that this assumption guarantees uniformly in , hence there is a well-defined (weak) solution which will be bounded in uniformly in , in an appropriate space, e.g. .

Define the following vector-valued function

 G(u)=[g1(p(⋅;u)),…,gM(p(⋅;u))]⊺,

where are elements of the dual space (e.g. is sufficient), for . It is assumed that the data take the form

 y=G(u)+ξ,ξ∼N(0,θ−1⋅IM),ξ⊥u,

where

denotes the Gaussian random variable with mean

and covariance matrix , and denotes independence. The unnormalized density of for fixed is then given by

 γθ(u)=θM/2exp(−θ2∥G(u)−y∥2), (2)

the normalized density is given by

 ηθ(u)=γθ(u)Zθ,

where , and the quantity of interest is defined as

 φθ(u):=∇θlog(γθ(u))=M2θ−12∥G(u)−y∥2). (3)

#### 2.2.1 Particular setup

Let and and consider . For the prior specification of , we set , and for , let if

is odd and

if is even. The observation operator is , and the parameter in observation noise covariance is taken to be .

The PDE problem at resolution level is solved using a finite element method with piecewise linear shape functions on a uniform mesh of width , for . Thus, on the th level the finite-element basis functions are defined as (for ):

 ψli(x)={(1/hl)[x−(xi−hl)]% if x∈[xi−hl,xi],(1/hl)[xi+hl−x]if x∈[xi,xi+hl].

To solve the PDE, is plugged into (1), and projected onto each basis element:

 −⟨∇⋅(^u∇2l−1∑i=1pliψli),ψlj⟩=⟨f,ψlj⟩,

resulting in the following linear system:

 Al(u)pl=fl,

where we introduce the matrix with entries , and vectors with entries and , respectively.

Define . Denote the corresponding approximated unnormalized density by

 γlθ(u)=θM/2exp(−θ2∥Gl(u)−y∥2), (4)

and the approximated normalized density by

 ηlθ(u)=γlθ(u)Zlθ,

where . We further define

 (5)

It is well-known that under assumption (A2.2) converges to as uniformly in [3, 4]. Furthermore, continuity ensures converges to and converges to uniformly in as well.

## 3 Methodology for Unbiased Estimation

We now describe our methodology for computing an unbiased estimate of . For simplicity of exposition we will suppose that for , , where denotes the element of a vector and are the collection of bounded, measurable and real-valued functions on . This constraint is not needed for the numerical implementation of the method, but, shall reduce most of the technical exposition to follow. As remarked in the introduction, the basic approach follows that in [9] with some notable differences. We now detail how the approach will work.

### 3.1 Methodology in [9]

The underlying approach of [9] is a type of double randomization scheme. The first step is to use the single-term estimator as developed in [11]. Suppose one wants to estimate , but, only has access to a methodology that can approximate for each fixed . Let be a positive probability mass function on and suppose that one can construct a sequence of random variables such that

 E[Ξ0θ] = η0θ(φ0θ) (6) E[Ξlθ] = ηlθ(φlθ)−ηl−1θ(φl−1θ)l∈N (7)

and that

 ∑l∈N01PL(l)E[∥Ξlθ∥2]<+∞ (8)

where is the norm. Now if one draws , then is an unbiased and finite variance estimator of . It should be noted that (6)-(7) are not necessary conditions, but are sufficient to ensure the unbiasedness of the estimator.

In the context of interest, it can be challenging to obtain a sequence of random variables which can possess the properties (6)-(8). We will detail one possible approach at a high-level and then explain in details how one can actually construct a simulation method to achieve this high-level description.

### 3.2 High Level Approach

The objective of this section is to highlight the generic procedure that is used in [9] for producing estimates that satisfy (6)-(7). The basic idea is to use another application of randomization to construct such unbiased estimators from a consistent sequence of estimators. In particular, consider a given increasing sequence with for each , and . Then, we suppose that one can construct -sample Monte Carlo (type) estimators for , such that almost surely the following consistency results hold

 limp→∞ξ0,pθ = η0θ(φ0θ), (9) limp→∞ξl,pθ = ηlθ(φlθ)−ηl−1θ(φl−1θ),l∈N. (10)

For a given , we do not require and to be independent, nor do we require unbiasedness of the individual estimators as in

 E[ξ0,pθ] = η0θ(φ0θ), E[ξl,pθ] = ηlθ(φlθ)−ηl−1θ(φl−1θ),l∈N.

Now set

 Ξ0,0θ := ξ0,0θ, Ξ0,pθ := ξ0,pθ−ξ0,p−1θ,p∈N.

For given, set

 Ξl,0θ := ξl,0θ, Ξl,pθ := ξl,pθ−ξl,p−1θ,p∈N.

Let , , be a positive probability mass function with . Now if

 ∑p∈N01¯¯¯PP(p)E[∥ξl,pθ−η0θ(φ0θ)∥2] < +∞, (11) ∑p∈N01¯¯¯PP(p)E[∥ξl,pθ−{ηlθ(φlθ)−ηl−1θ(φl−1θ)}∥2] < +∞,l∈N (12)

and , then

 Ξlθ=P∑p=01¯¯¯PP(p)Ξl,pθ (13)

will allow to satisfy (6)-(7), where expectations are understood to be with respect to yet is suppressed in the notation. Moreover will have finite variances. This result follows as we are simply using the coupled sum estimator as in [11] and using [15, Theorem 5], for instance, to verify the conditions required.

### 3.3 Details of the Approach

We will now describe how to obtain the sequence for fixed.

#### 3.3.1 MLSMC Method of [2]

To introduce our approach, we first consider the MLSMC method in [2] which will form the basis for our estimation procedure. Define for

 Glθ(u)=γl+1θ(u)γlθ(u)

and for , is a invariant Markov kernel; that is, for any

 ηlθ(φ)=∫X(∫Xφ(u′)Mlθ(u,du′))ηlθ(du). (14)

Define for (the collection of probability measures on ),

 Φlθ(μ)(du′):=1μ(Gl−1θ)∫XGl−1θ(u)Mlθ(u,du′)μ(du). (15)

Noting that

 ηlθ(φ)=ηl−1θ(Gl−1θφ)ηl−1θ(Gl−1θ)=Zl−1θZlθηl−1θ(Gl−1θφ)=1Zlθ∫X(γlθ(u)φ(u))du, (16)

equations (14) and (15) lead to the recursion

 ηlθ(φ)=ηl−1θ(Gl−1θφ)ηl−1θ(Gl−1θ) = 1ηl−1θ(Gl−1θ)∫XGl−1θ(u)(∫Xφ(u′)Mlθ(u,du′))ηl−1θ(du) = Φlθ(ηl−1θ)(φ).

Consider , and slightly modify the MLSMC algorithm used in [2] to keep the number of samples across layers fixed, up to some given level . Details are given in Algorithm 1.

This algorithm yields samples distributed according to the following joint law

 PNθ(d(u1:N0,…,u1:Nl))=(N∏i=1η0θ(dui0))(l∏s=1N∏i=1Φsθ(ηs−1,Nθ)(duis)), (17)

where for . One can compute an estimate of as

 η0,Nθ(φ0θ):=1NN∑i=1φ0θ(ui0).

Following from (16), for , one can estimate with

 ηl−1,Nθ(Gl−1θφlθ)ηl−1,Nθ(Gl−1θ)−ηl−1,Nθ(φl−1θ)=1N∑Ni=1Gl−1θ(uil−1)φlθ(uil−1)1N∑Ni=1Gl−1θ(uil−1)−1NN∑i=1φl−1θ(uil−1).

The reason for using the samples generated at level to estimate as well as is to construct estimators which satisfying conditions such as (8). Standard results (for instance in [5]) allow one to prove that almost surely

 limN→∞η0,Nθ(φ0θ) = η0θ(φ0θ) limN→∞(ηl−1,Nθ(Gl−1θφlθ)ηl−1,Nθ(Gl−1θ)−ηl−1,Nθ(φl−1θ)) = ηlθ(φlθ)−ηl−1θ(φl−1θ),l∈N.

Note that in general one has

 ENθ[(ηl−1,Nθ(Gl−1θφlθ)ηl−1,Nθ(Gl−1θ)−ηl−1,Nθ(φl−1θ))]≠ηlθ(φlθ)−ηl−1θ(φl−1θ),l∈N,

where is an expectation associated to the probability in (17).

#### 3.3.2 Approach for Constructing (Ξl,pθ)p∈N0

In order to calculate our approximation, we will consider the following approach, which was also used in [9]. Given any , we will run Algorithm 2 in order to obtain .

The joint probability law of the samples simulated according to Algorithm 2 is

 Pθ(d(u1:Np0,…,u1:Np(l−1)∨0))=P∏p=0PNp−Np−1θ((uNp−1+1:Np0,…,uNp−1+1:Np(l−1)∨0)), (18)

where and is as defined in (17). For given, consider running Algorithm 2. Then for any and any we can construct the following empirical probability measure on

 ηs,N0:pθ(dus):=p∑q=0(Nq−Nq−1Np)ηs,Nq−Nq−1θ(dus). (19)

Note the recursion

 ηs,N0:pθ(dus)=(Np−Np−1Np)ηs,Np−Np−1θ(dus)+Np−1Npηs,N0:p−1θ(dus).

Now define

 Ξl,pθ:=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩η0,N0:pθ(φ0θ)−η0,N0:p−1θ(φ0θ)if l=0ηl−1,N0:pθ(Gl−1θφlθ)ηl−1,N0:pθ(Gl−1θ)−ηl−1,N0:pθ(φl−1θ)−(ηl−1,N0:p−1θ(Gl−1θφlθ)ηl−1,N0:p−1θ(Gl−1θ)−ηl−1,N0:p−1θ(φl−1θ))otherwise, (20)

where , and

 ηl−1,N0:−1θ(Gl−1θφlθ)ηl−1,N0:−1θ(Gl−1θ)−ηl−1,N0:−1θ(φl−1θ):=0.

For convenience in the next section, the conditions (11)-(12) translated to the notations used in this section are

 ∑p∈N01¯¯¯PP(p)Eθ[∥[η0,N0:pθ−η0θ](φ0θ)∥2] < +∞ (21) ∑p∈N01¯¯¯PP(p)Eθ[∥∥ηl−1,N0:pθ(Gl−1θφlθ)ηl−1,N0:pθ(Gl−1θ)−ηl−1,N0:pθ(φl−1θ)−(ηl−1θ(Gl−1θφlθ)ηl−1θ(Gl−1θ)−ηl−1θ(φl−1θ))∥∥2] < +∞,l∈N, (22)

where is used to denote expectation associated to the probability in (18).

### 3.4 Method

The new method is now presented in Algorithm 3.

The estimate of is given by (23). In Section 4 we will prove that it is both unbiased and of finite variance, as well as to investigate the cost of computing the estimate.

There are several points of practical interest to be made at this stage (the first two were noted already in [9]). First, the loop over the number of independent samples in Algorithm 3 can be easily parallelized. Second, one does not need to make and independent; this is only assumed for simplicity of presentation, but is not required. Third, the current method uses only the level marginal of (18). All the samples for and associated empirical measures (19) are discarded and only the level empirical measure is utilized. This differs from [2] where all the lower level empirical measures are used. It is possible these samples could be utilized to improve the accuracy of the method, but it is not necessary and so is not investigated further here. The potential efficiency of the double randomization scheme, as well as a discussion of the overall efficiency of the approach, is given in [9, Section 2.5].

## 4 Theoretical Results

Our main objective is to show that as defined in (13) with as in (20) will satisfy (6)-(8). To that end, one must first show that satisfy (21)-(22) which certainly verifies (6)-(7) and then one must establish that (8) holds. We make the following assumptions.

• For each , there exist such that

 supl≥0supu∈XGlθ(u) ≤ ¯¯¯¯C infl≥0infu∈XGlθ(u) ≥ C––.
• For each , , there exist a such that for any , ,

 ∫AMlθ(u,du′)≥ρ∫AMlθ(v,dv′).
• For each , there exists a such that for each

 supl≥0supu∈X|(φl