Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods

04/07/2021 ∙ by Andrew B. Duncan, et al. ∙ Imperial College London California Institute of Technology University of Warwick 10

The increasing availability of data presents an opportunity to calibrate unknown parameters which appear in complex models of phenomena in the biomedical, physical and social sciences. However, model complexity often leads to parameter-to-data maps which are expensive to evaluate and are only available through noisy approximations. This paper is concerned with the use of interacting particle systems for the solution of the resulting inverse problems for parameters. Of particular interest is the case where the available forward model evaluations are subject to rapid fluctuations, in parameter space, superimposed on the smoothly varying large scale parametric structure of interest. Multiscale analysis is used to study the behaviour of interacting particle system algorithms when such rapid fluctuations, which we refer to as noise, pollute the large scale parametric dependence of the parameter-to-data map. Ensemble Kalman methods (which are derivative-free) and Langevin-based methods (which use the derivative of the parameter-to-data map) are compared in this light. The ensemble Kalman methods are shown to behave favourably in the presence of noise in the parameter-to-data map, whereas Langevin methods are adversely affected. On the other hand, Langevin methods have the correct equilibrium distribution in the setting of noise-free forward models, whilst ensemble Kalman methods only provide an uncontrolled approximation, except in the linear case. Therefore a new class of algorithms, ensemble Gaussian process samplers, which combine the benefits of both ensemble Kalman and Langevin methods, are introduced and shown to perform favourably.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 The Setting

The focus of this work is on the solution of inverse problems in the setting where only noisy evaluations of the forward problem (the parameter-to-data map) are available and where the evaluations are expensive. The methodological approaches we study are all ensemble based. The take-home message of the paper is that judicious use of ensemble information may be used to remove the pitfalls associated with gradient based methods in this setting, but still retain the advantages of gradient descent; the conclusions apply to both optimization and sampling approaches to inversion. We provide theoretical and numerical studies which allow us to differentiate between existing ensemble based approaches, and we propose a new ensemble-based method.

The problem we study is this: we aim to infer given observations of , so that

(1)

here the specific instance of the observational noise is not known, but its distribution is; to be concrete we assume that , with strictly positive-definite covariance

. After imposing a prior probability measure

, we may formulate the Bayesian inference problem of finding the posterior distribution on

given . Application of Bayes rule shows that this is proportional to

(2)
(3)

The objective is either to generate samples from target distribution (Bayesian approach) or to compute minimizers of (optimization approach).

In order to understand the setting where we only have access to noisy evaluations of the forward model111It is important to distinguish between the observational noise , appearing in the observations , and the concept of noisy evaluations of the forward model. we consider the following class of forward models. We assume that we wish to solve the inverse problem with but that we have access only to evaluations of which is the sum of the smooth component with an additional rapidly fluctuating noise term:

(4)

here controls the characteristic length-scale of the fluctuations. We work in the setting where only , and not itself, may be evaluated, but the goal is nonetheless to solve the smoothed inverse problem of finding from where

(5)

Another application of Bayes rule shows that the resulting posterior distribution is proportional to

(6)
(7)

The following is a concrete example of a parameter-inference problem of interest.

Example

Consider the following parameter-dependent dynamical system

and assume that data is available to us as finite time-averages of a function on the state space:

where

For ergodic, mixing dynamical systems a central limit theorem may sometimes be proved to hold

[araujo2015rapid], or empirically observed. In this setting

where is the infinite time average which, by ergodicity, is independent of the initial condition Thus may be viewed as a noisy perturbation of the function with the noise induced by the unknown initial condition In many chaotic dynamical systems the fluctuations in the dependence of are rapidly varying. In our computations we approximate by a constant covariance estimated from a single long-run of the model. This setting arises in climate modelling: see [cleary2020calibrate].
Figure 1 shows an example in which the Lorenz ’63 model is used, corresponds to evaluation of one of the three coordinates of the solution and

is a parameter appearing linearly in the driving vector field. Further details of the model along with the particulars of

and are provided in Section 5.3.

Figure 1: The potential function (negative log-posterior density) for unknown parameters and of the parameter-dependent dynamical system as described in Example 1.1 and further detailed in Section 5.3. The roughness of the potential arises from the noise introduced into the data through the time-averaging process over the ergodic dynamical system.

1.2 Literature Review

The focus of this paper is the solution of inverse problems, via optimization or probabilistic approaches. Due to their wide applicability and practical success, Markov Chain Monte Carlo (MCMC) methods based on Metropolis-Hastings (MH) transition kernels remain the de-facto approach to performing Bayesian inference for complex and/or high-dimensional statistical models

[hastings1970monte, robert2013monte], and in particular, play an important role in Bayesian inversion [kaipio2006statistical]. An MCMC method can effectively explore the posterior distribution of a Bayesian model providing systematic uncertainty quantification and, with sufficient computational effort, return an arbitrarily accurate approximation to an expectation of a quantity of interest. On the other hand, such methods require large numbers of iterations to provide an accurate characterisation of the posterior distribution [geyer1991markov]. While convergence of many MH chains can be guaranteed under mild conditions [meyn2012markov], the speed of convergence can be prohibitively slow, either because of small, incremental, proposal moves or because of frequent proposal rejections. For Bayesian models with computationally expensive likelihoods this may render MCMC based methods computationally prohibitive, as they require at least one likelihood evaluation per MCMC step. In practice it is not unusual for MCMC methods to require or more likelihood computations [geyer1991markov], which may be infeasible if each involves the evaluation of an expensive PDE, for example in climate modeling [jarvinen2012ensemble], the geophysical sciences more generally [oliver2008inverse, chandra2019bayeslands] and for agent-based models [gomes2019parameter]

. The overall performance depends on both the characteristics of the target posterior distribution and the parameters of the specific proposal distribution. In particular, the proposal variance must be chosen suitably for the model under consideration, for example based on MCMC convergence diagnostics

[mengersen1999mcmc, vats2020analyzing].

The presence of multiple modes in the target posterior distribution is another cause of slow MCMC convergence, as MH-based chains will spend the majority of their time exploring around a single mode, with rare transitions between modes. This has motivated various extensions to standard MH-based MCMC including delayed-rejection methods [green2001delayed, haario2006dram], adaptive MCMC and methods based on ensembles to promote better state-space exploration, e.g. parallel tempering [liu2008monte] and others. This issue is further exacerbated for models with posterior distributions exhibiting “roughness” characterised by a non-convex, non-smooth posterior with large numbers of local maxima, such as arises in solving inverse problems of the type described in Example 1.1

, or akin to the frustrated energy landscape models arising in molecular models of protein structures or glassy models in statistical physics and similar models arising in the training of neural networks

[baity2018comparing]. Similar scenarios also occur naturally in Bayesian models arising in geoscience [chandra2019bayeslands]

. In many practical applications, these local maxima will not play any important role in quantifying the model uncertainty due to their lack of large-scale structure; algorithms are sought which find minimizers of a modified loss function characterizing the large-scale structure

[chosh]. The focus of our work is to understand the extent to which we are able to characterise the large scale structure of the process without being hampered by the presence of small-scale fluctuations in the likelihood.

In the context of Bayesian inverse problems, such pathologies may arise naturally if the forward model exhibits multiscale behaviour [frederick2017numerical], particularly when only sparse data is available, giving rise to identifiability issues [chandra2019bayeslands]. Alternatively, this may occur if one only has access to a noisy estimate of the likelihood, e.g. for some classes of intractable likelihoods such as those arising from stochastic differential equation models with sparse observations. Similarly, rough posteriors may also arise if one is fitting a Bayesian inverse problem based on estimators of sufficient statistics of the forward model [morzfeld2018feature]; this setting arises in parameter estimation problems of the type described in Example 1.1, where time-averaged quantities are used for parameter estimation in chaotic dynamical systems. Posterior distributions arising from Bayesian neural networks have also been observed to have a similar structure [baity2018comparing, chandra2019bayeslands]

. In the special case where one has an unbiased estimator of the likelihood then Pseudo-Marginal MCMC methods

[andrieu2009pseudo] provide a means of sampling from the exact posterior distribution, but the performance of these methods degrades very quickly with increasing dimension.

The behaviour of MH chains for sampling a particular class of rough distributions was studied in [plechavc2019sampling] where it was shown that the performance of Random Walk Metropolis-Hastings based MCMC is independent of the characteristic roughness length-scale. On the other hand, standard gradient-based MH methods, such as the Metropolis-Adjusted Langevin Algorithm (MALA), are sensitive to the posterior roughness, in that the performance degrades as the fluctuation scale of the noisy likelihood evaluation is decreased, even when optimally tuned. A similar dichotomy between random walk and gradient based proposals for MH can also be found in [livingstone2019robustness]. Study of parameter estimation for the effective single-scale dynamics of the overdamped Langevin equation, based on sample path data generated with multiscale potentials such as (3) was introduced in [papavasiliou2009maximum, pavliotis2007parameter], building on the framework of [Olla94homogenizationof]. In subsequent works [duncan2016noise, duncan2016brownian], the authors study the influence of the fast-scale perturbation of the potential on the speed of convergence to distribution, identifying situations where the exponential rate of convergence is independent of and situations where the speed of convergence will decrease as ; in particular, using Muckenhoupt’s criterion [Muckenhoupt1972, bakry2013analysis], it is shown that for potentials of the form , where is smooth function with unit period, a Logarithmic-Sobolev inequality cannot hold for the semigroup associated with Langevin dynamics.

We emphasise that the objective of our work is orthogonal to those in [plechavc2019sampling, duncan2016noise, duncan2016brownian] in that we aim to obtain approximate samples from the smoothed posterior distribution given by (6) based on evaluations of , whereas in the aformentioned papers, the objective is to sample directly from the multiscale target distribution. In this sense our work has more in common with the objective of the papers [papavasiliou2009maximum, pavliotis2007parameter]; however our setting is not restricted to learning parameters from sample paths of diffusion processes.

The new method that we introduced is based on the use of Gaussian processes to find a smoothed potential to use within Langevin dynamics. In the context of uncertainty quantification, Gaussian Processes (GPs) were first used to model ore reserves for mining [krige1951statistical], leading to the kriging method, which is now widely used in the geo-statistics community [stein2012interpolation]. Subsequently, GPs have been widely used to provide black-box emulation of computationally expensive codes [sacks1989design], and in [kennedy2001bayesian] a Bayesian formulation to the underpinning framework is introduced. Emulation methods based on GPs are now widespread and find uses in numerous applications ranging from computer code calibration [higdon2004combining], global sensitivity analysis [oakley2004probabilistic], uncertainty analysis [oakley2002bayesian] and MCMC [lan2016emulation].

Ensemble Kalman based methods have had much empirical success for state estimation of noisily observed dynamical systems, finding application in numerical weather prediction and geophysical sciences [asch2016data, evensen2009data]. In [reich2011dynamical] the “analysis” step in sequential state estimation was studied from the viewpoint of sequential Monte Carlo (SMC, [del2006sequential]) methods, making a link between data-assimilation and the solution of inverse problems. The fact that the ensemble Kalman methodology is derivative-free and only requires ensembles of moderate size has led to their increased use in the solution of inverse problems; this initial development was primarily in the context of oil reservoir simulation [chen2012ensemble, chen2013levenberg, emerick2013ensemble], where the methods go by the names randomized maximum likelihood and iterative ensemble smoother [raanes2019revising, raanes2019revisingb] and are akin to SMC methods. The approach has since been generalized to optimization methods found by iterating these algorithms over arbitrary numbers of steps, rather than transporting the prior to the posterior in a fixed number of steps [iglesias2013ensemble, albers2019ensemble]; we will refer to such methods, in which (discrete or continuous time) integration over arbitrarily long time-intervals is used, as Ensemble Kalman Inversion (EKI). Empirically, ensemble Kalman based inversion methods have been shown to produce good parameter estimates even for high dimensional models but do not provide systematic uncertainty quantification. In [garbuno2020interacting] an extension of the methods, which we refer to as the Ensemble Kalman Sampler (EKS), is proposed; this method provides approximate samples from the target Bayesian posterior distribution, and is able to sample exactly from the posterior distribution for Gaussian models [nusken2019note, garbuno2020affine]. Despite their widespread use, ensemble Kalman methods rely on a Gaussian assumption for the posterior distribution, and their output may diverge from the true posterior for models which depart from this assumption. A recently introduced method, based on multiscale simulation of an interacting particle system, delivers controllable approximations of the true posterior [pavliotis2021derivative]; it is rather slow in its basic form, but can be made more efficient when preconditioned by covariance information derived from the output of the ensemble Kalman sampler. The use of ensemble methods to enable more effective posterior exploration also arises in the context of MCMC: methods include replica exchange [swendsen1986replica, liu2008monte] as well as Metropolis-Hastings based approaches [goodman2010ensemble, cappe2004population, jasra2007population, iba2001population]. Ensemble-based MCMC methods which accelerate the Markov chain dynamics by introducing preconditioners computed from ensemble information (e.g. sample covariance), have been studied [leimkuhler2018ensemble, garbuno2020affine, carrillo2019wasserstein, duncan2019geometry]. The recent work [maoutsa2020interacting] also employs an ensemble of particles for evolving density estimation using kernel methods with the objective of approximating solution of a Fokker-Planck equation.

In this work we seek to understand how ensemble methods can be employed to effectively mitigate the challenges of sampling from noisy or rough posterior distributions. We emphasise that, although we are able to evaluate and not , it is not our objective to sample from the posterior distribution of the Bayesian model (2) with

, but rather to generate samples from the smoothed probability distribution (

6), determined only by . We identify an important dichotomy between different classes of ensemble methods which resonates with the conclusions of [plechavc2019sampling]: (a) those which calculate the gradient of the log-posterior density for every particle within the ensemble and then aggregate this to update the particle positions; (b) those which evaluate the log-posterior for every particle and then compute a gradient, or approximate gradient, based on the aggregate. The Noisy SVGD and Empirically Preconditioned Langevin detailed in [duncan2019geometry] and [carrillo2019wasserstein] respectively fall into the first class, while the EKS falls into the latter class of ensemble methods. We show that those in class b) are robust to the roughness of the posterior landscape and produce approximations of the posterior (6), using only evaluations of , but with relaxation times independent of ; in contrast the performance of those in class (a) deteriorates as the characteristic length-scale of the roughness converges to zero and do not solve the inverse problem defined by the smooth component , but rather solve a correction of it.

Our analysis and evaluation of the algorithms is based on deploying multiscale methodology to determine the effect of small scale structures on the large scales of interest; in particular we apply the formal perturbation approach to multiscale analysis which was systematically developed in [bensoussan2011asymptotic], and which is presented pedagogically in [pavliotis2008multiscale]. To simplify the analysis we perform the multiscale analysis for mean field limit problems, requiring the study of nonlinear Fokker-Planck equations; previous use of multiscale methods for nonlinear Fokker-Planck equations arising from mean field limits may be found in [gomes2018mean, gomes2019mean].

This analysis demonstrates the benefits of EKS for obtaining independent convergence, but leaves open the issue that the EKS leads to an uncontrolled approximation of the posterior, except in the Gaussian setting. Motivated by the success of the EKS for sampling from models with noisy likelihoods, and by the wish to make controlled approximations of the posterior, we propose a new class of ensemble method known as the Ensemble GP sampler (EGPS) which can sample effectively from rough distributions without making the ansatz of a Gaussian posterior distribution that is used in the EKS. The strategy underpinning this method involves evolving an ensemble of particles according to overdamped Langevin dynamics using a surrogate GP emulator to replace the potentially expensive log-likelihood term; different policies for evolving the design points of the GP emulator will be considered. For an appropriate choice of kernel the GP approximation of the log-likelihood can be differentiated, even if the log-likelihood is not smooth. The use of surrogate GP models to accelerate MCMC has been considered before, for example in [lan2016emulation] a GP emulator was used to calculate the higher-order derivatives of the log-likelihood required in the calculation of Riemannian Manifold Hamiltonian Monte Carlo. Similarly, in [strathmann2015gradient] a nonparametric density estimator based on an infinite dimensional exponential family was used to approximate the log-posterior which was then used to compute the derivatives required for HMC. The use of surrogate models to augment existing MCMC methods through a delayed rejection scheme has been considered in [zhang2019accelerating] for GPs and [yan2019adaptive] for Neural Network surrogates.

1.3 Our Contributions

In this paper we make the following contributions to the analysis and development of ensemble based methods for the solution of inverse problem (6), based on forward model , given only access to the noisy approximation in the form (4):

  • by means of multiscale analysis we demonstrate that the ensemble Kalman sampler (EKS, which does not use gradients of the forward model) exhibits an averaging property which enables the algorithm to avoid undesirable effects of noise in the forward model evaluations;

  • by means of multiscale analysis we demonstrate that Langevin based samplers (which use gradients of the forward model) exhibit a homogenization property which causes the algorithm to slow down in the presence of noise in the forward model evaluations;

  • we introduce the ensemble GP sampler (GPS) which combines the benefits of the EKS (averaging out noisy forward model evaluations) with the benefits of Langevin based sampling (exact gradients and exact posteriors), overcoming the drawbacks of the two methods (uncontrolled approximation of the posterior, slow performance in presence of noisy forward model evaluations respectively);

  • we employ numerical experiments to illustrate the averaging and homogenization effects of the EKS and Langevin samplers, and to show the benefits of the GPS.

In Section 2 we define the EKS and study its application to noisy forward models by means of multiscale averaging. In Section 3 we define a class of interacting Langevin samplers and study its application to noisy forward models by means of multiscale homogenization. Section 4 introduces the new ensemble Gaussian process sampler. Numerical results for all three methods are presented in Section 5.

2 Derivative-Free Sampling

2.1 Ensemble Kalman Sampling

The Ensemble Kalman Sampler is an approximate inference method based on solution of the following system of SDEs:

(8)

where the are standard independent Brownian motions in and equationparentequation

(9a)
(9b)

We assume that all particles are initially distinct and sampled from a high variance distribution, which we choose w.l.o.g to be the prior. The method was introduced in [garbuno2020interacting], without the linear correction term proportional to , and the linear correction term was identified and analyzed in [nusken2019note, garbuno2020affine]. In the case where is linear and when initialized with positive-definite empirical covariance , this system converges exponentially fast to a measure comprising independent copies of the desired posterior distribution associated with the Bayesian formulation of the inverse problem for [nusken2019note, garbuno2020affine]. In the nonlinear case this result only holds approximately [garbuno2020interacting], but the output of the SDEs may be used as a key component in other algorithms for solution of the inverse problem [cleary2020calibrate, pavliotis2021derivative] which come equipped with rigorous bounds for the approximation of the posterior.

Also of interest is the mean field limit of this system, which takes the form

(10)

where is a standard independent Brownian motion in and, for density on , we define the functions and of and of by

(11)
(12)
(13)

Here is the time-dependent density of the process, and self-consistency implies that it satisfies the nonlinear Fokker-Planck equation

(14)

It is useful to notice that depends only on and not and that hence we may also write

(15)

In equation (14) the outer divergence acts on vectors, the inner one on matrices; in equation (15) the Frobenius inner-product between matrices is used.

2.2 The small limit

In order to understand the performance of the EKS algorithm when rapid fluctuations are present in the forward model on the EKS algorithm, we proceed to analyze it under the following assumption:

Assumption

The forward model where

(16)

, and .

Here denotes the dimensional unit torus: is a periodic function in every direction. Although the periodic perturbation is a simplification of the typical noisy models encountered in practice, such as the class presented in Example 1.1, it provides a convenient form for analysis which is enlightening about the behaviour of algorithms more generally; furthermore the multiscale ideas we use may be generalized to stationary random perturbations and similar conclusions are to be expected [bensoussan2011asymptotic].

We use formal multiscale perturbation expansions to understand the effect of the rapidly varying perturbation on the smooth envelope of the forward model, , in the context of the EKS, using the mean field limit. To describe the result of this multiscale analysis we define the averaged mean field limit equations

(17)

with

To be self-consistent the density must satisfy the nonlinear Fokker-Planck equation

(18)

The important observation is that these equations constitute the EKS for the smooth envelope of the multiscale forward model . The following result is derived in Appendix A and it shows that, as , equation (14) is approximated by (18).

Formal Perturbation Result

Let Assumption 2.2 hold with . If the solution of (14) is expanded in the form , then formal multiscale analysis demonstrates that satisfies (18).

Remark 1
  • The result shows that, as , the mean field SDE (10), and the nonlinear Fokker-Planck equation (14) for its density, are approximated by the SDE (17), and the nonlinear Fokker-Planck equation (18) for its density. This means that the EKS algorithm simply behaves as if , and ignores the rapid fluctuations on top of ; this is a very desirable feature for computations whose goal is to solve the inverse problem (1) defined by but where only black box evaluations of given by (4) are available.

  • We choose to formulate this result in terms of the mean field limit because this leads to a transparent derivation of the relevant result. The analysis is cleaner in this limit as it concerns a nonlinear Fokker-Planck equation with spatial domain ; similar results may also be obtained for the finite particle system by considering a linear Fokker-Planck equation with spatial domain . Rigorous justification of the formal expansion could be approached by use Itô formula (see Chapters 17 and 18 in [pavliotis2008multiscale] for example); the main technical difficulty in this setting is the need to derive bounds from below on the covariance operator, something which is considered in [garbuno2020affine] where the finite particle system is proved to be ergodic.

3 Derivative-Based Sampling

If the forward problem is differentiable with respect to and the derivative can be computed efficiently then a standard approach to generate samples from defined by (2) is to employ sampling methods which use the gradient in an attempt to explore state-space effectively. For example, one may consider the overdamped Langevin process [pavliotis2015stochastic], given by the solution of the following Itô stochastic differential equation (SDE)

(19)

where is defined by (3). Under mild conditions [pavliotis2015stochastic], the Markov process , will be ergodic with unique invariant distribution given by . In particular, will converge in distribution to as . MCMC methods based on (19) include the Unadjusted Langevin algorithm (ULA) [roberts1996exponential], the Metropolis Adjusted Langevin Algorithm (MALA) [bou2010pathwise] and variants such as Riemmanian Manifold MALA algorithm (RM-MALA) [girolami2011riemann]. Hybrid (also known as Hamiltonian) Monte Carlo based methods also exploit the gradient of to explore the state-space [duane1987hybrid], and have been generalized to the Riemannian Manifold setting in [girolami2011riemann].

3.1 Ensemble Langevin Sampling

A system of overdamped Langevin processes can be coupled in a natural fashion through a spatially-varying preconditioning matrix. More specifically, we can define a system of -valued diffusion processes defined by the following SDE

(20)

where and , for . As in the previous section, we assume that particles are initially distinct and sampled from a high variance distribution, such as the prior. If the matrix function is positive definite then the process will be invariant with respect to the product measure .

We now identify two particular choices of preconditioner. For the first preconditioner we define

(21)

where defined by (9b) is the empirical covariance. Assuming that is positive definite, then is positive definite for all and so converges in distribution to . This idea follows from the more general concept of preconditioning arbitrary interacting particle MCMC methods through their empirical covariance [leimkuhler2018ensemble]. Furthermore, in the case of a linear forward operator, equation (8) becomes (20) [garbuno2020interacting], connecting the work in this section with the EKS. Quantitative estimates on the rate of convergence to equilibrium in the setting of preconditioned interacting Langevin equations can be found in [garbuno2020affine, carrillo2019wasserstein].

The mean field limit of this system of interacting particles with this choice of takes the form

where is as in (12). By self-consistency, the associated non-linear Fokker-Planck equation for the time-dependent density of the process is given by

(22)

The second choice of preconditioner is to take , where

for and where is a characteristic kernel. This diffusion process is known as Noisy Stein Variational Gradient Descent (N-SVGD), and its properties are studied in [duncan2019geometry]; the original (SVGD) algorithm appears in [liu2016stein] and is analyzed in [lu2019scaling]. If then one can show that converges in distribution to the product distribution . The mean field limit of this system of interacting particles takes the following form:

By self-consistency, the time-dependent density satisfies the following non-linear non-local Fokker-Planck equation:

(23)

where is the operator defined by

for all .

3.2 The small limit

As an ensemble scheme, the system described by (20) aggregates information from individual particles to obtain a better informed direction in which to explore the posterior distribution. Unlike the EKS, these approaches compute the gradient before aggregating across particles. We show that this causes the resulting sampler to be poorly performed with respect to the presence of rapid fluctuations in the evaluation of the likelihood. To this end, given , we consider a system with mean-field limit characterised by the following PDE:

(24)

where is a bounded linear operator on

for every probability density function

and where is given by (3) with the forward model satisfying Assumption 2.2. Note that (24) includes both the models described by (22) and (23).

The following result is derived in Appendix B. It characterises the evolution of the leading order term of solving (24). Unlike the setting in the previous section for the EKS, the limit is not the same as the Fokker-Planck equation obtained from applying the ensemble Langevin equation methodology to the inverse problem defined by forward model with posterior given by (6).

Formal Perturbation Result

Let Assumption 2.2 hold with . If the solution of (24) is expanded in the form , then formal multiscale analysis demonstrates that satisfies the following mean field PDE:

(25)

where , and

Here is a solution to the following second order PDE in , parameterized by :

Furthermore, for arbitrary ,

(26)

Remark 2
  • The homogenized mean field equations in the limit describe the evolution of a density with unique invariant distribution given by . This invariant distribution will generally not be equal to the invariant distribution , associated with the smoothed inverse problem (1), defined in (6) because of the presence of . This indicates that using an ensemble of coupled Langevin particles applied with potential derived from the noisy forward problem will not result in an ‘averaging out’ to obtain samples from the posterior of the smoothed inverse problem with potential derived from the smooth forward problem ; indeed there will in general be an deviation from the target invariant distribution.

  • A second effect that is caused by the fast-scale perturbation is a slow-down of convergence to equilibrium, specifically (26) implies that the spectral gap associated with the mean field equation (25) will be generally smaller than that associated with the slowly-varying forward operator .

  • The same considerations described in the second bullet of Remark 1 also apply here.

4 The Best Of Both Worlds

In this section we detail another gradient-free ensemble method which makes use of estimates of the log-likelihood over the ensemble of particles to estimate the (noisy) gradient. This approximation is then used to evolve each particle forward according to overdamped Langevin dynamics. To this end, we model the partially observed potential

as a Gaussian process , where is an appropriately chosen positive definite kernel on . This idea is inspired by the paper [maoutsa2020interacting] which uses a closely related approach, with the goal of approximating solutions to a Fokker-Planck equation.

In this work, we choose

to be a Gaussian radial basis function kernel of the form

, where is the kernel amplitude and is the kernel bandwidth. Given (noisy) evaluations of ensemble of the potential at points we seek a function such that

The corresponding Gaussian process posterior for has mean function

and covariance function

Here , where

is the standard deviation of the observational noise. The gradient of the posterior mean is well-defined and given by

The particles in the ensemble are then evolved forward according to overdamped Langevin dynamics, i.e.

(27)

The three hyperparameters

should be chosen to reflect the spread and local variation in the data. As the conditioning points are evolved over time, these parameters should also be adjusted accordingly. We impose log-normal priors on the amplitude and observation noise standard deviation and a Gamma prior on the lengthscale . These prior modeling choices on the hyper-parameters ensure that the posterior mean does not introduce any short-scale variations below the levels of the available data [cleary2020calibrate, gelman2013bayesian, gelman2017prior]. As is standard in the training of GPs we center and rescale the training data to have mean zero and variance one. To select the hyperparameters we employ an empirical Bayesian approach: we compute the maximum a-posteriori values of the hyperparameters after marginalising out . This entails selecting which maximise the log marginal posterior,

where denotes the prior density over the hyperparameters.

In simulations we employ an Euler-Maruyama discretisation of (27), coupled with a gradient descent scheme for adaptively selecting the hyperparameters. Let denote the particle ensemble at time-step . The algorithm for evolving the particles forward to time-step is summarised as follows:

  • For :

    • Set where iid.

  • Update .

In the above and are step-sizes for the Langevin updates and the hyperparameter gradient descent, respectively.

If we are in a situation where evaluating the likelihood is computationally intensive, then we may consider a straightforward modification of these dynamics where time is split into a fixed set of epochs where we keep the same conditioning points within the same epoch, performing several steps of Langevin updates and hyperparameter tuning based on the same conditioning points. This permits effective exploration of the posterior distribution but with a fixed number of log-likelihood evaluations.

5 Numerical Results

We conclude by illustrating the performance of the different methods for three numerical examples. In particular we compare the Ensemble Kalman Sampler (EKS) introduced in Section 2, with the Ensemble Langevin dynamics sampler (ELS) as discussed in Section 3.1 along with the Gaussian process sampler (GPS) as detailed in Section 4. Note that we do not present numerical experiments for the Noisy Stein Variational Gradient Descent explored in Section 3.1 as the performance was found to be consistently inferior to the (already poorly performing) ELS, in the context of the multiscale problems considered here. Our results show the desirable behaviour of the EKS with respect to its ability to avoid the rapid fluctuations imposed on the smooth parametric structure of interest and rapidly converge to the desired smooth posterior; they show the undesirable slowing down of the ELS, but do not illustrate the modified limiting posterior as their slow performance means that this potential is not reached in a reasonable number of iterations; and they demonstrate that the GPS has the same quality of performance as the EKS, with further improved rate of convergence.

5.1 A linear model

As a first pedagogical example we consider a multiscale inverse problem of the form (1) for a forward map of the form:

(28)

The objective is to recover the posterior distribution associated with “slowly-varying” component of the forward model , based on evaluations of the multiscale forward map . To this end, we generate observed data for a true value of given by and observational covariance with . We impose a Gaussian prior on the unknown parameter with zero mean and covariance , with . In the absence of any multiscale perturbations in the forward model, the resulting posterior would be Gaussian with mean and covariance given by

respectively. Setting , each of the the presented methods is used to evolve an ensemble of particles from a initial distribution, over a total of time-units. The step-size employed for each method is selected differently to ensure stability.
Figure 2 shows the particles ensemble at the final time (blue), the true solution (red dot) as well as the posterior (6) associated with the slowly varying part of the forward model (black contour lines). We observe that particles get stuck in the many local minima for the ensemble Langevin sampler in Figure 1(a). This is consistent with the formal theoretical result highlighted in Section 3.1, which indicates that the multiscale perturbation will have a significant influence on the target distribution. This is not the case for both the EKS and GPS, which are able to recover the slowly varying target distribution correctly, see Figure 1(b) and Figure 1(c) respectively. Figure 1(d) shows the negative log-likelihood for averaged across all the particles in the ensemble. One observes that both the EKS and GPS are able to move towards the mode of the slowly-varying target distribution, with the GPS able to converge faster due to the use of the approximate gradient. On the other hand the Langevin sampler is strongly influenced by the multiscale perturbations, and after time units has remained distant from the posterior distribution. While GPS does converge the fastest, we note that its performance appears to be sensitive to the initial selection of hyper-parameter values and step-sizes, which were initially set through a preliminary tuning phase. On the other hand, the EKS is surprisingly robust to the choice of step-size.

(a) Ensemble Langevin Sampler
(b) Ensemble Kalman Sampler
(c) Ensemble GP Sampler
(d) Negative Log-Likelihood
Figure 2: Plot of the particle ensemble for each of the three processes after simulating time units. The contour plot indicates the posterior distribution for the “slowly-varying” forward model. The red dot denotes the truth. Note the different scaling of the axes in plots (a-c). The bottom left plot shows the evolution of negative log likelihood as a function of time-step.

5.2 Multi-modal posteriors

It is well-known that multi-modal posteriors pose a significant challenge for Ensemble Kalman-based approaches, since they are constructed on the assumption that the posterior is Gaussian. In the following we illustrate the flexibility of the GPS by considering the inverse problem for the form (1) for the unknown parameter for a multiscale forward map of the form where is defined by

(29)

and where . To demonstrate the three proposed methods we generate observation data for the truth , where and . We impose a Gaussian prior on the unknown parameter , where . As the slowly-varying component of the forward map is the non-injective function , the associated posterior density will exhibit global modes. The ELS, EKS and GPS were each simulated for an ensemble of particles for time-units starting from a distribution. Note that a significantly smaller step-size was selected for the Langevin sampler to ensure stability of the process. We plot the final ensemble in Figure 3. As before, we observe that the ELS struggles to explore the large-scale features of the posterior, in this case remaining concentrated on a single mode. The effect of the multiscale perturbations can be clearly seen in the final-time ensemble as the particle distribution appears ’corrugated’ due to the influence of the sinusoidal component of the forward model. The EKS appears to be unaffected by the fine-scale structure in the forward model, but concentrates in a region at the centre of the posterior, reflecting the fact that the EKS assumes that the target distribution is Gaussian, and thus concentrates around a single mode; however, with different initializations the EKS may concentrate on a single one of the four modes of the posterior, rather than a compromise between all of them. Finally, we observe that the GPS sampler manages to effectively explore the large-scale structure of the posterior, sampling from all four modes of the distribution.

(a) ELS
(b) EKS
(c) GPS
Figure 3: Comparison of the three approaches after simulating the ensemble for units of time for the Bayesian inverse problem highlighted in Section 5.2. The contour plot indicates the posterior distribution associated with the slowly-varying forward map , while the points denote the ensemble at the final time.

5.3 The Lorenz-63’ model

We consider the 3-dimensional Lorenz equations [lorenz1963deterministic] as the underlying dynamical system in Example 1. It is given by equationparentequation

(30a)
(30b)
(30c)

with parameter . Data is generated by simulating (30) for the parameter set , for which system (30) exhibits chaotic behavior. In the following we fix the parameter to its true value and focus on the inverse problem of identifying and

from time-averaged data. In this setting, the forward model data corresponds to the first and second order moments, computed over time windows of length

. This coincides with the setting of Example 1 where the function is chosen to be

This defines a multiscale forward map for . Specifically, is a noisy approximation of the infinite-time average of the first and second moments, with noisy rapid fluctuations caused by the use of a finite-time average. Although the fine-scale fluctuations are not periodic as considered in the mean-field analysis of the ELS or EKS, we shall demonstrate numerically that the general conclusions remain valid. To ensure that there is minimal correlation between subsequent evaluations of the forward map, we set the initial condition of (30), at each step of the sampling algorithm and for each ensemble member, to be the the state of the dynamical system from the previous ODE solve, for the same ensemble member, evaluated at a large random time . Following [cleary2020calibrate, Section 5] we estimate the covariance using long-time series data. Specifically, we use time-averages of evaluated over trajectories of (30) using the true parameter values over units. This is split into windows of size where the first units of time are neglected. Given observation data , the resulting negative log-likelihood function is given by

(31)

Figure 4 shows the profile of for a fixed value of . We impose a multivariate log-normal prior on with mean and covariance . We denote by the resulting negative log-posterior density.

Figure 4: Profile of the noisy negative log-likelihood over for fixed . The red line denotes the ’true’ value .

The EKS, ELS and GPS processes were all simulated for one algorithmic time unit, with the step-size adjusted to ensure process stability. The log-likelihood evaluation involves integrating the ODE (30) for physical time units. Each process was simulated with particles, with initial condition distributed as . In Figure 5(a-c) where we plot the particle ensemble at the final time for each method, overlaid with a contour plot of the negative log-posterior density . In Figure 5(d) we plot the negative log-likelihood function averaged over the finite-time ensemble for each process. It is clear from the plots in Figure 5 that the EKS and GPS have concentrated around the true value and are distributed according to smoothed version of the posterior. On the other hand, the particles undergoing ELS dynamics have remaining trapped around the local minima of the multiscale posterior distribution, preventing the particles from concentrating in a similar fashion.

(a) Ensemble Langevin Sampler
(b) Ensemble Kalman Sampler
(c) Ensemble GP Sampler
(d) Negative Log-Likelihood
Figure 5: Comparison of the three approaches after simulating the ensemble for unit of time. The contour plot indicates the posterior distribution while the dots denote the ensemble at the final time. The bottom left plot shows the evolution of negative log likelihood as a function of time.

6 Conclusions

In this paper we discussed and analysed different ensemble methods for solving inverse problems with noisy and expensive likelihoods. A formal multiscale analysis approach was employed to characterise the influence of small scale fluctuations on sampling when the objective is to explore the large-scale structure of the posterior distribution. Within this framework we contrasted the long-term behaviour between sampling schemes which use gradient information and those which are gradient free, using the Ensemble Langevin Sampler (ELS) and Ensemble Kalman Sampler (EKS) as specific examples.

Both the formal analysis and computational experiments illustrate the robustness of EKS to noisy perturbations of the forward model and demonstrate its ability to efficiently characterise the underlying large scale structures of the resulting noisy posterior. This is not the case for Langevin methods, whose long time behavior is significantly impacted by the noise: these methods do not identify the correct large-scale structure, and are also slowed down by the presence of small-scale structure. Motivated by the success of the EKS in this setting, we propose a new class of ensemble based methods, the so called ensemble Gaussian process samplers (GPS) which are also robust to noisy perturbations of the forward model, but still employ gradient information to effectively explore the posterior distribution, and without making any assumptions on the distribution of the posterior.

There are various interesting and challenging directions for future research. Time averaged quantities for parameter estimation, as considered in Example 1.1, arise frequently in chaotic system for climate models. We expect that the methods proposed in this work would provide effective tools for quantifying uncertainty in such problems. In particular, it would be interesting to gauge the efficacy of the proposed GPS approach for the calibration and uncertainty quantification of a generalised circulation model such as in [dunbar2020calibration].

While computational experiments have demonstrated the good performance of the GPS, it is evident that this method requires careful tuning of hyper-parameters, which is currently achieved using a preliminary tuning stage. Gaining an understanding of how to select these parameters based on the multiscale structure of the forward map will be important for further algorithmic development.

On the theoretical front, it would be of interest to make the presented formal multi-scale arguments rigorous. This might prove challenging as it would require bounds on the solution of the cell problem

, a Poisson PDE which characterises the large-scale influence of small-scale perturbations. Any such analysis would require tight lower bounds on the eigenvalues of the empirical covariance process uniformly over time. While this has been shown to be positive definite in

[garbuno2020affine], obtaining quantitative lower bounds on the eigenvalues remains an open problem for future study. Another interesting problem is to characterise the long-time behaviour the ensemble Gaussian Process Sampler. In particular, identifying conditions for stability and ergodicity along with quantifying the asymptotic bias are questions which we leave for future study.

References

Appendix A Multiscale Analysis For EKS

In this section we derive Formal Perturbation Result 2.2 concerning averaging for the mean field limit of the EKS. To carry out the analysis we extend the spatial domain of the mean field system from to as is standard in the perturbation approach described in [bensoussan2011asymptotic, pavliotis2008multiscale]. The analysis will be stream-lined by making the following definitions. For economy of notation we reuse the notation , now to denote functions with input domain extended from to ; specifically, this naturally generalizes the definitions of in Section 2.

In the following

denotes a probability density on

and denotes a probability density on . Using this notation we define

Note that in employing this notation , viewed as a matrix-valued functional on densities, is extended from its definition in Section 2 to now act on densities on . However if density is constant in then we recover the definition of from Section 2; in this case We also define the following differential operators.