 # Convergence Acceleration of Ensemble Kalman Inversion in Nonlinear Settings

Many data-science problems can be formulated as an inverse problem, where the parameters are estimated by minimizing a proper loss function. When complicated black-box models are involved, derivative-free optimization tools are often needed. The ensemble Kalman filter (EnKF) is a particle-based derivative-free Bayesian algorithm originally designed for data assimilation. Recently, it has been applied to inverse problems for computational efficiency. The resulting algorithm, known as ensemble Kalman inversion (EKI), involves running an ensemble of particles with EnKF update rules so they can converge to a minimizer. In this article, we investigate EKI convergence in general nonlinear settings. To improve convergence speed and stability, we consider applying EKI with non-constant step-sizes and covariance inflation. We prove that EKI can hit critical points with finite steps in non-convex settings. We further prove that EKI converges to the global minimizer polynomially fast if the loss function is strongly convex. We verify the analysis presented with numerical experiments on two inverse problems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

A crucial task of data science is to build mathematical models that can explain existing data, and use it to infer structural information. A general way to formulate this mathematically is through

 (1.1) y=G(u)+η.

From (1.1), that stands for parameters of interest we try to infer from some associated noisy data , where describes the physical laws that relate and . Finally models uncontrollable noises in the data collection process, which we assume here is an independent Gaussian noise, i.e. . Typical example includes the subsurface flow problem, where stands for the underground geological structure, is the pressure reading at different locations, and

involves solving a partial differential equation (PDE) called “Darcy’s law”. In most applications, the forward problem, that is finding

or with a given , is relatively straightforward. But the associated inverse problem, that is finding with a given , can be difficult.

To solve an inverse problem  given as (1.1), one standard approach is finding such that is closest to . Mathematically, this is equivalent to minimizing the data-misfit function

 (1.2) ℓo(u)=∥G(u)−y∥2Γ.

This is commonly referred to as the variational approach. Here and what follows, we use to denote the Mahalanobis norm of with weight matrix . Yet, this approach often leads to unphysical solutions that overfit the data, or there can be non-unique solutions. These issues can often be alleviated by incorporating physical information through regularization [7, 24]. One popular choice is Tikhonov regularization which introduces a preference matrix and weight parameter , where now we consider minimizing the loss function

 (1.3) ℓ(u)=∥G(u)−y∥2Γ+λ∥u∥2Σ.

In PDE applications, the matrix is often chosen as certain Laplacian operators to enforce smoothness on , and is set as a tuning parameter. Another way of solving the inverse problem (1.1) which negates computing the minimizer of (1.2) is the Bayesian approach . This approach characterizes the solution as the conditional distribution of given , i.e. , which is discussed in more detail in Section 2.1. The physical information can be incorporated by assuming a prior distribution for , such as . The variational and Bayesian approaches are closely related. In particular the minimizer of (1.3

) is also known as the maximum a posteriori probability estimator in Bayesian statistics.

In order to minimize (1.2) or (1.3), classical optimization methods, such as gradient descent and Gauss–Newton, require the gradient information of . However, this information can be computationally expensive if is high dimensional or if the model is complex. This is often the case for many modern-day big data applications. For example, in numerical weather prediction (NWP), represents the atmospheric and oceanic state on earth. Its dimension can exceed and describes the evolution of a fluid equation of multiple scales. In situations such as this, one would rather treat as a black-box, and apply optimization methods that are derivative-free .

The ensemble Kalman filter (EnKF) [20, 21] is a derivative-free algorithm designed for data assimilation problems [33, 48], which can be interpreted as an inverse problem where a sequence of interrelated parameters are to be recovered. It was originally derived as a Monte Carlo approximation of the Kalman filter , which is the Bayesian solution to (1.3) assuming is linear. Because its formulation is derivative-free, the EnKF can be executed efficiently, and hence has been widely applied for NWP problems [14, 25]. The application of the EnKF to the setting of Bayesian inverse problems goes back to . Since then, a wide development of work has been done on applying ensemble Kalman methods for inverse problems arising in PDEs. This was initiated by the works of Iglesias [26, 27] and has lead to numerous further directions [10, 11, 50]. We will refer the application of EnKF to inverse problems as ensemble Kalman inversion (EKI).

As a short description, EKI draws an initial ensemble from the prior distribution or a smoothed version of it, and repeatedly applies the EnKF to update the ensemble so it fits the data better. In the linear setting, the continuous time approximation of EKI eventually converges to the minimizer of the data-misfit function (1.2) . However, this minimizer in general does not contain prior information and may overfit the data. To avoid this issue, one approach was devised by  which incorporated iterative Levenburg–Marquardt regularization, taking motivation from an earlier work . Recently, another more direct approach is found by introducing an artificial observation in the EnKF step, so the ensemble converges instead to the minimizer of (1.3). This formulation is known as Tikhonov ensemble Kalman inversion . Our investigation will mostly focus on it, and for simplicity of notation, we will refer it as EKI in the discussion below.

Despite the empirical success of EKI in the references aforementioned, its behaviour as an optimizer for (1.3) is not well understood. Convergence results of EKI are available only for linear observations and the continuous-time limit of EKI iterates [10, 30, 32, 36, 41, 50]

. However, EKI algorithms in practice have to run at discrete time, and the observations are rarely linear. Moreover, recent machine learning research has shown that using non-constant step-size or learning rate can significantly improve the optimization algorithm results

[2, 19, 46, 47]. For EKI, using a non-constant step-size is related to incorporating covariance inflation , which are important tuning techniques for improving both the accuracy and stability in NWP [4, 5, 54, 40, 53]. Yet, these important features and connections can not be revealed if one investigates only the continuous-time limit.

This paper intends to fill these gaps by investigating EKI as a derivative-free optimization tool. Our contributions are highlighted through the following:

• We develop a new version of the Tikhonov EKI algorithm, where non-constant step-sizes and covariance inflation are applied. These modifications are essential to the algorithm performance both in theory and numerical tests.

• We compare the long time behavior of EKI with the Gauss–Newton method in a general nonlinear setting. Such comparison leads to an intuitive explanation why EKI can be used for optimization.

• Assuming a general nonlinear map , we show that EKI can converge to approximate critical points with finitely many iterations. If in addition the regularized loss function (1.3) is strongly convex, we show that EKI converges to the global minimum at a polynomial speed.

• Based on our convergence analysis, we provide guidelines on how to choose the step-size and covariance inflation in EKI. We implement the EKI on the Lorenz 96 model in 1D, and a nonlinear elliptic partial differential equation in 1D and 2D. Not only do we test the theory but also assess the performance of the method as an inverse problem solver against the standard Tikhonov EKI.

### 1.1. Notation and organization

The structure of this article is as follows. In Section 2 we provide an overview of the preliminary material required, reviewing the EnKF, while introducing our formulation of the inverse problem loss function (1.3). This leads to Section 3 where we state the main results while introducing the assumptions on the optimization and convergence analysis. Numerical verification of the results are shown in Section 4, while finally we conclude our findings and discuss potential areas of future work in Section 5. The appendix will contain the majority of proofs from Section 3.

Throughout the article we use and to denote norm and its corresponding inner product. For any arbitrary function, we will further denote its Jacobian and Hessian matrix as and . Given a matrix the -operator norm is defined as . Given two symmetric matrices and , we use to indicate the matrix is positive semidefinite. Given a covariance matrix the Mahalanobis norm is defined by .

## 2. Tikhonov ensemble Kalman inversion

In this section we review some of the key steps for deriving the Tikhonov EKI algorithm. We initiate with an overview of optimization with iterative Bayesian approaches, while discussing how to implement EKI with non-constant step-sizes and covariance inflations. For notation simplicity, we assume in our discussion. This does not sacrifice any generality, since we can always replace with otherwise.

### 2.1. Optimization by iterative Bayesian sampling

In the Bayesian inverse problem setting, one often assumes follows a prior , then given an observation from model (1.1), the posterior distribution is given by

 p1(u)=1√det(2πΓ)exp(−12∥G(u)−y∥2Γ)p0(u)∝exp(−12ℓo(u))p0(u).

Then suppose we use as a prior, and the data as an observation again, the next posterior distribution is given by

 p2(u)=1√det(2πΓ)exp(−12∥G(u)−y∥2Γ)p1(u)∝exp(−ℓo(u))p0(u).

If we iterate this procedure times, the resulting posterior is given by

 pN(u)∝exp(−N2ℓo(u))p0(u).

When is large enough, most of the probability mass of will concentrate on the minimum of . So if we can generate samples from , they are likely to be close to the minimizer of .

### 2.2. Regularized observation

As discussed in the introduction, the minimizer of the data-misfit function may be a nonphysical solution which overfits the data. It is often more desirable to minimize the regularized loss function (1.3) instead. In order to implement the iterative Bayesian sampling methods discussed in Section 2.1, we need to include the regularization term into the observation model. This is achieved by concatenating the real observation with a direct artificial observation with observation noise . We begin by extending (1.1) to the equations equationparentequation

 (2.1a) y =G(u)+η, (2.1b) u =ζ,

where

are independent random variables distributed as

and Define variables and mapping as follows,

 z=[y0],H(u)=[G(u)u],ξ=[ηζ].

Then note that

 ξ∼N(0,Γ+),Γ+=[Γ00Σ].

We can express our modified inverse problem as

 (2.2) z=H(u)+ξ.

Under this transformation, the regularized loss function in (1.3) can be express as the data-misfit function of (2.2):

 ℓ(u)=∥Γ−1/2(G(u)−y)∥2+∥Σ−1/2u∥2=∥(H(u)−z)∥2Γ+.

Moreover, if we apply an iterative Bayesian sampling method with observation from (2.2), the -th posterior is given by

 (2.3) pn+1(u)=1√det(2πΓ+)exp(−12∥H(u)−y∥2Γ+)pn(u)∝exp(−12(n+1)ℓ(u))p0(u).

To minimize , it suffices to sample for a large .

### 2.3. Kalman filter and ensemble formulation

When is Gaussian and is linear, the Kalman filter  provides explicit formulas for the posterior distribution in (2.3). In particular, is given by , where

 (2.4) Σn+1=Σn−ΣnHT(Γ++HΣnHT)−1HΣn,bn+1=Σn+1Σ−1nbn+Σn+1HTΓ−1+z.

Iterating the same formula, one can find the sequential distributions are all Gaussian, which implies the mean and covariance all have explicit forms.

In practical implementations, applying the Kalman filter can be difficult, as may be nonlinear, and inverting the associated matrices can be expensive if the underlying dimension is large. The EnKF algorithm is designed to overcome these two issues. It uses a group of particles

to represent the Gaussian distribution

, where the covariance matrices in (2.4) can be approximated by their sample versions.

In particular, we define

 mn=1KK∑i=1u(i)n,¯¯¯¯¯Hn=1KK∑i=1H(u(i)n).

and the sample covariances

 Cpun=1KK∑i=1(H(u(i)n)−¯¯¯¯¯Hn)⊗(u(i)n−mn),Cupn=(Cpun)T.

Suppose are i.i.d. samples from and , it is evident that , , , and are approximations of and . By inserting these approximations in (2.4), we have the next posterior distribution . There are in general two ways to update the particles such that their mean and covariance satisfy (2.4). The first way is directly updating the particles by

 (2.5) u(i)n+1 =u(i)n+Cupn(Cppn+Γ+)−1(z+ξ(i)n−H(u(i)n)),

where are i.i.d. samples from . With these artificial noises, one can show that on average, the mean and covariance of are approximately in (2.4). On the other hand, these artificial noises create fluctuation and instability. The second way is simply finding a group of particles such that their mean and covariance match the target formulas in (2.4). This leads to the mean update

 (2.6) mn+1=mn+Cupn(Cppn+Γ+)−1(z−H(mn)).

Then we seek a new ensemble centered at , so that

 Cuun+1=Cuun−Cupn(Γ++Cppn)−1Cpun.

This formulation is often called the ensemble square root filter [37, 52]. It is known to perform better than the particle formulation (2.5).

### 2.4. Non-constant step-size and covariance inflation

Recall that in the gradient descent (GD) algorithm, one generates a sequence of iterates to approach the minimum of . One way to update the iterate is by

 un+1=un−h∇ℓ(un),

where is often called the step-size. One can interpret GD as implementing Euler’s method for the gradient flow of an ordinary different equation. In the machine learning literature, it is shown that using a decreasing sequence may improve algorithm performance [2, 3], since it allows the algorithm to take larger steps at earlier iterations and explore more regions, while converging to a solution faster with smaller steps in later iterations.

One of the findings in this paper and some earlier works is that the EKI in the long run similar to the GD [23, 31]. One would naturally conjecture that implementing non-constant step-size may lead to improved optimization performance. In , it is shown that to implement EKI with a constant small step-size , we simply replace with . One can also reach such procedure by considering applying tempering techniques from sequential Monte Carlo when sampling (2.3) [15, 28]. Here we implement the same idea, except that we explore the possibility of using the step-size in place of . When , this is the same taking a constant step-size. As a result, the mean update formula is given by

 (2.7) mn+1=mn+Cupn(Cppn+h−1nΓ+)−1(z−H(mn)).

Unlike other machine learning algorithms such as stochastic gradient descend , the step-size for EKI does not need to decrease. This is because the movement made by (2.7) is closer toa Gauss–Newton type of algorithm, instead of a GD type of algorithm. Yet, the step-size parameter does control the final convergence speed. This will be clearer when we have more analysis results. Specifically Remark 3.4 will provide further details.

Aside from implementing the non-constant step-size, we will also apply additive covariance inflation [4, 5, 54] for the update formula. The resulting covariance update is given by

 (2.8) Cuun+1=Cuun−Cupn(Cppn+h−1nΓ+)−1Cpun+α2nΣ,

where is a sequence of positive parameters to be specified. In the literature of EnKF, covariance inflation is commonly applied for improved algorithm stability. It is equivalent to adding a stochastic noise in the particle formulation (2.5). Such operation is also known as rejuvenation in data assimilation, and it in general makes the associated Kalman filter system controllable.

## 3. Main results

In this section we state our analysis results regarding the convergence of Tikhonov EKI. We first aim to understand the behaviour of the ensemble, more specifically the ensemble covariance . We then compare the EKI update and Gauss–Newton update, where we show that their difference converges to zero. Finally we state results regarding both convergence towards local and global minimizers. The proofs of these results will be omitted from this section and are provided in the appendices.

### 3.1. Ensemble covariance collapse

The first step of our analysis involves understanding the ensemble configuration of EKI. For simplicity, we impose the following regularity assumption for the map .

###### Assumption 3.1.

has bounded first and second order derivatives. So there are constants and such that for all and

 ∥∇H(z)∥≤M1,∥H(z′)−H(z)∥≤M1∥z′−z∥,vT∇2H(z)v≤M2∥v∥2.
###### Theorem 3.2.

Under Assumption 3.1, suppose the EKI algorithm (2.7)-(2.8) is implemented with and , where the parameters satisfy

 γ−1≤β≤γ.

Then the sample covariance is bounded from above and below for all ,

 κmnγ−β−1Σ⪯Cuun⪯κMnγ−β−1Σwith constants κm,κM>0.

In the view of classical linear Kalman filter theory , this result indicates that the system is observable and controllable.

### 3.2. Connection with Gauss–Newton

In the Kalman filter literature, it is well known that the Kalman filter mean update (2.4) is closely related to the Gauss–Newton method . We show below that as an ensemble formulation of the Kalman filter, EKI inherits such a connection in the long run. This intuitively explains why EKI is an appropriate optimization tool.

As explained in Section 2.1 and 2.2, at the -th step, EKI is attempting to sample the posterior distribution (2.3). Given that is assumed to be , and that is assumed to be Gaussian, the mean of , , should be the minimizer of , which is proportional to

 (3.1) ℓn+1(u)=∥u−mn∥2Cuun+∥H(u)−z∥2Γ+h−1n.

Note that we have replaced with to implement our non-constant step-size. Since is of nonlinear-least-square form, given the current mean , the Gauss–Newton method indicates that should be , where

 Gn =[h−1n(Cuun)−1+JTnΓ−1+Jn]−1JTnΓ−1+(z−H(mn)) (3.2) =CuunJTn(JnCuunJTn+h−1nΓ+)−1(z−H(mn)),Jn:=∇H(mn).

Because , by Theorem 3.2 we can estimate

 (3.3) ∥Gn∥≤O(∥Cuun∥hn)=O(nγ−1).

Next, recall the mean movement from EKI (2.7) is given by

 (3.4) Δn:=mn+1−mn=Cupn(Cppn+h−1nΓ+)−1(z−H(mn)).

This is different from (3.2), however we can show their difference converges to zero. To see that, recall from Theorem 3.2, we find the ensemble covariance decreases to zero in a well controlled manner. In particular, the particles are very close to the mean when is large. This indicates the ensemble spread is very small. We can apply a first order approximation:

 H(u(i)n)≈H(mn)+JnΔu(i)n.

With this approximation, we find that

 Cupn≈CuunJTn,Cppn≈JnCuunJTn.

Applying these approximations to (3.4), we recover (3.2). More specifically, the difference between EKI mean update and Gauss–Newton update is bounded, as discussed in the following proposition.

###### Proposition 3.3.

Under the setting of Theorem 3.2, there is a constant , such that for sufficiently large the following bound holds:

 ∥Gn−Δn∥≤M3hnK∥Cuun∥32∥z−H(mn)∥.

Given the estimates in Theorem 3.2, the upper bound above is of order , which will converge to zero with large .

###### Remark 3.4.

Recall that in (3.3) we show the mean movement made by EKI is of order . So the difference between EKI and Gauss–Newton is of a lower order. Also note that the step-size parameter actually does not control EKI mean movement. Instead, it controls the speed of ensemble collapse as in Theorem 3.2, and consequentially the accuracy of EKI in approximating Gauss–Newton.

### 3.3. Iterative descent made by EKI

While Proposition 3.3 explains how the EKI iterates optimize a sequence of loss functions (3.1), it is unclear how the regularized loss function (1.3) is optimized in the process. is not necessarily the limit of , since is not a fixed point. One interesting fact is that by running a Gauss-Netwon type update for , the value of is also decreased at each step. To see this, we again apply the Taylor expansion

 H(mn+Gn)≈H(mn)+JnGn,

and find that

 ℓ(mn+Gn)=∥H(mn+Gn)−z∥2Γ+≈ℓ(mn)−2⟨Γ−1+(H(mn)−z),JnGn⟩.

The important observation here is that

 ⟨Γ−1+(H(mn)−z),JnGn⟩ =(JTnΓ−1+(z−H(mn)))T[(hnCuun)−1+JTnΓ−1+Jn]−1JTnΓ−1+(z−H(mn)) =∥JTnΓ−1+(z−H(mn))∥2(hnCuun)−1+JTnΓ−1+Jn≥0.

It is easy to check that

 JTnΓ−1+(z−H(mn))=−∇ℓ(mn).

Since Proposition 3.3 suggests that ,

 ℓ(mn+1)≈ℓ(mn)−2∥∇ℓ(mn)∥2(hnCuun)−1+JTnΓ−1+Jn.

The error of this approximation is given by the following.

###### Proposition 3.5.

Under the same setting as Proposition 3.3 we have the following estimate

 ℓ(mn+1)=ℓ(mn)−2∥∇ℓ(mn)∥2(hnCuun)−1+JTnΓ−1+Jn+Rn,

where the residual is bounded by

 |Rn|≤M4hn∥Cuun∥32max{∥z−H(mn)∥4,1}.

### 3.4. Convergence analysis

Classical analysis of optimization algorithms often focus on understanding the limiting behavior of the iterations. When the underlying loss function is strongly convex, there is a unique global minimum, so it is of interest to show the algorithms can converge to this minimizer with finite steps. Under non-convex settings, the global minimum can be non-unique, and it is more practical to ask whether the algorithm can converge to a critical point of the loss function.

The EnKF is known to have certain stability issues , in the sense the iterates in principle may diverge to infinity. With general observation functions, EKI can have the same phenomena. But this issue can often be fixed by modifying the algorithm, if we know proper solutions should be bounded by a known radius . Such information can often to be obtained from the physical background of the inverse problem. As a consequence, it is reasonable to modify the EKI algorithm so the particles are bounded. One simple way to achieve this is by modifying the observation map outside the radius . In particular, we have the proposition.

###### Proposition 3.6.

Suppose the observation map takes value when , there is a threshold iteration , so that the EKI sequence is bounded such that

 ∥mn∥≤max{2M+2+∥z∥,∥mn0∥}∀n≥n0.

The requirement that takes value when can be enforced for general observation function by producting it with a mollifier, for example we replace with

 ˜G(u)=G(u)exp(−C((M+1)2−∥u∥2)−1)where C is a large constant.

Proposition 3.6 indicates it is reasonable to assume the mean sequence is bounded. Then from the descend estimate in Proposition 3.5, we can show EKI will reach an approximate critical point with finite iterations.

###### Theorem 3.7.

Under the setting of Theorem 3.2, suppose that the EKI mean sequence is bounded and the parameter , then or any ,

 minn≤Nϵ{∥∇ℓ(mn)∥}≤ϵ.

The threshold iteration is given by

 Nϵ={exp(D/ϵ2)ifγ=0,(D/ϵ2)2min{2γ,β+1−γ+δ}if1>γ>0.

here is a constant independent of .

In addition if we assume the loss function is strongly convex, then we have the following theorem which establishes convergence to the global minimizer. The theorem also states that the convergence is attained at a polynomial rate.

###### Theorem 3.8.

Under the same setting as Theorem 3.2, suppose in addition that is strongly convex, so there is a

such that for any vectors

 ℓ(x)−ℓ(y)≥⟨∇ℓ(y),x−y⟩+λc∥x−y∥2.

Then there is a threshold iteration and constant so that the following estimates hold for any :

1. If , for any ,

 λc∥mN−u∗∥2≤ℓ(mN)−ℓ(u∗)≤DNα.

Here is given by Theorem 3.2 and

is the minimum eigenvalue of

.

2. If , for any ,

 λc∥mN−u∗∥2≤ℓ(mN)−ℓ(u∗)≤DNα.

## 4. Numerical results

In this section we present several experiments assessing the performance of EKI with additive covariance inflation and non-constant step-sizes. We test the inversion performance on both the Lorenz 96 model and a nonlinear partial differential equation (PDE) from the field of geosciences. Before describing in detail the experiments we present the different test models and their corresponding inverse problem.

### 4.1. Test models

#### 4.1.1. Lorenz 96 model

The first test problem is the Lorenz 96 (L96) model . The L96 model is a dynamical system designed to describe equatorial waves in atmospheric science. The L96 model takes the form

 (4.1) dvkdt=vk−1(vk+1−vk−2)−vk+F,k=1,…,N,v0=vN,vN+1=v1,v−1=vN−1.

In (4.1), denotes the current state of the system at the -th grid point. is a forcing constant with default value . The dimension is often chosen as , but other large numbers can also be used. The initial condition

 v(0)=(v1(0),…,vN(0))T,

of (4.1) is randomly sampled from the Gaussian approximation of the equilibrium measure of L96. We also generate the initial ensemble for EKI from the same Gaussian distribution. The associated inverse problem is the recovery of using 20 noisy partial measurements at time

 (4.2) yk=v2k−1(t)+ηk.

This inverse problem is a standard test problem for data assimilation. It is also a good testbed for high dimensional Bayesian computational methods, since the dimension can take arbitrary large values [35, 42].

#### 4.1.2. 1D Darcy’s law

The second test problem will be a nonlinear PDE motivated from geosciences referred to as Darcy’s law. Assume that in a domain we have a source field and a diffusion coefficient , referred to as the permeability, then the forward problem is concerned with solving , known as the pressure, in the PDE

 (4.3) −∇⋅(κ∇p) =f,x∈D, p =0,x∈∂D,

with Dirichlet boundary conditions. The inverse problem concerned with (4.3) is the recovery of the permeability from measurements of the pressure at 20 equidistance locations . The associated inverse problem is then defined as

 (4.4) yj=p(xj)+ηj,j=1,⋯,J,

where the are Gaussian noises, assumed independent, mean zero and joint covariance . By defining , we can rewrite (4.4) as the inverse problem

 y=G(u)+η,η∼N(0,Γ).

As a starter, we will consider a 1D Darcy flow problem, where the domain is given as . By testing on this toy problem, we can have a more direct observation of the recovery skill with different parameter setups, as shown later by Figure 3. The initialization of the ensemble is taken to be a mean-zero Gaussian with covariance function , which is a common covariance function used in the context of uncertainty quantification .

#### 4.1.3. 2D Darcy’s law

Our final test problem is Darcy’s law as stated through (4.3), but we consider in two dimension. As before we are interested in the recovery of the permeability in the domain where we specify equidistant pointwise observations. However one key difference in the 2D setting is the initialization of TEKI. We follow the setting described in . One way to simulate draws from a random function is through the series expansion

 (4.5) u=∑k∈Z2+√λkξkφk(x),ξk∼N(0,1),i.i.d..

(4.5) is known as the Karhunen-Loève (KL) expansion, where and

are the respective eigenfunctions and eigenvectors of the covariance operator

, defined as

 φk(x)=√2sin(π⟨k,x⟩),λk=(|k|2π2+τ2)−ν,k∈Z2+.

The corresponding KL expansion (4.5) satisfies the eigenvalue problem where our covariance operator is defined as

 C=(−△+τ2)−ν,

with denoting the Laplacian operator, such that denotes the regularity and denotes the inverse lengthscale. For the tests below, . For simplicity, we initiate EKI with KL basis vectors, in other words we let

 (4.6) u(k)(x)=φk(x),k=1,…,K.

This has been shown in  to work well in the context of TEKI. The truth for this test model is given in Figure 5. We specify the observation noise covariance as in the 1D Darcy flow experiment.

### 4.2. Parameter settings

To test our modifications of EKI, we monitor the effect of modifying the parameters and . Recall that the parameter controls the step-size , where Theorem 3.8 suggests that taking a nonzero can result in a faster, and improved, performance. The other parameter arises from the formula for covariance inflation,

 (4.7) α2n=α20h−10n2γ−β−2.

For Theorems 3.7 and 3.8 to apply, the parameters need to satisfy the following

 0<γ<1,γ−1≤β≤γ.

To see the effect of the non-constant step-size we also test a setup where resulting in a constant step-size. We set the initial step-size as and the initial inflation factor as . We will consider 10 different setups where we monitor (i) the relative error of the mean with respect to the truth , i.e.

 (4.8) ∥mN−m∗∥l2∥m∗∥l2,

and (ii) the relative error evaluated in the loss function , i.e. . These different setups are provided in Table 1.

The first five setups monitor the effect of while the last fiver monitor the parameter . We set the ensemble size fixed at for the first two models, and for the 2D model we set ensemble members. The effect of modifying and has been well documented in [11, 26] , thus we negate doing the same. Aside from analyzing the relative errors we also plot the reconstruction of the unknown for the first 5 setups of each test model. We set the iterative model to run for 23 iterations. We compare our results with the case of no additive covariance inflation, i.e. from the inflation formula (4.7), which we label as VTEKI (vanilla Tikhonov EKI). For all test experiments we place the regularization parameter as =2.

### 4.3. Inversion results

Figures 1, 3 and 6

demonstrate the reconstruction of the unknown for each model problem. To highlight the effect of the non-constant step-size and variance inflation we plotted the first 5 setups and the case of VTEKI. As seen from the Figures

1 and 3, as we increase we tend to get a better reconstruction which is closer to the truth. A more in-depth analysis of the effect of the step-size and covariance inflation is provided in Tables 2 - 7, where we consider the different setups. Throughout each model we notice that as we increase and decrease we see an improvement in learning the unknown.

This is verified further through Figures 2, 4 and 7 which show the relative error. For the error plots we see that the decay of the error is faster than the rate . We choose , which satisfies for the first 5 setups. Note that the error eventually plateaus. The reason for this is that we plot the error w.r.t. the truth, while the convergence result in Theorem 3.8 considers the error w.r.t. the minimizer. For the latter 5 setups, the error plots are similar, so we do not show them here.

Our experiments use different initial ensembles for the 1D problems hence we see different errors for the first iteration. This is to show that the method is robust under different initializations. We have conducted identical experiments which have the same initialization, and we see a similar performance with respect to each setup. Figure 1. Reconstruction of the true initial condition of the L96 model. Figure 2. Relative error and error in loss function of the L96 model. Figure 3. Reconstruction of the true initial condition of 1D Darcy flow. Figure 4. Relative error and error in loss function of 1D Darcy flow. Figure 5. Gaussian random field truth. Figure 6. Reconstruction of the truth for 2D Darcy flow. Top row. Left: Vanilla TEKI. Centre: EKI setup 1. Right : EKI setup 2. Bottom row. Left: EKI setup 3. Centre: EKI setup 4. Right: EKI setup 5. Figure 7. Relative error and error in loss function of 2D Darcy flow.

## 5. Conclusion

Ensemble Kaman inversion (EKI) is a derivate-free method used to solve inverse problems. As it formulated in a variational manner, a natural question to ask is how to accelerate its convergence. We aimed to answer this by considering a discrete time formulation of EKI and providing a convergence analysis in general nonlinear settings. We show that approximate stationary points are attainable in a non-convex setting while global minimizers are attainable in a strongly convex setting. A key insight in our work was the use of filtering techniques such as covariance inflation and adopting an ensemble square-root formulation, as well as ideas of non-constant step-sizes from machine learning. These results were highlighted through different numerical examples which validate the performance improvement of EKI. By using various inverse problems, we were able to see the consistency among experiments.

This work promotes various directions to take both in terms of algorithmic efficiency and also further theory for EKI. For the algorithmic efficiency one could consider understanding the relationship between EKI and common optimization techniques which include Nesterov acceleration 

, momentum and stochastic gradient descent

[13, 45], where initial work has been done in this in the context of machine learning [23, 31]. As our results hold for a nonlinear operator in discrete time, a natural direction is to to investigate similar properties in the noisy continuous-time setting, similar to that of . A final direction to consider is the adaptive learning of the regularization parameter , which would optimize the effect of the regularization.

## Acknowledgments

NKC acknowledges a Singapore Ministry of Education Academic Research Funds Tier 2 grant [MOE2016-T2-2-135]. The research of XTT is supported by the Singapore Ministry of Education Academic Research Funds Tier 1 grant R-146-000-292-114.

## Appendix A Convergence of the ensemble

The proof of Theorem 3.2 is decomposed into four lemmas below. Note that we can always normalize the problem by consider the transformation , so the prior covariance of is . Likewise, the observation model (1.1) can be transformed to

 ~y=Γ−1/2G(u)+Γ−1η=˜G(~u)+~η.

Therefore without loss of generality, we assume in most of the analysis below for simplicity.

###### Lemma A.1.

Let be the minimum eigenvalue of , then the sequence satisfies:

 ωn+1≥ωn+hn1+α2n(ωn+hn).
###### Proof.

Recall the sample covariance update rule is given by

 Cuun+1=Cuun−Cupn(h−1nΓ++Cppn)−1Cpun+α2nI.

To continue, we have a closer look at the cross covariance matrices and denote

 A=1KK∑i=1(G(u(i)n)−1KK∑i=1G(u(i)n))⊗(G(u(i)n)−1KK∑i=1G(u(i)n)),
 B=1KK∑i=1(u(i)n−mn)⊗(G(u(i)n)−1KK∑i=1G(u(i)n)),
 C=1KK∑i=1(u(i)n−mn)⊗(u(i)n−mn)=Cuun.

Then

 Cupn=[BC],Cppn=[ABTBC].

For a sufficiently large number , the following holds

 Cppn+h−1nΓ+=[A+h−1nΓBTBC+h−1nI]⪯[MIBTBC+h−1nI].

As a consequence, we can apply the block matrix inversion formula 

 [Cppn+h−1nΓ+]−1 ⪰[MIBTBC+h−1nI]−1 =⎡⎣1MI+1M2BT[C+h−1nI−1MBBT]−1B,−1MBT[C+h−1nI−1MBBT]−1−1M[C+h−1nI−1MBBT]−1B,[C+h−1nI−1MBBT]−1⎤⎦.

Since this holds for any sufficiently large , we can let , and find that

 [Cppn+h−1nΓ+]−1⪰[0,00,[C+h−1nI]−1].

Therefore

 Cupn[Cppn+h−1nΓ+]−1Cpun⪰C[C+h−1nI]−1C=Cuun[Cuun+h−1nI]−1Cuun.

So we have

 Cuun+1⪯Cuun−Cuun[Cuun+h−1nI]−1Cuun+α2nI.

Clearly shares the same eigenvectors with , while an eigenvalue for the former corresponds to an eigenvalue for the latter. From this, we find that if is the minimum eigenvalue of , then the minimum eigenvalue of will be bounded from below by

 ωn+1≥ωn+hn1+α2n(ωn+hn).

###### Lemma A.2.

With defined in Lemma A.1, let be the minimal eigenvalue of , it satisfies the following recursion:

 cn+1≥cnM2