1. Introduction
A crucial task of data science is to build mathematical models that can explain existing data, and use it to infer structural information. A general way to formulate this mathematically is through
(1.1) 
From (1.1), that stands for parameters of interest we try to infer from some associated noisy data , where describes the physical laws that relate and . Finally models uncontrollable noises in the data collection process, which we assume here is an independent Gaussian noise, i.e. . Typical example includes the subsurface flow problem, where stands for the underground geological structure, is the pressure reading at different locations, and
involves solving a partial differential equation (PDE) called “Darcy’s law”. In most applications, the forward problem, that is finding
or with a given , is relatively straightforward. But the associated inverse problem, that is finding with a given , can be difficult.To solve an inverse problem [24] given as (1.1), one standard approach is finding such that is closest to . Mathematically, this is equivalent to minimizing the datamisfit function
(1.2) 
This is commonly referred to as the variational approach. Here and what follows, we use to denote the Mahalanobis norm of with weight matrix . Yet, this approach often leads to unphysical solutions that overfit the data, or there can be nonunique solutions. These issues can often be alleviated by incorporating physical information through regularization [7, 24]. One popular choice is Tikhonov regularization which introduces a preference matrix and weight parameter , where now we consider minimizing the loss function
(1.3) 
In PDE applications, the matrix is often chosen as certain Laplacian operators to enforce smoothness on , and is set as a tuning parameter. Another way of solving the inverse problem (1.1) which negates computing the minimizer of (1.2) is the Bayesian approach [51]. This approach characterizes the solution as the conditional distribution of given , i.e. , which is discussed in more detail in Section 2.1. The physical information can be incorporated by assuming a prior distribution for , such as . The variational and Bayesian approaches are closely related. In particular the minimizer of (1.3
) is also known as the maximum a posteriori probability estimator in Bayesian statistics.
In order to minimize (1.2) or (1.3), classical optimization methods, such as gradient descent and Gauss–Newton, require the gradient information of . However, this information can be computationally expensive if is high dimensional or if the model is complex. This is often the case for many modernday big data applications. For example, in numerical weather prediction (NWP), represents the atmospheric and oceanic state on earth. Its dimension can exceed and describes the evolution of a fluid equation of multiple scales. In situations such as this, one would rather treat as a blackbox, and apply optimization methods that are derivativefree [49].
The ensemble Kalman filter (EnKF) [20, 21] is a derivativefree algorithm designed for data assimilation problems [33, 48], which can be interpreted as an inverse problem where a sequence of interrelated parameters are to be recovered. It was originally derived as a Monte Carlo approximation of the Kalman filter [29], which is the Bayesian solution to (1.3) assuming is linear. Because its formulation is derivativefree, the EnKF can be executed efficiently, and hence has been widely applied for NWP problems [14, 25]. The application of the EnKF to the setting of Bayesian inverse problems goes back to [44]. Since then, a wide development of work has been done on applying ensemble Kalman methods for inverse problems arising in PDEs. This was initiated by the works of Iglesias [26, 27] and has lead to numerous further directions [10, 11, 50]. We will refer the application of EnKF to inverse problems as ensemble Kalman inversion (EKI).
As a short description, EKI draws an initial ensemble from the prior distribution or a smoothed version of it, and repeatedly applies the EnKF to update the ensemble so it fits the data better. In the linear setting, the continuous time approximation of EKI eventually converges to the minimizer of the datamisfit function (1.2) [50]. However, this minimizer in general does not contain prior information and may overfit the data. To avoid this issue, one approach was devised by [26] which incorporated iterative Levenburg–Marquardt regularization, taking motivation from an earlier work [24]. Recently, another more direct approach is found by introducing an artificial observation in the EnKF step, so the ensemble converges instead to the minimizer of (1.3). This formulation is known as Tikhonov ensemble Kalman inversion [12]. Our investigation will mostly focus on it, and for simplicity of notation, we will refer it as EKI in the discussion below.
Despite the empirical success of EKI in the references aforementioned, its behaviour as an optimizer for (1.3) is not well understood. Convergence results of EKI are available only for linear observations and the continuoustime limit of EKI iterates [10, 30, 32, 36, 41, 50]
. However, EKI algorithms in practice have to run at discrete time, and the observations are rarely linear. Moreover, recent machine learning research has shown that using nonconstant stepsize or learning rate can significantly improve the optimization algorithm results
[2, 19, 46, 47]. For EKI, using a nonconstant stepsize is related to incorporating covariance inflation [50], which are important tuning techniques for improving both the accuracy and stability in NWP [4, 5, 54, 40, 53]. Yet, these important features and connections can not be revealed if one investigates only the continuoustime limit.This paper intends to fill these gaps by investigating EKI as a derivativefree optimization tool. Our contributions are highlighted through the following:

We develop a new version of the Tikhonov EKI algorithm, where nonconstant stepsizes and covariance inflation are applied. These modifications are essential to the algorithm performance both in theory and numerical tests.

We compare the long time behavior of EKI with the Gauss–Newton method in a general nonlinear setting. Such comparison leads to an intuitive explanation why EKI can be used for optimization.

Assuming a general nonlinear map , we show that EKI can converge to approximate critical points with finitely many iterations. If in addition the regularized loss function (1.3) is strongly convex, we show that EKI converges to the global minimum at a polynomial speed.

Based on our convergence analysis, we provide guidelines on how to choose the stepsize and covariance inflation in EKI. We implement the EKI on the Lorenz 96 model in 1D, and a nonlinear elliptic partial differential equation in 1D and 2D. Not only do we test the theory but also assess the performance of the method as an inverse problem solver against the standard Tikhonov EKI.
1.1. Notation and organization
The structure of this article is as follows. In Section 2 we provide an overview of the preliminary material required, reviewing the EnKF, while introducing our formulation of the inverse problem loss function (1.3). This leads to Section 3 where we state the main results while introducing the assumptions on the optimization and convergence analysis. Numerical verification of the results are shown in Section 4, while finally we conclude our findings and discuss potential areas of future work in Section 5. The appendix will contain the majority of proofs from Section 3.
Throughout the article we use and to denote norm and its corresponding inner product. For any arbitrary function, we will further denote its Jacobian and Hessian matrix as and . Given a matrix the operator norm is defined as . Given two symmetric matrices and , we use to indicate the matrix is positive semidefinite. Given a covariance matrix the Mahalanobis norm is defined by .
2. Tikhonov ensemble Kalman inversion
In this section we review some of the key steps for deriving the Tikhonov EKI algorithm. We initiate with an overview of optimization with iterative Bayesian approaches, while discussing how to implement EKI with nonconstant stepsizes and covariance inflations. For notation simplicity, we assume in our discussion. This does not sacrifice any generality, since we can always replace with otherwise.
2.1. Optimization by iterative Bayesian sampling
In the Bayesian inverse problem setting, one often assumes follows a prior , then given an observation from model (1.1), the posterior distribution is given by
Then suppose we use as a prior, and the data as an observation again, the next posterior distribution is given by
If we iterate this procedure times, the resulting posterior is given by
When is large enough, most of the probability mass of will concentrate on the minimum of . So if we can generate samples from , they are likely to be close to the minimizer of .
2.2. Regularized observation
As discussed in the introduction, the minimizer of the datamisfit function may be a nonphysical solution which overfits the data. It is often more desirable to minimize the regularized loss function (1.3) instead. In order to implement the iterative Bayesian sampling methods discussed in Section 2.1, we need to include the regularization term into the observation model. This is achieved by concatenating the real observation with a direct artificial observation with observation noise . We begin by extending (1.1) to the equations equationparentequation
(2.1a)  
(2.1b) 
where
are independent random variables distributed as
and Define variables and mapping as follows,Then note that
We can express our modified inverse problem as
(2.2) 
Under this transformation, the regularized loss function in (1.3) can be express as the datamisfit function of (2.2):
Moreover, if we apply an iterative Bayesian sampling method with observation from (2.2), the th posterior is given by
(2.3) 
To minimize , it suffices to sample for a large .
2.3. Kalman filter and ensemble formulation
When is Gaussian and is linear, the Kalman filter [29] provides explicit formulas for the posterior distribution in (2.3). In particular, is given by , where
(2.4) 
Iterating the same formula, one can find the sequential distributions are all Gaussian, which implies the mean and covariance all have explicit forms.
In practical implementations, applying the Kalman filter can be difficult, as may be nonlinear, and inverting the associated matrices can be expensive if the underlying dimension is large. The EnKF algorithm is designed to overcome these two issues. It uses a group of particles
to represent the Gaussian distribution
, where the covariance matrices in (2.4) can be approximated by their sample versions.In particular, we define
and the sample covariances
Suppose are i.i.d. samples from and , it is evident that , , , and are approximations of and . By inserting these approximations in (2.4), we have the next posterior distribution . There are in general two ways to update the particles such that their mean and covariance satisfy (2.4). The first way is directly updating the particles by
(2.5) 
where are i.i.d. samples from . With these artificial noises, one can show that on average, the mean and covariance of are approximately in (2.4). On the other hand, these artificial noises create fluctuation and instability. The second way is simply finding a group of particles such that their mean and covariance match the target formulas in (2.4). This leads to the mean update
(2.6) 
Then we seek a new ensemble centered at , so that
This formulation is often called the ensemble square root filter [37, 52]. It is known to perform better than the particle formulation (2.5).
2.4. Nonconstant stepsize and covariance inflation
Recall that in the gradient descent (GD) algorithm, one generates a sequence of iterates to approach the minimum of . One way to update the iterate is by
where is often called the stepsize. One can interpret GD as implementing Euler’s method for the gradient flow of an ordinary different equation. In the machine learning literature, it is shown that using a decreasing sequence may improve algorithm performance [2, 3], since it allows the algorithm to take larger steps at earlier iterations and explore more regions, while converging to a solution faster with smaller steps in later iterations.
One of the findings in this paper and some earlier works is that the EKI in the long run similar to the GD [23, 31]. One would naturally conjecture that implementing nonconstant stepsize may lead to improved optimization performance. In [50], it is shown that to implement EKI with a constant small stepsize , we simply replace with . One can also reach such procedure by considering applying tempering techniques from sequential Monte Carlo when sampling (2.3) [15, 28]. Here we implement the same idea, except that we explore the possibility of using the stepsize in place of . When , this is the same taking a constant stepsize. As a result, the mean update formula is given by
(2.7) 
Unlike other machine learning algorithms such as stochastic gradient descend [3], the stepsize for EKI does not need to decrease. This is because the movement made by (2.7) is closer toa Gauss–Newton type of algorithm, instead of a GD type of algorithm. Yet, the stepsize parameter does control the final convergence speed. This will be clearer when we have more analysis results. Specifically Remark 3.4 will provide further details.
Aside from implementing the nonconstant stepsize, we will also apply additive covariance inflation [4, 5, 54] for the update formula. The resulting covariance update is given by
(2.8) 
where is a sequence of positive parameters to be specified. In the literature of EnKF, covariance inflation is commonly applied for improved algorithm stability. It is equivalent to adding a stochastic noise in the particle formulation (2.5). Such operation is also known as rejuvenation in data assimilation, and it in general makes the associated Kalman filter system controllable.
3. Main results
In this section we state our analysis results regarding the convergence of Tikhonov EKI. We first aim to understand the behaviour of the ensemble, more specifically the ensemble covariance . We then compare the EKI update and Gauss–Newton update, where we show that their difference converges to zero. Finally we state results regarding both convergence towards local and global minimizers. The proofs of these results will be omitted from this section and are provided in the appendices.
3.1. Ensemble covariance collapse
The first step of our analysis involves understanding the ensemble configuration of EKI. For simplicity, we impose the following regularity assumption for the map .
Assumption 3.1.
has bounded first and second order derivatives. So there are constants and such that for all and
Theorem 3.2.
In the view of classical linear Kalman filter theory [29], this result indicates that the system is observable and controllable.
3.2. Connection with Gauss–Newton
In the Kalman filter literature, it is well known that the Kalman filter mean update (2.4) is closely related to the Gauss–Newton method [1]. We show below that as an ensemble formulation of the Kalman filter, EKI inherits such a connection in the long run. This intuitively explains why EKI is an appropriate optimization tool.
As explained in Section 2.1 and 2.2, at the th step, EKI is attempting to sample the posterior distribution (2.3). Given that is assumed to be , and that is assumed to be Gaussian, the mean of , , should be the minimizer of , which is proportional to
(3.1) 
Note that we have replaced with to implement our nonconstant stepsize. Since is of nonlinearleastsquare form, given the current mean , the Gauss–Newton method indicates that should be , where
(3.2) 
Because , by Theorem 3.2 we can estimate
(3.3) 
Next, recall the mean movement from EKI (2.7) is given by
(3.4) 
This is different from (3.2), however we can show their difference converges to zero. To see that, recall from Theorem 3.2, we find the ensemble covariance decreases to zero in a well controlled manner. In particular, the particles are very close to the mean when is large. This indicates the ensemble spread is very small. We can apply a first order approximation:
With this approximation, we find that
Applying these approximations to (3.4), we recover (3.2). More specifically, the difference between EKI mean update and Gauss–Newton update is bounded, as discussed in the following proposition.
Proposition 3.3.
Remark 3.4.
Recall that in (3.3) we show the mean movement made by EKI is of order . So the difference between EKI and Gauss–Newton is of a lower order. Also note that the stepsize parameter actually does not control EKI mean movement. Instead, it controls the speed of ensemble collapse as in Theorem 3.2, and consequentially the accuracy of EKI in approximating Gauss–Newton.
3.3. Iterative descent made by EKI
While Proposition 3.3 explains how the EKI iterates optimize a sequence of loss functions (3.1), it is unclear how the regularized loss function (1.3) is optimized in the process. is not necessarily the limit of , since is not a fixed point. One interesting fact is that by running a GaussNetwon type update for , the value of is also decreased at each step. To see this, we again apply the Taylor expansion
and find that
The important observation here is that
It is easy to check that
Since Proposition 3.3 suggests that ,
The error of this approximation is given by the following.
Proposition 3.5.
Under the same setting as Proposition 3.3 we have the following estimate
where the residual is bounded by
3.4. Convergence analysis
Classical analysis of optimization algorithms often focus on understanding the limiting behavior of the iterations. When the underlying loss function is strongly convex, there is a unique global minimum, so it is of interest to show the algorithms can converge to this minimizer with finite steps. Under nonconvex settings, the global minimum can be nonunique, and it is more practical to ask whether the algorithm can converge to a critical point of the loss function.
The EnKF is known to have certain stability issues [54], in the sense the iterates in principle may diverge to infinity. With general observation functions, EKI can have the same phenomena. But this issue can often be fixed by modifying the algorithm, if we know proper solutions should be bounded by a known radius . Such information can often to be obtained from the physical background of the inverse problem. As a consequence, it is reasonable to modify the EKI algorithm so the particles are bounded. One simple way to achieve this is by modifying the observation map outside the radius [12]. In particular, we have the proposition.
Proposition 3.6.
Suppose the observation map takes value when , there is a threshold iteration , so that the EKI sequence is bounded such that
The requirement that takes value when can be enforced for general observation function by producting it with a mollifier, for example we replace with
Proposition 3.6 indicates it is reasonable to assume the mean sequence is bounded. Then from the descend estimate in Proposition 3.5, we can show EKI will reach an approximate critical point with finite iterations.
Theorem 3.7.
Under the setting of Theorem 3.2, suppose that the EKI mean sequence is bounded and the parameter , then or any ,
The threshold iteration is given by
here is a constant independent of .
In addition if we assume the loss function is strongly convex, then we have the following theorem which establishes convergence to the global minimizer. The theorem also states that the convergence is attained at a polynomial rate.
4. Numerical results
In this section we present several experiments assessing the performance of EKI with additive covariance inflation and nonconstant stepsizes. We test the inversion performance on both the Lorenz 96 model and a nonlinear partial differential equation (PDE) from the field of geosciences. Before describing in detail the experiments we present the different test models and their corresponding inverse problem.
4.1. Test models
4.1.1. Lorenz 96 model
The first test problem is the Lorenz 96 (L96) model [39]. The L96 model is a dynamical system designed to describe equatorial waves in atmospheric science. The L96 model takes the form
(4.1) 
In (4.1), denotes the current state of the system at the th grid point. is a forcing constant with default value . The dimension is often chosen as , but other large numbers can also be used. The initial condition
of (4.1) is randomly sampled from the Gaussian approximation of the equilibrium measure of L96. We also generate the initial ensemble for EKI from the same Gaussian distribution. The associated inverse problem is the recovery of using 20 noisy partial measurements at time
(4.2) 
This inverse problem is a standard test problem for data assimilation. It is also a good testbed for high dimensional Bayesian computational methods, since the dimension can take arbitrary large values [35, 42].
4.1.2. 1D Darcy’s law
The second test problem will be a nonlinear PDE motivated from geosciences referred to as Darcy’s law. Assume that in a domain we have a source field and a diffusion coefficient , referred to as the permeability, then the forward problem is concerned with solving , known as the pressure, in the PDE
(4.3)  
with Dirichlet boundary conditions. The inverse problem concerned with (4.3) is the recovery of the permeability from measurements of the pressure at 20 equidistance locations . The associated inverse problem is then defined as
(4.4) 
where the are Gaussian noises, assumed independent, mean zero and joint covariance . By defining , we can rewrite (4.4) as the inverse problem
As a starter, we will consider a 1D Darcy flow problem, where the domain is given as . By testing on this toy problem, we can have a more direct observation of the recovery skill with different parameter setups, as shown later by Figure 3. The initialization of the ensemble is taken to be a meanzero Gaussian with covariance function , which is a common covariance function used in the context of uncertainty quantification [38].
4.1.3. 2D Darcy’s law
Our final test problem is Darcy’s law as stated through (4.3), but we consider in two dimension. As before we are interested in the recovery of the permeability in the domain where we specify equidistant pointwise observations. However one key difference in the 2D setting is the initialization of TEKI. We follow the setting described in [12]. One way to simulate draws from a random function is through the series expansion
(4.5) 
(4.5) is known as the KarhunenLoève (KL) expansion, where and
are the respective eigenfunctions and eigenvectors of the covariance operator
, defined asThe corresponding KL expansion (4.5) satisfies the eigenvalue problem where our covariance operator is defined as
with denoting the Laplacian operator, such that denotes the regularity and denotes the inverse lengthscale. For the tests below, . For simplicity, we initiate EKI with KL basis vectors, in other words we let
(4.6) 
This has been shown in [12] to work well in the context of TEKI. The truth for this test model is given in Figure 5. We specify the observation noise covariance as in the 1D Darcy flow experiment.
4.2. Parameter settings
To test our modifications of EKI, we monitor the effect of modifying the parameters and . Recall that the parameter controls the stepsize , where Theorem 3.8 suggests that taking a nonzero can result in a faster, and improved, performance. The other parameter arises from the formula for covariance inflation,
(4.7) 
For Theorems 3.7 and 3.8 to apply, the parameters need to satisfy the following
To see the effect of the nonconstant stepsize we also test a setup where resulting in a constant stepsize. We set the initial stepsize as and the initial inflation factor as . We will consider 10 different setups where we monitor (i) the relative error of the mean with respect to the truth , i.e.
(4.8) 
and (ii) the relative error evaluated in the loss function , i.e. . These different setups are provided in Table 1.
Setup  Setup  

1  0  0.9  6  0.2  0.2 
2  0.2  0.9  7  0.2  0.3 
3  0.4  0.9  8  0.2  0.5 
4  0.6  0.9  9  0.2  0.7 
5  0.8  0.9  10  0.2  0.9 
The first five setups monitor the effect of while the last fiver monitor the parameter . We set the ensemble size fixed at for the first two models, and for the 2D model we set ensemble members. The effect of modifying and has been well documented in [11, 26] , thus we negate doing the same. Aside from analyzing the relative errors we also plot the reconstruction of the unknown for the first 5 setups of each test model. We set the iterative model to run for 23 iterations. We compare our results with the case of no additive covariance inflation, i.e. from the inflation formula (4.7), which we label as VTEKI (vanilla Tikhonov EKI). For all test experiments we place the regularization parameter as =2.
4.3. Inversion results
demonstrate the reconstruction of the unknown for each model problem. To highlight the effect of the nonconstant stepsize and variance inflation we plotted the first 5 setups and the case of VTEKI. As seen from the Figures
1 and 3, as we increase we tend to get a better reconstruction which is closer to the truth. A more indepth analysis of the effect of the stepsize and covariance inflation is provided in Tables 2  7, where we consider the different setups. Throughout each model we notice that as we increase and decrease we see an improvement in learning the unknown.This is verified further through Figures 2, 4 and 7 which show the relative error. For the error plots we see that the decay of the error is faster than the rate . We choose , which satisfies for the first 5 setups. Note that the error eventually plateaus. The reason for this is that we plot the error w.r.t. the truth, while the convergence result in Theorem 3.8 considers the error w.r.t. the minimizer. For the latter 5 setups, the error plots are similar, so we do not show them here.
Our experiments use different initial ensembles for the 1D problems hence we see different errors for the first iteration. This is to show that the method is robust under different initializations. We have conducted identical experiments which have the same initialization, and we see a similar performance with respect to each setup.
Setup  Relative error  Loss function 

1  0.126  1.78 
2  0.076  1.65 
3  0.073  1.63 
4  0.065  1.56 
5  0.057  1.53 
VTEKI  0.183  2.11 
Setup  Relative error  Loss function 

6  0.065  1.50 
7  0.068  1.52 
8  0.073  1.57 
9  0.077  1.62 
10  0.081  1.68 
VTEKI  0.184  2.13 
Setup  Relative error  Loss function 

1  0.112  1.63 
2  0.058  1.50 
3  0.053  1.45 
4  0.047  1.42 
5  0.044  1.40 
VTEKI  0.161  1.87 
Setup  Relative error  Loss function 

6  0.044  1.37 
7  0.046  1.39 
8  0.049  1.42 
9  0.057  1.48 
10  0.060  1.53 
VTEKI  0.163  1.86 
Setup  Relative error  Loss function 

1  0.172  2.01 
2  0.164  1.89 
3  0.158  1.84 
4  0.154  1.81 
5  0.152  1.76 
VTEKI  0.208  2.06 
Setup  Relative error  Loss function 

6  0.154  1.64 
7  0.156  1.67 
8  0.161  1.73 
9  0.164  1.79 
10  0.167  1.89 
VTEKI  0.212  2.09 
5. Conclusion
Ensemble Kaman inversion (EKI) is a derivatefree method used to solve inverse problems. As it formulated in a variational manner, a natural question to ask is how to accelerate its convergence. We aimed to answer this by considering a discrete time formulation of EKI and providing a convergence analysis in general nonlinear settings. We show that approximate stationary points are attainable in a nonconvex setting while global minimizers are attainable in a strongly convex setting. A key insight in our work was the use of filtering techniques such as covariance inflation and adopting an ensemble squareroot formulation, as well as ideas of nonconstant stepsizes from machine learning. These results were highlighted through different numerical examples which validate the performance improvement of EKI. By using various inverse problems, we were able to see the consistency among experiments.
This work promotes various directions to take both in terms of algorithmic efficiency and also further theory for EKI. For the algorithmic efficiency one could consider understanding the relationship between EKI and common optimization techniques which include Nesterov acceleration [43]
, momentum and stochastic gradient descent
[13, 45], where initial work has been done in this in the context of machine learning [23, 31]. As our results hold for a nonlinear operator in discrete time, a natural direction is to to investigate similar properties in the noisy continuoustime setting, similar to that of [8]. A final direction to consider is the adaptive learning of the regularization parameter , which would optimize the effect of the regularization.Acknowledgments
NKC acknowledges a Singapore Ministry of Education Academic Research Funds Tier 2 grant [MOE2016T22135]. The research of XTT is supported by the Singapore Ministry of Education Academic Research Funds Tier 1 grant R146000292114.
Appendix A Convergence of the ensemble
The proof of Theorem 3.2 is decomposed into four lemmas below. Note that we can always normalize the problem by consider the transformation , so the prior covariance of is . Likewise, the observation model (1.1) can be transformed to
Therefore without loss of generality, we assume in most of the analysis below for simplicity.
Lemma A.1.
Let be the minimum eigenvalue of , then the sequence satisfies:
Proof.
Recall the sample covariance update rule is given by
To continue, we have a closer look at the cross covariance matrices and denote
Then
For a sufficiently large number , the following holds
As a consequence, we can apply the block matrix inversion formula [6]
Since this holds for any sufficiently large , we can let , and find that
Therefore
So we have
Clearly shares the same eigenvectors with , while an eigenvalue for the former corresponds to an eigenvalue for the latter. From this, we find that if is the minimum eigenvalue of , then the minimum eigenvalue of will be bounded from below by
∎
Lemma A.2.
With defined in Lemma A.1, let be the minimal eigenvalue of , it satisfies the following recursion:
Comments
There are no comments yet.