1. Introduction
Inverse problems arise in almost all areas of application, e.g. biological problems, engineering and environmental systems. The integration of data can substantially reduce the uncertainty in predictions based on the model and is therefore indispensable in many applications. Advances in machine learning provide exciting potential to complement and enhance simulators for complex phenomena in the inverse setting. We propose a novel approach to inverse problems by approximating the forward problem with neural networks, i.e. the computationally intense forward problem will be replaced by a neural network. However, the neural network when used as a surrogate model needs to be trained beforehand in order to guarantee good approximations of the forward problem for arbitrary inputs of the unknown coefficient. To reduce the costs associated with building the surrogate in the entire parameter space, we suggest to solve the inverse problem and train the neural network simultaneously in a oneshot fashion. From a computational point of view, this approach has the potential to reduce the overall costs significantly, as the neural network serves as a surrogate of the forward model only in the unknown parameter estimated from the data and not in the entire parameter space. The ensemble Kalman inversion (EKI) as a derivativefree optimizer is applied to the resulting optimization problem.
1.1. Background and literature overview
The goal of computation is to solve the following inverse problem: Recover the unknown parameter in an abstract model
(1) 
from a finite number of observation of the state given by
(2) 
which might be subject to noise. We denote by , and Banach spaces. The operator describes the underlying model, typically a PDE or ODE in the applications of interest. The variable denotes the state of the model and is defined on a domain . The operator is the observation operator, mapping the state variables to the observations. Classical methods to inverse problems are based on an optimization approach of the (regularized) data misfit, see e.g. [12, 25]. Oneshot (allatonce) approaches are well established in the context of PDEconstrained optimization (see e.g. [5]) and have been recently also introduced in the inverse setting [23, 24]. The idea is to solve the underlying model equation simultaneously with the optimality conditions. This is in contrast to so called reduced or blackbox methods, which formulate the minimization as an unconstrained optimization problem via the solution operator of (1
). The connection of the optimization approach to the Bayesian setting via the maximum a posteriori estimate is well established in the finitedimensional setting
[22] as well as in the infinitedimensional setting for certain prior classes, see [1, 8, 9, 18]. We refer to [22, 37] for more details on the Bayesian approach to inverse problems. Recently, datadriven approaches have been introduced to the inverse setting to reduce the computational complexity in case of highly complex forward models and to improve models in case of limited understanding of the underlying process [2]. Neural networks in the inverse framework have been successfully applied in the case of parametric holomorphic forward models [19] and in the case of limited knowledge of the underlying model [29, 30, 31, 36, 38]. To train neural networks, gradientbased methods are typically used [16]. The ensemble Kalman inversion (EKI) (see e.g. [20, 32, 33]) has been recently applied as gradientfree optimizer for the training of neural networks in [17, 26].1.2. Our contribution
The goal of this work is twofold:

We formulate the inverse problem in a oneshot fashion and establish the connection to the Bayesian setting. In particular, the Bayesian viewpoint allows to incorporate model uncertainty and gives a natural way to define the regularization parameters. In case of an exact forward model, the vanishing noise can be interpreted as a penalty method. This allows to establish convergence results of the oneshot formulation as an unconstrained optimization problem to the corresponding (regularized) solution of the reduced optimization problem (with exact forward model). The (numerical approximation of the) forward problem is replaced by a neural network in the oneshot formulation, i.e. the neural network does not have to be trained in advance for all potential parameters.

Secondly, we show that the EKI can be effectively used to solve the resulting optimization problem. We provide a convergence analysis in the linear setting. To enhance the performance, we modify the algorithm motivated by the continuous version of the EKI. Numerical experiments demonstrate the robustness (also in the nonlinear setting) of the proposed algorithm.
1.3. Outline
In section 2, we discuss the optimization and Bayesian approach to inverse problems, introduce the oneshot formulation and establish the connection to vanishing noise and penalty methods in case of exact forward models. Section 3 gives an overview of neural networks used to approximate the forward problem. The EKI as a derivativefree optimizer for the oneshot formulation is discussed and analyzed in section 4. Finally, numerical experiments demonstrating the effectiveness of the porposed approach are presented in section 5. We conclude in section 6 with an outlook to future work.
1.4. Notation
We denote by the Euclidean norm and by the corresponding inner product. For a given symmetric, positive definite matrix (s.p.d.) , the weighted norm is defined by and the weighted inner product by .
2. Problem Formulation
We introduce in the following the optimization and Bayesian approach to an abstract inverse problem of the form (1)(2).
Throughout this paper, we derive the methods and theoretical results under the assumption that and are finitedimensional, i.e. we assume that the forward problem has been discretized by a suitable numerical scheme and the parameter space is finitedimensional as well, possibly after dimension truncation. Though most of the ideas and results can be generalized to the infinitedimensional setting, we avoid the technicalities arising from the infinitedimensional setting and focus on the discretized problem, i.e. , and .
2.1. Optimization approach to the inverse problem
The optimization approach leads to the following problem
(3)  
(4) 
i.e. we minimize the data misfit in a suitable weighted norm with s.p.d., given that the forward model is satisfied. Due to the illconditioning of the inverse problem, a regularization term on the unknown parameters is often introduced in order to stabilize the optimization, i.e. we consider
(5)  
(6) 
where the regularization is denoted by and the positive scalar is usually chosen according to prior knowledge on the unknown parameter . We will comment on the motivation of the regularization via the Bayesian approach in the following section. To do so, we first introduce the socalled reduced problem of (3)(4) and (5)(6), resepectively. The forward model is typically a wellposed problem, in the sense that for each parameter , there exists a unique state such that in . Introducing the solution operator s.t. , we can reformulate the optimization problem (3)(4) as an unconstrained optimization problem
(7) 
and
(8) 
respectively.
2.2. Bayesian approach to the inverse problem
Adopting the Bayesian approach to inverse problems, we view the unknown parameters as an
valued random variable with prior distribution
. The noise in the observations is assumed to be additive and described by a random variable with s.p.d, i.e.(9) 
Further, we assume that the noise is stochastically independent of
. By Bayes’ theorem, we obtain the posterior distribution
(10) 
the conditional distribution of the unknown given the observation .
2.3. Connection between the optimization and the Bayesian approach
The optimization approach leads to a point estimate of the unknown parameters, whereas the Bayesian approach computes the conditional distribution of the unknown parameters given the data, the posterior distribution, as solution of the inverse problem. Since the computation or approximation of the posterior distribution is prohibitively expensive in many applications, one often restores to point estimates such as the maximum a posteriori (MAP) estimate, the most likely point of the unknown parameters. Denoting by the Lebesgue density of the prior distribution, the MAP estimate is defined as
(11) 
Assuming a Gaussian prior distribution, i.e. , the MAP estimate is given by the solution of the following minimization problem
(12) 
The Gaussian prior assumption leads to a Tikhonovtype regularization in the objective function, whereas the first term in the objective function results from the Gaussian assumption on the noise. We refer the interested reader to [10, 22] for more details on the MAP estimate.
2.4. Oneshot formulation for inverse problems
Blackbox methods are based on the reduced formulation (7) assuming that the forward problem can be solved exactly in each iteration. Here, we will follow a different approach, which simultaneously solves the forward and optimization problem. Various names for the simultaneous solution of the design and state equation exist: oneshot method, allatonce, piggyback iterations etc.. We refer the reader to [5] and the references therein for more details.
Following the oneshot ideas, we seek to solve the problem
(13) 
Due to the noise in the observations, we rather consider
(14) 
with normally distributed noise
, s.p.d.. Similarly, we assume that(15) 
i.e. we assume that the model error can be described by , s.p.d.. Thus, we obtain the problem
(16) 
The MAP estimate is then computed by the solution of the following minimization problem
(17) 
where and are regularizations of the parameter and the state , and .
2.5. Vanishing noise and penalty methods
In case of an exact forward model, i.e. in case the forward equation is supposed to be satisfied exactly with , this can be modeled in the Bayesian setting by vanishing noise. In order to illustrate this idea, we consider a parametrized noise covariance model for and given s.p.d. matrix . The limit for corresponds to the vanishing noise setting and can be interpret as reducing the uncertainty in our model. The MAP estimate in the oneshot framework is thus given by
(18) 
with . This form reveals the close connection to penalty methods, which attempt to solve constrained optimization problems such as (3)(4) by sequentially solving unconstrained optimization problems of the form (18) for a sequence of monotonically increasing penalty parameters . The following wellknown result on the convergence of the resulting algorithm can be found e.g. in [3].
Proposition 1.
Let the observation operator , the forward model and the regularization functions , be continuous and the feasible set be nonempty. For let denote a global minimizer of
(19) 
with strictly monotonically increasing and for . Then every accumulation point of the sequence is a global minimizer of
(20)  
(21)  s.t. 
This classic convergence result ensures the feasibility of the estimates, i.e. the proposed approach is able to incorporate and exactly satisfy physical constraints in the limit. We mention also the possibility to consider exact penalty terms in the objective, corresponding to different noise models in the Bayesian setting. This will be subject to future work.
This setting will be the starting point of the incorporation of neural networks into the problem. Instead of minimizing with respect to the state , we will approximate the solution of the forward problem by a neural network , where denote the parameters of the neural network to be learned within this framework. Thus, we obtain the corresponding minimization problem
(22) 
where denotes the state approximated by the neural network.
3. Neural Networks
, we distinguish between the neural network architecture or neural network parameters as a set of weights and biases and the neural network, which is the associated mapping when an activation function is applied to the affine linear transformations defined by the neural network parameters:
The neural network architecture or the neural network parameters with input dimension and
layers is a sequence of matrixvector tuples
where , with
. Hence, the number of neurons in layer
is given by and , i.e. and for . Given an activation function , we call the mappingdefined by the recursion
a neural network with activation function , where is understood to act componentwise on vectorvalued inputs, i.e. for
. In the following the activation function is always the sigmoid function
. We will therefore waive the superscript indicating the activation function, i.e. we abbreviate . In the context of approximating the solution of the forward problem , this definition of the neural network allows us to distinguish between the neural network evaluated at one specific input point, , and the neural network as a function of , denoted by .This class of neural networks is often called feedforward deep neural networks (DNNs) and have recently found success in forward [35] as well as Bayesian inverse [19] problems in the field of uncertainty quantification.
Based on the approximation results of polynomials by feedforward DNNs [39], the authors in [35] derive bounds on the expression rate for manyvariate, realvalued functions depending holomorphically on a sequence of parameters. More specifically, the authors consider functions that admit sparse Taylor gpc expansions, i.e. summable Taylor gpc coefficients. Such functions arise as response surfaces of parametric PDEs, or in a more general setting from parametric operator equations, see e.g. [34] and the references therein. Their main results is that these functions can be expressed with arbitrary accuracy (uniform w.r. to ) by DNNs of size bounded by with a constant independent of the dimension of the input data . Similar results for parametric PDEs can be found in [27].
3.1. Neural networks for inverse problems
The methods in [35] motivated the work of [19] in which the authors show holomorphy of the datatoQoI map for additive, centered Gaussian observation noise in Bayesian inverse problems. Using the fact that holomorphy implies fast convergence of Taylor expansions, the authors derived an exponential expression rate bound in terms of the overall network size.
Our approach differs from the ideas above as we do not want to approximate the datatoQoI map, but instead emulate the state itself by a DNN. Hence, in our method the input of the neural network is a point in the domain of the state, . The output of the neural network is the state at this point, , i.e. . By we denote the neural network as a function of , i.e. . In combination with a oneshot approach for the training of the neural network parameters, our method is closer related to the physicsinformed neural networks (PINNs) in [29, 30].
3.1.1. Connection to physicsinformed neural networks
In [29, 30] the authors consider PDEs of the form
where is a nonlinear differential operator parametrized by . The authors replace by a neural network and use automatic differentiation to construct the function . The neural network parameters are then obtained by minimizing , where
For the minimization a LBFGS method is used. The parameters of the differential operator turn into parameters of the neural network and can be learned by minimizing the MSE.
In [38] the authors consider so called Bayesian neural networks (BNNs), where the neural network parameters are updated according to Bayes’ theorem. Hereby the initial distribution on the network parameters serves as prior distribution. The likelihood requires the PDE solution, which is obtained by concatenating the Bayesian neural network with a physicsinformed neural network, which they call Bayesian physicsinformed neural networks (BPINNs). For the estimation of the posterior distributions they use the Hamiltonian Monte Carlo method and variational inference. In contrast to the PINNs, the Bayesian framework allows them to quantify the aleatoric uncertainty associated with noisy data. In addition to that, their numerical experiments indicate that BPINNs beat PINNs in case of large noise levels on the observations. In contrast to that, our proposed method is based on the MAP estimate and remains exact in the small noise limit. We propose a derivativefree optimization method, the EKI, which shows promising results (also compared to quasiNewton methods) without requiring derivatives w.r. to the weights and design parameters.
4. Ensemble Kalman Filter for Neural Network Based Oneshot Inversion
The ensemble Kalman inversion (EKI) generalizes the wellknown ensemble Kalman Filter (EnKF) introduced by Evensen and coworker in the data assimilation context [14] to the inverse setting, see [20] for more details. Since the Kalman filter involves a Gaussian approximation of the underlying posterior distribution, we focus on an iterative version based on tempering in order to reduce the linearization error. Recall the posterior distribution given by
for an abstract inverse problem
with mapping from the unknowns to the observations with , s.p.d.. We define the intermediate measures
(23) 
by scaling the data misfit / likelihood by the step size , . The idea is to evolve the prior distribution into the posterior distribution
by this sequence of intermediate measures and to apply the EnKF to the resulting artificial time dynamical system. Note that we account for the repeated use of the observations by amplifying the noise variance by
in each step. The EKI then uses an ensemble of particles with to approximate the intermediate measures by(24) 
with denoting the deltaDirac mass located at . The particles are transformed in each iteration by the application of the Kalman update formulas to the empirical mean and covariance in the form
(25) 
where denotes the Kalman gain and, for , the operators and given by
(26)  
(27)  
(28)  
(29) 
are the empirical covariances and empirical mean in the observation space. Since this update does not uniquely define the transformation of each particle to the next iteration , the specific choice of the transformation leads to different variants of the EKI. We will focus here on the generalization of the EnKF as introduced by [20] resulting in a mapping of the particles of the form
(30) 
where
The are i.i.d. random variables distributed according to with corresponding to the case of perturbed observations and to the unperturbed observations.
The motivation via the sequence of intermediate measures and the resulting artificial time allows to derive the continuous time limit of the iteration, which has been extensively studied in [4, 32, 33] to build analysis of the EKI in the linear setting. This limit arises by taking the parameter in (30) to zero resulting in
(31) 
As shown in [13], the EKI does not in general converge to the true posterior distribution. Therefore, the analysis presented in [4, 32, 33] views the EKI as a derivativefree optimizer of the data misfit, which is also the viewpoint we adopt here.
4.0.1. Ensemble Kalman inversion for neural network based oneshot formulation
By approximating the state of the underlying PDE by a neural network, we seek to optimize the unknown parameter and on the other side the parameters of the neural network . The idea is based on defining the function , where denotes the state approximated by the neural network and . This leads to the empirical summary statistics
and the EKI update
(32) 
where the perturbed observation are computed as before
(33) 
with
Figure 1 illustrates the basic idea of the application of the EKI to solve the neural network based oneshot formulation.
The EKI (32) will be used as a derivativefree optimizer of the data misfit The analysis presented in [4, 32, 33] shows that the EKI in its continuous form is able to recover the data with a finite number of particles in the limit under suitable assumptions on the forward problem and the set of particles. In particular, the analysis assumes a linear forward problem. Extensions to the nonlinear setting can be found e.g. in [7]. The limit corresponds to the noisefree setting, as the inverse noise covariance scales with in (23). To explore the scaling of the noise and to discuss regularization techniques, we illustrate the ideas in the following for a Gaussian linear setting, i.e. we assume that the forward response operator is linear with and . Considering the large ensemble size limit , the mean and covariance satisfy the equations
(34)  
(35) 
for in (30). By considering the dynamics of the inverse covariance, it is straightforward to show that the solution is given by
(36) 
see e.g. [15] and the references therein for details. Note that corresponds to the posterior covariance and that for . Furthermore, the mean is given by
(37) 
in particular the mean minimizes the data misfit in the limit . The application of the EKI in the inverse setting therefore often requires additional techniques such as adaptive stopping [33] or additional regularization [6] to overcome the illposedness of the minimization problem. To control the regularization of the data misfit and neural network individually, we consider the following system
(38)  
(39) 
with , , and
. The loss function for the augmented system is therefore given by
(40) 
Assuming that the resulting forward operator
(41) 
is linear, the EKI will converge to the minimum of the regularized loss function (40), cp. [32]. To ensure the feasibility of the EKI estimate (w.r. to the underlying forward problem), we propose the following algorithm using the ideas discussed in section 2.5.
Theorem 1.
Assume that the forward operator , ,
is linear, i.e. with . Let be strictly monotonically increasing and for . Further, assume that the initial ensemble members are chosen so that .
Then, Algorithm 1 generates a sequence of estimates , where minimizes the loss function for the augmented system given by
with given . Furthermore, every accumulation point of is the (unique, global) minimizer of
s.t. 
Proof.
Under the assumption of a linear forward model, the penalty function
is strictly convex for all , i.e. there exists a unique minimizer of the penalized problem. Choosing the initial ensemble such that ensures the convergence of the EKI estimate to the global minimizer, see [6, Theorem 3.13] and [32, Theorem 4]. The convergence of Algorithm 1 to the minimzer of the constrained problem then follows from Proposition 1. ∎
Remark 1.
Note that the convergence result Theorem 1 involves an assumption on the size of the ensemble to ensure the convergence to the (global) minimizer of the loss function in each iteration. This is due to the wellknown subspace property of the EKI, i.e. the EKI estimate will lie in the span of the initial ensemble when using the EKI in its variant as discussed here. In case of a large or possibly infinitedimensional parameter / state space, the assumption on the size of the ensemble is usually not satisfied in practice. Techniques such as variance inflation, localization and adaptive ensemble choice are able to overcome the subspace property and thus might lead to much more efficient algorithms from a computational point of view. Furthermore, we stress the fact that the convergence result presented above requires the linearity of the forward and observation operator (w.r. to the optimization variables), i.e. the assumption is not fulfilled when considering neural networks with a nonlinear activation function as approximation of the forward problem. We will demonstrate in the numerical experiments that the EKI shows promising results even in the nonlinear setting. The generalization of the theory to the nonlinear case will be subject to future work.
To accelerate the computation of the minimizer, we suggest the following variant of Algorithm 1.
s.t. 
5. Numerical Results
We present in the following numerical experiments illustrating the oneshot inversion. The first example is a linear, onedimensional problem, for which we compare the blackbox method, quasiNewton method for the oneshot inversion, quasiNewton method for the neural network based oneshot inversion (Algorithm 1), EKI for the oneshot inversion and EKI for the neural networks based oneshot inversion (Algorithm 2). The second experiment is concerned with the corresponding twodimensional problem to investigate the potential of the EKI for neural network based inversion in the higherdimensional setting.
5.1. Onedimensional example
We consider the problem of recovering the unknown data from noisy observations
where is the solution of the onedimensional elliptic equation
(42) 
with operator observing the dynamical system at equispaced observation points , .
We approximate the forwardproblem (42) numerically on a uniform mesh with meshwidth by a finite element method with continuous, piecewise linear ansatz functions. The approximated solution operator will be denoted by , with .
The unknown parameter is assumed to be Gaussian, i.e. , with (discretized) covariance operator for . For our inverse problem we assume a observational noise covariance , a model error covariance and we choose the regularization parameter , while we turn off the regularization on , i.e. we set . Further, we choose a feedforward DNN with layers, where we set size of the hidden layers and size of the input and output layer. As activation function we choose the sigmoid function . The EKI method is based on the deterministic formulation represented through the coupled ODE system
(43) 
which will be solved with the MATLAB function ode45 up to time . The ensemble of particles , and respectively will be initialized by particles as i.i.d. samples, where the parameters are drawn from the prior distribution , the state are drawn from and the weights of the neural network approximation are drawn from , which are all independent from each other.
We compare the results to a classical gradientbased method, namely the quasiNewton method with BFGS updates, as implemented by MATLAB.
We summarize the methods in the following and introduce abbreviations:

reduced formulation: explicit solution (redTik).

oneshot formulation: we compare the performance of the EKI with Algorithm 1 (osEKI_1), the EKI with Algorithm 2 (osEKI_2) and the quasiNewton method with Algorithm 1 (osQN_1).

neural network based oneshot formulation: we compare the performance of the EKI with Algorithm 2 (nnosEKI_2) and the quasiNewton method with Algorithm 1 (nnosQN_1).
Figure 2 shows the increasing sequence of used for Algorithm 1 and the quasiNewton method and Algorithm 2 (over time).
5.1.1. Oneshot inversion
To illustrate the convergence result of the EKI and to numerically investigate the performance of Algorithm 2, we start the discussion by a comparison of the oneshot inversion based on the FEM approximation of the forward problem in the 1d example.
Figure 3 illustrates the difference of the estimates given by EKI with Algorithm 1 (osEKI_1), the EKI with Algorithm 2 (osEKI_2) and the quasiNewton method with Algorithm 1 (osQN_1) compared to the Tikhonov solution and the truth (on the lefthand side) and in the observation space (on the righthand side). We observe that all three methods lead to an excellent approximation of the Tikhonov solution. Due to the linearity of the forward problem, the quasiNewton method as well as the EKI with Algorithm 1 are expected to converge to the regularized solution. The EKI with Algorithm 2 shows a similar performance while reducing the compuational effort significantly compared to Algorithm 1.
The comparison of the data misfit and the residual of the forward problem shown in Figure 4 reveals a very good performance of the EKI (for both algorithms) with feasibility of the estimate (w.r. to the forward problem) in the range of .
5.1.2. Oneshot method with neural network approximation
The next experiment replaces the forward problem by a neural network in the oneshot setting. Due to the excellent performance of Algorithm 2 in the previous experiment, we focus in the following on this approach for the neural network based oneshot inversion.
The EKI for the neural network based oneshot inversion leads to a very good approximation of the regularized solution (cp. Figure 5), whereas the performance of the quasiNewton approach is slightly worse, which might be attributed to the nonlinearity introduced by the neural network approximation.
Comments
There are no comments yet.