Inverse problems arise in almost all areas of application, e.g. biological problems, engineering and environmental systems. The integration of data can substantially reduce the uncertainty in predictions based on the model and is therefore indispensable in many applications. Advances in machine learning provide exciting potential to complement and enhance simulators for complex phenomena in the inverse setting. We propose a novel approach to inverse problems by approximating the forward problem with neural networks, i.e. the computationally intense forward problem will be replaced by a neural network. However, the neural network when used as a surrogate model needs to be trained beforehand in order to guarantee good approximations of the forward problem for arbitrary inputs of the unknown coefficient. To reduce the costs associated with building the surrogate in the entire parameter space, we suggest to solve the inverse problem and train the neural network simultaneously in a one-shot fashion. From a computational point of view, this approach has the potential to reduce the overall costs significantly, as the neural network serves as a surrogate of the forward model only in the unknown parameter estimated from the data and not in the entire parameter space. The ensemble Kalman inversion (EKI) as a derivative-free optimizer is applied to the resulting optimization problem.
1.1. Background and literature overview
The goal of computation is to solve the following inverse problem: Recover the unknown parameter in an abstract model
from a finite number of observation of the state given by
which might be subject to noise. We denote by , and Banach spaces. The operator describes the underlying model, typically a PDE or ODE in the applications of interest. The variable denotes the state of the model and is defined on a domain . The operator is the observation operator, mapping the state variables to the observations. Classical methods to inverse problems are based on an optimization approach of the (regularized) data misfit, see e.g. [12, 25]. One-shot (all-at-once) approaches are well established in the context of PDE-constrained optimization (see e.g. ) and have been recently also introduced in the inverse setting [23, 24]. The idea is to solve the underlying model equation simultaneously with the optimality conditions. This is in contrast to so called reduced or black-box methods, which formulate the minimization as an unconstrained optimization problem via the solution operator of (1
). The connection of the optimization approach to the Bayesian setting via the maximum a posteriori estimate is well established in the finite-dimensional setting as well as in the infinite-dimensional setting for certain prior classes, see [1, 8, 9, 18]. We refer to [22, 37] for more details on the Bayesian approach to inverse problems. Recently, data-driven approaches have been introduced to the inverse setting to reduce the computational complexity in case of highly complex forward models and to improve models in case of limited understanding of the underlying process . Neural networks in the inverse framework have been successfully applied in the case of parametric holomorphic forward models  and in the case of limited knowledge of the underlying model [29, 30, 31, 36, 38]. To train neural networks, gradient-based methods are typically used . The ensemble Kalman inversion (EKI) (see e.g. [20, 32, 33]) has been recently applied as gradient-free optimizer for the training of neural networks in [17, 26].
1.2. Our contribution
The goal of this work is two-fold:
We formulate the inverse problem in a one-shot fashion and establish the connection to the Bayesian setting. In particular, the Bayesian viewpoint allows to incorporate model uncertainty and gives a natural way to define the regularization parameters. In case of an exact forward model, the vanishing noise can be interpreted as a penalty method. This allows to establish convergence results of the one-shot formulation as an unconstrained optimization problem to the corresponding (regularized) solution of the reduced optimization problem (with exact forward model). The (numerical approximation of the) forward problem is replaced by a neural network in the one-shot formulation, i.e. the neural network does not have to be trained in advance for all potential parameters.
Secondly, we show that the EKI can be effectively used to solve the resulting optimization problem. We provide a convergence analysis in the linear setting. To enhance the performance, we modify the algorithm motivated by the continuous version of the EKI. Numerical experiments demonstrate the robustness (also in the nonlinear setting) of the proposed algorithm.
In section 2, we discuss the optimization and Bayesian approach to inverse problems, introduce the one-shot formulation and establish the connection to vanishing noise and penalty methods in case of exact forward models. Section 3 gives an overview of neural networks used to approximate the forward problem. The EKI as a derivative-free optimizer for the one-shot formulation is discussed and analyzed in section 4. Finally, numerical experiments demonstrating the effectiveness of the porposed approach are presented in section 5. We conclude in section 6 with an outlook to future work.
We denote by the Euclidean norm and by the corresponding inner product. For a given symmetric, positive definite matrix (s.p.d.) , the weighted norm is defined by and the weighted inner product by .
2. Problem Formulation
Throughout this paper, we derive the methods and theoretical results under the assumption that and are finite-dimensional, i.e. we assume that the forward problem has been discretized by a suitable numerical scheme and the parameter space is finite-dimensional as well, possibly after dimension truncation. Though most of the ideas and results can be generalized to the infinite-dimensional setting, we avoid the technicalities arising from the infinite-dimensional setting and focus on the discretized problem, i.e. , and .
2.1. Optimization approach to the inverse problem
The optimization approach leads to the following problem
i.e. we minimize the data misfit in a suitable weighted norm with s.p.d., given that the forward model is satisfied. Due to the ill-conditioning of the inverse problem, a regularization term on the unknown parameters is often introduced in order to stabilize the optimization, i.e. we consider
where the regularization is denoted by and the positive scalar is usually chosen according to prior knowledge on the unknown parameter . We will comment on the motivation of the regularization via the Bayesian approach in the following section. To do so, we first introduce the so-called reduced problem of (3)-(4) and (5)-(6), resepectively. The forward model is typically a well-posed problem, in the sense that for each parameter , there exists a unique state such that in . Introducing the solution operator s.t. , we can reformulate the optimization problem (3)-(4) as an unconstrained optimization problem
2.2. Bayesian approach to the inverse problem
Adopting the Bayesian approach to inverse problems, we view the unknown parameters as an
-valued random variable with prior distribution. The noise in the observations is assumed to be additive and described by a random variable with s.p.d, i.e.
Further, we assume that the noise is stochastically independent of
. By Bayes’ theorem, we obtain the posterior distribution
the conditional distribution of the unknown given the observation .
2.3. Connection between the optimization and the Bayesian approach
The optimization approach leads to a point estimate of the unknown parameters, whereas the Bayesian approach computes the conditional distribution of the unknown parameters given the data, the posterior distribution, as solution of the inverse problem. Since the computation or approximation of the posterior distribution is prohibitively expensive in many applications, one often restores to point estimates such as the maximum a posteriori (MAP) estimate, the most likely point of the unknown parameters. Denoting by the Lebesgue density of the prior distribution, the MAP estimate is defined as
Assuming a Gaussian prior distribution, i.e. , the MAP estimate is given by the solution of the following minimization problem
The Gaussian prior assumption leads to a Tikhonov-type regularization in the objective function, whereas the first term in the objective function results from the Gaussian assumption on the noise. We refer the interested reader to [10, 22] for more details on the MAP estimate.
2.4. One-shot formulation for inverse problems
Black-box methods are based on the reduced formulation (7) assuming that the forward problem can be solved exactly in each iteration. Here, we will follow a different approach, which simultaneously solves the forward and optimization problem. Various names for the simultaneous solution of the design and state equation exist: one-shot method, all-at-once, piggy-back iterations etc.. We refer the reader to  and the references therein for more details.
Following the one-shot ideas, we seek to solve the problem
Due to the noise in the observations, we rather consider
with normally distributed noise, s.p.d.. Similarly, we assume that
i.e. we assume that the model error can be described by , s.p.d.. Thus, we obtain the problem
The MAP estimate is then computed by the solution of the following minimization problem
where and are regularizations of the parameter and the state , and .
2.5. Vanishing noise and penalty methods
In case of an exact forward model, i.e. in case the forward equation is supposed to be satisfied exactly with , this can be modeled in the Bayesian setting by vanishing noise. In order to illustrate this idea, we consider a parametrized noise covariance model for and given s.p.d. matrix . The limit for corresponds to the vanishing noise setting and can be interpret as reducing the uncertainty in our model. The MAP estimate in the one-shot framework is thus given by
with . This form reveals the close connection to penalty methods, which attempt to solve constrained optimization problems such as (3)-(4) by sequentially solving unconstrained optimization problems of the form (18) for a sequence of monotonically increasing penalty parameters . The following well-known result on the convergence of the resulting algorithm can be found e.g. in .
Let the observation operator , the forward model and the regularization functions , be continuous and the feasible set be nonempty. For let denote a global minimizer of
with strictly monotonically increasing and for . Then every accumulation point of the sequence is a global minimizer of
This classic convergence result ensures the feasibility of the estimates, i.e. the proposed approach is able to incorporate and exactly satisfy physical constraints in the limit. We mention also the possibility to consider exact penalty terms in the objective, corresponding to different noise models in the Bayesian setting. This will be subject to future work.
This setting will be the starting point of the incorporation of neural networks into the problem. Instead of minimizing with respect to the state , we will approximate the solution of the forward problem by a neural network , where denote the parameters of the neural network to be learned within this framework. Thus, we obtain the corresponding minimization problem
where denotes the state approximated by the neural network.
3. Neural Networks
, we distinguish between the neural network architecture or neural network parameters as a set of weights and biases and the neural network, which is the associated mapping when an activation function is applied to the affine linear transformations defined by the neural network parameters:
The neural network architecture or the neural network parameters with input dimension and
layers is a sequence of matrix-vector tuples
where , with
. Hence, the number of neurons in layeris given by and , i.e. and for . Given an activation function , we call the mapping
defined by the recursion
a neural network with activation function , where is understood to act component-wise on vector-valued inputs, i.e. for
. In the following the activation function is always the sigmoid function. We will therefore waive the superscript indicating the activation function, i.e. we abbreviate . In the context of approximating the solution of the forward problem , this definition of the neural network allows us to distinguish between the neural network evaluated at one specific input point, , and the neural network as a function of , denoted by .
This class of neural networks is often called feed-forward deep neural networks (DNNs) and have recently found success in forward  as well as Bayesian inverse  problems in the field of uncertainty quantification.
Based on the approximation results of polynomials by feed-forward DNNs , the authors in  derive bounds on the expression rate for many-variate, real-valued functions depending holomorphically on a sequence of parameters. More specifically, the authors consider functions that admit sparse Taylor gpc expansions, i.e. -summable Taylor gpc coefficients. Such functions arise as response surfaces of parametric PDEs, or in a more general setting from parametric operator equations, see e.g.  and the references therein. Their main results is that these functions can be expressed with arbitrary accuracy (uniform w.r. to ) by DNNs of size bounded by with a constant independent of the dimension of the input data . Similar results for parametric PDEs can be found in .
3.1. Neural networks for inverse problems
The methods in  motivated the work of  in which the authors show holomorphy of the data-to-QoI map for additive, centered Gaussian observation noise in Bayesian inverse problems. Using the fact that holomorphy implies fast convergence of Taylor expansions, the authors derived an exponential expression rate bound in terms of the overall network size.
Our approach differs from the ideas above as we do not want to approximate the data-to-QoI map, but instead emulate the state itself by a DNN. Hence, in our method the input of the neural network is a point in the domain of the state, . The output of the neural network is the state at this point, , i.e. . By we denote the neural network as a function of , i.e. . In combination with a one-shot approach for the training of the neural network parameters, our method is closer related to the physics-informed neural networks (PINNs) in [29, 30].
3.1.1. Connection to physics-informed neural networks
where is a nonlinear differential operator parametrized by . The authors replace by a neural network and use automatic differentiation to construct the function . The neural network parameters are then obtained by minimizing , where
For the minimization a L-BFGS method is used. The parameters of the differential operator turn into parameters of the neural network and can be learned by minimizing the MSE.
In  the authors consider so called Bayesian neural networks (BNNs), where the neural network parameters are updated according to Bayes’ theorem. Hereby the initial distribution on the network parameters serves as prior distribution. The likelihood requires the PDE solution, which is obtained by concatenating the Bayesian neural network with a physics-informed neural network, which they call Bayesian physics-informed neural networks (B-PINNs). For the estimation of the posterior distributions they use the Hamiltonian Monte Carlo method and variational inference. In contrast to the PINNs, the Bayesian framework allows them to quantify the aleatoric uncertainty associated with noisy data. In addition to that, their numerical experiments indicate that B-PINNs beat PINNs in case of large noise levels on the observations. In contrast to that, our proposed method is based on the MAP estimate and remains exact in the small noise limit. We propose a derivative-free optimization method, the EKI, which shows promising results (also compared to quasi-Newton methods) without requiring derivatives w.r. to the weights and design parameters.
4. Ensemble Kalman Filter for Neural Network Based One-shot Inversion
The ensemble Kalman inversion (EKI) generalizes the well-known ensemble Kal-man Filter (EnKF) introduced by Evensen and coworker in the data assimilation context  to the inverse setting, see  for more details. Since the Kalman filter involves a Gaussian approximation of the underlying posterior distribution, we focus on an iterative version based on tempering in order to reduce the linearization error. Recall the posterior distribution given by
for an abstract inverse problem
with mapping from the unknowns to the observations with , s.p.d.. We define the intermediate measures
by scaling the data misfit / likelihood by the step size , . The idea is to evolve the prior distribution into the posterior distribution
by this sequence of intermediate measures and to apply the EnKF to the resulting artificial time dynamical system. Note that we account for the repeated use of the observations by amplifying the noise variance byin each step. The EKI then uses an ensemble of particles with to approximate the intermediate measures by
with denoting the delta-Dirac mass located at . The particles are transformed in each iteration by the application of the Kalman update formulas to the empirical mean and covariance in the form
where denotes the Kalman gain and, for , the operators and given by
are the empirical covariances and empirical mean in the observation space. Since this update does not uniquely define the transformation of each particle to the next iteration , the specific choice of the transformation leads to different variants of the EKI. We will focus here on the generalization of the EnKF as introduced by  resulting in a mapping of the particles of the form
The are i.i.d. random variables distributed according to with corresponding to the case of perturbed observations and to the unperturbed observations.
The motivation via the sequence of intermediate measures and the resulting artificial time allows to derive the continuous time limit of the iteration, which has been extensively studied in [4, 32, 33] to build analysis of the EKI in the linear setting. This limit arises by taking the parameter in (30) to zero resulting in
As shown in , the EKI does not in general converge to the true posterior distribution. Therefore, the analysis presented in [4, 32, 33] views the EKI as a derivative-free optimizer of the data misfit, which is also the viewpoint we adopt here.
4.0.1. Ensemble Kalman inversion for neural network based one-shot formulation
By approximating the state of the underlying PDE by a neural network, we seek to optimize the unknown parameter and on the other side the parameters of the neural network . The idea is based on defining the function , where denotes the state approximated by the neural network and . This leads to the empirical summary statistics
and the EKI update
where the perturbed observation are computed as before
Figure 1 illustrates the basic idea of the application of the EKI to solve the neural network based one-shot formulation.
The EKI (32) will be used as a derivative-free optimizer of the data misfit The analysis presented in [4, 32, 33] shows that the EKI in its continuous form is able to recover the data with a finite number of particles in the limit under suitable assumptions on the forward problem and the set of particles. In particular, the analysis assumes a linear forward problem. Extensions to the nonlinear setting can be found e.g. in . The limit corresponds to the noise-free setting, as the inverse noise covariance scales with in (23). To explore the scaling of the noise and to discuss regularization techniques, we illustrate the ideas in the following for a Gaussian linear setting, i.e. we assume that the forward response operator is linear with and . Considering the large ensemble size limit , the mean and covariance satisfy the equations
for in (30). By considering the dynamics of the inverse covariance, it is straightforward to show that the solution is given by
see e.g.  and the references therein for details. Note that corresponds to the posterior covariance and that for . Furthermore, the mean is given by
in particular the mean minimizes the data misfit in the limit . The application of the EKI in the inverse setting therefore often requires additional techniques such as adaptive stopping  or additional regularization  to overcome the ill-posedness of the minimization problem. To control the regularization of the data misfit and neural network individually, we consider the following system
with , , and
. The loss function for the augmented system is therefore given by
Assuming that the resulting forward operator
is linear, the EKI will converge to the minimum of the regularized loss function (40), cp. . To ensure the feasibility of the EKI estimate (w.r. to the underlying forward problem), we propose the following algorithm using the ideas discussed in section 2.5.
Assume that the forward operator , ,
is linear, i.e. with . Let be strictly monotonically increasing and for . Further, assume that the initial ensemble members are chosen so that .
Then, Algorithm 1 generates a sequence of estimates , where minimizes the loss function for the augmented system given by
with given . Furthermore, every accumulation point of is the (unique, global) minimizer of
Under the assumption of a linear forward model, the penalty function
is strictly convex for all , i.e. there exists a unique minimizer of the penalized problem. Choosing the initial ensemble such that ensures the convergence of the EKI estimate to the global minimizer, see [6, Theorem 3.13] and [32, Theorem 4]. The convergence of Algorithm 1 to the minimzer of the constrained problem then follows from Proposition 1. ∎
Note that the convergence result Theorem 1 involves an assumption on the size of the ensemble to ensure the convergence to the (global) minimizer of the loss function in each iteration. This is due to the well-known subspace property of the EKI, i.e. the EKI estimate will lie in the span of the initial ensemble when using the EKI in its variant as discussed here. In case of a large or possibly infinite-dimensional parameter / state space, the assumption on the size of the ensemble is usually not satisfied in practice. Techniques such as variance inflation, localization and adaptive ensemble choice are able to overcome the subspace property and thus might lead to much more efficient algorithms from a computational point of view. Furthermore, we stress the fact that the convergence result presented above requires the linearity of the forward and observation operator (w.r. to the optimization variables), i.e. the assumption is not fulfilled when considering neural networks with a nonlinear activation function as approximation of the forward problem. We will demonstrate in the numerical experiments that the EKI shows promising results even in the nonlinear setting. The generalization of the theory to the nonlinear case will be subject to future work.
To accelerate the computation of the minimizer, we suggest the following variant of Algorithm 1.
5. Numerical Results
We present in the following numerical experiments illustrating the one-shot inversion. The first example is a linear, one-dimensional problem, for which we compare the black-box method, quasi-Newton method for the one-shot inversion, quasi-Newton method for the neural network based one-shot inversion (Algorithm 1), EKI for the one-shot inversion and EKI for the neural networks based one-shot inversion (Algorithm 2). The second experiment is concerned with the corresponding two-dimensional problem to investigate the potential of the EKI for neural network based inversion in the higher-dimensional setting.
5.1. One-dimensional example
We consider the problem of recovering the unknown data from noisy observations
where is the solution of the one-dimensional elliptic equation
with operator observing the dynamical system at equispaced observation points , .
We approximate the forward-problem (42) numerically on a uniform mesh with meshwidth by a finite element method with continuous, piecewise linear ansatz functions. The approximated solution operator will be denoted by , with .
The unknown parameter is assumed to be Gaussian, i.e. , with (discretized) covariance operator for . For our inverse problem we assume a observational noise covariance , a model error covariance and we choose the regularization parameter , while we turn off the regularization on , i.e. we set . Further, we choose a feed-forward DNN with layers, where we set size of the hidden layers and size of the input and output layer. As activation function we choose the sigmoid function . The EKI method is based on the deterministic formulation represented through the coupled ODE system
which will be solved with the MATLAB function ode45 up to time . The ensemble of particles , and respectively will be initialized by particles as i.i.d. samples, where the parameters are drawn from the prior distribution , the state are drawn from and the weights of the neural network approximation are drawn from , which are all independent from each other.
We compare the results to a classical gradient-based method, namely the quasi-Newton method with BFGS updates, as implemented by MATLAB.
We summarize the methods in the following and introduce abbreviations:
reduced formulation: explicit solution (redTik).
one-shot formulation: we compare the performance of the EKI with Algorithm 1 (osEKI_1), the EKI with Algorithm 2 (osEKI_2) and the quasi-Newton method with Algorithm 1 (osQN_1).
neural network based one-shot formulation: we compare the performance of the EKI with Algorithm 2 (nnosEKI_2) and the quasi-Newton method with Algorithm 1 (nnosQN_1).
Figure 2 shows the increasing sequence of used for Algorithm 1 and the quasi-Newton method and Algorithm 2 (over time).
5.1.1. One-shot inversion
To illustrate the convergence result of the EKI and to numerically investigate the performance of Algorithm 2, we start the discussion by a comparison of the one-shot inversion based on the FEM approximation of the forward problem in the 1d example.
Figure 3 illustrates the difference of the estimates given by EKI with Algorithm 1 (osEKI_1), the EKI with Algorithm 2 (osEKI_2) and the quasi-Newton method with Algorithm 1 (osQN_1) compared to the Tikhonov solution and the truth (on the left-hand side) and in the observation space (on the right-hand side). We observe that all three methods lead to an excellent approximation of the Tikhonov solution. Due to the linearity of the forward problem, the quasi-Newton method as well as the EKI with Algorithm 1 are expected to converge to the regularized solution. The EKI with Algorithm 2 shows a similar performance while reducing the compuational effort significantly compared to Algorithm 1.
The comparison of the data misfit and the residual of the forward problem shown in Figure 4 reveals a very good performance of the EKI (for both algorithms) with feasibility of the estimate (w.r. to the forward problem) in the range of .
5.1.2. One-shot method with neural network approximation
The next experiment replaces the forward problem by a neural network in the one-shot setting. Due to the excellent performance of Algorithm 2 in the previous experiment, we focus in the following on this approach for the neural network based one-shot inversion.
The EKI for the neural network based one-shot inversion leads to a very good approximation of the regularized solution (cp. Figure 5), whereas the performance of the quasi-Newton approach is slightly worse, which might be attributed to the nonlinearity introduced by the neural network approximation.