1 Introduction
Inverse problems have been traditionally posed as inferring unknown or uncertain parameters (e.g., coefficients, initial conditions, boundary or domain source terms, geometry) that characterize an underlying model from given (possibly noisy) observational or experimental data [53, 25]. Such inverse problems governed by physicsbased models, also referred to as data assimilation in the meteorological and climate communities [26], abound in a wider range of application areas such as geophysics, cryosphere studies, medical imaging, biochemistry, and control theory. Typically the models governing these inverse problems are considered deterministic. In reality, however, in addition to the inversion parameter, these models involve other sources of uncertainties and randomness. For instance, the models have multiple uncertain coefficients or unknown or random source terms, parameters that are not—or cannot be—inferred. Motivated by the need to account for these additional uncertainties, researchers in recent years have shown a growing interest in considering inverse problems governed by stochastic (or uncertain) models, mostly in the context of optimal control [55, 9, 41, 31]. In this paper, we consider the inference of parameters for stochastic models (described by differential equations) and quantify the uncertainty associated with this inference.
Contributions
This study introduces a methodology for the statistical treatment of inverse problems constrained by physicsbased models with stochastic terms. The salient idea of our approach is to express the problem as finding the inversion parameter for which the stochastic model generates a distribution that best explains the distribution of the observations. To this end, we formulate a Bayesian inverse problem with a likelihood given by suitable statistical scoring metrics typically employed in forecast verification [49, 19, 18]. Proper scores allow for including in the inversion process a large range of statistical features (e.g., spatial and/or temporal correlations and biases) and, in this respect, are a significant departure from and improvement over the traditional leastsquares misfit metrics.
We also delve into the issue of how different statistical scores affect the results of the inference problem. This issue becomes important when we explore fitness functions for multivariate distributions because one invariably needs to rely on statistics that typically favor certain features over others, for example, variance over correlations. Traditionally, the solution of the inverse problem is the parameter field (or a distribution if working with statistical inverse problems) that is the closest to the true parameter field in some metric, e.g., leastsquares or statistical scores such as those proposed in this work. However, one can also pose the problem as finding the parameter field that generates the most accurate predictions in some statistical sense. In other words, we are interested in how much more predictable the model is after inference/inversion, not necessarily in the goodness of fit. We also explore this alternative inversion paradigm in this study and show that various proper scores or combination of them can be used successfully in different inversion setups (i.e, inverse problems governed by spatial differential equations and timedependent differential equations) to improve the model predictability.
Another critical issue we address in this paper is the efficient computation of the numerical solution of the proposed statistical inverse problems. Namely, we propose a solution approach based on numerical optimization and provide the ingredients, in the form of gradientbased scalable optimization algorithms and adjoint sensitivities, that are needed to ensure the scalability of our methodology to largescale problems. More specifically, to compute the maximum a posteriori (MAP) point, we solve an optimization problem (implicitly) constrained by the stochastic model with a quasiNewton limitedmemory secant algorithm, BFGS updates for the inverse of the Hessian of the cost function, and an Armijo line search. We derive adjointbased expressions for efficient computation of the gradient of the objective with respect to the inversion parameters. We illustrate our approach with two problems. The first is an inversion for the coefficient field in an elliptic partial differential equation (PDE), interpreted as a subsurface flow problem under a stochastic source field. The second problem is represented by the inversion for a parameter in a differentialalgebraic system of equations (DAEs) under stochastic load terms. This model can represent a power grid system with uncertain load behavior, which induces small transients in the system, the goal here being to determine the dynamic parameters within a quasistationary regime.
Problem formulation
In what follows, let us consider that we have a mathematical model expressed as , with states and parameters driven by a stochastic forcing
with known probability law. Formally, such a model can consist of a standard stochastic PDE on a domain
() with suitable boundary ; in this case is a function on and is defined by means of a probability space , where is the set of events, is the algebra of events, and is a probability measure. We assume that this probability space is completely defined. We take to be a realvalued deterministic field, although extensions to random fields are also possible. Formally, the mathematical model can be defined as in [22] in the following form:(1) 
We also assume that we have a suitable vector space with finite stochastic moments when
is a PDE. More details on the setup can be found in [21]. In this work is defined by a PDE (see Sec. 4) or an ordinary differential equation(ODE)/DAE (see Sec.
5) system. We assume the availability of sparse observations of the states corresponding to the true parameters, which we denote by . These observations are subject to observational noise , which we assume to be a random vector with known Borel probability measure , in addition to the stochastic forcing .At the high level, the inverse problem formulation we use in this work is standard: Given model (1) and observations such that a.s., find that generates model predictions that best explain the observations under a certain metric, where the operator is a projection map from the state space to the observations space. A commonly used metric is the distance between the observations and observables . The metrics used in this work are referred to as statistical score functions that compute the fitness or a distance between the distribution of the observables and the set of validation data, namely, observations . We denote these score functions by , where the first two components represent the observation and model predictions. The third component represents the control variable, and will be used to encode prior information or a regularization factor. We introduce and discuss in detail such score functions in the next section.
Scores are positive functions that achieve their global minimum when observations and model predictions are statistically indistinguishable. For that reason scores have been used as loss or utility functions in order to assess the level of confidence one has in the probabilistic model prediction [27, 17]. We define a likelihood function based on this metric as , and eventually construct a posteriori function. Therefore, the inverse problem can be formulated as finding such that equationparentequation
(2a)  
while observing  
(2b) 
with i.i.d. and .
The objective in (2a) depends on the observed data via a single or multiple realizations, the numerical model observables output , and potentially explicitly on the parameter . The inclusion of in the objective is meant to account for the prior information on the parameter or to regularize the optimization problem. We remark that the optimization problem also depends on the parameter implicitly through the PDE or ODE/DAE constraint described by .
Related work
Inverse problems with stochastic parameters are typically addressed in a multilevel context. Here the stochasticity may come from reducing the models and introducing a stochastic term that accounts for the model reduction error [3, 29]. In other cases, the additional stochastic/uncertain input is treated as a nuisance parameter and approximative premarginalization over this parameter is carried out [33, 25, 24]. In this study we consider the case when stochasticity is inherent in the problem and we do not have access to a deterministic version of a complete model. In the optimal control community, recent efforts have targeted moment matching between a stochastic controlled PDE and observations, [55, 9, 41]
. In most cases the loss function is based on statistics of univariate or marginal distributions. Extending to multivariate settings is extremely challenging because of the difficulty of accounting for complex dependencies, the curse of dimensionality, and the lack of order or rank statistics. The scoring functions used in this study are precisely addressing the multivariate aspect of the model and observational data distributions. A strategy similar to what we present here has been introduced in the statistical community sometimes under the name of statistical postprocessing or model output statistics. It consists of altering the computational model probabilistic forecasts by postprocessing the ensemble forecasts, and it tends to address the issue of bias and dispersion
[18]. Most of these approaches are variations of Bayesian model averaging discussed in [40] and the nonhomogeneous regression model proposed in [18]. In these strategies the numerical model or its parameters are not controlled; only its output is adjusted [46, 28, 6, 7, 45, 49, 14]. In the strategy proposed in this study, the model itself through its parameters is part of the control space.The remaining sections of this paper are organized as follows. We begin by introducing the scoring functions and discuss the property needed in order for the ansatz (2) to be well posed in Section 2. In Sections 4 and 5 we introduce a prototype elliptic PDEbased model problem with application in subsurface flow and a timedependent problem driven by DAEs with application in power grid modeling, respectively. We conclude in Section 6 with a discussion of the method, results, and its limitations.
2 Scoring and metrics
In statistics one way to quantitatively compare or rank probabilistic models is using score functions. A score function is a scalar metric that takes as inputs (i) verification data, in our inverse problem formulation the observations , and (ii) outputs from the model to be evaluated, which in the present work are the observables subject to observational noise, namely, , and returns a scalar used to scoring or ranking verification outputs with respect to the verification data. These scores are commonly used in forecast evaluation where competing forecasts are compared [48].
While evaluating numerical model simulations, one aims to detect bias, trends, outliers, or correlation misspecification. To create an objective function that can distinguish among different modeling strategies, one needs appropriate mathematical scoring metrics to rank them. Complex mathematical properties are required for consistent ranking of the models
[17].Score functions use different statistics to distinguish among different (statistical) models. Moreover, in order to be able to distinguish among and consistently rank different models, score functions are required to have specific mathematical properties. Proper score functions are widely used in statistics to ensure consistent ranking, for example in forecast evaluation, where competing forecasts are compared [48]. The following definition of proper scoring is from [17].
Definition 1
In other words, this definition states that a proper scoring rule prevents a score from favoring any probabilistic distribution over the distribution of the verification data. In addition, a proper score has two main desired features: (1) statistical consistency between the verification data and the model outputs, called calibration, and (2) reasonable dispersion in the model outputs, provided they are calibrated, which is referred to as sharpness. In statistics, the trend is to build proper scores that access simultaneous calibration and sharpness.
Various functions can be used to assess the error between the verification data and model outputs; however, scoring is not restricted to pointwise comparison. Methods to evaluate the quality of unidimensional outputs are well understood [16]; however, the evaluation of multidimensional outputs or ensemble of outputs has been addressed in the literature more recently [19, 38, 44] and remains challenging. The energy and variogrambased scores (detailed below) are well suited to multiple multidimensional realizations of a same model to be verified. Therefore, in this paper we will focus on these scores (and combination of them). In what follows, let us denote by the statistical part of the score independent of the parameter .
A widely accepted score is the continuous ranked probability score (CRPS):
(4) 
where
is the cumulative distribution function (CDF) of
, , and is the Heavyside function. The CRPS computes a distance between a full probability distribution and a single deterministic observation, where both are represented by their CDF. However, this score is only univariate and cannot be used if the dimension of the observations is larger than one. Moreover, one typically does not have access to a functional probability distribution function (PDF). Therefore we will use below samples from the PDF.For this reason and since the variogrambased score is defined directly using the samples, in the following, we express the energy and variogrambased scores for discrete arguments as .
2.1 Energy score
The energy score [19] is multivariate and proper. It generalizes CRPS (4) from univariate to multivariate and can be expressed as
(5) 
This score is sensitive to bias and variance discrepancy, but potentially less sensitive to correlations; it will be denoted as the ESscore.
In the probabilistic forecast context, scores can be used as a loss function to fit probabilistic predictive distributions to observations; for instance, see [43]. Similarly to this idea, we propose to use statistical proper scores as objective functions in the underlying inverse problems. In this context, the score could, for instance, be the energy score (2b); and if only samples from distribution are available, it can be defined as follows:
(6) 
where is the number of samples and are the observations corresponding to th sample from the right hand side, in other words, , evaluated at parameter . Here is a linear observation operator that extracts measurements from .
2.2 Variogram score
The variogrambased score [44] is multivariate, proper, more sensitive to covariance (structure) but less to bias. It is given by
(7) 
where we take , is the dimension of observations (e.g., number of observational points in space), and is a function of the distance of the position of observation and observation . In other words we take differences between every observation and then of the corresponding expectation of the scenarios. If we denote , where is the unit vector with the component , the preceding equation becomes
(8) 
This will be referred to as the VSscore.
2.3 Discussion and other scores
The energy score is known for failing to discriminate misspecified correlation structures of the fields, but it successfully identifies fields with expectation similar to the one of the verifying data. On the other hand, the variogrambased score fails to discriminate fields with misspecified intensity, but it discriminates between correlation structures [39, 44]. Because of these different features and in order to discriminate fields according to their intensity and correlation structure, we propose to use a linear combination of the two scores, namely,
(9) 
where and are problem specific. We will refer to this hybrid score as the HSscore. It is also a proper score because any linear positive combination of proper scores remains a proper score.
The score functions defined above are referred to as instantaneous scores because they are functions of one verification sample. If more than one verification sample is available—for example, if we have samples of
from the true distribution—then we can estimate the mean score defined as follows:
(10) 
In most cases, scores are used on verification data that are assumed to be perfect. In practice, however, observations are almost always tainted with errors. Limited recent studies on forecast verification attempt to address this issue [13, 32]
. Incorporating error and uncertainty in the scoring setup is challenging. Analytical results are tractable only in particular cases such as linear or multiplicative noise with Gaussian and Gamma distributions, respectively. One way of tackling the observational error is to assume some probability distributions for the observations and the model outputs and to consider a new score defined as the conditional expectation of the original score given the observations
[32]. Using the notations of Definition 1, we can express the corrected score as , where represents the hidden true state of the system. In practice, to implement this method, one has to assume some distribution for the and and access an estimation of the distribution parameters. If the errors are i.i.d., then their contribution can be factored out when using the score as a loss function. This is the case under consideration in this study.3 Model problems
To probe the proposed statistical treatment of inverse problems constrained by differential equationsbased models with stochastic inputs, we consider two model problems. The first is a coefficient field inversion for subsurface flow governed by an elliptic PDE with a stochastic input, in other words, a PDEconstrained model problem (Section 4); the second is a parameter identification problem for power grid governed by DAEs with stochastic input, in other words, a DAEconstrained model problem (Section 5). For both problems, we generate synthetic observations by using one or more samples , . These samples then enter into the forward models with a parameter considered the truth, . We note that can also be a distribution. We then solve the optimization problem (2a) for the MAP point by evaluating the likelihood or a posteriori function with the precomputed scenarios such that , . To solve the PDEconstrained model problem efficiently, in Section 4.1 we derive the gradient of the objective function with respect to these fixed scenarios using adjoints. We remark that our calculations use classical Monte Carlo to estimate the solution of the problem at hand; however, more sophisticated methods such as higherorder [20] or multilevel [15] Monte Carlo methods can be used to solve the underlying stochastic PDE.
4 Model problem 1: Coefficient field inversion in an elliptic PDE with a random input
In this section, we study the inference of the log coefficient field in an elliptic partial differential equation with a random/stochastic input. This example can model, for instance, the steadystate equivalent for groundwater flows [31]. For simplicity, we state the equations using a deterministic righthand side. We will then turn our attention to the case where the volume source terms are stochastic. To this end, consider the forward model
(11) 
where () is an open bounded domain with sufficiently smooth boundary , . Here, is the state variable; , , and are volume, Dirichlet, and Neumann boundary source terms, respectively; and is an uncertain parameter field in , where is a Laplacianlike operator, as defined in [47, 1] and for completeness repeated in Section 4.3. To state the weak form of (11), we define the spaces,
where is the Sobolev space of functions in with square integrable derivatives. Then, the weak form of (11) is as follows: Find such that
Here and denote the standard inner products in and , respectively.
In what follows we treat as a stochastic term, denoted by for consistency, given by a twodimensional heterogeneous Gaussian process with known distribution . We use the instantaneous scores defined in Section 2 for this model.
4.1 Adjoint and gradient derivation
We apply an adjointbased approach to derive gradient information with respect to the parameter field for the optimization problem (2a). The adjoint equations are derived through a Lagrangian formalism [50]. To this end, the Lagrangian functional can be written as
(12) 
where is the adjoint corresponding to state . The formal Lagrangian formalism yields that, at a minimizer of (2a), variations of the Lagrangian functional with respect to all variables vanish. Thus we have
equationparentequation
(13a)  
(13b)  
(13c) 
for all variations and . Note that (13a) and (13b) are the weak forms of the state and of the adjoint equations, respectively. The adjoint righthand side in strong form for the energy score (6) is
(14) 
and for the variogram score (7) the component of is
(15) 
for , and for . Here
(16)  
where denotes the component of , namely, .
4.2 Computational approach and cost
The inverse problems (2) are solved by using hIPPYlib (an inverse problem Python library [52, 51]). It implements stateoftheart scalable adjointbased algorithms for PDEbased deterministic and Bayesian inverse problems. It builds on FEniCS [12, 30] for the discretization of the PDEs and on PETSc [4, 5] for scalable and efficient linear algebra operations and solvers needed for the solution of the PDEs.
The gradient computation technique presented in the preceding section allows using stateoftheart nonintrusive computational techniques of nonlinear programming to solve the estimation problems (2) efficiently for the energy and variogram scores we propose, as well as any combination of them. More specifically, we use a quasiNewton limitedmemory secant algorithm with BFGS updates for the inverse of the Hessian [34, 8] and an Armijo linesearch [34] to solve (2) as an unconstrained optimization problem. This quasiNewton solution approach is appealing since it can have fast local convergence properties similar to Newtonlike methods without requiring Hessian evaluations and it also converges from remote starting points as robust as a gradientbased algorithm. In our computations the total number of quasiNewton iterations was reasonably low, varying between and . The implementation in hIPPYlib uses an efficient compact limitedmemory representation [11] of the inverse Hessian approximation that has reduced space and time computational complexities, namely, , where denotes the cardinal of the discretization vector of and is the length of the quasiNewton secant memory (usually taken as .
The computational cost per iteration is overwhelmingly incurred in the evaluation of the objective function in (2) and its gradient. For both the energy and variogram scores, the evaluation of the objective and its gradient requires forward PDE solves and adjoint PDE solves, respectively, to compute states in (13a) and adjoint variables in (13b). To achieve this, the projected states appearing in (6) and (14) are stored in memory to avoid the expensive reevaluations of the PDEs and state projections for the computation of the score in (6) and adjoint righthand side in (14). Similarly, for the variogram score, in the evaluation of the objective function we save the terms () as a matrix for each and reuse them in the computation of the adjoint righthand sides (15) during the objective gradient evaluation. This approach effectively avoids expensive reevaluations of the PDEs at the cost of extra storage.
From (6) and (14) one can see that the computation of the energy score and its gradient also includes a complexity term in addition to the forward and adjoint solves. A similar extra complexity term is present in the computation of the variogram score from (7) and its adjoint righthand size from (15).
Undoubtedly, the objective and gradient computations can be parallelized efficiently for both scores because of the presence of the summation operators. In particular, both scores allow a straightforward scenariobased decomposition that allows the PDE (forward and adjoint) solves to be done in parallel. Coupled with the (lowerlevel) parallelism achievable in hIPPYlib via DOLFIN and PETSc, this approach would result in an effective multilevel decomposition with potential for massive parallelism and would allow tackling complex PDEs and a large number of scenarios. The quasiNewton method based on secant updates used in this work can be parallelized efficiently, as one of the authors has shown recently [35].
We remark that a couple of potential nontrivial parallelization bottlenecks exist. For example, both the energy score and variogram score apparently require nontrivial interprocess communication in computing the righthand side (14) of the adjoint systems as well as in the computation of the double summation in the score itself. In this work we have used only serial calculations and deferred for future investigations efficient parallel computation techniques addressing such concerns.
4.3 Computational experiment
In this section we present the numerical experiment setup for the forward and inverse problems.
Forward problem: For the forward problem (11), we assume an unknown volume forcing, (i.e., ) and noflow conditions on , in other words, the homogeneous Neumann conditions on . The flow is driven by a pressure difference between the top and the bottom boundary; that is, we use on and on . This Dirichlet part of the boundary is denoted by . In Figure 1, we show the “truth” permeability used in our numerical tests, and the corresponding pressure.
The stochastic forcing term: The stochastic volume forcing is given by a twodimensional heterogeneous Gaussian process with known distribution defined by
(18) 
where , and . We choose , , , and . In Figure 2 we illustrate the forcing field and solution used to generate the observations as well as two other realizations of the forcing field along with their corresponding solutions.
The prior: Following [47], we choose the prior to be Gaussian; that is, is a prior distribution, where is the mean and is the covariance operator of the prior, modeled as the inverse of an elliptic differential operator. To study the effect of the prior on our results, we use an informative prior and the standard prior, both built in hIPPYlib [52]. The informative prior is constructed by assuming that we can measure the logpermeability coefficient at five points, namely, , in , namely, , , , , , as in [52]. This prior is built by using mollifier functions
The mean for this prior is computed as a regularized leastsquares fit of the point observations , by solving
(19) 
where is a differential operator of the form
(20) 
equipped with homogeneous natural boundary conditions, , and is a realization of a Gaussian random field with zero average and covariance matrix . Above
is an s.p.d. anisotropic tensor,
, and control the correlation length and the variance of the prior operator; in our computations we used and . The covariance for the informative prior is defined as , where , with a penalization constant taken as 10 in our computations. The standard prior distribution is , with .We note that the prior in finite dimensions is given by
(21) 
where is the discretization of the prior covariance operator and is a mass weighted inner product [10, 36]. In the Bayesian formulation, the posterior is obtained as . By taking the negative log of the posterior, the objective in (2a) becomes
(22) 
While the statistical assumptions ease the computations, this objective can take a different form under different functional likelihood or prior expressions. However, the overall MAP finding procedure will broadly follow the same steps.
4.4 Results
The aim of this computational study is twopronged. On the one hand, our goal is to invert for the unknown (or uncertain) parameter field with the measure of success being the retrieval of a parameter field close to the “truth”: this is the traditional inverse problem approach. On the other hand, we aim to generate accurate predictions in some statistical sense; for example, we are interested in covering well the multivariate distribution of the observables. We will therefore carry out two analyses: one focused on the inverted parameter field and one on the model output (i.e., the observables). In each of the analyses, to assess the inversion quality of our approach, we will use the standard root mean square error (RMSE), the Structural SIMilarity (SSIM) index [54], and the rank histogram as a qualitative tools. The SSIM is the product of three terms (luminance, contrast, and structure) evaluating respectively the matching of intensity between the two datasets and , the variability, and the covariability of the two signals. In statistical terms, luminance, contrast, and structure can be seen as evaluating the bias, variance, and correlation between the two datasets, respectively. SSIM is expressed as
where , , and
respectively are the mean, standard deviation, and crosscovariance of each dataset, and
, , and are constants derived from the datasets. The SSIM takes values between and . The closer to the values are, the more similar the two signals are in terms of intensity, variability, and covariability. Researchers commonly also investigate the three components (luminance, contrast, variability) separately, as done hereafter.For a visual assessment of the statistical consistency between two datasets in terms of probability distributions [2, 23] we use the rank histogram, an assessment tool often used in forecast verification. This rank histogram gives us an idea about the statistical consistency for the two datasets. The more uniform the histogram is, the more statistically consistent (i.e., sharp and calibrated) it is.
We solve the optimization problem (2a) with as the energy and variogram scores and the forward model given by (11). To understand the effect of the prior on the inversion results, for our numerical studies, we consider two priors: an informed prior and a standard prior, as discussed in Section 4.3. In what follows, we discuss the inversion results.
Comparison of MAP and true parameters
In Figure 3 we show the difference between the true parameter and the MAP estimate for both the informative (top) and standard priors (bottom), as well as for the energy (left column) and variogram (center column) scores. In the topright panel we also show the initial guess for the optimization solver. The results reveal that the VSscore objective displays a stronger match between the MAP and than does the ESscore one, whereas the HSscore falls in between. The HSscore coefficients in (9) are chosen to be and , with a betterinformed choice possible but not fully explored in this study. Models with the informed prior exhibit smaller discrepancies than do the models with the standard prior. These results are displayed in Table 1.
a) informed prior: energy score (left), variogram score (center) and initial guess (right)  
b) standard prior: energy score (left), variogram score (right) 
Table 1 shows the RMSE and SSIM and its 3 components, computed between the parameters and . As expected, being in the informed prior case leads to smaller RMSE and better SSIM for the inverted parameter, in other words, better overall performance. Additionally, the contrast term, which is related to the variance of each signal, is well captured by the three scores and by both types of priors (standard and informed). In particular, we note that the use of standard priors degrades the capture of the intensity of the parameters given by the luminance term. As expected, the VSscore and HSscore perform better than the ESscore at capturing the covariance between the two parameters and (structure term).
Samples  Luminance  Contrast  Structure  SSIM  RMSE  

Model  (a) Informed prior  
ES  1  0.847  1  0.698  0.591  1.137 
4  0.801  0.989  0.797  0.631  1.134  
8  0.824  0.995  0.777  0.637  1.108  
32  0.803  0.996  0.765  0.612  1.155  
64  0.793  0.992  0.757  0.596  1.176  
128  0.786  0.995  0.759  0.594  1.187  
VS  1  0.825  0.997  0.755  0.621  1.124 
4  0.955  0.976  0.868  0.809  0.696  
8  0.966  0.967  0.859  0.803  0.66  
32  0.98  0.948  0.848  0.788  0.617  
64  0.982  0.939  0.846  0.78  0.612  
128  0.987  0.947  0.855  0.799  0.574  
HS  1  0.837  1  0.722  0.605  1.136 
4  0.935  0.981  0.868  0.796  0.765  
8  0.947  0.979  0.853  0.737  0.737  
32  0.958  0.973  0.833  0.776  0.719  
64  0.960  0.966  0.83  0.769  0.716  
128  0.962  0.972  0.836  0.782  0.697  
Model  (b) Standard prior  
ES  1  0.729  0.997  0.442  0.321  2.988 
4  0.336  1  0.371  0.125  2.535  
8  0.29  0.999  0.385  0.112  2.483  
32  0.386  0.989  0.469  0.179  2.538  
64  0.283  0.985  0.469  0.131  2.43  
128  0.371  0.996  0.412  0.152  2.548  
VS  1  0.828  0.991  0.527  0.432  3.146 
4  0.547  0.999  0.447  0.245  2.737  
8  0.268  0.998  0.363  0.097  2.017  
32  0.412  0.995  0.46  0.189  1.8  
64  0.839  0.995  0.43  0.359  1.3  
128  0.863  0.999  0.402  0.346  1.292  
HS  1  0.85  0.989  0.501  0.421  3.202 
4  0.351  0.998  0.42  0.147  2.527  
8  0.192  1  0.406  0.078  2.385  
32  0.223  0.989  0.485  0.107  2.37  
64  0.023  0.992  0.465  0.011  2.194  
128  0.074  0.998  0.421  0.031  2.265 
Comparison between and
To assess the quality and statistical properties of the observables generated by the model, in Figure 4 we show the rank histograms reflecting the statistical consistency between the true observables and the generated ones . The results show that the standard priors (right row) provide a better calibration between and than do the informed priors (central row). Additionally, the results show that the ESscore (top row) generates calibrated . This is not unexpected since the energy score is known for discriminating between the intensity of the signals it compares [39]. The VSscore (center row) does not present good calibration results. This result is not unexpected either since the variogram score is known for not capturing the intensity of the signals it compares [44]. We remark that the observables from the VSscore show some overdispersion (bellshaped histogram). The hybrid score seems to take advantage of the properties of the energy score in terms of calibration and thus appears to be a good compromise between the ESscore and the VSscore. We note that the rank histograms do not assess the correlation structure of the data. In that context we investigate in the following indexes that measure the spatial data structure.
initial guess  ES (informed prior)  ES (standard prior) 
VS (informed prior)  VS (standard prior)  
HS (informed prior)  HS (standard prior) 
confidence intervals associated with the estimated count histogram. The closer to the uniform histogram, the better the consistency between
and . Histograms are obtained for simulations with 64 samples.In order to investigate the spatial structure of the observables, a metric assessing structural feature, namely, the SSIM, is computed. For each generated sample, in order to assess the overall error between the signals, the SSIM and RMSE are computed between the true and the recovered observable. The values of the metrics are summarized in boxplots in Figure 5. Comparable to Figure 4, Figure 5 shows that the ESscore and HSscore provide better results than does the VSscore in terms of recovering the observables . The metrics tend to have more variability for the VSscore and HSscore when the number of samples increases. This variability likely comes from the overdispersion of the outputs of the VSscore. We note, however, that the overall range of the bulk of the distribution (box) stays reasonably narrow. The hybrid score thus appears to be a good compromise between the ESscore and VSscore.
RMSE  
ES (standard prior)  VS (standard prior)  HS (standard prior) 
ES (informed prior)  VS (informed prior)  HS (informed prior) 
SSIM  
ES (standard prior)  VS (standard prior)  HS (standard prior) 
ES (informed prior)  VS (informed prior)  HS (informed prior) 
As a conclusion, the overall method shows a wide range of results in this experimental setup. In terms of capturing the parameter field and the intensity and the variability of , the VS and HS scores show better agreement with the true one than does the ESbased model. For observables, however, the ESscore shows good results in capturing statistics of the data . As expected, the informed priors help capture the parameter better, as seen on Table 1 whereas the standard prior case gives a better calibration between and and more accurate intensity and structure of (see Figures 4 and 5) arguably by relaxing the parameter constrained through the prior.
5 Model problem 2: Parameter identification in power grid applications governed by DAEs
Next we probe the proposed scores on a power grid inverse model problem governed by an index1 DAE system. This model incorporates an electromagnetic machine, a slack bus, and a stochastic load as illustrated in Figure 6.
We model the power grid using the generator, current, and network equations [42], namely, equationparentequation
(23a) 
Here is associated mainly with generators, and represents current (part of the generator equations) and the network equations (Kirchhoff). The full set of equations is given in Appendix A. The unknown (or inversion) parameter here is and represents the generator inertia. This parameter can be interpreted as how fast the generator reacts to fluctuations in the network. In our case represents fluctuations in the load, where and represent the real and imaginary resistive components, respectively.
In recent work we have explored estimating the inertia parameters in a standard 9bus system given a single known disturbance in the load from synthetic bus voltage observations [37]. In this study we pose the problem as having a small signal disturbance in the load, which is a discrete process in time. This problem now describes a realistic behavior of smallscale consumers drawing power from the grid in an unobserved fashion. What we consider to be known is the distribution of the probabilistic load process. Moreover, we assume that we measure the power flow (or voltage) at one of the buses (see Figure 6).
5.1 Computational experiment
We start with the system at dynamic steady state having and . The system is integrated with backward Euler with stochastic forcing. This discretization is then equivalent to firstorder weak convergence in the mean square error sense of the stochastic DAE. Higherorder methods have been used as well, without noticing any qualitative differences for our current setup. The time series generated by the stochastic forcing is given by two independent stationary processes with known distribution :
(24) 
where and, as before, . Some realizations of these time series are shown in Figure 7(c)7(d).
The simulation time is seconds, with a time step of (10 ms). From the 10second window we extract 5 seconds (seconds 3 to 8) to avoid initialization or mixture issues. We consider an ensemble of 1,000 samples integrated with this time step, with 800 being considered as numerical simulations and 200 set aside for observations. In this experiment we do not consider observational noise and do not need to use any regularization. The optimization problem becomes a univariate unconstrained program, and therefore we approximate the gradient with finite differences. Nevertheless, one can compute the gradients via adjoints as has been done for model problem 1 in Section 4 as well.
5.2 Analysis of the results
The timedependent setting allows us to analyze different aspects of the inverse problem solution. For instance, in the steadystate case such as the first model problem, the spatial domain is fixed, and in general so is the number of observations. In the unsteady example, one typically has control over the observation window. To this end, we begin by exploring the effect of adding observations to the inference process and thus using the mean score (10). Specifically, we compute the score values at integer values of the parameter, from 1 to 35, and use 1 to 200 batches of observations. In other words we explore the mean score values with . One batch is the time series obtained under a stochastic forcing realization. Each batch can be seen as a 5second simulation, and this assumes that the distribution is stationary for the entire inference window. The observations of the voltage and corresponding to the middle bus are taken at every time step.
We illustrate the results in Figure 8 for the ESscore and VSscore computed with respect to two exact values of the parameter, 10 and 20, respectively. We observe that both the energy and variogram scores converge to the exact value, with the variogram score converging much faster than the energy score especially when the exact value of the parameter is 10. We also remark that convergence guarantees are not easy to ascertain a priori; as reflected in the figure, a different number of observations are necessary in order to reach an accurate conclusion. One possible strategy to mitigate this issue is to use two different scores and observe the system until they are in agreement and do not change with additional observations.
The numerical results are carried out in Matlab using the default optimization solver. In Figure 9 we show the reconstructed parameter values as a function of function evaluations for the exact parameter values (black dashed line) 10 (left) and 20 (right). The results for the energy score are shown in red (cross) and for the variogram score in blue (circles). The bounds set to truth used in the optimization solver are shown with a black solid line. This figure shows that with both scores the optimization solver converges to a relatively good estimate of the exact parameter value in a relatively small number of function evaluations.
6 Conclusions
We have presented a statistical treatment of inverse problems governed by physicsbased models with stochastic inputs. The goal of this study is to quantify the quality of the inverted parameter field, measured by a comparison with the “truth,” and the quality of the recovered observable: for example, given the inverted parameter field, quantify how well we fit the distribution of the observable. The end goal of our study is to introduce a Bayesian inverse problem formulation that facilitates the integration of data with physicsbased models in order to quantify the uncertainties in model predictions. To this end, inspired from statistics, we propose to replace the traditional leastsquares minimization problem—which minimizes the norm of the misfit between data and observables—with a set of loss functions that describes quantitatively the distance between the distributions of model generated data and the distribution of observational data. We refer to these metrics as scores, as known in the statistics community.
To compute the maximum a posteriori (MAP) point for the proposed Bayesian inverse problem, we solve an optimization problem constrained by the physicsbased models under stochastic inputs with a quasiNewton limitedmemory algorithm. For efficient calculation of the gradient of the objective with respect to the inversion parameters, we derive adjointbased expressions. Several challenges are associated with solving such optimization problems. First, these inverse problems are large scale, stemming from discretization of the parameter field in the case of PDEbased models or the size of the power grid network. Second, although we employ an efficient method to calculate derivatives, the number of adjoint solves increases with the number of samples and are coupled. As we indicated above, however, the communication during the adjoint calculations follows a fixed pattern and can be optimized for, arguably resulting in overall scalable strategies. Third, structured error in measurements (data) requires special attention, and convergence guarantees are not easy to ascertain a priori. Nevertheless, as we discussed for the second model, one can use a crossvalidation strategy to ascertain and mitigate said convergence issues.
We have studied the performance of the proposed formulation in the context of two applications: a coefficient field inversion for subsurface flow governed by an elliptic PDE with a stochastic source and a parameter inversion for power grid governed by DAEs. In both cases the goal was to obtain predictive probabilistic models that explain the data.
Acknowledgments
We thank Michael Scheuerer for providing helpful comments on scoring functions and Umberto Villa for helpful discussions about hIPPYlib. The work of C. G. Petra was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DEAC5207NA27344. This material also was based upon work supported by the U.S. Department of Energy, Office of Science, under contract DEAC0206CH11357.
References
 [1] A. Alexanderian, N. Petra, G. Stadler, and O. Ghattas, A fast and scalable method for Aoptimal design of experiments for infinitedimensional Bayesian nonlinear inverse problems, SIAM Journal on Scientific Computing, 38 (2016), pp. A243–A272, https://doi.org/10.1137/140992564, http://dx.doi.org/10.1137/140992564.
 [2] J. L. Anderson, A method for producing and evaluating probabilistic forecasts from ensemble model integrations, Journal of Climate, 9 (1996), pp. 1518–1530.
 [3] G. Bal, I. Langmore, and Y. Marzouk, Bayesian inverse problems with Monte Carlo forward models, Inverse Problems & Imaging, 7 (2013), p. 81, https://doi.org/10.3934/ipi.2013.7.81, http://aimsciences.org//article/id/d904b995d45b482cb6973347afcb9c98.
 [4] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang, PETSc home page, 2001. http://www.mcs.anl.gov/petsc.
 [5] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang, PETSc Web page, 2009. http://www.mcs.anl.gov/petsc.
 [6] S. Baran, Probabilistic wind speed forecasting using bayesian model averaging with truncated normal components, Computational Statistics & Data Analysis, 75 (2014), pp. 227–238.
 [7] S. Baran and S. Lerch, Lognormal distribution based ensemble model output statistics models for probabilistic windspeed forecasting, Quarterly Journal of the Royal Meteorological Society, 141 (2015), pp. 2289–2299.
 [8] A. Borzì and V. Schulz, Computational Optimization of Systems Governed by Partial Differential Equations, SIAM, 2012.
 [9] A. Borzì and G. Von Winckel, Multigrid methods and sparsegrid collocation techniques for parabolic optimal control problems with random coefficients, SIAM Journal on Scientific Computing, 31 (2009), pp. 2172–2192.
 [10] T. BuiThanh, O. Ghattas, J. Martin, and G. Stadler, A computational framework for infinitedimensional Bayesian inverse problems Part I: The linearized case, with application to global seismic inversion, SIAM Journal on Scientific Computing, 35 (2013), pp. A2494–A2523, https://doi.org/10.1137/12089586X.
 [11] R. H. Byrd, J. Nocedal, and R. B. Schnabel, Representations of quasinewton matrices and their use in limited memory methods, Mathematical Programming, 63 (1994), pp. 129–156, https://doi.org/10.1007/BF01582063, http://dx.doi.org/10.1007/BF01582063.
 [12] T. Dupont, J. Hoffman, C. Johnson, R. Kirby, M. Larson, A. Logg, and R. Scott, The FEniCS project, tech. report, 2003.
 [13] C. A. T. Ferro, Measuring forecast performance in the presence of observation error, Quarterly Journal of the Royal Meteorological Society, 143 (2017), pp. 2665–2676, https://doi.org/10.1002/qj.3115, http://dx.doi.org/10.1002/qj.3115.
 [14] Y. Gel, A. E. Raftery, and T. Gneiting, Calibrated probabilistic mesoscale weather field forecasting: The geostatistical output perturbation method, Journal of the American Statistical Association, 99 (2004), pp. 575–583.
 [15] M. Giles, Multilevel Monte Carlo methods, Acta Numerica, 24 (2015), pp. 259–328.
 [16] T. Gneiting and M. Katzfuss, Probabilistic forecasting, Annual Review of Statistics and Its Application, 1 (2014), pp. 125–151.
 [17] T. Gneiting and A. E. Raftery, Strictly proper scoring rules, prediction, and estimation, Journal of the American Statistical Association, 102 (2007), pp. 359–378.
 [18] T. Gneiting, A. E. Raftery, A. H. Westveld III, and T. Goldman, Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation, Monthly Weather Review, 133 (2005), pp. 1098–1118.
 [19] T. Gneiting, L. I. Stanberry, E. P. Grimit, L. Held, and N. A. Johnson, Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds, Test, 17 (2008), pp. 211–235.
 [20] M. Gunzburger, N. Jiang, and Z. Wang, A secondorder timestepping scheme for simulating ensembles of parameterized flow problems, Computational Methods in Applied Mathematics, (2017).
 [21] M. Gunzburger, C. Webster, and G. Zhang, Stochastic finite element methods for partial differential equations with random input data, Acta Numerica, 23 (2014), p. 521–650, https://doi.org/10.1017/S0962492914000075.
 [22] M. Hairer, Introduction to Stochastic PDEs. Lecture Notes, 2009.
 [23] T. M. Hamill, Interpretation of rank histograms for verifying ensemble forecasts, Monthly Weather Review, 129 (2001), pp. 550–560.
 [24] J. Kaipio and V. Kolehmainen, Bayesian Theory and Applications, Oxford University Press, 2013, ch. Approximate Marginalization Over Modeling Errors and Uncertainties in Inverse Problems, pp. 644–672.
 [25] J. Kaipio and E. Somersalo, Statistical and Computational Inverse Problems, vol. 160 of Applied Mathematical Sciences, SpringerVerlag, New York, 2005.
 [26] E. Kalnay, Atmospheric Modeling, Data Assimilation and Predictability, Cambridge University Press, 2003.
 [27] R. Kass and A. Raftery, Bayes factors, Journal of the American Statistical Association, 90 (1995), pp. 773–795, https://doi.org/10.1080/01621459.1995.10476572.
 [28] S. Lerch and T. L. Thorarinsdottir, Comparison of nonhomogeneous regression models for probabilistic wind speed forecasting, Tellus A: Dynamic Meteorology and Oceanography, 65 (2013), p. 21206.
 [29] H. Lie, T. Sullivan, and A. Teckentrup, Random forward models and loglikelihoods in Bayesian inverse problems, ArXiv eprints, (2017), https://arxiv.org/abs/1712.05717.
 [30] A. Logg, K.A. Mardal, and G. N. Wells, eds., Automated Solution of Differential Equations by the Finite Element Method, vol. 84 of Lecture Notes in Computational Science and Engineering, Springer, 2012.
 [31] S. Mattis, T. Butler, C. Dawson, D. Estep, and V. Vesselinov, Parameter estimation and prediction for groundwater contamination based on measure theory, Water Resources Research, 51 (2015), pp. 7608–7629.
 [32] P. Naveau and J. Bessac, Forecast evaluation with imperfect observations and imperfect models. 2018.
 [33] R. Nicholson, N. Petra, and P. J. Kaipio, Estimation of the Robin coefficient field in a Poisson problem with uncertain conductivity field, Inverse Problems, 34 (2018), p. 115005.
 [34] J. Nocedal and S. J. Wright, Numerical Optimization, Springer, New York, 2nd ed., 2006.
 [35] C. G. Petra, A memorydistributed quasiNewton solver for nonlinear programming problems with a small number of general constraints, in review to the Journal of Parallel and Distributed Computed, (2017).
 [36] N. Petra, J. Martin, G. Stadler, and O. Ghattas, A computational framework for infinitedimensional Bayesian inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet inverse problems, SIAM Journal on Scientific Computing, 36 (2014), pp. A1525–A1555.
 [37] N. Petra, C. Petra, Z. Zhang, E. Constantinescu, and M. Anitescu, A Bayesian approach for parameter estimation with uncertainty for dynamic power systems, IEEE Transactions on Power Systems, 32 (2017), pp. 2735–2743, https://doi.org/10.1109/TPWRS.2016.2625277.
 [38] P. Pinson and R. Girard, Evaluating the quality of scenarios of shortterm wind power generation, Applied Energy, 96 (2012), pp. 12–20.
 [39] P. Pinson and J. Tastu, Discrimination ability of the energy score, tech. report, Technical University of Denmark, 2013.
 [40] A. E. Raftery, T. Gneiting, F. Balabdaoui, and M. Polakowski, Using Bayesian model averaging to calibrate forecast ensembles, Monthly Weather Review, 133 (2005), pp. 1155–1174.
 [41] E. Rosseel and G. Wells, Optimal control with stochastic PDE constraints and uncertain controls, Computer Methods in Applied Mechanics and Engineering, 213 (2012), pp. 152–167.
 [42] P. W. Sauer and M. Pai, Power system dynamics and stability, Prentice Hall Upper Saddle River, NJ, 1998.
 [43] M. Scheuerer, Probabilistic quantitative precipitation forecasting using ensemble model output statistics, Quarterly Journal of the Royal Meteorological Society, 140 (2014), pp. 1086–1096.
 [44] M. Scheuerer and T. M. Hamill, Variogrambased proper scoring rules for probabilistic forecasts of multivariate quantities, Monthly Weather Review, 143 (2015), pp. 1321–1334.
 [45] M. Scheuerer and D. Möller, Probabilistic wind speed forecasting on a grid based on ensemble model output statistics, The Annals of Applied Statistics, 9 (2015), pp. 1328–1349.
 [46] J. M. L. Sloughter, T. Gneiting, and A. E. Raftery, Probabilistic wind speed forecasting using ensembles and Bayesian model averaging, Journal of the American Statistical Association, 105 (2010), pp. 25–35.
 [47] A. Stuart, Inverse problems: a Bayesian perspective, Acta Numerica, 19 (2010), pp. 451–559.
 [48] T. Thorarinsdottir, T. Gneiting, and N. Gissibl, Using proper divergence functions to evaluate climate models, SIAM/ASA Journal on Uncertainty Quantification, 1 (2013), pp. 522–534.

[49]
T. L. Thorarinsdottir and T. Gneiting,
Probabilistic forecasts of wind speed: ensemble model output statistics by using heteroscedastic censored regression
, Journal of the Royal Statistical Society: Series A (Statistics in Society), 173 (2010), pp. 371–388.  [50] F. Tröltzsch, Optimal Control of Partial Differential Equations: Theory, Methods and Applications, vol. 112 of Graduate Studies in Mathematics, American Mathematical Society, 2010.
 [51] U. Villa, N. Petra, and O. Ghattas, hIPPYlib: an Extensible Software Framework for Largescale Deterministic and Bayesian Inversion, (2016), https://doi.org/10.5281/zenodo.596931, http://hippylib.github.io.
 [52] U. Villa, N. Petra, and O. Ghattas, hIPPYlib: An extensible software framework for largescale inverse problems, Journal of Open Source Software, 3 (2018), p. 115005, https://doi.org/10.21105/joss.00940.
 [53] C. R. Vogel, Computational Methods for Inverse Problems, Frontiers in Applied Mathematics, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2002.
 [54] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE transactions on image processing, 13 (2004), pp. 600–612.
 [55] N. Zabaras and B. Ganapathysubramanian, A scalable framework for the solution of stochastic inverse problems using a sparse grid collocation approach, Journal of Computational Physics, 227 (2008), pp. 4697–4735.
Appendix A Power grid equations
Below we present the equations extracted from [42] for the power grid example discussed in Sec. 5. The one generator 3bus system is described by index1 DAEs. Here we have 7 differential equations and 8 algebraic equations. The differential variables are the first seven variables, with the rest being algebraic.
The initial condition that gives a steady state is given by , here and .
The stochastic noise is characterized by and the parameter sought is , while observing the voltage at the slack bus, and .