Continuous advances in numerical methods and computational technology have made it feasible to simulate large-scale physical phenomena such as weather systems, computer vision, and medical imaging. Mathematical models are widely used in practice to predict the behavioral patterns of such physical processes. In the applications we consider, the mathematical models are typically described by systems of partial differential equations (PDEs). However, parameters that are needed for a full description of the mathematical models, such as initial and boundary conditions, or coefficients, are typically unknown and need to be inferred from experimental data by solving an inverse problem. The acquisition of data is usually a laborious or expensive process and has a certain cost associated with it. Due to budgetary or physical considerations, often times, only a limited amount of data can be collected. Even in applications where collecting data is relatively cheap, processing large amounts of data can be computationally cumbersome, or a poor design may lead to wastage of resources, or may miss out on important information regarding the parameters of interest. Therefore, it is important to control the experimental conditions for data acquisition in a way that makes optimal use of resources to accurately reconstruct or infer the parameters of interest. This is known as Optimal Design of Experiments (ODE).
In this article we adopt the Bayesian approach for solving inverse problems. The Bayesian approach has the following ingredients: the data, the mathematical model, the statistical description of the observational noise, and the prior information about the parameters we wish to infer. Bayes’ theorem is used to combine these ingredients to produce the posterior distribution, which encapsulates the uncertainty in every aspect of the inverse problem. The posterior distribution can be interrogated in various ways: one can compute the peak of this distribution, called the maximum a posteriori probability (MAP) estimate, which estimates the posterior mode of the parameter, draw samples from this distributions, or compute the conditional mean. The Bayesian approach to ODE aims to minimize various measures of uncertainty in the inferred parameters by minimizing certain criteria based on the posterior distribution. Popular examples of design criteria include the Bayesian A- and D-optimal criteria[1, 2, 3]. For Gaussian posteriors, the A- and D-optimal criteria are defined as the trace and the log-determinant of the posterior covariance matrix, respectively.
ODE is an active area of research [4, 1, 3, 5, 2, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. Specifically, in the recent years a number of advances have been made on optimal design of experiments for large scale applications. The articles [25, 8, 9, 26, 27, 15, 18, 21] target A-optimal experimental designs for large-scale inverse problems. Fast algorithms for computing D-optimal experimental design criterion, given by expected information gain, for nonlinear inverse problems, were introduced in [12, 16, 28]. An efficient greedy algorithm for computing Bayesian D-optimal designs, with correlated observations, is introduced in . Choosing a D-optimal experimental design that targets a specific region in the parameter space is discussed in , where the optimal design minimizes the marginal uncertainties in the region of interest. The works  and  address theory and computational methods for Bayesian D-optimal design in infinite-dimensional Bayesian linear inverse problems. Motivated by goal-oriented approaches for parameter dimensionality reduction [32, 33, 34], our paper presents theory and methods for goal-oriented optimal design of experiments (GOODE).
There are two potential drawbacks in the standard Bayesian approach for ODE. First, in certain applications, what may be of interest is not the reconstructed parameter in itself, but some prediction quantity involving the reconstructed parameter. In this situation, it may be desirable to deploy valuable resources to collect experimental data so as to minimize the uncertainty in the end-goal, i.e. prediction, rather than the reconstructed quantity. Second, the reconstructed parameters are often spatial images or infinite dimensional functions. When discretized on a fine-scale grid, the resulting parameter dimension is very high. The posterior covariance matrix is also very high dimensional; forming and storing this covariance matrix explicitly is computationally infeasible on problems discretized on very fine grid resolutions. Consequently, evaluating the optimal design criteria is challenging. Randomized matrix methods[35, 36] have been instrumental in addressing such computational challenges. When the dimension of the predictions is smaller than the dimension of the reconstructed parameter, working in the prediction space may be computationally beneficial. These two reasons—the need for designs tailored to predictions and the computational savings offered by targeting low-dimensional prediction quantities—motivate us to propose goal-oriented criteria, and devise efficient algorithms for their computation and optimization.
As a motivating application, consider the transport of a contaminant in an urban environment. The inverse problem of interest here seeks to identify the source of the contaminant from measurements collected at sensor locations. The standard Bayesian approach to ODE involves controlling the sensor locations in order to reconstruct the source (represented as a spatial function) with minimized uncertainty over the entire domain. On the other hand, if instead of determining the initial condition, the goal is to predict the average contaminant around a building after a certain amount of time has elapsed, then the experimental design should explicitly account for this goal. This application is explored in detail in Sections 4 and 5. As a preview, in Figure 1 we show optimal sensor placements corresponding to three different goals: roughly speaking, the left, middle, and right figures depict sensor placements that are focused on the first building, the second building, and both buildings, respectively. This shows immediately, that incorporating the end goal, i.e., the target prediction, in the ODE problem results in different sensor placements, which may be valuable in practice. More details are provided in Section 5.
The main contributions of this article are as follows. We propose two goal oriented criteria—A-GOODE and D-GOODE—that are analogues of the A- and D-optimality criteria in classical Bayesian ODE. We investigate the properties of these two criteria by explaining connections to the classical Bayesian ODE criteria, and provide interpretations for the criteria by using tools from information theory. Then, we propose a gradient-based optimization framework for GOODE using these two criteria. To facilitate this, we derive expressions for the gradient of these objective functions, along with a computational recipe for computing them, for large-scale linear inverse problems governed by time-dependent PDEs. Our proposed strategy is implemented within the context of optimal sensor placement for the class of inverse problems under study. We enforce sparsity of the sensor placements using an -norm penalty approach, and propose a practical strategy for specifying the associated penalty parameter. In addition, we present comprehensive numerical results, in context of initial state inversion in a time-dependent advection-diffusion equation, to test the performance of the criteria and the proposed algorithms. The specific model problem under study is motivated by the inverse problem of contaminant source identification.
This article is organized as follows. Section 2 presents a brief overview of Bayesian inverse problems and optimal design of experiments. Section 3 formulates the goal-oriented optimal design of experiments problem, and presents the computational methods for solving such problems. Section 4 describes the model inverse advection-diffusion problem and the setup of numerical experiments used to test the proposed methods. Numerical results are detailed in Section 5. Concluding remarks are given in Section 6.
In this section, we briefly review the Bayesian formulation of an inverse problem, and some basics from Bayesian optimal design of experiments.
2.1 Bayesian inverse problem
Consider the problem of reconstructing an unknown parameter using noisy data and a model
is a centered random variable that models measurement noise. Specifically, we consider a Gaussian noise model. We consider the case that
is a (finite-dimensional) vector of measurement data,is an element of an appropriate infinite-dimensional real separable Hilbert space, and
is a continuous linear transformation, which will we will refer to as the parameter-to-observable map. In applications we target, where is a bounded domain in , with , or . The infinite-dimensional formulation, finite-element discretization, and numerical solution of this problem have been addressed in detail in , under the assumption of a Gaussian prior and additive Gaussian noise model, which is the setting we consider here.
To keep the presentation simple, we consider the discretized version of the problem. However, in what follows, we pay close attention to the issues pertaining to discretization of the infinite-dimensional problem and the infinite-dimensional limit. We denote the Gaussian prior for the discretized parameter as , and assume . Letting be the discretized parameter-to-observable map, the additive Gaussian noise assumption leads to the Gaussian likelihood,
That is, . It is well known (see e.g., [38, Chapter 3]) that, in this setting, the posterior is also a Gaussian with
It is also worth mentioning that the posterior mean is the minimizer of the following functional
The Hessian of the above functional, is given by
where denotes the Hessian of the data-misfit term in (4).
Note that for the linear operator , we use to denote its adjoint. The reason we do not simply use matrix transpose is as follows. The underlying infinite-dimensional Bayesian inverse problem is formulated on equipped with the standard inner product. As noted in , when discretizing the Bayesian inverse problem, we also need to pay attention to the choice of inner products. Namely, the discretized parameter space has to be equipped with a discretized inner product. If finite element method is used to discretize the inverse problem, then, the appropriate inner product to use is the Euclidean inner product weighted by the finite-element mass matrix . That is, for , in , we use . In the infinite-dimensional setting, the forward operator is a mapping from to ; the domain is endowed with the inner product and the co-domain is equipped with the Euclidean inner product, which we denote by . Upon discretization, we work with the discretized forward operator . For and , we have
from this we note . See  for further details on finite-element discretization of Bayesian inverse problems.
2.2 Bayesian Optimal design of experiments
Next, we turn to the problem of optimal design of experiments (ODE) for Bayesian linear inverse problems governed by PDEs, which has received great amount of attention in recent years [39, 11, 14, 12, 40]. In a standard Bayesian experimental design problem, we seek an experimental design that results in minimized posterior uncertainty in the inferred parameter . The way one chooses to quantify posterior uncertainty leads to the choice of the design criterion [1, 41, 42, 2, 43, 44, 45, 46, 4, 3]. When the posterior distribution is Gaussian, the standard experimental design criteria are defined as functionals of ; here denotes a generic vector of experimental design parameters. For example, the A-optimal design is found by minimizing the trace of the posterior covariance operator
On the other hand, a D-optimal design is obtained by minimizing the log-determinant of the posterior covariance
3 Goal-Oriented Optimal Design of Experiments
Classical Bayesian optimal experimental design constructs experimental designs that result in minimized posterior uncertainty on the inversion parameter . On the other hand, goal-oriented optimal design of experiments (GOODE) seeks designs that minimize the uncertainty associated with a goal quantity of interest (QoI), which is a function of . In this article, we consider a goal QoI of the form
where is a linear operator, which we call the goal operator. The parameter is inferred by solving an inverse problem, as described in the previous section.
As mentioned before, to keep the presentation simple, we work with discretized quantities. The goal operator is thus the discretization of a linear transformation that maps the inversion parameter, an element of , to a goal QoI. We consider the case where the goal QoI is finite-dimensional; i.e., the goal operator has a finite-dimensional range independent of discretization. This is motivated by many applications in which the end goal is either a scalar or a relatively low-dimensional vector. We denote the dimension of the goal by .
Assuming the Gaussian linear setting presented above, the prior and posterior laws of defined in (8) can be obtained as follows: the prior distribution law of is , with
where is the adjoint of the goal operator . The posterior distribution law of the goal QoI , conditioned on the observations , is also Gaussian and is given by , where
Below, we describe the GOODE criteria under consideration, and present a scalable framework for computing these criteria and their derivatives with respect to design parameters, which exploits the fact that the end goal is of much lower dimension than .
We focus on goal-oriented A- and D-optimal experimental designs (see Section 3.1). We also examine these criteria from a decision theoretic point of view. In the case of goal oriented Bayesian A-optimality, we show that the minimization of expected Bayes risk of the goal QoI is equivalent to minimizing the trace of the covariance operator . In the case of goal-oriented Bayesian D-optimality, we show that maximizing the expected information gain for the goal QoI is equivalent to minimizing log-determinant of . These results, which are natural extensions of the known results from classical Bayesian optimal experimental design theory, provide further insight on the interpretation of the presented GOODE criteria. In Section 3.2, we describe the precise definition of an experimental design vector that parameterizes a sensor placement and leads to formulation of a suitable optimization problem. This is followed by the discussion of the optimization problem for finding A- and D-GOODE criteria and a computational framework for computing the associated objective functions and their gradients in Section 3.3. Algorithmic descriptions of the proposed methodologies, along with a discussion of the computational cost, are outlined in Section 3.4. In Section 3.5, we further discuss the connections of the GOODE criteria to the corresponding classical experimental design criteria, and outline possible extensions to the cases of nonlinear goal operators.
3.1 GOODE criteria
Let, as before, denote a generic vector of experimental design parameters. (The precise definition of , in the case of optimal sensor placement problems, will be provided later in this section.) In this section, we will assume that has full row-rank, and consider a Bayesian linear inverse problem as formulated in Section 2.1. Here we propose and examine the implications of goal-oriented A- and D-optimal criteria.
Goal-oriented A-optimal criterion
The goal-oriented A-optimal design (A-GOODE) criterion is defined to be the trace of the posterior covariance matrix of the goal QoI:
Two special cases are worth pointing out here. In the special case is a row vector , we get , which is the classical C-optimality criterion. On the other hand, if , then this is nothing but the classical Bayesian A-optimal criterion.
Note that while is a high-dimensional operator, in practice usually has a low-dimensional range. Thus, in practice is a low-dimensional operator. From computational point of view, this means computing the A-GOODE criterion is significantly cheaper than that of the classical A-optimality which is .
Minimizing the A-GOODE criterion can be understood as minimizing the average variance of the goal QoI. However, having a measure of the statistical quality of the estimatoris also of crucial importance. Theorem 3.1 below addresses this by relating the goal-oriented A-optimal criterion and the expected Bayes risk of , which is nothing but the mean squared error averaged over the prior distribution.
Consider a Bayesian linear inverse problem as formulated in Section 2.1. Let be the estimator for the goal QoI, , where has full row-rank, then,
See Appendix A. ∎
Note that for notational convenience, and since the result holds pointwise in , we have suppressed the dependence on in the statement of the theorem.
Goal-oriented Bayesian D-Optimality criterion
The goal-oriented D-optimal design (D-GOODE) criterion is taken to be the log-determinant of the posterior end-goal covariance :
To explain the motivation for this optimality criterion, we consider the Kullback-Leibler (KL) divergence  from the posterior to prior distribution of the end-goal QoI :
Since both distributions are Gaussian, the KL-divergence has a closed form expression, which will simplify the calculations considerably. The expected information gain is defined as
The classical Bayesian D-optimality criterion is related to the expected information gain, quantified by the expected KL divergence between the posterior distribution and the prior distribution. A similar relation for the goal-oriented D-optimality criterion can be derived. We present the following result that relates and the expected information gain :
Consider a Bayesian linear inverse problem as formulated in Section 2.1. Let be an end-goal QoI, where has full row-rank. Then,
See Appendix D. ∎
The significance of Theorem 3.3 is that it says minimizing with the appropriate constraints on , amounts to maximizing the expected information gain under the same constraints.
Note that in the case of , and in the limit as
the criterion (13) is meaningless. The reason for this
is that in the infinite-dimensional limit is a positive self-adjoint trace class operator
and thus its eigenvalues accumulate at zero. On the other hand
in the case of
is a positive self-adjoint trace class operator and thus its eigenvalues accumulate at zero. On the other hand in the case of, (16) simplifies to
which recovers the known result  that for a Gaussian linear Bayesian inverse problem
and note that the quantity to the right is well defined in the limit as .
The previous remark shows that (17) is the correct expression for the expected information gain to choose in the case of . While the derivation of (17) here is done for the discretized version of the problem, as shown in  such an expression for the expected information gain can be derived in the infinite-dimensional Hilbert space setting also.
3.2 Goal-oriented sensor placement
The experimental conditions we choose to control are the sensor locations in the domain at which data are to be collected. This can be expressed as an optimal design of experiments (ODE) problem, as we now demonstrate. Our strategy is to fix an array of candidate locations for sensors and then select an optimal subset of the candidate sensor locations. In this context, a design is a binary vector where each entry corresponds to whether or not a particular sensor is active. Practical considerations, such as budgetary or physical constraints, limit the number of sensors that can be chosen. In this context, ODE seeks to identify the best possible sensor locations out of the possible sensor locations.
We consider Bayesian inverse problems with time-dependent linear forward models. In this case, maps the inversion parameter to spatio-temporal observations of the state at the sensor locations and at observation times. In what follows, we make use of the notation for the forward model that maps to the equivalent sensor measurements at observation time instance .
In the present formulation, enters the Bayesian inverse problem through the data likelihood:
where , and is a block diagonal matrix with . In particular, where and is the Kronecker product. The noise covariance is in general a block diagonal matrix , where is the spatial noise covariance matrix corresponding to th observation time, and is the number of observation time instances. In the present work, we assume that observations are uncorrelated in space and time, and thus is a diagonal matrix. While this assumption is not necessary, it simplifies the formulation considerably. Since is also diagonal, we have the convenient relation
The posterior covariance of the parameter is
Therefore, the posterior distribution of the goal QoI , conditioned by the observations , and the design is the Gaussian with
The above definition of will be substituted in the GOODE criteria described above.
Identifying the best sensor locations out of a set of candidate sensor locations is a combinatorial problem that is computationally intractable even for modest values of and . A standard approach [49, 25, 8, 9, 15], which we follow in this article, is to relax the binary condition on the design weights and let , . To ensure that only a limited number of sensors are allowed to be active, we use sparsifying penalty functions to control the sparsity of the optimal designs.
In the present formulation, non-binary weights are difficult to interpret and implement. Thus, some form of thresholding scheme is required to make a computed optimal design vector into a binary design. In this work, we adopt the following heuristic for thresholding: assuming that onlysensors are to be placed in the candidate locations, as is common practice, the locations corresponding to highest weights can be selected. This means that the corresponding weights are set to whereas all other sensors are set to , therefore giving a “near-optimal” solution to the original binary problem . An alternative approach, not considered in this article, is to obtain binary weights is to successively approximate the -norm by employing a sequence of penalty functions yielding a binary solution . The issue of sparsification and the choice of sparsifying penalty function is elaborated further in the description of the GOODE optimization problem formulation below and in the numerical results section.
3.3 The optimization problem
The generic form of the goal-oriented experimental design problem involves an optimization problem of the form
where, is the specific design criterion, is a penalty function, and is a user-defined penalty parameter that controls sparsity of the design. In this work, we make a choice to incorporate an norm to control sparsity of the design; that is, we set
Depending on whether we want goal-Oriented A- or D-optimality, is either or , respectively.
The optimization problem is solved using a gradient based approach, and therefore, this requires the derivation of the gradient. Since the design weights are restricted to the interval , differentiable, and the gradient of the penalty term in (21) is , where is a vector of ones. In the sequel, we derive expressions for the gradient of and .
3.3.1 Gradient of
The gradient of with respect to the design is given by
|where is the pointwise Hadamard product, and|
Here is the coordinate vector in , and is the forward model that maps the parameter to the equivalent observation at time instance , . See Appendix B for derivation of this gradient expression.
In the case the forward model is time-independent, the expressions simplify to
where is the coordinate vector in .
3.3.2 Gradient of
We present two alternate way of computing the gradient that are equivalent, but differ in computational cost depending on the number of sensors and dimension of the goal QoI, i.e. . In the first formulation we assume that . We can compute the gradient as
and is the coordinate vector in . The details are given in Appendix C.1. This formulation is especially suited for large-scale four-dimensional variational (4D-Var) data assimilation applications [51, 52, 53], including weather forecasting, and ocean simulations. Note that, in this formulation, evaluating the gradient requires the square-root of the posterior covariance matrix .
Evaluating (25), requires solving linear systems. If , the following alternative formulation of the gradient of will be computationally beneficial:
|where is now:|
and is the coordinate vector in . The derivation details are given in Appendix C.2. Note that the two gradient expressions are equivalent, in exact arithmetic.
3.4 Implementation and computational considerations
The main bottleneck of solving A-GOODE and D-GOODE problems lie in the evaluation of the respective objective functions and the associated gradients. Here, we detail the steps of evaluating the objective function and the gradient of both A-GOODE and D-GOODE problems. These steps are explained in Algorithms 1, and 2, respectively. For simplicity, we drop the penalty terms in the computations presented in the two Algorithms 1, and 2.
The A-GOODE problem. Algorithm 1 outlines the main steps of evaluating the A-GOODE objective and the gradient (23). Notice that all loops over the end-goal dimension, in Algorithm 1, i.e., steps 1–3, 8–10, and 16–20, are embarrassingly parallel. Evaluating , , is independent of the design , and can be evaluated offline and saved for later use. Step 2 requires a Hessian solve for each of the vectors , which can be done using preconditioned conjugate gradient. Each application of the Hessian requires a forward and an adjoint PDE solve. Using the prior covariance as a preconditioner, this requires CG iterations, i.e., PDE solves, where is the numerical rank of the prior-preconditioned data misfit Hessian; see [54, 55]. Another forward solutions of the underlying PDE are required in Step in the algorithm, for gradient computation. Therefore, the total number of PDE solves for objective and gradient evaluation is . This computations can be easily parallelized in cores. Moreover, the applications of the inverse Hessian—the Hessian solves—can be accelerated by using low-rank approximations of the prior-preconditioned data misfit Hessian . Note that Algorithm (1) also requires independent applications of and its adjoint, which can be done in parallel.
In Algorithm 1, is the goal operator, is the dimension of the end-goal, is the number of observation time instances, is the experimental design, is the weighted Hessian, is the covariance of the measurement noise at time instance , and is the forward model that maps the parameter from time instance to the equivalent observation at observation time instance .
The D-GOODE problem. Algorithm 2 describes the main steps for calculating D-GOODE objective and gradient expressions (25)–(26). Similar to Algorithm 1, all loops over the end-goal dimension, i.e., steps 1–3, 14–22, and 16–20 in Algorithm 2, are inherently parallel. Also, the loop over the observation dimension, , in Algorithm 2 is inherently parallel. With small end-goal space, the cost of Cholesky factorization of , in step is negligible compared to PDE solutions. The first form of the gradient, i.e. steps 12–24, requires Hessian solves, and forward PDE solves that can run completely in parallel. The second form of the gradient, i.e. steps 25–41, on the other hand requires Hessian solves, and forward and adjoint solves of the underlying system of PDEs. As mentioned before, low-rank approximation of the prior-preconditioned data misfit Hessian can be used to accelerate computations. Checkpointing is utilized in the second form of the gradient, i.e., steps 28–31. The checkpointed solutions are recalled in the adjoint solves in step 33.
Algorithm 2 requires applications of and its adjoint for objective function evaluations. As for the gradient, we need applications of with the first form of the gradient, and applications of . As before, the loops over end-goal and observation dimensions are embarrassingly parallel.
3.5 Connections to classical ODE criteria and extensions
Here we further discuss the connections of the GOODE criteria to the corresponding classical experimental design criteria and an extension to nonlinear goal operators. We have already mentioned two important connections:
Taking trivially recovers the Bayesian A- and D-optimal design criteria.
In the special case is a row vector , we get , which is the Bayesian C-optimality criterion.
Furthermore, if the vector is randomly drawn from a distribution with mean zero and identity covariance, then