 # Randomization and reweighted ℓ_1-minimization for A-optimal design of linear inverse problems

We consider optimal design of PDE-based Bayesian linear inverse problems with infinite-dimensional parameters. We focus on the A-optimal design criterion, defined as the average posterior variance and quantified by the trace of the posterior covariance operator. We propose using structure exploiting randomized methods to compute the A-optimal objective function and its gradient, and provide a detailed analysis of the error for the proposed estimators. To ensure sparse and binary design vectors, we develop a novel reweighted ℓ_1-minimization algorithm. We also introduce a modified A-optimal criterion and present randomized estimators for its efficient computation. We present numerical results illustrating the proposed methods on a model contaminant source identification problem, where the inverse problem seeks to recover the initial state of a contaminant plume, using discrete measurements of the contaminant in space and time.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A central problem in scientific computing involves estimating parameters that describe mathematical models, such as initial conditions, boundary conditions, or material parameters. This is often addressed by using experimental measurements and a mathematical model to compute estimates of unknown model parameters. In other words, one can estimate the parameters by solving an inverse problem. Experimental design involves specifying the experimental setup for collecting measurement data, with the goal of accurately recovering the parameters of interest. As such, optimal experimental design (OED) is an important aspect of effective and efficient parameter estimation. Namely, in applications where collecting experimental measurements is expensive (e.g., because of budget, labor, or physical constraints), deploying experimental resources has to be done efficiently and in a parsimonious manner. Even when collecting large amounts of data is feasible, OED is still important; the computational cost of processing all the data may be prohibitive or a poorly designed experiment with many measurements may miss important information about the parameters of interest.

To make matters concrete, we explain the inverse problem and the experimental design in the context of an application. Consider the transport of a contaminant in an urban environment or the subsurface. The forward problem involves forecasting the spread of the contaminant, whereas the inverse problem involves using the measurements of the contaminant concentration at discrete points in space and time to recover the source of the contaminant (i.e., the initial state). In this application, OED involves optimal placement of sensors, at which measurement data is collected, to reconstruct the initial state.

We focus on OED for Bayesian linear inverse problems governed by PDEs. In our formulation of the OED problem, the goal is to find an optimal subset of sensors from a fixed array of candidate sensor sites. The experimental design is parameterized by assigning non-negative weights to each candidate sensor location. Ideally, we seek a binary weight vector ; if , a sensor will be placed at the th candidate location, and if , no sensor will be placed at the location. However, formulating an optimization problem over binary weight vectors leads to a problem with combinatorial complexity that is computationally prohibitive. A common approach to address this issue is to relax the binary requirement on design weights by letting the weights take values in the interval . The sparsity of the design will then be controlled using a penalty method; see e.g., [HaberMagnantLuceroEtAl12, a14infinite, yu2018scalable]. This results in an optimization problem of the following form:

 min\boldmathw∈[0,1]nsΦ(\boldmathw)+γP(\boldmathw), (1)

where denotes the design criterion, is a penalty parameter, and is a penalty function.

Adopting a Bayesian approach to the inverse problem, the design criterion will be a measure of the uncertainty in the estimated parameters. In this article, we focus on a popular choice known as the A-optimal criterion [Ucinski05, ChalonerVerdinelli95]. That is, we seek a sensor configuration that results in a minimized average posterior variance. The design criterion, in this case, is given by the trace of the posterior covariance operator.

One major challenge in solving Eq. 1, specifically for PDE-based inverse problems, is the computational cost of objective function and gradient evaluations. Namely, the posterior covariance operator is dense, high-dimensional, and computationally challenging to explicitly form—computing applications of to vectors requires solving multiple PDEs. Furthermore, these computations must be performed at each iteration of an optimization algorithm used to solve Eq. 1. To address this computational challenge, efficient and accurate matrix-free approaches for computing the OED objective and its gradient are needed. Another challenge in solving the OED problem Eq. 1 is the need for a suitable penalty method that is computationally tractable and results in sparse and binary optimal weight vectors. This article is about methods for overcoming these computational challenges.

##### Related work

For an extensive review of the OED literature, we refer the reader to [alexanderian2018dopt]. We focus here on works that are closely related to the present article. Algorithms for A-optimal designs for ill-posed linear inverse problems were proposed in [HaberHoreshTenorio08, HaberMagnantLuceroEtAl12] and more specifically for infinite-dimensional Bayesian linear inverse problems in [a14infinite]. In these articles, Monte-Carlo trace estimators are used to approximate the A-optimal design criterion and its gradient. Also, [HaberMagnantLuceroEtAl12, a14infinite] advocate use of low-rank approximations using the Lanczos algorithm or the randomized SVD [Halko2011structure]. We refer to our previous work [saibaba2016randomized] for comparison of Monte Carlo trace estimators and those based on randomized subspace iteration; it was shown that the latter are significantly more accurate than Monte Carlo trace estimators. Regarding sparsity control, various techniques have been used to approximate the -“norm” to enforce sparse and binary designs. For example, [HaberHoreshTenorio08, HaberHoreshTenorio10, HaberMagnantLuceroEtAl12] use the -penalty function with an appropriate threshold to post-process the solution. In [a14infinite], a continuation approach is proposed that involves solving a sequence of optimization problems with non-convex penalty functions that approximate the -“norm”. More recently, in [yu2018scalable], a sum-up rounding approach is proposed to obtain binary optimal designs.

##### Our approach and contributions

In this article, we make the following advances in methods for A-optimal sensor placements in infinite-dimensional Bayesian linear inverse problems:

1. We present efficient and accurate randomized estimators, based on the randomized subspace iteration, for A-optimal criterion and its gradient. This is accompanied by a detailed algorithm that guides efficient implementations, discussion of computational cost, as well as theoretical error analysis; see Section 3. Our estimators are structure exploiting, in that they use the low-rank structure embedded in the posterior covariance operator. To quantify the accuracy of the estimators we present rigorous error analysis, significantly advancing the methods in [saibaba2016randomized]. A desirable feature of our analysis is that the bounds are independent of the dimension of the discretized inversion parameter. Furthermore, the computational cost (measured in the number of PDE solves) of the A-optimal objective and gradient using our proposed estimators is independent of the discretized parameter dimension.

2. We propose a new algorithm for optimal sensor placement that is based on solving a sequence of reweighted -optimization problems; see Section 4. An important benefit of this approach is that one works with convex penalty functions, and since the A-optimal criterion itself is a strictly convex function of , in each step of the reweighted algorithm a convex optimization problem is solved. We derive this algorithm by using the Majorization-Minimization principle applied to a novel penalty function that promotes binary designs. The reweighted -optimization problems are accelerated by the efficient randomized estimators for the optimality criterion and their gradients. To our knowledge, the presented framework, based on reweighted -minimization, is the first of its kind in the context of OED.

3. Motivated by reducing computational cost, we propose a new criterion known as modified A-optimal criterion; see Section 5. This criterion is derived by considering a suitably weighted A-optimal criterion. We present randomized estimators with complete error analysis for computing the modified A-optimal criterion and its gradient.

We illustrate the benefits of the proposed algorithms on a model problem from contaminant source identification. A comprehensive set of numerical experiments is provided so to test various aspect of the presented approach; see Section 6.

Finally, we remark that the randomized estimators and the reweighted approach for promoting sparse and binary weights are of independent interest beyond the application to OED.

## 2 Preliminaries

In this section, we recall the background material needed in the remainder of the article.

### 2.1 Bayesian linear inverse problems

We consider a linear inverse problem of estimating , using the model

 Fm+\boldmathη=\boldmathy.

Here is a linear parameter-to-observable map (also called the forward operator), represents the measurement noise, and is a vector of measurement data. The inversion parameter is an element of , where is a bounded domain.

#### The setup of the inverse problem

To fully specify the inverse problem, we need to describe the prior law of and our choice of data likelihood. For the prior, we choose a Gaussian measure . We assume the prior mean is a sufficiently regular element of and that the covariance operator is a strictly positive self-adjoint trace-class operator. Following the developments in [Stuart10, Bui-ThanhGhattasMartinEtAl13, DashtiStuart17], we use with taken to be a Laplacian-like operator [Stuart10]. This ensures that is trace-class in two and three space dimensions.

We consider the case where represents a time-dependent PDE and we assume observations are taken at sensor locations at points in time. Thus, the vector of experimental data is an element of . An application of the parameter-to-observable map,

, involves a PDE solve followed by an application of a spatiotemporal observation operator. We assume a Gaussian distribution on the experimental noise,

. Given this choice of the noise model—additive and Gaussian—the likelihood probability density function (pdf) is

 π\mathuplike(\boldmathy∣m)∝exp{−12(Fm−\boldmathy)⊤\boldmathΓ−1\mathupnoise(Fm−\boldmathy)}.

Furthermore, the solution of the Bayesian inverse problem—the posterior distribution law —is given by the Gaussian measure with

 m\boldmathy\mathuppost=Γ\mathuppost(F∗% \boldmathΓ−1\mathupnoise\boldmathy+Γ−1\mathupprm\mathuppr),Γ\mathuppost=(F∗\boldmathΓ−1\mathupnoiseF+Γ−1\mathuppr)−1. (2)

Note that here the posterior mean coincides with the maximum a posteriori probability (MAP) estimator. We refer to [Stuart10] for further details.

#### Discretization

We use a continuous Galerkin finite element discretization approach for the governing PDEs, as well as the inverse problem. Specifically, our discretization of the Bayesian inverse problem follows the developments in [Bui-ThanhGhattasMartinEtAl13]. The discretized parameter space in the present case is equipped with the inner product and norm , where is the finite element mass matrix. Note that is the discretized

inner product. The discretized parameter-to-observable map is a linear transformation

with adjoint discussed below. The discretized prior measure is obtained by discretizing the prior mean and covariance operator, and the discretized posterior measure is given by , with

 Γpost=(F∗\boldmath% Γ−1noiseF+Γ−1pr)−1,\boldmathm\boldmathy\mathuppost=Γpost(F∗\boldmathΓ−1noise\boldmathy+Γ−1pr\boldmathmpr).

We point out that the operator is the Hessian of the data-misfit cost functional whose minimizer is the MAP point, and is thus referred to as the data-misfit Hessian; see e.g., [a14infinite].

It is important to keep track of the inner products in the domains and ranges of the linear mappings, appearing in the above expressions, when computing the respective adjoint operators. For the convenience of the readers, in Fig. 1, we summarize the different spaces that are important, the respective inner products, and the adjoints of the linear transformations defined between these spaces. Figure 1: Different spaces, their inner products, and the adjoints of linear transformations between them. Here ⟨⋅,⋅⟩ denotes the Euclidean inner product and ⟨⋅,⋅⟩M is the mass-weighted inner product.

Using the fact that is a self-adjoint operator on and the form of the adjoint operator (see Fig. 1), we can rewrite the expression for as follows:

 Γpost=Γ1/2prM−1/2(I+\mathbfcalF⊤\boldmathΓ−1noise\mathbfcalF)−1M1/2Γ1/2pr,

with

 \mathbfcalF=FΓ1/2prM−1/2. (3)

Note that the operator is a symmetric positive semidefinite matrix, and is a similarity transform of the prior-preconditioned data-misfit Hessian . In many applications (including the application considered in Section 6),

has rapidly decaying eigenvalues and therefore, it can be approximated by a low-rank matrix. This is a key insight that will be exploited in our estimators for the OED criterion and its gradient.

### 2.2 Randomized subspace iteration algorithm

In this article, we develop and use randomized estimators to efficiently compute the design criteria and their derivatives. We first explain how to use randomized algorithms for computing low-rank approximations of a symmetric positive semidefinite matrix . To draw connection with the previous subsection, in our application will stand for . We first draw a random Gaussian matrix

(i.e., the entries are independent and identically distributed standard normal random variables). We then perform

steps of subspace iteration on with the starting guess to obtain the matrix . If, for example, the matrix has rank , or the eigenvalues decay sufficiently, then the range of is a good approximation to the range of under these suitable conditions. This is the main insight behind randomized algorithms. We now show how to obtain a low-rank approximation of . A thin-QR factorization of is performed to obtain the matrix , which has orthonormal columns. We then form the “projected” matrix and obtain the low-rank approximation

 A≈QTQ⊤. (4)

This low-rank approximation can be manipulated in many ways depending on the desired application. An alternative low-rank approximation can be computed using the Nyström approximation, see e.g., [Halko2011structure].

In addition, once the matrix is computed, it can be used in various ways. In [saibaba2016randomized], was used as an estimator for , whereas was used as estimator for . The main idea behind these estimators is that the eigenvalues of are good approximations to the eigenvalues of , when is sufficiently low-rank or has rapidly decaying eigenvalues. Our estimators for the A-optimal criterion and its gradient utilize the same idea but in a slightly different form.

### 2.3 A-optimal design of experiments

As mentioned in the introduction, an experimental design refers to a placement of sensors used to collect measurement data for the purposes of parameter inversion. Here we describe the basic setup of the optimization problem for finding an A-optimal design.

#### Experimental design and A-optimal criterion

We seek to find an optimal subset of a network of candidate sensor locations, which collect measurements at points in time. The experimental design is parameterized by a vector of design weights . In the present work, we use the A-optimal criterion to find the optimal design. That is, we seek designs that minimize the average posterior variance, as quantified by . (The precise nature of the dependence of on will be explained below.) Note that and thus, minimizing the trace of the posterior covariance operator is equivalent to minimizing

 Φaopt(\boldmathw)≡tr(Γpost% (\boldmathw)−Γpr). (5)

This is the objective function we seek to minimize for finding A-optimal designs. As seen below, this formulation of the A-optimal criterion is well suited for approximations via randomized matrix methods Section 2.2. Note also that minimizing amounts to maximizing , which can be thought of as a measure of uncertainty reduction.

We can also understand Eq. 5 from a decision theoretic point of view. It is well known [ChalonerVerdinelli95, alexanderian2016bayesian, AttiaAlexanderianSaibaba18] that for Bayesian linear inverse problems with Gaussian prior and additive Gaussian noise models, coincides with the expected Bayes risk:

 tr(Γpost)= Eμ\mathuppr(Eπ\mathuplike(\boldmathy∣\boldmathm)(∥\boldmathm\boldmathy\mathuppost−\boldmathm∥2M)) = ∫Vn∫Rnsnt∥\boldmathm\boldmathy\mathuppost−\boldmathm∥2Mπ\mathuplike(\boldmathy∣\boldmathm)d\boldmathyμ\mathuppr(d\boldmathm).

Here denotes the discretized prior measure, . Using

 ∫Vn∥\boldmathm\mathuppr−\boldmath% m∥2Mμ\mathuppr(d% \boldmathm)=tr(Γpr),

we see that

 tr(Γpost−Γpr)=Eμ\mathuppr(Eπ\mathuplike(\boldmathy∣\boldmathm)(∥\boldmathm\boldmathy\mathuppost−\boldmathm∥2M−∥\boldmathm\mathuppr−\boldmathm∥2M));

this provides an alternate interpretation of

as an expected Bayes risk with a modified loss function.

#### Dependence of the A-optimal criterion to the design weights

We follow the same setup as [a14infinite]. Namely, the design weights enter the Bayesian inverse problem through the data likelihood; therefore, the operator depends on as follows:

 \mathbfcalHm(\boldmathw)=\mathbfcalF⊤W1/2\boldmathΓ−1noiseW1/2\mathbfcalF,

We refer to [a14infinite] for details. This results in the dependent posterior covariance operator:

The matrix , which weights the spatiotemporal observations, is defined as follows:

 W=ns∑i=1wiEi,withEi=Int⊗eie⊤i,

where is the Kronecker product. In our formulation, we assume uncorrelated observations across sensor locations and time which implies is a block diagonal matrix, with blocks of the form ; here , , denote the measurement noise at individual sensor locations. Using this structure for , we define

 Wnoise≡W1/2\boldmath% Γ−1noiseW1/2=nsens∑i=1wiEnoisei, (6)

with . Thus, we have and the A-optimal criterion can be written as

 Φaopt(\boldmathw) (7) =tr([(\mathbfcalHm(\boldmathw)+I)−1−I]Z),

with

 Z≡M1/2ΓprM−1/2. (8)

Anticipating that we will use a gradient-based solver for solving Eq. 10, we also need the gradient of which we now derive. Using Theorems B.17 and B.19 in [Ucinski05], the partial derivatives of Eq. 7 with respect to , , are

 ∂jΦaopt(\boldmathw)=−tr((I+\mathbfcalHm(\boldmathw))−1∂j\mathbfcalHm(\boldmathw)(I+\mathbfcalHm(\boldmathw))−1Z). (9)

(We have used the notation to denote .) Note that using the definition of , we have , .

#### The optimization problem for finding an A-optimal design

We now specialize the optimization problem Eq. 1 to the case of A-optimal sensor placement for linear inverse problems governed by time-dependent PDEs:

 min\boldmathw∈[0,1]nsΦaopt(\boldmathw)+γP(\boldmathw). (10)

As explained before, to enable efficient solution methods for the above optimization problem we need (i) a numerical method for fast computation of and its gradient, and (ii) a choice of penalty function that promotes sparse and binary weights. The former is facilitated by the randomized subspace iteration approach outlined earlier (see Section 3), and for the latter we present an approach based on rewighted minimization (see Section 4).

## 3 Efficient computation of A-optimal criterion and its gradient

The computational cost of solving Eq. 10 is dominated by the PDE solves required in OED objective and gradient evaluations; these operations need to be performed repeatedly when using an optimization algorithm for solving Eq. 10. Therefore, to enable computing A-optimal designs for large-scale applications, efficient methods for objective and gradient computations are needed. In this section, we derive efficient and accurate randomized estimators for Eq. 7 and Eq. 9. The proposed estimators are matrix-free—they require only applications of the (prior-preconditioned) forward operator and its adjoint on vectors. Moreover, the computational cost of computing these estimators does not increase with the discretized parameter dimension. This is due to the fact that our estimators exploit the low-rank structure of , a problem property that is independent of the choice of discretization.

We introduce our proposed randomized estimators for the A-optimal design criterion and its gradient in Section 3.1. Additionally, we present a detailed computational method for computing the proposed estimators. We analyze the errors associated with our proposed estimators in Section 3.2.

### 3.1 Randomized estimators for Φaopt(% \boldmathw) and its gradient

Consider the low-rank approximation of given by , with and computed using Algorithm 1. Replacing by its approximation and using the cyclic property of the trace, we obtain the estimator for the A-optimal criterion Eq. 7:

 ˆΦaopt(\boldmathw% \boldmathw;ℓ)=tr(((I+ˆ\mathbfcalHm(\boldmathw))−1−I)Z). (11)

To derive an estimator for the gradient, once again, we replace with its low-rank approximation in Eq. 9 to obtain

 ˆ∂jΦaopt(\boldmathw;ℓ)=−tr((I+ˆ\mathbfcalHm(% \boldmathw))−1\mathbfcalF⊤Enoisej\mathbfcalF(I+ˆ\mathbfcalHm(\boldmathw))−1Z), (12)

for .

#### Computational procedure

First, we discuss computation of the A-optimal criterion estimator using Algorithm 1. Typically, the algorithm can be used with , due to rapid decay of eigenvalues of . In this case, Algorithm 1 requires applications of . Since each application of requires one apply (forward solve) and one apply (adjoint solve), computing requires PDE solves. Letting the spectral decomposition of the matrix be given by and denoting , we have . Applying the Sherman–Morrison–Woodbury formula [meyer2000matrix] and the cyclic property of the trace to Eq. 11, we obtain

 ˆΦaopt(\boldmathw% \boldmathw)=−tr(DTV⊤ZV), (13)

where .

Next, we describe computation of the gradient estimator Eq. 12. Here we assume ; the extension to the case is straightforward and is omitted. Again, using the Woodbury formula and cyclic property of the trace, we rewrite Eq. 12 as

 (14)

for . Expanding this expression, we obtain

 ˆ∂jΦaopt(% \boldmathw) =−tr(Z\mathbfcalF⊤Enoisej\mathbfcalF)+2tr(\mathbfcalFVDTV⊤Z\mathbfcalF⊤Enoisej)− (15)

Note that the first term in Eq. 15 does not depend on the design , for . As a result, this term can be precomputed and used in subsequent function evaluations. We expand to be

 sj =−tr(Z\mathbfcalF⊤Enoisej\mathbfcalF) =−nsnt∑k=1(\mathbfcalF⊤(Enoisej)1/2ˆ\boldmathek)⊤Z(\mathbfcalF⊤(Enoisej)1/2ˆ\boldmathek),

where is the

column of the identity matrix of size

. Because there are only columns of with nonzero entries, the total cost to precompute for is PDE solves.

To compute the remaining terms in Eq. 15, we exploit the fact that has columns. Notice all the other occurrences of and occur as a combination of and (or of their transposes). Both of these terms require PDE solves to compute. As a result, the total cost to evaluate and is PDE solves to apply Algorithm 1 and PDE solves to compute and . We detail the steps for computing our estimators for A-optimal criterion and its gradient in Algorithm 2.

#### Alternative approaches and summary of computational cost

A closely related variation of Algorithm 2 is obtained by replacing step 1 of the algorithm (i.e., randomized subspace iteration) by the solution of an eigenvalue problem to compute the dominant eigenvalues of “exactly”. We refer to this method as Eig-, where is the target rank of . This idea was explored for computing Bayesian D-optimal designs in [alexanderian2018dopt]. The resulting cost is similar to that of the randomized method: it would cost PDE solves per iteration to compute the spectral decomposition of , plus PDE solves to precompute for . While both the randomized and Eig- methods provide a viable scheme for computing the A-optimal criteria, our randomized method can exploit parallelism to lower computational costs. Each matrix-vector application with in Algorithm 1 can be computed in parallel. However, if accurate eigenpairs of are of importance to the problem, one can choose to use the Eig- approach at the cost of computing a more challenging problem.

Another possibility suitable for problems where the forward model does not depend on the design (as is the case in the present work), is to precompute a low-rank SVD of , which can then be applied as necessary to compute the A-optimal criterion and its gradient. This frozen forward operator approach has been explored in [a14infinite, HaberMagnantLuceroEtAl12] for the A-optimal criterion and in [alexanderian2018dopt] for the D-optimal criterion. The resulting PDE cost of precomputing a low-rank approximation of is , with indicating the target rank. The Frozen method is beneficial as no additional PDE solves are required when applying Algorithm 2; however, this approach would not favor problems where depends on , nor can the modeling errors associated with the PDE be controlled in subsequent evaluations without the construction of another operator.

Finally, if the problem size is not too large (i.e., in small-scale applications), one could explicitly construct the forward operator . This enables exact (excluding floating point errors) computation of the objective and its gradient. The total cost for evaluating involves an upfront cost of PDE solves. We summarize the computational cost of Algorithm 2 along with the other alternatives mentioned above in Table 1.

To summarize, the randomized methods for computing OED objective and its gradient present several attractive features; our approach is well suited to large-scale applications; it is matrix free, simple to implement and parallelize, and exploits low-rank structure in the inverse problem. Moreover, as is the case with the Eig-k approach, the performance of the randomized subspace iteration does not degrade as the dimension of the parameter increases due to mesh refinement. This is the case because the randomized subspace iteration relies on spectral properties of the prior-preconditioned data misfit Hessian—a problem structure that is independent of discretization.

### 3.2 Error Analysis

Here we analyze the error associated with our estimators computed using Algorithm 2. Since is symmetric positive semidefinite, we can order its eigenvalues as . Suppose are the dominant eigenvalues of , we define and and we assume that the eigenvalue ratio satisfies

 γk=∥Λ2∥2∥Λ−11∥2=λk+1λk<1.

We now present the error bounds for the objective function and its gradient. To this end, we define the constant as

 C≡e2(k+p)(p+1)2(12π(p+1))2p+1(μ+√2)2(p+1p−1), (16)

with and .

###### Theorem 1

Let and be approximations of the A-optimal objective function and its gradient , respectively, computed using Algorithm 2. Let be the target rank and be the oversampling parameter such that . Then, with

 E[|Φaopt(\boldmathw)−ˆΦaopt(\boldmathw;ℓ)|]≤∥Z∥2(tr(f(Λ2))+tr(f(γ2q−1kCΛ2))). (17)

Furthermore, with , for ,

 E[|∂jΦaopt(\boldmathw)−∂jˆΦaopt(\boldmathw% \boldmathw;ℓ)|]≤2∥Z∥2∥Pj∥2(tr(f(Λ2))+tr(f(γ2q−1kCΛ2))). (18)

###### Proof

See Section A.2.

In Theorem 1, the estimators are unbiased when the target rank equals the rank of . If the eigenvalues decay rapidly, the bounds suggest that the estimators are accurate. Recall that is the number of nonzero eigenvalues. Consequently, it is seen that the bounds are independent of the dimension of the discretization .

## 4 An optimization framework for finding binary designs

We seek A-optimal designs by solving an optimization problem of the form,

 min\boldmathw∈WΦ(\boldmathw)+γP(\boldmathw),W=[0,1]ns, (19)

where the design criterion is either the A-optimal criterion or the modified A-optimal criterion (see Section 5). In the previous sections, we laid out an efficient framework for computing accurate approximations to the A-optimal OED criterion and its gradient. We now discuss the choice of the penalty term and the algorithm for solving the optimization problem.

The choice of the penalty term must satisfy two conditions: sparsity, measured by the number of nonzeros of the design vector, and binary designs, i.e., designs that are either or . One possibility for the penalty function is the -“norm”, , which measures the number of nonzero entries in the design. However, the resulting optimization problem is challenging to solve due to its combinatorial complexity. A common practice is to replace the -“norm” penalty by the -norm, . The penalty function has desirable features: it is a convex penalty function that promotes sparsity of the optimal design vector . However, the resulting design is sparse but not necessarily binary and additional post-processing in the form of thresholding is necessary to enforce binary designs.

In what follows, we introduce a suitable penalty function that enforces both sparsity and binary designs and an algorithm for solving the OED optimization problem based on the MM approach. The resulting algorithm takes the form of a reweighted -minimization algorithm [Candes2008sparsity].

### 4.1 Penalty functions

We propose the following penalty function

 Pϵ(\boldmathw)=ns∑i=1|wi||wi|+ϵ,\boldmathw∈Rns, (20)

for a user-defined parameter . This penalty function approximates for small values of ; however, as becomes smaller the corresponding optimization problem becomes harder. To illustrate the choice of penalty functions, in Fig. 2, we plot along with and , with . Using in the OED problem leads to the optimization problem,

 min\boldmathw∈WΦ(\boldmathw)+γPϵ(\boldmathw). (21)

In Eq. 21, the absolute values in definition of can be dropped since we limit the search for optimal solutions in . Since is concave, Eq. 21 is a non-convex optimization problem. To tackle this, we adopt the majorization-minimization (MM) approach.

### 4.2 MM approach and reweighted ℓ1 algorithm

The idea behind the MM approach is to solve a sequence of optimization problems whose solutions converge to that of the original problem [hl2004MMAlgorithms, lange2016mm]. This sequence is generated by a carefully constructed surrogate that satisfies two properties—the surrogate must majorize the objective function for all values, and the surrogate must match the objective function at the current iterate. More specifically, suppose

 J(\boldmathw)=Φ(\boldmathw)+γns∑i=1wiwi+ϵ.

Then the surrogate function at the current iterate must satisfy

 g(\boldmathw|\boldmathw(m)) ≥J(\boldmathw)∀\boldmathw∈W, g(\boldmathw(m)|\boldmathw% (m)) =J(\boldmathw(m)).

Granted the existence of this surrogate function, to find the next iterate we solve the optimization problem

 \boldmathw(m+1)=argmin\boldmathw∈Wg(\boldmathw|\boldmathw(m)). (22)

To show the objective function decreases at the next iterate, observe that the next iterate stays within the feasible region and use the two properties of the surrogate function

 J(\boldmathw(m+1))≤g(\boldmathw(m+1)|\boldmathw(m))≤g(\boldmath% w(m)|\boldmathw(m))=J(\boldmathw(m)).

To construct this surrogate function, we use the fact that a concave function is below its tangent [lange2016mm, Equation (4.7)]. Applying this to our concave penalty , we have

 Pϵ(\boldmathw)≤Pϵ(\boldmathw(m))+(\boldmathw−\boldmathw(m))⊤∇\boldmathwPϵ(\boldmathw(m)),%forall\boldmathw∈W.

With this majorization relation, we define the surrogate function to be

 g(\boldmathw|\boldmathw(m))=Φ(\boldmathw)+γ(Pϵ(\boldmathw(m))+(\boldmathw−\boldmathw(m))⊤∇\boldmathwPϵ(\boldmathw(m)))

By dropping the terms that do not depend on , it can be readily verified that Eq. 22 can be replaced by the equivalent problem

 \boldmathw(m+1)= argmin\boldmathw∈WΦ(\boldmathw)+γns∑i=1ϵwi(w(m)i+ϵ)2 (23) =