# Sequentially optimized projections in X-ray imaging

This work applies Bayesian experimental design to selecting optimal projection geometries in (discretized) parallel beam X-ray tomography assuming the prior and the additive noise are Gaussian. The introduced greedy exhaustive optimization algorithm proceeds sequentially, with the posterior distribution corresponding to the previous projections serving as the prior for determining the design parameters, i.e. the imaging angle and the lateral position of the source-receiver pair, for the next one. The algorithm allows redefining the region of interest after each projection as well as adapting parameters in the (original) prior to the measured data. Both A and D-optimality are considered, with emphasis on efficient evaluation of the corresponding objective functions. Two-dimensional numerical experiments demonstrate the functionality of the approach.

## Authors

• 29 publications
• 24 publications
• 7 publications
• 7 publications
• 2 publications
04/01/2021

### Edge-promoting adaptive Bayesian experimental design for X-ray imaging

This work considers sequential edge-promoting Bayesian experimental desi...
07/09/2018

### Deriving Neural Network Architectures using Precision Learning: Parallel-to-fan beam Conversion

In this paper, we derive a neural network architecture based on an analy...
02/08/2021

### Core Imaging Library – Part I: a versatile Python framework for tomographic imaging

We present the Core Imaging Library (CIL), an open-source Python framewo...
04/11/2018

### Projection image-to-image translation in hybrid X-ray/MR imaging

The potential benefit of hybrid X-ray and MR imaging in the intervention...
09/22/2020

### Constrained Non-Linear Phase Retrieval for Single Distance X-ray Phase Contrast Tomography

X-ray phase contrast tomography (XPCT) is widely used for 3D imaging of ...
01/02/2017

### Retrieving Similar X-Ray Images from Big Image Data Using Radon Barcodes with Single Projections

The idea of Radon barcodes (RBC) has been introduced recently. In this p...
04/29/2022

### Forward-fitting STIX visibilities

Aima. To determine to what extent the problem of forward fitting visibil...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tomographic image reconstruction is a classical inverse problem with its history spanning over 100 years since the seminal work of Radon [27]. It gives rise to a wealth of applications in modern imaging for which research remains active to date. The contemporary mathematical questions are often driven by practical challenges such as limited data — consider, e.g., having a limited number of imaging angles or a limited field-of-view. There exists a number of proposed methods for image reconstruction given such incomplete data, however, much less is known about how to optimally design the imaging configuration if there are constraints on, e.g., the number of angles, if individual projections do not cover all of the imaged object, or if there is a particular region of interest (ROI) that may vary as a function of time. Such design is desirable in any application where the exposure to radiation must be minimized or the acquisition of data is otherwise expensive.

Optimal experimental design (OED) is also an extensive field of research with a massive body of literature; for a general introduction we refer to [5]

. In particular, OED in inverse problems has taken leaps during the last decade due to the increase in computational resources. In the present paper, we adopt the Bayesian paradigm to inverse problems, that is, the observed data is used to update the prior knowledge about the unknown into a posterior probability distribution

[22, 31]

. OED in Bayesian inference is formulated as minimization of a Bayesian cost with respect to the design parameters

[10]; different cost functions correspond to different optimality criteria. Here we focus on the two most common approaches: A-optimality, where a typical quadratic loss is considered, and D-optimality

, which maximizes the Kullback–Leibler divergence of the posterior distribution with respect to the prior.

For general high-dimensional inference problems, the minimization of the Bayes cost requires a huge computational effort since just evaluating the cost corresponds to (numerically) integrating the considered cost function against the joint probability distribution of the unknown and the data. Considerable effort has been put into formulating approximative schemes to solve the corresponding optimization task; in the framework of inverse problems this work is discussed below. However, for linear Bayesian inference problems with Gaussian prior and likelihood, it is well-known that the integral defining the Bayes cost for A or D-optimality can be evaluated analytically as the trace or the determinant of the posterior covariance, respectively, leading to more attractive optimization problems [10]. This is the general setting considered in this work.

### 1.1 Our contribution

Our paper focuses on discussing benefits of OED in linear X-ray imaging. Our contributions to this subject are as follows:

• We propose a greedy exhaustive sequential method for optimizing the imaging configuration in X-ray tomography in the spirit of [11]

and discuss its efficient numerical implementation. More precisely, our work proposes how to optimize the next imaging angle and the lateral position of the source-receiver pair if the geometric specifications of the previous projections are known. At each step, the posterior covariance corresponding to the previous projections is fed as the prior covariance for the optimization of the next imaging angle and the lateral position. We discuss both A and D-optimal designs. As the dimension of the new measurement at each step is considerably lower than that of the unknown, the computational cost of evaluating the necessary posterior traces and determinants can be significantly lowered by resorting to the Woodbury matrix identity and the matrix determinant lemma without employing Monte Carlo estimators (cf.

[3, 4, 7]).

• If the covariance matrices for the (original) Gaussian prior and the additive Gaussian noise process are known, the sequentially optimal projections can be determined before performing any actual measurements. However, we also introduce an optimization algorithm based on a varying ROI. Indeed, the ROI can be adapted to the reconstruction after a new projection image becomes available either algorithmically or based on expert knowledge, and subsequently the next optimal projection can be determined accounting for the change in the ROI. A related idea for the A-optimality criterion is suggested in [11].

• We develop a data-driven variant of our optimization method that simultaneously adapts the (original) prior distribution to the data observed at each step. As a numerical example, we demonstrate that the method is able to identify the true correlation length used in generating an imaged target.

It should be emphasized that the forms of the optimization target functions corresponding to both A and D-optimality criteria are well-known [10]. However, linear X-ray imaging provides an interesting example for further analysis of the setting, not only due to its high-dimensionality, but also since the design space (i.e. the projection geometry) is continuous by nature, which leads to challenging computational considerations. We adopt a sequential approach to the optimization due to its computational tractability and in order to include feedback from the observed data in the design process. Notice that sequential optimization is in general suboptimal if the total number of projections is fixed a priori and no feedback from the measured data is included in the process [19].

### 1.2 Literature review

In recent years, OED for large-scale inverse problems has gained substantial attention; see, e.g., [2, 3, 4, 6, 12, 13, 14, 16, 20, 23, 24, 25]. In particular, we acknowledge that the study of A and D-optimal designs has been extended to infinite-dimensional Hilbert spaces in [1]. The computational feasibility of experimental design in large-scale problems is often based on linearization and the use of Monte Carlo trace estimators such as the ones developed in [7]. However, there is also an interesting avenue of research presented in [17, 18, 19], where the authors develop design methods based on dynamical programming.

Optimization of projection angles in X-ray imaging has previously been considered in [28]. The authors explore two interesting approaches for empirical A-optimal design in constrained problems based on training data: On the one hand, they solve a sparse design out of a predetermined set of angles, which is an approach aligned with previous literature. On the other hand, they propose a gradient-based optimization scheme. However, there seems to be no guarantee of being able to find a global optimum and the presented numerical experiments are limited to two angles. The present paper is closely related to [28] as our approach is also based on a prior. As novelties compared to [28], we consider the lateral position of the source-receiver pair as a second design parameter, our sequential design enables incorporating feedback from the observed data in the process, and the exhaustive optimization algorithm facilitates circumventing difficulties arising from the nonconvexity of the objective functions. In fact, our numerical examples clearly demonstrate that the optimization of the projection geometry in X-ray imaging suffers from the existence of local optima.

In terms of adaptivity, our work draws inspiration from [11], where A-optimal sequential design for dynamical inverse problems is considered. Our setup can be derived from [11] by assuming static noise-free dynamics. However, in [11] the adaptation of design is obtained by introducing a so-called monitor function that measures the difference in the last two reconstructions. In our framework, the approach of [11] would roughly correspond to choosing the ROI to be an area where such a difference is high.

From an application’s point of view, our proposed method fits in the larger field of sparse tomography [15, 30], but moreover to applications with strict dose limitations and obstructions to the imaging region [9]. Such limitations might be due to safety reasons, experimental environment, or even expensive beamtime at synchrotron facilities [8]. For instance, in image-guided radiotherapy angles are chosen to radiate only a cancerous region of interest, without damaging essential surrounding tissue, using the knowledge obtained during the measurements, cf. [21]. Here, the proposed OED framework could provide insights into treatment planning and creation of automated algorithms to exclude critical regions from the beam direction.

This paper is organized as follows. In Section 2, we introduce our discretized parallel beam framework for two-dimensional X-ray imaging. Section 3 reviews the basics of finite-dimensional Bayesian OED and considers efficient numerical evaluation of the objective functions corresponding to A and D-optimality. The greedy exhaustive optimization algorithm and its adaptive variants are presented in Section 4. Finally, the numerical experiments are documented in Section 5, and some conclusions are drawn in Section 6. For completeness and to facilitate reading the paper, an appendix presents brief derivations of the target functions for A and D-optimal designs.

## 2 Measurement model and its discretization

The measurements are modeled as parallel beam tomography, where multiple parallel rays are directed into the object , and the resulting intensities of the rays are measured at detectors [26]. The attenuation is described by the equation

 I=I0exp(−∫Luds), (1)

where is the line along which the considered ray travels, is the intensity of the X-ray before entering the object and is the absorption. Obviously, (1) can equivalently be given as

 log(I0)−log(I)=∫Lu(x)ds.

In particular, the difference between the logarithms of the emitted and measured intensities is typically considered as the available data when X-ray tomography is tackled mathematically.

After discretizing the imaged domain into pixels, the forward operator, mapping the discretized absorption to a single set of log-intensity measurements at the detectors can be approximated by a matrix , where and are the number of detectors and the number of pixels, respectively (see, e.g., [30]); typically the dimension of the unknown is higher than the number of pixels in a single projection image, i.e. . Each ray is parametrized by its (signed) distance to the origin and its angle , say, relative to the positive horizontal axis. For a single projection image with detectors, one thus needs to specify parameters . We assume the rays/detectors to be equally spaced, but in our setting the complete parallel beam source-receiver pair does not typically cover the whole domain, that is, a single measurement with a given angle only produces a projection image of a part of .

Take note that is very sparse, which makes corresponding matrix multiplications relatively inexpensive computationally. Moreover, at least if , it is reasonable to expect that has a full (effective) rank, i.e.,

, with all its singular values being of the same order of magnitude. In particular, no projection matrix considered in our numerical experiments has a Euclidean norm based condition number that is higher than

.

We assume the domain consists of three distinct regions:

• ROI: the area about which we want to recover information,

• Obstruction: a nuisance region that obstructs the propagation of X-rays,

• Background: the rest of the object.

By resorting to this kind of a division, one can model settings where only a certain part of the target object is of interest. Such situations could arise, e.g., in a medical application if one has already determined the location of an organ, tumor or other abnormality that is to be further monitored. Moreover, in an industrial application one may only be interested in a certain component or section of some structure. In addition, the introduced division enables modeling obstructions that are difficult to image through, such as thick bone structures or other highly attenuating materials.

## 3 Bayesian inversion and experimental design

In Bayesian inversion all parameters carrying uncertainty are treated as random variables

[22, 31]

. The prior probability distributions for these parameters reflect the available information before the measurements are carried out. A measurement is modeled as a realization of a random variable depending on both the noise process and the random parameters in the forward model, as well as on the so-called design parameters that are deterministic and define the measurement setup. The Bayes’ formula is then employed to form the posterior probability density that updates the prior based on the information in the measurement. In our setting, the design variables are the projection angle and the lateral position of the parallel beam source-receiver pair, and our ultimate aim is to (sequentially) choose these parameters according to either the A or D-optimality criterion of Bayesian experimental design

[1, 10].

In this section, we only consider one step of the optimization process, that is, we assume to have a prior covariance (or its inverse) at hand and we consider optimally choosing the design parameters that define a single parallel beam projection. The computational efficiency is in the focus of our attention. The actual sequential optimization algorithm is presented in Section 4.

### 3.1 Bayes’ formula

Let represent the (noisy) projection image and be the design variable, and suppose our prior information on the (discretized) absorption distribution is encoded in a probability density . By Bayes’ formula, the posterior density for the (randomized) absorption reads

 π(x|y;p)=π(y|x;p)πpr(x)π(y;p),x∈Rn, (2)

where is the so-called likelihood function and the normalizing term in the denominator is the marginal density of the measurement evaluated at the available data .

In this work we assume the prior is Gaussian, i.e. , and the measurement can be modeled as a realization of the random variable

 Y=R(p)X+N, (3)

where is independent of . Here, and are symmetric positive definite covariance matrices, is the prior mean for , and we have explicitly indicated the nonlinear dependence of the discrete measurement matrix on the positioning of the parallel beam source-receiver pair. Under these simplifying assumptions, the posterior in (2) is also Gaussian with the covariance matrix and mean [22]

 Γpost(p) =(Γ−1pr+R(p)TΓ−1noiseR(p))−1, (4a) ˆx(p) =Γpost(p)(Γ−1prxpr+R(p)TΓ−1noisey), (4b)

respectively, as can be deduced by a straightforward completion of squares in (2). Using the Woodbury matrix identity, these equations can alternatively be written as

 Γpost(p) =Γpr−ΓprR(p)T(R(p)ΓprR(p)T+Γnoise)−1R(p)Γpr, (5a) ˆx(p) =xpr+ΓprR(p)T(R(p)ΓprR(p)T+Γnoise)−1(y−R(p)xpr), (5b) which are computationally far more attractive than (4) if m≪n, as is the case in our setting. However, if one is interested in the inverse covariance Γpost(p)−1, then (4) should be used.

### 3.2 A and D-optimality

In Bayesian optimal experimental design, one often considers minimizing the expected squared distance of the unknown in a given (semi)norm around some chosen point estimate, which corresponds to the so-called A-optimal design; in all our considerations, the point estimate of interest is the posterior mean. The other commonly used alternative is to look for the D-optimal design, that is, to maximize the information gain when the prior is replaced by the posterior. See, e.g., [1, 10] for more details.

Assuming the employed seminorm is induced by the positive semidefinite matrix for a given , in our simple, i.e. Gaussian, linear and finite-dimensional, setting, A-optimality corresponds to choosing a design parameter satisfying

 pA=argminptr(AΓpost(p)AT). (6)

On the other hand, D-optimality is equivalent to finding

 pD=argminplog(detΓpost(p)), (7)

where the logarithm follows from the definition of D-optimality but it also makes the numerical treatment of (7) considerably more stable as discussed in the following subsection. For completeness, the deductions of (6) and (7) are presented in Appendix A.

Since all pixels are of the same size in our discretized model, choosing

to be the identity matrix leads to using (a scaled)

-norm as the distance measure in (6). If one is only interested in the expected -accuracy of the posterior mean over some specific ROI within , one simply only needs to replace the diagonal elements of corresponding to the pixels in the complement of the ROI by zeros. On the other hand, being only interested in the information gain in the ROI for D-optimality corresponds to replacing in (7) by the matrix obtained by deleting the rows and columns of corresponding to the uninteresting pixels. We refer to Appendix A for the proof of this statement as well as for the exact form of the information gain when the prior is replaced by the posterior, as it equals only up to an affine transformation.

### 3.3 Evaluation of the target functionals

Since the target functionals in (6) and (7) typically have multiple local minima, we choose a straightforward approach and perform an exhaustive search over a two-dimensional grid. In consequence, the ability to efficiently evaluate the optimization targets becomes the top priority. To this end, observe that the (effective) rank of the perturbation in (4) is presumably since the measurement matrix corresponding to a single parallel beam projection is expected to be of full rank with all its singular values being of the same order of magnitude (cf. Section 2). Hence, our plan is to employ the matrix determinant lemma and the alternative formula for the posterior covariance in (4) in order to only need to compute traces of inverses and log-determinants for matrices of size . In particular, there is no apparent reason for resorting to Monte Carlo techniques, such as those in [3, 4, 7], because the low (effective) rank of the perturbation in (4) can be exploited explicitly.

With a suitable numbering of the pixels, the posterior covariance matrix can be written in a block form as

 Γpost=[ΓROIΓmixΓTmixΓrest,]=[(Γpost)11(Γpost)12(Γpost)21(Γpost)22]∈Rn×n, (8)

where and are the (marginal) covariance matrices for the pixels in the ROI and the rest of the image, respectively. We use a similar indexing for analogous block decompositions of other matrices as well. Moreover, as in (8), we often simplify the notation by not explicitly marking the dependence on .

###### Remark

Because our final product is a sequential optimization algorithm, the prior covariance for an optimization step is usually the optimized posterior from the previous one. Hence, having an explicit representation for at hand is arguably the most natural assumption for a single step of the algorithm; see (4). However, assuming is of moderate size, can also be formed explicitly via (4) without consuming too much time, and thus we assume to explicitly know in the following. On the other hand, if the noise covariance is diagonal, as it is in all our numerical experiments, knowing is essentially the same as knowing .

#### 3.3.1 D-optimality

As discussed in Section 3.2, the aim of D-optimal design with a preassigned ROI is to minimize

 ΦD(p):=log(detΓROI(p)) (9)

over in a certain subset of . Recall that our aim is to efficiently evaluate at numerous .

To begin with, observe that

 det(Γpost) =det[ΓROIΓmix0Γrest−ΓTmixΓ−1ROIΓmix] =det(ΓROI)det(Γrest−ΓTmixΓ−1ROIΓmix),

where is invertible as a nonempty diagonal block of a positive definite matrix. In particular, is the Schur complement of , meaning that

 Γ−1post=:Σ=[Σ11Σ12Σ21Σ22]=[ ∗ ∗ ∗(Γrest−ΓTmixΓ−1ROIΓmix)−1].

Altogether we thus have

 det(ΓROI)=det(Γpost)det(Γrest−ΓTmixΓ−1ROIΓmix)=det(Σ22)det(Σ),

and so one needs to consider evaluating the logarithms of and , both of which depend on the design parameter .

In the rest of this section, we explicitly mark which matrices depend on . According to (4),

 Σ(p) =Γ−1pr+R(p)TΓ−1noiseR(p), Σ22(p) =(Γ−1pr)22+R2(p)TΓ−1noiseR2(p),

where is a columnwise partitioning, with corresponding to the ROI and to the rest of the image. By virtue of the matrix determinant lemma,

 det(Σ(p)) det(Σ22(p)) =det(R2(p)((Γ−1pr)22)−1R2(p)T+Γnoise)det(Γ−1noise)det((Γ−1pr)22),

where only the first determinants, respectively, depend on . Moreover, the matrices associated to these -dependent determinants are only of size , i.e. small.

To evaluate , one thus needs to precompute Cholesky decomposition for the (large) positive definite matrix that is independent of . Since , i.e. the number of columns in , is assumed to be relatively low, this further enables building the (small) Cholesky decompositions

 C(p)C(p)T =R(p)ΓprR(p)T+Γnoise, (10a) ~C(p)~C(p)T =R2(p)((Γ−1pr)22)−1R2(p)T+Γnoise (10b)

for all on the employed grid. Finally, a straightforward algebraic manipulation gives

 log(det(ΓROI(p)))=2m∑j=1(log(~cjj(p))−log(cjj(p)))+c, (11)

where and , , are the diagonal elements of and , respectively, and is independent of .

Observe that it is far more stable to numerically evaluate the sum of logarithms in (11) than the products of the diagonal elements of the Cholesky factors needed for computing itself. Moreover, the constant in (11) can be evaluated by considering Cholesky decompositions for and either or ; the former was already employed when building (10), and the latter neither poses any difficulties if is not huge. In fact, one can even evaluate the actual information gain when the prior is replaced by the posterior for any without too severely compromising the computational efficiency; see Appendix A for further details. This makes it possible to compare the information gains between different rounds of the sequential optimization algorithm introduced in Section 4 below.

#### 3.3.2 A-optimality

As discussed in Section 3.2, when aiming at an A-optimal design, one needs to minimize

 ΦA(p):=tr(AΓpost(p)AT)

over in a certain subset of for a given . In our numerical tests presented in Section 5, is an identity matrix with the diagonal elements corresponding to the complement of the ROI replaced by zeros. Nevertheless, we perform the following calculations for an unspecified to maintain generality.

As in the case of D-optimality, we introduce a Cholesky decomposition

 C(p)C(p)T=R(p)ΓprR(p)T+Γnoise∈Rm×m,

and then form an auxiliary matrix

 B(p)=C(p)−1R(p)ΓprAT∈Rm×l.

According to (5),

 ΦA(p)=tr(AΓprAT−B(p)TB(p))=tr(AΓprAT)−tr(B(p)TB(p)).

Hence, fundamental properties of the matrix trace allow the representation

 ΦA(p)=c′−m∑j=1l∑k=1Bjk(p)2.

Since does not depend on , it does not affect the minimization of . However, evaluating does not considerably slow down the minimization process as a whole because it needs only be done once; knowing allows comparing the minimal values of between different iterations of the sequential minimization algorithm introduced in the following section.

## 4 Sequential optimization of measurements

In this section we present the algorithm for sequentially optimizing the parallel beam projections for X-ray tomography. We start with the basic algorithm that aims at finding the optimal projections prior to having any measurements at hand. Subsequently, we consider the modifications needed if one wants to estimate a hyperparameter in the prior based on the observed data or to change the ROI for the

th optimization round based on the reconstruction, i.e. the conditional mean (CM) estimate, computed from the first projections.

### 4.1 Basic algorithm

The algorithm is initialized by determining the discretization of into pixels, choosing the spacing of the detectors/rays for a single parallel beam projection, and specifying the covariance matrices and for the Gaussian prior of the absorption coefficient and for the Gaussian noise model, respectively.111The mean of the prior is not needed for the sequential optimization algorithm, unless it affects the noise model. One also needs to choose the ROI, select the number of optimization rounds, and determine how the design variable parametrizes a parallel beam projection. In our numerical experiments, the first component of corresponds to the projection angle and the second one to the distance of the bisection of the parallel beam source-receiver pair from the origin.

The optimization is performed sequentially, i.e., so that the posterior probability density after the th projection is used as the prior for choosing the th projection. At each step the target functional for A or D-optimality is evaluated for all on a two-dimensional optimization grid that also needs to be predetermined. The particular yielding the minimum value for the considered minimization target is chosen as the optimal parameter, and subsequently the posterior covariance, i.e. the prior covariance for the next optimization step, is formed according to (4) or (5).

The algorithm proceeds altogether as follows:

###### Algorithm 1
Choose the covariances and for the initial Gaussian prior and the noise model . Select the ROI, the grid for and the number of optimization rounds . Set and .
for  do
for all on the optimization grid do
Evaluate or as outlined in Section 3.3.
end for
Find the minimizer of the considered optimization target.
Append .
Form the posterior covariance .
Set .
end for
return

The columns of the output matrix define the sequentially optimized parallel beam projections. To be more precise, the th column of defines the A or D-optimal projection given the previous projections, that is, each of the optimized projections is (only) locally optimal. In particular, there is no reason to expect that the found parallel beam projections would be globally optimal [19]: It is a simpler and computationally less demanding task to optimize the projections one by one than to simultaneously find parallel beam projections that are jointly A or D-optimal. However, there is anyway reason to expect that the projections defined by the columns of are more optimal than, e.g., a random choice.

Due to the assumptions that the measurement model is linear with additive noise and the prior and noise process are independent Gaussians, Algorithm 1 can be run prior to performing any measurements. Hence, one may expect to have lots of computational time and resources for performing the sequential optimization of Algorithm 1 in an ‘offline mode’. However, it is also possible to change the ROI for the next optimization step or to estimate some free parameters in the prior based on the previous data or reconstruction as explained in the following two subsections.

### 4.2 Adaptive region of interest

If the sequentially optimal angles are not determined before the measurements but as a part of the online imaging procedure, the ROI may be altered between the optimization rounds based on the observations of the expert running the algorithm and, e.g., interesting or alarming features in the previous reconstruction. In this case, one also needs to give the initial prior mean as an input and incorporate the computation of the CM estimate in each step of the algorithm; see (4) and (5). Moreover, the natural output is no longer the sequence of optimal projections but the final reconstruction, and the outer iterations should be stopped by the operator of the algorithm.

Choose the covariances and as well as the mean for the initial Gaussian prior and the noise model . Select the ROI and the optimization grid for . Initialize , and .
while satisfactory reconstruction has not been reached do
Set .
for all on the optimization grid do
Evaluate or as outlined in Section 3.3.
end for
Find the minimizer of the considered optimization target.
Observe new data , where is a realization of the noise.
Form the posterior covariance and the CM estimate .

Redefine (heuristically) the ROI based on

.
Set and .
end while
Set and .
return   and

It follows from basic Bayesian analysis that the sequentially updated reconstruction and the spread estimator produced by Algorithm 2 are the same one would obtain by first collecting the data for all optimized projection angles and only then computing the CM estimate and the posterior covariance in a single step.

### 4.3 Data-driven hyperparameter estimation

Suppose the original prior covariance depends on a hierarchical parameter , the true value of which is unknown. Our aim is to introduce an algorithm that alternates between determining the optimal design parameter and the maximum likelihood estimate for the hyperparameter based on the measured data. The latter step can be considered as empirically adapting the prior at the same time as the sequential design process is performed.

Denote by

 Γms(ρ;p)=Γnoise+R(p)Γ0(ρ)R(p)T∈Rm×m

the covariance matrix of the measurement given the prior covariance and the projection matrix . The marginal density of the measurement conditioned on is hence given by

 π(y|ρ;p)∝exp(−12(log(detΓms(ρ;p))−(y−R(p)x0)TΓms(ρ;p)−1(y−R(p)x0))),

where is the (original) prior mean. In consequence, having independent measurements corresponding to the design parameters at hand leads to

 π (y1,…,yk|ρ;p1,…,pk)=k∏i=1π(yi|ρ;pi) (12) ∝exp(−12k∑i=1(log(detΓms(ρ;pi))−(yi−R(pi)x0)TΓms(ρ;pi)−1(yi−R(pi)x0))).

After obtaining the latest measurement corresponding to the design parameter , one can thus determine the latest maximum likelihood (ML) estimate for the hyperparameter by maximizing (12) with respect to .

In addition to the computational cost corresponding to maximizing (12), an extra price one needs to pay for the empirical estimation of a hyperparameter is the need to appropriately update the posterior for to make it compatible with the current ML estimate . In other words, one needs to recompute the posterior from a scratch: is formed, e.g., via (4) or (5) with , built from diagonal blocks, and replaced by the ‘total projection matrix’ .

###### Algorithm 3 (data-driven hyperparameter estimation)
Choose the covariance for the noise model and the mean for the Gaussian prior. Initialize , , and set . Select the ROI and the grid for .
while e.g., a satisfactory reconstruction has not been reached do
Set .
for all on the optimization grid do
Evaluate or as outlined in Section 3.3.
end for
Find the minimizer of the considered optimization target.
Observe the data , where is a realization of the noise.
Solve for the ML estimate
Form the posterior covariance “from a scratch”.
Set .
end while

Computing the ML estimate is relatively straightforward: When , one needs to pay some attention to make sure the employed minimization technique for finding is not very slow and does not diverge in case there is not much information on the whereabouts of the true parameter value to begin with. However, when for is available, it can be used as the initial guess for finding by the Newton’s method that exhibits fast convergence in our numerical experiments, where is the correlation length for a certain parametrized covariance structure. If the elementwise derivatives of are known, the required first two derivatives for the term in the exponent of can be straightforwardly evaluated by resorting to differentiation formulas for the determinant and the matrix inverse. Altogether, finding the needed ML estimates poses no severe computational difficulties if is relatively small and , as is the case in our numerical studies.

## 5 Numerical experiments

In all our tests, is the unit square divided into a uniform mesh of pixels. The maximal distance of X-rays passing through from its center point is set to . If contains an obstruction that blocks X-rays, the rows corresponding to the lines passing through the obstruction are deleted from the considered projection matrices, as the corresponding measurements do not carry any information on the unknown absorption and can thus be ignored. Hence, the number of active detectors may vary as a function of the position of the parallel beam source-receiver pair relative to the possible obstruction, which is taken into account when optimizing the optimal projections and computing related reconstructions. (It should be intuitively clear that optimal parallel beam projections try to avoid imaging through an obstruction whenever possible.)

The (initial) prior covariance between the remaining pixels is assumed to be of the form

 (Γ0)i,j=γ2exp(−|xi−xj|22ℓ2), (13)

where denotes the Euclidean norm, is the so-called correlation length and

is the pixelwise standard deviation. Here

and are the coordinates of the pixels (or more specifically the pixel centers) with indices and . Loosely speaking, the larger is, the less oscillating are random draws from the (initial) prior . The components of the additive noise in (3) are assumed to be independent with a common standard deviation .

### 5.1 Test 1: Basic algorithm

Our first numerical tests consider the basic version of our optimization routine, i.e. Algorithm 1. The aim is to demonstrate that the algorithm generates sequentially A-optimal parallel beam projections that are intuitively acceptable and clearly outperform random selections. Moreover, according to our tests, the optimization of projections can be performed using a considerably coarser discretization compared to the one used for computing the actual reconstructions, without significantly compromising the overall performance of the approach. Regarding D-optimality, it is numerically demonstrated that the sequential optimization scheme may sometimes lead to globally suboptimal sets of projections, although in many cases the sets of sequentially A and D-optimal projections are actually qualitatively very similar. For completeness it should also be admitted that the possibility that sequentially A-optimal projections may also sometimes suffer from apparent global nonoptimality cannot be excluded based solely on the presented results.

#### Test 1.1: Sanity check

Let us start by applying Algorithm 1 to the simplest possible setting: the ROI is , the parallel beam source-receiver pair is of the maximal width , and there are no obstructions inside . The number of pixels per edge of is chosen to be corresponding to altogether pixels, and the number of individual detectors is per a single projection image. The pointwise standard deviation of the prior covariance in (13) is assumed to be and the correlation length is chosen as . A random draw from the prior probability distribution of the absorption is visualized in the left-hand imaged of Figure 1. Here and in what follows, we assume the prior has zero mean, which is physically unrealistic as such, but mathematically this assumption simply corresponds to translating the coordinate system in by some given physically sensible, component-wise positive prior mean

. The standard deviation of the additive white noise is

.

Figure 2 presents the A and D-optimality target functionals as functions of the projection angle over the first ten iterations of Algorithm 1. To be quite precise, a modified A-optimality target is considered as it corresponds to the expected reconstruction error, and in case of D-optimality, the depicted quantity is the actual information gain when the prior is replaced by the posterior. (The maximization of the information gain is equivalent to minimizing , as explained in Appendix A.) For both optimality criteria, the optimal angles are distributed rather uniformly over the interval in such a way that the most recent projection angle is always located around the midpoint of the widest angular interval with no previous projections, which is intuitively what one would expect. It is noteworthy that after two projections the target functionals have multiple local optima; there is no reason to expect this would not be the case for more complicated imaging configurations as well.

The right-hand image of Figure 1 shows the mean reconstruction errors after each of the first ten A and D-optimal projections over a sample of target absorptions drawn from the assumed prior. To be more precise, the noisy measurements are first simulated for each of the target absorptions and all optimal angles. Then, the reconstructions, i.e. posterior means, are formed and compared to the corresponding targets when including the simulated noisy projections in the reconstruction process one by one in the same order as they were introduced by Algorithm 1. The A and D-optimal angles seem to perform equally well on average, even though the mean reconstruction error is precisely the quantity the A-optimal angles were designed to minimize.

For comparison, random projection angles are also considered: the mean

reconstruction errors over the considered sample of absorptions are computed for 1000 sequences of random projection angles, the components of which are picked independently from the uniform distribution over

. The right-hand image of Figure 1 illustrates the averages of the resulting mean reconstruction errors together with their standard deviations over the considered random sequences of projection angles. According to Figure 1, the A and D-optimal projections clearly outperform the random selections — at least in this simple geometric setting.

#### Test 1.2: Discoidal ROI

The second test is essentially a repetition of the first one, but with a parallel beam source-receiver pair of narrower width and the ROI specified to be a disk of radius centered at ; see the left-hand image of Figure 3. Observe, in particular, that the positioning of the source-receiver pair can now also be adjusted laterally and its width is the same as the diameter of the ROI. The standard deviation of the noise is set to , and the number of individual detectors in the narrower parallel beam sensor is . The other parameters are as in Test 1.1, that is, , and .

Figures 4 and 5 visualize the first six sequentially A and D-optimal projections, respectively, with the corresponding pixelwise (posterior) standard deviations shown in the background. The A-optimal projections behave as intuitively as in Test 1.1: all projections approximately cover the ROI and they are also distributed relatively uniformly over all angles. Although the D-optimal projections also cover the ROI, they correspond to subsequent rotations of about relative to one of the previous projections. Even though such projections are indeed locally D-optimal, they definitely do not form a globally D-optimal set of projections: for example, the six sequentially A-optimal projections in Figure 4 jointly produce a lower value for the D-optimality target than the sequentially D-optimal ones in Figure 5. This demonstrates that our sequential optimization procedure does not always produce parallel beam projections that are close to the global optimum.

The right-hand image of Figure 3 illustrates the mean reconstruction errors after each of the first ten (noisy) A and D-optimal projections over a sample of target absorptions drawn from the assumed prior. As in Figure 1 of Test 1.1, these mean errors are compared with the averages and standard deviations of the corresponding mean reconstruction errors for (semi)randomly selected projection sequences. To be more precise, the angles of the projections in these sequences are picked randomly from the uniform distribution over , but the lateral position of the parallel beam source-receiver pair is subsequently adjusted so that all projections cover the ROI in order to allow a fair comparison with the optimized projections. The sequentially A-optimal projections clearly produce the lowest mean reconstruction errors, but this time around the random selections slightly outperform the sequentially D-optimal projections in the considered performance metric.

#### Test 1.3: Obstruction and effect of coarse discretization

Next we test Algorithm 1 with an obstruction inside and also examine the effect that a coarse discretization has on the optimal angles. Only A-optimality is considered — the results for D-optimality would be qualitatively the same in this case. The obstruction blocking the X-rays is defined via , with being the ROI; see the left-hand image in Figure 6. The parallel beam source-receiver pair has width , and the parameters defining the prior and the additive measurement noise are also the same as in the previous test: , and . We employ two different levels of discretization: moderately dense, with and , and very coarse, with and .

Figure 7 presents the first six sequentially A-optimal projections for the denser discretization, with the corresponding pixelwise (posterior) standard deviations shown in the background. As expected, the algorithm tries to avoid imaging through the obstruction and emphasizes to begin with reducing the uncertainty in the right half of , where imaging from almost all directions is possible. The projections produced by Algorithm 1 for the coarser discretization, not illustrated here, are qualitatively similar, but not exactly the same as the ones shown in Figure 7.

The right-hand image of Figure 6 depicts the mean reconstruction errors after each of the first ten (noisy) A-optimal projections over a sample of target absorptions drawn from the assumed prior. Both sets of optimal projections, i.e. those deduced using the dense discretization ( and ) and the ones deduced using the coarse discretization ( and ), are considered. However, all reconstructions and the corresponding errors are computed using the dense discretization. Based on the right-hand image of Figure 6, it seems quite obvious that the level of discretization used for finding the sequentially A-optimal projections does not play a major role in the performance of the overall approach — at least for the studied parameter values and the simple geometry.

### 5.2 Test 2: Adaptive region of interest

Let us then study a setting where the information on the location of the ROI is updated during the imaging process, that is, we test Algorithm 2. We adopt the geometry in the left-hand image of Figure 6 as well as the parameter values in (the denser discretization of) Test 1.3: , , , and . The width of the parallel beam source-receiver pair is , and is the ROI to begin with.

The top left image in Figure 8 shows the true absorption inside : in the top right corner, there is an inhomogeneity that is characterized by a higher level of absorption compared to the rest of the domain. The other two images in the top row of Figure 8 present the first two A-optimal projections and the corresponding reconstructions, i.e. the posterior means, computed from simulated noisy measurements. Notice that these first projections are the same as the corresponding ones in Figure 6. Since the second reconstruction hints there may be something interesting going on in the top right quadrant of , the ROI is redefined accordingly. The subsequent three A-optimal projections are illustrated in the bottom row of Figure 8 together with the resulting reconstructions. Comparing these results with those in Figure 6, it seems that the redefinition of the ROI had the desired effect as all of the last three projections clearly provide information on the top right quadrant of .

### 5.3 Test 3: Simultaneous estimation of the prior correlation length

Our final numerical experiment applies Algorithm 3 to deducing the value of an unknown correlation length in the prior covariance (13). The other free parameters are chosen to be , , and . The parallel beam source-receiver pair has width , i.e. its position cannot be adjusted laterally, there are no obstructions inside , and only A-optimality is considered. The presented numerical results would be qualitatively similar if the unknown parameter in the prior were instead the pointwise standard deviation or/and if D-optimality were used as the criterion for choosing the projection angles.

The following simple test is repeated times: A random value for the true correlation length is picked from the uniform distribution over and a corresponding target absorption is subsequently drawn from the zero-mean Gaussian density with the prior covariance (13). Then, Algorithm 3 is run with the conservative initial guess for the correlation length. The estimate for the correlation length after the first projection, i.e. , is determined by performing ten steps of the golden section line search over the interval and subsequently applying the Newton’s method to fine-tune the location of the optimum. When determining for , mere Newton’s method is used with as the initial guess. In all cases, the Newton’s method is stopped when its step size is less than , and as a safety measure, Newton steps with absolute value larger than are scaled down to (retaining the sign of the step). No obvious problems with the convergence of the implemented minimization routine for finding were observed during computations.

The left-hand imaged in Figure 9 shows the mean signed errors in the estimate for during the first ten iterations of Algorithm 3 over the aforementioned sample of 1000 absorption targets. The corresponding standard deviations are visualized as errorbars. The mean signed error is almost zero independently of the number of (sequentially A-optimal) projections, whereas the standard deviation of the estimate over the sample of absorption targets decreases from approximately after the first projection to about after the tenth one. Altogether, the deduction of the prior correlation length based on noisy projection data seems to function well.

Certain mean errors in the reconstructions computed based on noisy parallel beam projections of the studied sample of target absorptions are shown as functions of the number of projections in the right-hand image of Figure 9. The reconstructions and the corresponding errors are computed in two different ways: (i) by accounting for the change in the estimate for the prior correlation length predicted by Algorithm 3 after each new sequentially A-optimal projection and (ii) by sticking with the conservative initial guess of for the correlation length throughout the optimization of the projection angles and in the computation of the corresponding reconstructions. It is obvious that including the estimation of the correlation length as a part of the reconstruction algorithm leads to superior reconstruction accuracy, although this also makes the algorithm significantly more expensive computationally. If the initial guess for the prior correlation length were (significantly) more accurate, it is debatable whether fine-tuning the estimate for would be worthwhile computationally.

## 6 Concluding remarks

We have introduced a greedy, exhaustive optimization algorithm that applies Bayesian OED to optimizing parallel beam projections in X-ray imaging. Although the sequential algorithm does not guarantee global optimality for the set of projections it predicts, our numerical experiments demonstrate that it usually exhibits satisfactory global behavior as well. More importantly, it fits to the medical imaging paradigm where the patient is exposed to the minimal amount of radiation without knowing a priori the necessary number of imaging angles. To this end, we have discussed modifications of the algorithm that allow tuning the ROI or some parameters in the (original) prior based on the measured data.

The main obstacle for adopting our algorithm in practical use is arguably its computational complexity: as the optimization target typically has many local minima (cf. Figure 2), we seek the optimal parameters via evaluating the target on a dense enough grid in order to guarantee finding the global optimum for the next projection. In a three-dimensional setting, such a simple approach becomes prohibitively expensive as the number of pixels in a single projection image can be in the order of , and thus a more sophisticated adaptive optimization routine needs to be developed for more realistic imaging geometries. On the positive side, the overall performance of our approach does not seem to depend heavily on the level of discretization, which may permit a procedure where the next projection geometry is optimized using a coarse discretization, but the actual reconstruction corresponds to a much denser one.

In addition, only a very simple Gaussian prior for the absorption distribution was employed in our numerical studies. The introduced algorithm should thus be tested with more realistic priors and, even more importantly, its functionality should be verified if no Gaussian prior is well aligned with the imaged target.

## Appendix A Target functionals for Bayesian experimental design

The purpose of this appendix is to deduce the minimization targets (6) and (7) and to indicate how the latter is actually related to the information gain when a prior is replaced by the posterior. It is also explained why should be replaced by in (7) if one is only interested in the information gain over the ROI, cf. (8).

### a.1 D-optimality

The Kullback–Leibler divergence of the distribution of from that of , i.e.,

 DKL(X|y∥X;p) =∫Rnlog(π(x|y;p)πpr(x))π(x|y;p)dx =∫Rn(log(π(x|y;p))−log(πpr(x)))π(x|y;p)dx, (14)

has an interpretation as the increase in the level of information when the prior distribution of is replaced by the posterior induced by the data . The aim of D-optimal design is to maximize the expectation of over all measurements, which means the precise form of the minimization target is

 ~ΦD(p)=Ey(−DKL(X|y∥X;p))=−∫RmDKL(X|y∥X;p)π(y;p)dy;

see (9) for comparison.

The expression

 h(X|y)=−∫Rnlog(π(x|y;p))π(x|y;p)dx (15)

appearing in (A.1) is called the differential entropy of . Since is Gaussian, allows an explicit representation [29]

 h(X|y)=n2+n2log(2π)+12log(det(Γpost(p))). (16)

On the other hand, the expected value of the second term on the right-hand side of (A.1) satisfies

 Ey(∫Rnlog(πpr(x))π(x|y;p)dx)= ∫Rm∫Rnlog(πpr(x))π(x,y;p)dxdy = ∫Rn∫Rmπ(y|x;p)dylog(πpr(x))πpr(x)dx = ∫Rnlog(πpr(x))πpr(x)dx = −n2−n2log(2π)−12log(detΓpr),

where the last expression is simply for the Gaussian random variable . In particular, take note that this expression does not depend on the design variable .

Because does not depend on , we finally arrive at the simple expression

 ~ΦD(p) =Ey(h(X|y))−h(X) (17) =12(log(detΓpost(p))−log(detΓpr)),

which explains why the minimizers of and in (9) are the same if ROI is the whole . Note that corresponds to the actual information gain that is to be maximized.

Finally, observe that the above calculation can be repeated as such if and are replaced by the corresponding marginal densities over the ROI. Hence, the precise form of the minimization target when the information gain is to be maximized over the ROI is obtained by replacing by and by in (17), cf. (9).

### a.2 A-optimality

To begin with, let and observe that the CM estimate , given in (4), naturally depends on the data . The expected squared distance of the unknown from in the squared seminorm induced by

over the joint distribution of

 ΦA(p) =∫Rm∫Rn(x−ˆx(y;p))TATA(x−ˆx(y;p))π(x,y;p)dxdy =∫Rm∫Rntr((x−ˆx(y;p))TATA(x−ˆx(y;p)))π(x|y;p)dxπ(y;p)dy

as the trace of a scalar is the scalar itself. Because the trace is linear and invariant under cyclic permutations, we get

 ΦA(p) =tr(A∫Rm∫Rn(x−ˆx(y;p))(x−ˆx(y;p))Tπ(x|y;p)dxπ(y;p)dyAT) =tr(A∫RmΓpost(p)π(y;p)dyAT) =tr(AΓpost(p)AT),

where the last step follows from being independent of in our simple, Gaussian and linear setting.

## References

• [1] Alexanderian, A., Gloor, P. J., Ghattas, O., et al. On Bayesian A- and D-optimal experimental designs in infinite dimensions. Bayesian Anal. 11, 3 (2016), 671–695.
• [2] Alexanderian, A., Petra, N., Stadler, G., and Ghattas, O. A-optimal design of experiments for infinite-dimensional Bayesian linear inverse problems with regularized -sparsification. SIAM J. Sci. Comput. 36, 5 (2014), A2122–A2148.
• [3] Alexanderian, A., Petra, N., Stadler, G., and Ghattas, O. A fast and scalable method for A-optimal design of experiments for infinite-dimensional Bayesian nonlinear inverse problems. SIAM J. Sci. Comput. 38, 1 (2016), A243–A272.
• [4] Alexanderian, A., and Saibaba, A. K. Efficient D-optimal design of experiments for infinite-dimensional Bayesian linear inverse problems. SIAM J. Sci. Comp. 40, 5 (2018), A2956–A2985.
• [5] Atkinson, A., Donev, A., Tobias, R., et al. Optimum experimental designs, with SAS, vol. 34. Oxford University Press, 2007.
• [6] Attia, A., Alexanderian, A., and Saibaba, A. K. Goal-oriented optimal design of experiments for large-scale Bayesian linear inverse problems. Inverse Problems 34, 9 (2018), 095009.
• [7] Avron, H., and Toledo, S. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58, 2 (2011), 1–34.
• [8] Batenburg, J. K., De Carlo, F., Mancini, L., and Sijbers, J. Advanced x-ray tomography: experiment, modeling, and algorithms. Meas. Sci. Technol. 29, 8 (2018).
• [9] Borg, L., Jørgensen, J. S., Frikel, J., and Sporring, J. Reduction of variable-truncation artifacts from beam occlusion during in situ x-ray tomography. Meas. Sci. Technol. 28, 12 (2017), 124004.
• [10] Chaloner, K., and Verdinelli, I. Bayesian experimental design: A review. Stat. Sci. (1995), 273–304.
• [11] Fohring, J., and Haber, E. Adaptive A-optimal experimental design for linear dynamical systems. SIAM/ASA J. Uncertainty Quantification 4, 1 (2016), 1138–1159.
• [12] Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of large-scale linear ill-posed inverse problems. Inverse Problems 24, 5 (2008), 055012.
• [13] Haber, E., Horesh, L., and Tenorio, L. Numerical methods for the design of large-scale nonlinear discrete ill-posed inverse problems. Inverse Problems 26, 2 (2009), 025002.
• [14] Haber, E., Magnant, Z., Lucero, C., and Tenorio, L. Numerical methods for A-optimal designs with a sparsity constraint for ill-posed inverse problems. Comput. Opt. Appl. 52, 1 (2012), 293–314.
• [15] Hämäläinen, K., Kallonen, A., Kolehmainen, V., Lassas, M., Niinimäki, K., and Siltanen, S. Sparse tomography. SIAM J. Sc. Comput. 35, 3 (2013), B644–B665.
• [16] Hannukainen, A., Hyvönen, N., and Perkkiö, L. Inverse heat source problem and experimental design for determining iron loss distribution. arXiv preprint arXiv:2003.10395 (2020).
• [17] Huan, X., and Marzouk, Y. Gradient-based stochastic optimization methods in Bayesian experimental design. Int. J. Uncertain. Quan. 4, 6 (2014).
• [18] Huan, X., and Marzouk, Y. M. Simulation-based optimal Bayesian experimental design for nonlinear systems. J. Comput. Phys. 232, 1 (2013), 288–317.
• [19] Huan, X., and Marzouk, Y. M. Sequential Bayesian optimal experimental design via approximate dynamic programming. arXiv preprint arXiv:1604.08320 (2016).
• [20] Hyvönen, N., Seppänen, A., and Staboulis, S. Optimizing electrode positions in electrical impedance tomography. SIAM J. Appl. Math. 74 (2014), 1831–1851.
• [21] Jaffray, D. A. Image-guided radiotherapy: from current concept to future perspectives. Nat. Rev. Clin. Oncol. 9, 12 (2012), 688.
• [22] Kaipio, J., and Somersalo, E. Statistical and computational inverse problems, vol. 160. Springer Science & Business Media, 2006.
• [23] Khodja, M., Prange, M., and Djikpesse, H. Guided Bayesian optimal experimental design. Inverse Problems 26, 5 (2010), 055008.
• [24] Long, Q., Motamed, M., and Tempone, R. Fast Bayesian optimal experimental design for seismic source inversion. Comput. Methods Appl. Mech. Eng. 291 (2015), 123–145.
• [25] Long, Q., Scavino, M., Tempone, R., and Wang, S. Fast estimation of expected information gains for Bayesian experimental designs based on laplace approximations. Comput. Methods Appl. Mech. Eng. 259 (2013), 24–39.
• [26] Natterer, F., and Wübbeling, F. Mathematical methods in image reconstruction. SIAM Monographs on Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2001.
• [27] Radon, J. Über die bestimmung von funktionen durch ihre integralwerte längs gewisser mannigfaltigkeiten. Ber. Verh. Sächsischen Akad. Wiss. Leipzig. Math.-Phys. Kl. 69 (1917), 262–277.
• [28] Ruthotto, L., Chung, J., and Chung, M. Optimal experimental design for inverse problems with state constraints. SIAM J. Sci. Comput. 40, 4 (2018), B1080–B1100.
• [29] Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27 (1948), 379–423.
• [30] Siltanen, S., Kolehmainen, V., Järvenpää, S., Kaipio, J. P., Koistinen, P., Lassas, M., Pirttilä, J., and Somersalo, E. Statistical inversion for medical x-ray tomography with few radiographs: I. general theory. Phys. Med. Biol. 48 (may 2003), 1437–1463.
• [31] Stuart, A. M. Inverse problems: A Bayesian perspective. Acta Numer. 19 (2010), 451–559.