Alternating Minimization Algorithm with Automatic Relevance Determination for Transmission Tomography under Poisson Noise

12/29/2014 ∙ by Yan Kaganovsky, et al. ∙ Duke University Washington University in St Louis 0

We propose a globally convergent alternating minimization (AM) algorithm for image reconstruction in transmission tomography, which extends automatic relevance determination (ARD) to Poisson noise models with Beer's law. The algorithm promotes solutions that are sparse in the pixel/voxel-differences domain by introducing additional latent variables, one for each pixel/voxel, and then learning these variables from the data using a hierarchical Bayesian model. Importantly, the proposed AM algorithm is free of any tuning parameters with image quality comparable to standard penalized likelihood methods. Our algorithm exploits optimization transfer principles which reduce the problem into parallel 1D optimization tasks (one for each pixel/voxel), making the algorithm feasible for large-scale problems. This approach considerably reduces the computational bottleneck of ARD associated with the posterior variances. Positivity constraints inherent in transmission tomography problems are also enforced. We demonstrate the performance of the proposed algorithm for x-ray computed tomography using synthetic and real-world datasets. The algorithm is shown to have much better performance than prior ARD algorithms based on approximate Gaussian noise models, even for high photon flux.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 30

page 32

page 33

page 34

page 35

page 37

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tomographic image reconstruction is the process of estimating an object from measurements of its line integrals along different angles

[1]. Tomographic reconstruction is an ill-posed problem [2] in the sense that there are multiple solutions that are consistent with the data. It is therefore desirable to make use of prior knowledge, which includes the statistical characteristics of the measured data, and properties that are expected of the image.

The images in transmission tomography represent attenuation per unit length of the impinging beam due to energy absorption inside the imaged object. Well known examples are x-ray computed tomography (CT) [3], and electron tomography [4]. The attenuation must be nonnegative, since a negative value corresponds to an increase in the intensity of the exiting beam, which is non-physical. These images can often be well approximated by a sparse representation in some transform domain. In some applications, sparsity is present directly in the native image domain [5], but more commonly it is present in the pixel-difference domain [6] or in some transform domain, such as the wavelet, curvelet or shearlet transforms [7, 8].

There are generally two types of approaches for tomographic image reconstruction. The first type consists of one-shot algorithms, such as filtered back-projection [1] and its extensions, which rely on analytic formulas. These algorithms do not incorporate the type of prior knowledge mentioned above and also produce prominent artifacts when some of the data are missing. The second type consists of iterative algorithms which are based on minimizing some objective function. The latter enable the incorporation of prior knowledge about the data and the image of interest, e.g., the objective function can put less emphasis on measurements that are expected to be more noisy, include penalties that promote sparsity of the image in some representation, and positivity constraints for the image can also be included. In addition, iterative algorithms are far more robust to missing data.

Many transmission tomography problems involve physical processes consisting of quanta, e.g., the number of transmitted photons in x-ray CT, or the number of transmitted electrons in electron tomography. Therefore, there is a strong physical motivation for modeling these processes using independent Poisson random variables

[4, 6]. In the medical CT community, the Poisson noise model is commonly used [6] and provides the basis for a well known class of iterative algorithms [6, 9, 11, 12, 13]. In addition, the mean of each measurement is modeled according to Beer’s law [6], which is an inherent non-linearity in transmission data.

In this paper, we develop a new class of statistical iterative algorithms for image reconstruction in transmission tomography that is inspired by automatic relevance determination (ARD) [14, 15, 16, 17]. It incorporates prior knowledge that includes Poisson statistics of the underlying signals, Beer’s law, positivity of the image, and also sparsity of the image in some underlying representation. What sets this new class apart from prior art is the fact that it automatically learns the balance between data-fidelity and prior knowledge, thus avoiding the use of any tuning parameters and the difficulty in determining them. In addition, it also computes posterior variances, which can be used for adaptive sensing/experimental design, or for determining the system’s sensitivity to noise. To the best of our knowledge, the proposed algorithm is the first ARD-based algorithm for Poisson noise that is both scalable to very large problems and that is also guaranteed to converge.

Our work is motivated by x-ray CT which is used in medical and security applications [1]. Nevertheless, the methods presented here can be applied more generally to any type of transmission tomography that involves count data. In formulating the problem presented in Sec. 1.1 we have tried to retain some generality in the hope that other imaging modalities could potentially benefit from the proposed approach. Naturally, this entails some simplifications of the forward model which are discussed later in Sec. 7.

It should be noted that most modern commercial CT scanners do not employ photon counting detectors. However, there are already photon counting detectors used in clinical studies [18], and it is certainty possible that they will be integrated into commercial CT scanners in the near future. Modern CT scanners also utilize a high photon flux, which led various researchers to propose algorithms that are based on an approximate post-log Gaussian noise model [19, 20, 21] rather than a Poisson noise model. The Gaussian model can be derived by making a 2nd order Taylor series approximation to the log-likelihood corresponding to the Poisson noise model with Beer’s law, under the assumption of moderate to high photon flux. This results in a data-weighted least-squares datafit term [19], which also corresponds to a post-log Gaussian noise model. However, in low-count scenarios, which are critical to maintain low radiation dose in pediatric and small animal CT, there could be a substantial performance benefit in using the Poisson model once photon-counting detectors become widely available. For example, it has been shown empirically that the approximate Gaussian model leads to a systematic negative bias that increases as photon flux decreases [22].

1.1 Model

We consider a Poisson noise model for transmission tomography [6] given by

(1)

where here and henceforth denotes the likelihood function, contains the measurements,

denotes the Poisson univariate probability distribution with rate

, and the exponential function encompasses Beer’s law (a physical property of the signals [6]

). The column vector

denotes the lexicographical ordering of the 2D/3D image to be reconstructed, where the entry is the linear attenuation coefficient at the th pixel/voxel in units of inverse length. The operation represents a line integral over the image corresponding to the th source-detector pair (see Fig. 2) and to the measurement . We denote by the system matrix that is formed by concatenating the row vectors (). is the mean number of photons that would have been measured using the th source-detector pair if the imaged object was removed. We make the standard assumption that was measured for all in an independent calibration experiment and is known to us.

We construct using the approach in [23], where an entry at the th row and th column of is equal to the length at which the line between the th source-detector pair intersects the th pixel/voxel (see Fig. 2). Each ray transverses at most pixels/voxels, where is the total number of pixels/voxels and for 2D/3D images, respectively. Accordingly, each row in the system matrix has only non-zero entries. Henceforth, we absorb a reference linear attenuation coefficient into so that is dimensionless.

Figure 1: Tomographic measurements in 2D. The total linear attenuation for the th measurement is equal to the line integral over the imaged domain along the path defined by the th source-detector pair. Several measurements are collected simultaneously using several detectors. The source and detectors are rotated to collect different views. Each line integral is modeled as where is the linear attenuation coefficient at the th pixel in units of inverse length and is the intersection length of the th line with the th pixel. If is the total number of pixels, then each line intersects at most pixels.
Figure 2: An example for an adjacency graph defining a neighborhood in (10). Each node represents a pixel in the image and iff the th and th nodes are connected by an edge. In this example, the node colored in light green and marked by a star has 4 neighboring pixels shown in dark green. Each neighbor has a weight of when computing their average. The image is augmented to introduce a boundary of pixels with zero values, represented by white nodes. The latter are not included in the forward model and are merely introduced to provide the interpretation for the weight when averaging less than neighbors, e.g., consider the node colored in light red and marked by a square.

1.2 Background

The method proposed in this paper is an extension of automatic relevance determination (ARD). For the convenience of readers unfamiliar with ARD we first provide background to explain the potential advantages of ARD motivating this work and discuss the key differences between ARD and other more common sparse estimation methods. Our aim here is to emphasize previous challenges in adapting ARD to the Poisson model in (1), challenges which we aim to resolve with the proposed method.

1.2.1 Maximum a Posteriori (MAP) Estimation

Statistical iterative methods for CT [6, 9, 11, 13] cast the problem of image reconstruction as a penalized maximum likelihood estimation problem, given by

(2)

where is the log-likelihood function which represents the datafit. For the Poisson model in (1), maximizing the log-likelihood is equivalent to minimizing the I-divergence 222I-divergence between two vectors is defined as [24] between the actual measurements and the predicted measurements [13], where denotes an elementwise product. in (2

) is the prior probability distribution representing prior knowledge about

and is a penalty that promotes solutions with desired properties according to this prior knowledge, e.g., the penalty [5], or the total variation (TV) penalty [25, 26], which promote sparse solutions. is a tuning parameter that determines the tradeoff between the datafit and the prior, or the bias-variance tradeoff. The “best” value of is usually unknown and depends on the imaged object at hand. To determine , it is often required to solve (2) many times for different values of , examine all reconstructed images and then select the value that best fits the needs of the application, based on some chosen figure of merit. For many common choices of considered below, decreasing will put more weight on the datafit term and will result in increased variance and in noisier images. Increasing will put more weight on the penalty term, and will result in increased bias and increased artifacts in the image, e.g., staircase artifacts. For a given , the solution in (2

) can be interpreted as the maximum a posteriori (MAP) solution, i.e., the maximizing point of the posterior distribution, which is given by Bayes’ theorem

(3)

The and TV penalties are non-differentiable and do not allow the use of standard gradient-based methods for solving (2). An alternative to TV is the class of neighborhood roughness penalties (Gibbs priors) [11] which have the following form

(4)

where is a positive weight inversely proportional to the distance between the centers of the th and th pixels/voxels,

is a set of ordered pairs

defining a neighborhood system and is chosen to be a positive, even, and twice continuously differentiable function. To preserve sharp edges in the image, is often chosen such that for large and can be viewed as a smooth version of the penalty. A common choice is [10, 11]

(5)

which approaches the quadratic for and the linear function as . This type of penalties allows the use of standard gradient-based methods but comes with the price of having an additional tuning parameter , so the search for the tuning parameters becomes even more cumbersome. Decreasing too much will result in noisy images but with sharp edges, and increasing too much will result in blurred images.

1.2.2 Automatic Relevance Determination (ARD)

A different approach to sparse estimation is “automatic relevance determination” (ARD) [14]. Originally, ARD has been proposed for Gaussian likelihoods with the prior [15]

(6)

where denotes a diagonal matrix with main diagonal . In this approach sparsity is promoted by treating as latent variables and seeking the newly introduced hyperparameters that maximize the marginal likelihood (also called the “model evidence” [16])

(7)

Once is found, the estimate for is given by the mean of the posterior distribution . Interestingly, the solutions to (7) have many and the posterior distribution for the corresponding parameters becomes highly concentrated around zero333By employing Bayes’ rule in (3), it can be easily shown that if the prior is concentrated around then the corresponding posterior distribution is also concentrated around . The mechanism that promotes sparsity of has been revealed by Wipf et al. [27, 28], with the analysis specifically tailored for Gaussian likelihoods.

The solution of (7) was originally addressed by Tipping in [15]

using an expectation-maximization (EM) algorithm

[29]

. In the E-step, the Gaussian posterior probability distribution

is computed for given hyperparameters . In the M-step, are updated based on the mean and variances of the posterior distribution found in the E-step. As opposed to MAP estimation methods, which focus on the mode (maximum) of the posterior distribution, ARD methods also compute the covariance matrix (during the E step), or at least the marginal variances, which provide additional information. ARD-based algorithms are therefore computationally more demanding than MAP methods, mainly due to the expensive task of estimating the variances. Previous extensions of ARD [28, 30] require the same type of operations and rely on approximations to the variances [30] to maintain scalability.

Despite computational complications, the ARD approach can be quite rewarding and excel in hard or ill-posed problems, as evidenced in several recent applications [31]. A very significant advantage of ARD methods is their ability to automatically learn the balance between the datafit and the degree of sparsity. This avoids the tuning of nuisance parameters, which is often nonintuitive, object-dependent, and requires many trials. The posterior variances are key in determining this tradeoff. The variances can also be used for adaptive sensing/experimental design [17, 30], which is outside the scope of this paper and will be the subject of future work.

1.2.3 Difficulties in Previous ARD-based Methods for Poisson Models

The ARD-EM algorithm [15] described in Sec. 1.2.2 cannot be directly applied to the Poisson likelihood in (1) since the E-step is intractable due to the non-conjugacy444A prior is said to be conjugate to the likelihood if both the prior and posterior probability distributions are in the same family. The posterior is then given in closed-form. of the prior in (6) and the likelihood. Although we are not aware of any previously published work on ARD for Poisson models with Beer’s law, a simple remedy to the non-conjugacy is the Laplace method, discussed in [15]

. In that approach, the posterior is approximated during the E-step using a multivariate Gaussian distribution, with the mean equal to the MAP solution and the covariance matrix equal to the inverse of the Hessian matrix for the log-likelihood, evaluated at the MAP solution. However, the resulting “EM-like” algorithm is not based upon a consistent objective function,

and is not guaranteed to converge. In addition, it requires inversions of the Hessian matrix with computational complexity , where is the total number of pixels/voxels. This is infeasible for x-ray CT, where could be very large, e.g., or higher. Instead, one can use more sophisticated methods [30, 32, 33] to indirectly estimate the diagonal of the inverse, i.e., the marginal variances, since only these are required for updating the hyperparameters [15]. Note however that the work in [30, 32, 33] does not consider Laplace approximations, so these methods have not been tested for this case. Irrespective of how the variances are computed, the “EM-like” algorithm with the Laplace approximation is unsatisfactory from a principled perspective due to the lack of an objective function and due to the lack of convergence guarantees. The objective is also desired in order to assess convergence in practice and in order to verify the correctness of the implementation. In addition, we are faced with conceptual difficulties, since the analysis that revealed the sparsity promoting mechanism of ARD by Wipf et al. [16, 27, 28] is restricted to Gaussian likelihoods.

1.2.4 Previous ARD Method for Transmission Tomography Based on an Approximate Post-Log Gaussian Noise Model

As mentioned in the introduction, it is reasonable to replace the model in (1) with an approximate post-log Gaussian noise model under moderate to high photon flux conditions. An ARD algorithm for transmission tomography based on the Gaussian model and the EM algorithm in [15] was proposed by Jeon et al. [17]. It should be noted that the original ARD algorithm in [15] has a computational cost of due to the computation of posterior variances, and therefore, it is not practical for large scale problems. In contrast, the algorithm proposed in [17] scales as where for 2D and 3D images, respectively. There are several disadvantages of the algorithm in [17]: (1) it does not enforce positivity of the solution; (2) due to the approximate estimation of posterior variances, the overall algorithm does not have convergence guarantees; (3) it requires one to choose the type and number of probing vectors for estimating the variances [17], which depends on the specific geometry of the system and on the imaged object; (4) in order to estimate the posterior variances it requires solving many matrix equations, which is computationally expensive; (5) computing the objective function scales as , making this operation infeasible for large scale problems; (6) it is limited to proper priors and complete sparse representations.

1.3 Contributions

The contributions of this paper are summarized as follows.

  1. We present a new alternating minimization (AM) algorithm for transmission tomography, which extends automatic relevance determination (ARD) to Poisson noise models with Beer’s law. The proposed algorithm is free of any tuning parameters and unlike prior ARD methods, also has a consistent objective function that can be evaluated in a computationally efficient way, in order to assess convergence in practice and to verify the correctness of the implementation.

  2. We reveal the sparsity-promoting mechanism in the proposed ARD-based algorithm for Poisson likelihoods. This preserves insights similar to those gained from previous work [16, 27, 28] for Gaussian likelihoods.

  3. We derive a scalable AM algorithm for transmission tomography, that is well adapted to parallel computing. We reduce the computational complexity considerably by simplifying the algorithm into parallel 1D optimization tasks (for each pixel/voxel) with the scalar mean and variance updates also done in parallel.

  4. We present a convergence analysis for the scalable AM algorithm and establish guarantees for global convergence555An algorithm is said to be “globally convergent” if for arbitrary starting points the algorithm is guaranteed to converge to a point satisfying the necessary optimality conditions (local optima included) [34].. To the best of our knowledge, this is the first ARD-based algorithm for Poisson models that has convergence guarantees.

  5. We study the performance of the algorithm on synthetic and real-world x-ray datasets.

  6. We compare the proposed algorithm to the Gaussian ARD method in [17]

    and demonstrate that the proposed algorithm leads to better performance under Poisson distributed data, even for high photon flux.

1.4 Outline of the Paper

The rest of this paper is organized as follows. In Sec. 2 we present a general overview of the proposed AM algorithm and explain how sparsity is promoted. In Sec. 3 we present a scalable AM algorithm based on separable surrogates for the objective function, which is well adapted to parallel computing and feasible for large-scale problems. In Sec. 4 we present a detailed analysis of the convergence properties of the algorithm presented in Sec. 3. In Sec. 5 we make connections between the proposed algorithm and related previous methods. We also present an alternative view of the proposed ARD method that further explains how sparsity is promoted. In Sec. 6 we extend the proposed ARD framework to allow the modeling of sparsity in overcomplete representations and the use of improper priors, while retaining the convergence properties studied in Sec. 4. In Secs. 78 we discuss model limitations and possible alternative optimization methods related to our work. In Sec. 9 we study the performance of the proposed scalable ARD algorithm on synthetic and real-world datasets (with a parallel implementation used in the experiments) and compare it to prior MAP estimation methods as well as to the ARD method in [17] which is based on a post-log Gaussian noise model. Conclusions are presented in Sec. 10.

2 Variational ARD

2.1 Prior

In many applications of transmission tomography sparsity is not present directly in the native basis but in some transform domain. Therefore, we consider a generalized prior (similar to Seeger and Nickisch [30]) to address these cases

(8)

where are newly introduced hyperparameters, and we assume a transformation

(9)

where is known and is approximately sparse. By choosing , the prior in (8) reduces to (6). To provide some insight into the mechanism of the proposed algorithm we assume for now that is square and invertible. We shall present a more general approach in Sec. 6 to allow for cases where these assumptions are not met in practice.

Any scalable algorithm will rely on efficient matrix-vector multiplication . This is achieved if is sparse or has some special structure, such as discrete Fourier or wavelet transforms. In this work we focus on neighborhood penalties, which are commonly used for x-ray CT, the application explored in Sec. 9. We first consider given by

(10)

where is the set of ordered pairs defining a neighborhood system and is the number of neighbors for the th pixel/voxel. An example is shown in Fig. 2. The neighborhood is usually chosen to be small enough such that will be sparse with non-zero elements in the th row. For the choice in (10), each in (9) is equal to the difference between the th pixel/voxel and the average of its neighboring pixels/voxels, i.e., . Note that sparsity of these pixel/voxel differences implies piecewise smooth images. Lastly, we note that the choice in (10) implies Dirichlet boundary conditions, which is equivalent to assuming additional neighbors outside the image domain with values set to zero, such that the number of neighbors for each pixel/voxel is equal, i.e., (see Fig. 2). These boundary conditions agree with physical constraints typical of many real experimental setups, where the attenuation is zero outside the region of interest. We shall remove these restrictions in Sec. 6.

2.2 Variational View of the EM Algorithm for ARD

We start with a known variational view [35] of the EM algorithm which is used to solve (7) for Gaussian noise models. Here we introduce some of the notation used throughout this work, and discuss the type of difficulties encountered when trying to extend ARD to non-Gaussian noise models.

We rewrite the model evidence as

(11)

where

(12)
(13)

is the proposed variational distribution and is the exact posterior distribution, is the free variational energy (FVE) or negative evidence lower bound (ELBO), is the Kullback-Leibler (KL) divergence, and is the expected value with respect to . The EM algorithm can be viewed [35] as minimizing the free variational energy (FVE) function in (12) (or maximizing the ELBO) by alternating between updating in the E-step and updating in the M-step, as summarized below

(14)
(15)

In the E-step, the optimal solution with zero KL divergence is , i.e., the exact posterior based on the previous values, which can be computed directly from (3). In the M-step, minimizing with respect to can only increase the KL term, since it is nonnegative and overall the model evidence is also increased.

For non-Gaussian likelihoods and the prior in (8), the E step in (14) is intractable, i.e., the expression for is unknown and its direct calculation according to (3) involves a high-dimensional integral that is prohibitive to compute. Instead, we are forced to restrict to a tractable family of distributions to approximate the posterior. However, this results in an algorithm that is no longer guaranteed to increase the evidence [36]. This occurs because the KL term in (11) is never zero and can either decrease or increase after updating . Furthermore, a local maximum of the evidence with respect to is generally not a local minimum of (or a local maximum of the ELBO), except for degenerate cases [36].

Therefore, when considering the above approach to extend ARD, we immediately encounter a conceptual difficulty, since most previous theoretical studies of ARD [16, 27, 28] are based on the maximization of the evidence and its specific closed-form expression for the Gaussian likelihood. It is not immediately clear how to extend the ARD framework to the Poisson noise model in (1) in a manner which will preserve the insights gained from [16, 27, 28] and will explain why sparsity is being promoted by the ARD algorithm.

Perhaps surprisingly, we shall show that the variational algorithm discussed above still preserves the principles of ARD, despite the fact that it is not based on maximizing the evidence, but rather on maximizing the ELBO (lower bound for the evidence). We call this framework “Variational ARD” (VARD).

2.3 Alternating-Minimization Algorithm for Variational ARD

As mentioned in Sec. 2.2, the true posterior distribution corresponding to the Poisson model in (1) and the prior in (8) is intractable, so we cannot solve the E-step in (14) exactly, and an approximation to the posterior should be used instead. Perhaps the most common approach is to use a free-form factorized posterior distribution (mean field variational Bayes) [37], however this is also intractable here due to the lack of conditional conjugacy. Instead, we restrict the posterior to a parametric multivariate Gaussian form. In addition, in order to reduce the computational complexity, we restrict ourselves to factorized distributions, i.e.,

(16)

where denotes a normal univariate distribution with mean and variance . We propose to modify the EM algorithm in (14)–(15) by adding the constraint during the E-step in (14). The resulting algorithm is defined by the following two steps

(17)
(18)

which are repeated until convergence. Note that we have rewritten the FVE of (12) as . The first term in (17) does not depend on and was therefore omitted in (18). In the backward (B) step in (17), we update the approximation to the posterior , which is reduced to only searching for the variational parameters . In the forward (F) step in (18), we update the generative (forward) model by finding . Note that due to the constraint in (16), this is no longer an EM algorithm, and therefore we distinguish between the “E-step” and the “B-step,” and similarly between the “M-step” and the “F-step.”

Note that the posterior mean is the main object of interest and provides the estimates for the linear attenuation coefficients, which must be nonnegative. Accordingly, a non-negativity constraint on was added in (17). The function in (17) is only defined for positive values and using the extended-value function we assume for any .

The AM algorithm formed by repeating the updates in (17)–(18) will reduce the FVE at each iteration (and increase the ELBO). Importantly, as shown below, the AM algorithm leads to near-sparse solutions, i.e., many of the marginal posterior distributions will be highly concentrated around zero in the transform space defined by (9). The roles of and in promoting sparse solutions will be made clear in Sec. 2.4.

For the Poisson model in (1), the prior in (8) and the approximate posterior proposed in (16), the objective function in (17) can be written as (dropping iteration number)

(19)
(20)
(21)
(22)

where are the entries of the system matrix defined after (1), and are the entries of the matrix defined in (8). In deriving the last term in (22), we assumed that is square and non-singular666In deriving (22) we used , where is a finite constant and can be excluded. Note that has to be non-singular in order for this expression to be well-defined. The second equality with only holds when is square and non-singular.. We provide a generalization in Sec. 6 which removes this restriction and allows any .

Recalling the discussion in Sec. 1.1 about the sparse structure of , we note that the objective function in (19)–(22) has a computational complexity of 777For convenience we restate the definitions here: is the number of measurements, is the total number of pixels/voxels, and for 2D/3D images, respectively. which is feasible to compute in order to assess convergence in practice. In contrast, the original ARD [15] or other extensions [28, 30] require operations to evaluate the objective which is prohibitive for large scale problems where is very large, so the objective can only be approximated, e.g., via the stochastic estimator of Papandreou and Yuille [33].

The solution to the F-step in (18) is obtained by solving which yields

(23)

where denotes elementwise product. It is straightforward to verify that (23) is indeed a minimizing point of (18) (for finite ). For fixed , the FVE objective function is jointly convex with respect to , which implies that any local minimum of the objective in (17) is also a global minimum. Note that the FVE function is not jointly convex with respect to , so local minima are possible. In a similar way, the original ARD-EM algorithm [15] is also only guaranteed to find a local maximum of the evidence [40].

2.4 Discussion

The B-step in (17) is well understood from a Bayesian variational inference perspective, and it is equivalent to minimizing the KL divergence between the proposed posterior and the true posterior , as described in Fig. 3(a).

However, the F-step in (18) is somewhat puzzling, as it is unclear why it promotes sparse solutions, especially since using a Gaussian prior for MAP estimation does not lead to sparse solutions. To explain how sparsity is promoted, consider the -space defined in (9), where the image has a sparse representation. Note that the KL divergence in (18) is invariant under the transformation in (9). The prior and posterior distributions become

(24)

where is assumed to be invertible. Next, we make use of the following representation for a factorized Student’s-t distribution [27]

(25)

where is a univariate Student’s-t distribution with denoting the Gamma function, and . To obtain (25), one exploits Fenchel duality [41] where is the Fenchel dual of , and then uses the monotonically increasing transformation to obtain . By exponentiating both sides of the dual representation for one obtains (25) [27]. For , with sharp spikes at . Note in (25) that implies , and the multivariate t-distribution becomes an envelope for the ARD prior in (24). This is illustrated in Fig. 3(b) using a 2D example with , which depicts the proposed posterior and the prior , with the latter “trapped” inside the level sets of due to (25). One can see in Fig. 3(b) how selecting the widths of the Gaussian prior to minimize the KL divergence between and promotes the shrinking of for some (minimizing KL divergence between the two distributions is illustrated as maximizing the overlap between the level-set ellipsoids corresponding to the majority of the probability mass of these distributions). In the example of Fig. 3, the posterior has a much lower mean and variance along than along , and the F-step will further shrink . The corresponding parameter will be shrunk in the sense that the marginal prior will be more concentrated around zero, and so will the true posterior (see (3)). In the following B-step, the approximate posterior will also concentrate further around zero due to the zero-forcing property of the KL divergence [42] (recall that in the B-step we minimize KL-divergence between and the true posterior). Note that repeating the B and F steps is required to obtain sparse solutions.

(a) Backward Step
(b) Forward Step
Figure 3: Description of the AM algorithm proposed in (17)–(18). (a) Backward step: level sets for the likelihood (red curve), prior (blue curve) and the proposed posterior (black curve); true posterior is concentrated in the orange region; proposed posterior is chosen to minimize the KL divergence with . (b) Forward step: a level set for the student’s t distribution (dashed black curves) forming an envelope for the level set of the zero mean Gaussian prior (blue curves); and correspond to two different choices for where assigns equal variance along and and assigns small variance along and large variance along . Choosing maximizes the overlap between the prior and the proposed posterior (leading to lower KL divergence). is shrunk in the sense that its exact and proposed posterior distributions are further concentrated around smaller values due to the concentration of the prior about zero. Note that Fig. (a) is in the domain whereas Fig. (b) is in the domain.

The description of the mechanism responsible for promoting sparsity was inspired by the work of Wipf et al.[27], which uses similar arguments. However, that work relies entirely on the evidence function in (11) since a Gaussian noise model was considered. In contrast, our proposed approach is based on the free variational energy (FVE) function and applies more generally, even to cases where the expression for the evidence is unknown and an algorithm that increases the evidence is unavailable. This is important for non-Gaussian noise models, for which the evidence function is intractable, so we are forced to work with the FVE function instead. Another important difference between our work and that of Wipf et al.[27] is that we show how sparsity is promoted during the iterations of the algorithm, whereas in [27] it has been shown that a solution to (7) is sparse. Note also that [27] considers the original ARD prior in (6), whereas we consider the more general case in (8).

3 Parallel AM Algorithm

Here we present a modified AM algorithm, which uses an alternative B-step. The rationale behind the modified algorithm lies in decoupling the variational parameters of each pixel/voxel. This reduces the algorithm into simple 1D optimization tasks that can be computed in parallel, leading to high scalability. First, note that (20) and (21) are not additively separable with respect to the individual parameters . We shall therefore introduce a surrogate function, making use of the following lemma. [The convex decomposition lemma [13]] Let be a convex function with and let be a reference point. Let such that . Then

(26)

where denotes a column vector with all zeros except 1 at the th entry. Equality in (26) is obtained if for all or if is affine.

The proof readily follows from Jensen’s inequality [41]

(27)

where we added and subtracted a reference point and then applied Jensen’s inequality to where and . Equality is obtained iff is independent of or is affine. The former is obtained if ,  .

We will also use a minor (but useful) extension of lemma 3 by O’Sullivan and Benac [13]

, where we introduce a dummy variable denoted

with a corresponding weight , such that . Accordingly, and in (26) are replaced by and , respectively ( is the collection of all ’s), and the summation in (26) over from to is replaced by a summation from to . This modification leads to the requirement that which allows the freedom to choose instead of strict equality as in Lemma 3 (which of course requires that ). We choose to fix the dummy variable so that the term corresponding to on the right hand side of (26) is just a constant and will not be involved in the following optimization.

Before we proceed, we make several definitions to simplify the following derivations,

(28)
(29)
(30)

where , and are defined in (1). To distinguish between quantities associated with the mean and variance, we denote any quantities associated with the variance using a tilde. In (28), is the projection of the posterior mean along the th line, which we shall call “mean-type” projection; is a projection of the posterior variance along the th line using an elementwise squared system matrix, which we term a “variance-type” projection. In (29), is the predicted Poisson rate for the th measurement, based on the expectation with respect to posterior distribution computed at iteration . In (30), is the back-projection of measurements to the th pixel; and are the back-projections of the predicted Poisson rate in (29) to the th pixel, using the backward (adjoint) operator and the squared backward operator, respectively, which we call “mean-type” and “variance-type” backprojections.

By noting that (20) can be written as and that is jointly convex with respect to , we apply Lemma 3 to each separately. Using (26) with its minor extension (mentioned after Lemma 3) and , , and choosing a dummy variable which is set to zero, we obtain the bound at iteration with defined as

(31)

where we made use of (28)–(30) and corresponds to while corresponds to (for the th measurement). Note that the summation over in (31) includes the dummy variable with for any . Also note that (31) is separable with respect to the components and . According to the extended Lemma 3 we are free to choose as long as

(32)

which must be satisfied for each . We make the following choice for and

(33)

which can be easily verified to satisfy the condition in (32). Substituting (33) into (31), and using the definitions in Eq. (30), we obtain

(34)

The particular choice in (33) is very useful since the sum over in (31) needs to be done only once instead of repeating it for each update of during the search for the minimum of .

We derive a separable bound for (21), , by noting that is a convex function and using lemma 3 again to obtain

(35)

where (also defined in (38)). Similarly to (32), we require using the dummy variable extension. One choice for is

(36)

Substituting (36) into (35), we obtain

(37)

where we introduced the following definitions

(38)
(39)
(40)

in (38) is the estimated th element in the sparse representation for the object based on the posterior mean , e.g., if is chosen according to (10) (assuming sparsity in the pixel/voxel-difference space) then where defines the neighborhood system and is the number of neighbors. In deriving (39) we made use of .

Combining (34),(37), and (22) (the latter is already separable), we obtain a surrogate function that is a global majorizer of , i.e., , given by

(41)
(42)
(43)

Note that the surrogate function in (41) is separable with respect to both the components of and . This yields a modified B-step where is minimized by simultaneous (parallel) 1D optimization tasks, each with respect to a different component or . The iterations are defined as

(44)
(45)

Combining the modified B-step in (44)–(45) with the F-step in (23), we obtain the parallel VARD algorithm described in Algorithm 1.

1:Initialize
2:Compute in (30) % backprojection of data %
3:for t=1 to N do % AM iterations %
4:      %“mean-type ”projections %
5:      % “variance-type” projections %
6:     
7:     Compute , and defined in (38)–(40)
8:      % “mean-type” backprojections of %
9:      % “variance-type” backprojections of %
10:     for j=1 to p do % executed in parallel %
11:          % see (42) %
12:          % see (43) %      
13:      % see (23) %
Algorithm 1 : PAR-VARD (VARD with Separable Surrogates)

Each iteration requires a “mean-type” forward-projection and a “variance type” forward-projection (lines 4-5 in Algorithm 1) which can be computed in parallel. It also requires a “mean-type” backprojection and a “variance-type” backprojection (lines 8-9 in Algorithm 1), which can also be computed in parallel. Each iteration also requires solving two types of 1D optimization problems, one for the mean components and one for the variance components (lines 11-12 in Algorithm 1). The 1D optimization tasks for all pixels/voxels can be done in parallel and the scalar updates for mean and variance can also be done in parallel. Note that despite the latter parallelization, and provide information about both the mean and variance at the previous iteration which is shared during the mean and variance updates in (44)–(45) (lines 11-12 in Algorithm 1). Also note that these subiterations for are themselves iterations of another alternating minimization algorithm (lines 4-12 of Algorithm 1). Here we chose to use only one subiteration, but any number can be used instead.

A considerable advantage of this algorithm is that it allows fast parallel computing on multi-core computers, or on dedicated hardware such as field-programmable gate arrays (FPGA). Another advantage is that the solution to the constrained problem in (44) can be obtained by first solving the unconstrained problem using simple gradient-based methods; if then the solution to the constrained problem must be , otherwise the solution is the same. This simple thresholding for leads to the optimal solution only in one-dimensional minimization problems, and the same procedure for the original B-step in (17) can lead to sub-optimal solutions or even lack of convergence.

The computational complexity of a forward/backprojection is , where is the number of measurements, is the total number of pixels/voxels and for 2D/3D images, respectively (recall the discussion about the sparsity of in Sec. 1.1). The computational complexity of the operations and is , where is the number of neighboring pixels/voxels defined by the prior (recall the discussion on the sparsity of in Sec. 2.1). Typically, , , and , so the forward/backprojection cost dominants the cost of and . The details regarding the implementations of the 1D optimization tasks (lines 11-12 in Algorithm 1) are given in Appendix B.

4 Optimality Conditions and Convergence Analysis

First we provide the necessary optimality conditions for the minimization problems presented in Secs. 23 as well as a few definitions in order to simplify the following analysis. We then present a detailed study of the convergence properties of the parallel AM algorithm derived in Sec. 3 (Algorithm 1). We shall refer to this algorithm as “PAR-VARD” (parallel VARD).

4.1 Necessary Optimality Conditions

  1. The necessary Karush-Kuhn-Tucker (KKT) optimality conditions [41] for to be a solution to (17) are given by

    (46)
    (47)

    where are defined by (30) with and replaced by and , respectively; is defined by in (38) with replaced by ; is defined in (40).

  2. The necessary KKT condition for to be a solution to (18) is given by

    (48)