1 Introduction
Tomographic image reconstruction is the process of estimating an object from measurements of its line integrals along different angles
[1]. Tomographic reconstruction is an illposed problem [2] in the sense that there are multiple solutions that are consistent with the data. It is therefore desirable to make use of prior knowledge, which includes the statistical characteristics of the measured data, and properties that are expected of the image.The images in transmission tomography represent attenuation per unit length of the impinging beam due to energy absorption inside the imaged object. Well known examples are xray computed tomography (CT) [3], and electron tomography [4]. The attenuation must be nonnegative, since a negative value corresponds to an increase in the intensity of the exiting beam, which is nonphysical. These images can often be well approximated by a sparse representation in some transform domain. In some applications, sparsity is present directly in the native image domain [5], but more commonly it is present in the pixeldifference domain [6] or in some transform domain, such as the wavelet, curvelet or shearlet transforms [7, 8].
There are generally two types of approaches for tomographic image reconstruction. The first type consists of oneshot algorithms, such as filtered backprojection [1] and its extensions, which rely on analytic formulas. These algorithms do not incorporate the type of prior knowledge mentioned above and also produce prominent artifacts when some of the data are missing. The second type consists of iterative algorithms which are based on minimizing some objective function. The latter enable the incorporation of prior knowledge about the data and the image of interest, e.g., the objective function can put less emphasis on measurements that are expected to be more noisy, include penalties that promote sparsity of the image in some representation, and positivity constraints for the image can also be included. In addition, iterative algorithms are far more robust to missing data.
Many transmission tomography problems involve physical processes consisting of quanta, e.g., the number of transmitted photons in xray CT, or the number of transmitted electrons in electron tomography. Therefore, there is a strong physical motivation for modeling these processes using independent Poisson random variables
[4, 6]. In the medical CT community, the Poisson noise model is commonly used [6] and provides the basis for a well known class of iterative algorithms [6, 9, 11, 12, 13]. In addition, the mean of each measurement is modeled according to Beer’s law [6], which is an inherent nonlinearity in transmission data.In this paper, we develop a new class of statistical iterative algorithms for image reconstruction in transmission tomography that is inspired by automatic relevance determination (ARD) [14, 15, 16, 17]. It incorporates prior knowledge that includes Poisson statistics of the underlying signals, Beer’s law, positivity of the image, and also sparsity of the image in some underlying representation. What sets this new class apart from prior art is the fact that it automatically learns the balance between datafidelity and prior knowledge, thus avoiding the use of any tuning parameters and the difficulty in determining them. In addition, it also computes posterior variances, which can be used for adaptive sensing/experimental design, or for determining the system’s sensitivity to noise. To the best of our knowledge, the proposed algorithm is the first ARDbased algorithm for Poisson noise that is both scalable to very large problems and that is also guaranteed to converge.
Our work is motivated by xray CT which is used in medical and security applications [1]. Nevertheless, the methods presented here can be applied more generally to any type of transmission tomography that involves count data. In formulating the problem presented in Sec. 1.1 we have tried to retain some generality in the hope that other imaging modalities could potentially benefit from the proposed approach. Naturally, this entails some simplifications of the forward model which are discussed later in Sec. 7.
It should be noted that most modern commercial CT scanners do not employ photon counting detectors. However, there are already photon counting detectors used in clinical studies [18], and it is certainty possible that they will be integrated into commercial CT scanners in the near future. Modern CT scanners also utilize a high photon flux, which led various researchers to propose algorithms that are based on an approximate postlog Gaussian noise model [19, 20, 21] rather than a Poisson noise model. The Gaussian model can be derived by making a 2nd order Taylor series approximation to the loglikelihood corresponding to the Poisson noise model with Beer’s law, under the assumption of moderate to high photon flux. This results in a dataweighted leastsquares datafit term [19], which also corresponds to a postlog Gaussian noise model. However, in lowcount scenarios, which are critical to maintain low radiation dose in pediatric and small animal CT, there could be a substantial performance benefit in using the Poisson model once photoncounting detectors become widely available. For example, it has been shown empirically that the approximate Gaussian model leads to a systematic negative bias that increases as photon flux decreases [22].
1.1 Model
We consider a Poisson noise model for transmission tomography [6] given by
(1) 
where here and henceforth denotes the likelihood function, contains the measurements,
denotes the Poisson univariate probability distribution with rate
, and the exponential function encompasses Beer’s law (a physical property of the signals [6]). The column vector
denotes the lexicographical ordering of the 2D/3D image to be reconstructed, where the entry is the linear attenuation coefficient at the th pixel/voxel in units of inverse length. The operation represents a line integral over the image corresponding to the th sourcedetector pair (see Fig. 2) and to the measurement . We denote by the system matrix that is formed by concatenating the row vectors (). is the mean number of photons that would have been measured using the th sourcedetector pair if the imaged object was removed. We make the standard assumption that was measured for all in an independent calibration experiment and is known to us.We construct using the approach in [23], where an entry at the th row and th column of is equal to the length at which the line between the th sourcedetector pair intersects the th pixel/voxel (see Fig. 2). Each ray transverses at most pixels/voxels, where is the total number of pixels/voxels and for 2D/3D images, respectively. Accordingly, each row in the system matrix has only nonzero entries. Henceforth, we absorb a reference linear attenuation coefficient into so that is dimensionless.
1.2 Background
The method proposed in this paper is an extension of automatic relevance determination (ARD). For the convenience of readers unfamiliar with ARD we first provide background to explain the potential advantages of ARD motivating this work and discuss the key differences between ARD and other more common sparse estimation methods. Our aim here is to emphasize previous challenges in adapting ARD to the Poisson model in (1), challenges which we aim to resolve with the proposed method.
1.2.1 Maximum a Posteriori (MAP) Estimation
Statistical iterative methods for CT [6, 9, 11, 13] cast the problem of image reconstruction as a penalized maximum likelihood estimation problem, given by
(2) 
where is the loglikelihood function which represents the datafit. For the Poisson model in (1), maximizing the loglikelihood is equivalent to minimizing the Idivergence ^{2}^{2}2Idivergence between two vectors is defined as [24] between the actual measurements and the predicted measurements [13], where denotes an elementwise product. in (2
) is the prior probability distribution representing prior knowledge about
and is a penalty that promotes solutions with desired properties according to this prior knowledge, e.g., the penalty [5], or the total variation (TV) penalty [25, 26], which promote sparse solutions. is a tuning parameter that determines the tradeoff between the datafit and the prior, or the biasvariance tradeoff. The “best” value of is usually unknown and depends on the imaged object at hand. To determine , it is often required to solve (2) many times for different values of , examine all reconstructed images and then select the value that best fits the needs of the application, based on some chosen figure of merit. For many common choices of considered below, decreasing will put more weight on the datafit term and will result in increased variance and in noisier images. Increasing will put more weight on the penalty term, and will result in increased bias and increased artifacts in the image, e.g., staircase artifacts. For a given , the solution in (2) can be interpreted as the maximum a posteriori (MAP) solution, i.e., the maximizing point of the posterior distribution, which is given by Bayes’ theorem
(3) 
The and TV penalties are nondifferentiable and do not allow the use of standard gradientbased methods for solving (2). An alternative to TV is the class of neighborhood roughness penalties (Gibbs priors) [11] which have the following form
(4) 
where is a positive weight inversely proportional to the distance between the centers of the th and th pixels/voxels,
is a set of ordered pairs
defining a neighborhood system and is chosen to be a positive, even, and twice continuously differentiable function. To preserve sharp edges in the image, is often chosen such that for large and can be viewed as a smooth version of the penalty. A common choice is [10, 11](5) 
which approaches the quadratic for and the linear function as . This type of penalties allows the use of standard gradientbased methods but comes with the price of having an additional tuning parameter , so the search for the tuning parameters becomes even more cumbersome. Decreasing too much will result in noisy images but with sharp edges, and increasing too much will result in blurred images.
1.2.2 Automatic Relevance Determination (ARD)
A different approach to sparse estimation is “automatic relevance determination” (ARD) [14]. Originally, ARD has been proposed for Gaussian likelihoods with the prior [15]
(6) 
where denotes a diagonal matrix with main diagonal . In this approach sparsity is promoted by treating as latent variables and seeking the newly introduced hyperparameters that maximize the marginal likelihood (also called the “model evidence” [16])
(7) 
Once is found, the estimate for is given by the mean of the posterior distribution . Interestingly, the solutions to (7) have many and the posterior distribution for the corresponding parameters becomes highly concentrated around zero^{3}^{3}3By employing Bayes’ rule in (3), it can be easily shown that if the prior is concentrated around then the corresponding posterior distribution is also concentrated around . The mechanism that promotes sparsity of has been revealed by Wipf et al. [27, 28], with the analysis specifically tailored for Gaussian likelihoods.
The solution of (7) was originally addressed by Tipping in [15]
using an expectationmaximization (EM) algorithm
[29]. In the Estep, the Gaussian posterior probability distribution
is computed for given hyperparameters . In the Mstep, are updated based on the mean and variances of the posterior distribution found in the Estep. As opposed to MAP estimation methods, which focus on the mode (maximum) of the posterior distribution, ARD methods also compute the covariance matrix (during the E step), or at least the marginal variances, which provide additional information. ARDbased algorithms are therefore computationally more demanding than MAP methods, mainly due to the expensive task of estimating the variances. Previous extensions of ARD [28, 30] require the same type of operations and rely on approximations to the variances [30] to maintain scalability.Despite computational complications, the ARD approach can be quite rewarding and excel in hard or illposed problems, as evidenced in several recent applications [31]. A very significant advantage of ARD methods is their ability to automatically learn the balance between the datafit and the degree of sparsity. This avoids the tuning of nuisance parameters, which is often nonintuitive, objectdependent, and requires many trials. The posterior variances are key in determining this tradeoff. The variances can also be used for adaptive sensing/experimental design [17, 30], which is outside the scope of this paper and will be the subject of future work.
1.2.3 Difficulties in Previous ARDbased Methods for Poisson Models
The ARDEM algorithm [15] described in Sec. 1.2.2 cannot be directly applied to the Poisson likelihood in (1) since the Estep is intractable due to the nonconjugacy^{4}^{4}4A prior is said to be conjugate to the likelihood if both the prior and posterior probability distributions are in the same family. The posterior is then given in closedform. of the prior in (6) and the likelihood. Although we are not aware of any previously published work on ARD for Poisson models with Beer’s law, a simple remedy to the nonconjugacy is the Laplace method, discussed in [15]
. In that approach, the posterior is approximated during the Estep using a multivariate Gaussian distribution, with the mean equal to the MAP solution and the covariance matrix equal to the inverse of the Hessian matrix for the loglikelihood, evaluated at the MAP solution. However, the resulting “EMlike” algorithm is not based upon a consistent objective function,
and is not guaranteed to converge. In addition, it requires inversions of the Hessian matrix with computational complexity , where is the total number of pixels/voxels. This is infeasible for xray CT, where could be very large, e.g., or higher. Instead, one can use more sophisticated methods [30, 32, 33] to indirectly estimate the diagonal of the inverse, i.e., the marginal variances, since only these are required for updating the hyperparameters [15]. Note however that the work in [30, 32, 33] does not consider Laplace approximations, so these methods have not been tested for this case. Irrespective of how the variances are computed, the “EMlike” algorithm with the Laplace approximation is unsatisfactory from a principled perspective due to the lack of an objective function and due to the lack of convergence guarantees. The objective is also desired in order to assess convergence in practice and in order to verify the correctness of the implementation. In addition, we are faced with conceptual difficulties, since the analysis that revealed the sparsity promoting mechanism of ARD by Wipf et al. [16, 27, 28] is restricted to Gaussian likelihoods.1.2.4 Previous ARD Method for Transmission Tomography Based on an Approximate PostLog Gaussian Noise Model
As mentioned in the introduction, it is reasonable to replace the model in (1) with an approximate postlog Gaussian noise model under moderate to high photon flux conditions. An ARD algorithm for transmission tomography based on the Gaussian model and the EM algorithm in [15] was proposed by Jeon et al. [17]. It should be noted that the original ARD algorithm in [15] has a computational cost of due to the computation of posterior variances, and therefore, it is not practical for large scale problems. In contrast, the algorithm proposed in [17] scales as where for 2D and 3D images, respectively. There are several disadvantages of the algorithm in [17]: (1) it does not enforce positivity of the solution; (2) due to the approximate estimation of posterior variances, the overall algorithm does not have convergence guarantees; (3) it requires one to choose the type and number of probing vectors for estimating the variances [17], which depends on the specific geometry of the system and on the imaged object; (4) in order to estimate the posterior variances it requires solving many matrix equations, which is computationally expensive; (5) computing the objective function scales as , making this operation infeasible for large scale problems; (6) it is limited to proper priors and complete sparse representations.
1.3 Contributions
The contributions of this paper are summarized as follows.

We present a new alternating minimization (AM) algorithm for transmission tomography, which extends automatic relevance determination (ARD) to Poisson noise models with Beer’s law. The proposed algorithm is free of any tuning parameters and unlike prior ARD methods, also has a consistent objective function that can be evaluated in a computationally efficient way, in order to assess convergence in practice and to verify the correctness of the implementation.

We derive a scalable AM algorithm for transmission tomography, that is well adapted to parallel computing. We reduce the computational complexity considerably by simplifying the algorithm into parallel 1D optimization tasks (for each pixel/voxel) with the scalar mean and variance updates also done in parallel.

We present a convergence analysis for the scalable AM algorithm and establish guarantees for global convergence^{5}^{5}5An algorithm is said to be “globally convergent” if for arbitrary starting points the algorithm is guaranteed to converge to a point satisfying the necessary optimality conditions (local optima included) [34].. To the best of our knowledge, this is the first ARDbased algorithm for Poisson models that has convergence guarantees.

We study the performance of the algorithm on synthetic and realworld xray datasets.

We compare the proposed algorithm to the Gaussian ARD method in [17]
and demonstrate that the proposed algorithm leads to better performance under Poisson distributed data, even for high photon flux.
1.4 Outline of the Paper
The rest of this paper is organized as follows. In Sec. 2 we present a general overview of the proposed AM algorithm and explain how sparsity is promoted. In Sec. 3 we present a scalable AM algorithm based on separable surrogates for the objective function, which is well adapted to parallel computing and feasible for largescale problems. In Sec. 4 we present a detailed analysis of the convergence properties of the algorithm presented in Sec. 3. In Sec. 5 we make connections between the proposed algorithm and related previous methods. We also present an alternative view of the proposed ARD method that further explains how sparsity is promoted. In Sec. 6 we extend the proposed ARD framework to allow the modeling of sparsity in overcomplete representations and the use of improper priors, while retaining the convergence properties studied in Sec. 4. In Secs. 7–8 we discuss model limitations and possible alternative optimization methods related to our work. In Sec. 9 we study the performance of the proposed scalable ARD algorithm on synthetic and realworld datasets (with a parallel implementation used in the experiments) and compare it to prior MAP estimation methods as well as to the ARD method in [17] which is based on a postlog Gaussian noise model. Conclusions are presented in Sec. 10.
2 Variational ARD
2.1 Prior
In many applications of transmission tomography sparsity is not present directly in the native basis but in some transform domain. Therefore, we consider a generalized prior (similar to Seeger and Nickisch [30]) to address these cases
(8) 
where are newly introduced hyperparameters, and we assume a transformation
(9) 
where is known and is approximately sparse. By choosing , the prior in (8) reduces to (6). To provide some insight into the mechanism of the proposed algorithm we assume for now that is square and invertible. We shall present a more general approach in Sec. 6 to allow for cases where these assumptions are not met in practice.
Any scalable algorithm will rely on efficient matrixvector multiplication . This is achieved if is sparse or has some special structure, such as discrete Fourier or wavelet transforms. In this work we focus on neighborhood penalties, which are commonly used for xray CT, the application explored in Sec. 9. We first consider given by
(10) 
where is the set of ordered pairs defining a neighborhood system and is the number of neighbors for the th pixel/voxel. An example is shown in Fig. 2. The neighborhood is usually chosen to be small enough such that will be sparse with nonzero elements in the th row. For the choice in (10), each in (9) is equal to the difference between the th pixel/voxel and the average of its neighboring pixels/voxels, i.e., . Note that sparsity of these pixel/voxel differences implies piecewise smooth images. Lastly, we note that the choice in (10) implies Dirichlet boundary conditions, which is equivalent to assuming additional neighbors outside the image domain with values set to zero, such that the number of neighbors for each pixel/voxel is equal, i.e., (see Fig. 2). These boundary conditions agree with physical constraints typical of many real experimental setups, where the attenuation is zero outside the region of interest. We shall remove these restrictions in Sec. 6.
2.2 Variational View of the EM Algorithm for ARD
We start with a known variational view [35] of the EM algorithm which is used to solve (7) for Gaussian noise models. Here we introduce some of the notation used throughout this work, and discuss the type of difficulties encountered when trying to extend ARD to nonGaussian noise models.
We rewrite the model evidence as
(11) 
where
(12)  
(13) 
is the proposed variational distribution and is the exact posterior distribution, is the free variational energy (FVE) or negative evidence lower bound (ELBO), is the KullbackLeibler (KL) divergence, and is the expected value with respect to . The EM algorithm can be viewed [35] as minimizing the free variational energy (FVE) function in (12) (or maximizing the ELBO) by alternating between updating in the Estep and updating in the Mstep, as summarized below
(14)  
(15) 
In the Estep, the optimal solution with zero KL divergence is , i.e., the exact posterior based on the previous values, which can be computed directly from (3). In the Mstep, minimizing with respect to can only increase the KL term, since it is nonnegative and overall the model evidence is also increased.
For nonGaussian likelihoods and the prior in (8), the E step in (14) is intractable, i.e., the expression for is unknown and its direct calculation according to (3) involves a highdimensional integral that is prohibitive to compute. Instead, we are forced to restrict to a tractable family of distributions to approximate the posterior. However, this results in an algorithm that is no longer guaranteed to increase the evidence [36]. This occurs because the KL term in (11) is never zero and can either decrease or increase after updating . Furthermore, a local maximum of the evidence with respect to is generally not a local minimum of (or a local maximum of the ELBO), except for degenerate cases [36].
Therefore, when considering the above approach to extend ARD, we immediately encounter a conceptual difficulty, since most previous theoretical studies of ARD [16, 27, 28] are based on the maximization of the evidence and its specific closedform expression for the Gaussian likelihood. It is not immediately clear how to extend the ARD framework to the Poisson noise model in (1) in a manner which will preserve the insights gained from [16, 27, 28] and will explain why sparsity is being promoted by the ARD algorithm.
Perhaps surprisingly, we shall show that the variational algorithm discussed above still preserves the principles of ARD, despite the fact that it is not based on maximizing the evidence, but rather on maximizing the ELBO (lower bound for the evidence). We call this framework “Variational ARD” (VARD).
2.3 AlternatingMinimization Algorithm for Variational ARD
As mentioned in Sec. 2.2, the true posterior distribution corresponding to the Poisson model in (1) and the prior in (8) is intractable, so we cannot solve the Estep in (14) exactly, and an approximation to the posterior should be used instead. Perhaps the most common approach is to use a freeform factorized posterior distribution (mean field variational Bayes) [37], however this is also intractable here due to the lack of conditional conjugacy. Instead, we restrict the posterior to a parametric multivariate Gaussian form. In addition, in order to reduce the computational complexity, we restrict ourselves to factorized distributions, i.e.,
(16) 
where denotes a normal univariate distribution with mean and variance . We propose to modify the EM algorithm in (14)–(15) by adding the constraint during the Estep in (14). The resulting algorithm is defined by the following two steps
(17)  
(18) 
which are repeated until convergence. Note that we have rewritten the FVE of (12) as . The first term in (17) does not depend on and was therefore omitted in (18). In the backward (B) step in (17), we update the approximation to the posterior , which is reduced to only searching for the variational parameters . In the forward (F) step in (18), we update the generative (forward) model by finding . Note that due to the constraint in (16), this is no longer an EM algorithm, and therefore we distinguish between the “Estep” and the “Bstep,” and similarly between the “Mstep” and the “Fstep.”
Note that the posterior mean is the main object of interest and provides the estimates for the linear attenuation coefficients, which must be nonnegative. Accordingly, a nonnegativity constraint on was added in (17). The function in (17) is only defined for positive values and using the extendedvalue function we assume for any .
The AM algorithm formed by repeating the updates in (17)–(18) will reduce the FVE at each iteration (and increase the ELBO). Importantly, as shown below, the AM algorithm leads to nearsparse solutions, i.e., many of the marginal posterior distributions will be highly concentrated around zero in the transform space defined by (9). The roles of and in promoting sparse solutions will be made clear in Sec. 2.4.
For the Poisson model in (1), the prior in (8) and the approximate posterior proposed in (16), the objective function in (17) can be written as (dropping iteration number)
(19)  
(20)  
(21)  
(22) 
where are the entries of the system matrix defined after (1), and are the entries of the matrix defined in (8). In deriving the last term in (22), we assumed that is square and nonsingular^{6}^{6}6In deriving (22) we used , where is a finite constant and can be excluded. Note that has to be nonsingular in order for this expression to be welldefined. The second equality with only holds when is square and nonsingular.. We provide a generalization in Sec. 6 which removes this restriction and allows any .
Recalling the discussion in Sec. 1.1 about the sparse structure of , we note that the objective function in (19)–(22) has a computational complexity of ^{7}^{7}7For convenience we restate the definitions here: is the number of measurements, is the total number of pixels/voxels, and for 2D/3D images, respectively. which is feasible to compute in order to assess convergence in practice. In contrast, the original ARD [15] or other extensions [28, 30] require operations to evaluate the objective which is prohibitive for large scale problems where is very large, so the objective can only be approximated, e.g., via the stochastic estimator of Papandreou and Yuille [33].
The solution to the Fstep in (18) is obtained by solving which yields
(23) 
where denotes elementwise product. It is straightforward to verify that (23) is indeed a minimizing point of (18) (for finite ). For fixed , the FVE objective function is jointly convex with respect to , which implies that any local minimum of the objective in (17) is also a global minimum. Note that the FVE function is not jointly convex with respect to , so local minima are possible. In a similar way, the original ARDEM algorithm [15] is also only guaranteed to find a local maximum of the evidence [40].
2.4 Discussion
The Bstep in (17) is well understood from a Bayesian variational inference perspective, and it is equivalent to minimizing the KL divergence between the proposed posterior and the true posterior , as described in Fig. 3(a).
However, the Fstep in (18) is somewhat puzzling, as it is unclear why it promotes sparse solutions, especially since using a Gaussian prior for MAP estimation does not lead to sparse solutions. To explain how sparsity is promoted, consider the space defined in (9), where the image has a sparse representation. Note that the KL divergence in (18) is invariant under the transformation in (9). The prior and posterior distributions become
(24) 
where is assumed to be invertible. Next, we make use of the following representation for a factorized Student’st distribution [27]
(25) 
where is a univariate Student’st distribution with denoting the Gamma function, and . To obtain (25), one exploits Fenchel duality [41] where is the Fenchel dual of , and then uses the monotonically increasing transformation to obtain . By exponentiating both sides of the dual representation for one obtains (25) [27]. For , with sharp spikes at . Note in (25) that implies , and the multivariate tdistribution becomes an envelope for the ARD prior in (24). This is illustrated in Fig. 3(b) using a 2D example with , which depicts the proposed posterior and the prior , with the latter “trapped” inside the level sets of due to (25). One can see in Fig. 3(b) how selecting the widths of the Gaussian prior to minimize the KL divergence between and promotes the shrinking of for some (minimizing KL divergence between the two distributions is illustrated as maximizing the overlap between the levelset ellipsoids corresponding to the majority of the probability mass of these distributions). In the example of Fig. 3, the posterior has a much lower mean and variance along than along , and the Fstep will further shrink . The corresponding parameter will be shrunk in the sense that the marginal prior will be more concentrated around zero, and so will the true posterior (see (3)). In the following Bstep, the approximate posterior will also concentrate further around zero due to the zeroforcing property of the KL divergence [42] (recall that in the Bstep we minimize KLdivergence between and the true posterior). Note that repeating the B and F steps is required to obtain sparse solutions.
The description of the mechanism responsible for promoting sparsity was inspired by the work of Wipf et al.[27], which uses similar arguments. However, that work relies entirely on the evidence function in (11) since a Gaussian noise model was considered. In contrast, our proposed approach is based on the free variational energy (FVE) function and applies more generally, even to cases where the expression for the evidence is unknown and an algorithm that increases the evidence is unavailable. This is important for nonGaussian noise models, for which the evidence function is intractable, so we are forced to work with the FVE function instead. Another important difference between our work and that of Wipf et al.[27] is that we show how sparsity is promoted during the iterations of the algorithm, whereas in [27] it has been shown that a solution to (7) is sparse. Note also that [27] considers the original ARD prior in (6), whereas we consider the more general case in (8).
3 Parallel AM Algorithm
Here we present a modified AM algorithm, which uses an alternative Bstep. The rationale behind the modified algorithm lies in decoupling the variational parameters of each pixel/voxel. This reduces the algorithm into simple 1D optimization tasks that can be computed in parallel, leading to high scalability. First, note that (20) and (21) are not additively separable with respect to the individual parameters . We shall therefore introduce a surrogate function, making use of the following lemma. [The convex decomposition lemma [13]] Let be a convex function with and let be a reference point. Let such that . Then
(26) 
where denotes a column vector with all zeros except 1 at the th entry. Equality in (26) is obtained if for all or if is affine.
The proof readily follows from Jensen’s inequality [41]
(27) 
where we added and subtracted a reference point and then applied Jensen’s inequality to where and . Equality is obtained iff is independent of or is affine. The former is obtained if , .
We will also use a minor (but useful) extension of lemma 3 by O’Sullivan and Benac [13]
, where we introduce a dummy variable denoted
with a corresponding weight , such that . Accordingly, and in (26) are replaced by and , respectively ( is the collection of all ’s), and the summation in (26) over from to is replaced by a summation from to . This modification leads to the requirement that which allows the freedom to choose instead of strict equality as in Lemma 3 (which of course requires that ). We choose to fix the dummy variable so that the term corresponding to on the right hand side of (26) is just a constant and will not be involved in the following optimization.Before we proceed, we make several definitions to simplify the following derivations,
(28)  
(29)  
(30) 
where , and are defined in (1). To distinguish between quantities associated with the mean and variance, we denote any quantities associated with the variance using a tilde. In (28), is the projection of the posterior mean along the th line, which we shall call “meantype” projection; is a projection of the posterior variance along the th line using an elementwise squared system matrix, which we term a “variancetype” projection. In (29), is the predicted Poisson rate for the th measurement, based on the expectation with respect to posterior distribution computed at iteration . In (30), is the backprojection of measurements to the th pixel; and are the backprojections of the predicted Poisson rate in (29) to the th pixel, using the backward (adjoint) operator and the squared backward operator, respectively, which we call “meantype” and “variancetype” backprojections.
By noting that (20) can be written as and that is jointly convex with respect to , we apply Lemma 3 to each separately. Using (26) with its minor extension (mentioned after Lemma 3) and , , and choosing a dummy variable which is set to zero, we obtain the bound at iteration with defined as
(31) 
where we made use of (28)–(30) and corresponds to while corresponds to (for the th measurement).
Note that the summation over in (31) includes the dummy variable with for any .
Also note that (31) is separable with respect to the components and .
According to the extended Lemma 3 we are free to choose as long as
(32) 
which must be satisfied for each . We make the following choice for and
(33) 
which can be easily verified to satisfy the condition in (32). Substituting (33) into (31), and using the definitions in Eq. (30), we obtain
(34) 
The particular choice in (33) is very useful since the sum over in (31) needs to be done only once instead of repeating it for each update of during the search for the minimum of .
We derive a separable bound for (21), , by noting that is a convex function and using lemma 3 again to obtain
(35) 
where (also defined in (38)). Similarly to (32), we require using the dummy variable extension. One choice for is
(36) 
Substituting (36) into (35), we obtain
(37) 
where we introduced the following definitions
(38)  
(39)  
(40) 
in (38) is the estimated th element in the sparse representation for the object based on the posterior mean , e.g., if is chosen according to (10) (assuming sparsity in the pixel/voxeldifference space) then where defines the neighborhood system and is the number of neighbors. In deriving (39) we made use of .
Combining (34),(37), and (22) (the latter is already separable), we obtain a surrogate function that is a global majorizer of , i.e., , given by
(41)  
(42)  
(43) 
Note that the surrogate function in (41) is separable with respect to both the components of and . This yields a modified Bstep where is minimized by simultaneous (parallel) 1D optimization tasks, each with respect to a different component or . The iterations are defined as
(44)  
(45) 
Combining the modified Bstep in (44)–(45) with the Fstep in (23), we obtain the parallel VARD algorithm described in Algorithm 1.
Each iteration requires a “meantype” forwardprojection and a “variance type” forwardprojection (lines 45 in Algorithm 1) which can be computed in parallel. It also requires a “meantype” backprojection and a “variancetype” backprojection (lines 89 in Algorithm 1), which can also be computed in parallel. Each iteration also requires solving two types of 1D optimization problems, one for the mean components and one for the variance components (lines 1112 in Algorithm 1). The 1D optimization tasks for all pixels/voxels can be done in parallel and the scalar updates for mean and variance can also be done in parallel. Note that despite the latter parallelization, and provide information about both the mean and variance at the previous iteration which is shared during the mean and variance updates in (44)–(45) (lines 1112 in Algorithm 1). Also note that these subiterations for are themselves iterations of another alternating minimization algorithm (lines 412 of Algorithm 1). Here we chose to use only one subiteration, but any number can be used instead.
A considerable advantage of this algorithm is that it allows fast parallel computing on multicore computers, or on dedicated hardware such as fieldprogrammable gate arrays (FPGA). Another advantage is that the solution to the constrained problem in (44) can be obtained by first solving the unconstrained problem using simple gradientbased methods; if then the solution to the constrained problem must be , otherwise the solution is the same. This simple thresholding for leads to the optimal solution only in onedimensional minimization problems, and the same procedure for the original Bstep in (17) can lead to suboptimal solutions or even lack of convergence.
The computational complexity of a forward/backprojection is , where is the number of measurements, is the total number of pixels/voxels and for 2D/3D images, respectively (recall the discussion about the sparsity of in Sec. 1.1). The computational complexity of the operations and is , where is the number of neighboring pixels/voxels defined by the prior (recall the discussion on the sparsity of in Sec. 2.1). Typically, , , and , so the forward/backprojection cost dominants the cost of and . The details regarding the implementations of the 1D optimization tasks (lines 1112 in Algorithm 1) are given in Appendix B.
4 Optimality Conditions and Convergence Analysis
First we provide the necessary optimality conditions for the minimization problems presented in Secs. 2–3 as well as a few definitions in order to simplify the following analysis. We then present a detailed study of the convergence properties of the parallel AM algorithm derived in Sec. 3 (Algorithm 1). We shall refer to this algorithm as “PARVARD” (parallel VARD).
4.1 Necessary Optimality Conditions

The necessary KKT condition for to be a solution to (18) is given by
(48)
Comments
There are no comments yet.