I Introduction
Compressive sensing and sparse signal representation have attracted the interest of an increasing number of researchers over the recent years (see e.g., [1, 2, 3, 4]). This has been motivated by the widespread applicability that such techniques find in a large variety of engineering areas. Generally speaking, these disciplines consider the following signal model:
(1) 
In this expression, is a vector of measurement samples, is an dictionary matrix with basis vectors , . The additive term is an perturbation vector, which is assumed to be a white Gaussian random vector with mean equal to zero and covariance matrix , where denotes the noise precision parameter and
is the identity matrix. The sparsity aspect is reflected by the fact that the entries in the
unknown weight vector are mostly zero. By obtaining a sparse estimate of we can accurately represent with a minimal number of basis vectors.We coin the signal model (1) as either real, when , , and are all real, or as complex, when , , and are all complex.^{1}^{1}1Obviously, one could also consider a mixed model where, e.g., and are complex but is real. Since, applying our inference approach to this mixed model is straightforward, we discard it in this paper and focus on the two most relevant cases of real and complex signal models. Historically, real signal models have dominated the research in sparse signal representation and compressive sampling. However, applications in which sparse estimation for complex signal models is sought are not uncommon in practice. An example is the estimation of multipath wireless channels [4, 5, 6]. The extension from sparse signal representation in real signal models to complex models is not always straightforward, as we will discuss in this paper.
Sparse Bayesian learning (SBL) [7, 8, 3] applied to signal model (1) aims at finding a sparse maximum a posteriori (MAP) estimate of
(2) 
In this expression, denotes the norm and the parameter takes values when the signal model (1) is real and when it is complex. The estimate minimizes a weighted sum of two terms: the first term represents the squarederror of the reconstructed signal, while the second term is a penalization term designed to enforce a sparse estimate of the weight vector .^{2}^{2}2Here denotes , and thus , for some arbitrary constant . We will also make use of , which denotes for some positive constant .
In consequence, a variety of different estimators can be obtained by modifying the prior probability density function (pdf)
. Each of these estimators represents different compromises between accuracy of the reconstructed signal and sparseness of the estimate. Instead of working directly with the prior pdf , SBL typically uses a twolayer (2L) hierarchal prior model that involves a conditional prior pdfand a hyperprior pdf
. The goal is then to select these pdfs in such a way that we can construct efficient yet computationally tractable iterative algorithms that estimate both the hyperparameter vector
and the weight vector with the latter estimate being sparse.The 2L prior model can be well understood from the perspective of a Gaussian scale mixture (GSM) model [9, 10, 11, 12], where the entries in are modeled as independent GSMs [13]. Specifically, is modeled as , where
is a standard Gaussian random variable and
is a nonnegative random scaling factor, also known as the mixing variable and described by its mixing density .^{3}^{3}3In this configuration, can be seen as the variance for . By choosing appropriate mixing densities various sparsityinducing penalty terms in (2) can be realized.One approach illustrating the use of the GSM model for SBL is the Relevance Vector Machine (RVM) [7], where the mixing densities are identical and equal to an inverse gamma pdf.^{4}^{4}4In the original formulation of the RVM algorithm in [7] the parameter models the precision (inverse variance) of the conditional Gaussian prior pdf and the hyperprior pdf is a gamma pdf. The model has been reformulated here in order to facilitate the comparison with the framework adopted in this paper. This choice leads to the marginal prior pdf being a product of studentt pdfs. An expectationmaximization (EM) algorithm is then formulated to estimate the hyperparameter vector ; as the estimate decreases it drives the corresponding weight estimate towards zero, thus encouraging a solution with only a few nonzero coefficients in . Another approach is presented in [14] that realizes two popular sparsityinducing penalization terms: the norm and the logsum. This is achieved by selecting identical mixing densities equal to respectively an exponential pdf and the density of the noninformative Jeffreys prior. In the former case, the marginal prior pdf is the product of Laplace pdfs and the corresponding MAP estimator (2) solves the norm constrained minimization problem. The same problem is also solved by the wellknown Least Absolute Shrinkage and Selection Operator (LASSO) [15] and Basis Pursuit Denoising [16]. Similarly to the RVM, an EM algorithm is used in [14] to estimate the hyperparameter vector . The resulting iterative algorithms are equivalent to special cases of the FOcal Underdetermined System Solver (FOCUSS) algorithm [17].
A significant limitation of the sparse estimators in [7, 14] is that the EM algorithms they use are know to suffer from high computational complexity and slow convergence [18]. Specifically, many iterations of the algorithm are needed before the entries of become small enough to disregard the corresponding basis vectors. To circumvent this shortcoming, a fast inference scheme is proposed in [18] for RVM and later applied in [19]. The Fast Laplace algorithm [19] is derived based on an augmented probabilistic model obtained by adding a third layer to the hierarchical structure of the Laplace pdf; this third layer introduces a hyperhyperprior for the rate parameter of the exponential pdf. The algorithm in [19] can be seen as the Bayesian version of the reweighted LASSO algorithm [20]. A similar threelayer (3L) prior model leading to an adaptive version of the LASSO estimator is also proposed in [21].
Even though the fast algorithms in [18] and [19] converge faster than their EM counterparts, they still suffer from slow convergence, especially in low and moderate signaltonoise ratio (SNR) regimes as we will demonstrate in this paper. Furthermore, in these regimes the algorithms significantly overestimate the number of nonzero weights and therefore they do not accomplish sparse representations. We will show that this behavior results from the selection of the hierarchical prior models.
Additionally, though complex GSM models have been proposed [22, 23], they have not been extensively studied for SBL in the literature. An example illustrating this fact is the hierarchical modeling of the penalty term. While it is well known that this penalty term results from selecting the exponential mixing density in real models, the same density will not induce this penalty term in complex models. Yet to the best of our knowledge, the complex GSM model realizing the penalty term has not been established in the literature. Motivated by the relative scarcity of formal tools for sparse learning in complex models and inspired by the recent developments of sparse Bayesian algorithms, we present a sparse Bayesian approach that applies to both real and complex signal models.
We first present a 2L hierarchical prior model inspired by the GSM model for both real and complex weights. By selecting each mixing density , , to be a gamma pdf, a Bessel K pdf is obtained for the marginal prior of each of the entries in [10, 11, 12]. The Bessel K pdf does not only encompass the modeling of the Laplace pdf as a special case,^{5}^{5}5The Bessel K pdf is in turn a special case of even a larger class of generalized hyperbolic distributions [10], obtained when the mixing density is a Generalized Inverse Gaussian pdf.
but also introduces new degrees of freedom that allow for novel priors to be used in SBL. In the particular case where the dictionary matrix
is orthonormal, we demonstrate that the MAP estimator is a generalization of the softthresholding rule with degree of sparseness depending on the selection of the shape parameter of the gamma pdf . Additionally, we show that the prior model has a strong connection to the Bayesian formalism of the group LASSO [21, 24]. A 3L hierarchical model is also investigated that results from an extension of the 2L model. For orthonormal the MAP estimator is, in this case, a generalization of the hardthresholding rule, with the shape parameter of determining again the sparseness property of the estimator. Finally, we apply the fast inference scheme in [18] to the 2L and 3L hierarchical structures to design sparse estimators for . As these structures encompass those used in [18] and [19], the iterative algorithms derived in these publications can be seen as instances of our estimation algorithms. We provide numerical results that reveal the superior performance of our estimators with respect to convergence speed, sparseness, and meansquared error (MSE) of the estimators. Most remarkably, the 2L prior model leads to an estimator that outperforms the stateoftheart Bayesian and nonBayesian sparse estimators.The remainder of this paper is organized as follows. In Section II, we detail the 2L and 3L hierarchical prior models for real and complex weights and analyze their sparsityinducing properties. Two fast Bayesian iterative algorithms based on the 2L and 3L models are proposed in Section III, where we also outline their relationship to the algorithms in [18] and [19]. A performance comparison of our two algorithms with related stateoftheart sparse estimators obtained via Monte Carlo simulations is presented in Section IV. Finally, we conclude the paper in Section V.
Ii Bayesian Hierarchical Modeling of Sparsity Constraints
We begin with the specification of the probabilistic structure of the SBL problem for signal model (1). Two types of hierarchical prior models for the weight vector are considered: a 2L and a 3L hierarchical model. Later we will see that these models lead to priors for with distinct sparsityinducing properties.
The joint pdf of the signal model (1) augmented with the 2L prior model for reads
(3) 
From (1), is Gaussian: if the signal model is real and if it is complex.^{6}^{6}6 and denote respectively a multivariate real and a multivariate complex Gaussian pdf with mean vector and covariance matrix . The sparsityinducing properties of this Bayesian model are determined by the joint prior pdf . Motivated by GSM modeling as well as previous works on SBL [7, 14, 19] we select the conditional prior pdf to factorize in a product of Gaussian pdfs: . For the sake of using a single notation for for both real and complex weights, we introduce the parameterization
(4) 
The conditional prior pdf for real is realized by selecting , while entails the conditional prior pdf for complex .
The hyperprior pdf of the 2L prior model is selected to be some suitable pdf defined over with fixed parameter . An extension of a 2L hierarchy with an additional layer, where the entries in are also modeled as random variables, has been considered in [21, 19]. Doing so yields the corresponding 3L prior model with the joint pdf specified as follows:
(5) 
where of is the pdf of the ”hyperhyperprior”.
In the sequel we analyze the 2L and 3L prior models in more detail by computing the prior pdfs of the weight vector obtained by marginalizing the joint prior pdfs and with respect to the hyperparameters and respectively. We then discuss the sparsityinducing properties of these marginalized priors in the case when the weights are real and complex. This includes the modeling of the penalty term as a special case.
Iia TwoLayer Hierarchical Prior Model
We begin with the hierarchical model of the Bessel K pdf. Specifically, is selected as a product of gamma pdfs with individual rate parameters, i.e., with .^{7}^{7}7 denotes a gamma pdf with shape parameter and rate parameter . Observe that while prior pdfs , , are assigned individual rate parameters , they have the same shape parameter . Let us define . The pdf of the marginal prior of is computed to be
(6) 
with
(7) 
In this expression, is the modified Bessel function of the second kind and order . Note that (7) is valid for both real () and complex () weight vector . Let us study (7) for these two cases in more detail.
IiA1 Real weights
With the selection , (7) coincides with the Bessel K pdf [10, 11]. This pdf has been used for, e.g., modeling image probabilities [25, 26]. Moreover, selecting and making use of the identity [27], (7) simplifies to the Laplace pdf
(8) 
The Laplace distribution for real weights is thereby obtained from a GSM model with an exponential mixing density [9].
IiA2 Complex weights
We now consider (7) for the case when the weight is complex, i.e. . In contrast to the previous case, selecting no longer leads to the Laplace pdf. Instead, a gamma pdf with the shape parameter has to be used as the mixing density. With this choice, the modified Bessel function in (7) has order and
(9) 
which is the Laplace pdf in the complex domain. Note that the choice has some connection with the group LASSO and its Bayesian interpretation [21, 24], where sparsity is enforced simultaneously over a group of variables. In the Bayesian interpretation of the group LASSO a gamma pdf with a shape parameter is employed to model the prior for the group weights; naturally, if we let a group consist of the real and imaginary part of a complex variable, then and hence, .
To conclude, the 2L prior model realizes the penalty term with for real and with for complex when selecting , . With these settings, the argument of the minimization in (2) particularizes to the LASSO cost function.
Let us stress that (7) represents a family of prior pdfs for parameterized by and . In Fig. 1, we visualize the restriction^{8}^{8}8Let denote a function defined on a set . The restriction of to a subset is the function defined on that coincides with on this subset. to of the prior pdf in (7) with for various values of . Observe the change of the shape of with : the smaller the value of is the more rapidly decays around the origin. Moreover, with the selection , becomes improper, i.e., its integral is no longer finite. This in turn leads to the improper prior density which has a light tail and a pole at the origin.^{9}^{9}9When this prior simplifies to . This prior realizes a strong sparsityinducing nature as we demonstrate next.
Naturally, the 2L prior model can be used with arbitrary values of , leading to the general optimization problem (2) with penalty term
(10) 
To illustrate this consider Fig. 2(a), where the contour lines of the restriction to of are plotted. Each contour line is computed for a specific choice of . It can be seen that as approaches more probability mass concentrates along the axes; as a consequence, the mode of the resulting posterior pdf is more likely to be found close to the axes, thus indicating a sparse solution. The behavior of the classical penalty term obtained for can also be clearly recognized.
In order to get insight into the impact of on the MAP estimator (2) with penalty term (10), we follow the approach in [14] and consider the case when is orthonormal, i.e., when , , where is the Kronecker delta. In this case the optimization (2) decouples into independent scalar optimization problems. Furthermore, when the 2L prior model realizes an penalty term, i.e., the pdf (8) (real case) or (9) (complex case) is selected, the MAP estimator coincides with the softthresholding rule
(11) 
where is the sign function. For an arbitrary value of , we cannot compute in closed form. Therefore, we make use the EM algorithm as implemented in [14] to approximate the MAP estimator. Figure 2(b) visualizes the estimation rule for different values of . Note that as decreases, more components of the estimate are pulled towards zero since the threshold value increases, thus encouraging a sparser solution.
IiB ThreeLayer Hierarchical Prior Model
We now turn to the SBL problem with a 3L prior model for which leads to the joint pdf (5) of the augmented signal model. In contrast with the 2L prior model, we consider the entries in as random and assume that with , .
We now compute the prior pdf by marginalizing with respect to and . First, we note that and marginalize this pdf over . This requires computing . Defining and we obtain
(12) 
with
(13) 
Finally, marginalizing over yields
(14) 
with
(15) 
In this expression, is the confluent hypergeometric function [27]. In Fig. 3(a), we depict the contour lines of the restriction to of .
Observe that although the contours are qualitatively similar to those shown in Fig. 2(a) for the 2L model, the corresponding estimation rules depicted in Fig. 3(b) differ from those shown in Fig. 2(b): the 3L prior model lead to a generalization of the hardthresholding rule.
References [21] and [19] also propose a 3L prior model to derive a Bayesian adaptive version of the LASSO estimator for hierarchical adaptive LASSO and for the Fast Laplace algorithm respectively. There is, however, one important difference between these approaches and the one advocated in our work: the 2L and 3L prior models presented here are not constrained to realize the penalty term, but can be used for arbitrary values of . It is precisely this distinction that makes these hierarchical prior models lead to estimators that outperform classical penalized sparse estimators, as our numerical evaluation in Section IV will reveal.
IiC The Jeffreys Noninformative Prior
Note that in (13) entails the density of the noninformative Jeffreys prior for and thereby the improper marginal prior density . The same holds for the 2L prior model with the selection . In other words, when the highest hierarchy layer in the prior formulation is chosen to be noninformative, the logsum penalization is invoked. This penalty term has gained much interest in the literature, including [7, 8, 14], as it is known to strongly promote sparse estimates. We should also add that the reweighting scheme in [20] has also been motivated from the logsum penalty term.
Iii Sparse Bayesian Inference
In this section we present the Bayesian inference scheme that relies on the 2L and 3L prior models presented in Section II. First, we exploit the EM framework to derive an iterative algorithm that estimates the unknown model parameters of the 3L prior model. Then, a fast algorithm with low computational complexity inspired by [18] is presented. The algorithm based on the 2L model can easily be obtained by treating as a fixed parameter.
Iiia Sparse Bayesian Inference using EM
Direct computation of the estimate in (2) is not always a straightforward task. In particular, for small the penalty term (see e.g., (10)) loses convexity and the calculation of the minimum of the objective function in (2) becomes impractically complex. The hierarchical structure of the 2L and 3L prior models instead provides an efficient yet computationally tractable approach to compute an estimate of . This is achieved at the expense of introducing the additional hyperparameters, and , that have to be estimated alongside with . The sparse Bayesian inference scheme previously used for SBL (see [7, 18, 8, 29, 19]) can also be adopted for our prior models. Its main steps are summarized below when it is applied to the 3L prior model. The reader is referred to the above cited works for more information. The inference scheme exploits the particular factorization of the posterior pdf of the model parameters:
(16) 
The joint maximization of the left handside of (16) is not feasible. Instead, a two step approach is adopted consisting of a first step which computes the maximizer of the second factor, followed by a second step which calculates the maximizer of the first factor with set to the value already obtained in the first step. Note that the second step computes the MAP estimate of , while the first step calculates the conditional MAP estimate of given set to its MAP estimate. Unfortunately, the first step is computationally prohibitive, so it is replaced in the Bayesian inference scheme by an EM algorithm that sequentially computes an approximation of the MAP estimate of . The virtue of this modified scheme is that the conditional MAP estimate of computed in the second step is readily a byproduct obtained in the Estep of the EM algorithm. In our particular application context, we update the estimates of the components in sequentially, instead of jointly. The generalized EM (GEM) framework justifies this modification.
The EM algorithm aims at approximating the MAP of , i.e. the maximizer of
(17) 
by exploiting the fact that is a complete data for :
(18) 
The Estep of the EM algorithm computes the conditional expectation
(19) 
with respect to the conditional pdf or whether the underlying signal model is real or complex. Here, and denote the current estimates computed at a previous iteration.^{10}^{10}10To keep notation simple we avoid the use of an iteration index in the formulation of the EM algorithm: in the updating equations, the estimates on the lefthand side are the new update estimates, while those on the righthand side are the current estimates. In either cases, the parameters of the conditional pdf of read
(20)  
(21) 
with . The Mstep of the EM algorithm updates the estimate of by equating it with the maximizer of (19). We can invoke the GEM framework to replace this step by a step updating the three components in sequentially. Specifically, , , and . Doing so we obtain
(22)  
(23)  
(24) 
In accordance with [8, 14], we have selected a constant (improper) prior for , i.e., . This choice is motivated by the fact that the influence of the shape of the conjugate gamma pdf becomes small for a large .
In this formulation of the EM algorithm, we have chosen as the hidden variable. Another approach followed in [14] consists in letting be the hidden variable and including the estimate of in the Mstep. However, our numerical results indicate that the resulting EM algorithm leads to estimators that may exhibit an unstable behavior when small values of the parameter are selected. These estimators are consistently converging to a local minimum and often fail to produce sparse estimates of . We have observed the same convergence behavior when this EM algorithm is applied to the probabilistic model using the noninformative Jeffreys prior for , , also proposed in [14]. As previously mentioned, this estimator coincides with the FOCUSS algorithm with the setting [17] (see [8] for a thorough numerical analysis on different choices of ). Finally, another approach to estimate the model parameters is to use variational Bayesian inference techniques, see e.g., [30, 31, 6]. This approach has not been followed in this paper.
IiiB Fast Sequential Sparse Inference Scheme
Although the EM algorithm derived in the previous subsection yields simple expressions for the parameter updates, it suffers from two major drawbacks: first, the algorithm converges rather slowly, i.e., many iterations are needed before the entries of corresponding to the zerovalued weights become small enough to remove the corresponding basis vectors, which reduces the computational complexity of the update (20). Secondly, unless , i.e., in a noiseless case, the entries of are never exactly zero; they take very small, yet nonzero values, and thus an additional, usually empirical, thresholding procedure is needed to remove components with small weights in the inference procedure.
In order to circumvent these drawbacks, we follow the framework proposed by [18]: rather than estimating the entries of simultaneously for all components in the dictionary , the objective function (17) is maximized sequentially with respect to the parameter of a single basis vector at a time, while keeping the parameters of the other basis vectors fixed. Thus, the parameters of all components are updated sequentially, one after another. This approach leads to a much more efficient inference scheme that proves to accelerate the convergence compared to EM. It also provides conditions for determining when a component has to be removed, thus eliminating the need for a pruning threshold.
Let us reinspect (17) in more details. First we note that in (17) can be computed in closed form. Since, both and are Gaussian pdfs, is likewise Gaussian with mean zero and covariance matrix . Now, let us consider the dependency of (17) on a single parameter . We note that . Making use of the matrix inversion lemma [32] we express the inverse of as
(25) 
where . Following the GEM framework, we consider the terms in (17) depending on while keeping and fixed to their current estimates:
(26) 
The first summand contains the terms independent of . The terms depending on are encompassed in the second summand:
(27) 
with the definitions and . Note that the definition domain of is . We seek to compute the stationary points of . To this end, we take its derivative and equate the result to zero. Doing so yields the cubic equation
(28) 
When ranges through , the equation has three solutions which can be determined analytically. While a feasible solution for is constrained to be positive, a further analysis is needed to determine which of the roots of the cubic polynomial have to be selected as a solution. We immediately note that , while . In other words, depending on the sign of , diverges either to plus or minus infinity when tends to zero. In the following we consider several different choices for .
IiiB1 Stationary points of for
When , diverges to plus infinity at the origin and hence, has no global maximum. In this case we turn to the following proposition:
Proposition 1
Provided the function has at most a single local maximum.
See the Appendix. Proposition 1 states a very pleasing result as it provides us with an easy routine for selecting . Either we select the local maximizer of , provided it exists, or we prune the basis vector in by setting .
IiiB2 Stationary points of for
In this case (28) simplifies to
(29) 
The two roots of the righthand quadratic polynomial read
(30)  
(31) 
where and , with the equality holding if, and only if, . While is always negative, is positive if, and only if, . Moreover, from and we conclude that if is positive it corresponds to the global maximum of . Thus, the solution is obtained as
(32) 
The case corresponds to the hierarchical modeling of the Laplace prior for real weights: (32) coincides with the estimate of obtained in [19] for the Fast Laplace algorithm when and . Obviously, (32) can also be used for the complex case with , yet in this case the equivalent prior for is no longer Laplace, as we showed in Section II, but some other prior from the Bessel K family.
The case is also connected to the Fast RVM algorithm [18]. The selection and corresponds to a constant hyperprior pdf , so that the hyperparameter vector is estimated via maximization of the typeII likelihood function . Taking the derivative of (27) and setting the result to zero leads to the update expression
(33) 
used in [18]. It is interesting to note that (33) is independent of and thus is the same regardless whether the signal model (1) is real or complex. The same holds for the RVM algorithm.
IiiB3 Stationary points of for
Since diverges to minus infinity as and , and is continuous, it must have a global maximum. In this case, we make use of Proposition 2.
Proposition 2
Provided the function has a single stationary point and this stationary point corresponds to a maximum.
See the Appendix. Hence, the positive solution to (28) is the global maximizer of .
IiiB4 Stationary points of for
This case is of a limited practical interest since the resulting penalty term is no longer sparsityinducing. We thus leave out the corresponding analysis of the stationary points.
IiiC Realization of the Fast Sequential Inference Scheme
The fast inference scheme presented in Section IIIB can be realized using different approaches. For instance, one can start out with the “full” dictionary and at each iteration of the algorithm remove a basis vector with a corresponding . This effectively reduces the model complexity “on the fly”. However, the first iterations still suffer from a high computational complexity due to the update (20).
To avoid this, we follow the approach in [18], which consists in starting with an “empty” dictionary and incrementally filling the dictionary by adding one basis vector at each iteration of the algorithm. More specifically, we start with an empty dictionary and “test” each basis vector by computing each , , and adding the one vector in with the corresponding that gives rise to the greatest increase in between two consecutive iterations. The estimator derived based on the 3L model updates according to (23). We then follow the add, delete and reestimate procedures for , , , and given by [18] ( and can be computed in a very efficient manner for all basis vectors). In the first update iterations, the estimate of is likely to be inaccurate, as contains only a few basis vectors. Therefore, the estimation of is enabled after several update iterations have been performed only. In this case, must be initialized to a fixed value in the first iterations.
Iv Numerical Results
In this section we analyze the performance of the proposed estimation algorithms in Section III. The performance is evaluated by means of Monte Carlo simulations. Both real and complex signal models are considered in the investigations. We use a random dictionary matrix whose entries are independent and identically distributed (iid) zeromean real (complex symmetric) Gaussian random variables with unit variance. A total of components in are nonzero, and the indices of these are uniformly drawn without repetition from the set . These indices together with are unknown to the algorithms. The nonzero components in
are iid and drawn from a zeromean real (complex circular) Gaussian distribution with unit variance. All reported curves are computed based on a total of
Monte Carlo runs. For each such run new realizations of the dictionary matrix , the vector , and the random perturbation vector are generated. To further simplify the analysis of the simulation results we assume that the noise variance is known. The performance of the algorithms is evaluated based on the MSE of the estimator and the mean of the number of nonzero components in . These two performance metrics combined provide a strong indication of whether a given algorithm has properly inferred the support set of , defined as the entries of the nonzero components in this vector.Legend  Model  Parameters 

Fast2L  2L  , 
Fast2L  2L  , 
Fast3L  3L  , , 
We refer to the proposed estimation algorithms using the 2L and 3L prior models as Fast2L and Fast3L respectively. In all simulations we select the free parameters of the prior models as summarized in Table I. According to Section II, affects the sparsityinducing property of the hierarchical prior model and, correspondingly, the penalty term in (2). The value of the parameter is appended to the acronyms Fast2L and Fast3L. Note that the values of the rate parameters , , need to be specified for Fast2L. Our investigations show that for the performance of Fast2L becomes largely independent of the choice of in moderate to high SNR regimes; nonetheless, in low SNR regime a large value of , results in overly sparse estimates of . We select for Fast2L.
We compare the performance of Fast2L and Fast3L to the performance of the following stateoftheart sparse estimators:

EMRVM: the RVM algorithm in [7, 8] that uses the EM algorithm for parameter estimation assuming a constant hyperparameter prior.^{11}^{11}11The software is available online at http://dsp.ucsd.edu/~dwipf/.

FastRVM: the algorithm in [18, 29], which is a fast sequential version of EMRVM assuming a constant hyperparameter prior.^{12}^{12}12The software is available online at http://people.ee.duke.edu/~lcarin/BCS.html.

FastLaplace: the algorithm in [19] is an analogue to FastRVM derived based on the hierarchical Laplace prior model for real weights with an additional hyperprior for the rate parameter .^{13}^{13}13The software is available online at http://ivpl.eecs.northwestern.edu/.

SpaRSA and ReSpaRSA: the sparse reconstruction by separable approximation (SpaRSA) algorithm for solving the LASSO cost function [33]^{14}^{14}14The software is available online at http://www.lx.it.pt/~mtf/SpaRSA/. and a modified version of it (ReSpaRSA) with the reweighted minimization realized using the framework proposed in [20].
As a reference, we also consider the performance of the oracle estimator of [34] that “knows” the positions of its nonzero entries. The oracle estimate reads
(34) 
where and is an dictionary matrix constructed from those columns of that correspond to the nonzero entries in . The MSE of the oracle estimator for a fixed and noise precision is .
We expect the performance of a SBL iterative algorithm to be determined by two main factors: the used underlying prior model and the inference scheme applied to it. FastRVM, FastLaplace, Fast2L, and Fast3L are all obtained by using the same inference scheme, but applied to different hierarchical prior models. Hence, by comparing the performances of these algorithms we can draw conclusions on the sparsityinducing properties of their respective prior models.^{15}^{15}15Naturally, the practical implementation of the inference scheme also impacts the performance. FastLaplace, Fast2L, and Fast3L are all implemented based on the software for FastRVM. This is done in Section IVA. In Section IVB, we benchmark our proposed algorithms with the rest of the estimators in the above list.
Iva Impact of the Hierarchical Prior Model on the Performance
To this end, we compare the performance of FastRVM, FastLaplace, Fast2L, and Fast3L. These estimators are all derived based on the same inference scheme yet applied to different hierarchical models. FastRVM and FastLaplace can be easily adapted to complex scenarios. Notice that FastLaplace does not make use of the Laplace prior in the complex case as earlier explained. The iterative nature of the algorithms requires a stopping criterion. We use the same stopping criterion for all algorithms based on the difference in between two consecutive iterations as done in [29].
We begin by evaluating the performance versus (i) the SNR per received signal component and (ii) the number of iterations used until the stopping criterion is met. The results, shown in Fig. 4, depict the sensitivity of the algorithms to measurement noise. In these investigations we set and ; the number of nonzero components in is set to .
Fast2L achieves nearly the performance of the oracle estimator in terms of MSE for both real and complex models. Furthermore, FastRVM and FastLaplace perform as good as Fast3L. We also note that in high SNR regime (SNR dB) the performance of all compared estimators is almost indistinguishable. Yet in moderate SNR regime, the impact of the noise becomes significant and Fast2L with small clearly outperforms the other algorithms. The performance results for Fast3L are not reported since they almost coincide with those for Fast2L. This indicates that it is mainly the selection of the shape parameter in the second layer that dominates the performance of the algorithms, regardless of whether the 2L or the 3L prior model is used.
We now study the mean of versus the SNR. The corresponding results are summarized in Figs. 5(a) and 5(d). Clearly, Fast2L outperforms the other algorithms: is unbiased for SNR as low as 15 dB in the complex case. FastRVM and FastLaplace produce a significantly biased estimate of in low to average SNR regime. Only for very high SNR values is unbiased.
In Figs. 5(b) and 5(e) we plot the mean of versus the algorithms’ iteration index for SNR fixed at 30 dB. Observe that Fast2L converges in roughly 30–40 iterations for both and . In contrast, the other algorithms require from to more then update iterations until convergence is achieved. As shown in Figs. 5(c) and 5(f) the number of iterations varies with the SNR. Notice that Fast2L requires fewer iterations than any other considered algorithm in almost the whole SNR range evaluated. Furthermore, we observe that Fast3L and FastRVM converge faster than FastLaplace. Nonetheless, FastLaplace leads to sparser representations.

Next we evaluate the performance of the algorithms versus the number of available measurements and the value of for the SNR fixed at dB. In these investigations we have not included Fast3L as we can conclude from the previously presented results that it performs poorly compared to Fast2L() and Fast2L(). First, we fix and and vary from to samples. Then, for fixed and we vary from to . The corresponding performance results are depicted in Fig. 6. In Figs. 6(a), 6(b), 6(e), and 6(f) we plot the MSE and the mean of versus number of measurements . Notice the characteristic thresholdlike behavior of the MSE curves with respect to the number of observed measurement samples . The threshold occurs for a smaller with complex signal models as compared with real signal models. This is explained by the fact that in the complex case both real and imaginary parts are used to prune components in , thus improving the MSE threshold. Furthermore, FastRVM and FastLaplace consistently produces a biased estimate of for both real and complex signal models; in contrast, Fast2L
achieves an unbiased estimate of
for sufficiently large values of .In Figs. 6(c), 6(d), 6(g), and 6(h) we plot the MSE and the mean of versus . A similar thresholdlike behavior can be observed as increases, with the threshold occurring for a higher value of in case of complex signal models, i.e., in complex scenarios less sparse signals can be estimated with higher accuracy. Analogously to previously discussed results, Fast2L with small again outperforms the other algorithms. In particular, Fig. 6(h) shows that Fast2L achieves an unbiased estimate of for values as high as .
IvB Comparison with the Other Sparse Estimators

In the following, we compare Fast2L( with EMRVM, EMLaplace, SpaRSA, and a reweighted version of SpaRSA. The latter two estimators implement a proximal gradient method to optimize the LASSO objective function, with the reweighted version introducing additional parameter weighting to improve performance. Specifically, SpaRSA optimizes the LASSO objective function with a single regularization parameter that regulates the tradeoff between the quadratic fit term