# Sparse Estimation using Bayesian Hierarchical Prior Modeling for Real and Complex Linear Models

In sparse Bayesian learning (SBL), Gaussian scale mixtures (GSMs) have been used to model sparsity-inducing priors that realize a class of concave penalty functions for the regression task in real-valued signal models. Motivated by the relative scarcity of formal tools for SBL in complex-valued models, this paper proposes a GSM model - the Bessel K model - that induces concave penalty functions for the estimation of complex sparse signals. The properties of the Bessel K model are analyzed when it is applied to Type I and Type II estimation. This analysis reveals that, by tuning the parameters of the mixing pdf different penalty functions are invoked depending on the estimation type used, the value of the noise variance, and whether real or complex signals are estimated. Using the Bessel K model, we derive a sparse estimator based on a modification of the expectation-maximization algorithm formulated for Type II estimation. The estimator includes as a special instance the algorithms proposed by Tipping and Faul [1] and by Babacan et al. [2]. Numerical results show the superiority of the proposed estimator over these state-of-the-art estimators in terms of convergence speed, sparseness, reconstruction error, and robustness in low and medium signal-to-noise ratio regimes.

## Authors

• 3 publications
• 7 publications
• 17 publications
• 2 publications
• 5 publications
• ### Kinetic Energy Plus Penalty Functions for Sparse Estimation

In this paper we propose and study a family of sparsity-inducing penalty...
07/22/2013 ∙ by Zhihua Zhang, et al. ∙ 0

• ### Dual-Space Analysis of the Sparse Linear Model

Sparse linear (or generalized linear) models combine a standard likeliho...
07/10/2012 ∙ by David Wipf, et al. ∙ 0

• ### Sorted Concave Penalized Regression

The Lasso is biased. Concave penalized least squares estimation (PLSE) t...
12/28/2017 ∙ by Long Feng, et al. ∙ 0

• ### Bayesian Sparsification Methods for Deep Complex-valued Networks

With continual miniaturization ever more applications of deep learning c...
03/25/2020 ∙ by Ivan Nazarov, et al. ∙ 0

• ### Sparse recovery with unknown variance: a LASSO-type approach

We address the issue of estimating the regression vector β in the generi...
01/02/2011 ∙ by Stéphane Chrétien, et al. ∙ 0

• ### Application of Bayesian Hierarchical Prior Modeling to Sparse Channel Estimation

Existing methods for sparse channel estimation typically provide an esti...
04/03/2012 ∙ by Niels Lovmand Pedersen, et al. ∙ 0

• ### A Noise-Robust Fast Sparse Bayesian Learning Model

This paper utilizes the hierarchical model structure from the Bayesian L...
08/20/2019 ∙ by Ingvild M. Helgøy, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Compressive sensing and sparse signal representation have attracted the interest of an increasing number of researchers over the recent years (see e.g., [1, 2, 3, 4]). This has been motivated by the widespread applicability that such techniques find in a large variety of engineering areas. Generally speaking, these disciplines consider the following signal model:

 y=Hα+w. (1)

In this expression, is a vector of measurement samples, is an dictionary matrix with basis vectors , . The additive term is an perturbation vector, which is assumed to be a white Gaussian random vector with mean equal to zero and covariance matrix , where denotes the noise precision parameter and

is the identity matrix. The sparsity aspect is reflected by the fact that the entries in the

unknown weight vector are mostly zero. By obtaining a sparse estimate of we can accurately represent with a minimal number of basis vectors.

We coin the signal model (1) as either real, when , , and are all real, or as complex, when , , and are all complex.111Obviously, one could also consider a mixed model where, e.g., and are complex but is real. Since, applying our inference approach to this mixed model is straightforward, we discard it in this paper and focus on the two most relevant cases of real and complex signal models. Historically, real signal models have dominated the research in sparse signal representation and compressive sampling. However, applications in which sparse estimation for complex signal models is sought are not uncommon in practice. An example is the estimation of multipath wireless channels [4, 5, 6]. The extension from sparse signal representation in real signal models to complex models is not always straightforward, as we will discuss in this paper.

Sparse Bayesian learning (SBL) [7, 8, 3] applied to signal model (1) aims at finding a sparse maximum a posteriori (MAP) estimate of

 ^αMAP(y)=argminα{ρ∥y−Hα∥22+λ−1Q(α)}. (2)

In this expression, denotes the norm and the parameter takes values when the signal model (1) is real and when it is complex. The estimate minimizes a weighted sum of two terms: the first term represents the squared-error of the reconstructed signal, while the second term is a penalization term designed to enforce a sparse estimate of the weight vector .222Here denotes , and thus , for some arbitrary constant . We will also make use of , which denotes for some positive constant .

In consequence, a variety of different estimators can be obtained by modifying the prior probability density function (pdf)

. Each of these estimators represents different compromises between accuracy of the reconstructed signal and sparseness of the estimate. Instead of working directly with the prior pdf , SBL typically uses a two-layer (2-L) hierarchal prior model that involves a conditional prior pdf

and a hyperprior pdf

. The goal is then to select these pdfs in such a way that we can construct efficient yet computationally tractable iterative algorithms that estimate both the hyperparameter vector

and the weight vector with the latter estimate being sparse.

The 2-L prior model can be well understood from the perspective of a Gaussian scale mixture (GSM) model [9, 10, 11, 12], where the entries in are modeled as independent GSMs [13]. Specifically, is modeled as , where

is a standard Gaussian random variable and

is a nonnegative random scaling factor, also known as the mixing variable and described by its mixing density .333In this configuration, can be seen as the variance for . By choosing appropriate mixing densities various sparsity-inducing penalty terms in (2) can be realized.

One approach illustrating the use of the GSM model for SBL is the Relevance Vector Machine (RVM) [7], where the mixing densities are identical and equal to an inverse gamma pdf.444In the original formulation of the RVM algorithm in [7] the parameter models the precision (inverse variance) of the conditional Gaussian prior pdf and the hyperprior pdf is a gamma pdf. The model has been reformulated here in order to facilitate the comparison with the framework adopted in this paper. This choice leads to the marginal prior pdf being a product of student-t pdfs. An expectation-maximization (EM) algorithm is then formulated to estimate the hyperparameter vector ; as the estimate decreases it drives the corresponding weight estimate towards zero, thus encouraging a solution with only a few nonzero coefficients in . Another approach is presented in [14] that realizes two popular sparsity-inducing penalization terms: the -norm and the log-sum. This is achieved by selecting identical mixing densities equal to respectively an exponential pdf and the density of the noninformative Jeffreys prior. In the former case, the marginal prior pdf is the product of Laplace pdfs and the corresponding MAP estimator (2) solves the -norm constrained minimization problem. The same problem is also solved by the well-known Least Absolute Shrinkage and Selection Operator (LASSO) [15] and Basis Pursuit Denoising [16]. Similarly to the RVM, an EM algorithm is used in [14] to estimate the hyperparameter vector . The resulting iterative algorithms are equivalent to special cases of the FOcal Underdetermined System Solver (FOCUSS) algorithm [17].

A significant limitation of the sparse estimators in [7, 14] is that the EM algorithms they use are know to suffer from high computational complexity and slow convergence [18]. Specifically, many iterations of the algorithm are needed before the entries of become small enough to disregard the corresponding basis vectors. To circumvent this shortcoming, a fast inference scheme is proposed in [18] for RVM and later applied in [19]. The Fast Laplace algorithm [19] is derived based on an augmented probabilistic model obtained by adding a third layer to the hierarchical structure of the Laplace pdf; this third layer introduces a hyper-hyperprior for the rate parameter of the exponential pdf. The algorithm in [19] can be seen as the Bayesian version of the re-weighted LASSO algorithm [20]. A similar three-layer (3-L) prior model leading to an adaptive version of the LASSO estimator is also proposed in [21].

Even though the fast algorithms in [18] and [19] converge faster than their EM counterparts, they still suffer from slow convergence, especially in low and moderate signal-to-noise ratio (SNR) regimes as we will demonstrate in this paper. Furthermore, in these regimes the algorithms significantly overestimate the number of nonzero weights and therefore they do not accomplish sparse representations. We will show that this behavior results from the selection of the hierarchical prior models.

Additionally, though complex GSM models have been proposed [22, 23], they have not been extensively studied for SBL in the literature. An example illustrating this fact is the hierarchical modeling of the penalty term. While it is well known that this penalty term results from selecting the exponential mixing density in real models, the same density will not induce this penalty term in complex models. Yet to the best of our knowledge, the complex GSM model realizing the penalty term has not been established in the literature. Motivated by the relative scarcity of formal tools for sparse learning in complex models and inspired by the recent developments of sparse Bayesian algorithms, we present a sparse Bayesian approach that applies to both real and complex signal models.

We first present a 2-L hierarchical prior model inspired by the GSM model for both real and complex weights. By selecting each mixing density , , to be a gamma pdf, a Bessel K pdf is obtained for the marginal prior of each of the entries in [10, 11, 12]. The Bessel K pdf does not only encompass the modeling of the Laplace pdf as a special case,555The Bessel K pdf is in turn a special case of even a larger class of generalized hyperbolic distributions [10], obtained when the mixing density is a Generalized Inverse Gaussian pdf.

but also introduces new degrees of freedom that allow for novel priors to be used in SBL. In the particular case where the dictionary matrix

is orthonormal, we demonstrate that the MAP estimator is a generalization of the soft-thresholding rule with degree of sparseness depending on the selection of the shape parameter of the gamma pdf . Additionally, we show that the prior model has a strong connection to the Bayesian formalism of the group LASSO [21, 24]. A 3-L hierarchical model is also investigated that results from an extension of the 2-L model. For orthonormal the MAP estimator is, in this case, a generalization of the hard-thresholding rule, with the shape parameter of determining again the sparseness property of the estimator. Finally, we apply the fast inference scheme in [18] to the 2-L and 3-L hierarchical structures to design sparse estimators for . As these structures encompass those used in [18] and [19], the iterative algorithms derived in these publications can be seen as instances of our estimation algorithms. We provide numerical results that reveal the superior performance of our estimators with respect to convergence speed, sparseness, and mean-squared error (MSE) of the estimators. Most remarkably, the 2-L prior model leads to an estimator that outperforms the state-of-the-art Bayesian and non-Bayesian sparse estimators.

The remainder of this paper is organized as follows. In Section II, we detail the 2-L and 3-L hierarchical prior models for real and complex weights and analyze their sparsity-inducing properties. Two fast Bayesian iterative algorithms based on the 2-L and 3-L models are proposed in Section III, where we also outline their relationship to the algorithms in [18] and [19]. A performance comparison of our two algorithms with related state-of-the-art sparse estimators obtained via Monte Carlo simulations is presented in Section IV. Finally, we conclude the paper in Section V.

## Ii Bayesian Hierarchical Modeling of Sparsity Constraints

We begin with the specification of the probabilistic structure of the SBL problem for signal model (1). Two types of hierarchical prior models for the weight vector are considered: a 2-L and a 3-L hierarchical model. Later we will see that these models lead to priors for with distinct sparsity-inducing properties.

The joint pdf of the signal model (1) augmented with the 2-L prior model for reads

 p(y,α,γ,λ)=p(y|α,λ)p(λ)p(α|γ)p(γ). (3)

From (1), is Gaussian: if the signal model is real and if it is complex.666 and denote respectively a multivariate real and a multivariate complex Gaussian pdf with mean vector and covariance matrix . The sparsity-inducing properties of this Bayesian model are determined by the joint prior pdf . Motivated by GSM modeling as well as previous works on SBL [7, 14, 19] we select the conditional prior pdf to factorize in a product of Gaussian pdfs: . For the sake of using a single notation for for both real and complex weights, we introduce the parameterization

 p(αl|γl)=(ρπγl)ρexp(−ρ|αl|2γl). (4)

The conditional prior pdf for real is realized by selecting , while entails the conditional prior pdf for complex .

The hyperprior pdf of the 2-L prior model is selected to be some suitable pdf defined over with fixed parameter . An extension of a 2-L hierarchy with an additional layer, where the entries in are also modeled as random variables, has been considered in [21, 19]. Doing so yields the corresponding 3-L prior model with the joint pdf specified as follows:

 p(y,α,γ,η,λ)=p(y|α,λ)p(λ)p(α|γ)p(γ|η)p(η), (5)

where of is the pdf of the ”hyper-hyperprior”.

In the sequel we analyze the 2-L and 3-L prior models in more detail by computing the prior pdfs of the weight vector obtained by marginalizing the joint prior pdfs and with respect to the hyperparameters and respectively. We then discuss the sparsity-inducing properties of these marginalized priors in the case when the weights are real and complex. This includes the modeling of the penalty term as a special case.

### Ii-a Two-Layer Hierarchical Prior Model

We begin with the hierarchical model of the Bessel K pdf. Specifically, is selected as a product of gamma pdfs with individual rate parameters, i.e., with .777 denotes a gamma pdf with shape parameter and rate parameter . Observe that while prior pdfs , , are assigned individual rate parameters , they have the same shape parameter . Let us define . The pdf of the marginal prior of is computed to be

 p(α;ϵ,η)=∫∞0p(α|γ)p(γ;ϵ,η)dγ=L∏l=1p(αl;ϵ,ηl) (6)

with

 p(αl;ϵ,ηl)=2ρ(ϵ+ρ)2πρΓ(ϵ)η(ϵ+ρ)2l|αl|ϵ−ρKϵ−ρ(2√ρηl|αl|). (7)

In this expression, is the modified Bessel function of the second kind and order . Note that (7) is valid for both real () and complex () weight vector . Let us study (7) for these two cases in more detail.

#### Ii-A1 Real weights

With the selection , (7) coincides with the Bessel K pdf [10, 11]. This pdf has been used for, e.g., modeling image probabilities [25, 26]. Moreover, selecting and making use of the identity [27], (7) simplifies to the Laplace pdf

 p(αl;ϵ=1,ηl)=√ηl2exp(−√2ηl|αl|),αl∈R. (8)

The Laplace distribution for real weights is thereby obtained from a GSM model with an exponential mixing density [9].

#### Ii-A2 Complex weights

We now consider (7) for the case when the weight is complex, i.e. . In contrast to the previous case, selecting no longer leads to the Laplace pdf. Instead, a gamma pdf with the shape parameter has to be used as the mixing density. With this choice, the modified Bessel function in (7) has order and

 p(αl;ϵ=3/2,ηl)=2ηlπexp(−2√ηl|αl|),αl∈C, (9)

which is the Laplace pdf in the complex domain. Note that the choice has some connection with the group LASSO and its Bayesian interpretation [21, 24], where sparsity is enforced simultaneously over a group of variables. In the Bayesian interpretation of the group LASSO a gamma pdf with a shape parameter is employed to model the prior for the group weights; naturally, if we let a group consist of the real and imaginary part of a complex variable, then and hence, .

To conclude, the 2-L prior model realizes the penalty term with for real and with for complex when selecting , . With these settings, the argument of the minimization in (2) particularizes to the LASSO cost function.

Let us stress that (7) represents a family of prior pdfs for parameterized by and . In Fig. 1, we visualize the restriction888Let denote a function defined on a set . The restriction of to a subset is the function defined on that coincides with on this subset. to of the prior pdf in (7) with for various values of . Observe the change of the shape of with : the smaller the value of is the more rapidly decays around the origin. Moreover, with the selection , becomes improper, i.e., its integral is no longer finite. This in turn leads to the improper prior density which has a light tail and a pole at the origin.999When this prior simplifies to . This prior realizes a strong sparsity-inducing nature as we demonstrate next.

Naturally, the 2-L prior model can be used with arbitrary values of , leading to the general optimization problem (2) with penalty term

 Q(α;ϵ,η)=L∑l=1log(|αl|ϵ−ρKϵ−ρ(2√ρηl|αl|)). (10)

To illustrate this consider Fig. 2(a), where the contour lines of the restriction to of are plotted. Each contour line is computed for a specific choice of . It can be seen that as approaches more probability mass concentrates along the axes; as a consequence, the mode of the resulting posterior pdf is more likely to be found close to the axes, thus indicating a sparse solution. The behavior of the classical penalty term obtained for can also be clearly recognized.

In order to get insight into the impact of on the MAP estimator (2) with penalty term (10), we follow the approach in [14] and consider the case when is orthonormal, i.e., when , , where is the Kronecker delta. In this case the optimization (2) decouples into independent scalar optimization problems. Furthermore, when the 2-L prior model realizes an penalty term, i.e., the pdf (8) (real case) or (9) (complex case) is selected, the MAP estimator coincides with the soft-thresholding rule

 ^αl,MAP(y)=sign(hHly)max{0,|hHly|−λ−1√ηlρ},l=1,…,L, (11)

where is the sign function. For an arbitrary value of , we cannot compute in closed form. Therefore, we make use the EM algorithm as implemented in [14] to approximate the MAP estimator. Figure 2(b) visualizes the estimation rule for different values of . Note that as decreases, more components of the estimate are pulled towards zero since the threshold value increases, thus encouraging a sparser solution.

### Ii-B Three-Layer Hierarchical Prior Model

We now turn to the SBL problem with a 3-L prior model for which leads to the joint pdf (5) of the augmented signal model. In contrast with the 2-L prior model, we consider the entries in as random and assume that with , .

We now compute the prior pdf by marginalizing with respect to and . First, we note that and marginalize this pdf over . This requires computing . Defining and we obtain

 p(γ;ϵ,a,b)=L∏l∫∞0p(γl|ηl;ϵ)p(ηl;al,bl)dηl=L∏lp(γl;ϵ,al,bl) (12)

with

 p(γl;ϵ,al,bl) =ballΓ(ϵ+al)Γ(ϵ)Γ(al)γϵ−1l(γl+bl)−(ϵ+al). (13)

Finally, marginalizing over yields

 p(α;ϵ,a,b)=L∏lp(αl;ϵ,al,bl) (14)

with

 p(αl;ϵ,al,bl) =∫∞0p(αl|γl)p(γl)dγl =(ρπbl)ρΓ(ϵ+al)Γ(al+ρ)Γ(ϵ)Γ(al)(ρ|αl|2bl)ϵ−ρU(ϵ+al;ϵ−ρ+1;ρ|αl|2bl). (15)

In this expression, is the confluent hypergeometric function [27]. In Fig. 3(a), we depict the contour lines of the restriction to of .

Observe that although the contours are qualitatively similar to those shown in Fig. 2(a) for the 2-L model, the corresponding estimation rules depicted in Fig. 3(b) differ from those shown in Fig. 2(b): the 3-L prior model lead to a generalization of the hard-thresholding rule.

References [21] and [19] also propose a 3-L prior model to derive a Bayesian adaptive version of the LASSO estimator for hierarchical adaptive LASSO and for the Fast Laplace algorithm respectively. There is, however, one important difference between these approaches and the one advocated in our work: the 2-L and 3-L prior models presented here are not constrained to realize the penalty term, but can be used for arbitrary values of . It is precisely this distinction that makes these hierarchical prior models lead to estimators that outperform classical -penalized sparse estimators, as our numerical evaluation in Section IV will reveal.

### Ii-C The Jeffreys Noninformative Prior

Note that in (13) entails the density of the noninformative Jeffreys prior for and thereby the improper marginal prior density . The same holds for the 2-L prior model with the selection . In other words, when the highest hierarchy layer in the prior formulation is chosen to be noninformative, the log-sum penalization is invoked. This penalty term has gained much interest in the literature, including [7, 8, 14], as it is known to strongly promote sparse estimates. We should also add that the re-weighting scheme in [20] has also been motivated from the log-sum penalty term.

## Iii Sparse Bayesian Inference

In this section we present the Bayesian inference scheme that relies on the 2-L and 3-L prior models presented in Section II. First, we exploit the EM framework to derive an iterative algorithm that estimates the unknown model parameters of the 3-L prior model. Then, a fast algorithm with low computational complexity inspired by [18] is presented. The algorithm based on the 2-L model can easily be obtained by treating as a fixed parameter.

### Iii-a Sparse Bayesian Inference using EM

Direct computation of the estimate in (2) is not always a straightforward task. In particular, for small the penalty term (see e.g., (10)) loses convexity and the calculation of the minimum of the objective function in (2) becomes impractically complex. The hierarchical structure of the 2-L and 3-L prior models instead provides an efficient yet computationally tractable approach to compute an estimate of . This is achieved at the expense of introducing the additional hyperparameters, and , that have to be estimated alongside with . The sparse Bayesian inference scheme previously used for SBL (see [7, 18, 8, 29, 19]) can also be adopted for our prior models. Its main steps are summarized below when it is applied to the 3-L prior model. The reader is referred to the above cited works for more information. The inference scheme exploits the particular factorization of the posterior pdf of the model parameters:

 p(α,γ,η,λ|y)=p(α|y,γ,λ)p(γ,η,λ|y). (16)

The joint maximization of the left hand-side of (16) is not feasible. Instead, a two step approach is adopted consisting of a first step which computes the maximizer of the second factor, followed by a second step which calculates the maximizer of the first factor with set to the value already obtained in the first step. Note that the second step computes the MAP estimate of , while the first step calculates the conditional MAP estimate of given set to its MAP estimate. Unfortunately, the first step is computationally prohibitive, so it is replaced in the Bayesian inference scheme by an EM algorithm that sequentially computes an approximation of the MAP estimate of . The virtue of this modified scheme is that the conditional MAP estimate of computed in the second step is readily a byproduct obtained in the E-step of the EM algorithm. In our particular application context, we update the estimates of the components in sequentially, instead of jointly. The generalized EM (GEM) framework justifies this modification.

The EM algorithm aims at approximating the MAP of , i.e. the maximizer of

 L(γ,η,λ)=logp(y,γ,η,λ)=log(p(y|γ,λ)p(γ|η)p(η)p(λ)) (17)

by exploiting the fact that is a complete data for :

 p(y,γ,η,λ)=∫p(y,α,γ,η,λ)dα. (18)

The E-step of the EM algorithm computes the conditional expectation

 ⟨logp(y,α,γ,η,λ)⟩ (19)

with respect to the conditional pdf or whether the underlying signal model is real or complex. Here, and denote the current estimates computed at a previous iteration.101010To keep notation simple we avoid the use of an iteration index in the formulation of the EM algorithm: in the updating equations, the estimates on the left-hand side are the new update estimates, while those on the right-hand side are the current estimates. In either cases, the parameters of the conditional pdf of read

 ˆΣ =(^λHHH+ˆΓ−1)−1, (20) ^μ =^λˆΣHHy, (21)

with . The M-step of the EM algorithm updates the estimate of by equating it with the maximizer of (19). We can invoke the GEM framework to replace this step by a step updating the three components in sequentially. Specifically, , , and . Doing so we obtain

 ^γl =ϵ−ρ−1+√(ϵ−ρ−1)2+4ρ^ηl⟨|αl|2⟩2^ηl, l=1,…,L, (22) ^ηl =ϵ+al−1^γl+bl, l=1,…,L, (23) ^λ =M⟨∥y−Hα∥22⟩. (24)

In accordance with [8, 14], we have selected a constant (improper) prior for , i.e., . This choice is motivated by the fact that the influence of the shape of the conjugate gamma pdf becomes small for a large .

In this formulation of the EM algorithm, we have chosen as the hidden variable. Another approach followed in [14] consists in letting be the hidden variable and including the estimate of in the M-step. However, our numerical results indicate that the resulting EM algorithm leads to estimators that may exhibit an unstable behavior when small values of the parameter are selected. These estimators are consistently converging to a local minimum and often fail to produce sparse estimates of . We have observed the same convergence behavior when this EM algorithm is applied to the probabilistic model using the noninformative Jeffreys prior for , , also proposed in [14]. As previously mentioned, this estimator coincides with the FOCUSS algorithm with the setting [17] (see [8] for a thorough numerical analysis on different choices of ). Finally, another approach to estimate the model parameters is to use variational Bayesian inference techniques, see e.g., [30, 31, 6]. This approach has not been followed in this paper.

### Iii-B Fast Sequential Sparse Inference Scheme

Although the EM algorithm derived in the previous subsection yields simple expressions for the parameter updates, it suffers from two major drawbacks: first, the algorithm converges rather slowly, i.e., many iterations are needed before the entries of corresponding to the zero-valued weights become small enough to remove the corresponding basis vectors, which reduces the computational complexity of the update (20). Secondly, unless , i.e., in a noiseless case, the entries of are never exactly zero; they take very small, yet nonzero values, and thus an additional, usually empirical, thresholding procedure is needed to remove components with small weights in the inference procedure.

In order to circumvent these drawbacks, we follow the framework proposed by [18]: rather than estimating the entries of simultaneously for all components in the dictionary , the objective function (17) is maximized sequentially with respect to the parameter of a single basis vector at a time, while keeping the parameters of the other basis vectors fixed. Thus, the parameters of all components are updated sequentially, one after another. This approach leads to a much more efficient inference scheme that proves to accelerate the convergence compared to EM. It also provides conditions for determining when a component has to be removed, thus eliminating the need for a pruning threshold.

Let us reinspect (17) in more details. First we note that in (17) can be computed in closed form. Since, both and are Gaussian pdfs, is likewise Gaussian with mean zero and covariance matrix . Now, let us consider the dependency of (17) on a single parameter . We note that . Making use of the matrix inversion lemma [32] we express the inverse of as

 C−1=C−1−l−C−1−lhlhHlC−1−lγ−1l+hHlC−1−lhl, (25)

where . Following the GEM framework, we consider the terms in (17) depending on while keeping and fixed to their current estimates:

 L(γ)≜L(γ;^η,^λ) =−ρlog|C−l|−ρyHC−1−ly+(ϵ−1)∑k≠llogγk−∑k≠l^ηkγk =L(γ−l)+ℓ(γl). (26)

The first summand contains the terms independent of . The terms depending on are encompassed in the second summand:

 ℓ(γl)≜−ρlog|1+γlsl|+ρ|ql|2γ−1l+sl+(ϵ−1)logγl−^ηlγl (27)

with the definitions and . Note that the definition domain of is . We seek to compute the stationary points of . To this end, we take its derivative and equate the result to zero. Doing so yields the cubic equation

 0 =−γ3l^ηls2l−γ2l[(1−ϵ+ρ)s2l+2^ηlsl]+γl[2(ϵ−1)sl−slρ+ρ|ql|2−^ηl]+(ϵ−1). (28)

When ranges through , the equation has three solutions which can be determined analytically. While a feasible solution for is constrained to be positive, a further analysis is needed to determine which of the roots of the cubic polynomial have to be selected as a solution. We immediately note that , while . In other words, depending on the sign of , diverges either to plus or minus infinity when tends to zero. In the following we consider several different choices for .

#### Iii-B1 Stationary points of ℓ(γl) for ϵ<1

When , diverges to plus infinity at the origin and hence, has no global maximum. In this case we turn to the following proposition:

###### Proposition 1

Provided the function has at most a single local maximum.

See the Appendix. Proposition 1 states a very pleasing result as it provides us with an easy routine for selecting . Either we select the local maximizer of , provided it exists, or we prune the basis vector in by setting .

#### Iii-B2 Stationary points of ℓ(γl) for ϵ=1

In this case (28) simplifies to

 0 =−γ2l^ηls2l−γl[s2lρ+2^ηlsl]−slρ+ρ|ql|2−^ηl. (29)

 γ(−)l =−(slρ+2^ηl)−√Δl2sl^ηl, (30) γ(+)l =−(slρ+2^ηl)+√Δl2sl^ηl, (31)

where and , with the equality holding if, and only if, . While is always negative, is positive if, and only if, . Moreover, from and we conclude that if is positive it corresponds to the global maximum of . Thus, the solution is obtained as

 ^γl=⎧⎨⎩−(slρ+2^ηl)+√Δl2sl^ηl,if|ql|2−sl>ρ−1^ηl0,elsewhere. (32)

The case corresponds to the hierarchical modeling of the Laplace prior for real weights: (32) coincides with the estimate of obtained in [19] for the Fast Laplace algorithm when and . Obviously, (32) can also be used for the complex case with , yet in this case the equivalent prior for is no longer Laplace, as we showed in Section II, but some other prior from the Bessel K family.

The case is also connected to the Fast RVM algorithm [18]. The selection and corresponds to a constant hyperprior pdf , so that the hyperparameter vector is estimated via maximization of the type-II likelihood function . Taking the derivative of (27) and setting the result to zero leads to the update expression

 ^γl={(|ql|2−sl)/s2l,if|ql|2>sl0,elsewhere (33)

used in [18]. It is interesting to note that (33) is independent of and thus is the same regardless whether the signal model (1) is real or complex. The same holds for the RVM algorithm.

#### Iii-B3 Stationary points of ℓ(γl) for 1<ϵ≤1+ρ

Since diverges to minus infinity as and , and is continuous, it must have a global maximum. In this case, we make use of Proposition 2.

###### Proposition 2

Provided the function has a single stationary point and this stationary point corresponds to a maximum.

See the Appendix. Hence, the positive solution to (28) is the global maximizer of .

#### Iii-B4 Stationary points of ℓ(γl) for ϵ>1+ρ

This case is of a limited practical interest since the resulting penalty term is no longer sparsity-inducing. We thus leave out the corresponding analysis of the stationary points.

### Iii-C Realization of the Fast Sequential Inference Scheme

The fast inference scheme presented in Section III-B can be realized using different approaches. For instance, one can start out with the “full” dictionary and at each iteration of the algorithm remove a basis vector with a corresponding . This effectively reduces the model complexity “on the fly”. However, the first iterations still suffer from a high computational complexity due to the update (20).

To avoid this, we follow the approach in [18], which consists in starting with an “empty” dictionary and incrementally filling the dictionary by adding one basis vector at each iteration of the algorithm. More specifically, we start with an empty dictionary and “test” each basis vector by computing each , , and adding the one vector in with the corresponding that gives rise to the greatest increase in between two consecutive iterations. The estimator derived based on the 3-L model updates according to (23). We then follow the add, delete and re-estimate procedures for , , , and given by [18] ( and can be computed in a very efficient manner for all basis vectors). In the first update iterations, the estimate of is likely to be inaccurate, as contains only a few basis vectors. Therefore, the estimation of is enabled after several update iterations have been performed only. In this case, must be initialized to a fixed value in the first iterations.

## Iv Numerical Results

In this section we analyze the performance of the proposed estimation algorithms in Section III. The performance is evaluated by means of Monte Carlo simulations. Both real and complex signal models are considered in the investigations. We use a random dictionary matrix whose entries are independent and identically distributed (iid) zero-mean real (complex symmetric) Gaussian random variables with unit variance. A total of components in are nonzero, and the indices of these are uniformly drawn without repetition from the set . These indices together with are unknown to the algorithms. The nonzero components in

are iid and drawn from a zero-mean real (complex circular) Gaussian distribution with unit variance. All reported curves are computed based on a total of

Monte Carlo runs. For each such run new realizations of the dictionary matrix , the vector , and the random perturbation vector are generated. To further simplify the analysis of the simulation results we assume that the noise variance is known. The performance of the algorithms is evaluated based on the MSE of the estimator and the mean of the number of nonzero components in . These two performance metrics combined provide a strong indication of whether a given algorithm has properly inferred the support set of , defined as the entries of the nonzero components in this vector.

We refer to the proposed estimation algorithms using the 2-L and 3-L prior models as Fast-2L and Fast-3L respectively. In all simulations we select the free parameters of the prior models as summarized in Table I. According to Section II, affects the sparsity-inducing property of the hierarchical prior model and, correspondingly, the penalty term in (2). The value of the parameter is appended to the acronyms Fast-2L and Fast-3L. Note that the values of the rate parameters , , need to be specified for Fast-2L. Our investigations show that for the performance of Fast-2L becomes largely independent of the choice of in moderate to high SNR regimes; nonetheless, in low SNR regime a large value of , results in overly sparse estimates of . We select for Fast-2L.

We compare the performance of Fast-2L and Fast-3L to the performance of the following state-of-the-art sparse estimators:

1. EM-RVM: the RVM algorithm in [7, 8] that uses the EM algorithm for parameter estimation assuming a constant hyperparameter prior.111111The software is available on-line at http://dsp.ucsd.edu/~dwipf/.

2. EM-Laplace: the EM algorithm based on the hierarchical model for Laplace prior in [14]. Note that EM-Laplace is equivalent to the FOCUSS algorithm [17] with .

3. Fast-RVM: the algorithm in [18, 29], which is a fast sequential version of EM-RVM assuming a constant hyperparameter prior.121212The software is available on-line at http://people.ee.duke.edu/~lcarin/BCS.html.

4. Fast-Laplace: the algorithm in [19] is an analogue to Fast-RVM derived based on the hierarchical Laplace prior model for real weights with an additional hyperprior for the rate parameter .131313The software is available on-line at http://ivpl.eecs.northwestern.edu/.

5. SpaRSA and Re-SpaRSA: the sparse reconstruction by separable approximation (SpaRSA) algorithm for solving the LASSO cost function [33]141414The software is available on-line at http://www.lx.it.pt/~mtf/SpaRSA/. and a modified version of it (Re-SpaRSA) with the re-weighted minimization realized using the framework proposed in [20].

As a reference, we also consider the performance of the oracle estimator of [34] that “knows” the positions of its nonzero entries. The oracle estimate reads

 ^αo(y)={(HHoHo)−1HHoy,onsupp(α)0,elsewhere, (34)

where and is an dictionary matrix constructed from those columns of that correspond to the nonzero entries in . The MSE of the oracle estimator for a fixed and noise precision is .

We expect the performance of a SBL iterative algorithm to be determined by two main factors: the used underlying prior model and the inference scheme applied to it. Fast-RVM, Fast-Laplace, Fast-2L, and Fast-3L are all obtained by using the same inference scheme, but applied to different hierarchical prior models. Hence, by comparing the performances of these algorithms we can draw conclusions on the sparsity-inducing properties of their respective prior models.151515Naturally, the practical implementation of the inference scheme also impacts the performance. Fast-Laplace, Fast-2L, and Fast-3L are all implemented based on the software for Fast-RVM. This is done in Section IV-A. In Section IV-B, we benchmark our proposed algorithms with the rest of the estimators in the above list.

### Iv-a Impact of the Hierarchical Prior Model on the Performance

To this end, we compare the performance of Fast-RVM, Fast-Laplace, Fast-2L, and Fast-3L. These estimators are all derived based on the same inference scheme yet applied to different hierarchical models. Fast-RVM and Fast-Laplace can be easily adapted to complex scenarios. Notice that Fast-Laplace does not make use of the Laplace prior in the complex case as earlier explained. The iterative nature of the algorithms requires a stopping criterion. We use the same stopping criterion for all algorithms based on the difference in between two consecutive iterations as done in [29].

We begin by evaluating the performance versus (i) the SNR per received signal component and (ii) the number of iterations used until the stopping criterion is met. The results, shown in Fig. 4, depict the sensitivity of the algorithms to measurement noise. In these investigations we set and ; the number of nonzero components in is set to .

Fast-2L achieves nearly the performance of the oracle estimator in terms of MSE for both real and complex models. Furthermore, Fast-RVM and Fast-Laplace perform as good as Fast-3L. We also note that in high SNR regime (SNR dB) the performance of all compared estimators is almost indistinguishable. Yet in moderate SNR regime, the impact of the noise becomes significant and Fast-2L with small clearly outperforms the other algorithms. The performance results for Fast-3L are not reported since they almost coincide with those for Fast-2L. This indicates that it is mainly the selection of the shape parameter in the second layer that dominates the performance of the algorithms, regardless of whether the 2-L or the 3-L prior model is used.

We now study the mean of versus the SNR. The corresponding results are summarized in Figs. 5(a) and 5(d). Clearly, Fast-2L outperforms the other algorithms: is unbiased for SNR as low as 15 dB in the complex case. Fast-RVM and Fast-Laplace produce a significantly biased estimate of in low to average SNR regime. Only for very high SNR values is unbiased.

In Figs. 5(b) and 5(e) we plot the mean of versus the algorithms’ iteration index for SNR fixed at 30 dB. Observe that Fast-2L converges in roughly 30–40 iterations for both and . In contrast, the other algorithms require from to more then update iterations until convergence is achieved. As shown in Figs. 5(c) and 5(f) the number of iterations varies with the SNR. Notice that Fast-2L requires fewer iterations than any other considered algorithm in almost the whole SNR range evaluated. Furthermore, we observe that Fast-3L and Fast-RVM converge faster than Fast-Laplace. Nonetheless, Fast-Laplace leads to sparser representations.

Next we evaluate the performance of the algorithms versus the number of available measurements and the value of for the SNR fixed at dB. In these investigations we have not included Fast-3L as we can conclude from the previously presented results that it performs poorly compared to Fast-2L() and Fast-2L(). First, we fix and and vary from to samples. Then, for fixed and we vary from to . The corresponding performance results are depicted in Fig. 6. In Figs. 6(a), 6(b), 6(e), and 6(f) we plot the MSE and the mean of versus number of measurements . Notice the characteristic threshold-like behavior of the MSE curves with respect to the number of observed measurement samples . The threshold occurs for a smaller with complex signal models as compared with real signal models. This is explained by the fact that in the complex case both real and imaginary parts are used to prune components in , thus improving the MSE threshold. Furthermore, Fast-RVM and Fast-Laplace consistently produces a biased estimate of for both real and complex signal models; in contrast, Fast-2L

achieves an unbiased estimate of

for sufficiently large values of .

In Figs. 6(c), 6(d), 6(g), and 6(h) we plot the MSE and the mean of versus . A similar threshold-like behavior can be observed as increases, with the threshold occurring for a higher value of in case of complex signal models, i.e., in complex scenarios less sparse signals can be estimated with higher accuracy. Analogously to previously discussed results, Fast-2L with small again outperforms the other algorithms. In particular, Fig. 6(h) shows that Fast-2L achieves an unbiased estimate of for values as high as .