Application of Bayesian Hierarchical Prior Modeling to Sparse Channel Estimation

by   Niels Lovmand Pedersen, et al.
Aalborg University

Existing methods for sparse channel estimation typically provide an estimate computed as the solution maximizing an objective function defined as the sum of the log-likelihood function and a penalization term proportional to the l1-norm of the parameter of interest. However, other penalization terms have proven to have strong sparsity-inducing properties. In this work, we design pilot-assisted channel estimators for OFDM wireless receivers within the framework of sparse Bayesian learning by defining hierarchical Bayesian prior models that lead to sparsity-inducing penalization terms. The estimators result as an application of the variational message-passing algorithm on the factor graph representing the signal model extended with the hierarchical prior models. Numerical results demonstrate the superior performance of our channel estimators as compared to traditional and state-of-the-art sparse methods.


page 1

page 2

page 3

page 4


A Fast Iterative Bayesian Inference Algorithm for Sparse Channel Estimation

In this paper, we present a Bayesian channel estimation algorithm for mu...

Doppler Shift and Channel Estimation for Intelligent Transparent Surface Assisted Communication Systems on High-Speed Railways

The critical distinction between the emerging intelligent transparent su...

A Distributed Sparse Channel Estimation Technique for mmWave Massive MIMO Systems

In this paper, we study the problem of sparse channel estimation via a c...

A variational Bayes framework for sparse adaptive estimation

Recently, a number of mostly ℓ_1-norm regularized least squares type det...

Low-Overhead Hierarchically-Sparse Channel Estimation for Multiuser Wideband Massive MIMO

The problem of excessive pilot overhead required for uplink massive MIMO...

Bayesian and L1 Approaches to Sparse Unsupervised Learning

The use of L1 regularisation for sparse learning has generated immense r...

Dual-Space Analysis of the Sparse Linear Model

Sparse linear (or generalized linear) models combine a standard likeliho...

I Introduction

During the last few years the research on compressive sensing techniques and sparse signal representations [1, 2] applied to channel estimation has received considerable attention, see e.g., [3, 4, 5, 6, 7]. The reason is that, typically, the impulse response of the wireless channel has a few dominant multipath components. A channel exhibiting this property is said to be sparse [3].

The general goal of sparse signal representations from overcomplete dictionaries is to estimate the sparse vector

in the following system model:


In this expression is the vector of measurement samples and represents the samples of the additive white Gaussian random noise with covariance matrix and precision parameter . The matrix is the overcomplete dictionary with more columns than rows () and is an unknown sparse vector, i.e., has few nonzero elements at unknown locations.

Often, a sparse channel estimator is constructed by solving the -norm constrained quadratic optimization problem, see among others [4, 5, 6]:


with and , , denoting the

vector norm. This method is also known as Least Absolute Shrinkage and Selection Operator (LASSO) regression

[8] or Basis Pursuit Denoising [9]. The popularity of the LASSO regression is mainly attributed to the convexity of the cost function, as well as to its provable sparsity-inducing properties (see [2]). In [4, 5, 6] the LASSO regression is applied to orthogonal frequency-division multiplexing (OFDM) pilot-assisted channel estimation. Various channel estimation algorithms that minimize the LASSO cost function using convex optimization are compared in [6].

Another approach to sparse channel estimation is sparse Bayesian learning (SBL) [7, 10, 11, 12]. Specifically, SBL aims at finding a sparse maximum a posteriori (MAP) estimate of


by specifying a prior such that the penalty term induces a sparse estimate .111Here denotes , and thus , for some arbitrary constant . We will also make use of which denotes for some positive constant .

Obviously, by comparing (2) and (3) the SBL framework realizes the LASSO cost function by choosing the Laplace prior with . However, instead of working directly with the prior , SBL models this using a two-layer (2-L) hierarchical structure. This involves specifying a conditional prior

and a hyperprior

such that has a sparsity-inducing nature. The hierarchical approach to the representation of

has several important advantages. First of all, one is free to choose simple and analytically tractable probability density functions (pdfs). Second, when carefully chosen, the resulting hierarchical structure allows for the construction of efficient yet computationally tractable iterative inference algorithms with analytical derivation of the inference expressions.

In [13] we propose a 2-L and a three-layer (3-L) prior model for . These hierarchical prior models lead to novel sparsity-inducing priors that include the Laplace prior for complex variables as a special case. This paper adapts the Bayesian probabilistic framework introduced in [13] to OFDM pilot-assisted sparse channel estimation. We then propose a variational message passing (VMP) algorithm that effectively exploits the hierarchical structure of the prior models. This approach leads to novel channel estimators that make use of various priors with strong sparsity-inducing properties. The numerical results reveal the promising potential of our estimators with improved performance as compared to state-of-the-art methods. In particular, the estimators outperform LASSO.

Throughout the paper we shall make use of the following notation: and denote respectively the transpose and the Hermitian transpose; the expression denotes the expectation of the function with respect to the density ; denotes a multivariate complex Gaussian pdf with mean and covariance matrix ; similarly, denotes a Gamma pdf with shape parameter and rate parameter .

Ii Signal Model

We consider a single-input single-output OFDM system with subcarriers. A cyclic prefix (CP) is added to preserve orthogonality between subcarriers and to eliminate inter-symbol interference between consecutive OFDM symbols. The channel is assumed static during the transmission of each OFDM symbol. The received (baseband) OFDM signal reads in matrix-vector notation


The diagonal matrix contains the transmitted symbols. The components of the vector are the samples of the channel frequency response at the subcarriers. Finally,

is a zero-mean complex symmetric Gaussian random vector of independent components with variance


To estimate the vector in (4), a total of pilot symbols are transmitted at selected subcarriers. The pilot pattern denotes the set of indices of the pilot subcarriers. The received signals observed at the pilot positions are then divided each by the corresponding pilot symbol to produce the vector of observations:


We assume that all pilot symbols hold unit power such that the statistics of the noise term remain unchanged, i.e., yields the samples of the true channel frequency response (at the pilot subcarriers) corrupted by additive complex white Gaussian noise with component variance .

In this work, we consider a frequency-selective wireless channel that remains constant during the transmission of each OFDM symbol. The maximum relative delay is assumed to be large compared to the sampling time , i.e., [3]. The impulse response of the wireless channel is modeled as a sum of multipath components:


In this expression, and are respectively the complex weight and the continuous delay of the th multipath component, and is the Dirac delta function. The parameter is the total number of multipath components. The channel parameters , , and ,

, are random variables. Specifically, the weights

, , are mutually uncorrelated zero-mean with the sum of their variances normalized to one. Additional details regarding the assumptions on the model (6) are provided in Section VI.

Iii The Dictionary Matrix

Our goal is to estimate in (4) by applying the general optimization problem (3) to the observation model (5). For doing so, we must define a proper dictionary matrix

. In this section we give an example of such a matrix. As a starting point, we invoke the parametric model (

6) of the channel. Making use of this model, (5) can be written as


with , , , , and depending on the pilot pattern as well as the unknown delays in . Specifically, the th entry of reads


with denoting the frequency of the th pilot subcarrier. In the general optimization problem (3) the columns of are known. However, the columns of in (7) depend on the unknown delays in . To circumvent this discrepancy we follow the same approach as in [5] and consider a grid of uniformly-spaced delay samples in the interval :


with such that is an integer. We now define the dictionary as . Thus, the entries of are of the form (8) with delay vector . The number of columns in is thereby inversely proportional to the selected delay resolution .

It is important to notice that the system model (1) with defined using discretized delay components is an approximation of the true system model (7). This approximation model is introduced so that (3) can be applied to solve the channel estimation task. The estimate of the channel vector at the pilot subcarriers is then . In order to estimate the channel in (4) the dictionary is appropriately expanded (row-wise) to include all subcarrier frequencies.

Iv Bayesian Prior Modeling

In this section we specify the joint pdf of the system model (1) when it is augmented with the 2-L and the 3-L hierarchical prior model. The joint pdf of (1) augmented with the 2-L hierarchical prior model reads


The 3-L prior model considers the parameter specifying the prior of in (10) as random. Thus, the joint pdf of (1) augmented with this hierarchical prior model is of the form


In (10) and (11) we have due to (1

). Furthermore, we select the conjugate prior

. Finally, we let with . In the following we show the main results and properties of these prior models. We refer to [13] for a more detailed analysis.

Iv-a Two-Layer Hierarchical Prior Model

Fig. 1: 2-L hierarchical prior pdf for : (a) Contour plot of the restriction to the – plane of the penalty term . (b) Restriction to of the resulting MAP estimation rule (3) with as a parameter in the case when is orthonormal. The black dashed line indicates the hard-threshold rule and the black solid line the soft-threshold rule (obtained with ). The black dashed line indicates the penalty term resulting when the prior pdf is a circular symmetric Gaussian pdf.

The 2-L prior model assumes that with . We compute the prior of to be




In this expression, is the modified Bessel function of the second kind with order . The prior (13) leads to the general optimization problem (3) with penalty term


We now show that the 2-L prior model induces the -norm penalty term and thereby the LASSO cost function as a special case. Selecting and using the identity [14], (13) yields the Laplace prior


With the selection , , we obtain .

The prior pdf (13) is specified by and the regularization parameter . In order to get insight into the impact of on the properties of this prior pdf we consider the case . In Fig. 1(a) the contour lines of the restriction to of are visualized;222Let denote a function defined on a set . The restriction of to a subset is the function defined on that coincides with on this subset. each contour line is computed for a specific choice of . Notice that as decreases towards more probability mass accumulates along the -axes; as a consequence, the mode of the resulting posterior is more likely to be located close to the axes, thus promoting a sparse solution. The behavior of the classical penalty term obtained for can also be clearly recognized. In Fig. 1(b) we consider the case when is orthonormal and compute the MAP estimator (3) with penalty term (14) for different values of . Note the typical soft-threshold-like behavior of the estimators. As , more components of are pulled towards zero since the threshold value increases, thus encouraging a sparser solution.

Iv-B Three-Layer Hierarchical Prior Model

Fig. 2: Three-layer hierarchical prior pdf for with the setting , : (a) Restriction to of the resulting MAP estimation rule (3) with as a parameter in the case when is orthonormal. The black dashed line indicates the hard-threshold rule and the black solid line the soft-threshold rule. (b) Contour plot of the restriction to the – plane of the penalty term .

We now turn to the SBL problem with a 3-L prior model for leading to the joint pdf in (11). Specifically, the goal is to incorporate the regularization parameter into the inference framework. To that end, we define with and compute the prior . Defining and we obtain with


In this expression, is the confluent hypergeometric function [14]. In Fig. 2(a) we show the estimation rules produced by the MAP solver for different values of and fixed parameters and when is orthonormal. It can be seen that the estimation rules obtained with the 3-L prior model approximate the hard-thresholding rule. In Fig. 2(b), we depict the contour lines of the restriction to of . Observe that although the contours behave qualitatively similarly to those shown in Fig. 1(a) for the 2-L prior model, the estimation rules in Fig. 2(a) and Fig. 1(b) are different.

Naturally, the 3-L prior model encompasses three free parameters, , , and . The choice and small (practically we let , ) induces a weighted log-sum penalization term. This term is known to strongly promote a sparse estimate [10, 11]. Later in the text we will also adopt this parameter setting.

V Variational Message Passing

Fig. 3: A factor graph that represents the joint pdf (11). In this figure , , , , and .

In this section we present a VMP algorithm for estimating in (4) given the observation in (5). Let be the set of unknown parameters and be the joint pdf specified in (11). The factor graph [15] that encodes the factorization of is shown in Fig. 3. Consider an auxiliary pdf for the unknown parameters that factorizes according to . The VMP algorithm is an iterative scheme that attempts to compute the auxiliary pdf that minimizes the Kullback-Leibler (KL) divergence . In the following we summarize the key steps of the algorithm; the reader is referred to [16] for more information on VMP.

From [16] the auxiliary function , , is updated as the product of incoming messages from the neighboring factor nodes to the variable node :


In (17) is the set of factor nodes neighboring the variable node and denotes the message from factor node to variable node . This message is computed as


where is the set of variable nodes neighboring the factor node . After an initialization procedure, the individual factors of are then updated iteratively in a round-robin fashion using (17) and (18).

We provide two versions of the VMP algorithm: one applied to the 2-L prior model (referred to as VMP-2L) and another one applied to the 3-L model (VMP-3L). The messages corresponding to VMP-2L are easily obtained as a special case of the messages computed for VMP-3L by assuming , where is some fixed real number.

V-1 Update of

According to (17) and Fig. 3 the computation of the update of requires evaluating the product of messages and . Multiplying these two messages yields the Gaussian auxiliary pdf with covariance matrix and mean given by


In the above expression we have defined .

Fig. 4: Comparison of the performance of the VMP-2L, VMP-3L, RWF, RVM, and SparseRSA algorithms: (a) BER versus , (b) MSE versus , (c) MSE versus number of available pilots with fixed and the ratio between received symbol power and noise variance set to 15 dB. In (a,b) we have and . In (a) the dashed line shows the BER performance when the true channel vector in (4) is known.

V-2 Update of

The update of is proportional to the product of the messages and :


The right-hand side expression in (21) is recognized as the product of Generalized Inverse Gaussian (GIG) pdfs [17] with order . Observe that the computation of in (19) requires evaluating for all

. Luckily, the moments of the GIG distribution are given in closed form for any



V-3 Update of

The update of is proportional to the product of messages and :


Clearly, factorizes as a product of gamma pdfs, one for each individual entry in . The first moment of used in (22) is easily computed as


Naturally, is only computed for VMP-3L.

V-4 Update of

It can be shown that . The first moment of used in (19) and (20) is therefore


Vi Numerical Results

Sampling time, 32.55 ns
CP length 4.69 s / 144
Subcarrier spacing 15 kHz
Pilot pattern Equally spaced, QPSK
Modulation QPSK
Subcarriers, 1200
Pilots, 100
OFDM symbols 1
Information bits 727
Channel interleaver Random
Convolutional code
Decoder BCJR algorithm [19]
TABLE I: Parameter settings for the simulations. The convolutional code and decoder has been implemented using [18].

We perform Monte Carlo simulations to evaluate the performance of the two versions of the derived VMP algorithm in Section V. We consider a scenario inspired by the 3GPP LTE standard [20] with the settings specified in Table I. The multipath channel (6) is based on the model used in [21] where, for each realization of the channel, the total number of multipath components

is Poisson distributed with mean of

and the delays ,

, are independent and uniformly distributed random variables drawn from the continuous interval [0, 144

] (corresponding to the CP length). The th nonzero component conditioned on the delay

has a zero-mean complex circular symmetric Gaussian distribution with variance

and parameters .333The parameter is computed such that , where is the joint pdf of the parameters of the channel model. In the considered simulation scenario, , , and (the decay rate).

To initialize the VMP algorithm we set and equal to the inverse of the sample variance of and the inverse number of columns respectively. Furthermore, we let in (25), which corresponds to the Jeffreys noninformative prior for . Once the initialization is completed, the algorithm sequentially updates the auxiliary pdfs , , , and until convergence is achieved. Obviously, is only updated for VMP-3L, whereas for VMP-2L the entries in are set to . For both versions we select and for VMP-3L we set and , . Finally, the dictionary is specified by pilot subcarriers and a total of columns (corresponding to the choice and in (9)).

The VMP is compared to a classical OFDM channel estimator and two state-of-the-art sparse estimation schemes. Specifically, we use as benchmark the robustly-designed Wiener Filter (RWF) [22], the relevance vector machine (RVM) [10], [11],444The software is available on-line at and the sparse reconstruction by separable approximation (SpaRSA) algorithm [23].555The software is available on-line at The RVM is an EM algorithm based on the 2-L prior model of the student-t pdf over each , whereas SpaRSA is a proximal gradient method for solving (2). In case of the SpaRSA algorithm the regularization parameter needs to be set. In all simulations, we let , which leads to good performance in high signal-to-noise ratio (SNR) regime.

The performance is compared with respect to the resulting bit-error-rate (BER) and mean-squared error (MSE) of the estimate versus the SNR (). In addition, in order to quantify the necessary pilot overhead, we evaluate the MSE versus the number of available pilots . Hence, in this setup is no longer fixed as in Table I.

In Fig. 4(a) we compare the BER performance of the different schemes. We see that VMP-3L outperforms the other schemes across all the SNR range considered. Specifically, at 1 % BER the gain is approximately 2 dB compared to VMP-2L and RVM and 3 dB compared to SpaRSA and RWF. Also VMP-2L achieves lower BER in the SNR range 0 - 12 dB compared to RVM and across the whole SNR range compared to SpaRSA and RWF.

The superior BER performance of the VMP algorithm is well reflected in the MSE performance shown in Fig. 4(b). Again VMP-3L is a clear winner followed by VMP-2L. The bad MSE performance of the SpaRSA for low SNR is due to the difficulty in specifying a suitable regularization parameter across a large SNR range.

We next fix the ratio between received symbol power and noise variance to 15 dB666Note that this value does not correspond with as represented in Fig. 4(a) and 4(b). The specific depends on the number of bits in an OFDM block, which in turn depends on the number of pilot symbols . and evaluate the MSE versus number of available pilots . The results are depicted in Fig. 4(c). Observe a noticeable performance gain obtained with VMP-3L. In particular, VMP-3L exhibits the same MSE performance as VMP-2L and RVM using only approximately 85 pilots, roughly half as many as VMP-2L and RVM. Furthermore, VMP-3L, using this number of pilots, significantly outperforms SpaRSA and RWF using 200 pilots.

Vii Conclusion

In this paper, we proposed channel estimators based on sparse Bayesian learning. The estimators rely on Bayesian hierarchical prior modeling and variational message passing (VMP). The VMP algorithm effectively exploits the probabilistic structure of the hierarchical prior models and the resulting sparsity-inducing priors. Our numerical results show that the proposed channel estimators yield superior performance in terms of bit-error-rate and mean-squared error as compared to other existing estimators, including the estimator based on the -norm constraint. They also allow for a significant reduction of the amount of pilot subcarriers needed for estimating a given channel.


This work was supported in part by the 4GMCT cooperative research project funded by Intel Mobile Communications, Agilent Technologies, Aalborg University and the Danish National Advanced Technology Foundation. This research was also supported in part by the project ICT- 248894 Wireless Hybrid Enhanced Mobile Radio Estimators (WHERE2).


  • [1] R. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 118–121, July 2007.
  • [2] E. J. Candes and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30, Mar. 2008.
  • [3] W. Bajwa, J. Haupt, A. Sayeed, and R. Nowak, “Compressed channel sensing: A new approach to estimating sparse multipath channels,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1058–1076, Jun. 2010.
  • [4] G. Taubock and F. Hlawatsch, “A compressed sensing technique for OFDM channel estimation in mobile environments: Exploiting channel sparsity for reducing pilots,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing ICASSP 2008, 2008, pp. 2885–2888.
  • [5] C. R. Berger, S. Zhou, J. C. Preisig, and P. Willett, “Sparse channel estimation for multicarrier underwater acoustic communication: From subspace methods to compressed sensing,” IEEE Trans. on Sig. Proc., vol. 58, no. 3, pp. 1708–1721, 2010.
  • [6] J. Huang, C. R. Berger, S. Zhou, and J. Huang, “Comparison of basis pursuit algorithms for sparse channel estimation in underwater acoustic OFDM,” in Proc. OCEANS 2010 IEEE - Sydney, 2010, pp. 1–6.
  • [7] D. Shutin and B. H. Fleury, “Sparse variational Bayesian SAGE algorithm with application to the estimation of multipath wireless channels,” IEEE Trans. on Sig. Proc., vol. 59, pp. 3609–3623, 2011.
  • [8] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” J. R. Statist. Soc., vol. 58, pp. 267–288, 1994.
  • [9] S. S. Chen, D. L. Donoho, Michael, and A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, pp. 33–61, 1998.
  • [10] M. Tipping, “Sparse Bayesian learning and the relevance vector machine,”

    J. of Machine Learning Res.

    , vol. 1, pp. 211–244, June 2001.
  • [11] D. Wipf and B. Rao, “Sparse Bayesian learning for basis selection,” IEEE Trans. on Sig. Proc., vol. 52, no. 8, pp. 2153 – 2164, aug. 2004.
  • [12]

    D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, “The variational approximation for Bayesian inference,”

    IEEE Signal Process. Mag., vol. 25, no. 6, pp. 131–146, November 2008.
  • [13] N. L. Pedersen, D. Shutin, C. N. Manchón, and B. H. Fleury, “Sparse estimation using Bayesian hierarchical prior modeling for real and complex models,” submitted to IEEE Trans. on Sig. Proc., 2012, arXiv:1108.4324v1.
  • [14] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables.   Dover, 1972.
  • [15] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. on Information Theory, vol. 47, no. 2, pp. 498–519, Feb 2001.
  • [16] J. Winn and C. M. Bishop, “Variational message passing,” J. Mach. Learn. Res., vol. 6, pp. 661–694, 2005.
  • [17] B. Jorgensen, Statistical Properties of the Generalized Inverse Gaussian Distribution (Lecture Notes in Statistics 9).   Springer-Verlag New York Inc, 1982.
  • [18] The iterative solutions coded modulation library. [Online]. Available:
  • [19] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. on Inform. Theory, vol. 20, no. 2, pp. 284–287, 1974.
  • [20] 3rd Generation Partnership Project (3GPP) Technical Specification, “Evolved universal terrestrial radio access (e-utra); base station (bs) radio transmission and reception,” TS 36.104 V8.4.0, Tech. Rep., 2008.
  • [21] M. L. Jakobsen, K. Laugesen, C. Navarro Manchón, G. E. Kirkelund, C. Rom, and B. Fleury, “Parametric modeling and pilot-aided estimation of the wireless multipath channel in OFDM systems,” in Proc. IEEE Int Communications (ICC) Conf, 2010, pp. 1–6.
  • [22]

    O. Edfors, M. Sandell, J.-J. van de Beek, S. K. Wilson, and P. O. Börjesson, “OFDM channel estimation by singular value decomposition,”

    IEEE Trans. on Communications, vol. 46, no. 7, pp. 931–939, 1998.
  • [23] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstruction by separable approximation,” IEEE Trans. on Sig. Proc., vol. 57, no. 7, pp. 2479–2493, 2009.