I Introduction
During the last few years the research on compressive sensing techniques and sparse signal representations [1, 2] applied to channel estimation has received considerable attention, see e.g., [3, 4, 5, 6, 7]. The reason is that, typically, the impulse response of the wireless channel has a few dominant multipath components. A channel exhibiting this property is said to be sparse [3].
The general goal of sparse signal representations from overcomplete dictionaries is to estimate the sparse vector
in the following system model:(1) 
In this expression is the vector of measurement samples and represents the samples of the additive white Gaussian random noise with covariance matrix and precision parameter . The matrix is the overcomplete dictionary with more columns than rows () and is an unknown sparse vector, i.e., has few nonzero elements at unknown locations.
Often, a sparse channel estimator is constructed by solving the norm constrained quadratic optimization problem, see among others [4, 5, 6]:
(2) 
with and , , denoting the
vector norm. This method is also known as Least Absolute Shrinkage and Selection Operator (LASSO) regression
[8] or Basis Pursuit Denoising [9]. The popularity of the LASSO regression is mainly attributed to the convexity of the cost function, as well as to its provable sparsityinducing properties (see [2]). In [4, 5, 6] the LASSO regression is applied to orthogonal frequencydivision multiplexing (OFDM) pilotassisted channel estimation. Various channel estimation algorithms that minimize the LASSO cost function using convex optimization are compared in [6].Another approach to sparse channel estimation is sparse Bayesian learning (SBL) [7, 10, 11, 12]. Specifically, SBL aims at finding a sparse maximum a posteriori (MAP) estimate of
(3) 
by specifying a prior such that the penalty term induces a sparse estimate .^{1}^{1}1Here denotes , and thus , for some arbitrary constant . We will also make use of which denotes for some positive constant .
Obviously, by comparing (2) and (3) the SBL framework realizes the LASSO cost function by choosing the Laplace prior with . However, instead of working directly with the prior , SBL models this using a twolayer (2L) hierarchical structure. This involves specifying a conditional prior
and a hyperprior
such that has a sparsityinducing nature. The hierarchical approach to the representation ofhas several important advantages. First of all, one is free to choose simple and analytically tractable probability density functions (pdfs). Second, when carefully chosen, the resulting hierarchical structure allows for the construction of efficient yet computationally tractable iterative inference algorithms with analytical derivation of the inference expressions.
In [13] we propose a 2L and a threelayer (3L) prior model for . These hierarchical prior models lead to novel sparsityinducing priors that include the Laplace prior for complex variables as a special case. This paper adapts the Bayesian probabilistic framework introduced in [13] to OFDM pilotassisted sparse channel estimation. We then propose a variational message passing (VMP) algorithm that effectively exploits the hierarchical structure of the prior models. This approach leads to novel channel estimators that make use of various priors with strong sparsityinducing properties. The numerical results reveal the promising potential of our estimators with improved performance as compared to stateoftheart methods. In particular, the estimators outperform LASSO.
Throughout the paper we shall make use of the following notation: and denote respectively the transpose and the Hermitian transpose; the expression denotes the expectation of the function with respect to the density ; denotes a multivariate complex Gaussian pdf with mean and covariance matrix ; similarly, denotes a Gamma pdf with shape parameter and rate parameter .
Ii Signal Model
We consider a singleinput singleoutput OFDM system with subcarriers. A cyclic prefix (CP) is added to preserve orthogonality between subcarriers and to eliminate intersymbol interference between consecutive OFDM symbols. The channel is assumed static during the transmission of each OFDM symbol. The received (baseband) OFDM signal reads in matrixvector notation
(4) 
The diagonal matrix contains the transmitted symbols. The components of the vector are the samples of the channel frequency response at the subcarriers. Finally,
is a zeromean complex symmetric Gaussian random vector of independent components with variance
.To estimate the vector in (4), a total of pilot symbols are transmitted at selected subcarriers. The pilot pattern denotes the set of indices of the pilot subcarriers. The received signals observed at the pilot positions are then divided each by the corresponding pilot symbol to produce the vector of observations:
(5) 
We assume that all pilot symbols hold unit power such that the statistics of the noise term remain unchanged, i.e., yields the samples of the true channel frequency response (at the pilot subcarriers) corrupted by additive complex white Gaussian noise with component variance .
In this work, we consider a frequencyselective wireless channel that remains constant during the transmission of each OFDM symbol. The maximum relative delay is assumed to be large compared to the sampling time , i.e., [3]. The impulse response of the wireless channel is modeled as a sum of multipath components:
(6) 
In this expression, and are respectively the complex weight and the continuous delay of the th multipath component, and is the Dirac delta function. The parameter is the total number of multipath components. The channel parameters , , and ,
, are random variables. Specifically, the weights
, , are mutually uncorrelated zeromean with the sum of their variances normalized to one. Additional details regarding the assumptions on the model (6) are provided in Section VI.Iii The Dictionary Matrix
Our goal is to estimate in (4) by applying the general optimization problem (3) to the observation model (5). For doing so, we must define a proper dictionary matrix
. In this section we give an example of such a matrix. As a starting point, we invoke the parametric model (
6) of the channel. Making use of this model, (5) can be written as(7) 
with , , , , and depending on the pilot pattern as well as the unknown delays in . Specifically, the th entry of reads
(8) 
with denoting the frequency of the th pilot subcarrier. In the general optimization problem (3) the columns of are known. However, the columns of in (7) depend on the unknown delays in . To circumvent this discrepancy we follow the same approach as in [5] and consider a grid of uniformlyspaced delay samples in the interval :
(9) 
with such that is an integer. We now define the dictionary as . Thus, the entries of are of the form (8) with delay vector . The number of columns in is thereby inversely proportional to the selected delay resolution .
It is important to notice that the system model (1) with defined using discretized delay components is an approximation of the true system model (7). This approximation model is introduced so that (3) can be applied to solve the channel estimation task. The estimate of the channel vector at the pilot subcarriers is then . In order to estimate the channel in (4) the dictionary is appropriately expanded (rowwise) to include all subcarrier frequencies.
Iv Bayesian Prior Modeling
In this section we specify the joint pdf of the system model (1) when it is augmented with the 2L and the 3L hierarchical prior model. The joint pdf of (1) augmented with the 2L hierarchical prior model reads
(10) 
The 3L prior model considers the parameter specifying the prior of in (10) as random. Thus, the joint pdf of (1) augmented with this hierarchical prior model is of the form
(11) 
In (10) and (11) we have due to (1
). Furthermore, we select the conjugate prior
. Finally, we let with . In the following we show the main results and properties of these prior models. We refer to [13] for a more detailed analysis.Iva TwoLayer Hierarchical Prior Model
The 2L prior model assumes that with . We compute the prior of to be
(12) 
with
(13) 
In this expression, is the modified Bessel function of the second kind with order . The prior (13) leads to the general optimization problem (3) with penalty term
(14) 
We now show that the 2L prior model induces the norm penalty term and thereby the LASSO cost function as a special case. Selecting and using the identity [14], (13) yields the Laplace prior
(15) 
With the selection , , we obtain .
The prior pdf (13) is specified by and the regularization parameter . In order to get insight into the impact of on the properties of this prior pdf we consider the case . In Fig. 1(a) the contour lines of the restriction to of are visualized;^{2}^{2}2Let denote a function defined on a set . The restriction of to a subset is the function defined on that coincides with on this subset. each contour line is computed for a specific choice of . Notice that as decreases towards more probability mass accumulates along the axes; as a consequence, the mode of the resulting posterior is more likely to be located close to the axes, thus promoting a sparse solution. The behavior of the classical penalty term obtained for can also be clearly recognized. In Fig. 1(b) we consider the case when is orthonormal and compute the MAP estimator (3) with penalty term (14) for different values of . Note the typical softthresholdlike behavior of the estimators. As , more components of are pulled towards zero since the threshold value increases, thus encouraging a sparser solution.
IvB ThreeLayer Hierarchical Prior Model
We now turn to the SBL problem with a 3L prior model for leading to the joint pdf in (11). Specifically, the goal is to incorporate the regularization parameter into the inference framework. To that end, we define with and compute the prior . Defining and we obtain with
(16) 
In this expression, is the confluent hypergeometric function [14]. In Fig. 2(a) we show the estimation rules produced by the MAP solver for different values of and fixed parameters and when is orthonormal. It can be seen that the estimation rules obtained with the 3L prior model approximate the hardthresholding rule. In Fig. 2(b), we depict the contour lines of the restriction to of . Observe that although the contours behave qualitatively similarly to those shown in Fig. 1(a) for the 2L prior model, the estimation rules in Fig. 2(a) and Fig. 1(b) are different.
V Variational Message Passing
In this section we present a VMP algorithm for estimating in (4) given the observation in (5). Let be the set of unknown parameters and be the joint pdf specified in (11). The factor graph [15] that encodes the factorization of is shown in Fig. 3. Consider an auxiliary pdf for the unknown parameters that factorizes according to . The VMP algorithm is an iterative scheme that attempts to compute the auxiliary pdf that minimizes the KullbackLeibler (KL) divergence . In the following we summarize the key steps of the algorithm; the reader is referred to [16] for more information on VMP.
From [16] the auxiliary function , , is updated as the product of incoming messages from the neighboring factor nodes to the variable node :
(17) 
In (17) is the set of factor nodes neighboring the variable node and denotes the message from factor node to variable node . This message is computed as
(18) 
where is the set of variable nodes neighboring the factor node . After an initialization procedure, the individual factors of are then updated iteratively in a roundrobin fashion using (17) and (18).
We provide two versions of the VMP algorithm: one applied to the 2L prior model (referred to as VMP2L) and another one applied to the 3L model (VMP3L). The messages corresponding to VMP2L are easily obtained as a special case of the messages computed for VMP3L by assuming , where is some fixed real number.
V1 Update of
According to (17) and Fig. 3 the computation of the update of requires evaluating the product of messages and . Multiplying these two messages yields the Gaussian auxiliary pdf with covariance matrix and mean given by
(19)  
(20) 
In the above expression we have defined .
V2 Update of
The update of is proportional to the product of the messages and :
(21) 
The righthand side expression in (21) is recognized as the product of Generalized Inverse Gaussian (GIG) pdfs [17] with order . Observe that the computation of in (19) requires evaluating for all
. Luckily, the moments of the GIG distribution are given in closed form for any
[17]:(22) 
V3 Update of
The update of is proportional to the product of messages and :
(23) 
Clearly, factorizes as a product of gamma pdfs, one for each individual entry in . The first moment of used in (22) is easily computed as
(24) 
Naturally, is only computed for VMP3L.
V4 Update of
Vi Numerical Results
Sampling time,  32.55 ns 

CP length  4.69 s / 144 
Subcarrier spacing  15 kHz 
Pilot pattern  Equally spaced, QPSK 
Modulation  QPSK 
Subcarriers,  1200 
Pilots,  100 
OFDM symbols  1 
Information bits  727 
Channel interleaver  Random 
Convolutional code  
Decoder  BCJR algorithm [19] 
We perform Monte Carlo simulations to evaluate the performance of the two versions of the derived VMP algorithm in Section V. We consider a scenario inspired by the 3GPP LTE standard [20] with the settings specified in Table I. The multipath channel (6) is based on the model used in [21] where, for each realization of the channel, the total number of multipath components
is Poisson distributed with mean of
and the delays ,, are independent and uniformly distributed random variables drawn from the continuous interval [0, 144
] (corresponding to the CP length). The th nonzero component conditioned on the delayhas a zeromean complex circular symmetric Gaussian distribution with variance
and parameters .^{3}^{3}3The parameter is computed such that , where is the joint pdf of the parameters of the channel model. In the considered simulation scenario, , , and (the decay rate).To initialize the VMP algorithm we set and equal to the inverse of the sample variance of and the inverse number of columns respectively. Furthermore, we let in (25), which corresponds to the Jeffreys noninformative prior for . Once the initialization is completed, the algorithm sequentially updates the auxiliary pdfs , , , and until convergence is achieved. Obviously, is only updated for VMP3L, whereas for VMP2L the entries in are set to . For both versions we select and for VMP3L we set and , . Finally, the dictionary is specified by pilot subcarriers and a total of columns (corresponding to the choice and in (9)).
The VMP is compared to a classical OFDM channel estimator and two stateoftheart sparse estimation schemes. Specifically, we use as benchmark the robustlydesigned Wiener Filter (RWF) [22], the relevance vector machine (RVM) [10], [11],^{4}^{4}4The software is available online at http://dsp.ucsd.edu/~dwipf/. and the sparse reconstruction by separable approximation (SpaRSA) algorithm [23].^{5}^{5}5The software is available online at http://www.lx.it.pt/~mtf/SpaRSA/ The RVM is an EM algorithm based on the 2L prior model of the studentt pdf over each , whereas SpaRSA is a proximal gradient method for solving (2). In case of the SpaRSA algorithm the regularization parameter needs to be set. In all simulations, we let , which leads to good performance in high signaltonoise ratio (SNR) regime.
The performance is compared with respect to the resulting biterrorrate (BER) and meansquared error (MSE) of the estimate versus the SNR (). In addition, in order to quantify the necessary pilot overhead, we evaluate the MSE versus the number of available pilots . Hence, in this setup is no longer fixed as in Table I.
In Fig. 4(a) we compare the BER performance of the different schemes. We see that VMP3L outperforms the other schemes across all the SNR range considered. Specifically, at 1 % BER the gain is approximately 2 dB compared to VMP2L and RVM and 3 dB compared to SpaRSA and RWF. Also VMP2L achieves lower BER in the SNR range 0  12 dB compared to RVM and across the whole SNR range compared to SpaRSA and RWF.
The superior BER performance of the VMP algorithm is well reflected in the MSE performance shown in Fig. 4(b). Again VMP3L is a clear winner followed by VMP2L. The bad MSE performance of the SpaRSA for low SNR is due to the difficulty in specifying a suitable regularization parameter across a large SNR range.
We next fix the ratio between received symbol power and noise variance to 15 dB^{6}^{6}6Note that this value does not correspond with as represented in Fig. 4(a) and 4(b). The specific depends on the number of bits in an OFDM block, which in turn depends on the number of pilot symbols . and evaluate the MSE versus number of available pilots . The results are depicted in Fig. 4(c). Observe a noticeable performance gain obtained with VMP3L. In particular, VMP3L exhibits the same MSE performance as VMP2L and RVM using only approximately 85 pilots, roughly half as many as VMP2L and RVM. Furthermore, VMP3L, using this number of pilots, significantly outperforms SpaRSA and RWF using 200 pilots.
Vii Conclusion
In this paper, we proposed channel estimators based on sparse Bayesian learning. The estimators rely on Bayesian hierarchical prior modeling and variational message passing (VMP). The VMP algorithm effectively exploits the probabilistic structure of the hierarchical prior models and the resulting sparsityinducing priors. Our numerical results show that the proposed channel estimators yield superior performance in terms of biterrorrate and meansquared error as compared to other existing estimators, including the estimator based on the norm constraint. They also allow for a significant reduction of the amount of pilot subcarriers needed for estimating a given channel.
Acknowledgment
This work was supported in part by the 4GMCT cooperative research project funded by Intel Mobile Communications, Agilent Technologies, Aalborg University and the Danish National Advanced Technology Foundation. This research was also supported in part by the project ICT 248894 Wireless Hybrid Enhanced Mobile Radio Estimators (WHERE2).
References
 [1] R. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 118–121, July 2007.
 [2] E. J. Candes and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30, Mar. 2008.
 [3] W. Bajwa, J. Haupt, A. Sayeed, and R. Nowak, “Compressed channel sensing: A new approach to estimating sparse multipath channels,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1058–1076, Jun. 2010.
 [4] G. Taubock and F. Hlawatsch, “A compressed sensing technique for OFDM channel estimation in mobile environments: Exploiting channel sparsity for reducing pilots,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing ICASSP 2008, 2008, pp. 2885–2888.
 [5] C. R. Berger, S. Zhou, J. C. Preisig, and P. Willett, “Sparse channel estimation for multicarrier underwater acoustic communication: From subspace methods to compressed sensing,” IEEE Trans. on Sig. Proc., vol. 58, no. 3, pp. 1708–1721, 2010.
 [6] J. Huang, C. R. Berger, S. Zhou, and J. Huang, “Comparison of basis pursuit algorithms for sparse channel estimation in underwater acoustic OFDM,” in Proc. OCEANS 2010 IEEE  Sydney, 2010, pp. 1–6.
 [7] D. Shutin and B. H. Fleury, “Sparse variational Bayesian SAGE algorithm with application to the estimation of multipath wireless channels,” IEEE Trans. on Sig. Proc., vol. 59, pp. 3609–3623, 2011.
 [8] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” J. R. Statist. Soc., vol. 58, pp. 267–288, 1994.
 [9] S. S. Chen, D. L. Donoho, Michael, and A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, pp. 33–61, 1998.

[10]
M. Tipping, “Sparse Bayesian learning and the relevance vector machine,”
J. of Machine Learning Res.
, vol. 1, pp. 211–244, June 2001.  [11] D. Wipf and B. Rao, “Sparse Bayesian learning for basis selection,” IEEE Trans. on Sig. Proc., vol. 52, no. 8, pp. 2153 – 2164, aug. 2004.

[12]
D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, “The variational approximation for Bayesian inference,”
IEEE Signal Process. Mag., vol. 25, no. 6, pp. 131–146, November 2008.  [13] N. L. Pedersen, D. Shutin, C. N. Manchón, and B. H. Fleury, “Sparse estimation using Bayesian hierarchical prior modeling for real and complex models,” submitted to IEEE Trans. on Sig. Proc., 2012, arXiv:1108.4324v1.
 [14] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, 1972.
 [15] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and the sumproduct algorithm,” IEEE Trans. on Information Theory, vol. 47, no. 2, pp. 498–519, Feb 2001.
 [16] J. Winn and C. M. Bishop, “Variational message passing,” J. Mach. Learn. Res., vol. 6, pp. 661–694, 2005.
 [17] B. Jorgensen, Statistical Properties of the Generalized Inverse Gaussian Distribution (Lecture Notes in Statistics 9). SpringerVerlag New York Inc, 1982.
 [18] The iterative solutions coded modulation library. [Online]. Available: http://www.iterativesolutions.com
 [19] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. on Inform. Theory, vol. 20, no. 2, pp. 284–287, 1974.
 [20] 3rd Generation Partnership Project (3GPP) Technical Specification, “Evolved universal terrestrial radio access (eutra); base station (bs) radio transmission and reception,” TS 36.104 V8.4.0, Tech. Rep., 2008.
 [21] M. L. Jakobsen, K. Laugesen, C. Navarro Manchón, G. E. Kirkelund, C. Rom, and B. Fleury, “Parametric modeling and pilotaided estimation of the wireless multipath channel in OFDM systems,” in Proc. IEEE Int Communications (ICC) Conf, 2010, pp. 1–6.

[22]
O. Edfors, M. Sandell, J.J. van de Beek, S. K. Wilson, and P. O. Börjesson, “OFDM channel estimation by singular value decomposition,”
IEEE Trans. on Communications, vol. 46, no. 7, pp. 931–939, 1998.  [23] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstruction by separable approximation,” IEEE Trans. on Sig. Proc., vol. 57, no. 7, pp. 2479–2493, 2009.