 # Variational Bayesian Inference of Line Spectral Estimation with Multiple Measurement Vectors

In this paper, the line spectral estimation (LSE) problem with multiple measurement vectors (MMVs) is studied utilizing the Bayesian methods. Motivated by the recently proposed variational line spectral estimation (VALSE) method, we extend it to deal with the MMVs setting, which is especially important in array signal processing. The VALSE method can automatically estimate the model order and nuisance parameters such as noise variance and weight variance. In addition, by approximating the probability density function (PDF) of the frequencies with the mixture of von Mises PDFs, closed-form update equation and the uncertainty degree of the estimates can be obtained. Interestingly, we find that the VALSE with MMVs can be viewed as applying the VALSE with single measurement vector (SMV) to each snapshot, and combining the intermediate data appropriately. Furthermore, the proposed prior distribution provides a good interpretation of tradeoff between grid and off-grid based methods. Finally, numerical results demonstrate the effectiveness of the VALSE method, compared to the state-of-the-art methods in the MMVs setting.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Line spectral estimation (LSE), i.e., recovering the parameters of a superposition of complex exponential functions is one of the classical problems in signal processing fields , which has many applications such as channel estimation in wireless communications [3, 4], direction of arrival estimation in radar systems , speech analysis and so on. Traditional methods for solving the LSE problem include periodogram, MUSIC, ESPRIT and maximum likelihood (ML) method [2, 6, 7, 8]. For periodogram method, it is difficult to recover the closely separated frequencies . While for subspace methods such as MUSIC and ESPRIT which utilize the covariance matrix to estimate the frequencies, they perform well when the model order is known and the signal to noise ratio (SNR) is high. As for the ML methods, it involves maximizing the nonconvex function which has a multimodal shape with a sharp global maximum. Iterative algorithm is often proposed with accurate initialization to solve the ML problem [9, 10]. Given that the model order is unknown in applications, some criterions such as Akaike information criterion are adopted to estimate the model order .

In the past decades, sparse methods for LSE have been popular due to the development of sparse signal representation and compressed sensing theory. By discretizing the continuous frequency into a finite set of grid points, the nonlinear problem can be formulated as a linear problem. optimization , sparse iterative covariance-based estimation (SPICE) [13, 14, 15], sparse Bayesian learning  are main sparse methods. Compared to classical methods, the grid based methods perform better by utilizing the sparsity in the spatial domain. Due to the grid mismatch, dictionary-based approaches suffer from spectral leakage. To mitigate the drawbacks of static dictionary, gridless methods have been proposed to gradually refine the dynamic dictionary, such as iterative grid refinement, joint sparse signal and parameter estimation [12, 17]. In , a Newtonalized orthogonal matching pursuit (NOMP) method is proposed, where a Newton step and feedback are utilized to refine the frequency estimation. In addition, the NOMP algorithm is also extended to deal with the MMVs setting . Compared to the incremental step in updating the frequencies in NOMP approach, the iterative reweighted approach (IRA)  estimates the frequencies in parallel.

To avoid the model mismatch issues, off-grid compressed sensing methods which work directly with continuously parameterized dictionaries have been proposed [21, 22, 23, 24, 25, 26]. For the SMV case, the atom norm based method has been proposed in the noiseless case . In [22, 23], the atom soft thresholding (AST) method is proposed in the noisy case. Since AST method requires knowledge of the noise variance, the gridless SPICE (GLS) method is proposed without knowledge of noise power . In , an exact discretization-free method called sparse and parametric approach (SPA) is proposed for uniform and sparse linear arrays, which is based on the well-established covariance fitting criterion. In , two approaches based on atomic norm minimization and structured covariance estimation are developed in the MMV case, and the benefit of including MMV is demonstrated. To further improve the resolution of the atom norm based methods, enhanced matrix completion (EMac)  and reweighted atomic-norm minimization (RAM)  are proposed and the resolution capability is improved numerically. These off-grid based methods involve solving a semidefinite programming (SDP) problem , whose computation complexity is prohibitively high for large-scale problems.

A different gridless approach is based on the Bayesian framework and sparse Bayesian learning (SBL) [30, 31] is adopted, where variational inference methods 

or maximization of the marginalized posterior probability density function (PDF)

 is performed. For all these approaches, only point estimates of the frequency are computed in each iteration, which is similar to the classical ML methods. Another limitation is that these methods usually overestimates the model order [32, 34]. In , a low complexity superfast LSE methods are proposed based on fast Toeplitz matrix inversion algorithm.

### I-a Main Contributions and Comparisons to Related Work

In , an off-grid variational line spectral estimation (VALSE) algorithm is proposed, where PDFs of the frequencies are estimated, instead of retaining only the point estimates of the frequencies. This more complete Bayesian approach allows to represent and operate with the frequency uncertainty, in addition to only that of the weights. Here we rigorously develop the variational Bayesian inference method for LSE in the MMVs setting, which is especially important in array signal processing. Meanwhile, the derived MVALSE reveals close relationship to the VALSE algorithm, which is suitable for parallel processing. We study the performance of the MVALSE method with von Mises prior PDFs for the frequencies. The prior information may be given from past experience, and is particularly useful when the SNR is low or few samples are available . For sequential estimation, the output of the PDF of the frequencies from the previous observations can be employed as the prior of the frequency, and sequential MVALSE (Seq-MVALSE) is proposed. Furthermore, substantial experiments are conducted to illustrate the competitive performance of the MVALSE method and its application to DOA problems, compared to other sparse based approaches.

### I-B Paper Organization and Notation

The rest of this paper is organized as below. Section II describes the signal model with MMV and introduces the probabilistic formulation. Section III develops the MVALSE algorithm and the details of the updating expressions are presented. In addition, the relationship between the VALSE and MVALSE are revealed, and the Seq-MVALSE is also presented. Substantial numerical experiments are provided in Section VI and Section VII concludes the paper.

Let be a subset of indices and denote its cardinality. For the matrix , let denote the submatrix by deleting the columns of indexed by . For the matrix and , let and denote the th row of and , respectively. Let and denote the submatrix by choosing the rows of and indexed by . For the matrix , let denote the submatrix by choosing both the rows and columns of indexed by . Let , and be the conjugate, transpose and Hermitian transpose operator of , respectively. Let

denote the identity matrix of dimension

. Let denote the Frobenius norm. denotes the indices excluding and returns the real part. Let

denote the complex normal distribution of

with mean and covariance , and let denote the von Mises distribution of with mean direction and concentration parameter . For a vector , let denote the number of nonzero elements, and sometimes we let or denote its th element. Similarly, let or denote the th element of , and let and denote the th row and th column of , respectively.

## Ii Problem Setup

For line spectral estimation problem with snapshots, the measurements consist of a superposition of complex sinusoids corrupted by the additive white Gaussian noise (AWGN) , which is described by

 Y=K∑k=1a(˜θk)˜wTk+U, (1)

where is the number of measurements for each observation. The complex weights over the snapshots and the frequency of the th component are represented by and respectively . The elements of the noise are i.i.d. and , and .

Since the number of complex sinusoids is generally unknown, the measurements is assumed to consist of a superposition of known components with , i.e.,

 Y=N∑i=1a(θi)wTi+U=AW+U, (2)

where , denotes the column of , denote the th row of . Since , the binary hidden variables are introduced and the probability mass function is , where and

 p(si;ρ)=λsi(1−λ)(1−si). (3)

We assume , where

follows a Bernoulli-Gaussian distribution

 p(wi|si;τ)=(1−si)δ(wi)+siCN(wi;0,τIL), (4)

where is the Dirac delta function. From (3) and (4), it can be seen that controls the probability of the th component being active. The prior distribution of the frequency is , where is encoded through the von Mises distribution [37, p. 36]

 p(θi)=VM(θi;μ0,i,κ0,i)=12πI0(κ0,i)eκ0,icos(θ−μ0,i), (5)

where and are the mean direction and concentration parameters of the prior of the th frequency , is the modified Bessel function of the first kind and the order [37, p. 348]. Note that corresponds to the uninformative prior distribution .

For measurement model (2), the likelihood is

 p(Y|AW;ν)=∏i,jCN(Yij;[AW]i,j,ν). (6)

Let and be the model and estimated parameters. Given the above statistical model, the type II maximum likelihood (ML) estimation of the model parameters is

 ^βML=argmaxβ ∫p(Y,Φ;β)dsdWdθ, (7)

where . Then the minimum mean square error (MMSE) estimate of the parameters is

 ^ΦMMSE=E[Φ|Y;^βML], (8)

where the expectation is taken with respect to the PDF

 p(Φ|Y;^βML)∝p(Y|AW;^νML)N∏i=1p(θi)p(wi|si;^τML)p(si;^ρML). (9)

However, computing both the ML estimate of (7) and the MMSE estimate of (8) are intractable. Thus an iterative algorithm is designed in the following.

## Iii MVALSE Algorithm

In this section, a mean field variational Bayes method is proposed to find an approximate PDF by minimizing the Kullback-Leibler (KL) divergence [38, p. 732]

 KL(q(Φ|Y)||p(Φ|Y))=∫q(Φ|Y)lnq(Φ|Y)p(Φ|Y)dθdWds. (10)

For any assumed PDF , the log marginal likelihood (model evidence) is [38, pp. 732-733]

 lnp(Y;β)=KL(q(Φ|Y)||p(Φ|Y))+L(q(Φ|Y);β), (11)

where

 L(q(Φ|Y);β)=Eq(Φ|Y)[lnp(Y,Φ;β)q(Φ|Y)]. (12)

For a given data , is a constant, thus minimizing the KL divergence is equivalent to maximizing in (11). Therefore we maximize in the sequel.

For the factored PDF , the following assumptions are made:

• Given , the frequencies are mutually independent.

• The posterior of the binary hidden variables has all its mass at , i.e., .

• Given and , the frequencies and weights are independent.

As a result, can be factored as

 q(Φ|Y)=N∏i=1q(θi|Y)q(W|Y,s)δ(s−ˆs). (13)

Due to the factorization property of (13), the frequencies can be estimated from the marginal distribution as [37, pp. 26]

 ˆθi=arg(Eq(θi|Y)[ejθi]), (14a) ˆai=Eq(θi|Y)[a(θi)], i∈{1,...,N}, (14b)

where returns the angle. In Section III-A, is approximated as a von Mises distribution. For von Mises distribution (5), . Therefore, is also the mean direction of for von Mises distribution. Besides, 111As for , the magnitudes of the elements of are less than . An alternative approach is to assume the following posterior PDF which corresponds to the point estimates of the frequencies, and let be , which yields the VALSE-pt algorithm . Numerical results show that the performance of VALSE-pt is slightly worse than that of VALSE algorithm . Here we use (14b) to estimate ..

Given that , the posterior PDF of is

 q(W|Y)=∫q(W|Y,s)δ(s−ˆs)ds=q(W|Y;ˆs). (15)

For the given posterior PDF , the mean and covariance of the weights are estimated as

 ˆwi=Eq(W|Y)[wi], (16a) ˆCi,j=Eq(W|Y)[wiwHj]−ˆwiˆwHj, i,j∈{1,...,N}. (16b)

Let be the set of indices of the non-zero components of , i.e.,

 S={i|1≤i≤N,si=1}.

Analogously, is defined based on . The model order is estimated as the cardinality of , i.e.,

 ˆK=|ˆS|.

According to (2), the noise-free signal is reconstructed as

 ˆX=∑i∈ˆSˆaiˆwTi.

Maximizing with respect to all the factors is also intractable. Similar to the Gauss-Seidel method , is optimized over each factor , and separately with the others being fixed. Let be the set of all latent variables. Maximizing (12) with respect to the posterior approximation of each latent variable yields [38, pp. 735, eq. (21.25)]

 lnq(zd|Y)=Eq(z∖zd|Y)[lnp(Y,z)]+const, (17)

where the expectation is with respect to all the variables except and the constant ensures normalization of the PDF. In the following, we detail the procedures.

### Iii-a Inferring the frequencies

For each , we maximize with respect to the factor . For , we have . According to (17), for , the optimal factor can be calculated as

 lnq(θi|Y)= Eq(z∖θi|Y)[lnp(Y,Φ;β)]+const. (18)

In Appendix VIII-A, it is shown that

 q(θi|Y)∝p(θi)(a)exp(Re{ηHia(θi)})(b), (19)

where the complex vector is given by

 ηi=2ν⎛⎜⎝Y−∑j∈ˆS∖{i}ˆajˆwTj⎞⎟⎠ˆw∗i−2ν∑j∈ˆS∖{i}tr(ˆCj,i)ˆaj (20)

for , and otherwise, which is consistent with the results in [34, equ. (17)] for the SMV case. In order to obtain the approximate posterior distribution of , as shown in the next subsection, (14b) needs to be computed. While it is hard to obtain the analytical results for the PDF (19

from  is used to obtain a von Mises approximation. For the second frequency, the prior can be similarly chosen from the set with the first selected prior being removed. For the other frequencies, the steps follow similarly.

It is worth noting that for the prior distribution (5), when tends to infinity, , where denotes the Dirac delta function. Consequently, the signal model (2) is a sum over deterministic frequencies , i.e., . Thus, in this case, the MVALSE algorithm is a complete grid based method. When , corresponding to the uninformative prior, the VALSE is a complete off-grid based method. Thus, by varying , the prior of the VALSE algorithm provides a trade-off between grid method and off-grid method.

### Iii-B Inferring the weights and support

Next are fixed and is maximized w.r.t. . Define the matrices and as

 (21a) H=ˆAHY, (21b)

where denotes the th element of .

According to (17), can be calculated as

 = Eq(θ|Y)[N∑i=1lnp(si)+lnp(W|s)+lnp(Y|θ,W)]+const = ||s||0lnλ1−λ+||s||0Lln1πτ−1τtr(WSWHS)+2νRe{tr(WHSHS)}−1νtr(WHSJSWS)+const = tr((WS−ˆWS)HˆC−1S,0(WS−ˆWS))+const, (22)

where

 ˆWS=ν−1ˆCS,0HS, (23a) ˆCS,0=(JSν+I|S|τ)−1. (23b)

From (13), the posterior approximation can be factored as the product of and . According to the formulation of (22), for a given , is a complex Gaussian distribution, i.e.,

 q(W|Y;ˆs)= 1(π||ˆs||0det(ˆCˆS,0))Lexp[−tr((WˆS−ˆWˆS)HˆC−1ˆS,0(WˆS−ˆWˆS))]∏i∉ˆSδ(wi) (24) = L∏l=1CN(wˆS,l;ˆwˆS,l,ˆCˆS,0)∏i∉ˆSδ(wi,l), (25)

where denotes the th column of . From (25), it can be seen that each column of is independent and is a complex Gaussian distribution. This is convenient for parallel execution, as described in Section IV.

To calculate , has to be given. Plugging the postulated PDF (13) in (12), one has

 lnZ(ˆs)≜L(q(θ,W,s|Y);ˆs)=Eq(θ,W,s|Y)[lnp(Y,θ,W,s;ˆs)q(θ,W,s|Y;ˆs)] = Eq(θ,W,s|Y)[N∑i=1lnp(si)+lnp(W|s)+lnp(Y|θ,W)−lnq(W|Y)]+const = −Llndet(JˆS+ντI|ˆS|)+||ˆs||0lnλ1−λ+ν−1tr(HHˆS(JˆS+ντI|ˆS|)−1HˆS)+||ˆs||0Llnντ+const. (26)

Thus should be chosen to maximize (26).

The computation cost of enumerative method to find the globally optimal binary sequence of (26) is , which is impractical for typical values of . Here a greedy iterative search strategy similar to  is proposed. For a given , we update it as follows: For each , calculate , where is the same as except that the th element of is flipped. Let . If , we update with the th element flipped, and is updated, otherwise is kept, and the algorithm is terminated. In fact, can be easily calculated and the details are provided in Appendix VIII-B.

Since each step increases the objective function (which is bounded) and can take a finite number of values (at most ), the method converges in a finite number of steps to some local optimum. If deactive is not allowed and is initialized as , then it can be proved that finding a local maximum of costs only steps. In general, numerical experiments show that steps is often enough to find the local optimum.

### Iii-C Estimating the model parameters

After updating the frequencies and weights, the model parameters is estimated via maximizing the lower bound for fixed . In Appendix VIII-C, it is shown that

 L(q(θ,W,s|Y);β)=Eq(θ,W,s|Y)[lnp(Y,θ,W,s;β)q(θ,W,s|Y)] = −1ν[||Y−ˆAˆSˆWˆS||2F+Ltr(JˆSˆCˆS,0)]−1τ[tr(ˆWˆSˆWHˆS)+Ltr(ˆCˆS,0)] + ||ˆs||0(lnλ1−λ−Llnτ)+Nln(1−λ)−MLlnν+const. (27)

Setting , , , we have

 ˆν =||Y−ˆAˆSˆWˆS||2F/(ML)+tr(JˆSˆCˆS,0)/M+∑i∈ˆSL∑l=1|ˆWil|2(1−||ˆai||22/M)/L, ˆλ =||ˆs||0N,ˆτ=tr(ˆWˆSˆWHˆS)+Ltr(ˆCˆS,0)L||ˆs||0. (28)

### Iii-D The MVALSE algorithm

Now the details of updating the assumed posterior have been given and summarized in Algorithm 1. For the proposed algorithm, the initialization is important for the performance of the algorithm. The schemes that we initialize , , and , are below.

First, initialize as , which can be simplified as the form similar to (19): By defining with cardinality and . Obviously . For each , by constructing as with , can be re-expressed as

 (29)

Then can be calculated. Since and (21) can be calculated. According to (23b) and (23a), is calculated. Then we update with . Following the previous steps, , and are all initialized. As for the model parameters , is used to build a Toeplitz estimate of . Let

be the average of the lower quarter of the eigenvalues of that matrix, and

is initialized as , where the active probability is initialized as .

The complexity of MVALSE algorithm is dominated by the two steps : the maximization of and the approximations of the posterior PDF by mixtures of von Mises PDFs. For the maximization of , if is initialized such that and deactive is not allowed, it can be proved that the greedy iterative search strategy needs at most steps to converge. For the general case where deactive is allowed, numerical experiments show that steps is enough to converge. For each step, the computational complexity is due to the matrix multiplication. Therefore, the computational complexity is . For the approximations of the posterior PDF by mixtures of von Mises PDFs, the Heuristic method [34, subsection D of Section IV] is adopted and the computational complexity is . In conclusion, the dominant computational complexity of the MVALSE is with being the number of iterations as is close to .

## Iv MVALSE with Parallel Processing

The MVALSE Algorithm 1 is compared with the VALSE algorithm . The MMVs can be decoupled as SMVs. For each SMV, we perform the VALSE algorithm and obtain according to [34, eq. (17)] for the th snapshot, i.e.,

 ηi,l=2ν⎛⎜⎝yl−∑j∈ˆS∖{i}ˆaj[ˆwTj]l⎞⎟⎠[ˆw∗i]l−2ν∑j∈ˆS∖{i}[ˆCj,i]l,lˆaj, (30)

where denotes the th element of , denotes the th element of . From (20), is the sum of for all the snapshots, i.e., , and now each is updated as . We use to obtain estimates and . In addition, we update the weights and their covariance (23) by applying the SMV VALSE. Let be the estimated weights of the th snapshot, the whole weight matrix (23) can be constructed as . It is worth noting that equation (25) reveals that for different snapshots, the weight vectors are uncorrelated. Besides, the covariance of the weights for each snapshot is the same, which means that the common covariance of the weight can be fed to the SMV VALSE. For updating under the active case, according to [34, equ. (40)], the changes for the th snapshot is

 Δk,l=lnvkτ+|[uk]l|2vk+lnλ1−λ. (31)

Thus (39) can also be expressed as

 Δk=L∑l=1Δk,l−(L−1)lnλ1−λ, (32)

which can be viewed as a sum of the results from the VALSE in SMVs, minus an additional constant term . Similarly, for the deactive case, (42) can be viewed as a sum of the results (equation (44) in ) from the VALSE in SMVs, plus an additional constant term . The additional constant terms can not be neglected because we need to determine the sign of (39) and (42) to update . For the th snapshot, running the VALSE algorithm yields the model parameters estimates

 ˆνl =||yl−ˆAˆS[ˆWˆS]:,l||2/M+tr(JˆSˆCˆS,0)/M+∑i∈ˆS|ˆwil|2(1−||ˆai||