## I Introduction

The problem of line spectral estimation (LSE) [1], i.e. extracting the parameters of a superposition of complex exponential functions from noisy measurements is fundamental in numerous disciplines in engineering, physics, and natural sciences. To quote a few examples, solutions to this problem have applications to range and direction estimation in sonar and radar, channel estimation in wireless communications, speech analysis, spectroscopy, molecular dynamics, power electronics, geophysical exploration.

In LSE, the original signal is a superposition of complex sinusoids, i.e.,

(1) |

where and are the complex amplitude and (angular) frequency, respectively, of the

th component. We are given the vector

containing noisy measurements of those components of with indices in , . Defining the function , and the vector representing additive noise, we write(2) |

The problem of LSE involves estimating the number of sinusoidal components, also referred to as model order selection, and their associated parameters and . Even if the model order is given, LSE is still nontrivial because of the nonlinear dependency of (2) on the frequencies.^{1}^{1}1When and the frequencies are given, the complex amplitudes can be easily estimated with the linear least-squares method.

### I-a Prior Work

Under the assumption of known , the ’s are traditionally estimated using the maximum-likelihood (ML) technique or subspace methods, such as [2, 3]. The ML method involves the hard task of maximizing a nonconvex function that has a multimodal shape with a sharp global maximum. The maximizer is typically searched using iterative algorithms (e.g., [4, 5, 6]) which, however, require accurate initialization and, at best, are guaranteed to converge to a local optimum. Nonetheless, the performance of the ML technique is superior to that of subspace methods, the difference being evident especially when the sample size or alternatively the signal-to-noise ratio (SNR) are small. Since is typically unknown in practice, the model order is conventionally selected based on an information criterion, which comprises a data term representing the fitting error and a penalty term that increases with the model order (see [7] and references therein). Assuming a range of potential model orders, the parameters corresponding to each possible order are estimated using, e.g., one of the aforementioned methods. Finally, the tradeoff between fitting error and model complexity is made by selecting the configuration that minimizes the criterion. Scanning a range of model orders can be computationally expensive. Also, in non-asymptotic regimes (particularly limited or SNR), information criteria tend to provide a wrong model order. A comprehensive review of classical approaches can be found in [1].

A more recent approach to LSE is dictionary-based model estimation, see [8] and the references therein. In this approach, nonlinear estimation of the frequencies is avoided by discretizing the range into a finite set (grid) of samples that represent the candidate frequency estimates. The signal model (2) is then approximated with a linear system comprising a so-called dictionary matrix (whose columns are given by evaluated at the grid samples) and a vector of weights. Thus, the original nonlinear problem is replaced by a linear inverse problem to which a sparse solution is sought. The nonzero entries of the sparse estimate of the weight vector encode the model order and parameter estimates. There is a plethora of techniques that can be used for sparse signal representation, see the detailed survey [9]. However, restricting the candidate frequency estimates to a discrete grid induces spectral leakage due to the model mismatch. Consequently, can admit only an approximately sparse representation (or may be even incompressible) in a finite dictionary [10, 11]. On the one hand, a denser grid provides a better sparse approximation and higher accuracy of frequency estimation. On the other hand, increasing the grid density makes the dictionary columns highly coherent, which might affect the sparse reconstruction capability, and boosts the computational complexity. To alleviate the mismatch issues, several approaches are conceived, e.g.: in [11], the concept of structured sparsity is utilized to inhibit closely-spaced frequency estimates; the method in [12]

starts with a coarse grid and heuristically iterates between estimating the weights and placing a finer grid around the location of the non-zero weight estimates; in

[13, 14, 15, 16], a less fine grid is used as a baseline and the dictionary matrix is modified to include auxiliary interpolation functions.

In the quest for gridless methods which work directly with continuously parameterized dictionaries, i.e., dictionaries whose parameter ranges in , several works depart from using a static dictionary given by a fixed grid. By including the parameters that dictate the dictionary in the estimation problem, they obtain dynamic dictionary algorithms in which the candidate frequencies and hence the dictionary columns are gradually refined. In [8], two such algorithms are designed based on the regularized least squares objective by adding a penalty term to prohibit closely spaced frequencies and respectively imposing a hard constraint on the minimum distance between frequencies. The algorithms approximately solve the involved nonlinear estimation and still require an initial grid [8]. A different line of works adopts the Bayesian framework and augments the probabilistic model of sparse Bayesian learning (SBL) [17, 18] to incorporate the candidate frequencies. In SBL, a sparse weight vector is promoted by selecting a parameterized/hierarchical prior model for its entries [17, 18]. Estimation in the augmented model is performed using variational inference methods [19, 20, 21] or maximization of the marginalized posterior pdf [22]. Common to all existing SBL-based approaches is that they restrict to compute point estimates of the frequencies (i.e., MAP/ML estimates), which implies nontrivial maximization of highly multimodal functions (similar to classical ML frequency estimation) in each iteration. The maximization is accomplished approximately by using a grid followed by refinement with Newton’s method or interpolation. Another limitation is that, while providing good reconstruction performance, the SBL-based methods reportedly overestimate the model order, i.e., they consistently output additional spurious components (artifacts) of small power [19, 21].

A different gridless approach that avoids the frequency discretization issues is based on the atomic norm (equivalently, the total variation norm), which is the continuous analog of the norm and allows for working with an infinite, continuous dictionary. In this way, it is shown that for the noiseless case the frequencies can be perfectly recovered from complete data () [23] or incomplete data () [24], as long as they are well separated. In [25], the atomic norm soft thresholding (AST) method, which solves a convex program, is proposed for LSE from noisy, complete data. AST is generalized to the incomplete data case in [26]

. Given that AST requires knowledge of the noise variance, the grid-based SPICE method

[27] (which minimizes a covariance matrix fitting criterion) is extended in [26] to gridless SPICE (GLS). GLS is applicable to both complete and incomplete data cases without knowledge of the noise power and is equivalent to atomic-norm-based methods; however, it has the limitations of frequency splitting and inaccurate model order estimation [26]. To overcome the two drawbacks, a GLS-based framework is proposed in [26], in which: GLS is used as a method to estimate the covariance matrix of , based on which the model order is selected using the SORTE algorithm [28] and the frequencies are estimated with MUSIC [2].An important limitation of atomic-norm-based techniques is that they require the frequencies be sufficiently separated in order to be recovered. Enhanced matrix completion (EMaC) [29] and reweighted atomic-norm minimization (RAM) [30] are two recent algorithms that are reported to improve the resolution capability of atomic-norm methods.

### I-B Contribution

In this paper, we propose a method for LSE from the measurement model (2) by following the approach of sparse Bayesian inference including estimation of continuous-valued frequencies. The key development that sets our work apart from the related methods [19, 20, 21, 22] is that, instead of retaining only point estimates of the frequencies, we seek a more complete Bayesian treatment by estimating pdfs of the frequencies and computing expectations over them. Our basic motivation is that, in general, a fully Bayesian approach is expected to show benefits, especially in the situations where sample sizes or SNRs are limited. The fully Bayesian approach naturally allows for representing and operating with the uncertainty of the frequency estimates, in addition to only that of the weights as considered so far. In particular, our approach involves computing expectations of , rather than just evaluating the phasor at a certain point estimate. The uncertainty impacts all other estimates and also the criterion for accepting a component in the estimated model (through the estimates involved) and therefore the model order estimate. Our results show that accounting for the uncertainty of frequency estimates with the fully Bayesian approach proves to be essential for improving model estimation performance. A second distinction from related works is that we employ a Bernoulli-Gaussian hierarchical model for the weights [31] instead of the typical SBL prior model [17, 18]. By analyzing the component-acceptance criteria induced by the two models, we observe that the Bernoulli-Gaussian model is more resilient to insertion of small spurious components.

We provide our probabilistic formulation of LSE in Section II. Since exact inference in the proposed model requires computations that do not admit closed-form analytical expressions, we take the variational approach^{2}^{2}2

Variational methods are deterministic inference techniques which provide analytical approximations of posterior pdfs, unlike the stochastic method of Markov chain Monte Carlo (MCMC) sampling. The convergence of MCMC methods can be prohibitively slow and difficult to diagnose. MCMC sampling has been previously used for LSE, see

[32] and the references therein. to: compute approximate posterior pdfs of the frequencies and weights; attempt MAP detection of the binary vector of the hierarchical model; and target ML estimation of the noise variance and parameters of the Bernoulli-Gaussian model. The variational optimization problem consists in maximizing a lower bound on the model evidence over the pdfs and parameters of interest. In Section III, we derive implicit expressions for local maximizers, which are to be updated iteratively. To enable closed-form expectations over the approximate pdfs of the frequencies, we show in Section IV that these pdfs can be very well represented by mixtures of von Mises pdfs (see also Appendices B and C). In Section V, we propose a specific initialization and schedule of iterations that define the variational LSE (VALSE) algorithm. VALSE has several attractive features: it is fully automated (i.e., does not include parameters to be tuned, as all necessary parameters are learned from the data); it converges because each step increases the lower bound on the model evidence; it has the ability to easily incorporate prior knowledge about the frequencies (through a von Mises pdf or a mixture of such pdfs); it provides posterior distributions based on which uncertainty in LSE can be quantified. In Section VI the performance of VALSE is evaluated and compared against state-of-art methods through computer simulations. Finally, Section VII concludes the paper.## Ii Bayesian Formulation and Variational Approach

Given the difficulty of not knowing the model order in (2), for the design of our Bayesian estimator we propose a probabilistic model consisting of a superposition of (i.e. the dimension of the original signal in (1)) complex sinusoids that have random frequencies and weights. Since we want that eventually only of those components have nonzero weights, we use a sparsity-promoting prior model for the weights. Inference in the following model ideally would recover the true frequencies and corresponding nonzero weights and yield zero weights for the excessive components. Concretely, we assume that the measurement vector is a realization of a random process described by

(3) |

The complex weights are governed by independent Bernoulli variables such that the elements of are independent and

has a Bernoulli-Gaussian distribution. That is,

(4) |

and . Since implies that , the probability controls how likely it is for the th component to be “active” (i.e. its weight to be nonzero). In (4), has a zero-mean Gaussian pdf with variance .^{3}^{3}3While should model some prior knowledge about the amplitudes, for the design of our estimator we select a zero-mean Gaussian pdf mainly for computational convenience (see Sec. III-B). In fact, in the simulation experiments we generate the complex amplitudes in (1) from a distribution different than Gaussian.
In this paper, denotes the complex univariate/multivariate Gaussian pdf with mean and covariance . The frequencies have the prior pdf . As justified in Section IV, is a von Mises pdf, or a mixture of such pdfs if one wants to model a more sophisticated, possibly multimodal distribution; the lack of prior knowledge can be represented by setting the concentration parameter of the von Mises pdf to zero. We assume that the components of the noise are iid complex Gaussian with mean zero and variance , which gives the likelihood

(5) |

The model parameters are collectively denoted by .

We can relate model (3) to a sparse approximation problem in which, given the frequencies , is the dictionary matrix and we need to infer the weights from data samples. Using sparsity-promoting hierarchical models for is a common Bayesian approach to find sparse solutions to ill-posed problems in compressed sensing. While the Bayesian treatment of LSE [19, 20, 21, 22] typically uses the SBL prior model [17, 18], the Bernoulli-Gaussian model [31, 33] has not been used in the LSE context before. In the Bernoulli-Gaussian model, the binary vector represents the support of the weights . Contrary to the standard sparse estimation problem, in our context the dictionary is parameterized by the frequencies that are to be inferred as well.

We would like to compute mean and circular mean estimates of and , respectively, based on the posterior pdf

(6) |

In (6), the joint pdf in the numerator is the likelihood (5) times the prior pdfs defined above, i.e.

(7) |

while the denominator , called the model evidence (or marginal likelihood of ), is the marginal of the joint pdf and acts as a normalizing constant. Fig.1 illustrates the factor graph representation of (7).

The sought estimates unfortunately require operations (high-dimensional integrals, summation over possible values of ) that cannot be performed analytically. Therefore we use variational inference to compute a surrogate pdf that should approximate (6) well and at the same time enable tractable estimation.

The variational approach builds on the fact that, for any postulated pdf , the log model evidence can be expressed as [34, Ch. 10]

(8) |

The first term in (8

) is the Kullback-Leibler divergence of

from ,^{4}

^{4}4The KL divergence of a pdf from a pdf (both defined on some set ) is . while the functional reads

(9) |

Given that is constant w.r.t. and , minimizing the divergence is equivalent to maximizing and tightening it as a lower bound to the log model evidence. The KL divergence vanishes only when , in which case attains its maximum value, . Nonetheless, as we already mentioned, working with the posterior pdf (6) is intractable so we have to restrict the family of candidate pdfs.

We postulate that factors as

(10) |

That is, we assume that the frequencies are *a posteriori* independent (mutually and of the other variables).^{5}^{5}5The assumed factorization of is also referred to as a naïve mean field approximation. Furthermore, we consider that has all its mass at , i.e., , where the function equals when and otherwise. The simplifying restrictions define a family of pdfs and our goal is to search for the member which maximizes the lower bound .

The estimates of interest are computed from as follows. Since is an angle, its estimate is defined so as to give the mean direction of [35]:

(11) |

The estimates , are central in this work. Their magnitudes are with equality if, and only if is the Dirac delta distribution. A broad signifying high uncertainty gives a small magnitude, and vice versa. Those estimates with indices in give the elements of ; similarly, . The mean and covariance estimates of the weights are defined as

(12) |

Given that , the posterior pdf of is

(13) |

Intuitively, the closer is to , the better the estimates (11) and (12) approximate the estimates which we would have computed from (6), if we could. The forms of the pdfs and the support estimate in the r.h.s. of (II) result from maximizing the lower bound . When the parameters in are unknown, we target their ML estimates also by maximizing the lower bound to the log marginal likelihood .

Finally, based on and , we define the estimates of the quantities in the original superposition (1). Let be the set of indices of the non-zero components of , i.e.

Analogously, we define based on . The estimate of the model order is the cardinality of :

(14) |

We define the reconstructed signal as the expectation of the signal part in the r.h.s. of (3) over , which gives

(15) |

The components of and with indices in give the estimates of the frequencies and amplitudes in (1).

## Iii Solution to the Variational Optimization Problem

We now turn to maximizing the lower bound in (9) with of the form (II). Except for restricting to give probability one to some sequence , we do not impose any constraints on the forms of the factors in (II). That is, the forms of the approximate posterior pdfs result from variational optimization and are dictated by the likelihood (5) and prior pdfs. As maximizing over all factors simultaneously is not viable, we perform alternating optimization: is maximized over each of the factors , , , in turn while keeping the others fixed. Consequently, the form of each factor is implicit because it depends on the other factors.

Upon their initialization, we iteratively cycle through the factors and replace them one by one with a revised expression. Such a scheme is guaranteed to converge to some local maximum of [34, Ch. 10]. In the following we derive the factor expressions that correspond to the fixed-point of the scheme. A specific initialization and scheduling of updates are proposed in Sec. V.

### Iii-a Inferring the frequencies

For each , maximizing in (9) w.r.t. the factor gives [34, Ch. 10, p. 466]

where the expectation is taken over , the joint pdf is given by (7) and the constant ensures normalization of the pdf. We further write only the terms that depend on , i.e.

Plugging the Gaussian form of the likelihood (5) in the above expression and carrying out the required expectations, we finally obtain

(16) |

where the complex vector is given by

(17) |

when , and otherwise. The second factor in the r.h.s. of (16) is an approximation of the marginal likelihood of ; it is an extremely multimodal function, see Sec. IV. According to (17), the likelihood favors values of for which the angle between and the residual signal (after canceling the interference from the other components) is small.^{6}^{6}6The angle between two complex vectors and satisfies .
Interestingly, the likelihood corresponds to coherent estimation of from the residual signal when the weight is fixed to . At the same time, it penalizes (to an extent given by the cross-variance of the weights) values that result in small angle between and of the other components in the model. Naturally, when (i.e. ), only the prior information comes into play in (16).

### Iii-B Inferring the weights and support

We next maximize w.r.t. when , , are kept fixed. Since is restricted in (II) to give the marginal pmf , we cannot anymore use the factor-update expression corresponding to free-form optimization [34, Ch. 10, p. 466]. So we will explicitly carry out the maximization of .

Plugging the postulated pdf (II) in (9) we obtain

Let us introduce the pdf

where is given by (7) and is the normalizing constant obtained by integrating the exponential over . We can now write

(18) |

Inspecting (18), for any the maximum of over is attained when the KL divergence vanishes. Thus, has its maximum at

(19) |

To compute required for and in (19), we use (7), together with (5) and (4), and obtain an expression that is quadratic in , given that all are Gaussian. We define the matrix with elements and , , , and the vector . From (13) and (19), we obtain

where the mean and covariance matrix of the Gaussian posterior pdf of are

(20) |

The mean is the LMMSE estimate of assuming . For , the measurements are noninformative and, conveniently, , i.e., .

From (19), the sequence (which determines ) is the maximizer of

(21) |

Since maximizing the nonlinear function (III-B) is NP-hard, in Appendix A we propose a suboptimal procedure which is guaranteed to converge to a local maximum of .

According to Appendix A, a sinusoidal component (we drop the index

for the moment) is admitted only if the posterior mean

and variance of its weight (for ) satisfy(22) |

It is interesting to relate (22) to the test obtained in [36] for the SBL prior model of the weights [17, 18], where and are the mean and variance of the posterior divided by the prior. The SBL prior model is often used for estimating superimposed signals [19, 20, 21, 22] and, reportedly, the resulting estimators output additional spurious components (artifacts) of small power. Since can be viewed as an SNR of the component, the threshold can be heuristically increased such that a higher SNR is required [36, 19]. For the Bernoulli-Gaussian prior model we express (22) as

(23) |

where we used , which gives and . Thus, the threshold (23) is not constant but depends on , and also . The latter dependence makes the method more resilient to insertion of artifacts, because, as shown in Fig. 2, the threshold increases with smaller variance, unlike for the SBL model where it stays the same.

### Iii-C Estimating the model parameters

The noise variance is often unknown in practice. Also, it might be unclear how to set the parameters and of the Bernoulli-Gaussian prior model. We show that learning the parameters can be easily included in the variational approach.

The lower bound (9) now additionally depends on . We alternate between maximizing over for fixed to (according to the previous subsections) and over for fixed . In the latter step,

where we write only the term depending on . The joint pdf and the approximate posterior pdf are given by (7) and (II), respectively. Based on the forms of the likelihood (5) and prior pdfs defined in Sec. II, we obtain

We can carry out independently over each parameter. Equating the partial derivatives to zero gives unique solutions that correspond to the global maximum (the second-order derivatives are strictly negative). Specifically, we obtain

(24) |

Thus, takes into account not only the fitting error, but also the uncertainty of weight estimation (through ) and of frequency estimation (via ). Regarding the latter, the sharper , the closer is to and therefore the smaller the contribution to . For and we obtain the estimates

(25) |

Naturally, is given by the number of nonzero components of and is the averaged second-moment of the weights corresponding to those components.

## Iv Approximating by a mixture of von Mises pdfs

In this section, after providing some background on the von Mises distribution, we show that any pdf of the form , such as in (16), can be well represented by a mixture of von Mises pdfs (MVM). The proposed approximation enables easy computation of expectations over such pdfs. We exploit the MVM approximation in the initialization of our algorithm as well, since the exponential of the periodogram also has the said form.

### Iv-a The von Mises distribution

Among the distributions on the unit circle, the von Mises (VM) distribution is of significant importance, its role being similar to that of the Gaussian distribution on the line [35]. The pdf of the VM distribution of a random angle is

The parameters and are the mean direction and concentration parameter, respectively, and is the modified Bessel function of the first kind and order . The pdf is symmetrical about its single mode, which is at . The VM pdf can also be parameterized in terms of :

The properties of circular distributions are completely determined by the characteristic function,

, [35]. The characteristic function of the VM distribution is(26) |

The moments of circular distributions are the moments of , i.e., values of the characteristic function. The first moment of the VM distribution, , gives the mean direction and the mean resultant length .

The multiplication of two VM pdfs gives

(27) |

with . That is, the result is proportional to a VM pdf with mean direction and concentration . Thus, the family of VM pdfs is closed under multiplication.

### Iv-B The proposed MVM approximation

In the following, we drop the frequency index for convenience. We write (16) as

(28) |

where the entries of have the polar form . When the factor in (28) corresponding to is a constant, so we can just remove this index from . Also, when , the factor indexed by has the form of a von Mises (VM) pdf with mean direction and concentration . Furthermore, the factors indexed by have the form of -fold wrapped VM pdfs. Thus, we can write (28) as

(29) |

In Appendix B we show that a wrapped VM pdf can be very well approximated by an appropriate MVM obtained by matching their characteristic functions. Employing the result (52), we approximate each of the -fold wrapped VM pdfs in (29) by a mixture of VM pdfs, i.e.,

(30) |

where . The components of the MVM have equal amplitudes and concentrations. The value of the latter is the solution to

(31) |

where is the modified Bessel function of the first kind and order . We show in Appendix B that an approximate solution to the transcendental equation (31) can be easily found. The means , , are given by

(32) |

i.e., they are evenly distributed around the circle, apart. The higher the concentration parameter of the wrapped VM pdf, the better its approximation (30). As illustrated in Fig. 3, the approximation is very tight even for moderate values of the concentration and still good for small concentrations.

The proposed approximation enables us to exploit the fact that the family of VM pdfs is closed under multiplication. To that end, we conveniently choose the prior pdf of to be , with .^{7}^{7}7When we do not have any prior information about , we can set the concentration , in which case the prior pdf becomes the uniform circular pdf .^{8}^{8}8Alternatively, we can select an MVM prior, if we wish to use a multimodal distribution.
Replacing (30) in (29) we obtain that is an MVM. Specifically, let us write and define . Using the multi-index , we have

(33) |

with

(34) |

and the normalizing constant . We explicitly express (33) as an MVM where the amplitude, mean and concentration of each of the mixture’s components are given by the corresponding parameter :

(35) |

The number of components in (35) can be intractable. For the component with index to have an important contribution to (35), its amplitude and concentration must be high, i.e., be large. Based on the observation that only a small fraction of them contribute significantly to the mass of , in the following we propose two heuristic methods for representing (35) by a limited number of components.

### Iv-C Heuristic 1

The first heuristic is a greedy procedure aiming to find and represent by only the most dominant components in (35). The idea is to progressively construct an approximation of (28) by sweeping through the index set and including in the approximation one additional index in each step. In step , , we have a “partial” posterior pdf given by the factors in (28) with indices , i.e., only measurements are taken into account. The partial pdf is an MVM with components parameterized by . As outlined in Algorithm 1, in each step the heuristic procedure retains from the “partial” posterior (at most) components having the highest concentration parameters. The complexity of the greedy search is . The algorithm outputs the parameters in which give

(36) |

where . Now we can compute expectations in closed-form. Using (36) and (26), we obtain

Similarly, the frequency estimate defined in (11) is given by