## I Introduction

The problem of spectral estimation studies how signal power is distributed over frequencies, and has rich applications in speech coding, radar & sonar signal processing and many other areas. Suppose a discrete-time real-valued signal is observed at finite time points contaminated with i.i.d. Gaussian noise. In common with all spectral models, we assume the signal can be represented as a linear combination of sinusoids, and aim to recover the spectrum of the signal at a desired resolution. However, the problem becomes very challenging when the required frequency resolution is high. In particular, the number of the frequency levels at the desired resolution can be (much) greater than the sample size, referred to as super-resolution spectral estimation. For such discrete-time signals of finite length, the classical methods based on fourier analysis [1] or least-squares periodogram (LSP) [2, 3] suffer from power leakage and have very limited spectral resolution [1]. Some more recent algorithms, such as Burg [1], MUSIC [4] and RELAX [5] only alleviate the issue to some extent.

We assume that the signal is sparse in the frequency-domain, i.e., the number of its sinusoidal components is small relative to the sample size, referred to as the spectral sparsity. It is a realistic assumption in many applications (e.g., astronomy [6] and radar signal processing [7]), and makes it possible to apply the revolutionary compressed sensing (CS) technique. In [6], Chen and Donoho proposed the basis pursuit (BP) to handle overcomplete dictionaries and unevenly sampled signals. A number of similar works followed, see, e.g., [8, 9, 10, 11, 12, 13, 14, 15] and the references therein.

We point out two crucial facts that cannot be ignored in super-resolution spectrum reconstruction. (a) When the desired frequency resolution is very high, neighboring dictionary atoms become very similar and thus necessarily result in high coherence or collinearity. As is well known in the literature, the popular convex technique as used in the BP yields inconsistent frequency selection and suboptimal rates in estimation and prediction under such coherent setups [16, 17, 18, 19, 20]. (b) The grouping structure of the sinusoidal components is an essential feature in spectrum recovery: if frequency is absent in the signal, the coefficients for and should *both* be zero.

In this paper we investigate super-resolution spectral recovery from a statistical perspective and propose a group iterative spectrum thresholding (GIST) framework to tackle the aforementioned challenges. GIST allows for (possibly nonconvex) shrinkage estimation and can exploit the pairing structure. Interestingly, we find that neither the nor the regularization is satisfactory for spectrum estimation, and advocate a hybrid type shrinkage estimation. Theoretical analysis shows that the new regularization essentially removes the stringent coherence requirement and can accommodate much lower SNR and higher coherence. Furthermore, a GIST variant provides a screening technique for supervised dimension reduction to deal with applications in ultrahigh dimensions. The rest of this paper is organized as follows. We formulate the problem from a statistical point of view and briefly survey the literature in Section II. In Section III, we propose the GIST framework—in more details, a novel form of regularization, a generic algorithm for fitting group nonconvex penalized models, a data-resampling based model selection criterion, and a probabilistic spectral screening for fast computation. Experimental results are shown in Section IV. We summarize the conclusions in Section V. The technical details are left to the Appendices.

## Ii Model Setup and The Super-resolution Challenge

In this section, we introduce the problem of super-resolution spectrum estimation and review some existing methods from a statistical point of view.
Let be a real-valued signal contaminated with i.i.d. Gaussian noise . (We focus on real signals in this paper but our methodology carries over to complex-valued signals; see Section V.)
The sampling time sequence is *not* required to be uniform (cf. [6]). In order to achieve super-resolution spectral recovery, an *overcomplete* frequency dictionary must be applied.
Concretely, we use a grid of evenly spaced frequencies for to construct the sine and cosine frequency predictors, i.e., and . Let denote the set of nonzero frequencies .
The upper band limit can be or estimated based on the spectral window [21].
The cardinality of the dictionary controls the frequency resolution given by .
The true spectra of the signal are assumed to be discrete for convenience, because the quantization error can always be reduced by increasing the value of . The signal can be represented by

(1) |

where , are unknown, and the noise

are i.i.d. Gaussian with zero mean and unknown variance

. Traditionally, . But in super resolution spectral analysis, can take a much larger value than . It still results in a well-defined problem because only a few are nonzero under the spectral sparsity assumption.From with ,

, we introduce two column vectors

and define the predictor matrix

(2) |

(Some redundant or useless predictors can be removed in concrete problems, see (4).) Denote the coefficient vector by and the intercept (zero frequency component) by

. Now the model can be formulated as a linear regression

(3) |

where is sparse and . In super-resolution analysis,

, giving a small-sample-size-high-dimensional design. Linear analysis such as Fourier transform fails for such an underdetermined system.

As a demonstration, we consider a noisy ‘TwinSine’ signal at frequencies Hz and Hz with 100 observations. Obviously, the frequency resolution needs to be as fine as HZ to perceive and distinguish the two sinusoidal components with different coefficients. We set , and thus must be at least – much larger than the sample size. The concrete design matrix (without the intercept) is given by

(4) |

The last sine atom disappears because all are integers. This yields a super-resolution spectral estimation problem.

There are many algorithms for identifying the spectrum of a discrete-time signal. But not all of them can super-resolve. From a modeling perspective, we classify them as nonsparse methods and sparse methods. Most classical methods (e.g.,

[3, 21, 2]) are nonsparse and assume no knowledge on the power spectra. For super-resolution spectrum estimation, they may seriously broaden the main lobes and introduce side lobes.*In this paper, we focus on sparse methods.*As aforementioned, one popular assumption for solving underdetermined systems is signal sparsity: the number of present frequency components is small relative to the number of samples. The problem is still NP hard because the frequency location of the truly relevant sinusoidal components is unknown and the number of candidate components can be very large. In fact, the frequency grid used for constructing the dictionary can be made arbitrarily fine by the customer.

Early attempts to enforce sparsity effects include greedy or exhaustive searches [22, 23]

and genetic algorithms with a sparsity constraint

[24]. Harikumar [25] computes the maximally sparse solutions under a constraint on the fitting error. A breakthrough is due to Chen & Donoho who proposed the basis pursuit (BP) for spectrum estimation [6]. A number of similar works followed [11, 8, 9, 10]. BP is able to superresolve for unevenly sampled signals. In our notation, the noiseless version of BP solves the convex optimization problem The noisy versions can be defined similarly, in a penalty/constraint form. The -norm provides the tightest convex relaxation to the -norm and achieves a sparse spectral representation of the signal within feasible time and cost.In recent years, the power and limitation of this convex relaxation have been systematically studied in a large body of compressed sensing literature. In short, to guarantee good statistical performance in either prediction, estimation, or model selection, the coherence of the system must be low, in terms of, e.g., mutual coherence conditions [16], restricted isometry property (RIP) [17] and irrepresentable conditions [19] among others. For example, the RIP of order requires that for any index set with , there exists an RIP constant such that
, ; when is small, any predictors in are approximately orthogonal. In theory, to guarantee ’s effectiveness in statistical accuracy, frequency selection consistency, and algorithmic stability, such RIP constants have to be small, e.g., in a noisy setup, where [17]. Similarly, the mutual coherence, defined as the maximum absolute value of the off-diagonal elements in the scaled Gram matrix , has to be as low as [16].
Such theoretical results clearly indicate that the super-resolution challenge *cannot* be fully addressed by the -norm based methods, because many similar sinusoidal components may arise in the dictionary and bring in high coherence.

To enhance the sparsity of the BP, Blumensath & Davies proposed the iterative hard thresholding (IHT) [12, 13]. See [14, 15] for some approximation methods. Intuitively, nonconvex penalties can better approximate the -norm and yield sparser estimates than the convex -penalty. On the other hand, we find that when the signal-to-noise ratio (SNR) is low and/or the coherence is high, the penalization may give an over-sparse spectral estimate and miss certain true frequency components. The high miss rates are due to the fact that the regularization is through (hard) thresholding only, offering no shrinkage at all for nonzero coefficients. Therefore, it tends to kill too many predictors to achieve the appropriate extent of shrinkage especially when the SNR is low. An inappropriate nonconvex penalty may seriously mask true signal components. This issue will be examined in the next section.

## Iii GIST Framework

This section examines the super-resolution spectrum estimation in details. The complete group iterative spectrum thresholding (GIST) framework is introduced at the end.

### Iii-a A novel regularization form

In this subsection, we study a group penalized least-squares model and investigate the appropriate type of regularization.

The BP finds a solution to an underdetermined linear system with the minimum norm. When the signal is corrupted by noise as in (3), the following -penalized linear model is more commonly used:

(5) |

where is a regularization parameter to provide a trade-off between the fitting error and solution sparsity. The intercept or zero frequency component is not subject to any penalty. To include more sparsity-enforcing penalties, we consider a more general problem in this paper which minimizes

(6) |

where is a univariate penalty function parameterized by and is possibly nonconvex.

Some structural information can be further incorporated in spectrum estimation.
From the derivation of (3), implies , i.e., the sine and cosine predictors at vanish *simultaneously*.
The pairing structure shows it is more reasonable to impose the so-called group sparsity or block sparsity [26, 27] on rather than the unstructured sparsity on . The group penalized model with the model design (2) minimizes

(7) |

(In the problem with the design matrix given by (4), the last sine predictor disappears and thus we always set to be .) The penalty function is the same as before and is allowed to be nonconvex. For ease in computation, the first term in (6) and (7) will be replaced by for some large enough; see the comment after Theorem 1.

A crucial problem is then to determine the appropriate form of for regularization purposes.
The popular -penalty may result in insufficient sparsity and relatively large prediction error, as shown in Section IV. There is still much room for improvement in super-resolution spectral estimation.
Before we proceed, it is worth pointing out that there are two objectives involved in this task

Objective 1 (O1): *accurate* prediction of the signal at any new time point in the time domain;

Objective 2 (O2): *parsimonious* spectral representation of the signal in the Fourier domain.

O1+O2 complies with Occam’s razor principle—the simplest way to explain the data is the best. A perfect approach must reflect both concerns to produce a stable sparse model with good generalizability.

From the perspective of O2, the -norm constructs an ideal penalty

(8) |

where the indicator function is when and otherwise. Yet it is discrete and strongly nonconvex.
Interestingly, given any model matrix, the class of penalties for any mimics the behavior of (8), where , referred to as the *hard-penalty*, is defined by

(9) |

Based on [28], we can show that all penalties, including the continuous penalty (9) () and the discrete penalty (8) (), result in the same global minima in optimization. Fig. 1 illustrates the penalty family in a neighborhood around 0.

A different type of regularization is desirable for objective O1. Even if all truly relevant sinusoidal components could be successfully located, these atoms are not necessarily far apart in the frequency domain, and thus collinearity may occur. In statistical signal processing, Tikhonov regularization is an effective means to deal with the singularity issue which seriously affects estimation and prediction accuracy. It is in the form of an -norm penalty

(10) |

also known as the ridge penalty in statistics. The necessity and benefit of introducing such shrinkage in multidimensional estimation date back to the famous James-Stein estimator [29]. Even for the purpose of detection, O1 plays an important role because most parameter tuning methods are designed to reduce prediction error.

Taking into account both concerns, we advocate the following hybrid *hard-ridge* (HR) penalty as a fusion of (9) and (10):

(11) |

The hard portion induces sparsity for small coefficients, while the ridge portion, representing Tikhonov regularization, helps address the coherence of the design and compensates for noise and collinearity. In the following subsections, we will show that such defined hard-ridge penalty also allows for ease in optimization and has better frequency selection performance.

Finally, we point out the difference between HR and the elastic net [30] which adds an additional ridge penalty in the lasso problem (5). However, this penalty, i.e., may over-shrink the model (referred to as the *double-shrinkage* effect [30]) and can not enforce higher level of sparsity than the -penalty. In contrast, using a -function trick [31], it is shown that results in the same estimator as the ‘’ penalty

(12) |

The ridge part does not affect the nondifferential behavior of the -norm at zero, and there is no double-shrinkage effect for nonzero coefficient estimates.

### Iii-B GIST fitting algorithm

We discuss how to fit the group penalized model (7) for a wide class of penalty functions. We assume both and have been centered so that the intercept term vanishes in the model. Our main tool to tackle the computational challenge is the class of -estimators [32]. Let be an arbitrarily given threshold function (with

as the parameter) which is odd, monotone, and a unbounded shrinkage rule (see

[32] for the rigorous definition) with as the parameter. A group -estimator is defined to be a solution to(13) |

Here, for any , is a -dimensional vector satisfying

for . In the simpler case when no grouping is assumed, the -estimator equation (13) reduces to

(14) |

A -estimator is necessarily a -penalized estimator provided that

(15) |

holds for some nonnegative satisfying for any [28]. Based on this result, we can compute -penalized estimators by solving (13) for an appropriate .

We establish the convergence of Algorithm 1 in the following theorem. For simplicity, assume that there is no intercept term in the model (which is reasonable when and have both been centered), and . Let . Construct an energy function for any as follows

(16) |

with the non-group and group versions of being and , respectively. is always greater than or equal to the objective function as defined in (7) or (6). This energy function can be used to prove the convergence of the iterates to a -estimator.

###### Theorem 1

For any and a thresholding rule satisfying (15), under the continuity assumption in Appendix A, Algorithm 1 in either group form or non-group form converges, and the iterates satisfy

(17) |

where and . Furthermore, any limit point of is a group (or non-group) -estimator that satisfies (13) (or (14)), and the sequence decreases to the limit with defined in (7) (or (6)).

Applying the theorem to Algorithm 1, we know the nongroup form solves the optimization problem , and the group form solves for any arbitrarily given , . Algorithm 1 is justified for computing a penalized spectrum estimate associated with , provided that a proper can be found to satisfy (15).

The - strategy covers most commonly used penalties, either in group form or non-group form. We give some examples below. (i) When is the soft-thresholding, the -function according to (15) is the -norm penalty used in BP, and the non-group version of our algorithm reduces to the iterative soft thresholding [33]. The group penalty (called the group lasso [26]) is more suitable for frequency selection, and can be handled by Algorithm 1 as well. (ii) When is the hard-thresholding, for we get the hard-penalty (9), and for we get the -penalty (8). The algorithm, in non-group form, corresponds to the iterative hard thresholding [12, 13]. (iii) Finally, if we define to be the hard-ridge thresholding:

(18) |

then is the hard-ridge penalty (11). Setting , we successfully reach the penalty (12). See [28] for more examples, such as SCAD, (), and elastic net.

Algorithm 1 includes a relaxation parameter , which is an effective means to accelerate the convergence. See the recent work by Maleki & Donoho [34]. (Our relaxation form is novel and is of Type I based on [35]). In practice, we set , and the number of iterations can be reduced by about 40% in comparison to nonrelaxation form.

### Iii-C Statistical analysis

Although the regularization is popular (see, e.g., BP [6]), in the following we show that the HR penalty has better selection power and can remove the stringent coherence assumption and can accommodate lower SNRs. We focus on the group form based on the discussion in Section III.

Let be the entire frequency set covered by the dictionary. For the design matrix defined in (2), . Given any frequency , we use to denote the submatrix of formed by the sine and cosine frequency atoms at , and the corresponding coefficient vector. If is an index set, and are defined similarly. In general, is of size and (but not always–cf. (4)). Given any coefficient vector , we introduce

(19) |

to characterize the frequency selection outcome. In particular, we write , , associated with the true coefficient vector , and let be the number of frequencies present in the true signal, and the number of irrelevant frequencies.

We introduce two useful quantities and . Recall and

(the largest eigenvalue of

). Given , let and . In this subsection, we assume the design matrix has been column-normalized such that the 2-norm of every column is . Let . Definewhere denotes the smallest eigenvalue and refers to the spectral norm. ( is of size typically.) Intuitively, measures the ‘mean’ correlation between the relevant frequency atoms and the irrelevant atoms. When is high, the coherence of the dictionary is necessarily high. Denote by

the probability that with soft-thresholding being applied, there exists at least one estimate

from Algorithm 1 such that . is similarly defined for hard-ridge thresholding. Theorem 2 bounds these two probabilities.###### Theorem 2

Assume .

(i) Let be the soft-thresholding.
Under the assumption that and is chosen such that , we have

(20) |

where and .

(ii) Let be the hard-ridge thresholding. Assume are chosen such that , , and . Then

(21) |

where and .

Seen from (20) and (21), both inconsistent detection probabilities are small. It is worth mentioning that in practice, we found the value of is usually small, which, however, effectively handles singularity/collinearity in comparison to , as supported by the literature (e.g., [36]). In the following, we make a comparison of the assumptions and probability bounds. The setup of is of particular interest, which means the number of truly present frequencies is small relative to the sample size but the number of irrelevant frequencies is overwhelmingly large. The -conditions characterize coherence accommodation, while the conditions on and describe how small the minimum signal strength can be. (i) For the penalty, is a version of the irrepresentable conditions and cannot be relaxed in general [19]. In contrast, for the , the bound for becomes large when is small, and so the stringent coherence requirement can be essentially removed! (ii) When is small in the hard-ridge thresholding, the noiseless ridge estimator is close to , but the minimum signal strength can be much lower than that of the , due to the fact that and in particular, the disappearance of . (iii) Finally, for small values of , , , and so has a better chance to recover the whole spectra correctly.

*Remark.* Including the ridge penalty in regularization is helpful to enhance estimation and prediction accuracy, especially when the frequency resolution is quite high and the true signal is multi-dimensional. Even when the purpose is selection alone, it is meaningful because most tuning strategies of are prediction error (generalization error) based.

### Iii-D Model comparison criterion

This part studies the problem of how to choose proper regularization parameters for any given data . In (7), the general parameter provides a statistical bias-variance tradeoff in regularizing the model, and ought to be tuned in a data-driven manner.
In common with most researchers (say [11, 37, 38]), we first specify a grid , then run Algorithm
1 for every in the grid to get a *solution path* , , and finally, use a *model comparison criterion* to find the optimal estimate . The commonly used model comparison criteria are Akaike information criterion (AIC), Bayesian information criterion (BIC), and cross-validation (CV). But we found none of them is satisfactory in the high-dimensional super-resolution spectral estimation.

Ideally, in a data-rich situation, one would divide the whole dataset into a training subset denoted by and a validation subset . For any , train the model on and evaluate the prediction accuracy on the validation subset by, say, . However, this data-splitting approach is only reasonable when the validation subset is large enough to approximate the true prediction error. It cannot be used in our problem due to insufficiency of observations. A popular data-reusing method in small samples is the -fold CV. Divide the dataset into folds. Let denote the th subset, and denote the remaining data. To obtain the CV error at any , one needs to fit penalized models. Concretely, setting and as the training data, solve the penalized problem associated with , the estimate represented by . Then calculate the validation error on : . The summarized CV error, , serves as the comparison criterion. After the optimal is determined, we refit the model on the global dataset to get .

However, when a nonconvex penalty is applied, the above plain CV has an inherent drawback: the trained models at a common value of may not be comparable, and thus averaging their validation errors may make little sense. The reasons are twofold. (i) The regularization parameter appears in a Lagrangian form optimization problem (cf. (6) or (7)). In general, the optimal to guarantee good selection and estimation must be a function of both the true coefficient vector and the data . Notice that in the trainings of -fold CV, changes. The same value of may have different regularization effects for different training datasets although remains the same. Fig. 2 shows the numbers of nonzero coefficient estimates under the penalization in -fold CV—they are never consistent at any fixed value of ! (ii) The solution path associated with a nonconvex penalty is generally discontinuous in . Fig. 3 plots the solution path for the default TwinSine signal. Even a small change in may result in a totally different estimate and zero-nonzero pattern. In consideration of both (i) and (ii), cross-validating is not a proper tuning strategy in our problem.

To resolve the training inconsistency, we advocate a generic selective cross validation (SCV) for parameter tuning in sparsity-inducing penalties. First the sparsity algorithm is run on the *entire* dataset to get a solution path , . Every estimate determines a candidate model with the predictor set given by . Next, we *cross-validate * (instead of ) to evaluate the goodness-of-fit of each candidate model. In this way, all trainings are restricted to the *same* subset of predictors.
Concretely, for penalties without shrinkage, such as the -penalty, is the unpenalized regression estimate fitted on , while for penalties with shrinkage, such as the -penalty,

is the ridge regression estimate fitted on

(cf. Theorem 1), i.e., . Finally, the total SCV error is summarized by .Motivated by the work of [37], we add a high-dimensional BIC correction term to define the model comparison criterion: , where DF

is the degrees of freedom function. When the true signal has a parsimonious representation in the frequency domain, i.e., the number of present frequencies is very small, such a correction is necessary—see

[37] for a further theoretical justification. For the or penalty, DF is approximately the number of nonzero components in the estimate; for the penalty, is given by [36]. The optimal estimate is chosen from the original solution path by minimizing .We point out that in SCV, the sparsity algorithm is only required to run on the whole dataset to generate one solution path, while CV needs such solution paths. SCV is more efficient in computation.

### Iii-E Probabilistic spectra screening

Computational complexity is another major challenge in super-resolution studies. In Algorithm 1, each iteration step involves only matrix-vector multiplications and componentwise thresholding operations. Both have low complexity and can be vectorized. The total number of flops is no more than , which is linear in . In our experiments, suffices and thus the complexity of Algorithm 1 is . (Restricting attention to uniformly sampled data and frequency atoms in the dictionary construction, we can use the Fast Fourier transform (FFT) in computation to reduce the complexity to , as pointed out by an anonymous reviewer, see [39] and Section IV.) On the other hand, with a superbly high resolution dictionary (where is very large), dimension reduction is still desirable to further reduce the computational cost.

This is indeed possible under the spectral sparsity assumption, where the number of true components is supposed to be much smaller than . One may reduce the dimension from to (say ) before running the formal algorithm. If the candidate predictors are wisely chosen, the truly relevant atoms will be included with high probability and the performance sacrifice in selection/estimation will be mild. Hereinafter, we call the *candidate ratio*. A well designed screening algorithm should not be very sensitive to as long as it is reasonably large. Significant decrease in computational time can be achieved after this supervised dimension reduction.

Comments

There are no comments yet.