Recently, many researchers have proposed algorithms for estimating the -norm of a linear time-invariant (LTI) filter from input/output data [16, 18, 12, 10, 9, 11]. A common property of these algorithms is eschewing model parameter estimation for directly estimating either the worst case -input signal [18, 12] or the maximizing frequency [9, 11]. One of the major motivations behind these algorithms is sample efficiency: since the -norm is a scalar estimate whereas the number of model parameters can be very large and possibly infinite, intuitively one expects that norm estimation can be performed using substantially fewer samples compared with model estimation.
In this paper, we study the fundamental limits of estimating the -norm of a finite impulse response (FIR) filter, and compare to known bounds for FIR model estimation. We show that for passive algorithms which do not adapt their inputs in response to the result of previous queries, it is no more efficient to estimate the -norm than to estimate the model. For active algorithms which do adapt their inputs, we show that compared to model estimation, norm estimation is only at most a factor more efficient when the underlying model is a length FIR filter. Our analysis raises an interesting open question: whether or not there there exists an active sampling strategy which attains our stated lower bound.
Based on our theoretical findings, we study the empirical performance of several existing adaptive algorithms compared to a simple (non-adaptive) estimator which first fits a model via least-squares and then returns the norm of the model. Surprisingly, we find that the current adaptive methods do not perform significantly better than the simple estimator.
2 Related Work
Data-driven methods for estimating the -norm of an LTI system fall into roughly two major approaches: (a) estimating the worst case -signal via a power-iteration type algorithm [18, 12, 10] and (b) discretizing the interval and searching for the maximizing frequency [9, 11].
Algorithms that rely on power iteration take advantage of a clever time-reversal trick introduced by Wahlberg et al. , which allows one to query the adjoint system with only input/output access to the original system
. One issue with these methods is that the rate of convergence of the top singular value of the truncated Toeplitz matrix to the-norm of the system is typically (c.f. ), but the constant hidden in the notation can be quite large as pointed out by 
. A non-asymptotic analysis of the statistical quality of the norm estimate returned by the power iteration remains an open question; asymptotic results can be found in.
are rooted in ideas from the multi-armed bandit literature. Here, each frequency can be treated as either an “arm” or an “expert”, and an adaptive algorithm such as Thompson sampling or multiplicative weights 
is applied. While sharp regret analysis for bandits has been developed by the machine learning and statistics communities, one of the barriers to applying this analysis is a lack of a sharp theory for the level of discretization required. In practice, the number of grid points is a parameter that must be appropriately tuned for the problem at hand.
The problem of estimating the model parameters of a LTI system with model error measured in -norm is studied by , and in -norms by . For the FIR setting,  gives matching upper and lower bounds for identification in -norm. These bounds will serve as a baseline for us to compare our bounds with in the norm estimation setting. Helmicki et al.  provide lower bounds for estimating both a model in -norm and its frequency response at a particular frequency in a query setting where the noise is worst-case. In this work we consider a less conservative setting with stochastic noise. Müller et al.  prove an asymptotic regret lower bound over algorithms that sample only one frequency at every iteration. Their notion of regret is however defined with respect to the best frequency in a fixed discrete grid and not the -norm. As we discuss in Section 3.2, this turns out to be a subtle but important distinction.
3 Problem Setup and Main Results
In this section we formulate the problem under consideration and state our main results.
We fix a filter length and consider an unknown length causal FIR filter with .
We study the following time-domain input/output query model for : for rounds, we
first choose an input such that , and then
we observe a sample ,
where denotes the upper left section of the
semi-infinite Toeplitz matrix induced by treating as an element of
We note that our results extend naturally to the setting when is the upper left section
for a positive integer .
Furthermore, one can restrict both the system coefficients and the inputs to be real-valued
by considering the discrete cosine transform (DCT) instead of the discrete Fourier transform (DFT)
in our proofs.
to be real-valued by considering the discrete cosine transform (DCT) instead of the discrete Fourier transform (DFT) in our proofs.. By for a complex we mean we observe where and are independent.
After these rounds, we are to return an estimate of the norm based on the collected data . The expected risk of any algorithm for this problem is measured as , where the expectation is taken with respect to both the randomness of the algorithm and the noise of the outputs .
Our results distinguish between passive and active algorithms. A passive algorithm is one in which the distribution of the input at time is independent of the history . An active algorithm is one where the distribution of is allowed to depend on this history.
Given this setup, our first result is a minimax lower bound for the risk attained by any passive algorithm. [Passive lower bound] Fix a . Let for a universal constant and . We call a passive algorithm admissible if the matrix . We have the following minimax lower bound on the risk of any passive admissible algorithm :
Here, is a universal constant. We note the form of the can be recovered from the proof.
The significance of Theorem 3 is due to the fact that under our query model, one can easily produce an estimate such that , for instance by setting where
is the first standard basis vector (see e.g. Theorem 1.1 of). That is, for passive algorithms the number of samples to estimate the model parameters is equal (up to constants) to the number of samples needed to estimate the -norm, at least for the worst-case. The situation changes slightly when we look at active algorithms. [Active lower bound] The following minimax lower bound on the risk of any active algorithm holds:
Here, is a universal constant.
We see in the active setting that the lower bound is weakened by a logarithmic factor in . This bound shows that for low SNR regimes when , the gains of being active are minimal in the worst case. On the other hand, for high SNR regimes when , the gains are potentially quite substantial. We are unaware of any algorithm that provably achieves the active lower bound; it is currently unclear whether or not the lower bound is loose, or if more careful design/analysis is needed to find a better active algorithm. However, in Section 3.2 we discuss special cases of FIR filters for which the lower bound in Theorem 3 is not improvable.
We note that our proof draws heavily on techniques used to prove minimax lower bounds for functional estimation, specifically in estimating the -norm of a unknown mean vector in the Gaussian sequence model. An excellent overview of these techniques is given in Chapter 22 of .
3.1 Hypothesis testing and sector conditions
An application that is closely related to estimating the -norm is testing whether or not the -norm exceeds a certain fixed threshold
. Specifically, consider a test statisticthat discriminates between the two alternatives and . This viewpoint is useful because it encompasses testing for more fine-grained characteristics of the Nyquist plot via simple transformations. For instance, given , one may test if is contained within a circle in the complex plane centered at with radius by equivalently checking if the -norm of the system is less than ; this is known as the -sector condition .
Due to the connection between estimation and hypothesis testing, our results also give lower bounds on the sum of the Type-I and Type-II errors of any testdiscriminating between the hypothesis and . Specifically, queries in the passive case (when is sufficiently small) and
queries in the active case are necessary for any test to have Type-I and Type-II error less than a constant probability.
3.2 Shortcomings of discretization
In order to close the gap between the upper and lower bounds, one needs to explicitly deal with the continuum of frequencies on ; here we argue that if the maximizing frequency is known a-priori to lie in discrete set of points, then the lower bound is sharp.
Suppose we consider a slightly different query model, where at each time the algorithm chooses a frequency and receives . For simplicity let us also assume that the -norm is bounded by one. Note that by slightly enlarging to the upper left triangle, we can emulate this query model by using a normalized complex sinusoid with frequency .
If the maximizing frequency for the -norm of the underlying system is located on the grid and its phase is known, this problem immediate reduces to a standard -arm multi-armed bandit (MAB) problem where each arm is associated with a point on the grid. For this family of instances, the following active algorithm has expected risk upper bounded by times a constant:
Run a MAB algorithm that is optimal in the stochastic setting (such as MOSS ) for iterations, with each of the arms associated to a frequency .
Sample an index where:
Query for times and return the sample mean.
More generally, for any grid of frequencies such that the number of grid points is , this algorithm obtains expected risk. Hence the lower bound of Theorem 3 is actually sharp with respect to these instances.
Therefore, the issue that needs to be understood is whether or not the continuum of frequencies on fundamentally requires additional sample complexity compared to a fixed discrete grid. Note that a naïve discretization argument is insufficient here. For example, it is known (see e.g. Lemma 3.1 of ) that by choosing equispaced frequencies one obtains a discretization error bounded by , e.g. for the largest . This bound is too weak, however, since it requires that the number of arms scale as in order to obtain a risk bounded by ; in terms of , the risk would scale .
To summarize, if one wishes to improve the active lower bound to match the rate given by Theorem 3, one needs to consider a prior distribution over hard instances where the support of the maximizing frequency is large (possibly infinite) compared to . On the other hand, if one wishes to construct an algorithm achieving the rate of Theorem 3, then one will need to understand the function at a much finer resolution than Lipschitz continuity.
4 Proof of Main Results
The proof of Theorem 3 and Theorem 3 both rely on a reduction to Bayesian hypothesis testing. While this reduction is standard in the statistics and machine learning communities (see e.g. Chapter 2 of ), we briefly outline it here, as we believe these techniques are not as widely used in the controls literature.
First, let be two prior distributions on . Suppose that for all we have and for all we have for some . Let
denote the joint distribution of, which combines the prior distribution with the observation model. Then,
where for two measures we define the total-variation (TV) distance as . Hence, if one can construct two prior distributions with the aforementioned properties and furthermore show that , then one deduces that the minimax risk is lower bounded by . This technique is generally known as Le Cam’s method, and will be our high-level proof strategy.
As working directly with the TV distance is often intractable, one typically computes upper bounds to the TV distance. We choose to work with both the KL-divergence and the -divergence. The KL-divergence is defined as , and the -divergence is defined as (we assume that so these quantities are well-defined). One has the standard inequalities and .
4.1 Proof of passive lower bound (Theorem 3)
The main reason for working with the -divergence is that it operates nicely with mixture distributions, as illustrated by the following lemma. [see e.g. Lemma 22.1 of ] Let be a parameter space and for each let be a measure over indexed by . Fix a measure on and a prior measure on . Define the mixture measure . Suppose for every , the measures and are both absolutely continuous w.r.t. a fixed base measure on . Define the function as
We have that:
We now specialize this lemma to our setting. Here, our distributions are over ; for a fixed system parameter , the joint distribution has the density (assuming that has the density ):
where denotes the PDF of the multivariate Gaussian . Note that this factorization with independent of is only possible under the passive assumption.
Supposing that , we have that
In part (a) we complete the square and in part (b) we use the fact that the distributions for are independent as a consequence of the passive assumption. In particular for (a), we first observe that:
Hence writing , we obtain:
We now construct two prior distributions on . The first prior will be the system with all coefficients zeros, i.e. . The second prior will be more involved. To construct it, we let . By the admissibility assumption on the algorithm , is invertible. Let denote an index set to be specified. Let denote the unnormalized discrete Fourier transform (DFT) matrix (i.e. and ). We define our prior distribution as, for some to be chosen:
We choose as according to the following proposition. Let be independently drawn from distributions such that a.s for all . Let and suppose that is invertible. There exists an index set such that and for all ,
First, we observe that:
denote the eigenvalues of. By Cauchy-Schwarz,
Now for any fixed satisfying , we have:
That is, we have shown that:
Now we state an auxiliary proposition whose proof follows from Markov’s inequality. Let satisfy and . Then there exists an index set with cardinality such that for all . By the auxiliary proposition, there exists an index set such that and for all . Hence, for any ,
Above, the first inequality holds because the -norm of the -th column of a matrix exceeds the absolute value of the -th position of the matrix, the second inequality is because for any positive definite matrix , we have and the last inequality is due to the property of . ∎
We now observe that for indices , defining :
where . In (a) we used the fact that for two vectors we have , e.g. the fact that convolution is commutatitive. Combining this calculation with Lemma 4.1,
where in (a) we used Cauchy-Schwarz and in (b) we used the fact that . Now condition on . For a
, define the random variableas:
We have that by construction. Furthermore, note that for and also that that for any vector . These facts, along with the assumption that , show that almost surely. Hence, is a zero-mean sub-Gaussian random variable with sub-Gaussian parameter (see e.g. Ch. 2 of  for background exposition on sub-Gaussian random variables). Therefore, we know that for any
, its moment generating function (MGF) is bounded as:
Hence by iterating expectations and setting , we have:
Therefore, for any choice of such that:
we have that:
Hence if and if we set to be:
we have that:
assuming the condition (4.1) is satisfied. This bound then implies that . We now aim to choose so that the condition (4.1) holds. Plugging our setting of from (4.2) into (4.1) and rearranging yields the condition .
To conclude, we need to show a minimum separation between the -norm on vs. . Clearly on . On the other hand, for , we observe that for ,
where inequality (a) comes from for any and inequality (b) comes from Proposition 4.1. Hence we have constructed two prior distributions with a separation of but a total variation distance less than . Theorem 3 now follows.
4.2 Proof of active lower bound (Theorem 3)
For this setting we let and . The proof proceeds by bounding the KL-divergence . To do this, we first bound , where is the joint distribution induced by the parameter and is the joint distribution induced by the parameter . Proceeding similarly to the proof of Theorem 1.3 in ,
A simple calculation shows that:
and hence the operator norm of this matrix is bounded by one. Therefore, by convexity of ,