# Minimax Lower Bounds for H_∞-Norm Estimation

The problem of estimating the H_∞-norm of an LTI system from noisy input/output measurements has attracted recent attention as an alternative to parameter identification for bounding unmodeled dynamics in robust control. In this paper, we study lower bounds for H_∞-norm estimation under a query model where at each iteration the algorithm chooses a bounded input signal and receives the response of the chosen signal corrupted by white noise. We prove that when the underlying system is an FIR filter, H_∞-norm estimation is no more efficient than model identification for passive sampling. For active sampling, we show that norm estimation is at most a factor of r more sample efficient than model identification, where r is the length of the filter. We complement our theoretical results with experiments which demonstrate that a simple non-adaptive estimator of the norm is competitive with state-of-the-art adaptive norm estimation algorithms.

• 29 publications
• 3 publications
• 62 publications
04/03/2018

### Adaptive distributed methods under communication constraints

We study distributed estimation methods under communication constraints ...
08/25/2020

### Minimax estimation of norms of a probability density: I. Lower bounds

The paper deals with the problem of nonparametric estimating the L_p–nor...
08/25/2020

### Minimax estimation of norms of a probability density: II. Rate-optimal estimation procedures

In this paper we develop rate–optimal estimation procedures in the probl...
05/16/2019

### Adaptive estimation in the linear random coefficients model when regressors have limited variation

We consider a linear model where the coefficients-intercept and slopes-a...
09/30/2022

### Optimal Query Complexities for Dynamic Trace Estimation

We consider the problem of minimizing the number of matrix-vector querie...
06/02/2015

### Extreme Compressive Sampling for Covariance Estimation

This paper studies the problem of estimating the covariance of a collect...
03/11/2018

### Fast Adaptive Identification of Stable Innovation Filters

The adaptive identification of the impulse response of an innovation fil...

## 1 Introduction

Recently, many researchers have proposed algorithms for estimating the -norm of a linear time-invariant (LTI) filter from input/output data [16, 18, 12, 10, 9, 11]. A common property of these algorithms is eschewing model parameter estimation for directly estimating either the worst case -input signal [18, 12] or the maximizing frequency [9, 11]. One of the major motivations behind these algorithms is sample efficiency: since the -norm is a scalar estimate whereas the number of model parameters can be very large and possibly infinite, intuitively one expects that norm estimation can be performed using substantially fewer samples compared with model estimation.

In this paper, we study the fundamental limits of estimating the -norm of a finite impulse response (FIR) filter, and compare to known bounds for FIR model estimation. We show that for passive algorithms which do not adapt their inputs in response to the result of previous queries, it is no more efficient to estimate the -norm than to estimate the model. For active algorithms which do adapt their inputs, we show that compared to model estimation, norm estimation is only at most a factor more efficient when the underlying model is a length FIR filter. Our analysis raises an interesting open question: whether or not there there exists an active sampling strategy which attains our stated lower bound.

Based on our theoretical findings, we study the empirical performance of several existing adaptive algorithms compared to a simple (non-adaptive) estimator which first fits a model via least-squares and then returns the norm of the model. Surprisingly, we find that the current adaptive methods do not perform significantly better than the simple estimator.

## 2 Related Work

Data-driven methods for estimating the -norm of an LTI system fall into roughly two major approaches: (a) estimating the worst case -signal via a power-iteration type algorithm [18, 12, 10] and (b) discretizing the interval and searching for the maximizing frequency [9, 11].

Algorithms that rely on power iteration take advantage of a clever time-reversal trick introduced by Wahlberg et al. [18], which allows one to query the adjoint system with only input/output access to the original system

. One issue with these methods is that the rate of convergence of the top singular value of the truncated Toeplitz matrix to the

-norm of the system is typically (c.f. [2]), but the constant hidden in the notation can be quite large as pointed out by [15]

. A non-asymptotic analysis of the statistical quality of the norm estimate returned by the power iteration remains an open question; asymptotic results can be found in

[12].

The algorithms developed in [9, 11] based on discretizing the frequencies

are rooted in ideas from the multi-armed bandit literature. Here, each frequency can be treated as either an “arm” or an “expert”, and an adaptive algorithm such as Thompson sampling

[9] or multiplicative weights [11]

is applied. While sharp regret analysis for bandits has been developed by the machine learning and statistics communities

[3], one of the barriers to applying this analysis is a lack of a sharp theory for the level of discretization required. In practice, the number of grid points is a parameter that must be appropriately tuned for the problem at hand.

The problem of estimating the model parameters of a LTI system with model error measured in -norm is studied by [14], and in -norms by [5]. For the FIR setting, [14] gives matching upper and lower bounds for identification in -norm. These bounds will serve as a baseline for us to compare our bounds with in the norm estimation setting. Helmicki et al. [7] provide lower bounds for estimating both a model in -norm and its frequency response at a particular frequency in a query setting where the noise is worst-case. In this work we consider a less conservative setting with stochastic noise. Müller et al. [9] prove an asymptotic regret lower bound over algorithms that sample only one frequency at every iteration. Their notion of regret is however defined with respect to the best frequency in a fixed discrete grid and not the -norm. As we discuss in Section 3.2, this turns out to be a subtle but important distinction.

## 3 Problem Setup and Main Results

In this section we formulate the problem under consideration and state our main results. We fix a filter length and consider an unknown length causal FIR filter with . We study the following time-domain input/output query model for : for rounds, we first choose an input such that , and then we observe a sample , where denotes the upper left section of the semi-infinite Toeplitz matrix induced by treating as an element of 111 We note that our results extend naturally to the setting when is the upper left section for a positive integer . Furthermore, one can restrict both the system coefficients and the inputs

to be real-valued by considering the discrete cosine transform (DCT) instead of the discrete Fourier transform (DFT) in our proofs.

. By for a complex we mean we observe where and are independent.

After these rounds, we are to return an estimate of the norm based on the collected data . The expected risk of any algorithm for this problem is measured as , where the expectation is taken with respect to both the randomness of the algorithm and the noise of the outputs .

Our results distinguish between passive and active algorithms. A passive algorithm is one in which the distribution of the input at time is independent of the history . An active algorithm is one where the distribution of is allowed to depend on this history.

Given this setup, our first result is a minimax lower bound for the risk attained by any passive algorithm. [Passive lower bound] Fix a . Let for a universal constant and . We call a passive algorithm admissible if the matrix . We have the following minimax lower bound on the risk of any passive admissible algorithm :

 inf\calAsupθ∈\Cr\E[\absˆH\calA−\hinfnormH(g)]≥C′σM√rlogrN. (3.1)

Here, is a universal constant. We note the form of the can be recovered from the proof.

The significance of Theorem 3 is due to the fact that under our query model, one can easily produce an estimate such that , for instance by setting where

is the first standard basis vector (see e.g. Theorem 1.1 of

[14]). That is, for passive algorithms the number of samples to estimate the model parameters is equal (up to constants) to the number of samples needed to estimate the -norm, at least for the worst-case. The situation changes slightly when we look at active algorithms. [Active lower bound] The following minimax lower bound on the risk of any active algorithm holds:

 inf\calAsupθ∈\Cr\E[\absˆH\calA−\hinfnormH(g)]≥C′σM√rN. (3.2)

Here, is a universal constant.

We see in the active setting that the lower bound is weakened by a logarithmic factor in . This bound shows that for low SNR regimes when , the gains of being active are minimal in the worst case. On the other hand, for high SNR regimes when , the gains are potentially quite substantial. We are unaware of any algorithm that provably achieves the active lower bound; it is currently unclear whether or not the lower bound is loose, or if more careful design/analysis is needed to find a better active algorithm. However, in Section 3.2 we discuss special cases of FIR filters for which the lower bound in Theorem 3 is not improvable.

We note that our proof draws heavily on techniques used to prove minimax lower bounds for functional estimation, specifically in estimating the -norm of a unknown mean vector in the Gaussian sequence model. An excellent overview of these techniques is given in Chapter 22 of [19].

### 3.1 Hypothesis testing and sector conditions

An application that is closely related to estimating the -norm is testing whether or not the -norm exceeds a certain fixed threshold

. Specifically, consider a test statistic

that discriminates between the two alternatives and . This viewpoint is useful because it encompasses testing for more fine-grained characteristics of the Nyquist plot via simple transformations. For instance, given , one may test if is contained within a circle in the complex plane centered at with radius by equivalently checking if the -norm of the system is less than ; this is known as the -sector condition [6].

Due to the connection between estimation and hypothesis testing, our results also give lower bounds on the sum of the Type-I and Type-II errors of any test

discriminating between the hypothesis and . Specifically, queries in the passive case (when is sufficiently small) and

queries in the active case are necessary for any test to have Type-I and Type-II error less than a constant probability.

### 3.2 Shortcomings of discretization

In order to close the gap between the upper and lower bounds, one needs to explicitly deal with the continuum of frequencies on ; here we argue that if the maximizing frequency is known a-priori to lie in discrete set of points, then the lower bound is sharp.

Suppose we consider a slightly different query model, where at each time the algorithm chooses a frequency and receives . For simplicity let us also assume that the -norm is bounded by one. Note that by slightly enlarging to the upper left triangle, we can emulate this query model by using a normalized complex sinusoid with frequency .

If the maximizing frequency for the -norm of the underlying system is located on the grid and its phase is known, this problem immediate reduces to a standard -arm multi-armed bandit (MAB) problem where each arm is associated with a point on the grid. For this family of instances, the following active algorithm has expected risk upper bounded by times a constant:

1. Run a MAB algorithm that is optimal in the stochastic setting (such as MOSS [1]) for iterations, with each of the arms associated to a frequency .

2. Sample an index where:

 Pr(I=i)=\# number of times arm i was pulledN/2.
3. Query for times and return the sample mean.

More generally, for any grid of frequencies such that the number of grid points is , this algorithm obtains expected risk. Hence the lower bound of Theorem 3 is actually sharp with respect to these instances.

Therefore, the issue that needs to be understood is whether or not the continuum of frequencies on fundamentally requires additional sample complexity compared to a fixed discrete grid. Note that a naïve discretization argument is insufficient here. For example, it is known (see e.g. Lemma 3.1 of [14]) that by choosing equispaced frequencies one obtains a discretization error bounded by , e.g. for the largest . This bound is too weak, however, since it requires that the number of arms scale as in order to obtain a risk bounded by ; in terms of , the risk would scale .

To summarize, if one wishes to improve the active lower bound to match the rate given by Theorem 3, one needs to consider a prior distribution over hard instances where the support of the maximizing frequency is large (possibly infinite) compared to . On the other hand, if one wishes to construct an algorithm achieving the rate of Theorem 3, then one will need to understand the function at a much finer resolution than Lipschitz continuity.

## 4 Proof of Main Results

The proof of Theorem 3 and Theorem 3 both rely on a reduction to Bayesian hypothesis testing. While this reduction is standard in the statistics and machine learning communities (see e.g. Chapter 2 of [13]), we briefly outline it here, as we believe these techniques are not as widely used in the controls literature.

First, let be two prior distributions on . Suppose that for all we have and for all we have for some . Let

denote the joint distribution of

, which combines the prior distribution with the observation model. Then,

 supθ∈\Cr\E[\absˆH−\hinfnormH(θ)] ≥csupθ∈\CrPrθ(\absˆH−\hinfnormH(θ)≥c) ≥cmaxi=1,2{∫Prθ(\absˆH−\hinfnormH(θ)≥c)πi(dθ)} ≥c2(Prπ1(\absˆH≥c)+Prπ2(\absˆH−2c≥c)) ≥c2(Prπ1(\absˆH≥c)+Prπ2(\absˆH

where for two measures we define the total-variation (TV) distance as . Hence, if one can construct two prior distributions with the aforementioned properties and furthermore show that , then one deduces that the minimax risk is lower bounded by . This technique is generally known as Le Cam’s method, and will be our high-level proof strategy.

As working directly with the TV distance is often intractable, one typically computes upper bounds to the TV distance. We choose to work with both the KL-divergence and the -divergence. The KL-divergence is defined as , and the -divergence is defined as (we assume that so these quantities are well-defined). One has the standard inequalities and [13].

### 4.1 Proof of passive lower bound (Theorem 3)

The main reason for working with the -divergence is that it operates nicely with mixture distributions, as illustrated by the following lemma. [see e.g. Lemma 22.1 of [19]] Let be a parameter space and for each let be a measure over indexed by . Fix a measure on and a prior measure on . Define the mixture measure . Suppose for every , the measures and are both absolutely continuous w.r.t. a fixed base measure on . Define the function as

 G(θ1,θ2):=∫dPrθ1dμdPrθ2dμd\bbQdμμ(dx).

We have that:

 \dChiSq(Prπ,\bbQ)=\Eθ1,θ2∼π⊗2[G(θ1,θ2)]−1.

We now specialize this lemma to our setting. Here, our distributions are over ; for a fixed system parameter , the joint distribution has the density (assuming that has the density ):

 pθ(u1,x1,...,uN,xN)=N∏t=1γt(ut)ϕ(xt;T(θ)ut),

where denotes the PDF of the multivariate Gaussian . Note that this factorization with independent of is only possible under the passive assumption.

Supposing that , we have that

 G(θ1,θ2)=\Eut[exp(1σ2N∑t=1R(\ipT(θ1)utT(θ2)ut))].
###### Proof.

We write:

 G(θ1,θ2) =∫pθ1pθ2p0du1dx1...duNdxN =∫N∏t=1γt(ut)ϕ(xt;T(θ1)ut)ϕ(xt;T(θ2)ut)ϕ(xt;0)dutdxt =N∏t=1∫γt(ut)ϕ(xt;T(θ1)ut)ϕ(xt;T(θ2)ut)ϕ(xt;0)dutdxt (a)=N∏t=1\Eut[exp(1σ2R(\ipT(θ1)utT(θ2)ut))] (b)=\Eut[exp(1σ2N∑t=1R(\ipT(θ1)utT(θ2)ut))].

In part (a) we complete the square and in part (b) we use the fact that the distributions for are independent as a consequence of the passive assumption. In particular for (a), we first observe that:

 \normxt−T(θ1)ut22+\normxt−T(θ2)ut22−\normxt22 =2\normxt22+\normT(θ1)ut22+\normT(θ2)ut22−2R(\ipxtT(θ1)ut+T(θ2)ut)−\normxt22 =\normxt22+\normT(θ1)ut+T(θ2)ut22−2R(\ipxtT(θ1)ut+T(θ2)ut)−2R(\ipT(θ1)utT(θ2)ut) =−2R(\ipT(θ1)utT(θ2)ut)+\normxt−(T(θ1)ut+T(θ2)ut)22.

Hence writing , we obtain:

 ∫ϕ(xt;T(θ1)ut)ϕ(xt;T(θ2)ut)ϕ(xt;0)dxt=∫Cexp(−12σ2(\normxt−T(θ1)ut22+\normxt−T(θ2)ut22−\normxt22))dxt =exp(1σ2R(\ipT(θ1)utT(θ2)ut))∫Cexp(−12σ2(\normxt−(T(θ1)ut+T(θ2)ut)22))dxt =exp(1σ2R(\ipT(θ1)utT(θ2)ut))∫ϕ(xt;T(θ1)ut+T(θ2)ut)dxt =exp(1σ2R(\ipT(θ1)utT(θ2)ut)).

We now construct two prior distributions on . The first prior will be the system with all coefficients zeros, i.e. . The second prior will be more involved. To construct it, we let . By the admissibility assumption on the algorithm , is invertible. Let denote an index set to be specified. Let denote the unnormalized discrete Fourier transform (DFT) matrix (i.e. and ). We define our prior distribution as, for some to be chosen:

 π2=Unif({τΣ−1/2F−1ei}i∈\calI).

We choose as according to the following proposition. Let be independently drawn from distributions such that a.s for all . Let and suppose that is invertible. There exists an index set such that and for all ,

 \normFΣ−1/2F−1ei∞≥12M.
###### Proof.

First, we observe that:

 r∑i=1e\TiFΣ1/2F−1ei=\Tr(FΣ1/2F−1)=\Tr(Σ1/2).

Let

denote the eigenvalues of

. By Cauchy-Schwarz,

 \Tr(Σ1/2)=r∑i=1√λi≤√r\Tr(Σ).

Now for any fixed satisfying , we have:

 \Tr(T(u)∗T(u))≤rM2.

Hence,

 \Tr(Σ)=1NN∑t=1\Eut[\Tr(T(ut)∗T(ut))]≤rM2.

That is, we have shown that:

 r∑i=1e\TiFΣ1/2F−1ei≤rM⟺1rr∑i=1e\TiFΣ1/2F−1ei≤M.

Now we state an auxiliary proposition whose proof follows from Markov’s inequality. Let satisfy and . Then there exists an index set with cardinality such that for all . By the auxiliary proposition, there exists an index set such that and for all . Hence, for any ,

 \normFΣ−1/2F−1ei∞ ≥e\TiFΣ−1/2F−1ei≥1e\TiFΣ1/2F−1ei≥12M.

Above, the first inequality holds because the -norm of the -th column of a matrix exceeds the absolute value of the -th position of the matrix, the second inequality is because for any positive definite matrix , we have and the last inequality is due to the property of . ∎

We now observe that for indices , defining :

 N∑t=1\ipT(Σ−1/2F−1eℓ1)utT(Σ−1/2F−1eℓ2)ut (a)=N∑t=1\ipT(ut)Σ−1/2F−1eℓ1T(ut)Σ−1/2F−1eℓ2 =N∑t=1e\Tℓ1F−∗Σ−1/2T(ut)∗T(ut)Σ−1/2F−1eℓ2 =e\Tℓ1F−∗Σ−1/2(NΣ+Δ)Σ−1/2F−1eℓ2 =Ne\Tℓ1F−∗F−1eℓ2+e∗ℓ1F−∗˜ΔF−1eℓ2 =Nr\indℓ1=ℓ2+e\Tℓ1F−∗˜ΔF−1eℓ2,

where . In (a) we used the fact that for two vectors we have , e.g. the fact that convolution is commutatitive. Combining this calculation with Lemma 4.1,

 \Eθi[G(θ1,θ2)] =\Eℓi,ut[exp(τ2σ2Nr\indℓ1=ℓ2)exp(τ2σ2R(e\Tℓ1F−∗˜ΔF−1eℓ2))] (a)≤√\Eℓi,ut[exp(2τ2Nσ2r\indℓ1=ℓ2)]√\Eℓi,ut[exp(2τ2σ2R(e\Tℓ1F−∗˜ΔF−1eℓ2))] (b)≤√exp(2Nτ2σ2r)2r+1−2r√\Eℓi,ut[exp(2τ2σ2R(e\Tℓ1F−∗˜ΔF−1eℓ2))].

where in (a) we used Cauchy-Schwarz and in (b) we used the fact that . Now condition on . For a

, define the random variable

as:

 ψt :=R(e\Tℓ1F−∗Σ−1/2T(ut)∗T(ut)Σ−1/2F−1eℓ2) −R(e\Tℓ1F−∗Σ−1/2\Eut[T(ut)∗T(ut)]Σ−1/2F−1eℓ2).

We have that by construction. Furthermore, note that for and also that that for any vector . These facts, along with the assumption that , show that almost surely. Hence, is a zero-mean sub-Gaussian random variable with sub-Gaussian parameter (see e.g. Ch. 2 of [17] for background exposition on sub-Gaussian random variables). Therefore, we know that for any

, its moment generating function (MGF) is bounded as:

 \Eut|ℓi[exp(tN∑t=1ψt)]≤exp(2t2M4N/γ2).

Hence by iterating expectations and setting , we have:

 \Eℓi,ut[exp(2τ2σ2R(e\Tℓ1F−∗Σ−1/2ΔΣ−1/2F−1eℓ2))]≤exp(8τ4M4N/(σ4γ2)).

Therefore, for any choice of such that:

 τ4≤(log(1.1)/8)σ4γ2/(M4N), (4.1)

we have that:

 \Eθi,ut[G(θ1,θ2)]≤√1.1√exp(2Nτ2σ2r)2r+1−2r.

Hence if and if we set to be:

 τ2=σ2rlog(0.211r)2N, (4.2)

we have that:

 \dChiSq(Prπ,Pr0)=\Eθi,ut[G(θ1,θ2)]−1≤1/4,

assuming the condition (4.1) is satisfied. This bound then implies that . We now aim to choose so that the condition (4.1) holds. Plugging our setting of from (4.2) into (4.1) and rearranging yields the condition .

To conclude, we need to show a minimum separation between the -norm on vs. . Clearly on . On the other hand, for , we observe that for ,

 \normH(τΣ−1/2F−1ei)\hinf (a)≥τ\normFΣ−1/2F−1ei∞(b)≥τ2M,

where inequality (a) comes from for any and inequality (b) comes from Proposition 4.1. Hence we have constructed two prior distributions with a separation of but a total variation distance less than . Theorem 3 now follows.

### 4.2 Proof of active lower bound (Theorem 3)

For this setting we let and . The proof proceeds by bounding the KL-divergence . To do this, we first bound , where is the joint distribution induced by the parameter and is the joint distribution induced by the parameter . Proceeding similarly to the proof of Theorem 1.3 in [14],

 \dKL(Pr0,Pri) =\EPr0[logN∏t=1γt(ut|{uk,xk}t−1k=1)p0(xt|ut)γt(ut|{uk,xk}t−1k=1)pi(xt|ut)] =\EPr0[logN∏t=1p0(xt|ut)pi(xt|ut)] =N∑t=1\EPr0[logp0(xt|ut)pi(xt|ut)] =N∑t=1\Eut∼Pr0[\dKL(\calN(0,σ2I),\calN(T(τF−1ei)ut,σ2I))] =τ22σ2N∑t=1\Eut∼Pr0[\normT(F−1ei)ut22].

A simple calculation shows that:

 r∑i=1T(F−1ei)∗T(F−1ei)=\diag(1,r−1r,r−2r,...,1r),

and hence the operator norm of this matrix is bounded by one. Therefore, by convexity of ,

 \dKL(Prπ1,Prπ2) ≤1rr∑i=1\dKL(Pr0,Pri) =τ22σ2rr∑i=1N∑t=1\Eut∼Pr0[\normT(F−1ei)ut22] =τ22σ2rN∑t=1r∑