# Efficient Estimation of Smooth Functionals in Gaussian Shift Models

We study a problem of estimation of smooth functionals of parameter θ of Gaussian shift model X=θ +ξ, θ∈ E, where E is a separable Banach space and X is an observation of unknown vector θ in Gaussian noise ξ with zero mean and known covariance operator Σ. In particular, we develop estimators T(X) of f(θ) for functionals f:E R of Hölder smoothness s>0 such that _θ≤ 1 E_θ(T(X)-f(θ))^2 ≲(Σ∨ ( Eξ^2)^s)∧ 1, where Σ is the operator norm of Σ, and show that this mean squared error rate is minimax optimal (up to a logarithmic factor) at least in the case of standard Gaussian shift model (E= R^d equipped with the canonical Euclidean norm, ξ =σ Z, Z∼ N(0;I_d)). Moreover, we determine a sharp threshold on the smoothness s of functional f such that, for all s above the threshold, f(θ) can be estimated efficiently with a mean squared error rate of the order Σ in a "small noise" setting (that is, when Eξ^2 is small). The construction of efficient estimators is crucially based on a "bootstrap chain" method of bias reduction. The results could be applied to a variety of special high-dimensional and infinite-dimensional Gaussian models (for vector, matrix and functional data).

• 8 publications
• 3 publications
05/20/2022

### Estimation of smooth functionals of covariance operators: jackknife bias reduction and bounds in terms of effective rank

Let E be a separable Banach space and let X, X_1,…, X_n, … be i.i.d. Gau...
11/05/2019

### A Fourier Analytical Approach to Estimation of Smooth Functions in Gaussian Shift Model

We study the estimation of f() under Gaussian shift model = +, where ∈^d...
07/19/2021

### Mismatched Estimation of rank-one symmetric matrices under Gaussian noise

We consider the estimation of an n-dimensional vector s from the noisy e...
12/15/2017

### Rate-optimal estimation of p-dimensional linear functionals in a sparse Gaussian model

We consider two problems of estimation in high-dimensional Gaussian mode...
10/27/2016

### Estimation of Bandlimited Grayscale Images From the Single Bit Observations of Pixels Affected by Additive Gaussian Noise

The estimation of grayscale images using their single-bit zero mean Gaus...
06/10/2019

### Stretching the Effectiveness of MLE from Accuracy to Bias for Pairwise Comparisons

A number of applications (e.g., AI bot tournaments, sports, peer grading...
10/27/2020

### Wide flat minima and optimal generalization in classifying high-dimensional Gaussian mixtures

We analyze the connection between minimizers with good generalizing prop...

## 1 Introduction

The problem of estimation of functionals of “high complexity” parameters of statistical models often occurs both in high-dimensional and in nonparametric statistics, where it is of importance to identify some features of a complex parameter that could be estimated efficiently with a fast (sometimes, parametric) convergence rates. Such problems are very important in the case of vector, matrix or functional parameters in a variety of applications including functional data analysis and kernel machine learning (

[34], [5]). In this paper, we study a very basic version of this problem in the case of rather general Gaussian models with unknown mean. Consider the following Gaussian shift model

 X=θ+ξ, θ∈E, (1.1)

where is a separable Banach space, is an unknown parameter and

is a mean zero Gaussian random variable in

(the noise) with known covariance operator In other words, an observation in Gaussian shift model (1.1) is a Gaussian vector in with unknown mean and known covariance Recall that is an operator from the dual space into such that Here and in what follows, denotes the value of a linear functional on a vector (although, in some parts of the paper, with a little abuse of notation, will also denote the inner product of Euclidean spaces). It is well known that the covariance operator of a Gaussian vector in is bounded and, moreover, it is nuclear.

Our goal is to study the problem of estimation of for smooth functionals The problem of estimation of smooth functionals of parameters of infinite-dimensional (nonparametric) models has been studied for several decades. It is considerably harder than in the classical finite-dimensional parametric i.i.d. models, where under standard regularity assumptions, ( being the maximum likelihood estimator) is an asymptotically efficient (in the sense of Hàjek-LeCam) estimator of with -rate for continuously differentiable functions

In the nonparametric case, classical convergence rates do not necessarily hold in functional estimation problems and minimax optimal convergence rates have to be determined. Moreover, even when the classical convergence rates do hold, the construction of efficient estimator is is often a challenging problem. Such problems have been often studied for special models (Gaussian white noise model, nonparametric density estimation model, etc) and for special functionals (with a number of nontrivial results even in the case of linear and quadratic functionals). Early results in this direction are due to Levit

[28, 29] and Ibragimov and Khasminskii [15]. Further important references include Ibragimov, Nemirovski and Khasminskii [16], Donoho and Liu [9, 10], Bickel and Ritov [2], Donoho and Nussbaum [11], Nemirovski [31, 32], Birgé and Massart [4], Laurent [26], Lepski, Nemirovski and Spokoiny [30], Cai and Low [6, 7], Klemelä [19] as well as a vast literature on semiparametric efficiency (see, e.g., [3] and references therein). Early results on consistent and asymptotically normal estimation of smooth functionals of high-dimensional parameters are due to Girko [13, 14]. More recently, there has been a lot of interest in efficient and minimax optimal estimation of functionals of parameters of high-dimensional models including a variety of problems related to semiparametric efficiency of regularized estimators (see [36], [17], [37]), on minimax optimal rates of estimation of special functionals (see [8]), on efficient estimation of smooth functionals of covariance in Gaussian models [23, 20].

Throughout the paper, given nonnegative means that for a numerical constant is equivalent to and is equivalent to Sometimes signs of relationships and will be provided with subscripts (say, or ), indicating possible dependence of the constants on the corresponding parameters.

In what follows, exponential bounds on random variables (say, on ) are often stated in the following form: there exists a constant such that, for all

with probability at least

The proof could often result in a slightly different bound, for instance, with probability However, replacing constant with it is easy to obtain the probability bound in the initial form In such cases, we say that ,“adjusting the constants” allows us to write the probability as (without providing further details).

We will now briefly discuss the results of Ibragimov, Nemirovski and Khasminskii [16] and follow up results of Nemirovski [31, 32] that are especially close to our approach to the problem. In [16], the following model was studied

 dX(n)(t)=θ(t)dt+1√ndw(t),t∈[0,1],

in which a “signal” is observed in a Gaussian white noise ( being a standard Brownian motion on ). The complexity of the parameter space was characterized by Kolmogorov widths:

 dm(Θ):=infL⊂L2([0,1]),dim(L)≤msupθ∈Θ∥θ−PLθ∥2,

where denotes the orthogonal projection onto subspace Assuming that and, for some

 dm(Θ)≲m−β,m≥1,

the goal of the authors was to determine a “smoothness threshold” such that, for all and for all functionals on of smoothness could be estimated efficiently with rate based on observation (whereas for there exist functionals of smoothness such that could not be estimated with parametric rate ). It turned out that the main difficulties in this problem are related to a proper definition of the smoothness of the functional In particular, even such simple functional as could not be estimated efficiently on some sets with The smoothness of functionals on Hilbert space is usually defined in terms of their Hölder type norms that, in turn, depend on a way in which the norm of Frèchet derivatives is defined. The -th order Frèchet derivative is a symmetric -linear form on The most common definition of the norm of such a form is the operator norm: Other possibilities include Hilbert–Schmidt norm and “hybrid” norms The Hölder classes in [16] were defined in terms of the following norms: for

 ∥f∥~Cs:=max0≤j≤k−1supθ∈2U∥f(j)∥HS⋁supθ∈2U∥f(k)(θ)∥(1)⋁supθ,θ′∈2U,θ≠θ′∥f(k)(θ)−f(k)(θ′)∥∥θ−θ′∥.

With this somewhat complicated definition, it was proved that, if and, either and or and then there exists an asymptotically efficient estimator of with convergence rate

The construction of such estimators was based on the development of a method of unbiased estimation of Hilbert–Schmidt polynomials on

and on Taylor expansion of in a neighborhood of an estimator of with an optimal nonparametric error rate. It was later shown in [31, 32] that the smoothness thresholds described above are optimal.

We will study similar problems for Gaussian shift model (1.1) trying to determine smoothness thresholds for efficient estimation in terms of proper complexity characteristics for this model.

Among the simplest smooth functionals on are bounded linear functionals For a straightforward estimator of such a functional,

 Eθ(⟨X,u⟩−⟨θ,u⟩)2=E⟨ξ,u⟩2=⟨Σu,u⟩,

and, for functionals from the unit ball of the largest possible mean squared error is equal to the operator norm of

 ∥Σ∥=sup∥u∥,∥v∥≤1E⟨ξ,u⟩⟨ξ,v⟩=sup∥u∥≤1E⟨ξ,u⟩2.

It is also not hard to prove the following proposition.

###### Proposition 1.1.

Let

 ^T(X):={⟨X,u⟩ for ∥Σ∥≤10       for ∥Σ∥>1.

Then

 sup∥u∥≤1sup∥θ∥≤1Eθ(^T(X)−⟨θ,u⟩)2≤∥Σ∥∧1

and

 sup∥u∥≤1infTsup∥θ∥≤1Eθ(T(X)−⟨θ,u⟩)2≳∥Σ∥∧1. (1.2)

In what follows, the complexity of estimation problem will be characterized by two parameters of the noise One is the operator norm which is involved in the minimax mean squared error for estimation of linear functionals. It will be convenient to view as the

weak variance

of Another complexity parameter is the strong variance of defined as

 E∥ξ∥2=Esup∥u∥,∥v∥≤1⟨ξ,u⟩⟨ξ,v⟩=Esup∥u∥≤1⟨ξ,u⟩2.

Clearly, The ratio of these two parameters,

 r(Σ):=E∥ξ∥2∥Σ∥,

is called the effective rank of and it was used earlier in concentration bounds for sample covariance covariance operators and their spectral projections [22, 21]. The following properties of are obvious:

 r(Σ)≥1 and r(λΣ)=r(Σ),λ>0.

Thus, the effective rank is invariant with respect to rescaling of (or rescaling of the noise). In this sense, and can be viewed as complementary parameters of the noise. It is easy to check that, if is a Hilbert space, then which implies that Clearly, could be viewed as a way to measure the dimensionality of the noise. In particular, for the maximum likelihood estimator of in the Gaussian shift model (1.1), we have resembling a standard formula for the risk of estimation of a vector in observed in a “white noise” with variance

We discuss below several simple examples of the general Gaussian shift model (1.1).

###### Example 1.

Standard Gaussian shift model. Let be equipped with the canonical Euclidean inner product and the corresponding norm (the -norm), and let where is a known constant and In this case, and Note that the size of effective rank crucially depends on the choice of underlying norm of the linear space. For instance, if is equipped with the -norm instead of -norm, then we still have but

 E∥ξ∥2ℓ∞≍σ2logd,

implying that

###### Example 2.

Matrix Gaussian shift models. Let be the space of all symmetric matrices equipped with the operator norm and let with known parameter and sampled from the Gaussian orthogonal ensemble (that is,

is a symmetric random matrix,

are independent r.v., ). In this case, and

 E∥ξ∥2=σ2E∥Z∥2≍σ2d,

implying that As before, the effective rank would be different for a different choice of norm on For instance, if is equipped with the Hilbert–Schmidt norm, then (compare this with Example 1).

###### Example 3.

Gaussian functional data model. Let be equipped with the sup-norm Suppose that where is a known parameter and is a mean zero Gaussian process on with the sample paths continuous a.s. (and with known distribution). Without loss of generality, assume that Suppose that, for some

 τ2(t,s):=E|Z(t)−Z(s)|2≲|t−s|β, t,s∈[0,1]d.

Then, it is easy to see that the following bound holds for the metric entropy of with respect to metric

 Hτ([0,1]d;ε)≲βdlog1ε.

It follows from Dudley’s entropy bound that

 E∥Z∥2∞≲β(∫10H1/2τ([0,1]d;ε)dε)2≲d.

Therefore, it is easy to conclude that and implying that

In the following sections, we develop estimators of in Gaussian shift model with mean squared error of the order

where is the degree of smoothness of functional We also show that this error rate is minimax optimal up to a logarithmic factor (at least in the case of standard Gaussian shift model). Moreover, we determine a sharp threshold on smoothness such that, for all above this threshold and all functionals of smoothness the mean squared error rate of estimation of is of the order (as for linear functionals), and, for all strictly above the threshold, we prove the efficiency of our estimators in the “small noise” case (when the strong variance is small). The key ingredient in the development of such estimators is a bootstrap chain bias reduction method introduced in [20] in the problem of estimation of smooth functionals of covariance operators. We will outline this approach in Section 2 and develop it in detail in Section 3 for Gaussian shift models.

## 2 Overview of Main Results

We will study how the optimal error rate of estimation of for parameter of Gaussian shift model (1.1) depends on the smoothness of the functional as well as on the weak and strong variances, and of the noise (or, equivalently, on the parameters and ). To this end, we define below a Banach space of functionals of smoothness such that and its derivatives grow as not faster than for some

### 2.1 Differentiability

For Banach spaces let be the Banach space of symmetric -linear forms with bounded operator norm

 ∥M∥:=sup∥h1∥≤1,…,∥hk∥≤1∥M(h1,…,hk)∥<∞.

For is the space of constants (vectors of ). A function defined by where is called a bounded homogeneous -polynomial on with values in It is known that uniquely defines A bounded polynomial on with values in is an arbitrary function represented as a finite sum where is a non-zero bounded homogeneous -polynomial. For we set Polynomials are uniquely defined by The degree of is defined as (with ). If for define

 ∥P∥op:=∑j∈I∥Mj∥.

Recall that a function is called Fréchet differentiable at a point iff there exists a bounded linear operator from to (Fréchet derivative) such that

 f(x+h)−f(x)=f′(x)h+o(∥h∥) as h→0.

Higher order Fréchet derivatives could be defined by induction. The -th order Fréchet derivative at point is defined as the Fréchet derivative of the mapping (assuming its Fréchet differentiability). It is a bounded linear operator from to that could be also viewed as a bounded symmetric -linear form from the space As always, we call -times (Fréchet) continuously differentiable if its -th order derivative exists and it is a continuous function on Clearly, polynomials are times Fréchet differentiable for any If is a polynomial and then is a constant (a -linear symmetric form that does not depend on ) and

We will be interested in what follows in classes of smooth functionals with at most polynomial (with respect to ) growth of their derivatives. To this end, we describe below several useful norms.

First, let For let

 ∥g∥L∞,γ:=supx∈E∥g(x)∥(1∨∥x∥)γ

and for let

 ∥g∥Lipρ,γ:=supx′≠x′′∥g(x′)−g(x′′)∥(1∨∥x′∥∨∥x′′∥)γ∥x′−x′′∥ρ.

Assuming that spaces are equipped with their Borel -algebras, we define as the space of measurable functions with We also define

 Lipρ,γ(E;F):={g:∥g∥Lipρ,γ<∞}.

In the case of we will write simply and for we write instead of

For we will define the norm

 ∥g∥Ck,γ:=max0≤j≤k∥g(j)∥L∞,γ

and the space of times differentiable functions (with the growth rate of derivatives characterized by ). Finally, for with and define

 ∥g∥Cs,γ:=max0≤j≤k∥g(j)∥L∞,γ∨∥g(k)∥Lipρ,γ

and the space As before, we set It is easy to see that for any polynomial such that and for all

In what follows, we frequently use bounds on the remainder of the first order Taylor expansion

 Sg(x;h):=f(x+h)−f(x)−f′(x)(h),x,h∈E

of Fréchet differentiable function We will skip the proof of the following simple lemma.

###### Lemma 2.1.

Assume that is Fréchet differentiable in with Then

 |Sg(x;h)|≲∥g′∥Lipρ,γ(1∨∥θ∥∨∥h∥)γ∥h∥1+ρ,x,h∈E

and

 |Sg(x;h′)−Sg(x;h)|≲∥g′∥Lipρ,γ(1∨∥x∥∨∥h∥∨∥h′∥)γ(∥h∥∨∥h′∥)ρ∥h′−h∥, x,h,h′∈E.

### 2.2 Definition of estimators and risk bounds

The crucial step in construction of estimator is a bias reduction method developed in detail in Section 3 and briefly outlined here. Consider the following linear operator

 Tg(θ)\coloneqqEθg(X)=Eg(θ+ξ),θ∈E

that is well defined on the spaces for Given a smooth functional we would like to find a functional on such that the bias of estimator of is small enough. In other words, we would like to find an approximate solution of operator equation Under the assumption that the strong variance of the noise is small, the operator is close to the identity operator Define Then, at least formally, the solution of the equation could be written as a Neumann series:

 g=(I+B)−1f=(I−B+B2−B3+…)f.

We will define an estimator in terms of a partial sum of this series:

 fk(θ):=k∑j=0(−1)jBjf(θ),θ∈E.

It will be proved in Section 3, that, for this estimator, the bias is of the order provided that for and is bounded by a constant.

We will prove in Section 4 the following result.

###### Theorem 2.1.

Let for some and let Suppose that Let

 Tk(X):={fk(X) if E1/2∥ξ∥2≤1/20       otherwise.

Then

 Eθ(Tk(X)−f(θ))2≲γ(k+1)γ∥f∥2Cs,γ(1∨∥θ∥)2γ((∥Σ∥∨(E∥ξ∥2)s)∧1). (2.1)

It follows from bound (2.1) that

 sup∥f∥Cs,γ≤1sup∥θ∥≤1Eθ(Tk(X)−f(θ))2≲s,γ((∥Σ∥∨(E∥ξ∥2)s)∧1). (2.2)

We will show in Section 7 that, in the case of standard Gaussian shift model, the above bound is optimal up to a factor in a minimax sense. More precisely, in this case, the following result holds.

###### Theorem 2.2.

Let (equipped with the standard Euclidean norm) and let for some Then

 ≳(∥Σ∥⋁(E∥ξ∥2logd)s)⋀1, (2.3)

where the infimum is taken over all possible estimators

At this point, we do not know whether the log factor in the minimax rate is needed and we could not extend the lower bound to general Gaussian shift models in Banach spaces.

### 2.3 Efficiency

Bound (2.2) implies that, if the smoothness of functional is sufficiently large, namely if

 (E∥ξ∥2)s≤∥Σ∥, (2.4)

then

 sup∥f∥Cs,γ≤1sup∥θ∥≤1Eθ(Tk(X)−f(θ))2≲s,γ∥Σ∥∧1, (2.5)

which coincides with the largest minimax optimal mean squared error for linear functionals from the unit ball in Condition (2.4) can be equivalently written as

 s≥1+logr(Σ)log1∥Σ∥−logr(Σ). (2.6)

If is a small parameter and for some condition (2.6) would follow from the condition On the other hand, it follows from bound (2.3) that, in the case of standard Gaussian shift model, the smoothness threshold is sharp for estimation with mean squared error rate Indeed, in this case, and, if is small and for some then, for any there exists a functional with such that

 infTsup∥θ∥≤1Eθ(T(X)−f(θ))2≳σ2s(1−α)logs(1/σ),

which is significantly larger than as

In the case when for some and (or, more generally, when is of a smaller order than ), it is possible to prove that is close in distribution to normal and establish the efficiency of estimator More precisely, let

 σ2f,ξ(θ):=E(f′(θ)(ξ))2=⟨Σf′(θ),f′(θ)⟩

For denote

 K(f;Σ;θ):=Ks,γ(f;Σ;θ):=∥f∥Cs,γ(1∨∥θ∥)γ∥Σ∥1/2σf,ξ(θ).

It is easy to see that

 σf,ξ(θ)≤∥Σ∥1/2∥f′(θ)∥≤∥f′∥L∞,γ(1∨∥θ∥)γ∥Σ∥1/2≤∥f∥Cs,γ(1∨∥θ∥)γ∥Σ∥1/2,

implying that We also have that

 Ks,γ(f;λΣ;θ)=Ks,γ(f;Σ;θ),λ>0,

which means that does not depend on the noise level In what follows, it will be assumed that the functional is bounded from above by a constant, implying that is within a constant from its upper bound This is the case, for instance, when is in a bounded set and

(in other words, the standard deviation

is not too small comparing with the noise level ).

The following result will be proved in Section 5.

###### Theorem 2.3.

Suppose, for some and some Suppose also that Then

 supy∈R∣∣∣Pθ{fk(X)−f(θ)σf,ξ(θ)≤y}−P{Z≤y}∣∣∣≲γ(k+1)γ/2Ks,γ(f;Σ;θ) (2.7)

where is a standard normal r.v. Moreover,

 ∥∥∥fk(X)−f(θ)σf,ξ(θ)−Z∥∥∥L2(P) ≲γ(k+1)γ/2Ks,γ(f;Σ;θ)((E1/2∥ξ∥2)ρ⋁(E1/2∥ξ∥2)s∥Σ∥1/2). (2.8)

It follows from bound (2.3) that

 E1/2θ(fk(X)−f(θ))2σf,ξ(θ) ≤1+cγ(k+1)γ/2Ks,γ(f;Σ;θ)((E1/2∥ξ∥2)ρ⋁(E1/2∥ξ∥2)s∥Σ∥1/2). (2.9)

Assume that is in a set of parameters where is upper bounded by a constant. Then, is close to uniformly in provided that is small and is much smaller than (say, if and ).

Finally, in Section 6, we will prove the following minimax lower bound.

###### Theorem 2.4.

Suppose for some and Let Then, there exists a constant such that for all and all covariance operators satisfying the condition the following bound holds

 infTsup∥θ−θ0∥≤c∥Σ∥1/2Eθ(T(X)−f(θ))2σ2f,ξ(θ)≥1−DγK2s,γ(f;Σ;θ0)(cs−1∥Σ∥(s−1)/2+1c2),

where the infimum is taken over all possible estimators

The bound of Theorem 2.4 shows that, when the noise level is small and is upper bounded by a constant, the following asymptotic minimax result (in spirit of Hàjek and Le Cam) holds

 limc→∞liminf∥Σ∥1/2→0infTsup∥θ−θ0∥≤c∥Σ∥1/2Eθ(T(X)−f(θ))2σ2f,ξ(θ)≥1

locally in a neighborhood of parameter of size commensurate with the noise level. This shows the optimality of the variance of normal approximation and the efficiency of estimator

###### Remark 2.1.

In the case of matrix Gaussian shift model of Example 2 (that is, when is the space of symmetric matrices equipped with operator norm and being a random matrix from Gaussian orthogonal ensemble), the results of the paper could be applied, in particular, to bilinear forms of smooth functions of symmetric matrices: where is a smooth function in real line and Namely, it was shown in [20], Corollary 2 (based on the results of [33], [1]) that the -norm of operator function can be controlled in terms of Besov -norm of underlying function of real variable This allows one to apply all the results stated above to functional provided that is in a proper Besov space. Note that spectral projections of that correspond to subsets of its spectrum separated by a positive gap from the rest of the spectrum could be represented as for sufficiently smooth functions which allows one to apply the results to bilinear forms of spectral projections (see also [24]). In [20], similar results were obtained for smooth functionals of covariance operators.

###### Remark 2.2.

Obviously, the results of the paper can be applied to the model of i.i.d. observations If then it follows from Theorem 2.1 that

 (2.10)

Uniformly in the class of covariances with and for some this yields a bound on the mean squared error of the order provided that Moreover, if estimator is asymptotically normal and asymptotically efficient with convergence rate and limit variance

## 3 Bias Reduction

A crucial part of our approach to efficient estimation of smooth functionals of is a new bias reduction method based on iterative application of parametric bootstrap. Our goal is to construct an estimator of smooth functional of parameter and, to this end, we construct an estimator of the form for some functional for which the bias is negligible comparing with the noise level Define the following linear operator:

 Tg(θ)\coloneqqEθg(X)=Eg(θ+ξ),θ∈E.
###### Proposition 3.1.

For all is a bounded linear operator from the space into itself with

 ∥T∥L∞,γ(E)↦L∞,γ(E)≤2γ(1+E∥ξ∥γ). (3.1)

proof.  Indeed, by the definition of -norm,

 |g(θ+ξ)|≤2γ∥g∥L∞,γ(1∨∥θ∥∨∥ξ∥)γ.

Therefore,

 |Tg(θ)|≤E|g(θ+ξ)|≤2γ∥g∥L∞,γE(1∨∥θ∥∨∥ξ∥)γ≤2γ[(1∨∥θ∥)γ+E∥ξ∥γ]∥g∥L∞,γ,

which easily implies that

 ∥Tg∥L∞,γ≤2γ(1+E∥ξ