The problem of estimation of functionals of “high complexity” parameters of statistical models often occurs both in high-dimensional and in nonparametric statistics, where it is of importance to identify some features of a complex parameter that could be estimated efficiently with a fast (sometimes, parametric) convergence rates. Such problems are very important in the case of vector, matrix or functional parameters in a variety of applications including functional data analysis and kernel machine learning (, ). In this paper, we study a very basic version of this problem in the case of rather general Gaussian models with unknown mean. Consider the following Gaussian shift model
where is a separable Banach space, is an unknown parameter and
is a mean zero Gaussian random variable in(the noise) with known covariance operator In other words, an observation in Gaussian shift model (1.1) is a Gaussian vector in with unknown mean and known covariance Recall that is an operator from the dual space into such that Here and in what follows, denotes the value of a linear functional on a vector (although, in some parts of the paper, with a little abuse of notation, will also denote the inner product of Euclidean spaces). It is well known that the covariance operator of a Gaussian vector in is bounded and, moreover, it is nuclear.
Our goal is to study the problem of estimation of for smooth functionals The problem of estimation of smooth functionals of parameters of infinite-dimensional (nonparametric) models has been studied for several decades. It is considerably harder than in the classical finite-dimensional parametric i.i.d. models, where under standard regularity assumptions, ( being the maximum likelihood estimator) is an asymptotically efficient (in the sense of Hàjek-LeCam) estimator of with -rate for continuously differentiable functions
In the nonparametric case, classical convergence rates do not necessarily hold in functional estimation problems and minimax optimal convergence rates have to be determined. Moreover, even when the classical convergence rates do hold, the construction of efficient estimator is is often a challenging problem. Such problems have been often studied for special models (Gaussian white noise model, nonparametric density estimation model, etc) and for special functionals (with a number of nontrivial results even in the case of linear and quadratic functionals). Early results in this direction are due to Levit[28, 29] and Ibragimov and Khasminskii . Further important references include Ibragimov, Nemirovski and Khasminskii , Donoho and Liu [9, 10], Bickel and Ritov , Donoho and Nussbaum , Nemirovski [31, 32], Birgé and Massart , Laurent , Lepski, Nemirovski and Spokoiny , Cai and Low [6, 7], Klemelä  as well as a vast literature on semiparametric efficiency (see, e.g.,  and references therein). Early results on consistent and asymptotically normal estimation of smooth functionals of high-dimensional parameters are due to Girko [13, 14]. More recently, there has been a lot of interest in efficient and minimax optimal estimation of functionals of parameters of high-dimensional models including a variety of problems related to semiparametric efficiency of regularized estimators (see , , ), on minimax optimal rates of estimation of special functionals (see ), on efficient estimation of smooth functionals of covariance in Gaussian models [23, 20].
Throughout the paper, given nonnegative means that for a numerical constant is equivalent to and is equivalent to Sometimes signs of relationships and will be provided with subscripts (say, or ), indicating possible dependence of the constants on the corresponding parameters.
In what follows, exponential bounds on random variables (say, on ) are often stated in the following form: there exists a constant such that, for all
with probability at leastThe proof could often result in a slightly different bound, for instance, with probability However, replacing constant with it is easy to obtain the probability bound in the initial form In such cases, we say that ,“adjusting the constants” allows us to write the probability as (without providing further details).
We will now briefly discuss the results of Ibragimov, Nemirovski and Khasminskii  and follow up results of Nemirovski [31, 32] that are especially close to our approach to the problem. In , the following model was studied
in which a “signal” is observed in a Gaussian white noise ( being a standard Brownian motion on ). The complexity of the parameter space was characterized by Kolmogorov widths:
where denotes the orthogonal projection onto subspace Assuming that and, for some
the goal of the authors was to determine a “smoothness threshold” such that, for all and for all functionals on of smoothness could be estimated efficiently with rate based on observation (whereas for there exist functionals of smoothness such that could not be estimated with parametric rate ). It turned out that the main difficulties in this problem are related to a proper definition of the smoothness of the functional In particular, even such simple functional as could not be estimated efficiently on some sets with The smoothness of functionals on Hilbert space is usually defined in terms of their Hölder type norms that, in turn, depend on a way in which the norm of Frèchet derivatives is defined. The -th order Frèchet derivative is a symmetric -linear form on The most common definition of the norm of such a form is the operator norm: Other possibilities include Hilbert–Schmidt norm and “hybrid” norms The Hölder classes in  were defined in terms of the following norms: for
With this somewhat complicated definition, it was proved that, if and, either and or and then there exists an asymptotically efficient estimator of with convergence rate
The construction of such estimators was based on the development of a method of unbiased estimation of Hilbert–Schmidt polynomials onand on Taylor expansion of in a neighborhood of an estimator of with an optimal nonparametric error rate. It was later shown in [31, 32] that the smoothness thresholds described above are optimal.
We will study similar problems for Gaussian shift model (1.1) trying to determine smoothness thresholds for efficient estimation in terms of proper complexity characteristics for this model.
Among the simplest smooth functionals on are bounded linear functionals For a straightforward estimator of such a functional,
and, for functionals from the unit ball of the largest possible mean squared error is equal to the operator norm of
It is also not hard to prove the following proposition.
In what follows, the complexity of estimation problem will
be characterized by two parameters of the noise One is the operator norm
which is involved in the minimax mean squared error for estimation of linear functionals.
It will be convenient to view as the weak variance
weak varianceof Another complexity parameter is the strong variance of defined as
Clearly, The ratio of these two parameters,
is called the effective rank of and it was used earlier in concentration bounds for sample covariance covariance operators and their spectral projections [22, 21]. The following properties of are obvious:
Thus, the effective rank is invariant with respect to rescaling of (or rescaling of the noise). In this sense, and can be viewed as complementary parameters of the noise. It is easy to check that, if is a Hilbert space, then which implies that Clearly, could be viewed as a way to measure the dimensionality of the noise. In particular, for the maximum likelihood estimator of in the Gaussian shift model (1.1), we have resembling a standard formula for the risk of estimation of a vector in observed in a “white noise” with variance
We discuss below several simple examples of the general Gaussian shift model (1.1).
Standard Gaussian shift model. Let be equipped with the canonical Euclidean inner product and the corresponding norm (the -norm), and let where is a known constant and In this case, and Note that the size of effective rank crucially depends on the choice of underlying norm of the linear space. For instance, if is equipped with the -norm instead of -norm, then we still have but
Matrix Gaussian shift models. Let be the space of all symmetric matrices
equipped with the operator norm and let with known parameter and sampled
from the Gaussian orthogonal ensemble (that is, is a symmetric random matrix,
is a symmetric random matrix,are independent r.v., ). In this case, and
implying that As before, the effective rank would be different for a different choice of norm on For instance, if is equipped with the Hilbert–Schmidt norm, then (compare this with Example 1).
Gaussian functional data model. Let be equipped with the sup-norm Suppose that where is a known parameter and is a mean zero Gaussian process on with the sample paths continuous a.s. (and with known distribution). Without loss of generality, assume that Suppose that, for some
Then, it is easy to see that the following bound holds for the metric entropy of with respect to metric
It follows from Dudley’s entropy bound that
Therefore, it is easy to conclude that and implying that
In the following sections, we develop estimators of in Gaussian shift model with mean squared error of the order
where is the degree of smoothness of functional We also show that this error rate is minimax optimal up to a logarithmic factor (at least in the case of standard Gaussian shift model). Moreover, we determine a sharp threshold on smoothness such that, for all above this threshold and all functionals of smoothness the mean squared error rate of estimation of is of the order (as for linear functionals), and, for all strictly above the threshold, we prove the efficiency of our estimators in the “small noise” case (when the strong variance is small). The key ingredient in the development of such estimators is a bootstrap chain bias reduction method introduced in  in the problem of estimation of smooth functionals of covariance operators. We will outline this approach in Section 2 and develop it in detail in Section 3 for Gaussian shift models.
2 Overview of Main Results
We will study how the optimal error rate of estimation of for parameter of Gaussian shift model (1.1) depends on the smoothness of the functional as well as on the weak and strong variances, and of the noise (or, equivalently, on the parameters and ). To this end, we define below a Banach space of functionals of smoothness such that and its derivatives grow as not faster than for some
For Banach spaces let be the Banach space of symmetric -linear forms with bounded operator norm
For is the space of constants (vectors of ). A function defined by where is called a bounded homogeneous -polynomial on with values in It is known that uniquely defines A bounded polynomial on with values in is an arbitrary function represented as a finite sum where is a non-zero bounded homogeneous -polynomial. For we set Polynomials are uniquely defined by The degree of is defined as (with ). If for define
Recall that a function is called Fréchet differentiable at a point iff there exists a bounded linear operator from to (Fréchet derivative) such that
Higher order Fréchet derivatives could be defined by induction. The -th order Fréchet derivative at point is defined as the Fréchet derivative of the mapping (assuming its Fréchet differentiability). It is a bounded linear operator from to that could be also viewed as a bounded symmetric -linear form from the space As always, we call -times (Fréchet) continuously differentiable if its -th order derivative exists and it is a continuous function on Clearly, polynomials are times Fréchet differentiable for any If is a polynomial and then is a constant (a -linear symmetric form that does not depend on ) and
We will be interested in what follows in classes of smooth functionals with at most polynomial (with respect to ) growth of their derivatives. To this end, we describe below several useful norms.
First, let For let
and for let
Assuming that spaces are equipped with their Borel -algebras, we define as the space of measurable functions with We also define
In the case of we will write simply and for we write instead of
For we will define the norm
and the space of times differentiable functions (with the growth rate of derivatives characterized by ). Finally, for with and define
and the space As before, we set It is easy to see that for any polynomial such that and for all
In what follows, we frequently use bounds on the remainder of the first order Taylor expansion
of Fréchet differentiable function We will skip the proof of the following simple lemma.
Assume that is Fréchet differentiable in with Then
2.2 Definition of estimators and risk bounds
The crucial step in construction of estimator is a bias reduction method developed in detail in Section 3 and briefly outlined here. Consider the following linear operator
that is well defined on the spaces for Given a smooth functional we would like to find a functional on such that the bias of estimator of is small enough. In other words, we would like to find an approximate solution of operator equation Under the assumption that the strong variance of the noise is small, the operator is close to the identity operator Define Then, at least formally, the solution of the equation could be written as a Neumann series:
We will define an estimator in terms of a partial sum of this series:
It will be proved in Section 3, that, for this estimator, the bias is of the order provided that for and is bounded by a constant.
We will prove in Section 4 the following result.
Let for some and let Suppose that Let
It follows from bound (2.1) that
We will show in Section 7 that, in the case of standard Gaussian shift model, the above bound is optimal up to a factor in a minimax sense. More precisely, in this case, the following result holds.
Let (equipped with the standard Euclidean norm) and let for some Then
where the infimum is taken over all possible estimators
At this point, we do not know whether the log factor in the minimax rate is needed and we could not extend the lower bound to general Gaussian shift models in Banach spaces.
Bound (2.2) implies that, if the smoothness of functional is sufficiently large, namely if
which coincides with the largest minimax optimal mean squared error for linear functionals from the unit ball in Condition (2.4) can be equivalently written as
If is a small parameter and for some condition (2.6) would follow from the condition On the other hand, it follows from bound (2.3) that, in the case of standard Gaussian shift model, the smoothness threshold is sharp for estimation with mean squared error rate Indeed, in this case, and, if is small and for some then, for any there exists a functional with such that
which is significantly larger than as
In the case when for some and (or, more generally, when is of a smaller order than ), it is possible to prove that is close in distribution to normal and establish the efficiency of estimator More precisely, let
It is easy to see that
implying that We also have that
which means that does not depend on the noise level In what follows, it will be assumed that the functional is bounded from above by a constant, implying that is within a constant from its upper bound This is the case, for instance, when is in a bounded set and
(in other words, the standard deviationis not too small comparing with the noise level ).
The following result will be proved in Section 5.
Suppose, for some and some Suppose also that Then
where is a standard normal r.v. Moreover,
It follows from bound (2.3) that
Assume that is in a set of parameters where is upper bounded by a constant. Then, is close to uniformly in provided that is small and is much smaller than (say, if and ).
Finally, in Section 6, we will prove the following minimax lower bound.
Suppose for some and Let Then, there exists a constant such that for all and all covariance operators satisfying the condition the following bound holds
where the infimum is taken over all possible estimators
The bound of Theorem 2.4 shows that, when the noise level is small and is upper bounded by a constant, the following asymptotic minimax result (in spirit of Hàjek and Le Cam) holds
locally in a neighborhood of parameter of size commensurate with the noise level. This shows the optimality of the variance of normal approximation and the efficiency of estimator
In the case of matrix Gaussian shift model of Example 2 (that is, when is the space of symmetric matrices equipped with operator norm and being a random matrix from Gaussian orthogonal ensemble), the results of the paper could be applied, in particular, to bilinear forms of smooth functions of symmetric matrices: where is a smooth function in real line and Namely, it was shown in , Corollary 2 (based on the results of , ) that the -norm of operator function can be controlled in terms of Besov -norm of underlying function of real variable This allows one to apply all the results stated above to functional provided that is in a proper Besov space. Note that spectral projections of that correspond to subsets of its spectrum separated by a positive gap from the rest of the spectrum could be represented as for sufficiently smooth functions which allows one to apply the results to bilinear forms of spectral projections (see also ). In , similar results were obtained for smooth functionals of covariance operators.
Obviously, the results of the paper can be applied to the model of i.i.d. observations If then it follows from Theorem 2.1 that
Uniformly in the class of covariances with and for some this yields a bound on the mean squared error of the order provided that Moreover, if estimator is asymptotically normal and asymptotically efficient with convergence rate and limit variance
3 Bias Reduction
A crucial part of our approach to efficient estimation of smooth functionals of is a new bias reduction method based on iterative application of parametric bootstrap. Our goal is to construct an estimator of smooth functional of parameter and, to this end, we construct an estimator of the form for some functional for which the bias is negligible comparing with the noise level Define the following linear operator:
For all is a bounded linear operator from the space into itself with
proof. Indeed, by the definition of -norm,
which easily implies that