Asymptotically Efficient Estimation of Smooth Functionals of Covariance Operators

10/25/2017 ∙ by Vladimir Koltchinskii, et al. ∙ Georgia Institute of Technology 0

Let X be a centered Gaussian random variable in a separable Hilbert space H with covariance operator Σ. We study a problem of estimation of a smooth functional of Σ based on a sample X_1,... ,X_n of n independent observations of X. More specifically, we are interested in functionals of the form 〈 f(Σ), B〉, where f: R R is a smooth function and B is a nuclear operator in H. We prove concentration and normal approximation bounds for plug-in estimator 〈 f(Σ̂),B〉, Σ̂:=n^-1∑_j=1^n X_j⊗ X_j being the sample covariance based on X_1,..., X_n. These bounds show that 〈 f(Σ̂),B〉 is an asymptotically normal estimator of its expectation E_Σ〈 f(Σ̂),B〉 (rather than of parameter of interest 〈 f(Σ),B〉) with a parametric convergence rate O(n^-1/2) provided that the effective rank r(Σ):= tr(Σ)/Σ ( tr(Σ) being the trace and Σ being the operator norm of Σ) satisfies the assumption r(Σ)=o(n). At the same time, we show that the bias of this estimator is typically as large as r(Σ)/n (which is larger than n^-1/2 if r(Σ)≥ n^1/2). In the case when H is finite-dimensional space of dimension d=o(n), we develop a method of bias reduction and construct an estimator 〈 h(Σ̂),B〉 of 〈 f(Σ),B〉 that is asymptotically normal with convergence rate O(n^-1/2). Moreover, we study asymptotic properties of the risk of this estimator and prove minimax lower bounds for arbitrary estimators showing the asymptotic efficiency of 〈 h(Σ̂),B〉 in a semi-parametric sense.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a Gaussian random variable in a separable Hilbert space

sampled from normal distribution

with mean and covariance operator The purpose of this paper is to study a problem of estimation of smooth functionals of unknown covariance based on a sample of i.i.d. observations of Specifically, we deal with the functionals of the form where is a smooth function and

is a nuclear operator. Such important examples in principal component analysis as estimation of bilinear forms of spectral projection operators of covariance

could be easily reduced to this basic problem. Moreover, estimation of is a major building block in estimation of more general functionals of the form and their linear combinations that provides a way to the development of methods of statistical estimation of very general functionals of covariance.

Throughout the paper, we use the following notations. Given means that for a numerical (most often, unspecified) constant is equivalent to is equivalent to and Sometimes, constants in the above relationships might depend on some parameter(s). In such cases, the signs and are provided with subscripts: say, means that for a constant that depends on

Let denote the space of all bounded linear operators in a separable Hilbert space equipped with the operator norm and let denote the subspace of all self-adjoint operators.111The main results of the paper are proved in the case when is a real Hilbert space. However, on a couple of occasions, especially in auxiliary statements, its complexification with a standard extension of the inner product and complexification of the operators acting in is needed. With some abuse of notation, we keep in such cases the notation for the complex Hilbert space. In what follows, denotes the adjoint of operator denotes its trace (provided that is trace class) and denotes its operator norm. We use the notation for the Schatten -norm of In particular, is the nuclear norm of is its Hilbert–Schmidt norm and is its operator norm. We denote the space of self-adjoint operators with (-th Schatten class opeartors) by The space of compact self-adjoint operators in is denoted by The inner product notation is used both for inner products in the underlying Hilbert space and for the Hilbert–Schmidt inner product between the operators. Moreover, it is also used to denote bounded linear functionals on the spaces of operators (for instance, where is a bounded operator and is a nuclear operator, is a value of such a linear functional on the space of bounded operators). For

denotes the tensor product of vector

and The operator is of rank and finite linear combinations of rank one operators are operators of finite rank. The rank of is denoted by Finally, denotes the cone of self-adjoint positively semi-definite nuclear operators in (the covariance operators).

In what follows, we often use exponential bounds for random variables of the following form: for all

with probability at least

Sometimes, our derivation would yield a slightly different probability bound, for instance: for all with probability at least Such bounds could be easily rewritten again as by adjusting the value of constant for with probability at least we have Such an adjustment of constants will be used in many proofs without further notice.

1.1 Sample covariance and effective rank

Let denote the sample covariance based on the data

It is well known that is a complete sufficient statistics and the maximum likelihood estimator in the problem of estimation of unknown covariance in the model i.i.d.

In what follows, we often use so called effective rank of covariance as a complexity parameter of covariance estimation problem. It is defined as

Note that

The following result of Koltchinskii and Lounici [37] shows that, in the Gaussian case, the size of the random variable (which is a relative operator norm error of estimator of ) is completely characterized by the ratio

Theorem 1.

The following bound holds:

(1.1)

Moreover, for all with probability at least

(1.2)

It follows from expectation bound (1.1) and concentration inequality (1.2) that, for all with probability at least

(1.3)

and, for all

(1.4)

To avoid the dependence of the constant on the following modification of the above bound will be used on a couple of occasions:

(1.5)

Since the bounds in terms of effective rank do imply well known bounds in terms of dimension. For instance, for all with probability at least

(1.6)

(see, e.g., [60]). Of course, bound (1.6) is meaningless in the infinite-dimensional case. In the finite-dimensional case, it is sharp if is isotropic ( for a constant ), or if it is of isotropic type, that is, the spectrum of is bounded from above and bounded away from zero (by constants). In this case, which makes (1.6) sharp. This is the case, for instance, for popular spiked covariance models introduced by Johnstone [32] (see also [33, 50, 8]

). However, in the case of fast decay of eigenvalues of

the effective rank could be significantly smaller than and it becomes the right complexity parameter in covariance estimation.

In what follows, we are interested in the problems in which is allowed to be large, but as This is a necessary and sufficient condition for to be an operator norm consistent estimator of which also means that is a small perturbation of when is large and methods of perturbation theory could be used to analyze the behavior of for smooth functions

1.2 Overview of main results

In this subsection, we state and discuss the main results of the paper concerning asymptotically efficient estimation of functionals for a smooth function and nuclear operator It turns out that the proper notion of smoothness of function in these problems is defined in terms of Besov spaces and Besov norms. The relevant definitions (of spaces and corresponding norms), notations and references are provided in Section 2.

A standard approach to asymptotic analysis of plug-in estimators (in particular, such as

) in statistics is the Delta Method based on the first order Taylor expansion of Due to a result by Peller (see Section 2), for any the mapping is Fréchet differentiable with respect to the operator norm on the space of bounded self-adjoint operators in Let be a covariance operator with spectral decomposition being the spectrum of being an eigenvalue of and

being the corresponding spectral projection (the orthogonal projection onto the eigenspace of

). Then the derivative of operator function at in the direction is given by the following formula:

where and (see Section 2). Moreover, if, for some then the following first order Taylor expansion holds

with the linear term

and the remainder satisfying the bound

(see (2.15)). Since the linear term is the sum of i.i.d. random variables, it is easy to check (for instance, using Berry-Esseen bound) that

is asymptotically normal with limit mean zero and limit variance

Using exponential bound (1.3) on one can easily conclude that the remainder is asymptotically negligible (that is, of the order ) if or, equivalently, In the case when this means that This implies that is an asymptotically normal estimator of with convergence rate and limit normal distribution (under the assumption that ). The above perturbation analysis is essentially the same as for spectral projections of in the case of fixed finite dimension (see Anderson [3]), or in the infinite-dimensional case when the “complexity” of the problem (characterized by or ) is fixed (see Dauxois, Pousse and Romain [13]). Note also that the bias of estimator

is upper bounded by so, it is of the order (asymptotically negligible) under the same condition Moreover, it is easy to see that this bound on the bias is sharp for generic smooth functions For instance, if and then one can check by a straightforward computation that

This means that, as soon as both the bias and the remainder are not asymptotically negligible and, moreover, it turns out that, if then is not even a -consistent estimator of

Our first goal is to show that is an asymptotically normal estimator of its own expectation with convergence rate and limit variance in the class of covariances with effective rank of the order Given and define

Theorem 2.

Suppose, for some Let be a nuclear operator and let Suppose that and as Then

(1.7)

as where

This result is a consequence of Corollary 4 proved in Section 4 that provides an explicit bound on the accuracy of normal approximation. Its proof is based on a concentration bound for the remainder of the first order Taylor expansion around its expectation developed in Section 3. This bound essentially shows that the centered remainder

is of the order which is as soon as

Theorem 2 shows that naive plug-in estimator “concentrates” around its expectation with approximately standard normal distribution of random variables

At the same time, as we discussed above, the plug-in estimator has a large bias when the effective rank of is sufficiently large (say, for functions of smoothness ). In the case when with and the bias is negligible and becomes an asymptotically normal estimator of

Moreover, we will also derive asymptotics of the risk of plug-in estimator for loss functions satisfying the following assumption:

Assumption 1.

Let be a loss function such that is nondecreasing and convex on and, for some constants

Corollary 1.

Suppose, for some Let be a nuclear operator and let Suppose that and as Then

(1.8)

as Moreover, under the same assumptions on and and for any loss function satisfying Assumption 1,

(1.9)

as where is a standard normal random variable.

The main difficulty in asymptotically efficient estimation of functional is related to the development of bias reduction reduction methods. We will discuss now an approach to this problem in the case when is a finite-dimensional space of dimension and the covariance operator is of isotropic type (the spectrum of is bounded from above and bounded away from zero by constants that do not depend on ). In this case, the effective rank is of the same order as dimension so, will be used as a complexity parameter. The development of a similar approach in a more general setting (when the effective rank is a relevant complexity parameter) remains an open problem.

Consider the following integral operator

where is the cone of positively semi-definite self-adjoint operators in (covariance operators) and is the distribution of the sample covariance based on i.i.d. observations sampled from (which is a rescaled Wishart distribution). In what follows, will be called the Wishart operator. We will view it as an operator acting on bounded measurable functions on the cone taking values either in real line, or in the space of self-adjoint operators. Such operators play an important role in the theory of Wishart matrices (see, e.g., James [27, 28, 29], Graczyk, Letac and Massam [22, 23], Letac and Massam [42]). Their properties will be discussed in detail in Section 5

. To find the unbiased estimator

of one has to solve the integral equation (the Wishart equation). Let being the identity operator. Then, the solution of Wishart equation can be formally written as the Neumann series

We will define an approximate solution of this equation in terms of a partial sum of Neumann series

With this definition, we have

It remains to show that is small for smooth enough functions which would imply that the bias of estimator of is also small. [Very recently, a similar approach was considered in the paper by Jiao, Han and Weissman [31] in the case of estimation of function of the parameter of binomial model In this case, is the Bernstein polynomial of degree approximating function and some results of classical approximation theory ([21], [57]) were used in [31] to control ]

Note that

is a Markov kernel and it could be viewed as the transition kernel of a Markov chain

in the cone where and, in general, for any is the sample covariance based on i.i.d. observations sampled from the distribution (conditionally on ). In other words, the Markov chain is based on iterative applications of bootstrap, and it will be called in what follows the bootstrap chain. As a consequence of bound (1.6), with a high probability (conditionally on ),

so, when the Markov chain moves in “small steps” of the order Clearly, with the above definitions,

Note that, by Newton’s binomial formula,

The expression could be viewed as -th order difference of function along the Markov chain It is well known that, for a times continuously differentiable function in real line, the -th order difference (where ) is of the order for small increment

Thus, at least heuristically, one can expect that

would be of the order (since is the size of the “steps” of the Markov chain ). This means that, for much smaller than one can achieve a significant bias reduction in a relatively small number of steps The justification of this heuristic is rather involved. It is based on a representation of operator function in the form where is a real valued function on the cone invariant with respect to the orthogonal group. The properties of orthogonally invariant functions are then used to derive an integral representation for function

that implies, for a sufficiently smooth bounds on of the order and, as a consequence, bounds on the bias of estimator of of the order provided that and is sufficiently large (see (5.14) in Section 5 and Theorem 7, Corollary 5 in Section 6).

The next step in analysis of estimator is to derive normal approximation bounds for To this end, we study in Section 7 smoothness properties of functions for a smooth orthogonally invariant function that are later used to prove proper smoothness of such functions as and derive concentration bounds on the remainder of the first order Taylor expansion of which is the main step in showing that the centered remainder is asymptotically negligible and proving the normal approximation. In addition, we show that the limit variance in normal approximation of coincides with (which is exactly the same as the limit variance in normal approximation of ). This, finally, yields normal approximation bounds of theorems 9 and 10 in Section 8.

Given and denote by the set of all covariance operators in -dimensional space such that The following result on uniform normal approximation of estimator of is an immediate consequence of Theorem 10.

Theorem 3.

Let Suppose that, for some

Suppose also that, for some and let be an integer number such that, for some Then

(1.10)

as Moreover, if is a loss function satisfying Assumption 1, then

(1.11)

as

Remark 1.

Note that for and one can choose implying that in Theorem 3 is a usual plug-in estimator (compare this with Corollary 1). However, for we have to assume that and choose to satisfy the condition Thus, in this case the bias correction is already nontrivial. For larger values of even more smoothness of is required and more iterations in our bias reduction method is needed.

To show the asymptotic efficiency of estimator it remains to prove a minimax lower bound on the risk of an arbitrary estimator of the functional that would imply the optimality of the variance in normal approximation (1.10), (1.11). This is done for quadratic loss (under mild assumptions) in Theorem 11 of Section 9 implying that

1.3 Related results

Up to our best knowledge, the problem of efficient estimation for general classes of smooth functionals of covariance operators in the setting of the current paper has not been studied before. However, many results in the literature on nonparametric, semiparametric and high-dimensional statistics as well as some results in random matrix theory are relevant in our context. We provide below a very brief discussion of some of these results.

Asymptotically efficient estimation of smooth functionals of infinite-dimensional parameters has been an important topic in nonparametric statistics for a number of years that also has deep connections to efficiency in semiparametric estimation (see, e.g., [7], [18] and references therein). The early references include Levit [43, 44] and the book of Ibragimov and Khasminskii [24]. In the paper by Ibragimov, Nemirovski and Khasminskii [25] and later in paper [47] and in Saint-Flour Lectures [48] by Nemirovski, sharp results on efficient estimation of general smooth functionals of parameter of Gaussian shift model were obtained, precisely describing the dependence between the rate of decay of Kolmogorov’s diameters of parameter space (used as a measure of its complexity) and the degree of smoothness of functionals for which efficient estimation is possible. A general approach to construction of efficient estimators of smooth functionals in Gaussian shift models was also developed in these papers. The result of Theorem 3 is in the same spirit with the growth rate of the dimension of the space being the complexity parameter instead of the rate of decay of Kolmogorov’s diameters. At this point, we do not know whether the smoothness threshold for efficient estimation obtained in this theorem is sharp.

More recently, there has been a lot of interest in semi-parametric efficiency properties of regularization-based estimators (such as LASSO) in various models of high-dimensional statistics, see, e.g., [59], [26], [61], [30] as well as in minimax optimal rates of estimation of special functionals (in particular, linear and quadratic) in such models [9], [10], [12].

In a series of pioneering papers in the 80s–90s, Girko obtained a number of results on asymptotically normal estimation of many special functionals of covariance matrices in high-dimensional setting, in particular, on estimation of the Stieltjes transform of spectral function (see [19] and also [20] and references therein). His estimators were typically functions of sample covariance defined in terms of certain equations (so called -estimators) and the proofs of their asymptotic normality were largely based on martingale CLT. The centering and normalizing parameters in the limit theorems in these papers are often hard to interpret and the estimators were not proved to be asymptotically efficient.

Asymptotic normality of so called linear spectral statistics centered either by its own expectation, or by the integral of with respect to Marchenko-Pastur type law has been an active subject of research in random matrix theory both in the case of high-dimensional sample covariance (or Wishart matrices) and in other random matrix models such as Wigner matrices, see, e.g., Bai and Silverstein [5], Lytova and Pastur [45], Sosoe and Wong [56]. Although these results do not have direct statistical implications since does not “concentrate” around the corresponding population parameter, probabilistic and analytic techniques developed in these papers are highly relevant.

There are many results in the literature on special cases of the above problem, such as asymptotic normality of statistic (the log-determinant). If then it was shown that the sequence

converges in distribution to a standard normal random variable for explicitly given sequences that depend only on the sample size and on the dimension This means that is an asymptotically normal estimator of subject to a simple bias correction (see, e.g., Girko [19] and more recent paper by Cai, Liang and Zhou [11]). In this case, the problem is relatively simple since where is the sample covariance based on a sample of i.i.d. standard normal random vectors.

In a recent paper by Koltchinskii and Lounici [36] (see also [38, 39]), the problem of estimation of bilinear forms of spectral projections of covariance operators was studied in the setting when the effective rank as 222For other recent results on covariance estimation under assumptions on its effective rank see [46, 54]. Normal approximation and concentration results for bilinear forms centered by their expectations were proved using first order perturbation expansions for empirical spectral projections and concentration inequalities for their remainder terms (which is similar to the approach of the current paper). Special properties of the bias of these estimators were studied that, in the case of one dimensional spectral projections, led to the development of a bias reduction method based on sample split that resulted in construction of

-consistent and asymptotically normal estimators of linear forms of eigenvectors of the true covariance (principal components) in the case when

as These approach has been further developed in a very recent paper by Koltchinskii, Loeffler and Nickl [40] in which asymptotically efficient estimators of linear forms of eigenvectors of were studied.

Other recent references on estimation of functionals of covariance include Fan, Rigollet and Wang [15] (optimal rates of estimation of special functionals of covariance under sparsity assumptions) and Gao and Zhou [16]

(Bernstein-von Mises theorems for functionals of covariance).

2 Analysis and operator theory preliminaries

In this section, we discuss several results in operator theory concerning perturbations of smooth functions of self-adjoint operators in Hilbert spaces. They are simple modifications of known results due to several authors (see recent survey by Aleksandrov and Peller [2]).

2.1 Entire functions of exponential type and Besov spaces

Let be an entire function and let It is said that is of exponential type (more precisely, ) if for any there exists such that

In what follows, denotes the space of all entire functions of exponential type It is straightforward to see (and well known) that if and only if

With a little abuse of notation, the restriction of function to will be also denoted by

will denote the Fourier transform of

(if is not square integrable, its Fourier transform could be understood in the sense of tempered distributions). According to Paley-Wiener theorem,

It is also well known that if and only if

We will use the following Bernstein inequality

that holds for all functions Moreover, since implies we also have

and similar bounds hold for all the derivatives of Next elementary lemma is a corollary of Bernstein inequality. It provides bounds on the remainder of the first order Taylor expansion of a function

Lemma 1.

Let Denote

Then

and

where

We also need an extension of Bernstein inequality to functions of many complex variables. Let be an entire function and let Function is of exponential type if for any there exists such that

Let be the set of all such functions. The following extension of Bernstein inequality could be found in the paper by Nikolsky [49], who actually proved it for an arbitrary -norm, If then for any and any such that

(2.1)

Let be a function in real line with such that and Define which implies that Let with These definitions immediately imply that

Finally, define functions (the Schwartz space of functions in ) by their Fourier transforms as follows:

For a tempered distribution