DeepAI

# Theory of functional principal components analysis for discretely observed data

For discretely observed functional data, estimating eigenfunctions with diverging index is essential in nearly all methods based on functional principal components analysis. In this paper, we propose a new approach to handle each term appeared in the perturbation series and overcome the summability issue caused by the estimation bias. We obtain the moment bounds for eigenfunctions and eigenvalues for a wide range of the sampling rate. We show that under some mild assumptions, the moment bound for the eigenfunctions with diverging indices is optimal in the minimax sense. This is the first attempt at obtaining an optimal rate for eigenfunctions with diverging index for discretely observed functional data. Our results fill the gap in theory between the ideal estimation from fully observed functional data and the reality that observations are taken at discrete time points with noise, which has its own merits in models involving inverse problem and deserves further investigation.

• 56 publications
• 1 publication
• 13 publications
02/08/2022

### Clustering of eclipsing binary light curves through functional principal component analysis

In this paper, we revisit the problem of clustering 1318 new variable st...
12/07/2018

### Principal components analysis of regularly varying functions

The paper is concerned with asymptotic properties of the principal compo...
10/25/2021

### Minimax estimation of Functional Principal Components from noisy discretized functional data

Functional Principal Component Analysis is a reference method for dimens...
08/29/2019

### Minimax optimal estimators for general additive functional estimation

In this paper, we observe a sparse mean vector through Gaussian noise an...
01/12/2018

### Minimax Optimal Additive Functional Estimation with Discrete Distribution: Slow Divergence Speed Case

This paper addresses an estimation problem of an additive functional of ...
12/09/2020

### Consistently recovering the signal from noisy functional data

In practice most functional data cannot be recorded on a continuum, but ...
04/07/2021

### Minimax Estimation of Linear Functions of Eigenvectors in the Face of Small Eigen-Gaps

Eigenvector perturbation analysis plays a vital role in various statisti...

## 1 Introduction

With the rapid evolution of modern data collection technologies, functional data emerge ubiquitously and have been extensively developed over the past decades. In general, functional data are typically regarded as stochastic processes with certain smoothness conditions or realizations of Hilbert space valued random elements. These perspectives convey two essential natures of functional data, smoothness and infinite dimensionality, which distinguish functional data from high-dimensional and Euclidean data. For a comprehensive treatment on functional data, we recommend monographs by ramsay2005functional, ferraty2006nonparametric, horvath2012inference, hsing2015theoretical and kokoszka2017introduction.

Although functional data provide information over a continuum, which is often time, or spatial location, real data are mostly collected in a discrete form with measurement errors. For instance, we usually use to denote the sample size and the number of observations for the

th subject. Thanks to the smoothness nature of functional data, having the number of observations per subject large is a blessing rather than a curse in contrast to the high-dimensional data

(hall2006propertieshans). There is an extensive literature on the nonparametric methods in accessing the smoothness nature of functional data, including kernel method (yao2005jasa; hall2006propertieshans) and various kinds of spline methods (rice2001nonparametric; yao2006penalized; paul2009consistency; cai2011optimal).

For a given smoothing method, there are two typical strategies. When the observed points per subject are relatively dense, pre-smoothing each curve before further analysis is suggested by ramsay2005functional and zhang2007statistical. Otherwise, pooling observations together from all subjects is more recommended when the sampling scheme is rather sparse (yao2005jasa). Whether using the individual or pooled information affects the convergence rate and phase transition in estimating population quantities such as mean and covariance functions. When , the reconstructed curves by pre-smoothing are -consistency and thus, the estimated mean and covariance functions based on the pre-smoothed curves have the optimal parametric rate. In contrast, by borrowing information from all subjects, the pooling method only requires for mean and covariance estimation reaching optimal (cai2010nonparametric; cai2011optimal; zhang2016sparse), which provides theoretical insight for the supremacy of pooling strategy over the pre-smoothing strategy.

However, estimating mean and covariance only reflects the smoothness nature of functional data but leaves the infinite dimensionality out of consideration. Due to the decaying eigenvalues, covariance operators for functional random objects are non-invertible and consequently, regularization is needed in models involving inverse problem with functional covariates, including functional linear regression (FLR)

(yao2005aos; hall2007methodology; yuan2010reproducing), functional generalized linear model (fGLM) (muller2005generalized; dou2012estimation) and functional Cox model (qu2016optimal). Truncation on the functional principal components (FPC) is a well developed approach to do regularization (hall2007methodology; dou2012estimation), which requires the asymptotic behavior of the estimated eigenfunctions in theoretical analysis. In order to suppress the estimation bias, the number of principal components used in estimating the regression function should grow with sample size. Therefore, convergence rate for a diverging number of the estimated eigenfunctions is a fundamental issue in functional data analysis (FDA). It is not only interested in its own theoretical merits but also embraced in most inverse models involving functional covariates.

For fully observed functional data, the seminal work hall2007methodology obtained the convergence rate for the th eigenfunction, which has been proofed optimal in the minimax sense by wahl2020information. Subsequently, this result becomes the keystone in establishing the optimal convergence in FLR (hall2007methodology) and fGLM (dou2012estimation). For discretely observed functional data, stochastic bounds derived by the existing literature are only considered for a fixed number of eigenfunctions. Specifically, applying the local linear smoother, hall2006propertieshans show that the convergence rate for a fixed eigenfunction with finite is . For another line of research, under the reproducing kernel Hilbert space (RKHS) framework, cai2010nonparametric claimed that eigenfunctions with fixed indices admit the same convergence rate as the covariance function, which is . We point out here that even though both of their results are one-dimensional non-parametric rates (at most differ by a factor of ), the methodologies and techniques they used are completely disparate. On one hand, hall2006propertieshans utilized the fact that the additional integration enables the eigenfunctions to be estimated at a faster rate than the covariance function itself and adopted a detailed perturbation expansion in bosq2000linear. On the other hand, the one-dimensional rate for the covariance function in cai2010nonparametric

is facilitated by the tensor product space, which is much smaller than the

space on , and their result for eigenfunctions is a direct application of the perturbation bound in bhatia1983perturbation. A detailed discussion on the perturbation bounds can be found in Section 2. Moreover, paul2009consistency proposed a reduced rank model and studied its asymptotic properties under a particular setting. We emphasize that the theoretical results in literature mentioned above are only for a fixed number of eigenfunctions, and to our best knowledge, there is no progress in obtaining the convergence rate for a diverging number of eigenfunctions when the data are discretely observed with noise.

An intrinsic difference between estimating the diverging and fixed number of eigenfunctions lies in the infinite-dimensional nature of functional data. Specifically, the decaying eigenvalues make it challenging in analyzing the eigenfunctions with diverging index, which yields all the existing techniques failed. Moreover, the perturbation results used in hall2007methodology and dou2012estimation are facilitated by the cross-sectional sample covariance, which reduces each term in the perturbation series to the FPC scores. This virtue no longer exists when the trajectories are observed at discrete time points, and a summability issue occurs due to the estimation bias, see Section 2 for further elaboration. Combination of smoothness and infinite dimensionality reveals the elevated difficulties in estimating a diverging number of eigenfunctions, which remains an open problem in FDA.

In view of the aforementioned significance and difficulties, we present in this paper a unified theory in estimating a diverging number of eigenfunctions from discretely observed functional data. The main contributions of this paper are summarized as follows. First, we bring up a new approach to handle each term appeared in the perturbation series, which overcomes the summability issue caused by the estimation bias. This methodology can be also applied in other perturbation involved problems, such as estimation and prediction in FLR. Second, under some mild assumptions, we obtain the moment bound for the eigenfunctions with diverging indices. Our unified theory allows varying in a wide range, and we show that when reaches the magnitude , where is determined by the smoothness of the underlying trajectory and the frequency of the estimated eigenfunction, this rate reaches optimal in the minimax sense as if the curves are fully observed. This result is the first attempt at obtaining an optimal convergence rate for eigenfunctions with diverging index of discretely observed functional data. Third, we obtain the asymptotic normality for eigenvalues with diverging idex. In summary, the results obtained in this paper is not merely a theoretical progress, it builds the bridge between “ideal” and “reality” in models involving inverse problem, which has its own merits deserving further investigation.

In what follows, we use stands for whereas stands for as for each and a positive constant . A non-random sequence is said to be if it is bounded and for each non-random sequence , stands for and stands for converges to zero. Throughout this paper, we use stands for a positive constant which may vary from place to place. The relation is used to indicate that and the relation can be defined analogously. We write if and . For , we use to denote the largest integer smaller or equal to . For a function , we use denotes and stands for . For a function , define and . Write and for and .

The reminder of the paper is organized as follows. Section 2 introduces the model and theoretical results are presented in Section 3. The proofs are collected together in Section 4.

## 2 Eigenfunctions estimation for discretely observed data

Denote a square integrable stochastic process on and independent and identically distributed (i.i.d.) copies of . The mean and covariance functions of are and , respectively. By the Mercer’s Theorem (indritz1963methods, Chapter 4), admits the spectral decomposition

 C(s,t)=∞∑k=1λkϕk(s)ϕk(t), (1)

where are eigenvalues and ’s are the corresponding eigenfunctions. The eigenfunctions form a complete orthonormal system (CONS) on , which denotes the space of square-integrable functions on . For each , the process admits the so-called Karhunen-Lòeve expansion

 Xi(t)=μ(t)+∞∑j=1ξikϕk(t), (2)

where

are uncorrelated zero mean random variables with variance

. In this paper we focus on the eigenfunction estimation thus we assume without loss of generality (W.L.O.G.).

In practice, we cannot observe the full trajectory . Instead, measurements are taken at discrete time points with noise contamination. The actual observations for each are

 {(Tij,Xij)|Xij=Xi(Tij)+εij,j=1,⋯,Ni}, (3)

where ’s are i.i.d copies of with and . We further assume that are i.i.d copies of

, which follows the uniform distribution on

and .

Kernel method has been widely developed as a smoothing approach in FDA due to its attractive theoretical features (yao2005jasa; hall2006propertieshans; li2010uniform; zhang2016sparse). The main motivation of this paper is to derive a unified theory in estimating a diverging number of eigenfunctions for discretely observed functional data and thus, we adopt the local constant smoother here to avoid distractions from complicated calculations. Denote and the covariance estimator is obtained through

 ^C(s,t)=1nn∑i=11N(N−1)1h2∑1≤j≠l≤NK(Tij−sh)K(Til−th)δijl. (4)

where is a symmetric density kernel. The function admits an empirical version of decomposition (1),

 ^C(s,t)=∞∑k=1^λk^ϕk(s)^ϕk(t), (5)

where and are estimators for and . We further assume .

Before introducing our theoretical, we shall give a comprehensive synopsis for eigenfunction estimation problem in FDA. We start with the following resolvent series (6) and illustrate how it is applied in statistical analysis.

 E(∥^ϕj−ϕj∥2)≍∑k≠jE{⟨(^C−C)ϕj,ϕk⟩2}(λk−λj)2. (6)

Such kind of expansions can be found in bosq2000linear, dou2012estimation and li2010uniform, see chapter 5 in hsing2015theoretical for details. Denote the eigengap of , then a rough bound for is derived by (6) and the Bessel’s inequality directly

 E∥^ϕj−ϕj∥2≲η−2jE∥^C−C∥2HS, (7)

which is exactly the bound in bhatia1983perturbation. However, this bound is clearly suboptimal for eigenfunction estimation in two aspects. First, is away from when is regarded as a fixed number, and (7) indicates that eigenfunctions admit the same convergence rates as the covariance function. This is counter-intuitive since and the integration usually brings extra smoothness (cai2006prediction). Thus, the eigenfunction should admit a rate faster than the two-dimensional rate of . Second, when is diverging to infinity, there is and appears in (7) could not be ignored. Assume , then the rate for obtained by (7) is slower than , which is clearly suboptimal (wahl2020information). Therefore, in order to obtain a sharp rate for , we should resort to the original perturbation series (6) rather than its approximation (7).

When each trajectory is observed for all , which refers to the fully observed case, the cross-sectional sample covariance is a canonical estimator of . Then the numerators in each term of equation (6) can be reduced to the FPC scores under some mild assumptions, e.g. (hall2007methodology; dou2012estimation). Then is bounded by . Assume the polynomial decay of eigenvalues, the aforementioned summation is dominated by , which is and optimal in the minimax sense (wahl2020information), see Lemma 7 in dou2012estimation for detailed elaboration.

However, we emphasize that when tackling with discretely observed data, all the existing literature use a bound analogous to (7), which excludes the diverging indices as a consequence. Specifically, the result in cai2010nonparametric is obtained by applying the bound (7) directly and their one-dimensional rate is facilitated by the tensor product space. In contrast, the one-dimensional rate derived by hall2006propertieshans; li2010uniform is profited by a detailed calculation and the perturbation series (6). However, their theoretical analysis are based on the assumption that is away from zero, which indicates that must be a fixed constant. Moreover, the restricted maximum likelihood estimator proposed by paul2009consistency also has a similar boundedness assumption on . When the data are discretely observed, how to utilize (6) effectively is the key in obtaining a sharp bound for a diverging number of eigenfunctions.

The main challenges comes from quantifying the summation (6) without the fully observed sample covariance. For pre-smoothing method, the reconstructed achieves a convergence in sense when each reaches the magnitude and thus, the estimated covariance function admits . However, this does not guarantee the optimal convergence of a diverging number of eigenfunctions by the existing techniques. Specifically, the numerator in each term of (6) is no longer the component scores and such a complicated form makes it infeasible to quantify this infinite summation when . Besides, for pooling method, it is also highly nontrivial to sum up all with respect to , see comments below Theorem 1 for more detail.

## 3 L2 convergence for eigenfunctions and eigenvalues

Based on the aforementioned issues, we propose a new technique to handle the perturbation series (6) for the discretely observed functional data. We shall make the following regularity assumptions.

###### 1 (a.1)

has finite fourth moment ; for all and .

###### 2 (a.2)

The covariance function addmits an expansion

 C(s,t)=∞∑k=1λkϕk(s)ϕk(t),

where the eigenvalues are decreasing with for and each .

###### 3 (a.3)

For each , the eigenfunctions satisfies and

 supt∈[0,1]|ϕ(k)j(t)|≲jc/2supt∈[0,1]|ϕ(k−1)j|, for k=1,2,

where is a positive constant and assume W.L.O.G..

Assumptions (A.1) and (A.2) are commonly adopted in the FDA literature (yao2005jasa; hall2007methodology; cai2010nonparametric; dou2012estimation). The number of eigenfunctions that can be well estimated for exponential decaying eigenvalues is at the order of and thus, a polynomial decaying rate is majorly considered in FDA. In mean and covariance estimation, it is typical to assume that the covariance function has bounded second order derivatives (zhang2016sparse). However, this assumption might be inappropriate in estimating a specific eigenfunction with diverging index. In general, the frequency of is higher for larger , which requires a smaller bandwidth to capture its local variation. The boundary assumption on and its first order derivative eliminates the edge effect caused by the local constant smoother for technical convenience, and may be relaxed with more technicality. Assumption (A.3) characterizes the frequency increment of a specific eigenfunction via the smoothness of its derivatives, and for some common used basis, Fourier, Legendre and wavelet basis for instance, . In hall2006propertieshans, the authors assume that which is achievable only for fixed and assumption (A.3) can be regarded as its generalized version. For the kernel function and sampling points , we shall make the following assumptions (zhang2016sparse).

###### 4 (k.1)

is a bounded symmetric probability density function on

.

###### 5 (k.2)

and are mutually independent.

The following theorem quantifies , which is fundamental in most problems in FDA that involving inverse issues, such as eigenfunction estimation with diverging index, estimation and prediction in FLR.

###### Theorem 1.

Under assumptions (A.1) to (A.3), (K.1), (K.2), and , denote . For all

 E⟨Δϕj,ϕk⟩2≲1n(j−ak−a+j−a+k−aN+1N2)+h4k2c−2a

and

 ∞∑k=j+1E⟨Δϕj,ϕk⟩2≲1n(j1−2a+h−1j−a+j1−aN+1hN2)+h4j1+2c−2a.

When is obtained from a kernel type smoother similar to (4), the convergence rate for should be a two-dimensional kernel smoothing rate with variance (zhang2016sparse). Similarly, one can show that admits a one-dimensional rate with variance . The first assertion of Theorem 1 reveals that the convergence rate for is a degenerated kernel smoothing rate and the variance terms vanish after twice integrations. However, by Bessel’s equality, there is and , which indicates that one cannot sum up all with respect to due to the estimation bias. In the fully observed case, the summation is dominated by (hall2007methodology; dou2012estimation). This inspires us that the convergence rate caused by the inverse issue can be captured by the summation on the set , and the tail sum on can be treated as a unity. The second assertion in Theorem 1 reveals that admits a one-dimensional rate caused by kernel smoothing. The terms appear in the variance term with are facilitated by the additional integration and the term involving is caused by the increasing frequency for larger .

Based on Theorem 1, the following theorem gives a unified theory for estimating eigenfunctions with diverging indices, which is the main result of this paper.

###### Theorem 2.

Under assumptions (A.1) to (A.3), (K.1) and (K.2), for admits

###### 6 (m.1)

and .

Define , then and on

 (8)

The event denotes the set of all realizations such that for sample size , sampling rate and bandwidth (hall2007methodology). It is worth mentioning that “on

” is not a statement that relates to a conditioning argument in the sense of probability theory. It should be interpreted as stating that all realizations for which

. Alternatively, can be regarded as the maximum number of eigenfunctions that can be well estimated based on the observed data. Note that is not fixed but can diverge to infinity and jointly determined by the observed data (), smoothing strategy () and the decaying eigengap ( or ). Specifically, the growing and make the covariance function well estimated and thus the eigenfunctions. The frequency of is higher for larger , which requires smaller to capture its local variations. For larger , the eigengap shrinks to rapidly, which makes two adjacent eigenfunctions difficult to distinguish. The in Assumptions (M) could cover the most cased in the subsequent analysis, such as regression problems.

Theorem 2 is a good illustration for both infinite dimensionality and smoothness nature of functional data. To understand this result, note that is the convergence rate in the fully observed case while remaining terms appear in the right-hand of (8) can be viewed as the contamination caused by the discrete observations and measurement errors. Specifically, the term involving represents the smoothing bias, and is a typical one-dimensional variance by kernel smoothing. The terms containing are owing to the discrete approximation and those involving with its positive powers arise from the decaying eigengaps for increasing number of eigen components. Rather than stochastic bounds (in the form of ) obtained by hall2006propertieshans, paul2009consistency and cai2010nonparametric, (8) is a moment bound and more feasible for subsequent analysis in inverse-involved problems, where is commonly appeared with .

Theorem 2 is a unified result for eigenfunction estimation and has no restriction on the sampling rate . As the phase transition in mean and covariance functions estimation (cai2011optimal; zhang2016sparse), we hope to derive a systematic partition in estimating a diverging number of eigenfunctions. The following corollary discusses the convergence rate of under different after selecting the optimal bandwidth .

###### Corollary 1.

Under assumptions (A.1) to (A.3), (K.1) and (K.2), given , assume satisfies (M). For each ,

1. If ,

 E(∥^ϕj−ϕj∥2)≲j2n+j(4a+2c+2)/5(nN)4/5. (9)

In addition, if , this rate becomes .

2. If ,

 E(∥^ϕj−ϕj∥2)≲j2a+2nN2+j(8a+2c+2)/5(nN2)4/5. (10)

###### Remark 1.

Different from the mean and covariance function estimation, the optimal bandwidth in estimating the th eigenfunction is varying with . When , the optimal bandwidth is larger for growing . This refers to the case that the eigenvalues are decaying rapidly with respect to the frequency increment, and over-smoothing is needed to balance the increasing estimation variance caused by the decaying eigengaps. On the contrary, when the frequency of the target eigenfunction is relatively large compare to its smoothness, that is , the optimal bandwidth becomes smaller for growing to capture the local variations, which stands for the under-smoothing case.

###### Remark 2.

When is fixed and is finite, the convergence rate derived by Corollary 1 becomes , which is a typical one-dimensional kernel smoothing rate and reaches optimal at . This result is consistent with those in hall2006properties and optimal in the minimax sense. When is not fixed but diverging to infinity, the knowing lower bound for fully observed data is derived by applying the van Tree’s inequality on the special orthogonal group (wahl2020information). For discretely observed functional data, there are no lower bounds for the eigenfunctions with diverging indices and it is not straightforward to extend the results in wahl2020information to the discrete case. Lower bounds of eigenfunctions with diverging indices for discretely observed functional data is beyond the scope of this paper and remains an open problem deserving investigation.

###### Remark 3.

In case of the common used basis where , the convergence rate for the th eigenfunction reaches optimal as if the curves are fully observed when . When is fixed, the phase transition occurs at , which is same as the mean and covariance function and consistent with those in hall2006propertieshans and cai2010nonparametric. For subjects, the largest index of eigenfunction that can be well estimated is smaller than , which is a direct consequence from Assumption (M). It can be check that and in this case, the phase transition occurs at , which can be interpreted from two aspects. On one hand, more observations per subject are needed in order to obtain a optimal rate for eigenfunctions with diverging indices compared to the mean and covariance estimation, which reveals the elevated difficulties caused by the infinite dimensionality. On the other hand, is slightly larger than , which provides the merits of the pooling method as well as our theoretical analysis.

Next, we focus on the asymptotic normality for the estimated eigenvalues. Before stating our results, we shall make the following assumption.

and .

###### 8 (n.2)

For any sequence , unless each index is repeated.

Assumption (N.1) gives the additional moment condition needed in deriving the asymptotic normality, which is typical in the FDA literature (zhang2016sparse). Assumption (N.2) is a standard assumption in functional principal components analysis (cai2006prediction; hall2009theory), which simplifies the moment calculation.

###### Theorem 3.

Under assumptions (A.1) to (A.3), (K.1), (K.2), , (N.1) and (N.2), for admits

###### 9 (m.2)

and .

For all

 Σ−12n(^λj−λjλj−2λjσ2Kh2∫1−hhϕ(2)j(u)ϕj(u)du+o(j−ah2))d⟶N(0,1),

where

 Σn= 1n[(N−2)(N−3)N(N−1)E(ξ4j)λ2j+4(N−2)N(N−1)E{ξ2j(∥Xϕj∥2+σ2X)}λ2j +2N(N−1)E{(∥Xϕj∥2+σ2X)2}λ2j−1].

Note that as , we need to do regularization such that the eigenvalues serve on a common scale of variabilities. After twice integration, admit a degenerated kernel smoothing rate and for fixed , is consistency for small enough . For slowly diverging , three types of asymptotic normality emerge from Theorem 3 depending on the oder of .

###### Corollary 2.

Under the assumption of Theorem 3, for all

1. If , ,

2. If ,

 √n(^λj−λjλj−2λjσ2Kh2∫1−hhϕ(2)j(u)ϕj(u)du) d⟶N(0,E(ξ4j)−λ2jλ2j+4E{ξ2j(∥Xϕj∥2+σ2X)}C1λj+2E{(∥Xϕj∥2+σ2X)2}C21).
3. If ,

 √nNλj(^λj−λjλj−2λjσ2Kh2∫1−hhϕ(2)j(u)ϕj(u)du) d⟶N(0,2NN−1E{(∥Xϕj∥2+σ2X)2}).

## 4 Proofs of theorems in Section 3

In this section, we will give the proofs of theorems in Section 3. In order to avoid the summability issues mentioned in Section 2, we propose a new approach to treat the dominating terms in the summation.

To begin with, denote and it is clear that is a self-adjoint and bounded operator. We first give the bound for ,

 |Thϕk(v)−ϕk(v)|=∣∣∣1h∫K(v−th)ϕk(t)dt−ϕk(v)∣∣∣ = ∣∣∣∫1−1K(u){ϕk(v)−huϕ(1)k(v)+h2u22ϕ(2)k(v∗)}du−ϕk(v)∣∣∣ ≲ h2|ϕ(2)k(t)|∞≲h2kc,

where and the last inequality comes form assumption (A.3). Then there is

 ∥Thϕk−ϕk∥2≲h4k2c for all k. (11)

### 4.1 Proof of Theorem 1

###### Proof 1.

Recall and note that

 ⟨^Cϕj,ϕk⟩=1nn∑i=11N(N−1)∑l1≠l2δil1l2Thϕj(Til1)Thϕk(Til2).

Denote , then

The second moment of each can be decomposed as

 EA2i(ϕj,ϕk)=4!(N4)Ai1(ϕj,ϕk)+3!(N3)Ai2(ϕj,ϕk)+2!(N2)Ai3(ϕj,ϕk)

with

 Ai1(ϕj,ϕk)= E(⟨Xi,Thϕj⟩2⟨Xi,Thϕk⟩2), Ai2(ϕj,ϕk)= 2E(⟨Xi,Thϕj⟩⟨Xi,Thϕk⟩⟨XiThϕj,XiThϕk⟩) +E{|⟨Xi,Thϕj⟩|2(∥XiThϕk∥2+σ2X∥Thϕk∥2)} +E{|⟨Xi,Thϕk⟩|2(∥XiThϕj∥2+σ2X∥Thϕj∥2)}, := Ai21(ϕj,ϕk)+Ai22(ϕj,ϕk)+Ai23(ϕj,ϕk) Ai3(ϕj,ϕk)= E(∥XiThϕj∥2+σ2X∥Thϕj∥2)(∥XiThϕk∥2+σ2X∥Thϕk∥2) +E(⟨XiThϕj,XiThϕk⟩+σ2X⟨Thϕj,Thϕk⟩)2 := Ai31(ϕj,ϕk)+Ai32(ϕj,ϕk).

By AM-GM inequality,

 Ai21(ϕj,ϕk)≤ 2E(⟨X,Thϕj⟩⟨X,Thϕk⟩∥XThϕj∥∥XThϕk∥) ≤ E(⟨X,Thϕj⟩2∥XThϕk∥2+⟨X,Thϕk⟩2∥XThϕj∥2) ≤ Ai22(ϕj,ϕk)+Ai23(ϕj,ϕk).

Similarly and thus , . In summary,

 E⟨Δϕj,ϕk⟩2≤(E⟨Δϕj,ϕk⟩)2 (12) +≲1n{Ai1(ϕj,ϕk)+Ai22(ϕj,ϕk)+Ai23(ϕj,ϕk)N+Ai31(ϕj,ϕk)N2}.

The following lemmas give the bounds for and the fourth moment of , which are useful in qualifying the bias and variance terms in equation (12) and their proofs can be found in the supplementary material.

###### Lemma 4.

Under assumptions (A.1) to (A.3) and (K.1), (K.2),

 ∥EΔ∥2HS ≤Const.⎧⎪⎨⎪⎩h4,2a−2c>1,h4|lnh|,2a−2c=1,h2(2a−1)/c,2a−2c<1.

###### Lemma 5.

Under assumptions (A.1) to (A.3), (K.1), (K.2) and , there are , and

 E∣∣ ∣∣∑