# Nonparametric regression using needlet kernels for spherical data

Needlets have been recognized as state-of-the-art tools to tackle spherical data, due to their excellent localization properties in both spacial and frequency domains. This paper considers developing kernel methods associated with the needlet kernel for nonparametric regression problems whose predictor variables are defined on a sphere. Due to the localization property in the frequency domain, we prove that the regularization parameter of the kernel ridge regression associated with the needlet kernel can decrease arbitrarily fast. A natural consequence is that the regularization term for the kernel ridge regression is not necessary in the sense of rate optimality. Based on the excellent localization property in the spacial domain further, we also prove that all the l^q (01≤ q < ∞) kernel regularization estimates associated with the needlet kernel, including the kernel lasso estimate and the kernel bridge estimate, possess almost the same generalization capability for a large range of regularization parameters in the sense of rate optimality. This finding tentatively reveals that, if the needlet kernel is utilized, then the choice of q might not have a strong impact in terms of the generalization capability in some modeling contexts. From this perspective, q can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..

## Authors

• 6 publications
• ### Does generalization performance of l^q regularization learning depend on q? A negative example

l^q-regularization has been demonstrated to be an attractive technique i...
07/25/2013 ∙ by Shaobo Lin, et al. ∙ 0

• ### Learning rates of l^q coefficient regularization learning with Gaussian kernel

Regularization is a well recognized powerful strategy to improve the per...
12/19/2013 ∙ by Shaobo Lin, et al. ∙ 0

• ### Spectrally-truncated kernel ridge regression and its free lunch

Kernel ridge regression (KRR) is a well-known and popular nonparametric ...
06/14/2019 ∙ by Arash A. Amini, et al. ∙ 0

• ### Distributed Learning with Dependent Samples

This paper focuses on learning rate analysis of distributed kernel ridge...
02/10/2020 ∙ by Shao-Bo Lin, et al. ∙ 0

• ### L2 Regularization for Learning Kernels

The choice of the kernel is critical to the success of many learning alg...
05/09/2012 ∙ by Corinna Cortes, et al. ∙ 0

• ### Kernel Alignment Risk Estimator: Risk Prediction from Training Data

We study the risk (i.e. generalization error) of Kernel Ridge Regression...
06/17/2020 ∙ by Arthur Jacot, et al. ∙ 9

• ### On-line Prediction with Kernels and the Complexity Approximation Principle

The paper describes an application of Aggregating Algorithm to the probl...
07/11/2012 ∙ by Alex Gammerman, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Contemporary scientific investigations frequently encounter a common issue of exploring the relationship between a response variable and a number of predictor variables whose domain is the surface of a sphere. Examples include the study of gravitational phenomenon

Freeden1998 , cosmic microwave background radiation Dolelson2003 , tectonic plat geology Chang2000 and image rendering Tsai2006

. As the sphere is topologically a compact two-point homogeneous manifold, some widely used schemes for the Euclidean space such as the neural networks

Gyorfi2002 Scholkopf2001 are no more the most appropriate methods for tackling spherical data. Designing efficient and exclusive approaches to extract useful information from spherical data has been a recent focus in statistical learning Downs2003 ; Marzio2014 ; Monnier2011 ; Pelletier2006 .

Recent years have witnessed considerable approaches about nonparametric regression for spherical data. A classical and long-standing technique is the orthogonal series methods associated with spherical harmonics Abrial2008 , with which the local performance of the estimate are quite poor, since spherical harmonics are not well localized but spread out all over the sphere. Another widely used technique is the stereographic projection methods Downs2003 , in which the statistical problems on the sphere were formulated in the Euclidean space by use of a stereographic projection. A major problem is that the stereographic projection usually leads to a distorted theoretical analysis paradigm and a relatively sophisticate statistical behavior. Localization methods, such as the Nadaraya-Watson-like estimate Pelletier2006 , local polynomial estimate Bickel2007 and local linear estimate Marzio2014 are also alternate and interesting nonparametric approaches. Unfortunately, the manifold structure of the sphere is not well taken into account in these approaches. Mihn Minh2006 also developed a general theory of reproducing kernel Hilbert space on the sphere and advocated to utilize the kernel methods to tackle spherical data. However, for some popular kernels such as the Gaussian Minh2010 and polynomials Cao2013 , kernel methods suffer from either a similar problem as the localization methods, or a similar drawback as the orthogonal series methods. In fact, it remains open that whether there is an exclusive kernel for spherical data such that both the manifold structure of the sphere and the localization requirement are sufficiently considered.

Our focus in this paper is not on developing a novel technique to cope with spherical nonparametric regression problems, but on introducing an exclusive kernel for kernel methods. To be detailed, we aim to find a kernel that possesses excellent spacial localization property and makes fully use of the manifold structure of the sphere. Recalling that one of the most important factors to embody the manifold structure is the special frequency domain of the sphere, a kernel which can control the frequency domain freely is preferable. Thus, the kernel we need is actually a function that possesses excellent localization properties, both in spacial and frequency domains. Under this circumstance, the needlet kernel comes into our sights. Needlets, introduced by Narcowich et al. Narcowich2006 ; Narcowich20061 , are a new kind of second-generation spherical wavelets, which can be shown to make up a tight frame with both perfect spacial and frequency localization properties. Furthermore, needlets have a clear statistical nature Baldi2008 ; Kerkyacharian2012 , the most important of which is that in the Gaussian and isotropic random fields, the random spherical needlets behave asymptotically as an i.i.d. array Baldi2008 . It can be found in Narcowich2006 that the spherical needlets correspond a needlet kernel, which is also well localized in the spacial and frequency domains. Consequently, the needlet kernel is proved to possess the reproducing property (Narcowich2006, , Lemma 3.8), compressible property (Narcowich2006, , Theorem 3.7) and best approximation property (Narcowich2006, , Corollary 3.10).

The aim of the present article is to pursue the theoretical advantages of the needlet kernel in kernel methods for spherical nonparametric regression problems. If the kernel ridge regression (KRR) associated with the needlet kernel is employed, the model selection then boils down to determining the frequency and regularization parameter. Due to the excellent localization in the frequency domain, we find that the regularization parameter of KRR can decrease arbitrarily fast for a suitable frequency. An extreme case is that the regularization term is not necessary for KRR in the sense of rate optimality. This attribution is totally different from other kernels without good localization property in the frequency domain Cucker2002 , such as the Gaussian Minh2010 and Abel-Poisson Freeden1998 kernels. We attribute the above property as the first feature of the needlet kernel. Besides the good generalization capability, some real world applications also require the estimate to possess the smoothness, low computational complexity and sparsity Scholkopf2001 . This guides us to consider the () kernel regularization (KRS) schemes associated with the needlet kernel, including the kernel bridge regression and kernel lasso estimate Wu2008 . The first feature of the needlet kernel implies that the generalization capability of all -KRS with are almost the same, provided the regularization parameter is set to be small enough. However, such a setting makes there be no difference among all -KRS with , as each of them behaves similar as the least squares. To distinguish different behaviors of the -KRS, we should establish a similar result for a large regularization parameter. By the aid of a probabilistic cubature formula and the the excellent localization property in both frequency and spacial domain of the needlet kernel, we find that all -KRS with can attain almost the same almost optimal generalization error bounds, provided the regularization parameter is not larger than Here is the number of samples and is the prediction accuracy. This implies that the choice of does not have a strong impact in terms of the generalization capability for -KRS, with relatively large regularization parameters depending on . From this perspective, can be specified by other no generalization criteria like smoothness, computational complexity and sparsity. We consider it as the other feature of the needlet kernel.

The reminder of the paper is organized as follows. In the next section, the needlet kernel together with its important properties such as the reproducing property, compressible property and best approximation property is introduced. In Section 3, we study the generalization capability of the kernel ridge regression associated with the needlet kernel. In Section 4, we consider the generalization capability of the kernel regularization schemes, including the kernel bridge regression and kernel lasso. In Section 5, we provide the proofs of the main results. We conclude the paper with some useful remarks in the last section.

## 2 The needlet kernel

Let be the unit sphere embedded into . For integer , the restriction to of a homogeneous harmonic polynomial of degree on the unit sphere is called a spherical harmonic of degree . The class of all spherical harmonics of degree is denoted by , and the class of all spherical harmonics of degree is denoted by . Of course, , and it comprises the restriction to of all algebraic polynomials in variables of total degree not exceeding . The dimension of is given by

 Ddk:=dim Hdk={2k+d−1k+d−1(k+d−1k),k≥1;1,k=0,

and that of is .

The addition formula establishes a connection between spherical harnomics of degree and the Legendre polynomial Freeden1998 :

 Ddk∑l=1Yk,l(x)Yk,l(x′)=Ddk|Sd|Pd+1k(x⋅x′), (2.1)

where is the Legendre polynomial with degree and dimension . The Legendre polynomial can be normalized such that and satisfies the orthogonality relations

 ∫1−1Pd+1k(t)Pd+1j(t)(1−t2)d−22dt=|Sd||Sd−1|Ddkδk,j,

where is the usual Kronecker symbol.

The following Funk-Hecke formula establishes a connection between spherical harmonics and function Freeden1998

 ∫Sdϕ(x⋅x′)Hk(x′)dω(y)=B(ϕ,k)Hk(x), (2.2)

where

 B(ϕ,k)=|Sd−1|∫1−1Pd+1k(t)ϕ(t)(1−t2)d−22dt.

A function is said to be admissible Narcowich20061 if satisfies the following condition:

 suppη⊂[0,2],η(t)=1 on [0,1], and 0≤η(t)≤1 on [1,2].

The needlet kernel Narcowich2006 is then defined to be

 Kn(x⋅x′)=∞∑k=0η(kn)Ddk|Sd|Pd+1k(x⋅x′), (2.3)

The needlets can be deduced from the needlet kernel and a spherical cubature formula Brown2005 ; LeGia2008 ; Mhaskar2000 . We refer the readers to Baldi2008 ; Kerkyacharian2012 ; Narcowich2006 for a detailed description of the needlets. According to the definition of the admissible function, it is easy to see that possess excellent localization property in the frequency domain. The following Lemma 2.1 that can be found in Narcowich2006 and Brown2005 yields that also possesses perfect spacial localization property.

###### Lemma 2.1

Let be admissible. Then for every and there exists a constant depending only on and such that

 ∣∣∣drdtrKn(cosθ)∣∣∣≤Cnd+2r(1+nθ)k, θ∈[0,π].

For , we write

 Kn∗f(ξ):=∫SdKn(x⋅x′)f(x′)dω(x′).

We also denote by the best approximation error of () from , i.e.

 EN(f)p:=infP∈ΠdN∥f−P∥Lp(Sd).

Then the needlet kernel satisfies the following Lemma 2.2, which can be deduced from Narcowich2006 .

###### Lemma 2.2

is a reproducing kernel for , that is for . Moreover, for any , we have , and

 ∥Kn∗f∥Lp(Sd)≤C∥f∥Lp(Sd), %and ∥f−Kn∗f∥Lp(Sd)≤CEn(f)p,

where is a constant depending only on and .

It is obvious that is a semi-positive definite kernel, thus it follows from the known Mercer theorem Minh2006 that corresponds a reproducing kernel Hilbert space (RKHS), .

###### Lemma 2.3

Let be defined above, then the reproducing kernel Hilbert space associated with is the space with the inner product:

 ⟨f,g⟩Kn:=∞∑k=0Ddj∑j=1η(k/n)−1^fk,j^gk,j,

where .

## 3 Kernel ridge regression associated with the needlet kernel

In spherical nonparametric regression problems with predictor variables and response variables , we observe i.i.d. samples from an unknown distribution . Without loss of generality, it is always assumed that almost surely, where is a positive constant. One natural measurement of the estimate is the generalization error,

 E(f):=∫Z(f(X)−Y)2dρ,

which is minimized by the regression function Gyorfi2002 defined by

 fρ(x):=∫YYdρ(Y|x).

Let be the Hilbert space of square integrable functions, with norm In the setting of , it is well known that, for every , there holds

 E(f)−E(fρ)=∥f−fρ∥2ρ. (3.1)

We formulate the learning problem in terms of probability rather than expectation. To this end, we present a formal way to measure the performance of learning schemes in probability. Let

and be the class of all Borel measures such that . For each , we enter into a competition over all estimators based on samples by

 ACm(Θ,ε):=inffz∈Φmsupρ∈M(Θ)Pm{z:∥fρ−fz∥2ρ>ε}.

As it is impossible to obtain a nontrivial convergence rate wtihout imposing any restriction on the distribution (Gyorfi2002, , Chap.3), we should introduce certain prior information. Let . Denote the Bessel-potential Sobolev class Mhaskar2010 to be all such that

 ∥f∥Wr:=∥∥ ∥∥∞∑k=0(k+(d−1)/2)rPlf∥∥ ∥∥2≤1,

where

 Plf=Ddk∑j=1⟨f,Yk,j⟩Yk,j.

It follows from the well known Sobolev embedding theorem that , provided . In our analysis, we assume .

The learning scheme employed in this section is the following kernel ridge regression (KRR) associated with the needlet kernel

 fz,λ:=argminf∈HK{1mm∑i=1(f(xi)−yi)2+λ∥f∥2Kn}. (3.2)

Since it is easy to see that for arbitrary , where is the truncation operator. As there isn’t any additional computation for employing the truncation operator, the truncation operator has been used in large amount of papers, to just name a few, Cao2013 ; Devore2006 ; Gyorfi2002 ; Lin2014 ; Minh2006 ; Wu2008 ; Zhou2006 . The following Theorem 3.1 illustrates the generalization capability of KRR associated with the needlet kernel and reveals the first feature of the needlet kernel.

###### Theorem 3.1

Let with , , be any real number, and . If is defined as in (3.2) with , then there exist positive constants depending only on , , and , and satisfying

 C1m−2r/(2r+d)≤ε−≤ε+≤C2(m/logm)−2r/(2r+d), (3.3)

such that for any ,

 supfρ∈WrPm{z:∥fρ−πMfz,λ∥2ρ>ε}≥ACm(Wr,ε)≥ε0, (3.4)

and for any ,

 e−C3mε≤ACm(Wr,ε)≤supfρ∈WrPm{z:∥fρ−πMfz,λ∥2ρ>ε}≤e−C4mε. (3.5)

We give several remarks on Theorem 3.1 below. In some real world applications, there are only data available, and the purpose of learning is to produce an estimate with the prediction error at most and statisticians are required to assess the probability of success. It is obvious that the probability depends heavily on and . If is too small, then there isn’t any estimate that can finish the learning task with small . This fact is quantitatively verified by the inequality (3.4). More specifically, (3.4) shows that if the learning task is to yield an accuracy at most , and other than the prior knowledge, , there are only data available, then all learning schemes, including KRR associated with the needlet kernel, may fail with high probability. To circumvent it, the only way is to acquire more samples, just as inequalities (3.5) purport to show. (3.5) says that if the number of samples achieves , then the probability of success of KRR is at least . The first inequality (lower bound) of (3.5) implies that this confidence can not be improved further. The values of and thus are very critical since the smallest number of samples to finish the learning task lies in the interval . Inequalities (3.3) depicts that, for KRR, there holds

 [ε−,ε+]⊂[C1m−2r/(2r+d),C2(m/logm)−2r/(2r+d)].

This implies that the interval is almost the shortest one in the sense that up to a logarithmic factor, the upper bound and lower bound of the interval are asymptotically identical. Furthermore, Theorem 3.1

also presents a sharp phase transition phenomenon of KRR. The behavior of the confidence function changes dramatically within the critical interval

. It drops from a constant to an exponentially small quantity. All the above assertions show that the learning performance of KRR is essentially revealed in Theorem 3.1.

An interesting finding in Theorem 3.1 is that the regularization parameter of KRR can decrease arbitrarily fast, provided it is smaller than . The extreme case is that the least-squares possess the same generalization performance as KRR. It is not surprised in the realm of nonparametric regression, due to the needlet kernel’s localization property in the frequency domain. Via controlling the frequency of the needlet kernel, is essentially a linear space with finite dimension. Thus, (Gyorfi2002, , Th.3.2& Th.11.3) together with Lemma 5.1 in the present paper automatically yields the optimal learning rate of the least squares associated with the needlet kernel in the sense of expectation. Differently, Theorem 3.1 presents an exponential confidence estimate for KRR, which together with (3.3) makes (Gyorfi2002, , Th.11.3) be a corollary of Theorem 3.1. Theorem 3.1 also shows that the purpose of introducing regularization term in KRR is only to conquer the singular problem of the kernel matrix, , since in our setting. Under this circumstance, a small leads to the ill-condition of the matrix and a large conducts large approximation error. Theorem 3.1 illustrates that if the needlet kernel is employed, then we can set to guarantee both the small condition number of the kernel matrix and almost generalization error bound. From (3.3), it is easy to deduce that to attain the optimal learning rate

, the minimal eigenvalue of the matrix

is , which can guarantee that the matrix inverse technique is suitable to solve (3.2).

## 4 lq kernel regularization schemes associated with the needlet kernel

In the last section, we analyze the generalization capability of KRR associated with the needlet kernel. This section aims to study the learning capability of the kernel regularization scheme (KRS) whose hypothesis space is the sample dependent hypothesis space Wu2008 associated with ,

 HK,z:={m∑i=1aiKn(xi,⋅):ai∈R}

The corresponding -KRS is defined by

 fz,λ,q∈argminf∈HK,z{1mm∑i=1(f(xi)−yi)2+λΩqz(f)}, (4.1)

where

 Ωqz(f):=inf(a1,…,an)∈Rnm∑i=1|ai|q,for f=m∑i=1aiKn(xi,⋅).

With different choices of the order , (4.1) leads to various specific forms of the regularizer. corresponds to the kernel ridge regression Scholkopf2001 , which smoothly shrinks the coefficients toward zero and leads to the LASSO Tibshirani1995 , which sets small coefficients exactly at zero and thereby also serves as a variable selection operator. The varying forms and properties of make the choice of order crucial in applications. Apparently, an optimal may depend on many factors such as the learning algorithms, the purposes of studies and so forth. The following Theorem 4.1 shows that if the needlet kernel is utilized in -KRS, then may not have an important impact in the generalization capability for a large range of regularization parameters in the sense of rate optimality.

Before setting the main results, we should at first introduce a restriction to the marginal distribution . Let be the identity mapping

 L2ρX  J⟶  L2(Bd).

and is called the distortion of (with respect to the Lebesgue measure) Zhou2006 , which measures how much distorts the Lebesgue measure.

###### Theorem 4.1

Let with , , , be any real number, and . If is defined as in (4.1) with and , then there exist positive constants depending only on , , and , and satisfying

 C1m−2r/(2r+d)≤ε−m≤ε+m≤C2(m/logm)−2r/(2r+d), (4.2)

such that for any ,

 supfρ∈WrPm{z:∥fρ−πMfz,λ,q∥2ρ>ε}≥ACm(Wr,ε)≥ε0, (4.3)

and for any ,

 e−C3mε≤ACm(Wr,ε)≤supfρ∈WrPm{z:∥fρ−πMfz,λ,q∥2ρ>ε}≤e−C4D−1ρXmε. (4.4)

Compared with KRR (3.2), a common consensus is that -KRS (4.1) may bring a certain additional interest such as the sparsity for suitable choice of . However, it should be noticed that this assertion may not always be true. This conclusion depends heavily on the value of the regularization parameter. If the the regularization parameter is extremely small, then -KRS for any behave similar as the least squares. Under this circumstance, Theorem 4.1 obviously holds due to the conclusion of Theorem 3.1. To distinguish the character of -KRS with different , one should consider a relatively large regularization parameter. Theorem 4.1 shows that for a large range of regularization parameters, all the -KRS associated with the needlet kernel can attain the same, almost optimal, generalization error bound. It should be highlighted that the quantity is, to the best of knowledge, almost the largest value of the regularization parameter among all the existing results. We encourage the readers to compare our result with the results in Lin2014 ; Shi2011 ; Tong2010 ; Wu2008 . Furthermore, we find that is sufficient to embody the feature of kernel regularization schemes. Taking the kernel lasso for example, the regularization parameter derived in Theorem 4.1 asymptotically equals to . It is to see that, to yield a prediction accuracy we have

 fz,λ,1∈argminf∈HK,z{1mm∑i=1(f(xi)−yi)2+λΩ1z(f)},

and

 1mm∑i=1(f(xi)−yi)2≤ε.

According to the structural risk minimization principle and , we obtain

 Ω1z(fz,λ,1)≤C.

Intuitively, the generalization capability of -KRS (4.1) with a large regularization parameter may depend on the choice of . While from Theorem 4.1 it follows that the learning schemes defined by (4.1) can indeed achieve the same asymptotically optimal rates for all . In other words, on the premise of embodying the feature of -KRS with different , the choice of has no influence on the generalization capability in the sense of rate optimality. Thus, we can determine by taking other non-generalization considerations such as the smoothness, sparsity, and computational complexity into account. Finally, we explain the reason for this phenomenon by taking needlet kernel’s perfect localization property in the spacial domain into account. To approximate , due to the localization property of , we can construct an approximant in with a few ’s whose centers are near to . As is bounded by , then the coefficient of these terms are also bounded. That is, we can construct, in , a good approximant, whose norm is bounded for arbitrary

. Then, using the standard error decomposition technique in

Cucker2001 that divide the generalization error into the approximation error and sample error, the approximation error of -KRS is independent of . For the sample error, we can tune that may depend on to offset the effect of . Then, a generalization error estimate independent of is natural.

## 5 Proofs

In this section, we present the proof of Theorem 3.1 and Theorem 4.1, respectively.

### 5.1 Proof of Theorem 3.1

For the sake of brevity, we set . Let

 S(λ,m,n):={E(πMfz,λ)−Ez(πMfz,λ)+Ez(fn)−E(fn)}.

Then it is easy to deduce that

 E(πMfz,λ)−E(fρ)≤S(λ,m,n)+Dn(λ), (5.1)

where If we set and then

Therefore, we can rewrite the sample error as

 S(λ,m,n)={E(ξ1)−1mm∑i=1ξ1(zi)}+{1mm∑i=1ξ2(zi)−E(ξ2)}=:S1+S2. (5.2)

The aim of this subsection is to bound , and , respectively. To bound , we need the following two lemmas. The first one is the Jackson-type inequality that can be deduced from Mhaskar2010 ; Narcowich2006 and the second one describes the RKHS norm of .

###### Lemma 5.1

Let . Then there exists a constant depending only on and such that

 ∥f−fn∥≤Cn−2r,

where denotes the uniform norm on the sphere.

###### Lemma 5.2

Let be defined as above. Then we have

 ∥fn∥2Kn≤M2.

Proof. Due to the addition formula (2.1), we have

 Kn(x⋅y)=n∑k=0η(kn)⎧⎪⎨⎪⎩Ddj∑j=1Yk,j(x)Yk,j(y)⎫⎪⎬⎪⎭=n∑k=0η(kn)DdkΩdPd+1k(x⋅y).

Since

 Kn∗f(x)=∫SdKn(x⋅y)f(y)dω(y),

it follows from the Funk-Hecke formula (2.2) that

 ˆKn∗fu,v = = ∫Sdf(x′)∫SdKn(x⋅x′)Yu,v(x)dω(x)dω(x′) = ∫Sd|Sd−1|∫1−1Kn(t)Pd+1u(t)(1−t2)d−22dtYu,v(x′)f(x′)dω(x′) = |Sd−1|^fu,v∫1−1Kn(t)Pd+1u(t)(1−t2)d−22dt.

Moreover,

 ∫1−1Kn(t)Pd+1u(t)(1−t2)d−22dt = ∫1−1n∑k=0η(un)Ddk|Sd|Pd+1u(t)Pd+1u(t)(1−t2)d−22dt = ∫1−1η(un)Ddu|Sd|Pd+1u(t)Pd+1u(t)(1−t2)d−22dt = η(un)Ddu|Sd||Sd||Sd−1|Ddu=η(un)1|Sd−1|.

Therefore,

 ˆKn∗fu,v=η(un)^fu,v.

This implies

 ∥Kn∗f∥2Kn = n∑u=0η(un)−1Ddu∑v=1(ˆKn∗fu,v)2 ≤ n∑u=0Ddu∑v=1^f2u,v≤∥f∥2L2(Sd)≤M2.

The proof of Lemma 5.2 is completed.

Based on the above two lemmas, it is easy to deduce an upper bound of .

###### Proposition 5.3

Let . There exists a positive constant depending only on and such that

 Dn(λ)≤Cn−2r+M2λ

In the rest of this subsection, we will bound and respectively. The approach used here is somewhat standard in learning theory. is a typical quantity that can be estimated by probability inequalities. We shall bound it by the following one-side Bernstein inequality Cucker2001 .

###### Lemma 5.4

Let

be a random variable on a probability space

with mean . If for almost all . then, for all ,

 Pm{1mm∑i=1ξ(zi)−E(ξ)≥ε}≤exp⎧⎪ ⎪⎨⎪ ⎪⎩−mε22(σ2+13Mξε)⎫⎪ ⎪⎬⎪ ⎪⎭.

By the help of the above lemma, we can deduce the following bound of .

###### Proposition 5.5

For every , with confidence at least

 1−exp(−3mε248M2(2∥fn−fρ∥2ρ+ε))

there holds

 1mm∑i=1ξ2(zi)−E(ξ2)≤ε.

Proof. It follows from Lemma 2.2 that , which together with yields that

 |ξ2|≤(∥fn∥∞+M)(∥fn∥∞+M)≤4M2.

Hence . Moreover, we have

 E(ξ22)=E((fn(X)−fρ(X)2×(fn(X)−Y)+(fρ(X)−Y))2)≤16M2∥fn−fρ∥2ρ,

which implies that

 σ2(ξ2)≤E(ξ22)≤16M2∥fn−fρ∥2ρ.

Now we apply Lemma 5.4 to . It asserts that for any ,

 1mm∑i=1ξ2(zi)−E(ξ2)≤t

with confidence at least

 1−exp⎛⎜ ⎜⎝−mt22(σ2(ξ2)+83M2t)⎞⎟ ⎟⎠≥1−exp(−3mt248M2(2∥fn−fρ∥2ρ+t)).

This implies the desired estimate.

It is more difficult to estimate because involves the sample through . We will use the idea of empirical risk minimization to bound this term by means of covering number Cucker2001 . The main tools are the following three lemmas.

###### Lemma 5.6

Let be a -dimensional function space defined on . Denote by . Then

 logN(πMVk,η)≤cklogMη,

where is a positive constant and is the covering number associated with the uniform norm that denotes the number of elements in least -net of .

Lemma 5.6 is a direct result through combining (Maiorov1999, , Property 1) and (Maiorov2006, , P.437). It shows that the covering number of a bounded functional space can be also bounded properly. The following ratio probability inequality is a standard result in learning theory Cucker2001 . It deals with variances for a function class, since the Bernstein inequality takes care of the variance well only for a single random variable.

###### Lemma 5.7

Let be a set of functions on such that, for some , almost everywhere and for each . Then, for every ,

 Pm{supf∈GE(g)−1m∑mi=1g(zi)√E(g)+ε≥√ε}≤N(G,ε)exp⎧⎨⎩−mε2c+2B3⎫⎬⎭.

Now we are in a position to give an upper bound of .

###### Proposition 5.8

For all ,

holds with confidence at least

 1−exp{cndlog4M2ε−3mε128M2}.

Proof. Set

 F:={(f(X)−Y)2−(fρ(X)−Y)2:f∈πMHK}.

Then for there exists such that . Therefore,

 E(g)=E(πMf)−E(fρ)≥0,  1mm∑i=1g(zi)=Ez(πM(f))−Ez(fρ).

Since and almost everywhere, we find that

 |g(z)|=|(πMf(X)−fρ(X))((πMf(X)−Y)+(fρ(X)−Y))|≤8M2

almost everywhere. It follows that almost everywhere and

 E(g2)≤16M2∥πMf−fρ∥2ρ=16M2E(g).

Now we apply Lemma 5.7 with to the set of functions and obtain that

 supf∈πMHK{E(f)−E(fρ)}−{Ez(f)−Ez(fρ)}√{E(f)−E(fρ)}+ε=supg∈FE(g)−1m∑mi=1g(zi)√E(g)+ε≤√ε (5.3)

with confidence at least

 1−N(F,ε)exp{−3mε1