# Optimal Rates for Spectral-regularized Algorithms with Least-Squares Regression over Hilbert Spaces

In this paper, we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a reproducing kernel Hilbert space. We investigate a class of spectral-regularized algorithms, including ridge regression, principal component analysis, and gradient methods. We prove optimal, high-probability convergence results in terms of variants of norms for the studied algorithms, considering a capacity assumption on the hypothesis space and a general source condition on the target function. Consequently, we obtain almost sure convergence results with optimal rates. Our results improve and generalize previous results, filling a theoretical gap for the non-attainable cases.

## Authors

• 9 publications
• 76 publications
03/12/2018

### Optimal Rates of Sketched-regularized Algorithms for Least-Squares Regression over Hilbert Spaces

We investigate regularized algorithms combining with projection for leas...
01/22/2018

### Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral-Regularization Algorithms

We study generalization properties of distributed algorithms in the sett...
11/05/2018

### Kernel Conjugate Gradient Methods with Random Projections

We propose and study kernel conjugate gradient methods (KCGM) with rando...
07/03/2017

### Generalization Properties of Doubly Online Learning Algorithms

Doubly online learning algorithms are scalable kernel methods that perfo...
04/21/2021

### Robust Kernel-based Distribution Regression

Regularization schemes for regression have been widely studied in learni...
10/09/2013

### M-Power Regularized Least Squares Regression

Regularization is used to find a solution that both fits the data and is...
02/23/2017

### Sobolev Norm Learning Rates for Regularized Least-Squares Algorithm

Learning rates for regularized least-squares algorithms are in most case...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let the input space be a separable Hilbert space with inner product denoted by and the output space . Let be an unknown probability measure on , the induced marginal measure on , and the conditional probability measure on with respect to and . Let the hypothesis space The goal of least-squares regression is to approximately solve the following expected risk minimization,

 inff∈HρE(f),E(f)=∫H×R(f(x)−y)2dρ(x,y), (1)

where the measure is known only through a sample of size , independently and identically distributed according to . Let be the Hilbert space of square integral functions from to with respect to , with its norm given by . The function that minimizes the expected risk over all measurable functions is the regression function [6, 27], defined as

 fρ(x)=∫Rydρ(y|x),x∈H,ρX-almost % every. (2)

Throughout this paper, we assume that the support of is compact and there exists a constant , such that

 ⟨x,x′⟩H≤κ2,∀x,x′∈H,ρX-almost every. (3)

Under this assumption, is a subspace of and a solution for (1) is the projection of the regression function onto the closure of in See e.g., [14, 1], or Section 2 for further details.

The above problem was raised for non-parametric regression with kernel methods [6, 27] and it is closely related to functional regression [20]. A common and classic approach for the above problem is based on spectral algorithms. It amounts to solving an empirical linear equation, where to avoid over-fitting and to ensure good performance, a filter function for regularization is involved, see e.g., [1, 10]. Such approaches include ridge regression, principal component regression, gradient methods and iterated ridge regression.

A large amount of research has been carried out for spectral algorithms within the setting of learning with kernel methods, see e.g., [26, 5] for Tikhonov regularization, [33, 31] for gradient methods, and [4, 1] for general spectral algorithms. Statistical results have been developed in these references, but still, they are not satisfactory. For example, most of the previous results either restrict to the case that the space is universal consistency (i.e., is dense in ) [26, 31, 4] or the attainable case (i.e., ) [5, 1]. Also, some of these results require an unnatural assumption that the sample size is large enough and the derived convergence rates tend to be (capacity-dependently) suboptimal in the non-attainable cases. Finally, it is still unclear whether one can derive capacity-dependently optimal convergence rates for spectral algorithms under a general source assumption.

In this paper, we study statistical results for spectral algorithms. Considering a capacity assumption of the space [32, 5], and a general source condition [1] of the target function , we show high-probability, optimal convergence results in terms of variants of norms for spectral algorithms. As a corollary, we obtain almost sure convergence results with optimal rates. The general source condition is used to characterize the regularity/smoothness of the target function in , rather than in as those in [5, 1]. The derived convergence rates are optimal in a minimax sense. Our results, not only resolve the issues mentioned in the last paragraph but also generalize previous results to convergence results with different norms and consider a more general source condition.

## 2 Learning with Kernel Methods and Notations

In this section, we first introduce supervised learning with kernel methods, which is a special instance of the learning setting considered in this paper. We then introduce some useful notations and auxiliary operators.

Learning with Kernel Methods. Let be a closed subset of Euclidean space . Let be an unknown but fixed Borel probability measure on . Assume that are i.i.d. from the distribution . A reproducing kernel is a symmetric function such that is positive semidefinite for any finite set of points in . The kernel defines a reproducing kernel Hilbert space (RKHS) as the completion of the linear span of the set with respect to the inner product For any , the reproducing property holds: In learning with kernel methods, one considers the following minimization problem

 inff∈HK∫Ξ×R(f(ξ)−y)2dμ(ξ,y).

Since by the reproducing property, the above can be rewritten as

 inff∈HK∫Ξ×R(⟨f,Kξ⟩K−y)2dμ(ξ,y).

Defining another probability measure , the above reduces to (1).

Notations and Auxiliary Operators. We next introduce some notations and auxiliary operators which will be useful in the following. For a given bounded operator denotes the operator norm of , i.e., .

Let be the linear map , which is bounded by under Assumption (3). Furthermore, we consider the adjoint operator , the covariance operator given by , and the operator given by It can be easily proved that and Under Assumption (3), the operators and can be proved to be positive trace class operators (and hence compact):

 ∥L∥=∥T∥≤tr(T)=∫Htr(x⊗x)dρX(x)=∫H∥x∥2HdρX(x)≤κ2. (4)

For any , it is easy to prove the following isometry property [27]

 ∥Sρω∥ρ=∥√Tω∥H. (5)

Moreover, according to the spectral theorem,

 ∥L−12Sρω∥ρ≤∥ω∥H (6)

We define the sampling operator by , where the norm in is the Euclidean norm times . Its adjoint operator defined by for is thus given by Moreover, we can define the empirical covariance operator such that . Obviously,

 Tx=1nn∑i=1⟨⋅,xi⟩Hxi.

By Assumption (3), similar to (4), we have

 ∥Tx∥≤tr(Tx)≤κ2. (7)

A simple calculation shows that [6, 27] for all

 E(f)−E(fρ)=∥f−fρ∥2ρ.

Then it is easy to see that (1) is equivalent to Using the projection theorem, one can prove that a solution for the above problem is the projection of the regression function onto the closure of in , and moreover, for all , (see e.g., [14]),

 S∗ρfρ=S∗ρfH, (8)

and

 E(f)−E(fH)=∥f−fH∥2ρ. (9)

## 3 Spectral/Regularized Algorithms

In this section, we demonstrate and introduce spectral algorithms.

The search for an approximate solution in for Problem (1) is equivalent to the search of an approximated solution in for

 infω∈H˜E(ω),˜E(ω)=∫H×R(⟨ω,x⟩H−y)2dρ(x,y). (10)

As the expected risk can not be computed exactly and that it can be only approximated through the empirical risk defined as

 ˜Ez(ω)=1nn∑i=1(⟨ω,xi⟩H−yi)2,

a first idea to deal with the problem is to replace the objective function in (10

) with the empirical risk, which leads to an estimator

satisfying the empirical, linear equation

 Tx^ω=S∗xy.

However, solving the empirical, linear equation directly may lead to a solution that fits the sample points very well but has a large expected risk. This is called as overfitting phenomenon in statistical learning theory. Moreover, the inverse of the empirical covariance operator

does not exist in general. To tackle with this issue, a common approach in statistical learning theory and inverse problems, is to replace with an alternative, regularized one, which leads to spectral algorithms [8, 4, 1].

A spectral algorithm is generated by a specific choice of filter function. Recall that the definition of filter functions is given as follows.

###### Definition 3.1 (Filter functions).

Let be a subset of A class of functions is said to be filter functions with qualification () if there exist some positive constants such that

 supα∈[0,1]supλ∈Λsupu∈]0,κ2]|uαGλ(u)|λ1−α≤E. (11)

and

 supα∈[0,τ]supλ∈Λsupu∈]0,κ2]|(1−Gλ(u)u)|uαλ−α≤Fτ. (12)

Given a filter function , the spectral algorithm is defined as follows.

###### Algorithm 1.

Let be a filter function indexed with . The spectral algorithm over the samples is given by111 Let be a self-adjoint, compact operator over a separable Hilbert space . is an operator on defined by spectral calculus: suppose that is a set of normalized eigenpairs of

with the eigenfunctions

forming an orthonormal basis of , then

 ωzλ=Gλ(Tx)S∗xy, (13)

and

 fzλ=Sρωzλ. (14)

Different filter functions correspond to different regularization algorithms. The following examples provide several specific choices on filter functions, which leads to different types of regularization methods, see e.g. [10, 1, 26].

###### Example 3.1 (Spectral cut-off).

Consider the spectral cut-off or truncated singular value decomposition (TSVD) defined by

 Gλ(u)={u−1,if u≥λ,0,if u<λ.

Then the qualification could be any positive number and .

The choice with where we identify corresponds to gradient methods or Landweber iteration algorithm. The qualification could be any positive number, and .

###### Example 3.3 ((Iterated) ridge regression).

Let Consider the function

 Gλ(u)=l∑i=1λi−1(λ+u)−i=1u(1−λl(λ+u)l).

It is easy to show that the qualification , and In the case that the algorithm is ridge regression.

The performance of spectral algorithms can be measured in terms of the excess risk, which is exactly according to (9). Assuming that , which implies that there exists some such that (in this case, the solution with minimal -norm for is denoted by ), it can be measured in terms of -norm, which is closely related to according to (6). In what follows, we will measure the performance of spectral algorithms in terms of a broader class of norms, where is such that is well defined. Throughout this paper, we assume that

## 4 Convergence Results

In this section, we first introduce some basic assumptions and then present convergence results for spectral algorithms.

### 4.1 Assumptions

The first assumption relates to a moment condition on the output value

.

###### Assumption 1.

There exists positive constants and such that for all with

 ∫R|y|ldρ(y|x)≤12l!Ml−2Q2, (15)

-almost surely.

The above assumption is very standard in statistical learning theory. It is satisfied if is bounded almost surely, or if , where

is a Gaussian random variable with zero mean and it is independent from

. Obviously, Assumption 1 implies that the regression function is bounded almost surely, as

 |fρ(x)|≤∫R|y|dρ(y|x)≤(∫R|y|2dρ(y|x))12≤Q. (16)

The next assumption relates to the regularity/smoothness of the target function As and it is natural to assume a general source condition on as follows.

###### Assumption 2.

satisfies

 ∫H(fH(x)−fρ(x))2x⊗xdρX(x)⪯B2T, (17)

and the following source condition

 fH=ϕ(L)g0,with∥g0∥ρ≤R. (18)

Here, and is a non-decreasing index function such that and . Moreover, for some is non-decreasing, and the qualification of covers the index function .

Recall that the qualification of covers the index function is defined as follows [1].

###### Definition 4.1.

We say that the qualification covers the index function if there exists a such that for all

 cλτϕ(λ)≤infλ≤u≤κ2uτϕ(u). (19)

Condition (17) is trivially satisfied if is bounded almost surely. Moreover, when making a consistency assumption, i.e., , as that in [26, 4, 5, 28], for kernel-based non-parametric regression, it is satisfied with Condition (18) is a more general source condition that characterizes the “regularity/smoothness” of the target function. It is trivially satisfied with as . In non-parametric regression with kernel methods, one typically considers Hölders condition (corresponding to ) [26, 5, 4] . [1, 18, 21] considers a general source condition but only with an index function , where can be decomposed as and is operator monotone with and , and is Lipschitz continuous with . In the latter case has a solution in as that [27, 22]

 L12(L2ρX)⊆Hρ, (20)

In this paper, we will consider a source assumption with respect to a more general index function, , where is operator monotone with and , and is Lipschitz continuous. Without loss of generality, we assume that the Lipschitz constant of is , as one can always scale both sides of the source condition (18). Recall that the function is called operator monotone on , if for any pair of self-adjoint operators with spectra in such that ,

Finally, the last assumption relates to the capacity of the hypothesis space (induced by ).

###### Assumption 3.

For some and , satisfies

 tr(T(T+λI)−1)≤cγλ−γ,for all λ>0. (21)

The left hand-side of of (21) is called as the effective dimension [5]

, or the degrees of freedom

[32]. It can be related to covering/entropy number conditions, see [27] for further details. Assumption 3 is always true for and , since

is a trace class operator which implies the eigenvalues of

, denoted as , satisfy This is referred to as the capacity independent setting. Assumption 3 with allows to derive better rates. It is satisfied, e.g., if the eigenvalues of satisfy a polynomial decaying condition , or with if is finite rank.

### 4.2 Main Results

Now we are ready to state our main results as follows.

###### Theorem 4.2.

Under Assumptions 1, 2 and 3, let , with , and The followings hold with probability at least .
1) If is operator monotone with , and , or Lipschitz continuous with constant over , then

 ∥L−a(fzλ−fH)∥ρ ≤ λ−a(~C1nλ12∨(1−ζ)+~C2√nλγ+~C3ϕ(λ))log6δ(log6δ+γ(θ−1∧logn))1−a. (22)

2) If , where is operator monotone with , and , and is non-decreasing, Lipschitz continuous with constant and . Furthermore, assume that the quality of covers then

 ∥L−a(fzλ−fH)∥ρ≤ λ−alog6δ(log6δ+γ(θ−1∧logn))1−a (23) ×(~C1nλ12∨(1−ζ)+~C4√nλγ+~C5ϕ(λ)+~C6ϑ(λ)ψ(n−12)),

Here, are positive constants depending only on and (independent from and , and given explicitly in the proof).

The above theorem provides convergence results with respect to variants of norms in high-probability for spectral algorithms. Balancing the different terms in the upper bounds, one has the following results with an optimal, data-dependent choice of regularization parameters. Throughout the rest of this paper, is denoted as a positive constant that depends only on and , and it could be different at its each appearance.

###### Corollary 4.3.

Under the assumptions and notations of Theorem 4.2, let and where The following holds with probability at least
1) Let be as in Part 1) of Theorem 4.2, then

 ∥L−a(fzλ−fH)∥ρ≤Cϕ(Θ−1(n−1))(Θ−1(n−1))alog2−a6δ. (24)

2) Let be as in Part 2) of Theorem 4.2 and , then (24) holds.

The error bounds in the above corollary are optimal as they match the minimax rates from [21] (considering only the case and ). The assumption that the quality of covers in Part 2) of Corollary 4.3 is also implicitly required in [1, 18, 21], and it is always satisfied for principle component analysis and gradient methods. The condition will be satisfied in most cases when the index function has a Lipschitz continuous part, and moreover, it is trivially satisfied when as will be seen from the proof.

As a direct corollary of Theorem 4.2, we have the following results considering Hölder source conditions.

###### Corollary 4.4.

Under the assumptions and notations of Theorem 4.2, we let in Assumption 2 and , then with probability at least

 (25)

The error bounds in (25) are optimal as the convergence rates match the minimax rates shown in [5, 3] with . The above result asserts that spectral algorithms with an appropriate regularization parameter converge optimally.

Corollary 4.4 provides convergence results in high-probability for the studied algorithms. It implies convergence in expectation and almost sure convergence shown in the follows. Moreover, when it can be translated into convergence results with respect to norms related to

###### Corollary 4.5.

Under the assumptions of Corollary 4.4, the following holds.
1) For any we have

 E∥L−a(fzλ−fH)∥qρ≤C⎧⎨⎩n−q(ζ−a)2ζ+γ if 2ζ+γ>1,n−q(ζ−a)(1∨lognγ)q(1−a) if 2ζ+γ≤1. (26)

2) For any ,

 limn→∞∥L−a(Sρfzλ−fH)∥ρnζ−a−ϵ1∨(2ζ+γ)=0,% almost surely.

3) If then for some almost surely, and with probability at least

 ∥T12−a(ωzλ−ωH)∥H≤Cn−ζ−a2ζ+γlog2−a6δ. (27)
###### Remark 4.6.

If then Assumption 3 is trivially satisfied with , and Assumption 2 could be satisfied 222Note that this is not true in general if is a general Hilbert space, and the proof for the finite-dimensional cases could be simplified, leading to some smaller constants in the error bounds. with any . Here denotes the smallest eigenvalue of . Thus, following from the proof of Theorem 4.2, we have that with probability at least

 ∥L−a(fzλ−fH)∥ρ≤C√cγnlog6δ(log6δlogcγ)1−a.

The proof for all the results stated in this subsection are postponed in the next section.

### 4.3 Discussions

There is a large amount of research on theoretical results for non-parametric regression with kernel methods in the literature, see e.g., [30, 23, 29, 15, 7, 18, 25, 13] and references therein. As noted in Section 2, our results apply to non-parametric regression with kernel methods. In what follows, we will translate some of the results for kernel-based regression into results for regression over a general Hilbert space and compare our results with these results.

We first compare Corollary 4.4 with some of these results in the literature for spectral algorithms with Hölder source conditions. Making a source assumption as

 fρ=Lζg0with ∥g0∥ρ≤R, (28)

, and with , [11] shows that with probability at least

 ∥fzλ−fρ∥ρ≤Cn−ζ2ζ+γlog41δ.

Condition (28) implies that as and . Thus almost surely.333Such a assumption is satisfied if and it is supported by many function classes and reproducing kernel Hilbert space in learning with kernel methods [27]. In comparison, Corollary 4.4 is more general. It provides convergence results in terms of different norms for a more general Hölder source condition, allowing and Besides, it does not require the extra assumption and the derived error bound in (25) has a smaller depending order on For the assumption (28) with , certain results are derived in [26] for Tikhonov regularization and in [31] for gradient methods, but the rates are suboptimal and capacity-independent. Recently, [13] shows that under the assumption (28), with and , spectral algorithm has the following error bounds in expectation,

 E∥fzλ−fρ∥2ρ≤C{n−2ζ2ζ+γ if 2ζ+γ>1,n−2ζ(1∨lognγ) if 2ζ+γ≤1.

Note also that [7] provides the same optimal error bounds as the above, but only restricts to the cases and . In comparison, Corollary 4.4 is more general. It provides convergence results with different norms and it does not require the universal consistency assumption. The derived error bound in (25) is more meaningful as it holds with high probability. However, it has an extra logarithmic factor in the upper bound for the case which is worser than that from [13]. [1, 3] study statistical results for spectral algorithms, under a Hölder source condition, with Particularly, [3] shows that if

 n≥Cλ−2log21δ, (29)

then with probability at least , with and ,

 ∥L−a(fzλ−fH)∥ρ≤Cn−ζ−a2ζ+γlog6δ.

In comparison, Corollary 4.4 provides optimal convergence rates even in the case that , while it does not require the extra condition (29). Note that we do not pursue an error bound that depends both on and the noise level as those in [3, 7], but it should be easy to modify our proof to derive such error bounds (at least in the case that ). The only results by now for the non-attainable cases with a general Hölder condition with respect to (rather than ) are from [14], where convergence rates of order are derived (but only) for gradient methods assuming is large enough.

We next compare Theorem 4.2 with results from [1, 21] for spectral algorithms considering general source conditions. Assuming that with (which implies for some ,) where is as in Part 2) of Theorem 4.2, [1] shows that if the qualification of covers and (29) holds, then with probability at least

 ∥L−a(fzλ−fH)∥ρ≤Cλ−a(ϕ(λ)√λ+1√λn)log6δ,a=0,12.

The error bound is capacity independent, i.e., with . Involving the capacity assumption444Note that from the proof from [21], we can see the results from [21] also require (29)., the error bound is further improved in [21], to

 ∥L−a(fzλ−fH)∥ρ≤Cλ−a(ϕ(λ)√λ+1√nλγ)log6δ,a=0,12.

As noted in [11, Discussion], these results lead to the following estimates in expectation

 E∥L−a(fzλ−fH)∥2ρ≤Cλ−2a(ϕ(λ)√λ+1√nλγ)2logn,a=0,12

In comparison with these results, Theorem 4.2 is more general, considering a general source assumption and covering the general case that may not be in . Furthermore, it provides convergence results with respect to a broader class of norms, and it does not require the condition (29). Finally, it leads to convergence results in expectation with a better rate (without the logarithmic factor) when the index function is , and it can infer almost-sure convergence results.

## 5 Proofs

In this section, we prove the results stated in Section 4. We first give some basic lemmas, and then give the proof of the main results.

### 5.1 Lemmas

Deterministic Estimates
We first introduce the following lemma, which is a generalization of [1, Proposition 7]. For notational simplicity, we denote

 Rλ(u)=1−Gλ(u)u, (30)

and

 N(λ)=tr(T(T+λ)−1).
###### Lemma 5.1.

Let be a non-decreasing index function and the qualification of the filter function covers the index function , and for some is non-decreasing. Then for all

 sup0

where is from Definition 4.1.

###### Proof.

When , by (19), we have

 ϕ(u)uτ≤1cϕ(λ)λτ.

Thus,

 |Rλ(u)|ϕ(u)u−a=|Rλ(u)|uτ−aϕ(u)u−τ≤|Rλ(u)|uτ−ac−1ϕ(λ)λ−τ≤Fτc−1λ−aϕ(λ),

where for the last inequality, we used (12). When since is non-decreasing,

 |Rλ(u)|ϕ(u)u−a=|Rλ(u)|uζ−aϕ(u)u−ζ≤|Rλ(u)|uζ−aϕ(λ)λ−ζ≤Fτϕ(λ)λ−a,

where we used (12) for the last inequality. From the above analysis, one can finish the proof. ∎

Using the above lemma, we have the following results for the deterministic vector

, defined by

 ωλ=Gλ(T)S∗ρfH. (32)
###### Lemma 5.2.

Under Assumption 2, we have for all

 ∥L−a(Sρωλ−fH)∥ρ≤cgRϕ(λ)λ−a, (33)

and

 ∥ωλ∥H≤Eϕ(κ2)κ−(2ζ∧1)λ−(12−ζ)+. (34)

The left hand-side of (33) is often called as the true bias.

###### Proof.

Following from the definition of in (32), we have

 Sρωλ−fH=SρGλ(T)S∗ρfH−fH=(LGλ(L)−I)fH.

Introducing with (18), with the notation we get

 ∥L−a(Sρωλ−fH)∥ρ=∥L−aRλ(L)ϕ(L)g0∥ρ≤∥L−aRλ(L)ϕ(L)∥R.

Applying the spectral theorem with (4) and Lemma 5.1 which leads to

 ∥L−aRλ(L)ϕ(L)∥≤supu∈[0,κ2]|Rλ(u)|u−aϕ(u)≤cgϕ(λ)λ−a,

one can get (33).
From the definition of in (32) and applying (18), we have

 ∥ωλ∥H=∥Gλ(T)S∗ρϕ(L)g0∥H≤∥Gλ(T)S∗ρϕ(L)∥R.

According to the spectral theorem, with (4), one has

 ∥Gλ