    # Optimal rates for the regularized learning algorithms under general source condition

We consider the learning algorithms under general source condition with the polynomial decay of the eigenvalues of the integral operator in vector-valued function setting. We discuss the upper convergence rates of Tikhonov regularizer under general source condition corresponding to increasing monotone index function. The convergence issues are studied for general regularization schemes by using the concept of operator monotone index functions in minimax setting. Further we also address the minimum possible error for any learning algorithm.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Learning theory [10, 16, 34] aims to learn the relation between the inputs and outputs based on finite random samples. We require some underlying space to search the relation function. From the experiences we have some idea about the underlying space which is called hypothesis space. Learning algorithms tries to infer the best estimator over the hypothesis space such that gives the maximum information of the output variable for any unseen input . The given samples are not exact in the sense that for underlying relation function but

. We assume that the uncertainty follows the probability distribution

on the sample space and the underlying function (called the regression function) for the probability distribution is given by

 fρ(x)=∫Yydρ(y|x),  x∈X,

where is the conditional probability measure for given . The problem of obtaining estimator from examples is ill-posed. Therefore we apply the regularization schemes [4, 15, 17, 33] to stabilize the problem. Various regularization schemes are studied for inverse problems. In the context of learning theory [8, 11, 16, 22, 34], the square loss-regularization (Tikhonov regularization) is widely considered to obtain the regularized estimator [9, 11, 29, 30, 31, 32]. Rosasco et al.  introduced general regularization in the learning theory and provided the error bounds under Hölder’s source condition . Bauer et al.  discussed the convergence issues for general regularization under general source condition  by removing the Lipschitz condition on the regularization considered in . Caponnetto et al.  discussed the square-loss regularization under the polynomial decay of the eigenvalues of the integral operator with Hölder’s source condition. Here we are discussing the convergence issues of general regularization schemes under general source condition and the polynomial decay of the eigenvalues of the integral operator. We present the minimax upper convergence rates for Tikhonov regularization under general source condition , for a monotone increasing index function . For general regularization the minimax rates are obtained using the operator monotone index function . The concept of effective dimension [24, 35] is exploited to achieve the convergence rates. In the choice of regularization parameters, the effective dimension plays the important role. We also discuss the lower convergence rates for any learning algorithm under the smoothness conditions. We present the results in vector-values function setting. Therefore in particular they can be applied to multi-task learning problems.

The structure of the paper is as follows. In the second section, we introduce some basic assumptions and notations for supervised learning problems. In Section 3, we present the upper and lower convergence rates under the smoothness conditions in minimax setting.

## 2 Learning from examples: Notations and assumptions

In the learning theory framework [8, 11, 16, 22, 34], the sample space consists of two spaces: The input space (locally compact second countable Hausdorff space) and the output space (the real Hilbert space). The input space and the output space are related by some unknown probability distribution on . The probability measure can be split as , where is the conditional probability measure of given and is the marginal probability measure on . The only available information is the random i.i.d. samples drawn according to the probability measure . Given the training set , learning theory aims to develop an algorithm which provides an estimator such that predicts the output variable for any given input . The goodness of the estimator can be measured by the generalization error of a function which can be defined as

 E(f):=Eρ(f)=∫ZV(f(x),y)dρ(x,y), (1)

where

is the loss function. The minimizer of

for the square loss function is given by

 fρ(x):=∫Yydρ(y|x), (2)

where is called the regression function. The regression function belongs to the space of square integrable functions provided that

 ∫Z||y||2Y dρ(x,y)<∞. (3)

We search the minimizer of the generalization error over a hypothesis space ,

 fH:=argminf∈H{∫Z||f(x)−y||2Ydρ(x,y)}, (4)

where is called the target function. In case , becomes the regression function .

Because of inaccessibility of the probability distribution , we minimize the regularized empirical estimate of the generalization error over the hypothesis space ,

 fz,λ:=argminf∈H{1mm∑i=1||f(xi)−yi||2Y+λ||f||2H}, (5)

where is the positive regularization parameter. The regularization schemes [4, 15, 17, 22, 33] are used to incorporate various features in the solution such as boundedness, monotonicity and smoothness. In order to optimize the vector-valued regularization functional, one of the main problems is to choose the appropriate hypothesis space which is assumed to be a source to provide the estimator.

### 2.1 Reproducing kernel Hilbert space as a hypothesis space

###### Definition 2.1.

(Vector-valued reproducing kernel Hilbert space) For non-empty set and the real Hilbert space , the Hilbert space of functions from to is called reproducing kernel Hilbert space if for any and the linear functional which maps to is continuous.

By Riesz lemma , for every and there exists a linear operator such that

 ⟨y,f(x)⟩Y=⟨Kxy,f⟩H,      ∀f∈H.

Therefore the adjoint operator is given by . Through the linear operator we define the linear operator ,

 K(x,t)y:=Kty(x).

From Proposition 2.1 , the linear operator (the set of bounded linear operators on ), and is nonnegative bounded linear operator. For any , we have that . The operator valued function is called the kernel.

There is one to one correspondence between the kernels and reproducing kernel Hilbert spaces [3, 25]. So a reproducing kernel Hilbert space corresponding to a kernel can be denoted as and the norm in the space can be denoted as . In the following article, we suppress by simply using for reproducing kernel Hilbert space and for its norm.

Throughout the paper we assume the reproducing kernel Hilbert space is separable such that

1. is a Hilbert-Schmidt operator for all and .

2. The real function from to , defined by , is measurable .

By the representation theorem , the solution of the penalized regularization problem (5) will be of the form:

 fz,λ=m∑i=1Kxici, for (c1,…,cm)∈Ym.
###### Definition 2.2.

let be a separable Hilbert space and be an orthonormal basis of . Then for any positive operator we define . It is well-known that the number is independent of the choice of the orthonormal basis.

###### Definition 2.3.

An operator is called Hilbert-Schmidt operator if . The family of all Hilbert-Schmidt operators is denoted by . For , we define for an orthonormal basis of .

It is well-known that is the separable Hilbert space with the inner product,

 ⟨A,B⟩L2(H)=Tr(B∗A)

and its norm satisfies

 ||A||L(H)≤||A||L2(H)≤Tr(|A|),

where and is the operator norm (For more details see ).

For the positive trace class operator , we have

 ||KxK∗x||L(H)≤||KxK∗x||L2(H)≤Tr(KxK∗x)≤κ2.

Given the ordered set , the sampling operator is defined by and its adjoint is given by

The regularization scheme (5) can be expressed as

 fz,λ=argminf∈H{||Sxf−y||2m+λ||f||2H}, (6)

where .

We obtain the explicit expression of by taking the functional derivative of above expression over RKHS .

###### Theorem 2.1.

For the positive choice of , the functional (6) has unique minimizer:

 fz,λ=(S∗xSx+λI)−1S∗xy. (7)

Define as the minimizer of the optimization functional,

 fλ:=argminf∈H{∫Z||f(x)−y||2Ydρ(x,y)+λ||f||2H}. (8)

Using the fact , we get the expression of ,

 fλ=(LK+λI)−1LKfH, (9)

where the integral operator is a self-adjoint, non-negative, compact operator, defined as

 LK(f)(x):=∫XK(x,t)f(t)dρX(t),  x∈X.

The integral operator can also be defined as a self-adjoint operator on . We use the same notation for both the operators defined on different domains. It is well-known that is an isometry from the space of square integrable functions to reproducing kernel Hilbert space.

In order to achieve the uniform convergence rates for learning algorithms we need some prior assumptions on the probability measure . Following the notion of Bauer et al.  and Caponnetto et al. , we consider the class of probability measures which satisfies the assumptions:

1. For the probability measure on ,

 ∫Z||y||2Y dρ(x,y)<∞. (10)
2. The minimizer of the generalization error (4) over the hypothesis space exists.

3. There exist some constants such that for almost all ,

 ∫Y(e||y−fH(x)||Y/M−||y−fH(x)||YM−1)dρ(y|x)≤Σ22M2. (11)
4. The target function belongs to the class with

 Ωϕ,R:={f∈H:f=ϕ(LK)g and ||g||H≤R}, (12)

where is a continuous increasing index function defined on the interval with the assumption . This condition is usually referred to as general source condition .

In addition, we consider the set of probability measures which satisfies the conditions (i), (ii), (iii), (iv) and the eigenvalues ’s of the integral operator follow the polynomial decay: For fixed positive constants and ,

 αn−b≤tn≤βn−b  ∀n∈N. (13)

Under the polynomial decay of the eigenvalues the effective dimension , to measure the complexity of RKHS, can be estimated from Proposition 3  as follows,

 N(λ):=Tr((LK+λI)−1LK)≤βbb−1λ−1/b, for b>1 (14)

and without the polynomial decay condition (13), we have

 N(λ)≤||(LK+λI)−1||L(H)Tr(LK)≤κ2λ.

We discuss the convergence issues for the learning algorithms () in probabilistic sense by exponential tail inequalities such that

 Probz{||fz−fρ||ρ≤ε(m)log(1η)}≥1−η

for all and is a positive decreasing function of . Using these probabilistic estimates we can obtain error estimates in expectation by integration of tail inequalities:

 Ez(||fz−fρ||ρ)=∞∫0Probz(||fz−fρ||ρ>t)dt≤∞∫0exp(−tε(m))dt=ε(m),

where and .

## 3 Convergence analysis

In this section, we analyze the convergence issues of the learning algorithms on reproducing kernel Hilbert space under the smoothness priors in the supervised learning framework. We discuss the upper and lower convergence rates for vector-valued estimators in the standard minimax setting. Therefore the estimates can be utilized particularly for scalar-valued functions and multi-task learning algorithms.

### 3.1 Upper rates for Tikhonov regularization

In General, we consider Tikhonov regularization in learning theory. Tikhonov regularization is briefly discussed in the literature [11, 13, 22, 33]. The error estimates for Tikhonov regularization are discussed theoretically under Hölder’s source condition [9, 31, 32]. We establish the error estimates for Tikhonov regularization scheme under general source condition for some continuous increasing index function and the polynomial decay of the eigenvalues of the integral operator .

In order to estimate the error bounds, we consider the following inequality used in the papers [4, 9] which is based on the results of Pinelis and Sakhanenko .

###### Proposition 3.1.

Let

be a random variable on the probability space

with values in real separable Hilbert space . If there exist two constants and satisfying

 E{||ξ−E(ξ)||nH}≤12n!S2Qn−2   ∀n≥2, (15)

then for any and for all ,

 Prob{(ω1,…,ωm)∈Ωm:∣∣ ∣∣∣∣ ∣∣1mm∑i=1[ξ(ωi)−E(ξ(ωi))]∣∣ ∣∣∣∣ ∣∣H≤2(Qm+S√m)log(2η)}≥1−η.

In particular, the inequality (15) holds if

 ||ξ(ω)||H≤Q and E(||ξ(ω)||2H)≤S2.

We estimate the error bounds for the regularized estimators by measuring the effect of random sampling and the complexity of . The quantities described in Proposition 3.2 express the probabilistic estimates of the perturbation measure due to random sampling. The expressions of Proposition 3.3 describe the complexity of the target function which are usually referred as the approximation errors. The approximation errors are independent of the samples .

###### Proposition 3.2.

Let be i.i.d. samples drawn according to the probability measure satisfying the assumptions (10), (11) and . Then for all , with the confidence , we have

 ||(LK+λI)−1/2{S∗xy−S∗xSxfH}||H≤2⎛⎝κMm√λ+√Σ2N(λ)m⎞⎠log(4η) (16)

and

 ||S∗xSx−LK||L(H)≤2(κ2m+κ2√m)log(4η). (17)
###### Proof.

To estimate the first expression, we consider the random variable from to reproducing kernel Hilbert space with

 Ez(ξ1)=∫Z(LK+λI)−1/2Kx(y−fH(x))dρ(x,y)=0,
 1mm∑i=1ξ1(zi)=(LK+λI)−1/2(S∗xy−S∗xSxfH)

and

 Ez(||ξ1−Ez(ξ1)||nH) = Ez(||(LK+λI)−1/2Kx(y−fH(x))||nH) ≤ Ez(||K∗x(LK+λI)−1Kx||n/2L(Y)||y−fH(x)||nY) ≤ Ex(Tr((LK+λI)−1KxK∗x)||K∗x(LK+λI)−1Kx||n2−1L(Y)Ey(||y−fH(x)||nY)).

Under the assumption (11) we get,

 Ez(||ξ1−Ez(ξ1)||nH)≤n!2(Σ√N(λ))2(κM√λ)n−2,  ∀n≥2.

On applying Proposition 3.1 we conclude that

 ||(LK+λI)−1/2{S∗xy−S∗xSxfH}||H≤2⎛⎝κMm√λ+√Σ2N(λ)m⎞⎠log(4η)

with confidence .

The second expression can be estimated easily by considering the random variable from to . The proof can also be found in De Vito et al. . ∎

###### Proposition 3.3.

Suppose . Then,

1. Under the assumption that and are nondecreasing functions, we have

 ||fλ−fH||ρ≤Rϕ(λ)√λ. (18)
2. Under the assumption that and are nondecreasing functions, we have

 ||fλ−fH||ρ≤Rκϕ(λ) (19)

and

 ||fλ−fH||H≤Rϕ(λ). (20)
###### Proof.

The hypothesis implies for some with . To estimate the approximation error bounds, we consider

 fλ−fH={(LK+λI)−1LK−I}ϕ(LK)g.

Therefore,

 ||fλ−fH||ρ≤||L1/2K{(LK+λI)−1LK−I}ϕ(LK)||L(H) ||g||H.

Using the functional calculus we get,

 ||fλ−fH||ρ≤Rsupα∈(0,κ2]|1−(α+λ)−1α|ϕ(α)√α.

Then under the assumptions on described in (i), we obtain

 ||fλ−fH||ρ≤Rϕ(λ)√λ

and under the assumptions on described in (ii), we have

 ||fλ−fH||ρ≤Rκϕ(λ).

In the same manner with the assumptions on described in (ii), we get

 ||fλ−fH||H=||{(LK+λI)−1LK−I}ϕ(LK)g||H≤Rsupα∈(0,κ2]|1−(α+λ)−1α|ϕ(α)≤Rϕ(λ).

Hence we achieve the required estimates. ∎

###### Theorem 3.1.

Let be i.i.d. samples drawn according to the probability measure where is the index function satisfying the conditions that , are nondecreasing functions. Then for all , with confidence , for the regularized estimator (7) the following upper bound holds:

 ||fz,λ−fH||H≤2⎧⎨⎩Rϕ(λ)+2κMmλ+√4Σ2N(λ)mλ⎫⎬⎭log(4η)

provided that

 √mλ≥8κ2log(4/η). (21)
###### Proof.

The error of regularized solution can be estimated in terms of the sample error and the approximation error as follows:

 (22)

Now can be expressed as

 fz,λ−fλ=(S∗xSx+λI)−1{S∗xy−S∗xSxfλ−λfλ}.

Then implies

 LKfH=LKfλ+λfλ.

Therefore,

 fz,λ−fλ=(S∗xSx+λI)−1{S∗xy−S∗xSxfλ−LK(fH−fλ)}.

Employing RKHS-norm we get,

 ||fz,λ−fλ||H ≤ ||(S∗xSx+λI)−1{S∗xy−S∗xSxfH+(S∗xSx−LK)(fH−fλ)}||H ≤ I1I2+I3||fλ−fH||H/λ,

where , and .

The estimates of , can be obtained from Proposition 3.2 and the only task is to bound . For this we consider

 (S∗xSx+λI)−1(LK+λI)1/2={I−(LK+λI)−1(LK−S∗xSx)}−1(LK+λI)−1/2

which implies

 I1≤∞∑n=0||(LK+λI)−1(LK−S∗xSx)||nL(H)||(LK+λI)−1/2||L(H) (24)

provided that . To verify this condition, we consider

 ||(LK+λI)−1(S∗xSx−LK)||L(H)≤I3/λ.

Now using Proposition 3.2 we get with confidence ,

 ||(LK+λI)−1(S∗xSx−LK)||L(H)≤4κ2√mλlog(4η).

From the condition (21) we get with confidence ,

 ||(LK+λI)−1(S∗xSx−LK)||L(H)≤12. (25)

Consequently, using (25) in the inequality (24) we obtain with probability ,

 I1=||(S∗xSx+λI)−1(LK+λI)1/2||L(H)≤2||(LK+λI)−1/2||L(H)≤2√λ. (26)

From Proposition 3.2 we have with confidence ,

 ||S∗xSx−LK||L(H)≤2(κ2m+κ2√m)log(4η).

Again from the condition (21) we get with probability ,

 I3=||S∗xSx−LK||L(H)≤λ2. (27)

Therefore, the inequality (3.1) together with (16), (20), (26), (27) provides the desired bound. ∎

###### Theorem 3.2.

Let be i.i.d. samples drawn according to the probability measure and is the regularized solution (7) corresponding to Tikhonov regularization. Then for all , with confidence , the following upper bounds holds:

1. Under the assumption that , are nondecreasing functions,

 ||fz,λ−fH||ρ≤2⎧⎨⎩Rϕ(λ)√λ+2κMm√λ+√4Σ2N(λ)m⎫⎬⎭log(4η)
2. Under the assumption that , are nondecreasing functions,

 ||fz,λ−fH||ρ≤⎧⎨⎩R(κ+√λ)ϕ(λ)+4κMm√λ+√16Σ2N(λ)m⎫⎬⎭log(4η)

provided that

 √mλ≥8κ2log(4/η). (28)
###### Proof.

In order to establish the error bounds of in -norm, we first estimate in -norm:

 fz,λ−fλ=(S∗xSx+λI)−1{S∗xy−S∗xSxfλ−LK(fH−fλ)}.

Employing -norm, we get

 ||fz,λ−fλ||ρ ≤ ||L1/2K(S∗xSx+λI)−1{S∗xy−S∗xSxfH+(S∗xSx−LK)(fH−fλ)}||H ≤ I4{I2+I3||fλ−fH||H/√λ},

where , and .

The estimates of and can be obtained from Proposition 3.2. To get the estimate for the sample error, we consider the following expression to bound ,

 L1/2K(S∗xSx+λI)−1(LK+λI)1/2=L1/2K(LK+λI)−1/2{I−(LK+λI)−1/2(LK−S∗xSx)(LK+λI)−1/2}−1,

which implies

 I4≤||L1/2K(LK+λI)−1/2||L(H)||{I−(LK+λI)−1/2(LK−