# Kernel Conjugate Gradient Methods with Random Projections

We propose and study kernel conjugate gradient methods (KCGM) with random projections for least-squares regression over a separable Hilbert space. Considering two types of random projections generated by randomized sketches and Nyström subsampling, we prove optimal statistical results with respect to variants of norms for the algorithms under a suitable stopping rule. Particularly, our results show that if the projection dimension is proportional to the effective dimension of the problem, KCGM with randomized sketches can generalize optimally, while achieving a computational advantage. As a corollary, we derive optimal rates for classic KCGM in the case that the target function may not be in the hypothesis space, filling a theoretical gap.

## Authors

• 9 publications
• 76 publications
03/12/2018

### Optimal Rates of Sketched-regularized Algorithms for Least-Squares Regression over Hilbert Spaces

We investigate regularized algorithms combining with projection for leas...
01/20/2018

### Optimal Rates for Spectral-regularized Algorithms with Least-Squares Regression over Hilbert Spaces

In this paper, we study regression problems over a separable Hilbert spa...
07/08/2016

### Convergence rates of Kernel Conjugate Gradient for random design regression

We prove statistical rates of convergence for kernel-based least squares...
01/25/2015

### Randomized sketches for kernels: Fast and optimal non-parametric regression

Kernel ridge regression (KRR) is a standard method for performing non-pa...
11/03/2021

### A unification of least-squares and Green-Gauss gradients under a common projection-based gradient reconstruction framework

We propose a family of gradient reconstruction schemes based on the solu...
02/03/2020

### Limiting Spectrum of Randomized Hadamard Transform and Optimal Iterative Sketching Methods

We provide an exact analysis of the limiting spectrum of matrices random...
10/09/2018

### Matrix-free construction of HSS representation using adaptive randomized sampling

We present new algorithms for the randomized construction of hierarchica...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let the input space be a separable Hilbert space with inner product , and the output space . Let

be an unknown probability measure on

. We study the following expected risk minimization,

 infω∈H~E(ω),~E(ω)=∫H×R(⟨ω,x⟩H−y)2dρ(x,y), (1)

where the measure is known only through a sample of size , independently and identically distributed (i.i.d.) according to . As noted in [20, 21], this setting covers nonparametric regression with kernel methods [8, 33]

, and it is close to functional linear regression

[27] with the intercept to be zero and linear inverse problems [11].

In the large-scale learning scenarios, the search of an approximated estimator for the above problem via some specific algorithms could be limited to a smaller subspace

, in order to achieve some computational advantages [36, 32, 10]. Typically, with a subsample/sketch dimension , where is chosen randomly from the input set , or where

is a general random matrix whose rows are drawn according to a distribution. The former is called Nyström subsampling while the latter is called randomized sketches. Limiting the solution within the subspace

, replacing expected risk by empirical risk over , and combining with a (linear-fashion and explicit) regularized technique based on spectral-filtering of the empirical covariance operator, this leads to the projected-regularized algorithms. Refer to the previous papers [1, 37, 19] and references therein for the statistical results and computational advantages of this kind of algorithms.

In this paper, we take a different step and apply the random-projection techniques to another efficient powerful iterative algorithms: kernel conjugate gradient type algorithms. As noted in [19], a solution of the empirical risk minimization over the subspace can be given by solving a projected normalized linear equation. We apply the kernel conjugate gradient methods (KCGM) [25, 15] for “solving” this normalized linear equation (without any explicit regularization term), and at th-iteration, we get an estimator that fits the linear equation best over the th-order Krylov subspace. The regularization to ensure its best performance is realized by early-stopping the iterative procedure.

Using the early-stopping (iterative) regularization [40, 38, 28] has its own benefit compared with spectral-filtering algorithms, as it can tune the “regularization parameter” in an adaptive way if a suitable stopping rule is used. Thus, for some easy learning problems, an iterative algorithm can stop earlier while generalizing optimally, leading to some computational advantages.

Considering either randomized sketches or Nyström subsampling, we provide statistical results in terms of different norms with optimal rates. Particularly, our results indicate that for KCGM with randomized sketches, the algorithm can generalize optimally after some number of iterations, provided that the sketch dimension is proportional to the effective dimension [39] of the problem.

Furthermore, we point out that the computational complexities for the algorithm are in time and in space, which are lower than in time and in space of classic KCGM. Thus, our results suggest that KCGM with randomized sketches can generalize optimally with less computational complexities, e.g., in time and in space without considering the begin assumptions of the problem in the attainable case (i.e. the expected risk minimization has at least one solution in ).

Finally, as a corollary, we derive the first result with optimal capacity-dependent rates for classical KCGM in the non-attainable case, filling a theoretical gap since [4].

The structure of this paper is organized as follows. We first introduce some preliminary notations and the studied algorithms in Section 2. We then introduce some basic assumptions and state our main results in Section 3, following with some simple discussions and numerical illustrations. All the proofs are given in Section 4 and Appendix.

## 2 Learning with Kernel Conjugate Gradient Methods and Random Projection

In this section, we first introduce some necessary notations. We then present KCGM with projection (abbreviated as projected-KCGM), and discuss their numerical realizations considering two types of projection generated by randomized sketches and Nyström sketches/subsampling .

### 2.1 Notations and Auxiliary Operators

Let , the induced marginal measure on of , and the conditional probability measure on with respect to and . Define the hypothesis space

 Hρ={f:H→R|∃ω∈H with f(x)=⟨ω,x⟩H,ρX-almost surely}.

Denote the Hilbert space of square integral functions from to with respect to , with its norm given by Throughout this paper, we assume that the support of is compact and there exists a constant , such that

 ⟨x,x′⟩H≤κ2,∀x,x′∈H,ρX-almost every. (2)

For a given bounded operator mapping from a separable Hilbert space to another separable Hilbert space denotes the operator norm of , i.e., . Let the set is denoted by For any real number , , .

Let be the linear map , which is bounded by under Assumption (2). Furthermore, we consider the adjoint operator , the covariance operator given by , and the integral operator given by It can be easily proved that

 S∗ρg=∫Hxg(x)dρX(x),
 Lf=SρS∗ρf=∫Hf(x)⟨x,⋅⟩HdρX(x),and
 T=S∗ρSρ=∫H⟨⋅,x⟩HxdρX(x).

Under Assumption (2), the operators and can be proved to be positive trace class operators (and hence compact):

 ∥L∥=∥T∥≤tr(T)=∫Htr(x⊗x)dρX(x)=∫H∥x∥2HdρX(x)≤κ2. (3)

For any , it is easy to prove the following isometry property,

 ∥Sρω∥ρ=∥√Tω∥H, (4)

Moreover, according to the singular value decomposition of a compact operator, one can prove

 ∥L−12Sρω∥ρ≤∥ω∥H. (5)

Similarly, for all there holds,

 ∥S∗ρf∥H=∥L12f∥ρ,and (6)
 ∥T−12S∗ρf∥H≤∥f∥ρ. (7)

We define the (normalized) sampling operator by

 (Sxω)i=1√n⟨ω,xi⟩H,i∈[n],

where the norm in is the usual Euclidean norm. Its adjoint operator defined by for is thus given by

 S∗xy=1√nn∑i=1yixi.

For notational simplicity, we also denote Moreover, we can define the empirical covariance operator such that . Obviously,

 Tx=S∗xSx=1nn∑i=1⟨⋅,xi⟩Hxi.

By Assumption (2), similar to (3), we have

 ∥Tx∥≤tr(Tx)≤κ2. (8)

Denote the matrix with its -th entry given by for any two input sets and Obviously,

 Kx~x=SxS∗~x.

Problem (1) is equivalent to

 inff∈HρE(f),E(f)=∫H×R(f(x)−y)2dρ(x,y), (9)

The function that minimizes the expected risk over all measurable functions is the regression function [8, 33], defined as,

 fρ(x)=∫Rydρ(y|x),x∈H,ρX-almost % every. (10)

A simple calculation shows that the following well-known fact holds [8, 33], for all

 E(f)−E(fρ)=∥f−fρ∥2ρ.

Under Assumption (2), is a subspace of Thus a solution for the problem (9) is the projection of the regression function onto the closure of in , and for all [20],

 S∗ρfρ=S∗ρfH,and (11)
 E(f)−E(fH)=∥f−fH∥2ρ. (12)

### 2.2 Kernel Conjugate Gradient Methods with Projection

In this subsection, we introduce KCGM with solutions restricted to the subspace , a closed subspace of . Let be the projection operator with its range . As noted in [19], a solution for the empirical risk minimization over is given by with such that

 PTxP^ω=PS∗x¯y, (13)

Note that as , . Thus, (13) could be viewed as a normalized equation of Motivated by [15, 4], we study the following conjugate gradient type algorithms applied to this normalized equation. For notational simplicity, we let

 U=PTxP, (14)

and write to mean

###### Algorithm 1 (Projected-KCGM).

For any

 ωt=argminω∈Kt(U,PS∗x¯y)∥Uω−PS∗x¯y∥H. (15)

Here, is the so-called Krylov subspace, defined as

 Kt(U,PS∗x¯y)=span{PS∗x¯y,UPS∗x¯y,⋯,Ut−1PS∗x¯y}={p(U)PS∗x¯y:p∈Pt−1},

where denotes the set of real polynomials of degree at most .

Different choices on the subspace correspond to different algorithms. Particularly, when , the algorithm is the classical KCGM. In this paper, we will set

 S=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯span{m∑j=1Gijxj:1≤i≤m}

where is a random matrix, or

 S=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯span{~xj:1≤j≤m}

with chosen randomly from . The following examples provide numerical realizations of Algorithm 1, considering randomized sketches, Nyström-subsampling sketches and non-sketching regimes.

###### Example 2.1 (Randomized sketches).

Let , and be a matrix in . Let be the matrix such that with . Denote and In this case, Algorithm 1 is equivalent to with given by

 at=argmina∈Kt(~K,b)∥~Ka−b∥2. (16)

We call this type of algorithm sketched-KCGM.

###### Example 2.2 (Subsampling sketches).

In Nyström-subsampling sketches, with each drawn randomly following a distribution from . Let be the matrix such that with . Denote and In this case, Algorithm 1 is equivalent to with given by

 at=argmina∈Kt(~K,b)∥~Ka−b∥2.

We call this algorithm Nyström-KCGM.

###### Example 2.3.

(Non-sketches [4]) For the ordinary non-sketching regimes, . Let Then Algorithm 1 is equivalent to with given by

 at=argmina∈Kt(K,¯y)∥Ka−¯y∥K.

In all the above examples, in order to execute the algorithms, one only needs to know how to compute for any two points , which is met by many cases such as learning with kernel methods.

In general, as that the computation of the matrix (or ) can be parallelized, the computational costs are in time and in space for sketched/Nyström KCGM after -iterations, while they are in time and in space for non-sketched KCGM. As shown both in theory and our numerical results, the total number of iterations for the algorithms to achieve best performance is typically less than for sketched/Nyström KCGM.

A classical [29] or sketched [2] kernel conjugate gradient type algorithm was proposed for solving the penalized empirical risk minimization. In contrast, Algorithm 1 is for “solving” the (unpenalized) empirical risk minimization and it does not involve any explicit penalty. In this case, we do not need to tune the penalty parameter. The best generalization ability of Algorithm 1 is ensured by early-stopping the procedure, considering a suitable stopping rule.

The proofs for the three examples will be given in Subsection 4.1.

## 3 Main Results

In this section, we first introduce some common assumptions from statistical learning theory, and then present our statistical results for sketched/Nyström-KCGM and classical KCGM.

### 3.1 Assumptions

###### Assumption 1.

There exist positive constants and such that for all with

 ∫R|y|ldρ(y|x)≤12l!Ml−2Q2, (17)

-almost surely. Furthermore, for some , satisfies

 ∫H(fH(x)−fρ(x))2x⊗xdρX(x)⪯B2T, (18)

Obviously, Assumption 1 implies that the regression function is bounded almost surely, as

 |fρ(x)|≤∫R|y|dρ(y|x)≤(∫R|y|2dρ(y|x))12≤Q. (19)

(17) is satisfied if is bounded almost surely or for some Gaussian noise (18) is satisfied if is bounded almost surely or the hypothesis space is consistent, i.e.,

###### Assumption 2.

satisfies the following Hölder source condition

 fH=Lζg0,with∥g0∥ρ≤R. (20)

Here, and are non-negative numbers.

Assumption 2 relates to the regularity/smoothness of . The bigger the is, the stronger the assumption is, the smoother is, as

 Lζ1(L2ρX)⊆Lζ2(L2ρX)when ζ1≥ζ2.

Particularly, when , there exists some such that almost surely [33], while for the assumption holds trivially.

###### Assumption 3.

For some and , satisfies

 N(λ):=tr(T(T+λI)−1)≤cγλ−γ,for all λ>0. (21)

Assumption 3 characters the capacity of The left-hand side of (21) is called the effective dimension [39]. As is a trace-class operator, Condition (21) is trivially satisfied with (which is called the capacity-independent case). Furthermore, it is satisfied with a general

if the eigenvalues

of satisfy

We refer to [19] for more comments on the above assumptions.

### 3.2 General Results for Kernel Conjugate Gradient Method with Projection

The following results provide convergence results for general projected-KCGM with a data-dependent stopping rule.

###### Theorem 3.1.

Under Assumptions 1, 2 and 3, let . Assume that for some , and for any ,

 P(∥(I−P)T12∥2>C′1λ1∨ζ−a1−alog2δ)≤δ,λ=n−1(2ζ+γ)∨1bn,ζ,γ. (22)

Then the following results hold with probability at least . There exist positive constants and (which depend only on ) such that if the stopping rule is

 ∥Uωt−PS∗x¯y∥H≤~C1log322δn−ζ+1/21∨(2ζ+γ)bζ+1/2n,ζ,γ,

then

 ∥L−a(Sρω^t−fH)∥ρ≤~C2log2−a2δn−ζ−a1∨(2ζ+γ)bζ−an,ζ,γ.

Furthermore, if for some and

 ∥T12−a(ω^t−ωH)∥H≤~C2log2−a2δn−ζ−a1∨(2ζ+γ). (23)

Here,

 bn,ζ,γ=(1∨lognγ)1{2ζ+γ≤1}. (24)

The convergence rate from the above is optimal as it matches the minimax lower rate derived for in [7, 5].

Convergence results with respect to different measures are raised from statistical learning theory and inverse problems. In statistical learning theory, one typically is interested in the generalization ability, measured in terms of excess risks, In inverse problems, one is interested in the convergence within the space

Theorem 3.1 asserts that projected-KCGM converges optimally if the projection error is small enough. The condition (22) is satisfied with random projections induced by randomized sketches or Nystróm subsampling if the sketching dimension is large enough, as shown in Section 4. Thus we have the following corollaries for sketched or Nyström KCGM.

### 3.3 Results for Kernel Conjugate Gradient Methods with Randomized Sketches

In this subsection, we state optimal convergence results with respect to different norms for KCGM with randomized sketches from Example 2.1.

We assume that the sketching matrix satisfies the following concentration property: For any finite subset in and for any

 P(|∥Ga∥22−∥a∥22≥t∥a∥22)≤2|E|e−t2mc′0logβn. (25)

Here, and are universal non-negative constants.

###### Example 3.1.

Many matrices satisfy the concentration property.
1)
Subgaussian sketches. Matrices with i.i.d. subgaussian (such as Gaussian or Bernoulli) entries satisfy (25) with some universal constant and . More general, if the rows of are independent (scaled) copies of an isotropic vector, then also satisfies (25) [23]. Recall that a random vector is isotropic if for all

 E[⟨a,v⟩22]=∥v∥22,andinf{t:E[exp(⟨a,v⟩22/t2)]≤2}≤α∥v∥2,

for some constant .
2)
Randomized orthogonal system (ROS) sketches. As noted in [17], matrix that satisfies restricted isometric property from compressed sensing [6, 12] with randomized column signs satisfies (25). Particularly, random partial Fourier matrix, or random partial Hadamard matrix with randomized column signs satisfies (25) with for some universal constant .

###### Corollary 3.2.

Under Assumptions 1, 2 and 3, let where is a random matrix satisfying (25). Let , and

 m≥~C3log33δlogβn⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩nγ[1∨lognγ]−γ,if 2ζ+γ≤1,nγ(ζ−a)(1−a)(2ζ+γ),if ζ≥1,nγ2ζ+γotherwise, (26)

for some (which depends only on Then the conclusions in Theorem 3.1 hold.

When , the minimal sketching dimension is proportional to the effective dimension up to a logarithmic factor, which we believe that it is unimprovable.

According to Corollary 3.2, sketched-KCGM can generalize optimally if the sketching dimension is large enough.

### 3.4 Results for Kernel Conjugate Gradient Methods with Nyström Sketches

In this subsection, we provide optimal rates with respect to different norms for KCGM with Nyström sketches from Example 2.2.

###### Corollary 3.3.

Under Assumptions 1, 2 and 3, let , , , and

 m≥n1∨ζ−a(1−a)(2ζ+γ)[1∨lognγ].

Then the conclusions in Theorem 3.1 are true.

The requirement on the sketch dimension of Nyström-KCGM does not depend on the probability constant , but it is stronger than that of sketched-KCGM if ignoring the factor

###### Remark 3.4.

In the above, we only consider the plain Nyström subsampling. Using the approximated leveraging score (ALS) Nyström subsampling [35, 10], we can further improve the projection dimension condition to (26), see Section 4 for details. However, in this case, we need to compute the ALS with an appropriate pseudo regularization parameter .

### 3.5 Optimal Rates for Classical Kernel Conjugate Gradient Methods

As a direct corollary, we derive optimal rates for classical KCGM as follows, covering the non-attainable cases.

###### Corollary 3.5.

Under Assumptions 1, 2 and 3, let and . Then the conclusions in Theorem 3.1 are true.

To the best of our knowledge, the above results provide the first optimal capacity-dependent rate for KCGM in the non-attainable case, i.e. . This thus provides an answer to a question open since [4].

Convergence results for kernel partial least squares under different stopping rules have been derived in [22, 30], but the derived optimal rates are only for the attainable cases. Our analysis could be extended to this different type of algorithm with similar stopping rules.

We present some numerical results to illustrate our derived results in the setting of learning with kernel methods. In all the simulations, we constructed training datas from the regression model , where the regression function , the input is uniformly drawn from , and

is a Gaussian noise with zero mean and standard deviation

. By construction, the function belongs to the first-order Sobolev space with . In all the simulations, the RKHS is associated with a Sobolev kernel . As noted in [37, Example 3] for Sobolev kernel, according to [14], Assumption 3 is satisfied with As suggested by our theory, we set the projection dimension for KCGM with ROS sketches based on the fast Hadamard transform while for KCGM with plain Nyström sketches. We performed simulations for in the set so as to study scaling with the sample size. For each , we performed 100 trials and both squared prediction errors and training errors averaged over these 100 trials were computed. The errors for versus the iterations were reported in Figure 1. For each the minimal squared prediction error over the first iterations is computed and these errors versus the sample size were reported in Figure 2

in order to compare with state-of-the-art algorithm, kernel ridge regression (KRR). From Figure

1, we see that the squared prediction errors decrease at the first iterations and then they increase for both sketched and Nyström KCGM. This indicates that the number of iteration has a regularization effect. Our theory predicts that the squared prediction loss should tend to zero at the same rate as that of KRR. Figure 2 confirms this theoretical prediction.

All the results stated in this section will be proved in Section 4.

## 4 Proof

In this section and the appendix, we provide all the proofs.

### 4.1 Proof for Subsection 2.2

Let be a compact operator from the Euclidean space to such that . It is easy to see that . Let and be the matrix such that . As is the projection operator onto then

 P=Q(Q∗Q)†Q∗=QRR∗Q∗. (27)

For any polynomial function we have that

 q(U)PS∗x¯y=q(PTxP)PS∗x¯y=q(PS∗xSxP)PS∗x¯y.

Noting that , and using Lemma 4.2 from the coming subsection,

 q(U)PS∗x¯y= PS∗xq(SxPPS∗x)¯y=PS∗xq(SxPS∗x)¯y.

Introducing with (27),

 q(U)PS∗x¯y= QRR∗Q∗S∗xq(SxQRR∗Q∗S∗x)¯y. (28)

Noting that , and applying Lemma 4.2,

 q(U)PS∗x¯y= QRq(R∗Q∗S∗xSxQR)R∗Q∗S∗x¯y=QRq(~K)b, (29)

where we denote

 b=R∗Q∗S∗x¯y,and~K=R∗Q∗S∗xSxQR.

Using , which implies and for any

 ∥QRR∗Q∗g∥2H=⟨QRR∗Q∗QRR∗Q∗g,g⟩H=⟨QRR∗Q∗g,g⟩H=∥R∗Q∗g∥22,

we get from (28) that

 ∥q(U)PS∗x¯y∥H=∥R∗Q∗S∗xq(SxQRR∗Q∗S∗x)¯y∥H=∥q(~K)b∥2, (30)

where we used Lemma 4.2 for the last equality.

Note that the solution of (15) is given by , with

Using (29) and (30), we know that , with

 pt=argminp∈Pt−1∥(~Kp(~K)−I)b∥2,

which is equivalent to , with

 at=argmina∈Kt(~K,b)∥~Ka−b∥2.
###### Proof for Example 2.1.

For general randomized sketches, . In this case,

 ~K=R∗GSxS∗xSxS∗xG∗R=R∗GK2xxG∗R,

and . ∎

###### Proof for Example 2.2.

In Nyström subsampling, is a subset of size drawn randomly following a distribution from , , and In this case, and . ∎

###### Proof for Example 2.3.

For the ordinary non-sketching regimes, and Denote Then

 ωt=argminω∈Kt(Tx,S∗x¯y)∥Txω−S∗x¯y∥H,

is equivalent to with given by

 ^at=argmina∈Kt(K,¯y)∥Ka−¯y∥K.

Indeed,

 ∥Txω−S∗x¯y∥2H=∥S∗x(Sxω−¯y)∥2H