Let the input space be a separable Hilbert space with inner product , and the output space . Let
be an unknown probability measure on. We study the following expected risk minimization,
where the measure is known only through a sample of size , independently and identically distributed (i.i.d.) according to . As noted in [20, 21], this setting covers nonparametric regression with kernel methods [8, 33]
, and it is close to functional linear regression with the intercept to be zero and linear inverse problems .
In the large-scale learning scenarios, the search of an approximated estimator for the above problem via some specific algorithms could be limited to a smaller subspace, in order to achieve some computational advantages [36, 32, 10]. Typically, with a subsample/sketch dimension , where is chosen randomly from the input set , or where
is a general random matrix whose rows are drawn according to a distribution. The former is called Nyström subsampling while the latter is called randomized sketches. Limiting the solution within the subspace, replacing expected risk by empirical risk over , and combining with a (linear-fashion and explicit) regularized technique based on spectral-filtering of the empirical covariance operator, this leads to the projected-regularized algorithms. Refer to the previous papers [1, 37, 19] and references therein for the statistical results and computational advantages of this kind of algorithms.
In this paper, we take a different step and apply the random-projection techniques to another efficient powerful iterative algorithms: kernel conjugate gradient type algorithms. As noted in , a solution of the empirical risk minimization over the subspace can be given by solving a projected normalized linear equation. We apply the kernel conjugate gradient methods (KCGM) [25, 15] for “solving” this normalized linear equation (without any explicit regularization term), and at th-iteration, we get an estimator that fits the linear equation best over the th-order Krylov subspace. The regularization to ensure its best performance is realized by early-stopping the iterative procedure.
Using the early-stopping (iterative) regularization [40, 38, 28] has its own benefit compared with spectral-filtering algorithms, as it can tune the “regularization parameter” in an adaptive way if a suitable stopping rule is used. Thus, for some easy learning problems, an iterative algorithm can stop earlier while generalizing optimally, leading to some computational advantages.
Considering either randomized sketches or Nyström subsampling, we provide statistical results in terms of different norms with optimal rates. Particularly, our results indicate that for KCGM with randomized sketches, the algorithm can generalize optimally after some number of iterations, provided that the sketch dimension is proportional to the effective dimension  of the problem.
Furthermore, we point out that the computational complexities for the algorithm are in time and in space, which are lower than in time and in space of classic KCGM. Thus, our results suggest that KCGM with randomized sketches can generalize optimally with less computational complexities, e.g., in time and in space without considering the begin assumptions of the problem in the attainable case (i.e. the expected risk minimization has at least one solution in ).
Finally, as a corollary, we derive the first result with optimal capacity-dependent rates for classical KCGM in the non-attainable case, filling a theoretical gap since .
The structure of this paper is organized as follows. We first introduce some preliminary notations and the studied algorithms in Section 2. We then introduce some basic assumptions and state our main results in Section 3, following with some simple discussions and numerical illustrations. All the proofs are given in Section 4 and Appendix.
2 Learning with Kernel Conjugate Gradient Methods and Random Projection
In this section, we first introduce some necessary notations. We then present KCGM with projection (abbreviated as projected-KCGM), and discuss their numerical realizations considering two types of projection generated by randomized sketches and Nyström sketches/subsampling .
2.1 Notations and Auxiliary Operators
Let , the induced marginal measure on of , and the conditional probability measure on with respect to and . Define the hypothesis space
Denote the Hilbert space of square integral functions from to with respect to , with its norm given by Throughout this paper, we assume that the support of is compact and there exists a constant , such that
For a given bounded operator mapping from a separable Hilbert space to another separable Hilbert space denotes the operator norm of , i.e., . Let the set is denoted by For any real number , , .
Let be the linear map , which is bounded by under Assumption (2). Furthermore, we consider the adjoint operator , the covariance operator given by , and the integral operator given by It can be easily proved that
Under Assumption (2), the operators and can be proved to be positive trace class operators (and hence compact):
For any , it is easy to prove the following isometry property,
Moreover, according to the singular value decomposition of a compact operator, one can prove
Similarly, for all there holds,
We define the (normalized) sampling operator by
where the norm in is the usual Euclidean norm. Its adjoint operator defined by for is thus given by
For notational simplicity, we also denote Moreover, we can define the empirical covariance operator such that . Obviously,
Denote the matrix with its -th entry given by for any two input sets and Obviously,
Problem (1) is equivalent to
2.2 Kernel Conjugate Gradient Methods with Projection
In this subsection, we introduce KCGM with solutions restricted to the subspace , a closed subspace of . Let be the projection operator with its range . As noted in , a solution for the empirical risk minimization over is given by with such that
Note that as , . Thus, (13) could be viewed as a normalized equation of Motivated by [15, 4], we study the following conjugate gradient type algorithms applied to this normalized equation. For notational simplicity, we let
and write to mean
Algorithm 1 (Projected-KCGM).
Here, is the so-called Krylov subspace, defined as
where denotes the set of real polynomials of degree at most .
Different choices on the subspace correspond to different algorithms. Particularly, when , the algorithm is the classical KCGM. In this paper, we will set
where is a random matrix, or
with chosen randomly from . The following examples provide numerical realizations of Algorithm 1, considering randomized sketches, Nyström-subsampling sketches and non-sketching regimes.
Example 2.1 (Randomized sketches).
Let , and be a matrix in . Let be the matrix such that with . Denote and In this case, Algorithm 1 is equivalent to with given by
We call this type of algorithm sketched-KCGM.
Example 2.2 (Subsampling sketches).
In Nyström-subsampling sketches, with each drawn randomly following a distribution from . Let be the matrix such that with . Denote and In this case, Algorithm 1 is equivalent to with given by
We call this algorithm Nyström-KCGM.
In all the above examples, in order to execute the algorithms, one only needs to know how to compute for any two points , which is met by many cases such as learning with kernel methods.
In general, as that the computation of the matrix (or ) can be parallelized, the computational costs are in time and in space for sketched/Nyström KCGM after -iterations, while they are in time and in space for non-sketched KCGM. As shown both in theory and our numerical results, the total number of iterations for the algorithms to achieve best performance is typically less than for sketched/Nyström KCGM.
A classical  or sketched  kernel conjugate gradient type algorithm was proposed for solving the penalized empirical risk minimization. In contrast, Algorithm 1 is for “solving” the (unpenalized) empirical risk minimization and it does not involve any explicit penalty. In this case, we do not need to tune the penalty parameter. The best generalization ability of Algorithm 1 is ensured by early-stopping the procedure, considering a suitable stopping rule.
The proofs for the three examples will be given in Subsection 4.1.
3 Main Results
In this section, we first introduce some common assumptions from statistical learning theory, and then present our statistical results for sketched/Nyström-KCGM and classical KCGM.
There exist positive constants and such that for all with
-almost surely. Furthermore, for some , satisfies
Obviously, Assumption 1 implies that the regression function is bounded almost surely, as
satisfies the following Hölder source condition
Here, and are non-negative numbers.
Assumption 2 relates to the regularity/smoothness of . The bigger the is, the stronger the assumption is, the smoother is, as
Particularly, when , there exists some such that almost surely , while for the assumption holds trivially.
For some and , satisfies
Assumption 3 characters the capacity of The left-hand side of (21) is called the effective dimension . As is a trace-class operator, Condition (21) is trivially satisfied with (which is called the capacity-independent case). Furthermore, it is satisfied with a general
if the eigenvaluesof satisfy
We refer to  for more comments on the above assumptions.
3.2 General Results for Kernel Conjugate Gradient Method with Projection
The following results provide convergence results for general projected-KCGM with a data-dependent stopping rule.
Convergence results with respect to different measures are raised from statistical learning theory and inverse problems. In statistical learning theory, one typically is interested in the generalization ability, measured in terms of excess risks, In inverse problems, one is interested in the convergence within the space
Theorem 3.1 asserts that projected-KCGM converges optimally if the projection error is small enough. The condition (22) is satisfied with random projections induced by randomized sketches or Nystróm subsampling if the sketching dimension is large enough, as shown in Section 4. Thus we have the following corollaries for sketched or Nyström KCGM.
3.3 Results for Kernel Conjugate Gradient Methods with Randomized Sketches
In this subsection, we state optimal convergence results with respect to different norms for KCGM with randomized sketches from Example 2.1.
We assume that the sketching matrix satisfies the following concentration property: For any finite subset in and for any
Here, and are universal non-negative constants.
Many matrices satisfy the concentration property.
1) Subgaussian sketches. Matrices with i.i.d. subgaussian (such as Gaussian or Bernoulli) entries satisfy (25) with some universal constant and . More general, if the rows of are independent (scaled) copies of an isotropic vector, then also satisfies (25) . Recall that a random vector is isotropic if for all
for some constant .
2) Randomized orthogonal system (ROS) sketches. As noted in , matrix that satisfies restricted isometric property from compressed sensing [6, 12] with randomized column signs satisfies (25). Particularly, random partial Fourier matrix, or random partial Hadamard matrix with randomized column signs satisfies (25) with for some universal constant .
When , the minimal sketching dimension is proportional to the effective dimension up to a logarithmic factor, which we believe that it is unimprovable.
According to Corollary 3.2, sketched-KCGM can generalize optimally if the sketching dimension is large enough.
3.4 Results for Kernel Conjugate Gradient Methods with Nyström Sketches
In this subsection, we provide optimal rates with respect to different norms for KCGM with Nyström sketches from Example 2.2.
The requirement on the sketch dimension of Nyström-KCGM does not depend on the probability constant , but it is stronger than that of sketched-KCGM if ignoring the factor
In the above, we only consider the plain Nyström subsampling. Using the approximated leveraging score (ALS) Nyström subsampling [35, 10], we can further improve the projection dimension condition to (26), see Section 4 for details. However, in this case, we need to compute the ALS with an appropriate pseudo regularization parameter .
3.5 Optimal Rates for Classical Kernel Conjugate Gradient Methods
As a direct corollary, we derive optimal rates for classical KCGM as follows, covering the non-attainable cases.
To the best of our knowledge, the above results provide the first optimal capacity-dependent rate for KCGM in the non-attainable case, i.e. . This thus provides an answer to a question open since .
Convergence results for kernel partial least squares under different stopping rules have been derived in [22, 30], but the derived optimal rates are only for the attainable cases. Our analysis could be extended to this different type of algorithm with similar stopping rules.
We present some numerical results to illustrate our derived results in the setting of learning with kernel methods. In all the simulations, we constructed training datas from the regression model , where the regression function , the input is uniformly drawn from , and
is a Gaussian noise with zero mean and standard deviation. By construction, the function belongs to the first-order Sobolev space with . In all the simulations, the RKHS is associated with a Sobolev kernel . As noted in [37, Example 3] for Sobolev kernel, according to , Assumption 3 is satisfied with As suggested by our theory, we set the projection dimension for KCGM with ROS sketches based on the fast Hadamard transform while for KCGM with plain Nyström sketches. We performed simulations for in the set so as to study scaling with the sample size. For each , we performed 100 trials and both squared prediction errors and training errors averaged over these 100 trials were computed. The errors for versus the iterations were reported in Figure 1. For each the minimal squared prediction error over the first iterations is computed and these errors versus the sample size were reported in Figure 2
in order to compare with state-of-the-art algorithm, kernel ridge regression (KRR). From Figure1, we see that the squared prediction errors decrease at the first iterations and then they increase for both sketched and Nyström KCGM. This indicates that the number of iteration has a regularization effect. Our theory predicts that the squared prediction loss should tend to zero at the same rate as that of KRR. Figure 2 confirms this theoretical prediction.
All the results stated in this section will be proved in Section 4.
In this section and the appendix, we provide all the proofs.
4.1 Proof for Subsection 2.2
Let be a compact operator from the Euclidean space to such that . It is easy to see that . Let and be the matrix such that . As is the projection operator onto then
For any polynomial function we have that
Noting that , and using Lemma 4.2 from the coming subsection,
Introducing with (27),
Noting that , and applying Lemma 4.2,
where we denote
Using , which implies and for any
we get from (28) that
where we used Lemma 4.2 for the last equality.
Note that the solution of (15) is given by , with
which is equivalent to , with
Proof for Example 2.1.
For general randomized sketches, . In this case,
and . ∎
Proof for Example 2.2.
In Nyström subsampling, is a subset of size drawn randomly following a distribution from , , and In this case, and . ∎
Proof for Example 2.3.
For the ordinary non-sketching regimes, and Denote Then
is equivalent to with given by