1 Introduction
Let the input space be a separable Hilbert space with inner product denoted by , and the output space . Let
be an unknown probability measure on
. In this paper, we study the following expected risk minimization,(1) 
where the measure is known only through a sample of size , independently and identically distributed (i.i.d.) according to .
The above regression setting covers nonparametric regression over a reproducing kernel Hilbert space (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), and it is close to functional regression (Ramsay, 2006) and linear inverse problems (Engl et al., 1996)
. A basic algorithm for the problem is ridge regression, and its generalization, spectral algorithm. Such algorithms can be viewed as solving an empirical, linear equation with the empirical covariance operator replaced by a regularized one, see
(Caponnetto & Yao, 2006; Bauer et al., 2007; Gerfo et al., 2008; Lin et al., 2018) and references therein. Here, the regularization is used to control the complexity of the solution to against overfitting and to achieve best generalization ability.The function/estimator generated by classic regularized algorithm is in the subspace
of , where More often, the search of an estimator for some specific algorithms is restricted to a different (and possibly smaller) subspace , which leads to regularized algorithms with projection. Such approaches have computational advantages in nonparametric regression with kernel methods (Williams & Seeger, 2000; Smola & Schölkopf, 2000). Typically, with a subsample/sketch dimension , where is chosen randomly from the input set , or whereis a general randomized matrix whose rows are drawn according to a distribution. The resulted algorithms are called Nyström regularized algorithm and sketchedregularized algorithm, respectively.
Our starting points of this paper are recent papers (Bach, 2013; Alaoui & Mahoney, 2015; Yang et al., 2015; Rudi et al., 2015; Myleiko et al., 2017) where convergence results on Nyström/sketched regularized algorithms for learning with kernel methods are given. Particularly, within the fixed design setting, i.e., the input set are deterministic while the output set treated randomly, convergence results have been derived, in (Bach, 2013; Alaoui & Mahoney, 2015) for Nyström ridge regression and in (Yang et al., 2015) for sketched ridge regression. Within the random design setting (which is more meaningful (Hsu et al., 2014)
in statistical learning theory) and involving a regularity/smoothness condition on the target function
(Smale & Zhou, 2007), optimal statistical results on generalization error bounds (excess risks) have been obtained in (Rudi et al., 2015) for Nyström ridge regression. The latter results were further generalized in (Myleiko et al., 2017) to a general Nyström regularized algorithm.Although results have been developed for sketched ridge regression in the fixed design setting, it is still unclear if one can get statistical results for a general sketchedregularized algorithms in the random design setting. Besides, all the derived results, either for sketched or Nyström regularized algorithms, are only for the attainable case, i.e., the case that the expected risk minimization (1) has at least one solution in . Moreover, they saturate (Bauer et al., 2007) at a critical value, meaning that they can not lead to better convergence rates even with a smoother target function. Motivated by these, in this paper, we study statistical results of projectedregularized algorithms for leastsquares regression over a separable Hilbert space within the random design setting.
We first extend the analysis in (Lin et al., 2018) for classicregularized algorithms to projectedregularized algorithms, and prove statistical results with respect to a broader class of norms. We then show that optimal rates can be retained for sketchedregularized algorithms, provided that the sketch dimension is proportional to the effective dimension (Zhang, 2005) up to a logarithmic factor. As a byproduct, we obtain similar results for Nyström regularized algorithms.
Interestingly, our results are the first ones with optimal, distributiondependent rates that do not have any saturation effect for sketched/Nyström regularized algorithms, considering both the attainable and nonattainable cases. In our proof, we naturally integrate proof techniques from (Smale & Zhou, 2007; Caponnetto & De Vito, 2007; Rudi et al., 2015; Myleiko et al., 2017; Lin et al., 2018). Our novelties lie in a new estimates on the projection error for sketchedregularized algorithms, a novel analysis to conquer the saturation effect, and a refined analysis for Nyström regularized algorithms, see Section 4 for details.
2 Learning with Projectedregularized Algorithms
In this section, we introduce some notations as well as auxiliary operators, and present projectedregularized algorithms.
2.1 Notations and Auxiliary Operators
Let , the induced marginal measure on of , and the conditional probability measure on with respect to and . For simplicity, we assume that the support of is compact and that there exists a constant , such that
(2) 
Define the hypothesis space Denote the Hilbert space of square integral functions from to with respect to , with its norm given by
For a given bounded operator denotes the operator norm of , i.e., . Let the set is denoted by For any real number , , .
Let be the linear map , which is bounded by under Assumption (2). Furthermore, we consider the adjoint operator , the covariance operator given by , and the integeral operator given by It can be easily proved that and Under Assumption (2), the operators and can be proved to be positive trace class operators (and hence compact):
(3) 
For any , it is easy to prove the following isometry property (Bauer et al., 2007),
(4) 
Moreover, according to the singular value decomposition of a compact operator, one can prove that
(5) 
We define the (modified) sampling operator by , where the norm in is the usual Euclidean norm. Its adjoint operator defined by for is thus given by For notational simplicity, we let Moreover, we can define the empirical covariance operator such that . Obviously, By Assumption (2), similar to (3), we have
(6) 
It is easy to see that Problem (1) is equivalent to
(7) 
The function that minimizes the expected risk over all measurable functions is the regression function (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), defined as,
(8) 
A simple calculation shows that the following wellknown fact holds (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), for all Then it is easy to see that (7) is equivalent to Under Assumption (2), is a subspace of Using the projection theorem, one can prove that a solution for the problem (7) is the projection of the regression function onto the closure of in , and moreover, for all (Lin & Rosasco, 2017),
(9) 
and
(10) 
Note that does not necessarily be in .
Throughput this paper, is a closed, finitedimensional subspace of , and is the projection operator onto or .
2.2 Projectedregularized Algorithms
In this subsection, we demonstrate and introduce projectedregularized algorithms.
The expected risk in (1) can not be computed exactly. It can be only approximated through the empirical risk A first idea to deal with the problem is to replace the objective function in (1) with the empirical risk. Moreover, we restrict the solution to the subspace . This leads to the projected empirical risk minimization, Using a simple calculation shows that a solution for the above is given by , with satisfying Motivated by the classic (iterated) ridge regression, we replace with a regularized one, and thus leads to the following projected (iterated) ridge regression.
Algorithm 1.
The projected (iterated) ridge regression algorithm of order over the samples and the subspace is given by , where ^{1}^{1}1Let be a selfadjoint, compact operator over a separable Hilbert space . is an operator on defined by spectral calculus: suppose that is a set of normalized eigenpairs of
with the eigenfunctions
forming an orthonormal basis of , then(11) 
Remark 1.
1) Our results not only hold for projected ridge regression, but also hold for a general projectedregularized algorithm, in which is a general filter function. Given a class of functions are called filter functions with qualification () if there exist some positive constants such that
(12) 
and
(13) 
2) A simple calculation shows that
(14) 
Thus, is a filter function with qualification , and
When , it is a filter function for classic ridge regression and
the algorithm is projected ridge regression.
3) Another typical filter function studied in the literature is
which corresponds to principal component (spectral cutoff) regularization. Here, denotes the indication function.
In this case, , and could be any positive number.
In the above, is a regularization parameter which needs to be well chosen in order to achieve best performance. Throughout this paper, we assume that
The performance of an estimator can be measured in terms of excess risk (generalization error), which is exactly according to (10). Assuming that , i.e., for some (in this case, the solution with minimal norm for is denoted by ), it can be measured in terms of norm, which is closely related to , according to (5). In what follows, we will measure the performance of an estimator in terms of a broader class of norms, where is such that is well defined. But one should keep in mind that all the derived results also hold if we replace with in the attainable case, i.e., We will report these results in a longer version of this paper. Convergence with respect to different norms has its strong backgrounds in convex optimization, inverse problems, and statistical learning theory. Particularly, convergence with respect to target function values and norm has been studied in convex optimization. Interestingly, convergence in norm can imply convergence in target function values (although the derived rate is not optimal), while the opposite is not true.
3 Convergence Results
In this section, we first introduce some basic assumptions and then present convergence results for projectedregularized algorithms. Finally, we give results for sketched/Nyström regularized algorithms.
3.1 Assumptions
In this subsection, we introduce three standard assumptions made in statistical learning theory (Steinwart & Christmann, 2008; Cucker & Zhou, 2007; Lin et al., 2018)
. The first assumption relates to a moment condition on the output value
.Assumption 1.
There exist positive constants and such that for all with
(15) 
almost surely.
Typically, the above assumption is satisfied if is bounded almost surely, or if , where
is a Gaussian random variable with zero mean and it is independent from
. Condition (15) implies that the regression function is bounded almost surely, using the CauchySchwarz inequality.The next assumption relates to the regularity/smoothness of the target function
Assumption 2.
satisfies
(16) 
and the following Hölder source condition
(17) 
Here, are nonnegative numbers.
Condition (16) is trivially satisfied if is bounded almost surely. Moreover, when making a consistency assumption, i.e., , as that in (Smale & Zhou, 2007; Caponnetto, 2006; Caponnetto & De Vito, 2007; Steinwart et al., 2009), for kernelbased nonparametric regression, it is satisfied with Condition (17) characterizes the regularity of the target function (Smale & Zhou, 2007). A bigger corresponds to a higher regularity and a stronger assumption, and it can lead to a faster convergence rate. Particularly, when ,
(Steinwart & Christmann, 2008). This means that the expected risk minimization (1) has at least one solution in , which is referred to as the attainable case.
Finally, the last assumption relates to the capacity of the space ().
Assumption 3.
For some and , satisfies
(18) 
The left handside of (18
) is called degrees of freedom
(Zhang, 2005), or effective dimension (Caponnetto & De Vito, 2007). Assumption 3 is always true for and , since is a trace class operator. This is referred to as the capacity independent setting. Assumption 3 withallows to derive better rates. It is satisfied, e.g., if the eigenvalues of
satisfy a polynomial decaying condition , or with if is finite rank.3.2 Results for Projectedregularized Algorithms
We are now ready to state our first result as follows. Throughout this paper, denotes a positive constant that depends only on and , and it could be different at its each appearance. Moreover, we write to mean .
Theorem 1.
The above result provides highprobability error bounds with respect to variants of norms for projectedregularized algorithms. The upper bound consists of three terms. The first term depends on the regularity parameter
, and it arises from estimating bias. The second term depends on the sample size, and it arises from estimating variance. The third term depends on the projection error. Note that there is a tradeoff between the bias and variance terms. Ignoring the projection error, solving this tradeoff leads to the best choice on
and the following results.Corollary 2.
Under the assumptions and notations of Theorem 1, let Then the following holds with probability at least .
1) If
(22) 
2) If and ,
(23) 
3) If
(24) 
Comparing the derived upper bound for projectedregularized algorithms with that for classic regularized algorithms in (Lin et al., 2018), we see that the former has an extra term, which is caused by projection. The above result asserts that projectedregularized algorithms perform similarly as classic regularized algorithms if the projection operator is well chosen such that the projection error is small enough.
In the special case that , we get the follow result.
Corollary 3.
Under the assumptions and notations of Theorem 1, let and . Then with probability at least ,
(25) 
3.3 Results for Sketchedregularized Algorithms
In this subsection, we state results for sketchedregularized algorithms.
In sketchedregularized algorithms, the range of the projection operator is the subspace where is a sketch matrix satisfying the following concentration inequality: For any finite subset in and for any
(26) 
Here, and are universal nonnegative constants. Many matrices satisfy the concentration property.

Randomized orthogonal system (ROS) sketches. As noted in (Krahmer & Ward, 2011), matrix that satisfies restricted isometric property from compressed sensing with randomized column signs satisfies (26). Particularly, random partial Fourier matrix, or random partial Hadamard matrix with randomized column signs satisfies (26) with for some universal constant . Using OS sketches has an advantage in computation, as that for suitably chosen orthonormal matrices such as the DFT and Hadamard matrices, a matrixvector product can be executed in time, in contrast to time required for the same operation with generic dense sketches.
The following corollary shows that sketchedregularized algorithms have optimal rates provided the sketch dimension is not too small.
Corollary 4.
The above results assert that sketchedregularized algorithms converge optimally, provided the sketch dimension is not too small, or in another words the error caused by projection is negligible when the sketch dimension is large enough. Note that the minimal sketch dimension from the above is proportional to the effective dimension up to a logarithmic factor for the case
Remark 2.
Considering only the case and , (Yang et al., 2015) provides optimal error bounds for sketched ridge regression within the fixed design setting.
3.4 Results for Nyström Regularized Algorithms
As a byproduct of the paper, using Corollary 2, we derive the following results for Nyström regularized algorithms.
Corollary 5.
Remark 3.
1) Considering only the case and , (Rudi et al., 2015) provides optimal generalization error bounds for Nyström ridge regression. This result was further extended in (Myleiko et al., 2017) to a general Nyström regularized algorithm with a general source assumption indexed with an operator monotone function (but only in the attainable cases). Note that as in classic ridge regression, Nyström ridge regression saturates over i.e., it does not have a better rate even for a bigger .
2) For the case
and , (Myleiko et al., 2017) provides certain generalization error bounds for plain Nyström regularized algorithms, but the rates are capacityindependent, and the minimal
projection dimension is larger than ours (considering the case for the sake of fairness).
In the above lemma, we consider the plain Nyström subsampling. Using the ALS Nyström subsampling (Drineas et al., 2012; Gittens & Mahoney, 2013; Alaoui & Mahoney, 2015), we can improve the projection dimension condition to (27).
ALS Nyström Subsampling
Let For the leveraging scores of is the set with
The approximated leveraging scores (ALS) of is a set satisfying for some . In ALS Nyström subsampling regime, where each is i.i.d. drawn according to
Corollary 6.
All the results stated in this section will be proved in the next section.
4 Proof
In this section, we prove the results stated in Section 3. We first give some deterministic estimates and an analytics result. We then give some probabilistic estimates. Applying the probabilistic estimates into the analytics result, we prove the results for projectedregularized algorithms. We finally estimate the projection errors and present the proof for sketchedregularized algorithms.
4.1 Deterministic Estimates
In this subsection, we introduce some deterministic estimates. For notational simplicity, throughout this paper, we denote
We define a deterministic vector as follows,
(30) 
The vector is often called population function. We introduce the following lemma. The proof is essentially the same as that for Lemma 26 from (Lin & Cevher, 2018). We thus omit it.
Lemma 7.
The above lemma provides some basic properties for the population function. It will be useful for the proof of our main results. The left handside of
(31) is often called true bias.
Using the above lemma and some basic operator inequalities, we can prove the following analytic, deterministic result.
Proposition 8.
The above proposition is key to our proof. The proof of the above proposition for the case borrows ideas from (Smale & Zhou, 2007; Caponnetto & De Vito, 2007; Rudi et al., 2015; Myleiko et al., 2017; Lin et al., 2018), whereas the key step is an error decomposition from (Lin & Cevher, 2018). Our novelty lies in the proof for the case , see the appendix for further details.
4.2 Proof for Projectedregularized Algorithms
To derive total error bounds from Proposition 8, it is necessary to develop probabilistic estimates for the random quantities , , and . We thus introduce the following four lemmas.
Lemma 9 ((Lin & Cevher, 2018)).
Lemma 10.
Let It holds with probability at least
Here, denotes the HilbertSchmidt norm.
Lemma 11.
Under Assumption 3, let It holds with probability at least
The proof of the above lemmas can be done simply applying concentration inequalities for sums of Hilbertspacevalued random variables. We refer to (Lin & Rosasco, 2017) for the proofs.
Lemma 12.
With the above probabilistic estimates and the analytics result, Proposition 8, we are now ready prove results for projectedregularized algorithms.
Proof of Theorem 1.
4.3 Proof for Sketchedregularized Algorithms
In order to use Corollary 2 for sketchedregularized algorithms, we need to estimate the projection error. The basic idea is to approximate the projection error in terms of its ‘empirical’ version, . The estimate for is quite lengthy and it is divided into several steps.
Lemma 13.
Let and Given a fix , assume that for ,
Comments
There are no comments yet.