Let the input space be a separable Hilbert space with inner product denoted by , and the output space . Let
be an unknown probability measure on. In this paper, we study the following expected risk minimization,
where the measure is known only through a sample of size , independently and identically distributed (i.i.d.) according to .
The above regression setting covers nonparametric regression over a reproducing kernel Hilbert space (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), and it is close to functional regression (Ramsay, 2006) and linear inverse problems (Engl et al., 1996)
. A basic algorithm for the problem is ridge regression, and its generalization, spectral algorithm. Such algorithms can be viewed as solving an empirical, linear equation with the empirical covariance operator replaced by a regularized one, see(Caponnetto & Yao, 2006; Bauer et al., 2007; Gerfo et al., 2008; Lin et al., 2018) and references therein. Here, the regularization is used to control the complexity of the solution to against over-fitting and to achieve best generalization ability.
The function/estimator generated by classic regularized algorithm is in the subspaceof , where More often, the search of an estimator for some specific algorithms is restricted to a different (and possibly smaller) subspace , which leads to regularized algorithms with projection. Such approaches have computational advantages in nonparametric regression with kernel methods (Williams & Seeger, 2000; Smola & Schölkopf, 2000). Typically, with a subsample/sketch dimension , where is chosen randomly from the input set , or where
is a general randomized matrix whose rows are drawn according to a distribution. The resulted algorithms are called Nyström regularized algorithm and sketched-regularized algorithm, respectively.
Our starting points of this paper are recent papers (Bach, 2013; Alaoui & Mahoney, 2015; Yang et al., 2015; Rudi et al., 2015; Myleiko et al., 2017) where convergence results on Nyström/sketched regularized algorithms for learning with kernel methods are given. Particularly, within the fixed design setting, i.e., the input set are deterministic while the output set treated randomly, convergence results have been derived, in (Bach, 2013; Alaoui & Mahoney, 2015) for Nyström ridge regression and in (Yang et al., 2015) for sketched ridge regression. Within the random design setting (which is more meaningful (Hsu et al., 2014)
in statistical learning theory) and involving a regularity/smoothness condition on the target function(Smale & Zhou, 2007), optimal statistical results on generalization error bounds (excess risks) have been obtained in (Rudi et al., 2015) for Nyström ridge regression. The latter results were further generalized in (Myleiko et al., 2017) to a general Nyström regularized algorithm.
Although results have been developed for sketched ridge regression in the fixed design setting, it is still unclear if one can get statistical results for a general sketched-regularized algorithms in the random design setting. Besides, all the derived results, either for sketched or Nyström regularized algorithms, are only for the attainable case, i.e., the case that the expected risk minimization (1) has at least one solution in . Moreover, they saturate (Bauer et al., 2007) at a critical value, meaning that they can not lead to better convergence rates even with a smoother target function. Motivated by these, in this paper, we study statistical results of projected-regularized algorithms for least-squares regression over a separable Hilbert space within the random design setting.
We first extend the analysis in (Lin et al., 2018) for classic-regularized algorithms to projected-regularized algorithms, and prove statistical results with respect to a broader class of norms. We then show that optimal rates can be retained for sketched-regularized algorithms, provided that the sketch dimension is proportional to the effective dimension (Zhang, 2005) up to a logarithmic factor. As a byproduct, we obtain similar results for Nyström regularized algorithms.
Interestingly, our results are the first ones with optimal, distribution-dependent rates that do not have any saturation effect for sketched/Nyström regularized algorithms, considering both the attainable and non-attainable cases. In our proof, we naturally integrate proof techniques from (Smale & Zhou, 2007; Caponnetto & De Vito, 2007; Rudi et al., 2015; Myleiko et al., 2017; Lin et al., 2018). Our novelties lie in a new estimates on the projection error for sketched-regularized algorithms, a novel analysis to conquer the saturation effect, and a refined analysis for Nyström regularized algorithms, see Section 4 for details.
2 Learning with Projected-regularized Algorithms
In this section, we introduce some notations as well as auxiliary operators, and present projected-regularized algorithms.
2.1 Notations and Auxiliary Operators
Let , the induced marginal measure on of , and the conditional probability measure on with respect to and . For simplicity, we assume that the support of is compact and that there exists a constant , such that
Define the hypothesis space Denote the Hilbert space of square integral functions from to with respect to , with its norm given by
For a given bounded operator denotes the operator norm of , i.e., . Let the set is denoted by For any real number , , .
Let be the linear map , which is bounded by under Assumption (2). Furthermore, we consider the adjoint operator , the covariance operator given by , and the integeral operator given by It can be easily proved that and Under Assumption (2), the operators and can be proved to be positive trace class operators (and hence compact):
For any , it is easy to prove the following isometry property (Bauer et al., 2007),
Moreover, according to the singular value decomposition of a compact operator, one can prove that
We define the (modified) sampling operator by , where the norm in is the usual Euclidean norm. Its adjoint operator defined by for is thus given by For notational simplicity, we let Moreover, we can define the empirical covariance operator such that . Obviously, By Assumption (2), similar to (3), we have
It is easy to see that Problem (1) is equivalent to
A simple calculation shows that the following well-known fact holds (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), for all Then it is easy to see that (7) is equivalent to Under Assumption (2), is a subspace of Using the projection theorem, one can prove that a solution for the problem (7) is the projection of the regression function onto the closure of in , and moreover, for all (Lin & Rosasco, 2017),
Note that does not necessarily be in .
Throughput this paper, is a closed, finite-dimensional subspace of , and is the projection operator onto or .
2.2 Projected-regularized Algorithms
In this subsection, we demonstrate and introduce projected-regularized algorithms.
The expected risk in (1) can not be computed exactly. It can be only approximated through the empirical risk A first idea to deal with the problem is to replace the objective function in (1) with the empirical risk. Moreover, we restrict the solution to the subspace . This leads to the projected empirical risk minimization, Using a simple calculation shows that a solution for the above is given by , with satisfying Motivated by the classic (iterated) ridge regression, we replace with a regularized one, and thus leads to the following projected (iterated) ridge regression.
The projected (iterated) ridge regression algorithm of order over the samples and the subspace is given by , where 111Let be a self-adjoint, compact operator over a separable Hilbert space . is an operator on defined by spectral calculus: suppose that is a set of
normalized eigenpairs of with the eigenfunctions
with the eigenfunctionsforming an orthonormal basis of , then
1) Our results not only hold for projected ridge regression, but also hold for a general projected-regularized algorithm, in which is a general filter function. Given a class of functions are called filter functions with qualification () if there exist some positive constants such that
2) A simple calculation shows that
Thus, is a filter function with qualification , and
When , it is a filter function for classic ridge regression and
the algorithm is projected ridge regression.
3) Another typical filter function studied in the literature is which corresponds to principal component (spectral cut-off) regularization. Here, denotes the indication function. In this case, , and could be any positive number.
In the above, is a regularization parameter which needs to be well chosen in order to achieve best performance. Throughout this paper, we assume that
The performance of an estimator can be measured in terms of excess risk (generalization error), which is exactly according to (10). Assuming that , i.e., for some (in this case, the solution with minimal -norm for is denoted by ), it can be measured in terms of -norm, which is closely related to , according to (5). In what follows, we will measure the performance of an estimator in terms of a broader class of norms, where is such that is well defined. But one should keep in mind that all the derived results also hold if we replace with in the attainable case, i.e., We will report these results in a longer version of this paper. Convergence with respect to different norms has its strong backgrounds in convex optimization, inverse problems, and statistical learning theory. Particularly, convergence with respect to target function values and -norm has been studied in convex optimization. Interestingly, convergence in -norm can imply convergence in target function values (although the derived rate is not optimal), while the opposite is not true.
3 Convergence Results
In this section, we first introduce some basic assumptions and then present convergence results for projected-regularized algorithms. Finally, we give results for sketched/Nyström regularized algorithms.
. The first assumption relates to a moment condition on the output value.
There exist positive constants and such that for all with
Typically, the above assumption is satisfied if is bounded almost surely, or if , where
is a Gaussian random variable with zero mean and it is independent from. Condition (15) implies that the regression function is bounded almost surely, using the Cauchy-Schwarz inequality.
The next assumption relates to the regularity/smoothness of the target function
and the following Hölder source condition
Here, are non-negative numbers.
Condition (16) is trivially satisfied if is bounded almost surely. Moreover, when making a consistency assumption, i.e., , as that in (Smale & Zhou, 2007; Caponnetto, 2006; Caponnetto & De Vito, 2007; Steinwart et al., 2009), for kernel-based non-parametric regression, it is satisfied with Condition (17) characterizes the regularity of the target function (Smale & Zhou, 2007). A bigger corresponds to a higher regularity and a stronger assumption, and it can lead to a faster convergence rate. Particularly, when ,
(Steinwart & Christmann, 2008). This means that the expected risk minimization (1) has at least one solution in , which is referred to as the attainable case.
Finally, the last assumption relates to the capacity of the space ().
For some and , satisfies
The left hand-side of (18
) is called degrees of freedom(Zhang, 2005), or effective dimension (Caponnetto & De Vito, 2007). Assumption 3 is always true for and , since is a trace class operator. This is referred to as the capacity independent setting. Assumption 3 with
allows to derive better rates. It is satisfied, e.g., if the eigenvalues ofsatisfy a polynomial decaying condition , or with if is finite rank.
3.2 Results for Projected-regularized Algorithms
We are now ready to state our first result as follows. Throughout this paper, denotes a positive constant that depends only on and , and it could be different at its each appearance. Moreover, we write to mean .
The above result provides high-probability error bounds with respect to variants of norms for projected-regularized algorithms. The upper bound consists of three terms. The first term depends on the regularity parameter
, and it arises from estimating bias. The second term depends on the sample size, and it arises from estimating variance. The third term depends on the projection error. Note that there is a trade-off between the bias and variance terms. Ignoring the projection error, solving this trade-off leads to the best choice onand the following results.
Under the assumptions and notations of Theorem 1, let Then the following holds with probability at least .
2) If and ,
Comparing the derived upper bound for projected-regularized algorithms with that for classic regularized algorithms in (Lin et al., 2018), we see that the former has an extra term, which is caused by projection. The above result asserts that projected-regularized algorithms perform similarly as classic regularized algorithms if the projection operator is well chosen such that the projection error is small enough.
In the special case that , we get the follow result.
Under the assumptions and notations of Theorem 1, let and . Then with probability at least ,
3.3 Results for Sketched-regularized Algorithms
In this subsection, we state results for sketched-regularized algorithms.
In sketched-regularized algorithms, the range of the projection operator is the subspace where is a sketch matrix satisfying the following concentration inequality: For any finite subset in and for any
Here, and are universal non-negative constants. Many matrices satisfy the concentration property.
Randomized orthogonal system (ROS) sketches. As noted in (Krahmer & Ward, 2011), matrix that satisfies restricted isometric property from compressed sensing with randomized column signs satisfies (26). Particularly, random partial Fourier matrix, or random partial Hadamard matrix with randomized column signs satisfies (26) with for some universal constant . Using OS sketches has an advantage in computation, as that for suitably chosen orthonormal matrices such as the DFT and Hadamard matrices, a matrix-vector product can be executed in time, in contrast to time required for the same operation with generic dense sketches.
The following corollary shows that sketched-regularized algorithms have optimal rates provided the sketch dimension is not too small.
The above results assert that sketched-regularized algorithms converge optimally, provided the sketch dimension is not too small, or in another words the error caused by projection is negligible when the sketch dimension is large enough. Note that the minimal sketch dimension from the above is proportional to the effective dimension up to a logarithmic factor for the case
Considering only the case and , (Yang et al., 2015) provides optimal error bounds for sketched ridge regression within the fixed design setting.
3.4 Results for Nyström Regularized Algorithms
As a byproduct of the paper, using Corollary 2, we derive the following results for Nyström regularized algorithms.
Under the assumptions of Theorem 1, let , , and . Then with probability at least ,
1) Considering only the case and , (Rudi et al., 2015) provides optimal generalization error bounds for Nyström ridge regression. This result was further extended in (Myleiko et al., 2017) to a general Nyström regularized algorithm with a general source assumption indexed with an operator monotone function (but only in the attainable cases). Note that as in classic ridge regression, Nyström ridge regression saturates over i.e., it does not have a better rate even for a bigger .
2) For the case and , (Myleiko et al., 2017) provides certain generalization error bounds for plain Nyström regularized algorithms, but the rates are capacity-independent, and the minimal projection dimension is larger than ours (considering the case for the sake of fairness).
In the above lemma, we consider the plain Nyström subsampling. Using the ALS Nyström subsampling (Drineas et al., 2012; Gittens & Mahoney, 2013; Alaoui & Mahoney, 2015), we can improve the projection dimension condition to (27).
ALS Nyström Subsampling
Let For the leveraging scores of is the set with
The -approximated leveraging scores (ALS) of is a set satisfying for some . In ALS Nyström subsampling regime, where each is i.i.d. drawn according to
All the results stated in this section will be proved in the next section.
In this section, we prove the results stated in Section 3. We first give some deterministic estimates and an analytics result. We then give some probabilistic estimates. Applying the probabilistic estimates into the analytics result, we prove the results for projected-regularized algorithms. We finally estimate the projection errors and present the proof for sketched-regularized algorithms.
4.1 Deterministic Estimates
In this subsection, we introduce some deterministic estimates. For notational simplicity, throughout this paper, we denote
We define a deterministic vector as follows,
The vector is often called population function. We introduce the following lemma. The proof is essentially the same as that for Lemma 26 from (Lin & Cevher, 2018). We thus omit it.
Under Assumption 2, the following holds.
1) For any
The above lemma provides some basic properties for the population function. It will be useful for the proof of our main results. The left hand-side of
(31) is often called true bias.
Using the above lemma and some basic operator inequalities, we can prove the following analytic, deterministic result.
Under Assumption 2, let
Then, for any the following holds.
1) If ,
The above proposition is key to our proof. The proof of the above proposition for the case borrows ideas from (Smale & Zhou, 2007; Caponnetto & De Vito, 2007; Rudi et al., 2015; Myleiko et al., 2017; Lin et al., 2018), whereas the key step is an error decomposition from (Lin & Cevher, 2018). Our novelty lies in the proof for the case , see the appendix for further details.
4.2 Proof for Projected-regularized Algorithms
To derive total error bounds from Proposition 8, it is necessary to develop probabilistic estimates for the random quantities , , and . We thus introduce the following four lemmas.
Lemma 9 ((Lin & Cevher, 2018)).
Under Assumption 3, let , for some , and
We have with probability at least
Let It holds with probability at least
Here, denotes the Hilbert-Schmidt norm.
Under Assumption 3, let It holds with probability at least
The proof of the above lemmas can be done simply applying concentration inequalities for sums of Hilbert-space-valued random variables. We refer to (Lin & Rosasco, 2017) for the proofs.
With the above probabilistic estimates and the analytics result, Proposition 8, we are now ready prove results for projected-regularized algorithms.
Proof of Theorem 1.
4.3 Proof for Sketched-regularized Algorithms
In order to use Corollary 2 for sketched-regularized algorithms, we need to estimate the projection error. The basic idea is to approximate the projection error in terms of its ‘empirical’ version, . The estimate for is quite lengthy and it is divided into several steps.
Let and Given a fix , assume that for ,