Optimal Rates of Sketched-regularized Algorithms for Least-Squares Regression over Hilbert Spaces

03/12/2018
by   Junhong Lin, et al.
0

We investigate regularized algorithms combining with projection for least-squares regression problem over a Hilbert space, covering nonparametric regression over a reproducing kernel Hilbert space. We prove convergence results with respect to variants of norms, under a capacity assumption on the hypothesis space and a regularity condition on the target function. As a result, we obtain optimal rates for regularized algorithms with randomized sketches, provided that the sketch dimension is proportional to the effective dimension up to a logarithmic factor. As a byproduct, we obtain similar results for Nyström regularized algorithms. Our results are the first ones with optimal, distribution-dependent rates that do not have any saturation effect for sketched/Nyström regularized algorithms, considering both the attainable and non-attainable cases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/20/2018

Optimal Rates for Spectral-regularized Algorithms with Least-Squares Regression over Hilbert Spaces

In this paper, we study regression problems over a separable Hilbert spa...
11/05/2018

Kernel Conjugate Gradient Methods with Random Projections

We propose and study kernel conjugate gradient methods (KCGM) with rando...
06/01/2020

Analysis of Least Squares Regularized Regression in Reproducing Kernel Krein Spaces

In this paper, we study the asymptotical properties of least squares reg...
01/25/2015

Randomized sketches for kernels: Fast and optimal non-parametric regression

Kernel ridge regression (KRR) is a standard method for performing non-pa...
02/23/2017

Sobolev Norm Learning Rates for Regularized Least-Squares Algorithm

Learning rates for regularized least-squares algorithms are in most case...
10/09/2013

M-Power Regularized Least Squares Regression

Regularization is used to find a solution that both fits the data and is...
10/04/2018

Optimal Learning with Anisotropic Gaussian SVMs

This paper investigates the nonparametric regression problem using SVMs ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let the input space be a separable Hilbert space with inner product denoted by , and the output space . Let

be an unknown probability measure on

. In this paper, we study the following expected risk minimization,

(1)

where the measure is known only through a sample of size , independently and identically distributed (i.i.d.) according to .

The above regression setting covers nonparametric regression over a reproducing kernel Hilbert space (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), and it is close to functional regression (Ramsay, 2006) and linear inverse problems (Engl et al., 1996)

. A basic algorithm for the problem is ridge regression, and its generalization, spectral algorithm. Such algorithms can be viewed as solving an empirical, linear equation with the empirical covariance operator replaced by a regularized one, see

(Caponnetto & Yao, 2006; Bauer et al., 2007; Gerfo et al., 2008; Lin et al., 2018) and references therein. Here, the regularization is used to control the complexity of the solution to against over-fitting and to achieve best generalization ability.

The function/estimator generated by classic regularized algorithm is in the subspace

of , where More often, the search of an estimator for some specific algorithms is restricted to a different (and possibly smaller) subspace , which leads to regularized algorithms with projection. Such approaches have computational advantages in nonparametric regression with kernel methods (Williams & Seeger, 2000; Smola & Schölkopf, 2000). Typically, with a subsample/sketch dimension , where is chosen randomly from the input set , or where

is a general randomized matrix whose rows are drawn according to a distribution. The resulted algorithms are called Nyström regularized algorithm and sketched-regularized algorithm, respectively.

Our starting points of this paper are recent papers (Bach, 2013; Alaoui & Mahoney, 2015; Yang et al., 2015; Rudi et al., 2015; Myleiko et al., 2017) where convergence results on Nyström/sketched regularized algorithms for learning with kernel methods are given. Particularly, within the fixed design setting, i.e., the input set are deterministic while the output set treated randomly, convergence results have been derived, in (Bach, 2013; Alaoui & Mahoney, 2015) for Nyström ridge regression and in (Yang et al., 2015) for sketched ridge regression. Within the random design setting (which is more meaningful (Hsu et al., 2014)

in statistical learning theory) and involving a regularity/smoothness condition on the target function

(Smale & Zhou, 2007), optimal statistical results on generalization error bounds (excess risks) have been obtained in (Rudi et al., 2015) for Nyström ridge regression. The latter results were further generalized in (Myleiko et al., 2017) to a general Nyström regularized algorithm.
Although results have been developed for sketched ridge regression in the fixed design setting, it is still unclear if one can get statistical results for a general sketched-regularized algorithms in the random design setting. Besides, all the derived results, either for sketched or Nyström regularized algorithms, are only for the attainable case, i.e., the case that the expected risk minimization (1) has at least one solution in . Moreover, they saturate (Bauer et al., 2007) at a critical value, meaning that they can not lead to better convergence rates even with a smoother target function. Motivated by these, in this paper, we study statistical results of projected-regularized algorithms for least-squares regression over a separable Hilbert space within the random design setting.

We first extend the analysis in (Lin et al., 2018) for classic-regularized algorithms to projected-regularized algorithms, and prove statistical results with respect to a broader class of norms. We then show that optimal rates can be retained for sketched-regularized algorithms, provided that the sketch dimension is proportional to the effective dimension (Zhang, 2005) up to a logarithmic factor. As a byproduct, we obtain similar results for Nyström regularized algorithms.

Interestingly, our results are the first ones with optimal, distribution-dependent rates that do not have any saturation effect for sketched/Nyström regularized algorithms, considering both the attainable and non-attainable cases. In our proof, we naturally integrate proof techniques from (Smale & Zhou, 2007; Caponnetto & De Vito, 2007; Rudi et al., 2015; Myleiko et al., 2017; Lin et al., 2018). Our novelties lie in a new estimates on the projection error for sketched-regularized algorithms, a novel analysis to conquer the saturation effect, and a refined analysis for Nyström regularized algorithms, see Section 4 for details.

The rest of the paper is organized as follows. Section 2 introduces some auxiliary notations and projected-regularized algorithms. Section 3 present assumptions and our main results, followed with simple discussions. Finally, Section 4 gives the proofs of our main results.

2 Learning with Projected-regularized Algorithms

In this section, we introduce some notations as well as auxiliary operators, and present projected-regularized algorithms.

2.1 Notations and Auxiliary Operators

Let , the induced marginal measure on of , and the conditional probability measure on with respect to and . For simplicity, we assume that the support of is compact and that there exists a constant , such that

(2)

Define the hypothesis space Denote the Hilbert space of square integral functions from to with respect to , with its norm given by

For a given bounded operator denotes the operator norm of , i.e., . Let the set is denoted by For any real number , , .

Let be the linear map , which is bounded by under Assumption (2). Furthermore, we consider the adjoint operator , the covariance operator given by , and the integeral operator given by It can be easily proved that and Under Assumption (2), the operators and can be proved to be positive trace class operators (and hence compact):

(3)

For any , it is easy to prove the following isometry property (Bauer et al., 2007),

(4)

Moreover, according to the singular value decomposition of a compact operator, one can prove that

(5)

We define the (modified) sampling operator by , where the norm in is the usual Euclidean norm. Its adjoint operator defined by for is thus given by For notational simplicity, we let Moreover, we can define the empirical covariance operator such that . Obviously, By Assumption (2), similar to (3), we have

(6)

It is easy to see that Problem (1) is equivalent to

(7)

The function that minimizes the expected risk over all measurable functions is the regression function (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), defined as,

(8)

A simple calculation shows that the following well-known fact holds (Cucker & Zhou, 2007; Steinwart & Christmann, 2008), for all Then it is easy to see that (7) is equivalent to Under Assumption (2), is a subspace of Using the projection theorem, one can prove that a solution for the problem (7) is the projection of the regression function onto the closure of in , and moreover, for all (Lin & Rosasco, 2017),

(9)

and

(10)

Note that does not necessarily be in .

Throughput this paper, is a closed, finite-dimensional subspace of , and is the projection operator onto or .

2.2 Projected-regularized Algorithms

In this subsection, we demonstrate and introduce projected-regularized algorithms.

The expected risk in (1) can not be computed exactly. It can be only approximated through the empirical risk A first idea to deal with the problem is to replace the objective function in (1) with the empirical risk. Moreover, we restrict the solution to the subspace . This leads to the projected empirical risk minimization, Using a simple calculation shows that a solution for the above is given by , with satisfying Motivated by the classic (iterated) ridge regression, we replace with a regularized one, and thus leads to the following projected (iterated) ridge regression.

Algorithm 1.

The projected (iterated) ridge regression algorithm of order over the samples and the subspace is given by , where 111Let be a self-adjoint, compact operator over a separable Hilbert space . is an operator on defined by spectral calculus: suppose that is a set of normalized eigenpairs of

with the eigenfunctions

forming an orthonormal basis of , then

(11)
Remark 1.

1) Our results not only hold for projected ridge regression, but also hold for a general projected-regularized algorithm, in which is a general filter function. Given a class of functions are called filter functions with qualification () if there exist some positive constants such that

(12)

and

(13)

2) A simple calculation shows that

(14)

Thus, is a filter function with qualification , and When , it is a filter function for classic ridge regression and the algorithm is projected ridge regression.
3) Another typical filter function studied in the literature is which corresponds to principal component (spectral cut-off) regularization. Here, denotes the indication function. In this case, , and could be any positive number.

In the above, is a regularization parameter which needs to be well chosen in order to achieve best performance. Throughout this paper, we assume that

The performance of an estimator can be measured in terms of excess risk (generalization error), which is exactly according to (10). Assuming that , i.e., for some (in this case, the solution with minimal -norm for is denoted by ), it can be measured in terms of -norm, which is closely related to , according to (5). In what follows, we will measure the performance of an estimator in terms of a broader class of norms, where is such that is well defined. But one should keep in mind that all the derived results also hold if we replace with in the attainable case, i.e., We will report these results in a longer version of this paper. Convergence with respect to different norms has its strong backgrounds in convex optimization, inverse problems, and statistical learning theory. Particularly, convergence with respect to target function values and -norm has been studied in convex optimization. Interestingly, convergence in -norm can imply convergence in target function values (although the derived rate is not optimal), while the opposite is not true.

3 Convergence Results

In this section, we first introduce some basic assumptions and then present convergence results for projected-regularized algorithms. Finally, we give results for sketched/Nyström regularized algorithms.

3.1 Assumptions

In this subsection, we introduce three standard assumptions made in statistical learning theory (Steinwart & Christmann, 2008; Cucker & Zhou, 2007; Lin et al., 2018)

. The first assumption relates to a moment condition on the output value

.

Assumption 1.

There exist positive constants and such that for all with

(15)

-almost surely.

Typically, the above assumption is satisfied if is bounded almost surely, or if , where

is a Gaussian random variable with zero mean and it is independent from

. Condition (15) implies that the regression function is bounded almost surely, using the Cauchy-Schwarz inequality.

The next assumption relates to the regularity/smoothness of the target function

Assumption 2.

satisfies

(16)

and the following Hölder source condition

(17)

Here, are non-negative numbers.

Condition (16) is trivially satisfied if is bounded almost surely. Moreover, when making a consistency assumption, i.e., , as that in (Smale & Zhou, 2007; Caponnetto, 2006; Caponnetto & De Vito, 2007; Steinwart et al., 2009), for kernel-based non-parametric regression, it is satisfied with Condition (17) characterizes the regularity of the target function (Smale & Zhou, 2007). A bigger corresponds to a higher regularity and a stronger assumption, and it can lead to a faster convergence rate. Particularly, when , (Steinwart & Christmann, 2008). This means that the expected risk minimization (1) has at least one solution in , which is referred to as the attainable case.
Finally, the last assumption relates to the capacity of the space ().

Assumption 3.

For some and , satisfies

(18)

The left hand-side of (18

) is called degrees of freedom

(Zhang, 2005), or effective dimension (Caponnetto & De Vito, 2007). Assumption 3 is always true for and , since is a trace class operator. This is referred to as the capacity independent setting. Assumption 3 with

allows to derive better rates. It is satisfied, e.g., if the eigenvalues of

satisfy a polynomial decaying condition , or with if is finite rank.

3.2 Results for Projected-regularized Algorithms

We are now ready to state our first result as follows. Throughout this paper, denotes a positive constant that depends only on and , and it could be different at its each appearance. Moreover, we write to mean .

Theorem 1.

Under Assumptions 1, 2 and 3, let for some , , and . Then the following holds with probability at least ().
1) If ,

(19)

2) If and

(20)

Here, is the projection error and

(21)

The above result provides high-probability error bounds with respect to variants of norms for projected-regularized algorithms. The upper bound consists of three terms. The first term depends on the regularity parameter

, and it arises from estimating bias. The second term depends on the sample size, and it arises from estimating variance. The third term depends on the projection error. Note that there is a trade-off between the bias and variance terms. Ignoring the projection error, solving this trade-off leads to the best choice on

and the following results.

Corollary 2.

Under the assumptions and notations of Theorem 1, let Then the following holds with probability at least .
1) If

(22)

2) If and ,

(23)

3) If

(24)

Comparing the derived upper bound for projected-regularized algorithms with that for classic regularized algorithms in (Lin et al., 2018), we see that the former has an extra term, which is caused by projection. The above result asserts that projected-regularized algorithms perform similarly as classic regularized algorithms if the projection operator is well chosen such that the projection error is small enough.

In the special case that , we get the follow result.

Corollary 3.

Under the assumptions and notations of Theorem 1, let and . Then with probability at least ,

(25)

The above result recovers the result derived in (Lin et al., 2018). The convergence rates are optimal as they match the mini-max rates with derived in (Caponnetto & De Vito, 2007; Blanchard & Mucke, 2016).

3.3 Results for Sketched-regularized Algorithms

In this subsection, we state results for sketched-regularized algorithms.

In sketched-regularized algorithms, the range of the projection operator is the subspace where is a sketch matrix satisfying the following concentration inequality: For any finite subset in and for any

(26)

Here, and are universal non-negative constants. Many matrices satisfy the concentration property.

  • Subgaussian sketches. Matrices with i.i.d. subgaussian (such as Gaussian or Bernoulli) entries satisfy (26) with some universal constant and . More general, if the rows of are independent (scaled) copies of an isotropic vector, then also satisfies (26) (Mendelson et al., 2008).

  • Randomized orthogonal system (ROS) sketches. As noted in (Krahmer & Ward, 2011), matrix that satisfies restricted isometric property from compressed sensing with randomized column signs satisfies (26). Particularly, random partial Fourier matrix, or random partial Hadamard matrix with randomized column signs satisfies (26) with for some universal constant . Using OS sketches has an advantage in computation, as that for suitably chosen orthonormal matrices such as the DFT and Hadamard matrices, a matrix-vector product can be executed in time, in contrast to time required for the same operation with generic dense sketches.

The following corollary shows that sketched-regularized algorithms have optimal rates provided the sketch dimension is not too small.

Corollary 4.

Under the assumptions of Theorem 1, let where is a randomized matrix satisfying (26). Let and

(27)

Then with confidence at least the following holds

(28)

The above results assert that sketched-regularized algorithms converge optimally, provided the sketch dimension is not too small, or in another words the error caused by projection is negligible when the sketch dimension is large enough. Note that the minimal sketch dimension from the above is proportional to the effective dimension up to a logarithmic factor for the case

Remark 2.

Considering only the case and , (Yang et al., 2015) provides optimal error bounds for sketched ridge regression within the fixed design setting.

3.4 Results for Nyström Regularized Algorithms

As a byproduct of the paper, using Corollary 2, we derive the following results for Nyström regularized algorithms.

Corollary 5.

Under the assumptions of Theorem 1, let , , and . Then with probability at least ,

provided that

Remark 3.

1) Considering only the case and , (Rudi et al., 2015) provides optimal generalization error bounds for Nyström ridge regression. This result was further extended in (Myleiko et al., 2017) to a general Nyström regularized algorithm with a general source assumption indexed with an operator monotone function (but only in the attainable cases). Note that as in classic ridge regression, Nyström ridge regression saturates over i.e., it does not have a better rate even for a bigger .
2) For the case and , (Myleiko et al., 2017) provides certain generalization error bounds for plain Nyström regularized algorithms, but the rates are capacity-independent, and the minimal projection dimension is larger than ours (considering the case for the sake of fairness).

In the above lemma, we consider the plain Nyström subsampling. Using the ALS Nyström subsampling (Drineas et al., 2012; Gittens & Mahoney, 2013; Alaoui & Mahoney, 2015), we can improve the projection dimension condition to (27).

ALS Nyström Subsampling

Let For the leveraging scores of is the set with

The -approximated leveraging scores (ALS) of is a set satisfying for some . In ALS Nyström subsampling regime, where each is i.i.d. drawn according to

Corollary 6.

Under the assumptions of Theorem 1, let and with drawn following an -ALS Nyström subsampling scheme. Then with probability at least , (28) holds provided that

(29)

All the results stated in this section will be proved in the next section.

4 Proof

In this section, we prove the results stated in Section 3. We first give some deterministic estimates and an analytics result. We then give some probabilistic estimates. Applying the probabilistic estimates into the analytics result, we prove the results for projected-regularized algorithms. We finally estimate the projection errors and present the proof for sketched-regularized algorithms.

4.1 Deterministic Estimates

In this subsection, we introduce some deterministic estimates. For notational simplicity, throughout this paper, we denote

We define a deterministic vector as follows,

(30)

The vector is often called population function. We introduce the following lemma. The proof is essentially the same as that for Lemma 26 from (Lin & Cevher, 2018). We thus omit it.

Lemma 7.

Under Assumption 2, the following holds.
1) For any

(31)

2)

(32)

The above lemma provides some basic properties for the population function. It will be useful for the proof of our main results. The left hand-side of (31) is often called true bias.
Using the above lemma and some basic operator inequalities, we can prove the following analytic, deterministic result.

Proposition 8.

Under Assumption 2, let

Then, for any the following holds.
1) If ,

(33)

2) If

(34)

The above proposition is key to our proof. The proof of the above proposition for the case borrows ideas from (Smale & Zhou, 2007; Caponnetto & De Vito, 2007; Rudi et al., 2015; Myleiko et al., 2017; Lin et al., 2018), whereas the key step is an error decomposition from (Lin & Cevher, 2018). Our novelty lies in the proof for the case , see the appendix for further details.

4.2 Proof for Projected-regularized Algorithms

To derive total error bounds from Proposition 8, it is necessary to develop probabilistic estimates for the random quantities , , and . We thus introduce the following four lemmas.

Lemma 9 ((Lin & Cevher, 2018)).

Under Assumption 3, let , for some , and

(35)

We have with probability at least

and

Lemma 10.

Let It holds with probability at least

Here, denotes the Hilbert-Schmidt norm.

Lemma 11.

Under Assumption 3, let It holds with probability at least

The proof of the above lemmas can be done simply applying concentration inequalities for sums of Hilbert-space-valued random variables. We refer to (Lin & Rosasco, 2017) for the proofs.

Lemma 12.

(Lin et al., 2018) Under Assumptions 1, 2 and 3, let be given by (30). For all the following holds with probability at least

(36)

Here, and

With the above probabilistic estimates and the analytics result, Proposition 8, we are now ready prove results for projected-regularized algorithms.

Proof of Theorem 1.

We use Proposition 8 to prove the result. We thus need to estimate , and . Following from Lemmas 9, 10, 11 and 12, with we know that with probability at least

(37)
(38)

The results thus follow by introducing the above estimates into (33) or (34), combining with a direct calculation and . ∎

4.3 Proof for Sketched-regularized Algorithms

In order to use Corollary 2 for sketched-regularized algorithms, we need to estimate the projection error. The basic idea is to approximate the projection error in terms of its ‘empirical’ version, . The estimate for is quite lengthy and it is divided into several steps.

Lemma 13.

Let and Given a fix , assume that for ,