Spectrally-truncated kernel ridge regression and its free lunch

06/14/2019
by   Arash A. Amini, et al.
0

Kernel ridge regression (KRR) is a well-known and popular nonparametric regression approach with many desirable properties, including minimax rate-optimality in estimating functions that belong to common reproducing kernel Hilbert spaces (RKHS). The approach, however, is computationally intensive for large data sets, due to the need to operate on a dense n × n kernel matrix, where n is the sample size. Recently, various approximation schemes for solving KRR have been considered, and some analyzed. Some approaches such as Nyström approximation and sketching have been shown to preserve the rate optimality of KRR. In this paper, we consider the simplest approximation, namely, spectrally truncating the kernel matrix to its largest r < n eigenvalues. We derive an exact expression for the maximum risk of this truncated KRR, over the unit ball of the RKHS. This result can be used to study the exact trade-off between the level of spectral truncation and the regularization parameter of the KRR. We show that, as long as the RKHS is infinite-dimensional, there is a threshold on r, above which, the spectrally-truncated KRR, surprisingly, outperforms the full KRR in terms of the minimax risk, where the minimum is taken over the regularization parameter. This strengthens the existing results on approximation schemes, by showing that not only one does not lose in terms of the rates, truncation can in fact improve the performance, for all finite samples (above the threshold). In other words, there is nothing to be gained by running the full KRR and one should always truncate. Our proof is elementary and distribution-free, only requiring the noise vector to be isotropic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/01/2020

On the Improved Rates of Convergence for Matérn-type Kernel Ridge Regression, with Application to Calibration of Computer Models

Kernel ridge regression is an important nonparametric method for estimat...
05/25/2019

Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration

In this paper, we consider the nonparametric least square regression in ...
05/22/2013

Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates

We establish optimal convergence rates for a decomposition-based scalabl...
02/14/2015

Nonparametric regression using needlet kernels for spherical data

Needlets have been recognized as state-of-the-art tools to tackle spheri...
02/19/2021

Finite-size effects in response functions of molecular systems

We consider an electron in a localized potential submitted to a weak ext...
02/17/2018

Nonparametric Testing under Random Projection

A common challenge in nonparametric inference is its high computational ...
05/27/2019

Distributionally Robust Optimization and Generalization in Kernel Methods

Distributionally robust optimization (DRO) has attracted attention in ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The general nonparametric regression problem can be stated as

(1)

where is a noise vector and is the function of interest to be approximated from the noisy observations . Here, is the space to which the covariates

belong. We consider the fixed design regression where the covariates are assumed to be deterministic. The problem has a long history in statistics and machine learning 

[wasserman2006all, tsybakov2009intro]. In this paper, we assume that belongs to a reproducing kernel Hilbert space (RKHS), denoted as  [paulsen2016introduction]. Such spaces are characterized by the existence of a reproducing kernel, that is, a positive semidefinite function that uniquely determines the underlying function space . RKHSs are very versatile modeling tools and include, for example, Sobolev spaces of smooth functions whose norms are measures of function roughness (the opposite of smoothness) [wahba1990spline]. Throughout, we think of these Sobolev spaces as the concrete examples of . By assuming an upper bound on the Hilbert norm of , we can encode a prior belief that the true data generating function has a certain degree of smoothness. Without loss of generality, we assume that belongs to the unit ball of the RKHS, that is,

(2)

A natural estimator is then, the kernel ridge regression (KRR), defined as the solution of the following optimization problem:

(3)

where is a regularization parameter. It is well-known that this problem can be reduced to a finite-dimensional problem (by an application of the so-called representer theorem) [kimeldorf1971some]:

(4)

is the (normalized empirical) kernel matrix. Although (4) has a closed form solution, it involves inverting an dense matrix, with time complexity , which is prohibitive in practice.

Various approximation schemes have been proposed to mitigate the computational costs, including (i) approximating the kernel matrix or (ii) directly approximating the optimization problem (4). Examples of the former are the Nyström approximation, column sampling and their variants [williams2001using, zhang2008improved, kumar2009ensemble, li2010making, talwalkar2014matrix]. An example of the latter is sketching [alaoui2015fast, yang2017randomized] where one restricts to the subspace

, for some random matrix

. It is in fact known that Nyström can be considered a special case of sketching with random standard basis vectors [yang2017randomized]. Sketching, with sufficiently large , has been shown in [yang2017randomized] to achieve minimax optimal rates over Sobolev spaces, under mild conditions on the sketching matrix . Similarly, the Nyström approximation has been analyzed in [cortes2010impact, yang2012nystrom, jin2013improved, alaoui2015fast, bach2013sharp] and [rudi2015less], the latter showing minimax rate optimality. In addition to the above, (iii) divide and conquer approaches have been proposed [zhang2013divide]

, where one solves the problem over subsamples and then aggregate by averaging, with some rate optimality guarantees. Other notable approaches to scaling include (iv) approximating translation-invariant kernel functions via Monte Carlo averages of tensor products of randomized feature maps 

[rahimi2008random, le2013fastfood] and (v) applying stochastic gradient in the function space [dai2014scalable].

In this paper, we consider the most direct kernel approximation, namely, replacing by its best rank approximation (in Frobenius norm). This amounts to truncating the eigenvalue decomposition of to its top eigenvalues. We refer to the resulting KRR approximation as the spectrally-truncated KRR (ST-KRR). Although somewhat slower than the Nyström approximation and fast forms of sketching, ST-KRR can be considered an ideal rank- spectral approximation. By analyzing it, one can also gain insights about approximate SVD truncation approaches such as Nyström or sketching. In addition, practically, it is a very viable solution for moderate size problems. See Appendix A for a discussion of the time complexity of various schemes.

We derive an exact expression for the maximum (empirical) mean-squared error (MSE) of ST-KRR, uniformly over the unit ball of the RKHS. This expression is solely in terms of the eigenvalues of the kernel matrix , the regularization parameter , the truncation level , and the noise level . Thus if one has access to and the noise level (or estimates of them), one can plot the exact regularization curve (maximum MSE versus ) for a given truncation level and sample size , and determine the optimal value of . We also note that since the empirical eigenvalues quickly approach those of the integral operator associated with (as [koltchinskii2000random], one can use these idealized eigenvalues instead of to get an excellent approximation of these regularization curves.

We then show that there is an optimal threshold on , the truncation level, which we denote as such that for all , the minimal maximum-risk of -truncated KRR, with minimum taken over the regularization parameter, is strictly better than that of the full KRR whenever . In particular, for infinite-dimensional RKHSs, , hence truncating at level is guaranteed to strictly improve performance. The slower the decay of the eigenvalues, the larger this gap in performance.

We also show how the exact expression for the maximum MSE can be used to easily establish an slightly weaker bound for ST-KRR, similar to those derived in [yang2017randomized] for sketching. We discuss the link between the statistical dimension considered in [yang2017randomized] and the optimal truncation level , and show how the same rate-optimality guarantees hold for ST-KRR. (Rate-optimality also follows form the fact that ST-KRR, with proper , strictly dominates full KRR and the latter is rate-optional. However, we do these calculations to make the comparison easier.)

Finally, we illustrate the results with some numerical simulations showing some further surprises. For example, the Gaussian kernel has a much faster eigendecay rate than a Sobolev-1 kernel (exponential versus polynomial decay). Hence, the optimal truncation level asymptotically grows much slower for the Gaussian kernel. However, for finite samples, depending on the choice of the Gaussian bandwidth, the exact optimal truncation level, computed numerically, can be larger than that of Sobolev-1.

2 Preliminaries

Before stating and studying the spectrally-truncated problem in more details, let us make some observations regarding the original KRR problem in (3). For , we define the kernel mapping

(5)

Note that is a linear map from . This map is the link between the solutions of the two optimization problems (3) and (4): For any optimal solution of (4), will be an optimal solution of (3). The link is easy to establish by observing the following two identities:

(6)

the first of which uses the reproducing property of the kernel: . We will frequently use this property in the sequel. The proof of the equivalence follows from an argument similar to our discussion of the identifiability below.

2.1 Identifiability

Let us first observe that in (1) is not (statistically) identifiable. That is, there are multiple functions (in fact infinitely many if is infinite dimensional) for which the vector has the exact same distribution. To see this, let

(7)

Let be the projection of onto . (It is always possible to choose at least one such by the definition of projection and since is a closed subspace of .) Given observations , we can only hope to recover the following equivalence class:

where the last line follows since by the property of orthogonal projection (and can be absorbed into ).

We will use as the representative of the (identifiable) equivalence class of . We are interested in measuring functional deviations (e.g., the error in our estimate relative to the true function) in the empirical norm:

The use of this norm is common in the literature of nonparametric regression [van2000empirical, wainwright2019high]. It is interesting to note that ,

(8)

and , since projections are contractive. Thus, recalling (2), also belongs to the Hilbert unit ball: . It is in fact easy to see that has the least Hilbert norm among the members in the equivalence class (i.e., the smoothest version). Thus, without loss of generality, we can identify with . Equivalently, we can assume from the start that is of the form for some . Note that the “no loss of generality” statement holds as long as we are working with the empirical norm, due to (8).

3 Main results

Let the eigenvalue decomposition (EVD) of the empirical kernel matrix defined in (4). Here,

is an orthogonal matrix and

where are the eigenvalues of . We assume for simplicity that , that is, the exact kernel matrix is invertible. Consider the rank approximation of , obtained by keeping the top eigenvalues and truncating the rest to zero, that is,

Here, and collects the first columns of . The idea is to solve (4) with replaced with , to obtain . We then form our functional estimate by using the (exact) kernel mapping (5). An -truncated -regularized KRR estimator with input , is a function where

(9)
(10)

A minimizer in (9), without the additional condition , is not unique due to the rank deficiency of . Thus, we can ask for it to satisfy additional constraints. The equality condition in (10), which can be stated as can always be satisfied. It is enough to choose to be the unique minimizer in , that is, for some . This is how the estimator is often implemented in practice.

We are interested in the deviation of from the true function in the empirical norm. More precisely, we are interested in the mean-squared error as the statistical risk:

Our main result is an expression for the worst-case risk of over the unit ball of the RKHS: Let be an -truncated -regularized kernel KRR estimator (Definition 3) applied to input generated from model (1). Let

where . Then, for all and ,

(11)

with .

The first term in (11) is the worst-case approximation error (WAE) and the second term the estimation error (EE). The approximation error (AE) is the risk (relative to ) of which is obtained by passing the noiseless observations , instead of , through the estimation procedure. The AE is the deterministic part of the risk and is given by . The estimation error is the stochastic part of the risk and is given by .

The function attains its maximum of , over , at . Thus, as long as , the bound is good. In general,

(12)

We note that since the KRR estimates are linear in , Theorem 3 easily gives the maximum MSE expression over the Hilbert ball of arbitrary radius , by replacing in (11) with and multiplying the entire right-hand side by .

We also have a precise result on the regularized risk of the approximating function:

Let be obtained by passing the noiseless observations , instead of , through the estimation procedure in Definition 3. Then,

(13)

3.1 Maximum-risk inadmissibility

Let us now consider how the maximum risk of the truncated KKR compares with the full version. For every, , define

In addition, recalling that is the full KRR estimator, let

(14)

That is, is the regularization parameter that achieves the minimal maximum-risk for the full KRR. We have the following corollary of Theorem 3: For every , and every with ,

(15)

In particular, for every ,

(16)

Both inequalities are strict whenever .

Corollary 3.1 shows that -optimized strictly improves on optimized full KRR whenever , in a sense rendering the full KRR inadmissible, as far as the maximum risk over is concerned. Note that we are not claiming inadmissibility in the classical sense which requires one estimator to improve on another for all . In general, the slower the decay of , the more significant the improvement gained by truncation. Note that (14) allows one to set the precise truncation level including the exact constants if one has access to the eigenvalues of the kernel matrix. In practice, for large , the eigenvalues of the associated kernel integral operator (if available) can act as excellent surrogates for  [koltchinskii2000random].

3.2 Gaussian complexity and rates

Less precise bounds, albeit good enough to capture the correct asymptotic rate as , can be obtained in terms of the Gaussian complexity of the unit ball of the RKHS. These types of results have been obtained for the Sketched-KRR. To make a comparison easier, let us show how such bounds can be obtained from Theorem 3.

Let us define the -truncated complexity (of the empirical Hilbert ball) as

(17)

For the case , this matches the definition of the kernel complexity in [yang2017randomized], which we refer to for the related background. In particular, (17) is a tight upper bound on the Gaussian complexity of the intersection of and  [wainwright2019high, Chapter 13]. We have: [Looser bound] Under the setup of Theorem 3, for any ,

(18)

If , one can replace the first term with for a better bound.

Choosing , we obtain

The latter upper bound is what one would get for the full KRR. Matching the two terms in that bound, we chooses such that which gives the well-known critical radius for the KRR problem [wainwright2019high]. It is known that gives the optimal rate of convergence for estimating functions in , i.e., its rate of decay matches that of the minimax risk [yang2017randomized]. The above argument shows that as long as is taken large enough so that , the -truncated KRR achieves (at least) the same rate as the full KRR. (For the sketching, the same conclusion is established in [yang2017randomized], where the smallest satisfying is referred to as the statistical dimension of the kernel.)

For Sobolev- kernels, with eigendecay , we obtain . Interestingly, in this case, the estimate based on the weaker bound (18) and the exact bound (11) give the same rate (cf. Appendix C). This is expected since the given rate is known to be minimax optimal for Sobolev spaces. The same goes for the Gaussian kernel for which and the rate is for .

Order-wise, will be the same as defined in (14), that is , whenever matches the optimal rate. Hence, often for large and the argument leading to (12) suggests that in this case . Then,

For Sobolev- kernels, this suggests truncation level which gives moderate savings for high smoothness levels . Similarly, for the Gaussian kernel, it is not hard to see that truncating to is enough to get the same rate as the full KRR, which is a substantial saving.

Figure 1: Plots of (top) the maximum theoretical MSE (bottom) typical empirical MSE, versus the regularization parameter () for the Gaussian (with bandwidth = 10) and Sobolev-1 kernels on with equispaced samples. The optimally-truncated KRR is shown () together with the full KRR ().

4 Simulations

We now present some numerical experiments to corroborate the theory. We consider a Gaussian kernel of bandwidth on , as well as the Sobolev-1 kernel on . We take the covariates to be equi-spaced points in each interval. The top row of Figure 1 shows the plot of the theoretical maximum MSE as given by Theorem 3 for the two kernels, for both the full KRR (), and the optimally truncated version (). We have used in (11). As predicted by Theorem 3, the minimum achievable maximum MSE is smaller for the truncated KRR.

To compute the optimal truncation, we have evaluated the regularization curve of the full KRR first, obtained the minimizer and then used (14) to compute the optimal truncation level . For the setup of the simulation, we get for the Gaussian and for the Sobolev-1. It is interesting to note that although in terms of rates, for the Gaussian should be asymptotically much smaller than that of Sobolev-1, in finite samples, the truncation level for the Guassian could be bigger as can be seen here. This is due to the unspecified, potentially large, constants in the rates (that depend on the bandwidth as well). Also, notice how surprisingly small is relative to in both cases.

The bottom row of Figure 1 shows the empirical MSE obtained for a typical random , by computing the KRR estimates for observation and comparing with . The random true function is generated as where and further normalized so that . We have generated observations from (1) with . The plots were obtained using 1000 replications. The truncation levels are those calculated based on the maximum MSE formula (11). The plots show that for a typical application, the truncated KRR also dominates the full KRR.

5 Proof of the main result

Here we give the proof of Theorem 3. The remaining proofs can be found in Appendix B.

The -transform.

From the discussion in Section 2.1, both the KRR estimate and the true function belong to given in (7). It is then useful to have an expression for the empirical error of functions belonging to this space. First, we observe that . Now, take any , and let and . Then, we have

(19)

where the fist equality is by the linearity of . For any function , we call the -space representation of . Identity (19) shows that it is often easier to work in the -space since the -transform turns empirical norms on functions into the usual norms on vectors. In other words, the map , is a Hilbert space isometry from to . In the -space, the KRR optimization problem can be equivalently stated as:

(20)

where is the pseudo inverse of , and its range. More precisely: For any , problems (4) and (20) are equivalent in the following sense:

  • [label=-]

  • For any minimizer of (4), is a minimizer of (20), and

  • for any minimizer of (20), any is a minimizer of (4).

It is often the case that the kernel matrix itself is invertible, in which case , and problem (20) simplifies. However, the equivalence in Lemma 5 holds even if we replace with an approximation which is rank deficient. This observation will be useful in the sequel.

Proof of Theorem 3.

Take to be as in Definition 3 and let . Since is the minimizer of we have or . Hence, or

(21)

Let be the noise vector in (1) and . We also let

(22)

Then, we can write model (1) as , where is zero mean with . From (19), we have , and

where the first equality uses assumption (10). It follows that

where the first term is the approximation error (AE) and the second term, the estimation error (EE). Let us write so that . We define

(23)

and note that is diagonal. Let and . Then, since norm is unitarily invariant, we have

Controlling the estimation error: We have

using since is an orthogonal matrix. Then,

(24)

establishing the EE part of the result.

Controlling the approximation error: Recall that we are interested in the worst-case approximation error (WAE) over the unit ball of the Hilbert space, i.e., over . Also, recall that without loss of generality, we can take . Hence,

(25)

where the second equality is from (6), and the latter two are by definitions of and . We obtain

A further change of variable gives

where , applied to matrices, is the operator norm. Note that is a diagonal matrix with diagonal elements, for followed by zeros. It follows that is diagonal with diagonal elements:

(26)

Since is a non-increasing sequence, we obtain

which is the desired result. ∎

Appendix A Time complexity comparison

The ST-KRR and approximate versions such as Nyström and sketching all have time complexity of for computing the -truncated KRR estimate, once the pieces required for approximating the kernel matrix (e.g., and in the case of sketching, and in the case of ST-KRR and so on) are computed. Computing these pieces is where these methods differ. For sketching, this step could have complexity as large as for dense sketches, for randomized Fourier and Hadamard sketches, to as low as for the Nyström.

For the ST-KRR, this step involves computing the top- eigenpairs of the symmetric matrix , for which the Lanczos algorithm is the standard and for which a complexity analysis is hard to find in the literature. However, results of [kuczynski1992estimating] suggest that it has average-case complexity . More precisely, [kuczynski1992estimating] shows that on average Lanczos iterations are enough to compute the top eigenvalue to within relative error , hence an overall average-case complexity where is the number of nonzero elements of matrix .

Appendix B Remaining proofs

Proof of Proposition 3.

We will use the same notation as in the proof of Theorem 3. By the same argument as in that proof, we have where is defined in (23) and for given in (22). Let be the solution of (9) for the input (instead of ) so that . Using the optimality condition in the proof of Theorem 3,

where we have used (10) and (21), with (i.e., ). We can write

using , and . Recall from (25) that is equivalent to . It follows that

where the third equality is using the change of variable as in the proof of Theorem 3, and the last line follows since all the matrices are diagonal and hence commute. The result now follows by combining (24) and (26), after some algebra. ∎

Proof of Corollary 3.1.

Let be the estimation error of as in (11). Note that as long as , we have . It remains to show that the WAE of the truncated KRR is less than that of full KRR. We have for ,

This proves (15). For the second assertion, it is enough to apply (15) with , noting that in this case, the RHS will be the minimax risk of the full KRR and the LHS is further lower bounded by the minimax risk of the truncated KRR. ∎

Proof of Corollary 3.2.

For any , we have , where . Hence, the estimation error in (11) is bounded as

This upper bound is within a factor of of the estimation error. Using to shave off the power by one, we obtain the weaker bound:

(27)

Recalling definition (17), we conclude that if ,

Combining with the WAE bound (12), we obtain the desired result. ∎

Proof of Lemma 5.

Let and be the objective functions in (4) and (20), respectively. We have for any , which follows from the identity . Now, assume that is a minimizer of , and let . Pick any ; there exists such that , and we have . The other direction follows similarly. ∎

Appendix C Rate calculations

Here we compute the error rate predicted by the strong and weak bounds and show that they are the same. Let . Assume the polynomial eigendecay of the Sobolev- kernel . Taking to be the smallest integer satisfying , we have

where the first inequality uses an integral approximation to the sum and the second uses the definition of . Setting we have , hence the critical radius .

Now consider the strong bound. As discussed in the text, . Also, as the proof of Corollary 3.2 shows, we have

Letting be defined as the smallest integer such that , we get as before. Then, the maximum MSE is bounded as

Since