1 Introduction
The general nonparametric regression problem can be stated as
(1) 
where is a noise vector and is the function of interest to be approximated from the noisy observations . Here, is the space to which the covariates
belong. We consider the fixed design regression where the covariates are assumed to be deterministic. The problem has a long history in statistics and machine learning
[wasserman2006all, tsybakov2009intro]. In this paper, we assume that belongs to a reproducing kernel Hilbert space (RKHS), denoted as [paulsen2016introduction]. Such spaces are characterized by the existence of a reproducing kernel, that is, a positive semidefinite function that uniquely determines the underlying function space . RKHSs are very versatile modeling tools and include, for example, Sobolev spaces of smooth functions whose norms are measures of function roughness (the opposite of smoothness) [wahba1990spline]. Throughout, we think of these Sobolev spaces as the concrete examples of . By assuming an upper bound on the Hilbert norm of , we can encode a prior belief that the true data generating function has a certain degree of smoothness. Without loss of generality, we assume that belongs to the unit ball of the RKHS, that is,(2) 
A natural estimator is then, the kernel ridge regression (KRR), defined as the solution of the following optimization problem:
(3) 
where is a regularization parameter. It is wellknown that this problem can be reduced to a finitedimensional problem (by an application of the socalled representer theorem) [kimeldorf1971some]:
(4) 
is the (normalized empirical) kernel matrix. Although (4) has a closed form solution, it involves inverting an dense matrix, with time complexity , which is prohibitive in practice.
Various approximation schemes have been proposed to mitigate the computational costs, including (i) approximating the kernel matrix or (ii) directly approximating the optimization problem (4). Examples of the former are the Nyström approximation, column sampling and their variants [williams2001using, zhang2008improved, kumar2009ensemble, li2010making, talwalkar2014matrix]. An example of the latter is sketching [alaoui2015fast, yang2017randomized] where one restricts to the subspace
, for some random matrix
. It is in fact known that Nyström can be considered a special case of sketching with random standard basis vectors [yang2017randomized]. Sketching, with sufficiently large , has been shown in [yang2017randomized] to achieve minimax optimal rates over Sobolev spaces, under mild conditions on the sketching matrix . Similarly, the Nyström approximation has been analyzed in [cortes2010impact, yang2012nystrom, jin2013improved, alaoui2015fast, bach2013sharp] and [rudi2015less], the latter showing minimax rate optimality. In addition to the above, (iii) divide and conquer approaches have been proposed [zhang2013divide], where one solves the problem over subsamples and then aggregate by averaging, with some rate optimality guarantees. Other notable approaches to scaling include (iv) approximating translationinvariant kernel functions via Monte Carlo averages of tensor products of randomized feature maps
[rahimi2008random, le2013fastfood] and (v) applying stochastic gradient in the function space [dai2014scalable].In this paper, we consider the most direct kernel approximation, namely, replacing by its best rank approximation (in Frobenius norm). This amounts to truncating the eigenvalue decomposition of to its top eigenvalues. We refer to the resulting KRR approximation as the spectrallytruncated KRR (STKRR). Although somewhat slower than the Nyström approximation and fast forms of sketching, STKRR can be considered an ideal rank spectral approximation. By analyzing it, one can also gain insights about approximate SVD truncation approaches such as Nyström or sketching. In addition, practically, it is a very viable solution for moderate size problems. See Appendix A for a discussion of the time complexity of various schemes.
We derive an exact expression for the maximum (empirical) meansquared error (MSE) of STKRR, uniformly over the unit ball of the RKHS. This expression is solely in terms of the eigenvalues of the kernel matrix , the regularization parameter , the truncation level , and the noise level . Thus if one has access to and the noise level (or estimates of them), one can plot the exact regularization curve (maximum MSE versus ) for a given truncation level and sample size , and determine the optimal value of . We also note that since the empirical eigenvalues quickly approach those of the integral operator associated with (as ) [koltchinskii2000random], one can use these idealized eigenvalues instead of to get an excellent approximation of these regularization curves.
We then show that there is an optimal threshold on , the truncation level, which we denote as such that for all , the minimal maximumrisk of truncated KRR, with minimum taken over the regularization parameter, is strictly better than that of the full KRR whenever . In particular, for infinitedimensional RKHSs, , hence truncating at level is guaranteed to strictly improve performance. The slower the decay of the eigenvalues, the larger this gap in performance.
We also show how the exact expression for the maximum MSE can be used to easily establish an slightly weaker bound for STKRR, similar to those derived in [yang2017randomized] for sketching. We discuss the link between the statistical dimension considered in [yang2017randomized] and the optimal truncation level , and show how the same rateoptimality guarantees hold for STKRR. (Rateoptimality also follows form the fact that STKRR, with proper , strictly dominates full KRR and the latter is rateoptional. However, we do these calculations to make the comparison easier.)
Finally, we illustrate the results with some numerical simulations showing some further surprises. For example, the Gaussian kernel has a much faster eigendecay rate than a Sobolev1 kernel (exponential versus polynomial decay). Hence, the optimal truncation level asymptotically grows much slower for the Gaussian kernel. However, for finite samples, depending on the choice of the Gaussian bandwidth, the exact optimal truncation level, computed numerically, can be larger than that of Sobolev1.
2 Preliminaries
Before stating and studying the spectrallytruncated problem in more details, let us make some observations regarding the original KRR problem in (3). For , we define the kernel mapping
(5) 
Note that is a linear map from . This map is the link between the solutions of the two optimization problems (3) and (4): For any optimal solution of (4), will be an optimal solution of (3). The link is easy to establish by observing the following two identities:
(6) 
the first of which uses the reproducing property of the kernel: . We will frequently use this property in the sequel. The proof of the equivalence follows from an argument similar to our discussion of the identifiability below.
2.1 Identifiability
Let us first observe that in (1) is not (statistically) identifiable. That is, there are multiple functions (in fact infinitely many if is infinite dimensional) for which the vector has the exact same distribution. To see this, let
(7) 
Let be the projection of onto . (It is always possible to choose at least one such by the definition of projection and since is a closed subspace of .) Given observations , we can only hope to recover the following equivalence class:
where the last line follows since by the property of orthogonal projection (and can be absorbed into ).
We will use as the representative of the (identifiable) equivalence class of . We are interested in measuring functional deviations (e.g., the error in our estimate relative to the true function) in the empirical norm:
The use of this norm is common in the literature of nonparametric regression [van2000empirical, wainwright2019high]. It is interesting to note that ,
(8) 
and , since projections are contractive. Thus, recalling (2), also belongs to the Hilbert unit ball: . It is in fact easy to see that has the least Hilbert norm among the members in the equivalence class (i.e., the smoothest version). Thus, without loss of generality, we can identify with . Equivalently, we can assume from the start that is of the form for some . Note that the “no loss of generality” statement holds as long as we are working with the empirical norm, due to (8).
3 Main results
Let the eigenvalue decomposition (EVD) of the empirical kernel matrix defined in (4). Here,
is an orthogonal matrix and
where are the eigenvalues of . We assume for simplicity that , that is, the exact kernel matrix is invertible. Consider the rank approximation of , obtained by keeping the top eigenvalues and truncating the rest to zero, that is,Here, and collects the first columns of . The idea is to solve (4) with replaced with , to obtain . We then form our functional estimate by using the (exact) kernel mapping (5). An truncated regularized KRR estimator with input , is a function where
(9)  
(10) 
A minimizer in (9), without the additional condition , is not unique due to the rank deficiency of . Thus, we can ask for it to satisfy additional constraints. The equality condition in (10), which can be stated as can always be satisfied. It is enough to choose to be the unique minimizer in , that is, for some . This is how the estimator is often implemented in practice.
We are interested in the deviation of from the true function in the empirical norm. More precisely, we are interested in the meansquared error as the statistical risk:
Our main result is an expression for the worstcase risk of over the unit ball of the RKHS: Let be an truncated regularized kernel KRR estimator (Definition 3) applied to input generated from model (1). Let
where . Then, for all and ,
(11) 
with .
The first term in (11) is the worstcase approximation error (WAE) and the second term the estimation error (EE). The approximation error (AE) is the risk (relative to ) of which is obtained by passing the noiseless observations , instead of , through the estimation procedure. The AE is the deterministic part of the risk and is given by . The estimation error is the stochastic part of the risk and is given by .
The function attains its maximum of , over , at . Thus, as long as , the bound is good. In general,
(12) 
We note that since the KRR estimates are linear in , Theorem 3 easily gives the maximum MSE expression over the Hilbert ball of arbitrary radius , by replacing in (11) with and multiplying the entire righthand side by .
We also have a precise result on the regularized risk of the approximating function:
Let be obtained by passing the noiseless observations , instead of , through the estimation procedure in Definition 3. Then,
(13) 
3.1 Maximumrisk inadmissibility
Let us now consider how the maximum risk of the truncated KKR compares with the full version. For every, , define
In addition, recalling that is the full KRR estimator, let
(14) 
That is, is the regularization parameter that achieves the minimal maximumrisk for the full KRR. We have the following corollary of Theorem 3: For every , and every with ,
(15) 
In particular, for every ,
(16) 
Both inequalities are strict whenever .
Corollary 3.1 shows that optimized strictly improves on optimized full KRR whenever , in a sense rendering the full KRR inadmissible, as far as the maximum risk over is concerned. Note that we are not claiming inadmissibility in the classical sense which requires one estimator to improve on another for all . In general, the slower the decay of , the more significant the improvement gained by truncation. Note that (14) allows one to set the precise truncation level including the exact constants if one has access to the eigenvalues of the kernel matrix. In practice, for large , the eigenvalues of the associated kernel integral operator (if available) can act as excellent surrogates for [koltchinskii2000random].
3.2 Gaussian complexity and rates
Less precise bounds, albeit good enough to capture the correct asymptotic rate as , can be obtained in terms of the Gaussian complexity of the unit ball of the RKHS. These types of results have been obtained for the SketchedKRR. To make a comparison easier, let us show how such bounds can be obtained from Theorem 3.
Let us define the truncated complexity (of the empirical Hilbert ball) as
(17) 
For the case , this matches the definition of the kernel complexity in [yang2017randomized], which we refer to for the related background. In particular, (17) is a tight upper bound on the Gaussian complexity of the intersection of and [wainwright2019high, Chapter 13]. We have: [Looser bound] Under the setup of Theorem 3, for any ,
(18) 
If , one can replace the first term with for a better bound.
Choosing , we obtain
The latter upper bound is what one would get for the full KRR. Matching the two terms in that bound, we chooses such that which gives the wellknown critical radius for the KRR problem [wainwright2019high]. It is known that gives the optimal rate of convergence for estimating functions in , i.e., its rate of decay matches that of the minimax risk [yang2017randomized]. The above argument shows that as long as is taken large enough so that , the truncated KRR achieves (at least) the same rate as the full KRR. (For the sketching, the same conclusion is established in [yang2017randomized], where the smallest satisfying is referred to as the statistical dimension of the kernel.)
For Sobolev kernels, with eigendecay , we obtain . Interestingly, in this case, the estimate based on the weaker bound (18) and the exact bound (11) give the same rate (cf. Appendix C). This is expected since the given rate is known to be minimax optimal for Sobolev spaces. The same goes for the Gaussian kernel for which and the rate is for .
Orderwise, will be the same as defined in (14), that is , whenever matches the optimal rate. Hence, often for large and the argument leading to (12) suggests that in this case . Then,
For Sobolev kernels, this suggests truncation level which gives moderate savings for high smoothness levels . Similarly, for the Gaussian kernel, it is not hard to see that truncating to is enough to get the same rate as the full KRR, which is a substantial saving.
4 Simulations
We now present some numerical experiments to corroborate the theory. We consider a Gaussian kernel of bandwidth on , as well as the Sobolev1 kernel on . We take the covariates to be equispaced points in each interval. The top row of Figure 1 shows the plot of the theoretical maximum MSE as given by Theorem 3 for the two kernels, for both the full KRR (), and the optimally truncated version (). We have used in (11). As predicted by Theorem 3, the minimum achievable maximum MSE is smaller for the truncated KRR.
To compute the optimal truncation, we have evaluated the regularization curve of the full KRR first, obtained the minimizer and then used (14) to compute the optimal truncation level . For the setup of the simulation, we get for the Gaussian and for the Sobolev1. It is interesting to note that although in terms of rates, for the Gaussian should be asymptotically much smaller than that of Sobolev1, in finite samples, the truncation level for the Guassian could be bigger as can be seen here. This is due to the unspecified, potentially large, constants in the rates (that depend on the bandwidth as well). Also, notice how surprisingly small is relative to in both cases.
The bottom row of Figure 1 shows the empirical MSE obtained for a typical random , by computing the KRR estimates for observation and comparing with . The random true function is generated as where and further normalized so that . We have generated observations from (1) with . The plots were obtained using 1000 replications. The truncation levels are those calculated based on the maximum MSE formula (11). The plots show that for a typical application, the truncated KRR also dominates the full KRR.
5 Proof of the main result
The transform.
From the discussion in Section 2.1, both the KRR estimate and the true function belong to given in (7). It is then useful to have an expression for the empirical error of functions belonging to this space. First, we observe that . Now, take any , and let and . Then, we have
(19) 
where the fist equality is by the linearity of . For any function , we call the space representation of . Identity (19) shows that it is often easier to work in the space since the transform turns empirical norms on functions into the usual norms on vectors. In other words, the map , is a Hilbert space isometry from to . In the space, the KRR optimization problem can be equivalently stated as:
(20) 
where is the pseudo inverse of , and its range. More precisely: For any , problems (4) and (20) are equivalent in the following sense:

[label=]
It is often the case that the kernel matrix itself is invertible, in which case , and problem (20) simplifies. However, the equivalence in Lemma 5 holds even if we replace with an approximation which is rank deficient. This observation will be useful in the sequel.
Proof of Theorem 3.
Take to be as in Definition 3 and let . Since is the minimizer of we have or . Hence, or
(21) 
Let be the noise vector in (1) and . We also let
(22) 
Then, we can write model (1) as , where is zero mean with . From (19), we have , and
where the first equality uses assumption (10). It follows that
where the first term is the approximation error (AE) and the second term, the estimation error (EE). Let us write so that . We define
(23) 
and note that is diagonal. Let and . Then, since norm is unitarily invariant, we have
Controlling the estimation error: We have
using since is an orthogonal matrix. Then,
(24) 
establishing the EE part of the result.
Controlling the approximation error: Recall that we are interested in the worstcase approximation error (WAE) over the unit ball of the Hilbert space, i.e., over . Also, recall that without loss of generality, we can take . Hence,
(25) 
where the second equality is from (6), and the latter two are by definitions of and . We obtain
A further change of variable gives
where , applied to matrices, is the operator norm. Note that is a diagonal matrix with diagonal elements, for followed by zeros. It follows that is diagonal with diagonal elements:
(26) 
Since is a nonincreasing sequence, we obtain
which is the desired result. ∎
Appendix A Time complexity comparison
The STKRR and approximate versions such as Nyström and sketching all have time complexity of for computing the truncated KRR estimate, once the pieces required for approximating the kernel matrix (e.g., and in the case of sketching, and in the case of STKRR and so on) are computed. Computing these pieces is where these methods differ. For sketching, this step could have complexity as large as for dense sketches, for randomized Fourier and Hadamard sketches, to as low as for the Nyström.
For the STKRR, this step involves computing the top eigenpairs of the symmetric matrix , for which the Lanczos algorithm is the standard and for which a complexity analysis is hard to find in the literature. However, results of [kuczynski1992estimating] suggest that it has averagecase complexity . More precisely, [kuczynski1992estimating] shows that on average Lanczos iterations are enough to compute the top eigenvalue to within relative error , hence an overall averagecase complexity where is the number of nonzero elements of matrix .
Appendix B Remaining proofs
Proof of Proposition 3.
We will use the same notation as in the proof of Theorem 3. By the same argument as in that proof, we have where is defined in (23) and for given in (22). Let be the solution of (9) for the input (instead of ) so that . Using the optimality condition in the proof of Theorem 3,
where we have used (10) and (21), with (i.e., ). We can write
using , and . Recall from (25) that is equivalent to . It follows that
where the third equality is using the change of variable as in the proof of Theorem 3, and the last line follows since all the matrices are diagonal and hence commute. The result now follows by combining (24) and (26), after some algebra. ∎
Proof of Corollary 3.1.
Let be the estimation error of as in (11). Note that as long as , we have . It remains to show that the WAE of the truncated KRR is less than that of full KRR. We have for ,
This proves (15). For the second assertion, it is enough to apply (15) with , noting that in this case, the RHS will be the minimax risk of the full KRR and the LHS is further lower bounded by the minimax risk of the truncated KRR. ∎
Proof of Corollary 3.2.
For any , we have , where . Hence, the estimation error in (11) is bounded as
This upper bound is within a factor of of the estimation error. Using to shave off the power by one, we obtain the weaker bound:
(27) 
Recalling definition (17), we conclude that if ,
Combining with the WAE bound (12), we obtain the desired result. ∎
Appendix C Rate calculations
Here we compute the error rate predicted by the strong and weak bounds and show that they are the same. Let . Assume the polynomial eigendecay of the Sobolev kernel . Taking to be the smallest integer satisfying , we have
where the first inequality uses an integral approximation to the sum and the second uses the definition of . Setting we have , hence the critical radius .
Now consider the strong bound. As discussed in the text, . Also, as the proof of Corollary 3.2 shows, we have
Letting be defined as the smallest integer such that , we get as before. Then, the maximum MSE is bounded as
Since
Comments
There are no comments yet.