# Randomized sketches for kernels: Fast and optimal non-parametric regression

Kernel ridge regression (KRR) is a standard method for performing non-parametric regression over reproducing kernel Hilbert spaces. Given n samples, the time and space complexity of computing the KRR estimate scale as O(n^3) and O(n^2) respectively, and so is prohibitive in many cases. We propose approximations of KRR based on m-dimensional randomized sketches of the kernel matrix, and study how small the projection dimension m can be chosen while still preserving minimax optimality of the approximate KRR estimate. For various classes of randomized sketches, including those based on Gaussian and randomized Hadamard matrices, we prove that it suffices to choose the sketch dimension m proportional to the statistical dimension (modulo logarithmic factors). Thus, we obtain fast and minimax optimal approximations to the KRR estimate for non-parametric regression.

## Authors

• 24 publications
• 27 publications
• 80 publications
• ### Optimal Rates of Sketched-regularized Algorithms for Least-Squares Regression over Hilbert Spaces

We investigate regularized algorithms combining with projection for leas...
03/12/2018 ∙ by Junhong Lin, et al. ∙ 0

• ### Early stopping and non-parametric regression: An optimal data-dependent stopping rule

The strategy of early stopping is a regularization technique based on ch...
06/15/2013 ∙ by Garvesh Raskutti, et al. ∙ 0

• ### Deterministic error bounds for kernel-based learning techniques under bounded noise

We consider the problem of reconstructing a function from a finite set o...
08/10/2020 ∙ by Emilio T. Maddalena, et al. ∙ 0

• ### Non-parametric sparse additive auto-regressive network models

Consider a multi-variate time series (X_t)_t=0^T where X_t ∈R^d which ma...
01/23/2018 ∙ by Hao Zhou, et al. ∙ 0

• ### RKHSMetaMod : An R package to estimate the Hoeffding decomposition of an unknown function by solving RKHS Ridge Group Sparse optimization problem

In the context of the Gaussian regression model, RKHSMetaMod estimates a...
05/31/2019 ∙ by Halaleh Kamari, et al. ∙ 0

• ### Kernel Conjugate Gradient Methods with Random Projections

We propose and study kernel conjugate gradient methods (KCGM) with rando...
11/05/2018 ∙ by Junhong Lin, et al. ∙ 0

• ### Spectrally-truncated kernel ridge regression and its free lunch

Kernel ridge regression (KRR) is a well-known and popular nonparametric ...
06/14/2019 ∙ by Arash A. Amini, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The goal of non-parametric regression is to make predictions of a response variable

based on observing a covariate vector

. In practice, we are given a collection of samples, say of covariate-response pairs and our goal is to estimate the regression function . In the standard Gaussian model, it is assumed that the covariate-response pairs are related via the model

 yi =f∗(xi)+σwi,for i=1,…,n (1)

where the sequence consists of i.i.d. standard Gaussian variates. It is typical to assume that the regression function has some regularity properties, and one way of enforcing such structure is to require to belong to a reproducing kernel Hilbert space, or RKHS for short [3, 28, 9]). Given such an assumption, it is natural to estimate by minimizing a combination of the least-squares fit to the data and a penalty term involving the squared Hilbert norm, leading to an estimator known kernel ridge regression, or KRR for short [10, 22]). From a statistical point of view, the behavior of KRR can be characterized using existing results on -estimation and empirical processes (e.g. [13, 17, 26]). When the regularization parameter is set appropriately, it is known to yield a function estimate with minimax prediction error for various classes of kernels.

Despite these attractive statistical properties, the computational complexity of computing the KRR estimate prevent it from being routinely used in large-scale problems. More precisely, in a standard implementation [21], the time complexity and space complexity of KRR in a standard implementation scale as and , respectively, where refers to the number of samples. As a consequence, it becomes important to design methods to compute approximate forms of the KRR estimate, while retaining guarantees of optimality in terms of statistical minimaxity. Various authors have taken different approaches to this problem. Zhang et al. [30] analyze a distributed implementation of KRR, in which a set of machines each compute a separate estimate based on a random -way partition of the full data set, and combine it into a global estimate by averaging. This divide-and-conquer approach has time complexity and space complexity and , respectively. Zhang et al. [30] give conditions on the number of splits , as a function of the kernel, under which minimax optimality of the resulting estimator can be guaranteed. More closely related to this paper are methods that are based on forming a low-rank approximation to the -dimensional kernel matrix, such as the Nyström methods (e.g. [7, 8]). The time complexity by using a low-rank approximation is either or , depending on the specific approach (excluding the time for factorization), where is the maintained rank, and the space complexity is . Some recent work [4, 2] analyzes the tradeoff between the rank and the resulting statistical performance of the estimator, and we discuss this line of work at more length in Section 3.3.

In this paper, we consider approximations to KRR based on random projections, also known as sketches, of the data. Random projections are a classical way of performing dimensionality reduction, and are widely used in many algorithmic contexts (e.g., see the book [27] and references therein). Our proposal is to approximate -dimensional kernel matrix by projecting its row and column subspaces to a randomly chosen -dimensional subspace with . By doing so, an approximate form of the KRR estimate can be obtained by solving an -dimensional quadratic program, which involves time and space complexity and . Computing the approximate kernel matrix is a pre-processing step that has time complexity for suitably chosen projections; this pre-processing step is trivially parallelizable, meaning it can be reduced to to by using clusters.

Given such an approximation, we pose the following question: how small can the projection dimension be chosen while still retaining minimax optimality of the approximate KRR estimate? We answer this question by connecting it to the statistical dimension of the

-dimensional kernel matrix, a quantity that measures the effective number of degrees of freedom. (See Section

2.3 for a precise definition.) From the results of earlier work on random projections for constrained Least Squares estimators (e.g., see [18, 19]), it is natural to conjecture that it should be possible to project the kernel matrix down to the statistical dimension while preserving minimax optimality of the resulting estimator. The main contribution of this paper is to confirm this conjecture for several classes of random projection matrices.

The remainder of this paper is organized as follows. Section 2 is devoted to further background on non-parametric regression, reproducing kernel Hilbert spaces and associated measures of complexity, as well as the notion of statistical dimension of a kernel. In Section 3, we turn to statements of our main results. Theorem 2 provides a general sufficient condition on a random sketch for the associated approximate form of KRR to achieve the minimax risk. In Corollary 1, we derive some consequences of this general result for particular classes of random sketch matrices, and confirm these theoretical predictions with some simulations. We also compare at more length to methods based on the Nyström approximation in Section 3.3. Section 4 is devoted to the proofs of our main results, with the proofs of more technical results deferred to the appendices. We conclude with a discussion in Section 5.

## 2 Problem formulation and background

We begin by introducing some background on nonparametric regression and reproducing kernel Hilbert spaces, before formulating the problem discussed in this paper.

### 2.1 Regression in reproducing kernel Hilbert spaces

Given samples from the non-parametric regression model (1), our goal is to estimate the unknown regression function . The quality of an estimate can be measured in different ways: in this paper, we focus on the squared error

 ∥ˆf−f∗∥2n :=1nn∑i=1(ˆf(xi)−f∗(xi))2. (2)

Naturally, the difficulty of non-parametric regression is controlled by the structure in the function , and one way of modeling such structure is within the framework of a reproducing kernel Hilbert space (or RKHS for short). Here we provide a very brief introduction referring the reader to the books [6, 9, 28] for more details and background.

Given a space

endowed with a probability distribution

, the space consists of all functions that are square-integrable with respect to . In abstract terms, a space is an RKHS if for each , the evaluation function is a bounded linear functional. In more concrete terms, any RKHS is generated by a positive semidefinite (PSD) kernel function in the following way. A PSD kernel function is a symmetric function such that, for any positive integer , collections of points and weight vector , the sum is non-negative. Suppose moreover that for each fixed , the function belongs to . We can then consider the vector space of all functions of the form

 g(⋅)=N∑i=1ωiK(⋅,vi)

for some integer , points and weight vector . By taking the closure of all such linear combinations, it can be shown [3] that we generate an RKHS, and one that is uniquely associated with the kernel . We provide some examples of various kernels and the associated function classes in Section 2.3 to follow.

### 2.2 Kernel ridge regression and its sketched form

Given the dataset , a natural method for estimating unknown function is known as kernel ridge regression (KRR): it is based on the convex program

 f† :=argminf∈H{12nn∑i=1(yi−f(xi))2+λn∥f∥2H}, (3)

where is a regularization parameter.

As stated, this optimization problem can be infinite-dimensional in nature, since it takes place over the Hilbert space. However, as a straightforward consequence of the representer theorem [12], the solution to this optimization problem can be obtained by solving the -dimensional convex program. In particular, let us define the empirical kernel matrix, namely the -dimensional symmetric matrix with entries . Here we adopt the scaling for later theoretical convenience. In terms of this matrix, the KRR estimate can be obtained by first solving the quadratic program

 ω† =argminω∈Rn{12ωTK2ω−ωTKy√n+λnωTKω}, (4a) and then outputting the function f†(⋅)=1√nn∑i=1ω†iK(⋅,xi). (4b)

In principle, the original KRR optimization problem (4a) is simple to solve: it is an dimensional quadratic program, and can be solved exactly using

via a QR decomposition. However, in many applications, the number of samples may be large, so that this type of cubic scaling is prohibitive. In addition, the

-dimensional kernel matrix is dense in general, and so requires storage of order numbers, which can also be problematic in practice.

In this paper, we consider an approximation based on limiting the original parameter to an -dimensional subspace of , where is the projection dimension. We define this approximation via a sketch matrix , such that the -dimensional subspace is generated by the row span of . More precisely, the sketched kernel ridge regression estimate is given by first solving

 ˆα =argminθ∈Rm{12αT(SK)(KST)α−αTSKy√n+λnαTSKSTα}, (5a) and then outputting the function ˆf(⋅) :=1√nn∑i=1(STˆα)iK(⋅,xi). (5b)

Note that the sketched program (5a) is a quadratic program in dimensions: it takes as input the -dimensional matrices and the -dimensional vector . Consequently, it can be solved efficiently via QR decomposition with computational complexity . Moreover, the computation of the sketched kernel matrix in the input can be parallellized across its columns.

In this paper, we analyze various forms of random sketch matrices . Let us consider a few of them here.

##### Sub-Gaussian sketches:

We say the row of the sketch matrix is zero-mean -sub-Gaussian if for any fixed unit vector , we have

 P[|⟨u,si⟩≥t|]≤2e−nδ22for all δ≥0.

Many standard choices of sketch matrices have i.i.d. -sub-Gaussian rows in this sense; examples include matrices with i.i.d. Gaussian entries, i.i.d. Bernoulli entries, or random matrices with independent rows drawn uniformly from a rescaled sphere. For convenience, the sub-Gaussian sketch matrices considered in this paper are all rescaled so that their rows have the covariance matrix .

##### Randomized orthogonal system (ROS) sketches:

This class of sketches are based on randomly sampling and rescaling the rows of a fixed orthonormal matrix

. Examples of such matrices include the discrete Fourier transform (DFT) matrix, and the Hadamard matrix. More specifically, a ROS sketch matrix

is formed with i.i.d. rows of the form

 si =√nmRHTpi,for i=1,…,m,

where is a random diagonal matrix whose entries are i.i.d. Rademacher variables and is a random subset of rows sampled uniformly from the identity matrix without replacement. An advantage of using ROS sketches is that for suitably chosen orthonormal matrices, including the DFT and Hadamard cases among others, a matrix-vector product (say of the form for some vector ) can be computed in time, as opposed to time required for the same operation with generic dense sketches. For instance, see Ailon and Liberty [1] and [18] for further details. Throughout this paper, we focus on ROS sketches based on orthonormal matrices with uniformly bounded entries, meaning that for all entries . This entrywise bound is satisfied by Hadamard and DFT matrices, among others.

##### Sub-sampling sketches:

This class of sketches are even simpler, based on sub-sampling the rows of the identity matrix without replacement. In particular, the sketch matrix has rows of the form , where the vectors are drawn uniformly at random without replacement from the -dimensional identity matrix. It can be understood as related to a ROS sketch, based on the identity matrix as an orthonormal matrix, and not using the Rademacher randomization nor satisfying the entrywise bound. In Appendix A, we show that the sketched KRR estimate (5a) based on a sub-sampling sketch matrix is equivalent to the Nyström approximation.

### 2.3 Kernel complexity measures and statistical guarantees

So as to set the stage for later results, let us characterize an appropriate choice of the regularization parameter , and the resulting bound on the prediction error . Recall the empirical kernel matrix defined in the previous section: since it is symmetric and positive definite, it has an eigendecomposition of the form , where is an orthonormal matrix, and is diagonal with elements

. Using these eigenvalues, consider the

kernel complexity function

 ˆR(δ) = ⎷1nn∑j=1min{δ2,ˆμj}, (6)

corresponding to a rescaled sum of the eigenvalues, truncated at level . This function arises via analysis of the local Rademacher complexity of the kernel class (e.g., [5, 13, 17, 20]

). For a given kernel matrix and noise variance

, the critical radius is defined to be the smallest positive solution to the inequality

 ˆR(δ)δ ≤δσ. (7)

Note that the existence and uniqueness of this critical radius is guaranteed for any kernel class [5].

##### Bounds on ordinary KRR:

The significance of the critical radius is that it can be used to specify bounds on the prediction error in kernel ridge regression. More precisely suppose that we compute the KRR estimate (3) with any regularization parameter . Then with probability at least , we are guaranteed that

 ∥f†−f∗∥2n ≤cu{λn+δ2n}, (8)

where is a universal constant (independent of , and the kernel). This known result follows from standard techniques in empirical process theory (e.g., [26, 5]); we also note that it can be obtained as a corollary of our more general theorem on sketched KRR estimates to follow (viz. Theorem 2).

To illustrate, let us consider a few examples of reproducing kernel Hilbert spaces, and compute the critical radius in different cases. In working through these examples, so as to determine explicit rates, we assume that the design points are sampled i.i.d. from some underlying distribution , and we make use of the useful fact that, up to constant factors, we can always work with the population-level kernel complexity function

 R(δ) = ⎷1n∞∑j=1min{δ2,μj}, (9)

where are the eigenvalues of the kernel integral operator (assumed to be uniformly bounded). This equivalence follows from standard results on the population and empirical Rademacher complexities [17, 5].

###### Example 1 (Polynomial kernel).

For some integer , consider the kernel function on given by . For , it generates the class of all linear functions of the form for some scalars , and corresponds to a linear kernel. More generally, for larger integers , it generates the class of all polynomial functions of degree at most —that is, functions of the form .

Let us now compute a bound on the critical radius . It is straightforward to show that the polynomial kernel is of finite rank at most , meaning that the kernel matrix always has at most non-zero eigenvalues. Consequently, as long , there is a universal constant such that

 ˆR(δ) ≤c√D+1nδ,

which implies that . Consequently, we conclude that the KRR estimate satisifes the bound with high probability. Note that this bound is intuitive, since a polynomial of degree has free parameters.

###### Example 2 (Gaussian kernel).

The Gaussian kernel with bandwidth takes the form . When defined with respect to Lebesgue measure on the real line, the eigenvalues of the kernel integral operator scale as as . Based on this fact, it can be shown that the critical radius scales as . Thus, even though the Gaussian kernel is non-parametric (since it cannot be specified by a fixed number of parametrers), it is still a relatively small function class.

###### Example 3 (First-order Sobolev space).

As a final example, consider the kernel defined on the unit square given by . It generates the function class

 H1[0,1]={ f:[0,1]→R∣f(0)=0, (10) and f is abs. cts. with ∫10[f′(x)]2dx<∞},

a class that contains all Lipschitz functions on the unit interval . Roughly speaking, we can think of the first-order Sobolev class as functions that are almost everywhere differentiable with derivative in . Note that this is a much larger kernel class than the Gaussian kernel class. The first-order Sobolev space can be generalized to higher order Sobolev spaces, in which functions have additional smoothness. See the book [9] for further details on these and other reproducing kernel Hilbert spaces.

If the kernel integral operator is defined with respect to Lebesgue measure on the unit interval, then the population level eigenvalues are given by for . Given this relation, some calculation shows that the critical radius scales as . This is the familiar minimax risk for estimating Lipschitz functions in one dimension [23].

##### Lower bounds for non-parametric regression:

For future reference, it is also convenient to provide a lower bound on the prediction error achievable by any estimator. In order to do so, we first define the statistical dimension of the kernel as

 dn :=argminj=1,…,n{ˆμj≤δ2n}, (11)

and if no such index exists. By definition, we are guaranteed that for all . In terms of this statistical dimension, we have

 ˆR(δn) =[dnnδ2n+1nn∑j=dn+1ˆμj]1/2,

showing that the statistical dimension controls a type of bias-variance tradeoff.

It is reasonable to expect that the critical rate should be related to the statistical dimension as . This scaling relation holds whenever the tail sum satisfies a bound of the form . Although it is possible to construct pathological examples in which this scaling relation does not hold, it is true for most kernels of interest, including all examples considered in this paper. For any such regular kernel, the critical radius provides a fundamental lower bound on the performance of any estimator, as summarized in the following theorem:

###### Theorem 1 (Critical radius and minimax risk).

Given i.i.d. samples from the standard non-parametric regression model over any regular kernel class, any estimator has prediction error lower bounded as

 sup∥f∗∥H≤1E∥˜f−f∗∥2n ≥cℓδ2n, (12)

where is a numerical constant, and is the critical radius (7).

The proof of this claim, provided in Appendix B.1, is based on a standard applicaton of Fano’s inequality, combined with a random packing argument. It establishes that the critical radius is a fundamental quantity, corresponding to the appropriate benchmark to which sketched kernel regression estimates should be compared.

## 3 Main results and their consequences

We now turn to statements of our main theorems on kernel sketching, as well as a discussion of some of their consequences. We first introduce the notion of a -satisfiable sketch matrix, and then show (in Theorem 2) that any sketched KRR estimate based on a -satisfiable sketch also achieves the minimax risk. We illustrate this achievable result with several corollaries for different types of randomized sketches. For Gaussian and ROS sketches, we show that choosing the sketch dimension proportional to the statistical dimension of the kernel (with additional log factors in the ROS case) is sufficient to guarantee that the resulting sketch will be -satisfiable with high probability. In addition, we illustrate the sharpness of our theoretical predictions via some experimental simulations.

### 3.1 General conditions for sketched kernel optimality

Recall the definition (11) of the statistical dimension , and consider the eigendecomposition of the kernel matrix, where

is an orthonormal matrix of eigenvectors, and

is a diagonal matrix of eigenvalues. Let denote the left block of , and similarly, denote the right block. Note that the columns of the left block correspond to the eigenvectors of associated with the leading eigenvalues, whereas the columns of the right block correspond to the eigenvectors associated with the remaining smallest eigenvalues. Intuitively, a sketch matrix is “good” if the sub-matrix is relatively close to an isometry, whereas the sub-matrix has a relatively small operator norm.

This intuition can be formalized in the following way. For a given kernel matrix , a sketch matrix is said to be -satisfiable if there is a universal constant such that

 |||(SU1)TSU1−Idn|||op≤1/2,and|||SU2D1/22|||% op≤cδn, (13)

where .

Given this definition, the following theorem shows that any sketched KRR estimate based on a -satisfiable matrix achieves the minimax risk (with high probability over the noise in the observation model):

###### Theorem 2 (Upper bound).

Given i.i.d. samples from the standard non-parametric regression model, consider the sketched KRR problem (5a) based on a -satisfiable sketch matrix . Then any for , the sketched regression estimate from equation (5b) satisfies the bound

 ∥ˆf−f∗∥2n ≤cu{λn+δ2n}

with probability greater than .

We emphasize that in the case of fixed design regression and for a fixed sketch matrix, the -satisfiable condition on the sketch matrix is a deterministic statement: apart from the sketch matrix, it only depends on the properties of the kernel function and design variables . Thus, when using randomized sketches, the algorithmic randomness can be completely decoupled from the randomness in the noisy observation model (1).

##### Proof intuition:

The proof of Theorem 2 is given in Section 4.1. At a high-level, it is based on an upper bound on the prediction error that involves two sources of error: the approximation error associated with solving a zero-noise version of the KRR problem in the projected -dimensional space, and the estimation error between the noiseless and noisy versions of the projected problem. In more detail, letting denote the vector of function evaluations defined by , consider the quadratic program

 α† (14)

as well as the associated fitted function . The vector is the solution of the sketched problem in the case of zero noise, whereas the fitted function corresponds to the best penalized approximation of within the range space of .

Given this definition, we then have the elementary inequality

 12∥ˆf−f∗∥2n ≤∥f†−f∗∥2n% Approximation error+∥f†−ˆf∥2n% Estimation error. (15)

For a fixed sketch matrix, the approximation error term is deterministic: it corresponds to the error induced by approximating over the range space of . On the other hand, the estimation error depends both on the sketch matrix and the observation noise. In Section 4.1, we state and prove two lemmas that control the approximation and error terms respectively.

As a corollary, Theorem 2 implies the stated upper bound (8) on the prediction error of the original (unsketched) KRR estimate (3). Indeed, this estimator can be obtained using the “sketch matrix” , which is easily seen to be -satisfiable. In practice, however, we are interested in sketch matrices with , so as to achieve computational savings. In particular, a natural conjecture is that it should be possible to efficiently generate -satisfiable sketch matrices with the projection dimension proportional to the statistical dimension of the kernel. Of course, one such -satisfiable matrix is given by , but it is not easy to generate, since it requires computing the eigendecomposition of . Nonetheless, as we now show, there are various randomized constructions that lead to -satisfiable sketch matrices with high probability.

### 3.2 Corollaries for randomized sketches

When combined with additional probabilistic analysis, Theorem 2 implies that various forms of randomized sketches achieve the minimax risk using a sketch dimension proportional to the statistical dimension . Here we analyze the Gaussian and ROS families of random sketches, as previously defined in Section 2.2. Throughout our analysis, we require that the sketch dimension satisfies a lower obund of the form

 m ≥{cdnfor Gaussian sketches, andcdnlog4(n)for ROS sketches, (16a) where dn is the statistical dimension as previously defined in equation (11). Here it should be understood that the constant c can be chosen sufficiently large (but finite). In addition, for the purposes of stating high probability results, we define the function ϕ(m,dn,n) :=⎧⎪⎨⎪⎩c1e−c2mfor Gaussian sketches, % andc1[e−c2mdnlog2(n)+e−c2dnlog2(n)]for ROS sketches, (16b)

where are universal constants. With this notation, the following result provides a high probability guarantee for both Gaussian and ROS sketches:

###### Corollary 1 (Guarantees for Gaussian and ROS sketches).

Given i.i.d. samples from the standard non-parametric regression model (1), consider the sketched KRR problem (5a) based on a sketch dimension satisfying the lower bound (16a). Then there is a universal constant such that for any , the sketched regression estimate (5b) satisfies the bound

 ∥ˆf−f∗∥2n ≤c′u{λn+δ2n}

with probability greater than .

In order to illustrate Corollary 1, let us return to the three examples previously discussed in Section 2.3. To be concrete, we derive the consequences for Gaussian sketches, noting that ROS sketches incur only an additional overhead.

• for the -order polynomial kernel from Example 1, the statistical dimension for any sample size is at most , so that a sketch size of order is sufficient. This is a very special case, since the kernel is finite rank and so the required sketch dimension has no dependence on the sample size.

• for the Gaussian kernel from Example 2, the statistical dimension satisfies the scaling , so that it suffices to take a sketch dimension scaling logarithmically with the sample size.

• for the first-order Sobolev kernel from Example 3 , the statistical dimension scales as , so that a sketch dimension scaling as the cube root of the sample size is required.

In order to illustrate these theoretical predictions, we performed some simulations. Beginning with the Sobolev kernel on the unit square, as introduced in Example 3, we generated i.i.d. samples from the model  (1

) with noise standard deviation

, the unknown regression function

 f∗(x) =|x+0.5|−0.5, (17)

and uniformly spaced design points for . By construction, the function belongs to the first-order Sobolev space with . As suggested by our theory for the Sobolev kernel, we set the projection dimension , and then solved the sketched version of kernel ridge regression, for both Gaussian sketches and ROS sketches based on the fast Hadamard transform. We performed simulations for in the set so as to study scaling with the sample size. As noted above, our theory predicts that the squared prediction loss should tend to zero at the same rate as that of the unsketched estimator . Figure 1 confirms this theoretical prediction. In panel (a), we plot the squared prediction error versus the sample size, showing that all three curves (original, Gaussian sketch and ROS sketch) tend to zero. Panel (b) plots the rescaled prediction error versus the sample size, with the relative flatness of these curves confirming the decay predicted by our theory.

In our second experiment, we repeated the same set of simulations this time for the Gaussian kernel with bandwidth , and the function . In this case, as suggested by our theory, we choose the sketch dimension . Figure 2 shows the same types of plots with the prediction error. In this case, we expect that the squared prediction error will decay at the rate . This prediction is confirmed by the plot in panel (b), showing that the rescaled error , when plotted versus the sample size, remains relatively constant over a wide range.

### 3.3 Comparison with Nyström-based approaches

It is interesting to compare the convergence rate and computational complexity of our methods with guarantees based on the Nyström approximation. As shown in Appendix A, this Nyström approximation approach can be understood as a particular form of our sketched estimate, one in which the sketch corresponds to a random row-sampling matrix.

Bach [4] analyzed the prediction error of the Nyström approximation to KRR based on uniformly sampling a subset of -columns of the kernel matrix , leading to an overall computational complexity of . In order for the approximation to match the performance of KRR, the number of sampled columns must be lower bounded as

 p ≿n∥diag(K(K+λnI)−1)∥∞logn,

a quantity which can be substantially larger than the statistical dimension required by our methods. Moreover, as shown in the following example, there are many classes of kernel matrices for which the performance of the Nyström approximation will be poor.

###### Example 4 (Failure of Nyström approximation).

Given a sketch dimension , consider an empirical kernel matrix that has a block diagonal form , where and for any integer . Then the probability of not sampling any of the last columns/rows is at least . This means that with probability at least , the sub-sampling sketch matrix can be expressed as , where . Under such an event, the sketched KRR (5a) takes on a degenerate form, namely

 ˆα =argminθ∈Rm{12αTS1K21ST1α−αTS1K1y1√n+λnαTS1K1ST1α},

and objective that depends only on the first observations. Since the values of the last observations can be arbitrary, this degeneracy has the potential to lead to substantial approximation error.

The previous example suggests that the Nyström approximation is likely to be very sensitive to non-inhomogeneity in the sampling of covariates. In order to explore this conjecture, we performed some additional simulations, this time comparing both Gaussian and ROS sketches with the uniform Nyström approximation sketch. Returning again to the Gaussian kernel with bandwidth , and the function , we first generated i.i.d. samples that were uniform on the unit interval . We then implemented sketches of various types (Gaussian, ROS or Nyström) using a sketch dimension . As shown in the top row (panels (a) and (b)) of Figure 3, all three sketch types perform very well for this regular design, with prediction error that is essentially indistiguishable from the original KRR estimate. Keeping the same kernel and function, we then considered an irregular form of design, namely with samples perturbed as follows:

 xi ∼{Unif[0,1/2]if i=1,…,n−k1+zifor i=k+1,…,n

where each . The performance of the sketched estimators in this case are shown in the bottom row (panels (c) and (d)) of Figure 3. As before, both the Gaussian and ROS sketches track the performance of the original KRR estimate very closely; in contrast, the Nyström approximation behaves very poorly for this regression problem, consistent with the intuition suggested by the preceding example.

As is known from general theory on the Nyström approximation, its performance can be improved by knowledge of the so-called leverage scores of the underlying matrix. In this vein, recent work by Alaoui and Mahoney [2] suggests a Nyström approximation non-uniform sampling of the columns of kernel matrix involving the leverage scores. Assuming that the leverage scores are known, they show that their method matches the performance of original KRR using a non-uniform sub-sample of the order columns. When the regularization parameter is set optimally—that is, proportional to —then apart from the extra logarithmic factor, this sketch size scales with the statistical dimension, as defined here. However, the leverage scores are not known, and their method for obtaining a sufficiently approximation requires sampling columns of the kernel matrix , where

 ~p≿λ−1ntrace(K)logn.

For a typical (normalized) kernel matrix , we have ; moreover, in order to achieve the minimax rate, the regularization parameter should scale with . Putting together the pieces, we see that the sampling parameter must satisfy the lower bound . This requirement is much larger than the statistical dimension, and prohibitive in many cases:

• for the Gaussian kernel, we have , and so , meaning that all rows of the kernel matrix are sampled. In contrast, the statistical dimension scales as .

• for the first-order Sobolev kernel, we have , so that . In contrast, the statistical dimension for this kernel scales as .

It remains an open question as to whether a more efficient procedure for approximating the leverage scores might be devised, which would allow a method of this type to be statistically optimal in terms of the sampling dimension.

## 4 Proofs

In this section, we provide the proofs of our main theorems. Some technical proofs of the intermediate results are provided in the appendices.

### 4.1 Proof of Theorem 2

Recall the definition (14) of the estimate , as well as the upper bound (15) in terms of approximation and estimation error terms. The remainder of our proof consists of two technical lemmas used to control these two terms.

###### Lemma 1 (Control of estimation error).

Under the conditions of Theorem 2, we have

 ∥f†−ˆf∥2n≤cδ2n (18)

with probability at least .

###### Lemma 2 (Control of approximation error).

For any -satisfiable sketch matrix , we have

 ∥f†−f∗∥2n ≤c{λn+δ2n}and∥f†∥H≤c{1+δ2nλn}. (19)

These two lemmas, in conjunction with the upper bound (15), yield the claim in the theorem statement. Accordingly, it remains to prove the two lemmas.

#### 4.1.1 Proof of Lemma 1

So as to simplify notation, we assume throughout the proof that . (A simple rescaling argument can be used to recover the general statement). Since is optimal for the quadratic program (14), it must satisfy the zero gradient condition

 −SK(1√nf∗−KSTα†)+λnSKSTα† =0. (20)

By the optimality of and feasibility of for the sketched problem (5a), we have

 12∥KSTˆα∥22 −1√nyTKSTˆα+λn∥√KSTˆα∥22 ≤12∥KSTα†∥22−1√nyTKSTα†+λn∥√KSTα†∥22

Defining the error vector , some algebra leads to the following inequality

 12∥KˆΔ∥22 (21)

Consequently, by plugging in and applying the optimality condition (20), we obtain the basic inequality

 12∥KˆΔ∥22 ≤∣∣1√nwTKˆΔ∣∣−λn∥√KˆΔ∥22. (22)

The following lemma provides control on the right-hand side:

###### Lemma 3.

With probability at least , we have

 ∣∣1√nwTKΔ∣∣ ≤{6δn∥KΔ∥2+2δ2n%forall$∥√KΔ∥2≤1$,2δn∥KΔ∥2+2δ2n∥√KΔ∥2+116δ2nfor all ∥√KΔ∥2≥1. (23)

See Appendix B.2 for the proof of this lemma.

Based on this auxiliary result, we divide the remainder of our analysis into two cases:

##### Case 1:

If , then the basic inequality (22) and the top inequality in Lemma 3 imply

 12∥KˆΔ∥22 ≤∣∣1√nwTKˆΔ∣∣≤6δn∥KˆΔ∥2+2δ2n (24)

with probability at least . Note that we have used that fact that the randomness in the sketch matrix is independent of the randomness in the noise vector . The quadratic inequality (24) implies that for some universal constant .

##### Case 2:

If , then the basic inequality (22) and the bottom inequality in Lemma 3 imply

 12∥KˆΔ∥22 ≤2δn∥KˆΔ∥2+2δ2n∥√KˆΔ∥2+116δ2n−λn∥√KˆΔ∥22

with probability at least . If , then under the assumed condition , the above inequality gives

 12∥KˆΔ∥22≤2δn∥KˆΔ∥2+116δ2n≤14∥KˆΔ∥22+4δ2n+116δ2n.

By rearranging terms in the above, we obtain for a universal constant, which completes the proof.

#### 4.1.2 Proof of Lemma 2

Our goal is to show that the bound

 12n∥z∗−√nKSTα†∥22+λn∥√KSTα†∥22≤c{λn+δ2n}.

In fact, since is a minimizer, it suffices to exhibit some for which this inequality holds. Recalling the eigendecomposition , it is equivalent to exhibit some such that

 12∥θ∗−D˜STα∥22+λnαT˜SD˜STα ≤c{λn+δ2n}, (25)

where is the transformed sketch matrix, and the vector satisfies the ellipse constraint .

We do so via a constructive procedure. First, we partition the vector into two sub-vectors, namely and . Similarly, we partition the diagonal matrix into two blocks, and , with dimensions and respectively. Under the condition , we may let denote the left block of the transformed sketch matrix, and similarly, let denote the right block. In terms of this notation, the assumption that is -satisfiable corresponds to the inequalities

 |||˜ST1˜S1−Idn|||op≤12,and|||˜S2√D2|||op≤cδn. (26)

As a consequence, we are guarantee that the matrix is invertible, so that we may define the -dimensional vector

 ˆα =˜S1(˜ST1˜S1)−1(D1)−1β