Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates

05/22/2013 ∙ by Yuchen Zhang, et al. ∙ berkeley college 0

We establish optimal convergence rates for a decomposition-based scalable approach to kernel ridge regression. The method is simple to describe: it randomly partitions a dataset of size N into m subsets of equal size, computes an independent kernel ridge regression estimator for each subset, then averages the local solutions into a global predictor. This partitioning leads to a substantial reduction in computation time versus the standard approach of performing kernel ridge regression on all N samples. Our two main theorems establish that despite the computational speed-up, statistical optimality is retained: as long as m is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of N samples. As concrete examples, our theory guarantees that the number of processors m may grow nearly linearly for finite-rank kernels and Gaussian kernels and polynomially in N for Sobolev spaces, which in turn allows for substantial reductions in computational cost. We conclude with experiments on both simulated data and a music-prediction task that complement our theoretical results, exhibiting the computational and statistical benefits of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In non-parametric regression, the statistician receives samples of the form , where each is a covariate and

is a real-valued response, and the samples are drawn i.i.d. from some unknown joint distribution

over . The goal is to estimate a function that can be used to predict future responses based on observing only the covariates. Frequently, the quality of an estimate is measured in terms of the mean-squared prediction error , in which case the conditional expectation is optimal. The problem of non-parametric regression is a classical one, and a researchers have studied a wide range of estimators (see, e.g., the books [11, 32, 30]). One class of methods, known as regularized -estimators [30]

, are based on minimizing the combination of a data-dependent loss function with a regularization term. The focus of this paper is a popular

-estimator that combines the least-squares loss with a squared Hilbert norm penalty for regularization. When working in a reproducing kernel Hilbert space (RKHS), the resulting method is known as kernel ridge regression, and is widely used in practice [12, 26]. Past work has established bounds on the estimation error for RKHS-based methods [e.g., 16, 20, 30, 35], which have been refined and extended in more recent work [e.g., 27].

Although the statistical aspects of kernel ridge regression (KRR) are well-understood, the computation of the KRR estimate can be challenging for large datasets. In a standard implementation [24], the kernel matrix must be inverted, which requires costs and in time and memory respectively. Such scalings are prohibitive when the sample size is large. As a consequence, approximations have been designed to avoid the expense of finding an exact minimizer. One family of approaches is based on low-rank approximation of the kernel matrix; examples include kernel PCA [25], the incomplete Cholesky decomposition [9], or Nyström sampling [33]. These methods reduce the time complexity to or , where is the preserved rank. The associated prediction error has only been studied very recently. Concurrent work by Bach [1] establishes conditions on the maintained rank that still guarantee optimal convergence rates; see the discussion for more detail. A second line of research has considered early-stopping of iterative optimization algorithms for KRR, including gradient descent [34, 22] and conjugate gradient methods [6], where early-stopping provides regularization against over-fitting and improves run-time. If the algorithm stops after iterations, the aggregate time complexity is .

In this work, we study a different decomposition-based approach. The algorithm is appealing in its simplicity: we partition the dataset of size randomly into equal sized subsets, and we compute the kernel ridge regression estimate for each of the subsets independently, with a careful choice of the regularization parameter. The estimates are then averaged via . Our main theoretical result gives conditions under which the average achieves the minimax rate of convergence over the underlying Hilbert space. Even using naive implementations of KRR, this decomposition gives time and memory complexity scaling as and , respectively. Moreover, our approach dovetails naturally with parallel and distributed computation: we are guaranteed superlinear speedup with parallel processors (though we must still communicate the function estimates from each processor). Divide-and-conquer approaches have been studied by several authors, including McDonald et al. [19]

for perceptron-based algorithms,

Kleiner et al. [15] in distributed versions of the bootstrap, and Zhang et al. [36] for parametric smooth convex optimization problems. This paper demonstrates the potential benefits of divide-and-conquer approaches for nonparametric and infinite-dimensional regression problems.

One difficulty in solving each of the sub-problems independently is how to choose the regularization parameter. Due to the infinite-dimensional nature of non-parametric problems, the choice of regularization parameter must be made with care [e.g., 12]. An interesting consequence of our theoretical analysis is in demonstrating that, even though each partitioned sub-problem is based only on the fraction of samples, it is nonetheless essential to regularize the partitioned sub-problems as though they had all samples

. Consequently, from a local point of view, each sub-problem is under-regularized. This “under-regularization” allows the bias of each local estimate to be very small, but it causes a detrimental blow-up in the variance. However, as we prove, the

-fold averaging underlying the method reduces variance enough that the resulting estimator still attains optimal convergence rate.

The remainder of this paper is organized as follows. We begin in Section 2 by providing background on the kernel ridge regression estimate and discussing the assumptions that underlie our analysis. In Section 3, we present our main theorems on the mean-squared error between the averaged estimate and the optimal regression function . We provide both a result when the regression function belongs to the Hilbert space associated with the kernel, as well as a more general oracle inequality that holds for a general . We then provide several corollaries that exhibit concrete consequences of the results, including convergence rates of for kernels with finite rank , and convergence rates of for estimation of functionals in a Sobolev space with -degrees of smoothness. As we discuss, both of these estimation rates are minimax-optimal and hence unimprovable. We devote Sections 4 and 5 to the proofs of our results, deferring more technical aspects of the analysis to appendices. Lastly, we present simulation results in Section 6.1 to further explore our theoretical results, while Section 6.2 contains experiments with a reasonably large music prediction experiment.

2 Background and problem formulation

We begin with the background and notation required for a precise statement of our problem.

2.1 Reproducing kernels

The method of kernel ridge regression is based on the idea of a reproducing kernel Hilbert space. We provide only a very brief coverage of the basics here, referring the reader to one of the many books on the topic (e.g., [31, 26, 3, 10]) for further details. Any symmetric and positive semidefinite kernel function defines a reproducing kernel Hilbert space (RKHS for short). For a given distribution on , the Hilbert space is strictly contained within . For each , the function is contained with the Hilbert space ; moreover, the Hilbert space is endowed with an inner product such that acts as the representer of evaluation, meaning

(1)

We let denote the norm in , and similarly denotes the norm in . Under suitable regularity conditions, Mercer’s theorem guarantees that the kernel has an eigen-expansion of the form

where are a non-negative sequence of eigenvalues, and is an orthonormal basis for .

From the reproducing relation (1), we have for any and for any . For any , by defining the basis coefficients for , we can expand the function in terms of these coefficients as , and simple calculations show that

Consequently, we see that the RKHS can be viewed as an elliptical subset of the sequence space as defined by the non-negative eigenvalues .

2.2 Kernel ridge regression

Suppose that we are given a data set consisting of i.i.d. samples drawn from an unknown distribution over , and our goal is to estimate the function that minimizes the mean-squared error , where the expectation is taken jointly over pairs. It is well-known that the optimal function is the conditional mean . In order to estimate the unknown function , we consider an -estimator that is based on minimizing a combination of the least-squares loss defined over the dataset with a weighted penalty based on the squared Hilbert norm,

(2)

where is a regularization parameter. When is a reproducing kernel Hilbert space, then the estimator (2) is known as the kernel ridge regression estimate, or KRR for short. It is a natural generalization of the ordinary ridge regression estimate [13] to the non-parametric setting.

By the representer theorem for reproducing kernel Hilbert spaces [31], any solution to the KRR program (2) must belong to the linear span of the kernel functions . This fact allows the computation of the KRR estimate to be reduced to an -dimensional quadratic program, involving the entries of the kernel matrix . On the statistical side, a line of past work [30, 35, 7, 27, 14] has provided bounds on the estimation error of as a function of and .

3 Main results and their consequences

We now turn to the description of our algorithm, followed by the statements of our main results, namely Theorems 3.2 and 10. Each theorem provides an upper bound on the mean-squared prediction error for any trace class kernel. The second theorem is of “oracle type,” meaning that it applies even when the true regression function does not belong to the Hilbert space , and hence involves a combination of approximation and estimation error terms. The first theorem requires that , and provides somewhat sharper bounds on the estimation error in this case. Both of these theorems apply to any trace class kernel, but as we illustrate, they provide concrete results when applied to specific classes of kernels. Indeed, as a corollary, we establish that our distributed KRR algorithm achieves the statistically minimax-optimal rates for three different kernel classes, namely finite-rank, Gaussian and Sobolev.

3.1 Algorithm and assumptions

The divide-and-conquer algorithm  is easy to describe. We are given samples drawn i.i.d. according to the distribution . Rather than solving the kernel ridge regression problem (2) on all samples, the  method executes the following three steps:

  1. Divide the set of samples evenly and uniformly at randomly into the disjoint subsets .

  2. For each , compute the local KRR estimate

    (3)
  3. Average together the local estimates and output .

This description actually provides a family of estimators, one for each choice of the regularization parameter . Our main result applies to any choice of , while our corollaries for specific kernel classes optimize as a function of the kernel.

We now describe our main assumptions. Our first assumption, for which we have two variants, deals with the tail behavior of the basis functions . For some

\moment

2, there is a constant such that for all .

In certain cases, we show that sharper error guarantees can be obtained by enforcing a stronger condition of uniform boundedness: There is a constant such that for all .

Recalling that , our second assumption involves the deviations of the zero-mean noise variables . In the simplest case, when , we require only a bounded variance condition: The function , and for , we have .

When the function , we require a slightly stronger variant of this assumption. For each , define

(4)

Note that corresponds to the usual regression function, though the infimum may not be attained for (our analysis addresses such cases). Since , we are guaranteed that for each , the associated mean-squared error is finite for almost every . In this more general setting, the following assumption replaces Assumption 3.1:

For any , there exists a constant such that . This condition with is slightly stronger than Assumption 3.1.

3.2 Statement of main results

With these assumptions in place, we are now ready for the statements of our main results. All of our results give bounds on the mean-squared estimation error associated with the averaged estimate based on an assigning samples to each of machines. Both theorem statements involve the following three kernel-related quantities:

(5)

The first quantity is the kernel trace, which serves a crude estimate of the “size” of the kernel operator, and assumed to be finite. The second quantity , familiar from previous work on kernel regression [35], is known as the “effective dimensionality” of the kernel with respect to . Finally, the quantity is parameterized by a positive integer that we may choose in applying the bounds, and it describes the tail decay of the eigenvalues of . For , note that reduces to the ordinary trace. Finally, both theorems involve one further quantity that depends on the number of moments in Assumption 3.1, namely

(6)

Here the parameter is a quantity that may be optimized to obtain the sharpest possible upper bound and may be chosen arbitrarily. (The algorithm’s execution is independent of .)

With and under Assumptions 3.1 and 3.1, the mean-squared error of the averaged estimate is upper bounded as

(7)

where

and denotes a universal (numerical) constant.

Theorem 3.2 is a general result that applies to any trace-class kernel. Although the statement appears somewhat complicated at first sight, it yields concrete and interpretable guarantees on the error when specialized to particular kernels, as we illustrate in Section 3.3.

Before doing so, let us provide a few heuristic arguments for intuition. In typical settings, the term

goes to zero quickly: if the number of moments is large and number of partitions is small—say enough to guarantee that —it will be of lower order. As for the remaining terms, at a high level, we show that an appropriate choice of the free parameter leaves the first two terms in the upper bound (7) dominant. Note that the terms and are decreasing in while the term increases with . However, the increasing term grows only logarithmically in , which allows us to choose a fairly large value without a significant penalty. As we show in our corollaries, for many kernels of interest, as long as the number of machines is not “too large,” this tradeoff is such that and are also of lower order compared to the two first terms in the bound (7). In such settings, Theorem 3.2 guarantees an upper bound of the form

(8)

This inequality reveals the usual bias-variance trade-off in non-parametric regression; choosing a smaller value of reduces the first squared bias term, but increases the second variance term. Consequently, the setting of that minimizes the sum of these two terms is defined by the relationship

(9)

This type of fixed point equation is familiar from work on oracle inequalities and local complexity measures in empirical process theory [2, 16, 30, 35], and when is chosen so that the fixed point equation (9) holds this (typically) yields minimax optimal convergence rates [2, 16, 35, 7]. In Section 3.3, we provide detailed examples in which the choice specified by equation (9), followed by application of Theorem 3.2, yields minimax-optimal prediction error (for the  algorithm) for many kernel classes.

We now turn to an error bound that applies without requiring that . Define the radius , where the population regression function was previously defined (4). The theorem requires a few additional conditions to those in Theorem 3.2, involving the quantities , and defined in Eq. (5), as well as the error moment from Assumption 3.1. We assume that the triplet of positive integers satisfy the conditions

(10)

We then have the following result: Under condition (10), Assumption 3.1 with , and Assumption 3.1, for any and we have

(11)

where the residual term is given by

(12)

and denotes a universal (numerical) constant.

Remarks:

Theorem 10 is an instance of an oracle inequality, since it upper bounds the mean-squared error in terms of the error , which may only be obtained by an oracle knowing the sampling distribution , plus the residual error term (12).

In some situations, it may be difficult to verify Assumption 3.1. In such scenarios, an alternate condition suffices. For instance, if there exists a constant such that , then the bound (11) holds (under condition (10)) with replaced by —that is, with the alternative residual error

(13)

In essence, if the response variable

has sufficiently many moments, the prediction mean-square error in the statement of Theorem 10 can be replaced constants related to the size of . See Section 5.2 for a proof of inequality (13).

In comparison with Theorem 3.2, Theorem 10 provides somewhat looser bounds. It is, however, instructive to consider a few special cases. For the first, we may assume that , in which case . In this setting, the choice (essentially) recovers Theorem 3.2, since there is no approximation error. Taking , we are thus left with the bound

(14)

where denotes an inequality up to constants. By inspection, this bound is roughly equivalent to Theorem 3.2; see in particular the decomposition (8). On the other hand, when the condition fails to hold, we can take , and then choose and to balance between the familiar approximation and estimation errors. In this case, we have

(15)

The condition (10) required to apply Theorem 10 imposes constraints on the number of subsampled data sets that are stronger than those of Theorem 3.2. In particular, when ignoring constants and logarithm terms, the quantity may grow at rate . By contrast, Theorem 3.2 allows to grow as quickly as (recall the remarks on following Theorem 3.2 or look ahead to condition (25)). Thus—at least in our current analysis—generalizing to the case that prevents us from dividing the data into finer subsets.

Finally, it is worth noting that in practice, the optimal choice for the regularization may not be known a priori. Thus it seems that an adaptive choice of the regularization would be desirable (see, for example, Tsybakov [29, Chapter 3]). Cross-validation or other types of unbiased risk estimation are natural choices, however, it is at this point unclear how to achieve such behavior in the distributed setting we study, that is, where estimates depend only on the th local dataset. We leave this as an open question.

3.3 Some consequences

We now turn to deriving some explicit consequences of our main theorems for specific classes of reproducing kernel Hilbert spaces. In each case, our derivation follows the broad outline given the the remarks following Theorem 3.2: we first choose the regularization parameter to balance the bias and variance terms, and then show, by comparison to known minimax lower bounds, that the resulting upper bound is optimal. Finally, we derive an upper bound on the number of subsampled data sets for which the minimax optimal convergence rate can still be achieved.

3.3.1 Finite-rank kernels

Our first corollary applies to problems for which the kernel has finite rank , meaning that its eigenvalues satisfy for all . Examples of such finite rank kernels include the linear kernel , which has rank at most ; and the kernel generating polynomials of degree , which has rank at most . For a kernel with rank , consider the output of the  algorithm with . Suppose that Assumption 3.1 and Assumptions 3.1 (or 3.1) hold, and that the number of processors satisfy the bound

where is a universal (numerical) constant. For suitably large , the mean-squared error is bounded as

(16)

For finite-rank kernels, the rate (16) is known to be minimax-optimal, meaning that there is a universal constant such that

(17)

where the infimum ranges over all estimators based on observing all samples (and with no constraints on memory and/or computation). This lower bound follows from Theorem 2(a) of Raskutti et al. [23] with .

3.3.2 Polynomially decaying eigenvalues

Our next corollary applies to kernel operators with eigenvalues that obey a bound of the form

(18)

where is a universal constant, and parameterizes the decay rate. Kernels with polynomial decaying eigenvalues include those that underlie for the Sobolev spaces with different smoothness orders [e.g. 5, 10]. As a concrete example, the first-order Sobolev kernel generates an RKHS of Lipschitz functions with smoothness . Other higher-order Sobolev kernels also exhibit polynomial eigendecay with larger values of the parameter .

For any kernel with -polynomial eigendecay (18), consider the output of the  algorithm with . Suppose that Assumption 3.1 and Assumption 3.1 (or 3.1) hold, and that the number of processors satisfy the bound

where is a constant only depending on . Then the mean-squared error is bounded as

(19)

The upper bound (19) is unimprovable up to constant factors, as shown by known minimax bounds on estimation error in Sobolev spaces [28, 29]; see also Theorem 2(b) of Raskutti et al. [23].

3.3.3 Exponentially decaying eigenvalues

Our final corollary applies to kernel operators with eigenvalues that obey a bound of the form

(20)

for strictly positive constants . Such classes include the RKHS generated by the Gaussian kernel . For a kernel with exponential eigendecay (20), consider the output of the  algorithm with . Suppose that Assumption 3.1 and Assumption 3.1 (or 3.1) hold, and that the number of processors satisfy the bound

where is a constant only depending on . Then the mean-squared error is bounded as

(21)

The upper bound (21) is also minimax optimal for the exponential kernel classes, which behave like a finite-rank kernel with effective rank .

Summary:

Each corollary gives a critical threshold for the number of data partitions: as long as is below this threshold, we see that the decomposition-based  algorithm gives the optimal rate of convergence. It is interesting to note that the number of splits may be quite large: each grows asymptotically with whenever the basis functions have more than four moments (viz. Assumption 3.1). Moreover, the  method can attain these optimal convergence rates while using substantially less computation than standard kernel ridge regression methods.

4 Proofs of Theorem 3.2 and related results

We now turn to the proof of Theorem 3.2 and Corollaries 3.3.1 through  3.3.3. This section contains only a high-level view of proof of Theorem 3.2; we defer more technical aspects to the appendices.

4.1 Proof of Theorem 3.2

Using the definition of the averaged estimate , a bit of algebra yields

where we used the fact that for each . Using this unbiasedness once more, we bound the variance of the terms to see that

(22)

where we have used the fact that minimizes over .

The error bound (22) suggests our strategy: we upper bound and respectively. Based on equation (3), the estimate is obtained from a standard kernel ridge regression with sample size and ridge parameter . Accordingly, the following two auxiliary results provide bounds on these two terms, where the reader should recall the definitions of and from equation (5). In each lemma, represents a universal (numerical) constant.

[Bias bound] Under Assumptions 3.1 and 3.1, for each , we have

(23)

[Variance bound] Under Assumptions 3.1 and 3.1, for each , we have

(24)

The proofs of these lemmas, contained in Appendices A and B respectively, constitute one main technical contribution of this paper.

Given these two lemmas, the remainder of the theorem proof is straightforward. Combining the inequality (22) with Lemmas 4.1 and 4.1 yields the claim of Theorem 3.2.

4.2 Proof of Corollary 3.3.1

We first present a general inequality bounding the size of for which optimal convergence rates are possible. We assume that is chosen large enough that for some constant , we have in Theorem 3.2, and that the regularization has been chosen. In this case, inspection of Theorem 3.2 shows that if is small enough that

then the term provides a convergence rate given by . Thus, solving the expression above for , we find

Taking -th roots of both sides, we obtain that if

(25)

then the term of the bound (7) is .

Now we apply the bound (25) in the case in the corollary. Let us take ; then . We find that since each of its terms is bounded by 1, and we take . Evaluating the expression (25) with this value, we arrive at

If we have sufficiently many moments that , and (for example, if the basis functions have a uniform bound ), then we may take , which implies that , and we replace with (we assume ), by recalling Theorem 3.2. Then so long as

for some constant , we obtain an identical result.

4.3 Proof of Corollary 3.3.2

We follow the program outlined in our remarks following Theorem 3.2. We must first choose so that . To that end, we note that setting gives

Dividing by , we find that , as desired. Now we choose the truncation parameter . By choosing for some , then we find that and an integration yields . Setting guarantees that and ; the corresponding terms in the bound (7) are thus negligible. Moreover, we have for any finite that .

Applying the general bound (25) on , we arrive at the inequality

Whenever this holds, we have convergence rate . Now, let Assumption 3.1 hold, and take . Then the above bound becomes (to a multiplicative constant factor) , as claimed.

4.4 Proof of Corollary 3.3.3

First, we set . Considering the sum , we see that for , the elements of the sum are bounded by . For , we make the approximation

Thus we find that for some constant . By choosing , we have that the tail sum and -th eigenvalue both satisfy . As a consequence, all the terms involving or in the bound (7) are negligible.

Recalling our inequality (25), we thus find that (under Assumption 3.1), as long as the number of partitions satisfies

the convergence rate of to is given by . Under the boundedness assumption 3.1, as we did in the proof of Corollary 3.3.1, we take in Theorem 3.2. By inspection, this yields the second statement of the corollary.

5 Proof of Theorem 10 and related results

In this section, we provide the proofs of Theorem 10, as well as the bound (13) based on the alternative form of the residual error. As in the previous section, we present a high-level proof, deferring more technical arguments to the appendices.

5.1 Proof of Theorem 10

We begin by stating and proving two auxiliary claims:

(26a)
(26b)

Let us begin by proving equality (26a). By adding and subtracting terms, we have

where equality (i) follows since the random variable

is mean-zero given .

For the second equality (26b), consider any function in the RKHS that satisfies . The definition of the minimizer guarantees that

This result combined with equation (26a) establishes the equality (26b).

We now turn to the proof of the theorem. Applying Hölder’s inequality yields that