How Many Machines Can We Use in Parallel Computing for Kernel Ridge Regression?

05/25/2018 ∙ by Meimei Liu, et al. ∙ Purdue University Indiana University 0

This paper attempts to solve a basic problem in distributed statistical inference: how many machines can we use in parallel computing? In kernel ridge regression, we address this question in two important settings: nonparametric estimation and hypothesis testing. Specifically, we find a range for the number of machines under which optimal estimation/testing is achievable. The employed empirical processes method provides a unified framework, that allows us to handle various regression problems (such as thin-plate splines and nonparametric additive regression) under different settings (such as univariate, multivariate and diverging-dimensional designs). It is worth noting that the upper bounds of the number of machines are proven to be un-improvable (up to a logarithmic factor) in two important cases: smoothing spline regression and Gaussian RKHS regression. Our theoretical findings are backed by thorough numerical studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the parallel computing environment, a common practice is to distribute a massive dataset to multiple processors, and then aggregate local results obtained from separate machines into global counterparts. This Divide-and-Conquer (D&C) strategy often requires a growing number of machines to deal with an increasingly large dataset. An important question to statisticians is ”how many machines can we use in parallel computing to guarantee statistical optimality?” The present work aims to explore this basic yet fundamentally important question in a classical nonparametric regression setup, i.e., kernel ridge regression (KRR). This can be done by carefully analyzing statistical versus computational trade-off in the D&C framework, where the number of deployed machines is treated as a simple proxy for computing cost.

Recently, researchers have made impressive progress about KRR in the modern D&C framework with different conquer strategies; examples include median-of-means estimator proposed by [11], Bayesian aggregation considered by [15, 22, 18, 20], and simple averaging considered by [29] and [16]. Upper bounds for the number of machines have been studied in such strategies to guarantee good property. For instance, [29] showed that, when processors are employed with in a suitable range, DC method still preserves minimax optimal estimation. In smoothing spline regression (a special case of KRR), [16] derived critical, i.e., un-improvable, upper bounds for to achieve either optimal estimation or optimal testing, but their results are only valid in univariate fixed design. The critical bound for estimation obtained by [16] significantly improves the one given in [29]. Nonetheless, it remains unknown if results obtained in [16] continues to hold in a more general KRR framework where the design is either multivariate or random. On the other hand, there is a lack of literature dealing with nonparametric testing in general KRR. To the best of our knowledge, [16] is the only reference but in the special smoothing spline regression with univariate fixed designs.

In this paper, we consider KRR in the DC regime in a general setup: design is random and multivariate. As our technical contribution, we characterize the upper bounds of for achieving optimal estimation and testing based on quantifying an empirical process (EP), such that a sharper concentration bound of the EP leads to a tighter upper bound of . Our EP approaches can handle various function spaces including Sobolev space, Gaussian RKHS, or spaces of special structures such as additive functions, in a unified manner. As an illustration example, in the particular smoothing spline regression, we introduce the Green function for equivalent kernels to the EP bound and achieve a polynomial order improvement of compared with [29]. It is worthy noting that our upper bound is almost identical as [16] (upto a logarithmic factor) for optimal estimation, which is proven to be un-improvable.

The second main contribution of this paper is to propose a Wald type test statistic for nonparametric testing in D&C regime. Asymptotic null distribution and power behaviors of the proposed test statistic are carefully analyzed. One important finding is that the upper bounds of

for optimal testing are dramatically different from estimation, indicating the essentially different natures of the two problems. Our testing results are derived in a general framework that cover the aforementioned important function spaces. As an important byproduct, we derive a minimax rate of testing for nonparametric additive models with diverging number of components which is new in literature. Such rate is crucial in deriving the upper bound for for optimal testing, and is of independent interest.

2 Background and Distributed Kernel Ridge Regression

We begin by introducing some background on reproducing kernel Hilbert space (RKHS), and our nonparametric testing formulation under the distributed kernel ridge regression.

2.1 Nonparametric regression in reproducing kernel Hilbert spaces

Suppose that data are iid generated from the following regression model

(2.1)

where are random errors with , , the covariates follows a distribution , and is a real-valued response. Here, is either fixed or diverging with , and is unknown.

Throughout we assume that , where is a reproducing kernel Hilbert space (RKHS) associated with an inner product and a reproducing kernel function . By Mercer’s Theorem, has the following spectral expansion ([21]):

where

is a sequence of eigenvalues and

form a basis in . Moreover, for any ,

where is Kronecker’s .

We introduce a norm in by combining the norm and norm to facilitate our statistical inference theory. For , define

(2.2)

where and is the penalization parameter. Clearly, defines an inner product on . It is easy to prove that is also a RKHS with reproducing kernel function satisfying the following so-called reproducing property:

where for .

For any , we can express the function in terms of the Fourier expansion as . Therefore,

(2.3)

Replacing with in (2.3), we have . Then for any , has an explicit eigen-expansion expressed as

2.2 Distributed kernel ridge regression

For estimating , we consider the kernel ridge regression (KRR) in a divide-and-conquer (DC) regime. First, randomly divide the samples into subsamples. Let denote the set of indices of the observations from subsample for . For simplicity, suppose , i.e., all subsamples are of equal sizes. Hence, the total sample size is . Then, we estimate based on subsample through the following KRR method:

where is the penalization parameter. The DC estimator of is defined as the average of ’s, that is, .

Based on , we further propose a Wald-type statistic for testing the hypothesis

, vs. . (2.4)

In general, testing (for a known ) is equivalent to testing . So, (2.4) has no loss of generality.

3 Main results

In this section, we derive some general results relating to and . Let us first introduce some regularity assumptions.

3.1 Assumptions

The following Assumptions A1 and A2 require that the design density is bounded and the error

has finite fourth moment, which are commonly used in literature, see

[3].

Assumption A1.

There exists a constant such that for all , .

Assumption A2.

There exists a positive constant such that almost surely.

Define as the supremum norm of . We further assume that are uniformly bounded on , and satisfy certain tail sum property.

Assumption A3.

and .

The uniform boundedness condition of eigen-functions holds for various kernels, example includes univariate periodic kernel, 2-dimensional Gaussian kernel, multivariate additive kernel; see [7], [10] and reference therein. The tail sum property can also be verified in various RKHS, and is deferred to the Appendix.

Define as effective dimension. It has been widely studied in reference [1], [9], [28] etc. There is an explicit relationship between and as illustrated in various concrete examples in Section 3.4. Another quantity of interest is the series

, which represents the variance term defined in Theorem

3.5. In the following Proposition 3.1, we show that such variance term has the same order of .

Proposition 3.1.

Suppose Assumption A3 holds. For any , .

Define , and

Here, is the supremum of the empirical processes based on subsample . The quantity plays a vital role in determining the critical upper bound of to guarantee statistical optimality. As shown in our main theorems, a sharper bound of directly leads to an improved upper bound of . Assumption A4 provides a concentration bound for , and says that are uniformly bounded by , are constants that are specified in various kernels. Verification of Assumption A4 is deferred to Section 3.4 in concrete settings based on empirical processes methods, where the values of will be explicitly specified.

Assumption A4.

There exist nonnegative constants such that

3.2 Minimax optimal estimation

In this section, we derive a general error bound for . Let and . Suppose that (2.1) holds under . For convenience, let be a self-adjoint operator from to itself such that for all . The existence of follows by [14, Proposition 2.1]. We first obtain a uniform error bound for ’s in the following Lemma 3.2.

Lemma 3.2.

Suppose Assumptions A1,A3,A4 are satisfied and with given in Assumption A4

. Then with probability approaching one, for any

,

(3.1)
(3.2)

(3.1) quantifies the deviation from to its conditional mean through a higher order remainder term, and (3.2) quantifies the bias of . Lemma 3.2 immediately leads to the following result on . Specifically, (3.1) and (3.2) lead to (3.3), which, together with the rates of and in Lemma A.1, leads to (3.4).

Theorem 3.3.

If the conditions in Lemma 3.2 hold, then with probability approaching one,

(3.3)
(3.4)

Theorem 3.3 is a general result that holds for many commonly used kernels. Note that , the condition directly implies that as long as is dominated by , the conditional mean squared errors can be upper bounded by the variance term and the squared bias term . Then the minimax optimal estimation can be obtained through the particular that satisfies such bias-variance trade-off; see [1], [25]. Section 3.4 further illustrates concrete and interpretable guarantees on the conditional mean squared errors to particular kernels.

It is worthy to note that, through the condition of Lemma 3.2 and Theorem 3.3, we build a direct connection between the upper bound of and the uniform bound of the empirical process . That is, a tighter upper bound of can be achieved by a sharper concentration bound of , which is guaranteed by the empirical process methods in this work. For instance, in Section 3.4.1 the smoothing spline regression, we introduce the Green function for equivalent kernels in [3] to provide a sharp concentration bound of with . Consequently, we achieve an upper bound for almost identical to the critical one obtained by [16] (upto a logarithmic factor), and improve the one obtained by [29] in polynomial order.

3.3 Minimax optimal testing

In this section, we derive the asymptotic distribution of and further investigate its power behavior. For simplicity, assume that is known. Otherwise, we can replace by its consistent estimator to fulfill our procedure. We will show that the distributed test statistic can achieve minimax rate of testing (MRT), provided that the number of divisions belongs to a suitable range. Here, MRT is defined as the minimal distance between the null and the alternative hypotheses such that valid testing is possible. The range of is determined based on the criteria that the proposed test statistic can asymptotically achieve correct size and high power.

Before proving consistency of the test statistics , i.e., Theorem 3.5, let us state a technical lemma. Define with , and let . Define the empirical kernel matrix and .

Lemma 3.4.

Suppose Assumptions A1,A2, A3, A4 are all satisfied, and , , . Then it holds that

(3.5)

Furthermore, as , , where .

The following theorem shows that is asymptotically normal under . The key condition to obtain such a result is , where are determined through the uniform bound of in Assumption A4. This condition in turn leads to upper bounds for to achieve MRT; see Section 3.4 for detailed illustrations.

Theorem 3.5.

Suppose Assumptions A1 to A4 are all satisfied, and as , , , and . Then, as ,

By Theorem 3.5, we can define an asymptotic testing rule with significance level as follows:

where is the

percentile of standard normal distribution.

For any , define

is used to measure the distance between the null and the alternative hypotheses. The following Theorem 3.6 shows that, if the alternative signal is separated from zero by an order , then the proposed test statistic asymptotically achieves high power. To achieve optimal testing, it is sufficient to minimize . As long as is dominated by , can be simplified as

(3.6)

Then, MRT can be achieved by selecting to balance the tradeoff between the bias of and the standard derivation of ; see [6], [23]. It is worth noting that, such a tradeoff in (3.6) for testing is different from the bias-variance tradeoff in (3.3) for estimation, thus leading to different optimal testing rate.

Theorem 3.6.

If the conditions in Theorem 3.5 hold, then for any , there exist and s.t.

Section 3.4 will develop upper bounds for in various concrete examples based on the above general theorems. Our results will indicate that the ranges for to achieve MRT are dramatically different from ones to achieve optimal estimation.

3.4 Examples

In this section, we derive upper bounds for in four featured examples to achieve optimal estimation/testing, based on the general results obtained in Sections 3.2 and 3.3. Our examples cover the settings of univariate, multivariate and diverging-dimensional designs.

3.4.1 Example 1: Smoothing spline regression

Suppose for a constant , where is the th order Sobolev space on , i.e.,

and . Then model (2.1) becomes the usual smoothing spline regression. In addition to Assumption A1, we assume that

(3.7)

We call the design satisfying (3.7) as quasi-uniform, a common assumption on many statistical problems; see [3]. Quasi-uniform assumption excludes cases where design density is (nearly) zero at certain data points, which may cause estimation inaccuracy at those points.

It is known that when , is a RKHS under the inner product ; see [14], [4]. Meanwhile, Assumption A3 holds with kernel eigenvalues , . Hence, Proposition 3.1 holds with . We next provide a sharp concentration inequality to bound .

Proposition 3.7.

Under (3.7), there exist universal positive constants such that for any ,

The proof of Proposition 3.7 is based on the novel technical tool that we introduce into DC framework: the Green function for equivalent kernels; see [3, Corollary 5.41]. An immediate consequence of Proposition 3.7 is that Assumption A4 holds with . Then based on Theorem 3.3 and Theorem 3.6, we have the following results.

Corollary 3.8.

Suppose that , (3.7), Assumptions A1 and A2 hold.

  1. If , and , then .

  2. If , and , then the Wald-type test achieves minimax rate of testing .

It is known that the estimation rate is minimax-optimal; see [19]. Furthermore, the testing rate is also minimax optimal, in the sense of [6]. It is worth noting that the upper bound for matches (upto a logarithmic factor) the critical one by [16] in evenly spaced design, which is substantially larger than the one obtained by [29], i.e.,

for bounded eigenfunctions; see Table

1 for the comparison.

Zhang et al [29] Shang et al [16] Our approach
smoothing spline
regression
sharpness of
sharpness of
sharpness of
Table 1: Comparison of upper bounds of to achieve minimax optimal estimation.

3.4.2 Example 2: Nonparametric additive regression

Consider the function space

where is a constant. That is, any has an additive decomposition of ’s. Here, is either fixed or slowly diverging. Such additive model has been well studied in many literatures; see [19], [8], [13], [27] among others. For , suppose are independent for and each satisfies (3.7). For identifiability, assume for all . For and , define

It is easy to verify that is an RKHS under defined in (2.2). Lemma 3.9 below summarizes the properties for the with additive components.

Lemma 3.9.
  1. There exist eigenfunctions and eigenvalues that satisfying Assumption A3.

  2. It holds that , and accordingly.

  3. For , , where is a bounded constant.

  4. Assumption A4 holds with .

Lemma 3.9 (4) establishes a concentration inequality of for the additive model, such that . The proof is based on the extension of the Green function techniques ([3]) to diverging dimensional setting; see Lemma A.2 in Appendix.

Combining Lemma 3.9, Theorems 3.3, 3.5 and 3.6, we have the following result.

Corollary 3.10.
  1. Suppose Assumptions A1, A2 hold. If , , , , then .

  2. Suppose Assumptions A1, A2 hold. If , , , and , then the Wald-type test achieves minimax rate of testing with .

Remark 3.1.

It was shown by [13] that is the minimax estimation rate in nonparametric additive model. Part (1) of Corollary 3.10 provides an upper bound for such that achieves this rate. Meanwhile, Part (2) of Corollary 3.10 provides a different upper bound for such that our Wald-type test achieves minimax rate of testing . It should be emphasized that such minimax rate of testing is a new result in literature which is of independent interest. The proof is based on a local geometry approach recently developed by [23]. When , all results in this section reduce to Example 1 on univariate smoothing splines.

3.4.3 Example 3: Gaussian RKHS regression

Suppose that is an RKHS generated by the Gaussian kernel , where are constants. Here we consider . Then Assumption A3 holds with , ; see [17]. It can be shown that holds. To verify Assumption A4, we need the following lemma.

Lemma 3.11.

For Gaussian RKHS, Assumption A4 holds with , .

Following Theorem 3.3, Theorems 3.5 and 3.6, we get the following consequence.

Corollary 3.12.

Suppose that is a Gaussian RKHS and Assumptions A1 and A2 hold.

  1. If and , then .

  2. If and , then the Wald-type test achieves minimax rate of testing .

Corollary 3.12 shows that one can choose to be of order (upto a logarithmic factor) to obtain both optimal estimation and testing. This is consistent with the upper bound obtained by [29] for optimal estimation, which is of a different logarithmic factor. Interestingly, Corollary 3.12 shows that one can also choose to be almost identical to to obtain optimal testing.

3.4.4 Example 4: Thin-Plate spline regression

Consider the th order Sobolev space on , i.e., , with being fixed. It is known that Assumption A3 holds with ; see [5]. Hence . The following lemma verifies Assumption A4.

Lemma 3.13.

For thin-plate splines, Assumption A4 holds with , .

Following Theorem 3.3, Theorem 3.5 and Theorem 3.6, we have the following result.

Corollary 3.14.

Suppose with , Assumption A1 and Assumption A2 hold.

  1. If and , then .

  2. If and , then the Wald-type test achieves minimax rate of testing .

Corollary 3.14 demonstrates upper bounds on . These upper bounds are smaller compared with Corollary 3.8 in the univariate case, since the proof technique in bounding the empirical process here is not as sharp as the Green function technique used in Proposition 3.7 for the univariate example.

4 Simulation

In this section, we examined the performance of our proposed estimation and testing procedures versus various choices of number of machines in three examples based on simulated datasets.

4.1 Smoothing spline regression

The data were generated from the following regression model

(4.1)

where , and is a constant. Cubic spline (i.e., in Section 3.4.1) was employed for estimating the regression function. To display the impact of the number of divisions on statistical performance, we set sample sizes for and chose for . To examine the estimation procedure, we generated data from model (4.1) with . Mean squared errors (MSE) were reported based on 100 independent replicated experiments. The left panel of Figure 1 summarizes the results. Specifically, it displays that the MSE increases as does so; while the MSE increases suddenly when , where . Recall that the theoretical upper bound for , is ; see Corollary 3.8. Hence, estimation performance becomes worse near this theoretical boundary.

We next consider the hypothesis testing problem . To examine the proposed Wald test, we generated data from model (4.1) at both ; used for examining the size of the test, and used for examining the power of the test. Significance level was chosen as 0.05. Both size and power were calculated as the proportions of rejections based on 500 independent replications. The middle and right panels of Figure 1 summarize the results. Specifically, the right panel shows that the size approaches the nominal level under various choices of , showing the validity of the Wald test. The middle panel displays that the power increases when decreases; the power maintains at when and . Whereas the power quickly drops to zero when . This is consistent with our theoretical finding. Recall that the theoretical upper bound for is ; see Corollary 3.8. The numerical results also reveal that the upper bound of to achieve optimal testing is indeed smaller than the one required for optimal estimation.

Figure 1: Smoothing Spline Regression. (a) MSE of versus . (b) Power of the Wald test versus . (c) Size of the Wald test versus .

4.2 Nonparametric additive regression

We generated data from the following nonparametric model of two additive components

(4.2)

where , and , , and is a constant. To examine the estimation procedure, we generated data from (4.2) with . To examine the testing procedure, we generated data at . were chosen to be the same as the smoothing spline example in Section 4. Results are summarized in Figure 2. The interpretations are again similar to Figure 1, only with a slightly different asymptotic trend. Specifically, the MSE suddenly increases at , and the power quickly approaches one at . The sizes are around the nominal level 0.05 for all cases.

Figure 2: Additive Regression Model. (a) MSE of versus . (b) Power of the Wald test versus . (c) Size of the Wald test versus .

5 Conclusion

Our work offers theoretical insights on how to allocate data in parallel computing for KRR in both estimation and testing procedures. In comparison with [29] and [16], our work provides a general and unified treatment of such problems in modern diverging-dimension or big data settings. Furthermore, using the green function for equivalent kernels to provide a sharp concentration bound on the empirical processes related to , we have improved the upper bound of the number of machines in smoothing spline regression by [29] from to for optimal estimation, which is proven un-improvable in [16] (upto a logarithmic factor). In the end, we would like to point out that our theory is useful in designing a distributed version of generalized cross validation method that is developed to choose tuning parameter and the number of machines ; see [24].

Appendix A Proofs of main results

a.1 Notation table

sample size
response
covariate
random error
reproducing kernel Hilbert space (RKHS)
density distribution
dimension of covariate
, the inner product and norm under
kernel function under the norm
eigenvalue
eigenfunction
inner product
, embedded inner product and norm
inner product
kernel function equipped with
number of division
the set of indices of the observation from subsample
the subsample size
the estimate of based on subsample
penalization parameter
D&C estimator
test statistic
the supremum norm
self-adjoint operator satisfies
K empirical kernel matrix
the th order Sobolev space on

Table 2: A table that lists all useful notation and their meanings.

a.2 Some preliminary results

Lemma A.1.
  1. For any , .

  2. For any , .

Proof.

where the last inequality is by Assumption A3 and the definition of .

a.3 Proofs in Section 3.2

Our theoretical analysis relies on a set of Fréchet derivatives to be specified below: for , the Fréchet derivative of can be identified as: for any ,