In the parallel computing environment, a common practice is to distribute a massive dataset to multiple processors, and then aggregate local results obtained from separate machines into global counterparts. This Divide-and-Conquer (D&C) strategy often requires a growing number of machines to deal with an increasingly large dataset. An important question to statisticians is ”how many machines can we use in parallel computing to guarantee statistical optimality?” The present work aims to explore this basic yet fundamentally important question in a classical nonparametric regression setup, i.e., kernel ridge regression (KRR). This can be done by carefully analyzing statistical versus computational trade-off in the D&C framework, where the number of deployed machines is treated as a simple proxy for computing cost.
Recently, researchers have made impressive progress about KRR in the modern D&C framework with different conquer strategies; examples include median-of-means estimator proposed by , Bayesian aggregation considered by [15, 22, 18, 20], and simple averaging considered by  and . Upper bounds for the number of machines have been studied in such strategies to guarantee good property. For instance,  showed that, when processors are employed with in a suitable range, DC method still preserves minimax optimal estimation. In smoothing spline regression (a special case of KRR),  derived critical, i.e., un-improvable, upper bounds for to achieve either optimal estimation or optimal testing, but their results are only valid in univariate fixed design. The critical bound for estimation obtained by  significantly improves the one given in . Nonetheless, it remains unknown if results obtained in  continues to hold in a more general KRR framework where the design is either multivariate or random. On the other hand, there is a lack of literature dealing with nonparametric testing in general KRR. To the best of our knowledge,  is the only reference but in the special smoothing spline regression with univariate fixed designs.
In this paper, we consider KRR in the DC regime in a general setup: design is random and multivariate. As our technical contribution, we characterize the upper bounds of for achieving optimal estimation and testing based on quantifying an empirical process (EP), such that a sharper concentration bound of the EP leads to a tighter upper bound of . Our EP approaches can handle various function spaces including Sobolev space, Gaussian RKHS, or spaces of special structures such as additive functions, in a unified manner. As an illustration example, in the particular smoothing spline regression, we introduce the Green function for equivalent kernels to the EP bound and achieve a polynomial order improvement of compared with . It is worthy noting that our upper bound is almost identical as  (upto a logarithmic factor) for optimal estimation, which is proven to be un-improvable.
The second main contribution of this paper is to propose a Wald type test statistic for nonparametric testing in D&C regime. Asymptotic null distribution and power behaviors of the proposed test statistic are carefully analyzed. One important finding is that the upper bounds offor optimal testing are dramatically different from estimation, indicating the essentially different natures of the two problems. Our testing results are derived in a general framework that cover the aforementioned important function spaces. As an important byproduct, we derive a minimax rate of testing for nonparametric additive models with diverging number of components which is new in literature. Such rate is crucial in deriving the upper bound for for optimal testing, and is of independent interest.
2 Background and Distributed Kernel Ridge Regression
We begin by introducing some background on reproducing kernel Hilbert space (RKHS), and our nonparametric testing formulation under the distributed kernel ridge regression.
2.1 Nonparametric regression in reproducing kernel Hilbert spaces
Suppose that data are iid generated from the following regression model
where are random errors with , , the covariates follows a distribution , and is a real-valued response. Here, is either fixed or diverging with , and is unknown.
Throughout we assume that , where is a reproducing kernel Hilbert space (RKHS) associated with an inner product and a reproducing kernel function . By Mercer’s Theorem, has the following spectral expansion ():
is a sequence of eigenvalues andform a basis in . Moreover, for any ,
where is Kronecker’s .
We introduce a norm in by combining the norm and norm to facilitate our statistical inference theory. For , define
where and is the penalization parameter. Clearly, defines an inner product on . It is easy to prove that is also a RKHS with reproducing kernel function satisfying the following so-called reproducing property:
where for .
For any , we can express the function in terms of the Fourier expansion as . Therefore,
Replacing with in (2.3), we have . Then for any , has an explicit eigen-expansion expressed as
2.2 Distributed kernel ridge regression
For estimating , we consider the kernel ridge regression (KRR) in a divide-and-conquer (DC) regime. First, randomly divide the samples into subsamples. Let denote the set of indices of the observations from subsample for . For simplicity, suppose , i.e., all subsamples are of equal sizes. Hence, the total sample size is . Then, we estimate based on subsample through the following KRR method:
where is the penalization parameter. The DC estimator of is defined as the average of ’s, that is, .
Based on , we further propose a Wald-type statistic for testing the hypothesis
|, vs. .||(2.4)|
In general, testing (for a known ) is equivalent to testing . So, (2.4) has no loss of generality.
3 Main results
In this section, we derive some general results relating to and . Let us first introduce some regularity assumptions.
has finite fourth moment, which are commonly used in literature, see.
There exists a constant such that for all , .
There exists a positive constant such that almost surely.
Define as the supremum norm of . We further assume that are uniformly bounded on , and satisfy certain tail sum property.
The uniform boundedness condition of eigen-functions holds for various kernels, example includes univariate periodic kernel, 2-dimensional Gaussian kernel, multivariate additive kernel; see ,  and reference therein. The tail sum property can also be verified in various RKHS, and is deferred to the Appendix.
Define as effective dimension. It has been widely studied in reference , ,  etc. There is an explicit relationship between and as illustrated in various concrete examples in Section 3.4. Another quantity of interest is the series
, which represents the variance term defined in Theorem3.5. In the following Proposition 3.1, we show that such variance term has the same order of .
Suppose Assumption A3 holds. For any , .
Define , and
Here, is the supremum of the empirical processes based on subsample . The quantity plays a vital role in determining the critical upper bound of to guarantee statistical optimality. As shown in our main theorems, a sharper bound of directly leads to an improved upper bound of . Assumption A4 provides a concentration bound for , and says that are uniformly bounded by , are constants that are specified in various kernels. Verification of Assumption A4 is deferred to Section 3.4 in concrete settings based on empirical processes methods, where the values of will be explicitly specified.
There exist nonnegative constants such that
3.2 Minimax optimal estimation
In this section, we derive a general error bound for . Let and . Suppose that (2.1) holds under . For convenience, let be a self-adjoint operator from to itself such that for all . The existence of follows by [14, Proposition 2.1]. We first obtain a uniform error bound for ’s in the following Lemma 3.2.
(3.1) quantifies the deviation from to its conditional mean through a higher order remainder term, and (3.2) quantifies the bias of . Lemma 3.2 immediately leads to the following result on . Specifically, (3.1) and (3.2) lead to (3.3), which, together with the rates of and in Lemma A.1, leads to (3.4).
If the conditions in Lemma 3.2 hold, then with probability approaching one,
Theorem 3.3 is a general result that holds for many commonly used kernels. Note that , the condition directly implies that as long as is dominated by , the conditional mean squared errors can be upper bounded by the variance term and the squared bias term . Then the minimax optimal estimation can be obtained through the particular that satisfies such bias-variance trade-off; see , . Section 3.4 further illustrates concrete and interpretable guarantees on the conditional mean squared errors to particular kernels.
It is worthy to note that, through the condition of Lemma 3.2 and Theorem 3.3, we build a direct connection between the upper bound of and the uniform bound of the empirical process . That is, a tighter upper bound of can be achieved by a sharper concentration bound of , which is guaranteed by the empirical process methods in this work. For instance, in Section 3.4.1 the smoothing spline regression, we introduce the Green function for equivalent kernels in  to provide a sharp concentration bound of with . Consequently, we achieve an upper bound for almost identical to the critical one obtained by  (upto a logarithmic factor), and improve the one obtained by  in polynomial order.
3.3 Minimax optimal testing
In this section, we derive the asymptotic distribution of and further investigate its power behavior. For simplicity, assume that is known. Otherwise, we can replace by its consistent estimator to fulfill our procedure. We will show that the distributed test statistic can achieve minimax rate of testing (MRT), provided that the number of divisions belongs to a suitable range. Here, MRT is defined as the minimal distance between the null and the alternative hypotheses such that valid testing is possible. The range of is determined based on the criteria that the proposed test statistic can asymptotically achieve correct size and high power.
Before proving consistency of the test statistics , i.e., Theorem 3.5, let us state a technical lemma. Define with , and let . Define the empirical kernel matrix and .
The following theorem shows that is asymptotically normal under . The key condition to obtain such a result is , where are determined through the uniform bound of in Assumption A4. This condition in turn leads to upper bounds for to achieve MRT; see Section 3.4 for detailed illustrations.
By Theorem 3.5, we can define an asymptotic testing rule with significance level as follows:
where is the
percentile of standard normal distribution.
For any , define
is used to measure the distance between the null and the alternative hypotheses. The following Theorem 3.6 shows that, if the alternative signal is separated from zero by an order , then the proposed test statistic asymptotically achieves high power. To achieve optimal testing, it is sufficient to minimize . As long as is dominated by , can be simplified as
Then, MRT can be achieved by selecting to balance the tradeoff between the bias of and the standard derivation of ; see , . It is worth noting that, such a tradeoff in (3.6) for testing is different from the bias-variance tradeoff in (3.3) for estimation, thus leading to different optimal testing rate.
If the conditions in Theorem 3.5 hold, then for any , there exist and s.t.
Section 3.4 will develop upper bounds for in various concrete examples based on the above general theorems. Our results will indicate that the ranges for to achieve MRT are dramatically different from ones to achieve optimal estimation.
In this section, we derive upper bounds for in four featured examples to achieve optimal estimation/testing, based on the general results obtained in Sections 3.2 and 3.3. Our examples cover the settings of univariate, multivariate and diverging-dimensional designs.
3.4.1 Example 1: Smoothing spline regression
Suppose for a constant , where is the th order Sobolev space on , i.e.,
We call the design satisfying (3.7) as quasi-uniform, a common assumption on many statistical problems; see . Quasi-uniform assumption excludes cases where design density is (nearly) zero at certain data points, which may cause estimation inaccuracy at those points.
It is known that when , is a RKHS under the inner product ; see , . Meanwhile, Assumption A3 holds with kernel eigenvalues , . Hence, Proposition 3.1 holds with . We next provide a sharp concentration inequality to bound .
Under (3.7), there exist universal positive constants such that for any ,
The proof of Proposition 3.7 is based on the novel technical tool that we introduce into DC framework: the Green function for equivalent kernels; see [3, Corollary 5.41]. An immediate consequence of Proposition 3.7 is that Assumption A4 holds with . Then based on Theorem 3.3 and Theorem 3.6, we have the following results.
It is known that the estimation rate is minimax-optimal; see . Furthermore, the testing rate is also minimax optimal, in the sense of . It is worth noting that the upper bound for matches (upto a logarithmic factor) the critical one by  in evenly spaced design, which is substantially larger than the one obtained by , i.e.,
for bounded eigenfunctions; see Table1 for the comparison.
3.4.2 Example 2: Nonparametric additive regression
Consider the function space
where is a constant. That is, any has an additive decomposition of ’s. Here, is either fixed or slowly diverging. Such additive model has been well studied in many literatures; see , , ,  among others. For , suppose are independent for and each satisfies (3.7). For identifiability, assume for all . For and , define
Lemma 3.9 (4) establishes a concentration inequality of for the additive model, such that . The proof is based on the extension of the Green function techniques () to diverging dimensional setting; see Lemma A.2 in Appendix.
It was shown by  that is the minimax estimation rate in nonparametric additive model. Part (1) of Corollary 3.10 provides an upper bound for such that achieves this rate. Meanwhile, Part (2) of Corollary 3.10 provides a different upper bound for such that our Wald-type test achieves minimax rate of testing . It should be emphasized that such minimax rate of testing is a new result in literature which is of independent interest. The proof is based on a local geometry approach recently developed by . When , all results in this section reduce to Example 1 on univariate smoothing splines.
3.4.3 Example 3: Gaussian RKHS regression
Suppose that is an RKHS generated by the Gaussian kernel , where are constants. Here we consider . Then Assumption A3 holds with , ; see . It can be shown that holds. To verify Assumption A4, we need the following lemma.
For Gaussian RKHS, Assumption A4 holds with , .
Corollary 3.12 shows that one can choose to be of order (upto a logarithmic factor) to obtain both optimal estimation and testing. This is consistent with the upper bound obtained by  for optimal estimation, which is of a different logarithmic factor. Interestingly, Corollary 3.12 shows that one can also choose to be almost identical to to obtain optimal testing.
3.4.4 Example 4: Thin-Plate spline regression
For thin-plate splines, Assumption A4 holds with , .
In this section, we examined the performance of our proposed estimation and testing procedures versus various choices of number of machines in three examples based on simulated datasets.
4.1 Smoothing spline regression
The data were generated from the following regression model
where , and is a constant. Cubic spline (i.e., in Section 3.4.1) was employed for estimating the regression function. To display the impact of the number of divisions on statistical performance, we set sample sizes for and chose for . To examine the estimation procedure, we generated data from model (4.1) with . Mean squared errors (MSE) were reported based on 100 independent replicated experiments. The left panel of Figure 1 summarizes the results. Specifically, it displays that the MSE increases as does so; while the MSE increases suddenly when , where . Recall that the theoretical upper bound for , is ; see Corollary 3.8. Hence, estimation performance becomes worse near this theoretical boundary.
We next consider the hypothesis testing problem . To examine the proposed Wald test, we generated data from model (4.1) at both ; used for examining the size of the test, and used for examining the power of the test. Significance level was chosen as 0.05. Both size and power were calculated as the proportions of rejections based on 500 independent replications. The middle and right panels of Figure 1 summarize the results. Specifically, the right panel shows that the size approaches the nominal level under various choices of , showing the validity of the Wald test. The middle panel displays that the power increases when decreases; the power maintains at when and . Whereas the power quickly drops to zero when . This is consistent with our theoretical finding. Recall that the theoretical upper bound for is ; see Corollary 3.8. The numerical results also reveal that the upper bound of to achieve optimal testing is indeed smaller than the one required for optimal estimation.
4.2 Nonparametric additive regression
We generated data from the following nonparametric model of two additive components
where , and , , and is a constant. To examine the estimation procedure, we generated data from (4.2) with . To examine the testing procedure, we generated data at . were chosen to be the same as the smoothing spline example in Section 4. Results are summarized in Figure 2. The interpretations are again similar to Figure 1, only with a slightly different asymptotic trend. Specifically, the MSE suddenly increases at , and the power quickly approaches one at . The sizes are around the nominal level 0.05 for all cases.
Our work offers theoretical insights on how to allocate data in parallel computing for KRR in both estimation and testing procedures. In comparison with  and , our work provides a general and unified treatment of such problems in modern diverging-dimension or big data settings. Furthermore, using the green function for equivalent kernels to provide a sharp concentration bound on the empirical processes related to , we have improved the upper bound of the number of machines in smoothing spline regression by  from to for optimal estimation, which is proven un-improvable in  (upto a logarithmic factor). In the end, we would like to point out that our theory is useful in designing a distributed version of generalized cross validation method that is developed to choose tuning parameter and the number of machines ; see .
Appendix A Proofs of main results
a.1 Notation table
|reproducing kernel Hilbert space (RKHS)|
|dimension of covariate|
|,||the inner product and norm under|
|kernel function under the norm|
|,||embedded inner product and norm|
|kernel function equipped with|
|number of division|
|the set of indices of the observation from subsample|
|the subsample size|
|the estimate of based on subsample|
|the supremum norm|
|self-adjoint operator satisfies|
|K||empirical kernel matrix|
|the th order Sobolev space on|
a.2 Some preliminary results
For any , .
For any , .
where the last inequality is by Assumption A3 and the definition of .
a.3 Proofs in Section 3.2
Our theoretical analysis relies on a set of Fréchet derivatives to be specified below: for , the Fréchet derivative of can be identified as: for any ,