1 Introduction
The rapid development in data generation and acquisition has made a profound impact on knowledge discovery. Collecting data with unprecedented sizes and complexities is now feasible in many scientific fields. For example, a satellite takes thousands of high resolution images per day; a Walmart store has millions of transactions per week; and Facebook generates billions of posts per month. Such examples also occur in agriculture, geology, finance, marketing, bioinformatics, and Internet studies among others. The appearance of big data brings great opportunities for extracting new information and discovering subtle patterns. Meanwhile, their huge volume also poses many challenging issues to the traditional data analysis, where a dataset is typically processed on a single machine. In particular, some severe challenges are from the computational aspect, where the storage bottleneck and algorithmic feasibility need to be faced. Designing effective and efficient analytic tools for big data has been a recent focus in the statistics and machine learning communities
[24].In the literature, several strategies have been proposed for processing big data. To overcome the storage bottleneck, Hadoop system was developed to conduct distributive storage and parallel processing. The idea of Hadoop follows from a natural divideandconquer framework, where a large problem is divided into several manageable subproblems and the final output is obtained by combining the corresponding suboutputs. With the aid of Hadoop, many machine learning methods can be rebuilt to their distributed versions for the big data analysis. For examples, McDonald et al. [14] considered a distributed training approach for structured perception, while Kleiner et al. [10] introduced a distributed bootstrap method. Recently, similar ideas have also been applied to statistical point estimation [11]
, kernel ridge regression
[28], matrix factorization [13], and principal component analysis
[26].To better understand the divideandconquer strategy, let us consider an illustrative example as follows. Suppose that a dataset consists of random samples with dimension . We assume that the data follow from a linear model with a random noise . The goal of learning is to estimate the regression coefficient . Let be the
dimensional response vector and
be the covariate matrix. Apparently, the huge sample size of this problem makes the singlemachinebased least squares estimate computationally costly. Instead, one may first evenly distribute the samples into local machines and obtain subestimates based on independent running. The final estimate of can then be obtained by averaging the subestimates . Compared with the traditional method, such a distributive learning framework utilizes the computing power of multiple machines, which avoids the direct storage and operation on the original full dataset. We further illustrate this framework in Figure 1 and refer to it as a distributed algorithm.The distributed algorithm provides a computationally viable route for learning with big data. However, it remains largely unknown that whether such a divideandconquer scheme indeed provides valid theoretical inferences to the original data. For point estimation, Li et al. [11]
showed that the distributed moment estimation is consistent, if an unbiased estimate is obtained for each of the subproblems. For kernel ridge regression, Zhang et al.
[28] showed that, with appropriate tuning parameters, the distributed algorithm does lead to a valid estimation. To provide some insights on the feasibility issue, we numerically compare the estimation accuracy of with that of in the aforementioned example. Specifically, we generate independently from and set based on independent observations from . The value of is generated from the presumed linear model with . We then randomly distribute the full data to local machines and output based on local ridge estimates for . In Figure 2, we plot the estimation errors versus the number of local machines based on three types of estimators: , , and . For a wide range of , it seems that the distributed estimator leads to a similar accuracy as the traditional does. However, this argument tends to be false when is overly large. This observation brings an interesting but fundamental question for using the distributed algorithm in regression: under what conditions the distributed estimator provides an effective estimation of the target function? In this paper, we aim to find an answer to this question and provide more general theoretical support for the distributed regression.Under the kernelbased regression setup, we propose to take the generalization consistency as a criterion for measuring the feasibility of the distributed algorithms. That is, we regard an algorithm is theoretically feasible if its generalization error tends to zero as the number of observations goes to infinity. To justify the distributed regression, a uniform convergence rate is needed for bounding the generalization error over the
subestimators. This brings new and challenging issues in analysis for the big data setup. Under mild conditions, we show that the distributed kernel regression (DKR) is feasible when the number of its distributed subproblems is moderate. Our result is applicable to many commonly used regression models, which incorporate a variety of loss, kernel, and penalty functions. Moreover, the feasibility of DKR does not rely on any parametric assumption on the true model. It therefore provides a basic and general understanding for the distributed regression analysis. We demonstrate the promising performance of DKR via both simulation and real data examples.
The rest of the paper is organized as follows. In Section 2, we introduce model setup and formulate the DKR algorithm. In Section 3, we establish the generalization consistency and justify the feasibility of DKR. In Section 4, we show numerical examples to support the good performance of DKR. Finally, we conclude the paper in Section 5 with some useful remarks.
2 Distributed Kernel Regression
2.1 Notations
Let
be a response variable bounded by some
and be its dimensional covariate drawn from a compact set . Suppose that follows from a fixed but unknown distribution with its support fully filled on . Let be independent observations collected from . The goal of study is to estimate the potential relationship between and through analyzing .Let
be a nonnegative loss function and
be an arbitrary mapping from to . We useto denote the expected risk of . The minimizer is called the regression function, which is an oracle estimate under and thus serves as a benchmark for other estimators. Since is unknown, is only conceptual. Practically, it is common to estimate through minimizing a regularized empirical risk
(1) 
where is a userspecified hypothesis space, is the empirical risk, is a norm in , and is a regularization parameter.
Framework (1) covers a broad range of regression methods. In the machine learning community, it is popular to set by a reproducing kernel Hilbert space (RKHS). Specifically, let be a continuous, symmetric, and semipositive definite kernel function. The RKHS is a Hilbert space of integrable functions induced by . For any and , their inner product is defined by
and the kernel norm is given by . It is easy to verify that
(2) 
for any . Therefore, is a reproducing kernel of . Readers may refer to [1] [21] for more detailed discussions about RKHS.
2.2 The DKR Algorithm
We now consider (1) in the big data setup. In particular, we assume that sample is too big to be processed in a single machine and thus we need to use its distributed version. Suppose is evenly and randomly assigned to local machines, with each machine processing samples. We denote by the sample segment assigned to the th machine. The global estimator is then constructed through taking average of the local estimators. Specifically, by setting in (1), this strategy leads to the distributed kernel regression (DKR), which is described as Algorithm 1.
By representer theorem [17], in step 2 of DKR can be constructed from . This allows DKR to be practically carried out within finite dimensional subspaces. The distributive framework of DKR enables parallel processing and thus is appealing to the analysis of big data. With , DKR reduces to the regular kernelbased learning, which has received a great deal of attention in the literature [18] [23] [27]. With quadratic and , Zhang et. al. [28] conducted a feasibility analysis for DKR with . Unfortunately, their results are built upon the closeform solution of and thus are not applicable to other DKR cases. In this work, we attempt to provide a more general feasibility result for using DKR in dig data.
3 Consistency of DKR
3.1 Preliminaries and Assumptions
In regression analysis, a good estimator of is expected not only to fit training set but also to predict the future samples from . In the machine learning community, such an ability is often referred to as the generalization capability. Recall that is a conceptual oracle estimator, which enjoys the lowest generalization risk in a given loss. The goodness of can be typically measured by
(3) 
A feasible (consistent) is then required to have generalization error (3) converge to zero as . When the quadratic loss is used, the convergence of (3) also leads to the convergence of , which responds to the traditional notion of consistency in statistics.
When is convex, Jensen’s inequality implies that
Therefore, the consistency of is implied by the uniform consistency of the local estimators for . Under appropriate conditions, this result may be straightforward in the fixed setup. However, for analyzing big data, it is particularly desired to have associated with sample size
. This is because the number of machines needed in an analysis is usually determined by the scale of that problem. The larger a dataset is, the more machines are needed. This in turn suggests that, in asymptotic analysis,
may diverge to infinity as increases. This liberal requirement of poses new and challenging issues to justify under the big data setup.Clearly, the effectiveness of a learning method relies on the prior assumptions on as well as the choice of . For the convenience of discussion, we assess the performance of DKR under the following conditions.

and , where denotes the function supremum norm.

The loss function is convex and nonnegative. For any and , there exists a constant such that

For any and , there exists a , such that . Moreover, let for some . There exists constants , , such that
where denotes the covering number of a set by balls of radius with respect to .
Condition A1 is a regularity assumption on , which can be trivial in applications. For the quadratic loss, we have and thus A1 holds naturally with . Condition A2 requires that is Lipschitz continuous in . It is satisfied by many commonly used loss functions for regression analysis. Condition A3 corresponds to the notion of universal kernel in [15], which implies that is dense in . It therefore serves as a prerequisite for estimating an arbitrary from . A3 also requires that the unit subspace of has a polynomial complexity. Under our setup, a broad choices of satisfy this condition, which include the popular Gaussian kernel as a special case [29] [30].
3.2 Generalization Analysis
To justify DKR, we decompose (3) by
(4)  
(5)  
(6) 
where is an arbitrary element of . The consistency of is implied if (3) has convergent suberrors in (4)(6). Since is arbitrary, (6) measures how close the oracle can be approximated from the candidate space . This is a term that purely reflects the prior assumptions on a learning problem. Under Conditions A1A3, with a such that , (6) is naturally bounded by . We therefore carry on our justification by bounding the sample and hypothesis errors.
3.2.1 Sample Error Bound
Let us first work on the sample error (4), which describes the difference between the expected loss and the empirical loss for an estimator. For the convenience of analysis, let us rewrite (4) as
(7) 
where and . It should be noted that the randomness of is purely from , which makes a fixed quantity and a sample mean of independent observations. For , since is an output of , is random in and
s are dependent with each other. We derive a probability bound for the sample error through investigating (
7).To facilitate our proofs, we first state oneside Bernstein inequality as the following lemma.
Lemma 1.
Let be
independently and identically distributed random variables with
and . If for some , then for any ,The probability bounds for the two terms of (7) are given respectively in the following propositions.
Proposition 1.
Suppose that Conditions A1A2 are satisfied. For any and , we have
Proof.
Proposition 2.
Suppose that Conditions A1A3 are satisfied. For any and , we have
where .
Proof.
Let . Under Condition A3, is dense in . Therefore, for any , there exists a , such that . By A2, we further have
Consequently,
(11)  
Let be a cover of by balls of radius with respect to . With , (11) implies that
(12)  
where the last inequality follows from Lemma 1. By A3, we have
(13) 
Let . Inequality (12) together with (13) further implies that
(14) 
Based on Propositions 1 and 2, decomposition (7) implies directly the following probability bound of the sample error.
Theorem 1.
(Sample Error) Suppose that Conditions A1A3 are salified. Let . For any and , we have, with probability at least ,
(17) 
where
3.2.2 Hypothesis Error Bound
We now continue our feasibility analysis on the hypothesis error (5), which measures the empirical risk difference between and an arbitrary . When DKR is conducted with , corresponds to the singlemachinebased kernel learning. By setting , the hypothesis error has a natural zero bound by definition. However, this property is no longer valid for a general DKR with .
When is convex, we have (5) bounded by
(18)  
This implies that the hypothesis error of is bounded by a uniform bound of the hypothesis errors over the subestimators. We formulate this idea as the following theorem.
Theorem 2.
(Hypothesis Error) Suppose that Conditions A1A3 are satisfied. For any and , we have, with probability at least ,
where , , and are defined in Theorem 1.
Proof.
Without loss of generality, we prove the theorem for with . Recall that DKR spilt into segments . Let be the sample set with removed from and be the empirical risk for a sample set of size . Under A2, we have is convex and thus
(19)  
where and .
Let us first work on the first term of (19). By definition of , we know that
Therefore,
(20) 
This implies that the first term of (19) is bounded by .
We now turn to bound the second term of (19). Specifically, we further decompose by
where
Note that is independent of . Proposition 1 readily implies that, with probability at least ,
Also, by applying Proposition 2 with , we have, with probability at least ,
with the same defined in Proposition 2. Consequently, we have, with probability at least ,
where .
Inequalities (20) and (LABEL:Thm23) further imply that, with probability at least
The theorem is therefore proved. ∎
Theorem 2 implies that, with appropriate and , the hypothesis error of DKR has an bound in probability. This results is applicable to a general with , which incorporates the diverging situations.
3.3 Generalization Bound of DKR
With the aid of Theorems 12, we obtain a probability bound for the generalization error of as the following theorem.
Theorem 3.
(Generalization Error) Suppose that Conditions A1A3 are satisfied. When is sufficiently large, for any ,
with probability at least , where and .
Proof.
Theorem 3 suggests that, if we set , the generalization error of is bounded by an term in probability. In other words, as , a properly tuned DKR leads to an estimator that achieves the oracle predictive power. This justifies the feasibility of using divideand conquer strategy for the kernelbased regression analysis. Under the assumption that , we have and thus is feasible with . Moreover, when DKR is conducted with Gaussian kernels, Condition A3 is satisfied with any and thus enjoys a nearly convergence rate to .
Theorem 3 provides theoretical support for the distributed learning framework (Algorithm 1). It also reveals that the convergence rate of is related to the scale of local sample size . This seems to be reasonable, because is biased from under a general setup. The individual bias of may diminish as increase. It, however, would not be balanced off by taking the average of s for . As a result, the generalization bound of is determined by the largest bias among the s. When
is (nearly) unbiased, its generalization performance is mainly affected by its variance. In that case,
is likely to achieve a faster convergence rate by averaging over s. We use the following corollary to show some insights on this point.Corollary 1.
Suppose that DKR is conducted with the quadratic loss and . If for any , then under Conditions A1A3, we have
Proof.
Let be the marginal distribution of . When the quadratic loss is used, we have
(22) 
Since we assume for any , (22) implies that
(23)  
Comments
There are no comments yet.