1 Introduction and Distributed Learning Algorithms
In the era of big data, the rapid expansion of computing capacities in automatic data generation and acquisition brings data of unprecedented size and complexity, and raises a series of scientific challenges such as storage bottleneck and algorithmic scalability (Zhou et al., 2014)
. To overcome the difficulty, some approaches for generating scalable approximate algorithms have been introduced in the literature such as low-rank approximations of kernel matrices for kernel principal component analysis(Schölkopf et al., 1998; Bach, 2013), incomplete Cholesky decomposition (Fine, 2002), early-stopping of iterative optimization algorithms for gradient descent methods (Yao et al., 2007; Raskutti et al., 2014), and greedy-type algorithms. Another method proposed recently is distributed learning based on a divide-and-conquer approach and a particular learning algorithm implemented in individual machines (Zhang et al., 2015; Shamir and Srebro, 2014). This method produces distributed learning algorithms consisting of three steps: partitioning the data into disjoint subsets, applying a particular learning algorithm implemented in an individual machine to each data subset to produce an individual output (function), and synthesizing a global output by utilizing some average of the individual outputs. This method can successfully reduce the time and memory costs, and its learning performance has been observed in many practical applications to be as good as that of a big machine which could process the whole data. Theoretical attempts have been recently made in (Zhang et al., 2013, 2015) to derive learning rates for distributed learning with least squares regularization under certain assumptions.
This paper aims at error analysis of the distributed learning with regularized least squares and its approximation to the algorithm processing the whole data in one single machine. Recall (Cristianini and Shawe-Taylor, 2000; Evgeniou et al., 2000) that in a reproducing kernel Hilbert space (RKHS) induced by a Mercer kernel on an input metric space , with a sample where is the output space, the least squares regularization scheme can be stated as
Here is a regularization parameter and is the cardinality of
. This learning algorithm is also called kernel ridge regression in statistics and has been well studied in learning theory. See e.g.(De Vito et al., 2005; Caponnetto and De Vito, 2007; Steinwart et al., 2009; Bauer et al., 2007; Smale and Zhou, 2007; Steinwart and Christmann, 2008). The regularization scheme (1) can be explicitly solved by using a standard matrix inversion technique, which requires costs of in time and in memory. However, this matrix inversion technique may not conquer challenges on storages or computations arising from big data.
The distributed learning algorithm studied in this paper starts with partitioning the data set into disjoint subsets . Then it assigns each data subset to one machine or processor to produce a local estimator by the least squares regularization scheme (1). Finally, these local estimators are communicated to a central processor, and a global estimator is synthesized by taking a weighted average
of the local estimators . This algorithm has been studied with a matrix analysis approach in (Zhang et al., 2015) where some error analysis has been conducted under some eigenfunction assumptions for the integral operator associated with the kernel, presenting error bounds in expectation.
In this paper we shall use a novel integral operator approach to prove that is a good approximation of . We present a representation of the difference in terms of empirical integral operators, and analyze the error in expectation without any eigenfunction assumptions. As a by-product, we present the best learning rates for the least squares regularization scheme (1) in a general setting, which surprisingly has not been done for a general kernel in the literature (see detailed comparisons below).
2 Main Results
Our analysis is carried out in the standard least squares regression framework with a Borel probability measureon , where the input space is a compact metric space. The sample is independently drawn according to . The Mercer kernel defines an integral operator on as
where is the function in and is the marginal distribution of on .
2.1 Error Bounds for the Distributed Learning Algorithm
Our error bounds in expectation for the distributed learning algorithm (2) require the uniform boundedness condition for the output , that is, for some constant , there holds almost surely. Our bounds are stated in terms of the approximation error
where is the data-free limit of (1) defined by
denotes the norm of , the Hilbert space of square integrable functions with respect to , and is the regression function (conditional mean) of defined by
with being the conditional distribution of induced at .
Since is continuous, symmetric and positive semidefinite, is a compact positive operator of trace class and is invertible. Define a quantity measuring the complexity of with respect to , the effective dimension (Zhang, 2005), to be the trace of the operator as
In Section 6 we shall prove the following first main result of this paper concerning error bounds in expectation of in and in . Denote .
Assume almost surely. If for , then we have
where is a constant depending only on .
To derive the explicit learning rate of algorithm (2), one needs the following assumption as a characteristic of the complexity of the hypothesis space (Caponnetto and De Vito, 2007; Blanchard and Krmer, 2010),
for some and . In particular, let be a set of normalized eigenpairs of on with being an orthonormal basis of and arranged in a non-increasing order, and let
be the spectral decomposition. Since , the condition (7) with always holds true with . For , implies (7) (see, e.g. Caponnetto and De Vito (2007)). This condition is satisfied, e.g., by the Sobolev space , where is a ball in with the integer ,
being the uniform distribution on, and (Steinwart et al., 2009; Edmunds and Triebel, 1996).
The results in (Caponnetto and De Vito, 2007; Steinwart et al., 2009; Zhang et al., 2015) showed that if , then the optimal learning rates of algorithm (2) with can be obtained in the sense that the upper and lower bounds for are asymptomatically identical. Thus, to derive learning rates for , a more general case with an arbitrary is covered as follows.
Assume almost surely. If for , and satisfies
for some constant , then we have
where is a constant depending only on ,
, and the largest eigenvalue of.
In the special case that , the approximation error can be bounded as . A more general condition can be imposed for the regression function as
where the integral operator is regarded as a compact positive operator on and its th power is well defined for any . The condition (9) means lies in the range of , and the special case corresponds to the choice . Under condition (9), we can obtain from Corollary 2.1 the following nice convergence rates for the distributed learning algorithm.
Assume (9) for some , almost surely and for some . If for with
and , then we have
In particular, when and , the choice yields and
In Corollary 2.1, we present learning rates in both and norms. The -norm bound is useful because it equals (subject to a constant) the generalization error . The norm controls the norm since for any in , (Smale and Zhou, 2007); this inequality also implies the application of the -norm bound in the mismatched problem where the generalization power is measured in some -norm with different from .
In Corollary 2.1, the established error bounds are monotonously decreasing with respect to , which is different from the error analysis in (Zhang et al., 2015). The reason is that we are concerned with the difference between and
. This difference reflects the variance of the distributed learning algorithm. Concerning the learning rate (as shown in Corollary2.2 below), the regularization parameter should be smaller, and then the learning rate is independent of , provided is not very large.
2.2 Minimax Rates of Convergence for Least Squares Regularization Scheme
The second main result of this paper is a sharp error bound for the least squares regularization scheme (1
). We can even relax the uniform boundedness to a moment condition that for some constant,
where is the conditional variance defined by .
Assume and (11) for some . Then we have
If the parameters satisfy , we have the following explicit bound.
In particular, if , that is, the conditional variances are uniformly bounded, then
In particular, when (9) is satisfied, we have the following learning rates.
In particular, when (the conditional variances are uniformly bounded), we have
Combining bounds for and , we can derive learning rates for the distributed learning algorithm (2) for regression.
Assume almost surely and (9) for some . If for some , for , and satisfies the restriction
then by taking , we have
Corollary 2.2 shows that distributed learning with least squares regularization scheme (2) can reach the minimax rates in expectation, provided satisfies (13). It should be pointed out that we consider error analysis under (9) with while (Zhang et al., 2015) focused on the case (9) with . The main novelty of our analysis is that by using a novel second order decomposition for the difference of operator inverses, we remove the eigenfunction assumptions in (Zhang et al., 2015) and provide error bounds for a larger range of .
In this paper, we only derive minimax rates for the least squares regularization scheme (1) as well as its distributed version (2) in expectation. We guess it is possible to derive error bounds in probability by combining the proposed second order decomposition approach with the analysis in (Caponnetto and De Vito, 2007; Blanchard and Krmer, 2010). We will study it in a future publication.
Corollary 2.2 and Corollary 2.2 suggest that the optimal choice of the regularization parameter should be independent of the number of partitions. In particular, for regularized least squares (1), the distributed scheme shares the optimal with the batch learning scheme. This observation is consistent with the results in (Zhang et al., 2015). We note that there are several parameter selection approaches in literature including cross-validation (Györfy et al., 2002) and the balancing principle (De Vito et al., 2010). It would be interesting to develop some parameter selection method for distributed learning.
3 Comparisons and Discussion
The least squares regularization scheme (1) is a classical algorithm for regression and has been extensively investigated in statistics and learning theory. There is a vast literature on this topic. Here for a general kernel beyond the Sobolev kernels, we compare our results with the best learning rates in the existing literature. Denote the set of positive eigenvalues of as arranged in a decreasing order, and a set of normalized (in ) eigenfunctions of corresponding to the eigenvalues .
Under the assumption that the orthogonal projection of in onto the closure of satisfies (9) for some , and that the eigenvalues satisfy with some , it was proved in (Caponnetto and De Vito, 2007) that
and denotes a set of probability measures satisfying some moment decay condition (which is satisfied when ). This learning rate is suboptimal due to the limitation taken for and the logarithmic factor in the case . In particular, to have with confidence , one needs to restrict to be large enough and has the constant depending on to be large enough. Using similar mathematical tools as that in ((Caponnetto and De Vito, 2007)) and a novel second order decomposition for the difference of operator inverses, we succeed in deriving learning rates in expectation in Corollary 2.2 by removing the logarithmic factor in the case .
Under the assumption that almost surely, the eigenvalues satisfying with some and , and for some constant , the pair satisfying
for every , it was proved in (Steinwart et al., 2009) that for some constant depending only on and , with confidence , for any ,
and is the approximation error defined by
When , and the choice gives a learning rate of order . But one needs to impose the condition (14) for the functions spaces and , and to take the projection onto , although (14) is more general than the uniform boundedness assumption of the eigenfunctions and holds when is the Sobolev space and is the uniform distribution (Steinwart et al., 2009; Mendelson and Neeman, 2010). Our learning rates in Corollary 2.2 do not require such a condition for the function spaces, nor do we take the projection. Learning rates for the least squares regularization scheme (1) in the -metric have been investigated in the literature (Smale and Zhou, 2007).
For the distributed learning algorithm (2) with subsets of equal size, under the assumption that for some constants and , the eigenfunctions satisfy
that and for some and , it was proved in (Zhang et al., 2015) that for and satisfying the restriction
with a constant depending only on , there holds . This interesting result was achieved by a matrix analysis approach for which the eigenfunction assumption (15) played an essential role.
The eigenfunction assumption (15) generalizes the classical case that the eigenfunctions are uniformly bounded: . An example of a Mercer kernel was presented in (Zhou, 2002, 2003) to show that smoothness of the Mercer kernel does not guarantee the uniform boundedness of the eigenfunctions. Furthermore, (Gittens and Mahoney, 2016)
provided a practical reason for avoiding unform boundedness assumption on the eigenfunctions (or eigenvectors) in terms of localization and sparseness. The condition (15), to the best of our knowledge, only holds when is the Sobolev space and is the Lebesgue measure or is a periodical kernel. It is a challenge to verify (15) for some widely used kernels including the Gaussian kernel. It would be interesting to find practical instance such that (15) holds. Our learning rates stated in Corollary 2.1 do not require such an eigenfunction assumption. Also, our restriction (10) for the number of local processors is more general when is close to . Notice that the learning rates stated in Corollary 2.1 are for the difference between the output function of the distributed learning algorithm (2) and that of the algorithm (1) using the whole data. In the special case of , we can see that , achieved by choosing , is smaller as becomes larger. This is natural because the error reflects more the sample error and should become smaller when we use more local processors. On the other hand, as one expects, increasing the number of local processors would increase the approximation error for the regression problem, which can also be seen from the bound with stated in Theorem 2.2. The result in Corollary 2.2 with compensates and gives the best learning rate by restricting as in (13).
Besides the divide-and-conquer technique, there are some other widely-used approaches towards the goal of reducing time complexity. For example, the localized learning (Meister and Steinwart, 2016), Nyström regularization (Bach, 2013) and on-line learning (Dekel et al., 2012), to name but a few. A key advantage of the divide-and-conquer technique is that it also reduces the space complexity without a significant lost (as proved in this paper) of prediction power. Although here we only consider the distributed regularized least squares, it would be important also to develop the theory for the distributed variance of other algorithms such as the spectral algorithms (Bauer et al., 2007), empirical feature-based learning (Guo and Zhou, 2012), error entropy minimization (Hu et al., 2015), randomized Kaczmarz (Lin and Zhou, 2015), and so on. It would be important to consider the strategies of parameter selection and data partition for distributed learning.
In this paper, we consider the regularized least squares with Mercer kernels. It would be interesting to minimize the assumptions on the kernel and the domain to maximize the scope of applications. For example, the domain that does not have a metric (Shen et al., 2014), the kernel that is only bounded and measurable (Steinwart and Scovel, 2012), and so on.
4 Second Order Decomposition of Operator Differences and Norms
To analyze the error , we need the following representation in terms of the difference of inverse operators denoted by
and for the data subset . The empirical integral operator is defined with replaced by the data subset .
Define two random variablesand with values in the Hilbert space by
We can derive a representation for in the following lemma.
Assume . For , we have
A well known formula (see e.g. (Smale and Zhou, 2007)) asserts that
So we know that
Also, with the whole data , we have
Then the first desired expression for follows.
Our error estimates are achieved by a novel second order decomposition of operator differences in our integral operator approach. We approximate the integral operator by the empirical integral operator on defined with the input data set as
where the reproducing property for is used. Since is a Mercer kernel, is a finite-rank positive operator and is invertible.
The operator difference in our study is with and . Our second order decomposition for the difference is stated as follows.
Let and be invertible operators on a Banach space. Then we have
In particular, we have
We can decompose the operator as
This is the first order decomposition.