The rapid expansion in data size and complexity has brought a series of scientific challenges in the era of big data, such as storage bottlenecks and algorithmic scalability issues [37, 34, 18]. Distributed learning is the most popular approach for handling big data. Among many strategies of distributed learning, the approach has been shown most simple and effective, while also being able to preserve data security and privacy by minimizing mutual information communications.
This paper aims to study the theoretical performance of the -- based distributed learning for within a learning theory framework. Given
drawn identically and independently (i.i.d) from an unknown probability distributionon , the ERM can be defined as
where is a loss function111If is a regularizer loss function, that is , is a regularizer, then (1) is related to a regularizer ERM. and is a hypothesis space. In this paper, we assume that is a Hilbert space. In distributed learning, the data set is partitioned into disjoint subsets , and The
th local estimatoris produced on each data subset :
The final global estimator is then obtained by
The theoretical foundations of distributed learning for (regularized) ERM have received increasing attention in machine learning, and have recently been explored within the framework of learning theory[34, 18, 17, 36, 35, 31, 20, 19, 6]. However, most existing risk analysis are based on the closed form of the least square solution and the properties of the reproducing kernel Hilbert space (RKHS), which is only suitable when the distributed learning use a least square loss in the RKHS. Studies on establishing the risk bounds of distributed learning for general loss functions and hypothesis spaces remain limited.
In this paper, we study the risk performance of distributed ERM based on the divide-and-conquer approach for general loss functions and hypothesis spaces. Concretely, we consider the use of the proof techniques in stochastic convex optimization for general loss function and the covering number for general hypothesis space. Note that the proof techniques of stochastic convex optimization and covering number are usually two separate paths for theoretical analysis. The main technical difficulty of this paper is how to integrate these two different proof techniques for distributed learning.
The main contributions of the paper include:
Result I. If the number of processors , we present a tight risk bound with order222We use and to hide constant factors as well as polylogarithmic factors in or . , under assuming there is a logarithmic covering number of hypothesis space (, , see assumption 1 for details), and a smooth, Lipschitz continuous and strongly convex loss function.
Result II. Under another basic assumption that there exists a hypothesis space of polynomial covering number (, see assumption 2 for details), and if the number of processors , another tight risk bound of order is established.
Result III. Without the restriction of strong convexity of loss function in Result I, a more general risk bound of order is derived when the number of processors , , and the optimal risk is small. Since , the rate is faster than .
. Recently, divide-and-conquer based distributed learning with ridge regression[17, 34, 35, 31], gradient descent algorithms [23, 19], online learning , local average regression , spectral algorithms [11, 12]7], and the minimum error entropy principle , have been proposed and their learning performances have been observed in many practical applications. For point estimation, 31] showed that the distributed ERM leads to an estimator that is consistent to the unknown regression function. Under local strong convexity, smoothness and a reasonable set of other conditions, an improved bound was established in .
Optimal learning rates for divide-and-conquer kernel ridge regression in expectation were established in the seminal work of 
, under certain eigenfunction assumptions. Removing the eigenfunction assumptions, an improved bound was derived in using a novel integral operator method. Using similar proof techniques as  or , optimal learning rates were established for distributed spectral algorithms , kernel-based distributed gradient descent algorithms , kernel-based distributed semi-supervised learning , and distributed local average regression . Unfortunately, the optimal learning rates for these distributed learning methods depend on the special properties of the square loss and RKHS (such as their closed form, and the integral operator of the kernel function), which do not apply when analyzing the performance under other loss functions and hypothesis spaces. To fill this gap, in this paper, we derive the risk bounds based on the general properties of loss functions and hypothesis, making them more generalizable.
The rest of the paper is organized as follows. In Section 2, we discuss our main results. In Section 3, we compare against related work. Section 4 is the conclusion. All the proofs are given in the last part.
Ii Main Results
In this section, we provide and discuss our main results. To this end, we first introduce several notions.
Let be an -covering of a hypothesis space , i.e., for every , one can find an such that
Let be the -covering number of , that is, the smallest number of cardinality for . Let
be the risk of . We denote the optimal function and risk of , respectively, as
In this subsection, we introduce some basic assumptions of the hypothesis space and loss function.
Assumption 1 (logarithmic covering number).
There exists some such that
Many popular function classes satisfy the above assumption when the hypothesis is bounded:
Assumption 2 (polynomial covering number).
There exists some such that
If is bounded, this type of covering number is satisfied by many Sobolev/Besov classes 
. For instance, if the kernel eigenvalues decay at a rate of, then the RKHS satisfies Assumption 2 . For the RKHS of a Gaussian kernel, the kernel eigenvalues decay at a rate of .
To derive the risk bounds for divide-and-conquer ERM without specific assumptions on the type of hypothesis, we adopt the tool of covering number to measure the complexity of the hypothesis. To use the covering number in learning theory, an assumption on the bounded hypothesis is usually needed (see [5, 10] for details). In fact, ERM usually includes a regularizer, that is
which is equivalent to the following optimization for a constant related to ,
Thus, the assumption for the bounded hypothesis is usually implied in (regularized) ERM.
The loss function is non-negative, -smooth, -Lipschitz continuous, and convex w.r.t for any .
Assumption 3 is satisfied by several popular losses when and are bounded, such as the square loss , logistic loss , square Hinge loss , square -loss , and so on.
The loss function is an -strongly convex function w.r.t for any .
Note that usually includes a regularizer, e.g. . In this case, is a strongly convex function which only requires to be a convex function.
Assumption 5 (-diversity).
There exists some such that
where is the diversity between all partition-based estimates, and is the th local estimator, .
If not all the partition-based estimates , , are almost the same, Assumption 5 is satisfied.
Ii-B Risk Bounds
In the following, we first derive two tight risk bounds with smooth, Lipschitz continuous and strongly convex function. Then, we further consider the more general case by removing the restriction of strong convexity.
The above theorem implies that when is smooth, Lipschitz continuous and strongly convex, the distributed ERM achieves a risk bound in the order of This rate in Theorem 1 is minimax-optimal for some cases:
From Theorem 1, we know that, to achieve the tight risk bound, the number of processors should satisfy the restriction
Thus, the number of processors can reach which is sufficient for using distributed learning in practical applications.
Thus, our risk bound of order is minimax-optimal.
From Theorem 2, we know that, to achieve the tight risk bound, the number of processors should satisfy the restriction
Note that , thus the processors is smaller than that of Theorem 1, which is due to the restriction of polynomial covering number is looser than that of logarithmic one. When ( satisfied by the Gaussian kernel), the can reach .
|Paper||Loss Function||Hypothesis Space||Other Condition||Risk Bound||Optimal||Partitions|
|||Square loss||Assumption 1||Eigenfunctions (1)||Yes|
|Square loss||Assumption 2||Eigenfunctions (1)||Yes|
|||Square loss||RKHS||Regularity condition (6)||Yes|
|||Square loss||RKHS||Regularity condition (6)||Yes|
|Theorem 1||Assumptions 3, 4||Assumption 1||Assumption 5||Yes|
|Theorem 2||Assumptions 3, 4||Assumption 2||Assumption 5||Yes|
|Theorem 3||Assumptions 3||Assumption 1||–||Yes if|
Ii-C Risk Bounds without Strong Convexity
As follows, we provide a more general risk bound without the restriction of strong convexity.
From the above theorem, one can see that:
The rate of this theorem is worse than that of Theorem 1. This is due to the relaxation of the loss function restriction.
The above theorem implies that, when the optimal risk is small, the risk bound is in the order of . Note that , so in this case, the rate is faster than
In the central case, that is , the order of risk can reach
which is nearly optimal. To the best of our knowledge, such a fast rate of ERM for the central case has never been given before.
In Theorem 3, the risk bound is satisfied for all , . Parameter is used to balance the tightness of the bound and number of processors. The smaller the , the tighter the risk bound and the fewer the processors.
Iii Comparison with Related Work
In this section, we compare our results with related work. Our and previous results for (regularized) distributed ERM are summarized in Table I.
The theoretical foundations of distributed learning for (regularized) ERM have recently been explored within a learning theory framework [17, 36, 35, 31, 18, 20, 19, 6]. Among these works, [35, 18, 7] are the three most relevant papers. Thus, as follows, we will completely compare our results with those in [35, 18, 7]. The seminal work of  considered the learning performance of divide-and-conquer kernel ridge regression. Using a matrix decomposition approach,  derived two optimal learning rates of order and , respectively, for the -finite-rank kernels and polynomial eigen-decay kernels, under the assumption that, for some constants and , the normalized eigenfunctions satisfy
where is the integral operator induced by the kernel function and is the regression function. However, the analysis in  only works for . In , they generalized the results of , and derived the optimal learning rate for all under the restriction for bounded kernel functions. Thus, we find that, for the special case of , the number of local processors , does not increase with . Note that , so the largest number of local processors can only reach , which may limit the applicability of distributed learning.
Compared with previous works, there are two main novelties of our results.
The proof techniques of this paper are based on the general properties of loss functions and hypothesis spaces, while for [35, 18, 7], the proofs depend on the special properties of the square loss and RKHS. Thus, our results are suitable for general loss functions and hypothesis spaces, generalizing the results of [35, 18, 7];
To derive the optimal rates, [18, 7] show that the number of local processors should be less than , . Thus, the highest number will be restricted by a constant for , and the best result is (for ). However, in this paper, the number of processors that our result can reach is . Thus, our result can relax the restriction on the number of processors in [18, 7].
In this paper, we studied the risk performance of distributed ERM and derived tight risk bounds for general loss functions and hypothesis spaces. We first show that when the number of processors satisfy some restrictions, we can obtain tight risk bounds under assuming there is a logarithmic (or polynomial) covering number of hypothesis space, and a smooth, Lipschitz continuous and strongly convex loss function. We further present a more general risk bound by removing the restriction of strong convexity.
In our future, we will further improve our result in three sides:
In our result, the loss function should be a (strong) convex function. In our future, we consider to use the Polyak-Lojasiewicz condition  instead of (strong) convex function.
In , they show that the number of processors can be improved using the unlabeled examples. In our future, we will consider to use the unlabeled examples to improve our results.
In this paper, we only consider the simple divide-and-conquer based distributed learning, in our future, we consider to extend our result to other distributed learning machines.
This work was supported in part by the National Natural Science Foundation of China (No.61703396, No.61673293), the CCF-Tencent Open Fund, the Youth Innovation Promotion Association CAS, the Excellent Talent Introduction of Institute of Information Engineering of CAS (No. Y7Z0111107), the Beijing Municipal Science and Technology Project (No. Z191100007119002), and the Key Research Program of Frontier Sciences, CAS (No. ZDBS-LY-7024).
V-a The Key Idea
Note that if is an -strongly convex function, then, is also -strongly convex. According to the properties of a strongly convex function, , we have
By (8), one can see that
Therefore, we have
As follows, we will estimate :
Note that is convex, thus is convex. By the convexity of and the optimality condition of , we have
Thus, we get
Substituting the above equation into (10), we have
As follows, we utilize the covering number to establish an upper bound for the first term in the last line of (12). The second term in the last line of (12) is upper bounded by the concentration inequality.
V-B Proof of Theorem 1
Lemma 1 (Lemma 2 of ).
Let be a Hilbert space and be a random variable on (
be a random variable on () with values in . Assume
almost surely. Denote
Let be independent random drawers of . For any , with confidence ,
If the loss function is a -smooth and convex function, then for any , with probability at least , we have
Note that is -smooth and convex, so by (2.1.7) of , , we have
Taking expectation over both sides, we have
where the last inequality follows from the optimality condition of , i.e.,
Note that is -smooth, thus we have
We obtain Lemma 2 by taking the union bound over all . ∎
Under Assumption 3, with probability at least , we have