I Introduction
The rapid expansion in data size and complexity has brought a series of scientific challenges in the era of big data, such as storage bottlenecks and algorithmic scalability issues [37, 34, 18]. Distributed learning is the most popular approach for handling big data. Among many strategies of distributed learning, the approach has been shown most simple and effective, while also being able to preserve data security and privacy by minimizing mutual information communications.
This paper aims to study the theoretical performance of the  based distributed learning for within a learning theory framework. Given
drawn identically and independently (i.i.d) from an unknown probability distribution
on , the ERM can be defined as(1) 
where is a loss function^{1}^{1}1If is a regularizer loss function, that is , is a regularizer, then (1) is related to a regularizer ERM. and is a hypothesis space. In this paper, we assume that is a Hilbert space. In distributed learning, the data set is partitioned into disjoint subsets , and The
th local estimator
is produced on each data subset :(2) 
The final global estimator is then obtained by
The theoretical foundations of distributed learning for (regularized) ERM have received increasing attention in machine learning, and have recently been explored within the framework of learning theory
[34, 18, 17, 36, 35, 31, 20, 19, 6]. However, most existing risk analysis are based on the closed form of the least square solution and the properties of the reproducing kernel Hilbert space (RKHS), which is only suitable when the distributed learning use a least square loss in the RKHS. Studies on establishing the risk bounds of distributed learning for general loss functions and hypothesis spaces remain limited.In this paper, we study the risk performance of distributed ERM based on the divideandconquer approach for general loss functions and hypothesis spaces. Concretely, we consider the use of the proof techniques in stochastic convex optimization for general loss function and the covering number for general hypothesis space. Note that the proof techniques of stochastic convex optimization and covering number are usually two separate paths for theoretical analysis. The main technical difficulty of this paper is how to integrate these two different proof techniques for distributed learning.
The main contributions of the paper include:

Result I. If the number of processors , we present a tight risk bound with order^{2}^{2}2We use and to hide constant factors as well as polylogarithmic factors in or . , under assuming there is a logarithmic covering number of hypothesis space (, , see assumption 1 for details), and a smooth, Lipschitz continuous and strongly convex loss function.

Result II. Under another basic assumption that there exists a hypothesis space of polynomial covering number (, see assumption 2 for details), and if the number of processors , another tight risk bound of order is established.

Result III. Without the restriction of strong convexity of loss function in Result I, a more general risk bound of order is derived when the number of processors , , and the optimal risk is small. Since , the rate is faster than .
Related Work
Risk analysis for the original (regularized) ERM has been extensively explored within the framework of learning theory [29, 2, 30, 1, 8, 4, 27, 24, 26, 33]
. Recently, divideandconquer based distributed learning with ridge regression
[17, 34, 35, 31], gradient descent algorithms [23, 19], online learning [32], local average regression [6], spectral algorithms [11, 12][7], and the minimum error entropy principle [15], have been proposed and their learning performances have been observed in many practical applications. For point estimation, [17]showed that the distributed moment estimation is consistent if an unbiased estimate is obtained for each of the subproblems. For the distributed regularized least square in RKHS,
[31] showed that the distributed ERM leads to an estimator that is consistent to the unknown regression function. Under local strong convexity, smoothness and a reasonable set of other conditions, an improved bound was established in [36].Optimal learning rates for divideandconquer kernel ridge regression in expectation were established in the seminal work of [35]
, under certain eigenfunction assumptions. Removing the eigenfunction assumptions, an improved bound was derived in
[18] using a novel integral operator method. Using similar proof techniques as [18] or [35], optimal learning rates were established for distributed spectral algorithms [11], kernelbased distributed gradient descent algorithms [19], kernelbased distributed semisupervised learning [7], and distributed local average regression [6]. Unfortunately, the optimal learning rates for these distributed learning methods depend on the special properties of the square loss and RKHS (such as their closed form, and the integral operator of the kernel function), which do not apply when analyzing the performance under other loss functions and hypothesis spaces. To fill this gap, in this paper, we derive the risk bounds based on the general properties of loss functions and hypothesis, making them more generalizable.The rest of the paper is organized as follows. In Section 2, we discuss our main results. In Section 3, we compare against related work. Section 4 is the conclusion. All the proofs are given in the last part.
Ii Main Results
In this section, we provide and discuss our main results. To this end, we first introduce several notions.
Let be an covering of a hypothesis space , i.e., for every , one can find an such that
Let be the covering number of , that is, the smallest number of cardinality for . Let
be the risk of . We denote the optimal function and risk of , respectively, as
Iia Assumptions
In this subsection, we introduce some basic assumptions of the hypothesis space and loss function.
Assumption 1 (logarithmic covering number).
There exists some such that
(3) 
Many popular function classes satisfy the above assumption when the hypothesis is bounded:
Assumption 2 (polynomial covering number).
There exists some such that
(4) 
If is bounded, this type of covering number is satisfied by many Sobolev/Besov classes [10]
. For instance, if the kernel eigenvalues decay at a rate of
, then the RKHS satisfies Assumption 2 [5]. For the RKHS of a Gaussian kernel, the kernel eigenvalues decay at a rate of .Remark 1.
To derive the risk bounds for divideandconquer ERM without specific assumptions on the type of hypothesis, we adopt the tool of covering number to measure the complexity of the hypothesis. To use the covering number in learning theory, an assumption on the bounded hypothesis is usually needed (see [5, 10] for details). In fact, ERM usually includes a regularizer, that is
which is equivalent to the following optimization for a constant related to ,
Thus, the assumption for the bounded hypothesis is usually implied in (regularized) ERM.
Assumption 3.
The loss function is nonnegative, smooth, Lipschitz continuous, and convex w.r.t for any .
Assumption 3 is satisfied by several popular losses when and are bounded, such as the square loss , logistic loss , square Hinge loss , square loss , and so on.
Assumption 4.
The loss function is an strongly convex function w.r.t for any .
Note that usually includes a regularizer, e.g. . In this case, is a strongly convex function which only requires to be a convex function.
Assumption 5 (diversity).
There exists some such that
where is the diversity between all partitionbased estimates, and is the th local estimator, .
If not all the partitionbased estimates , , are almost the same, Assumption 5 is satisfied.
IiB Risk Bounds
In the following, we first derive two tight risk bounds with smooth, Lipschitz continuous and strongly convex function. Then, we further consider the more general case by removing the restriction of strong convexity.
Theorem 1.
The above theorem implies that when is smooth, Lipschitz continuous and strongly convex, the distributed ERM achieves a risk bound in the order of This rate in Theorem 1 is minimaxoptimal for some cases:
From Theorem 1, we know that, to achieve the tight risk bound, the number of processors should satisfy the restriction
Thus, the number of processors can reach which is sufficient for using distributed learning in practical applications.
Theorem 2.
From Theorem 2(b) of [22] with , we know that, under Assumption 2, there is a universal constant such that
Thus, our risk bound of order is minimaxoptimal.
From Theorem 2, we know that, to achieve the tight risk bound, the number of processors should satisfy the restriction
Note that , thus the processors is smaller than that of Theorem 1, which is due to the restriction of polynomial covering number is looser than that of logarithmic one. When ( satisfied by the Gaussian kernel), the can reach .
Paper  Loss Function  Hypothesis Space  Other Condition  Risk Bound  Optimal  Partitions 
[35]  Square loss  Assumption 1  Eigenfunctions (1)  Yes  
Square loss  Assumption 2  Eigenfunctions (1)  Yes  
[18]  Square loss  RKHS  Regularity condition (6)  Yes  
[7]  Square loss  RKHS  Regularity condition (6)  Yes  
Theorem 1  Assumptions 3, 4  Assumption 1  Assumption 5  Yes  
Theorem 2  Assumptions 3, 4  Assumption 2  Assumption 5  Yes  
Theorem 3  Assumptions 3  Assumption 1  –  Yes if 
IiC Risk Bounds without Strong Convexity
As follows, we provide a more general risk bound without the restriction of strong convexity.
Theorem 3.
From the above theorem, one can see that:

The rate of this theorem is worse than that of Theorem 1. This is due to the relaxation of the loss function restriction.

The above theorem implies that, when the optimal risk is small, the risk bound is in the order of . Note that , so in this case, the rate is faster than

In the central case, that is , the order of risk can reach
which is nearly optimal. To the best of our knowledge, such a fast rate of ERM for the central case has never been given before.
Remark 2.
In Theorem 3, the risk bound is satisfied for all , . Parameter is used to balance the tightness of the bound and number of processors. The smaller the , the tighter the risk bound and the fewer the processors.
Iii Comparison with Related Work
In this section, we compare our results with related work. Our and previous results for (regularized) distributed ERM are summarized in Table I.
The theoretical foundations of distributed learning for (regularized) ERM have recently been explored within a learning theory framework [17, 36, 35, 31, 18, 20, 19, 6]. Among these works, [35, 18, 7] are the three most relevant papers. Thus, as follows, we will completely compare our results with those in [35, 18, 7]. The seminal work of [35] considered the learning performance of divideandconquer kernel ridge regression. Using a matrix decomposition approach, [35] derived two optimal learning rates of order and , respectively, for the finiterank kernels and polynomial eigendecay kernels, under the assumption that, for some constants and , the normalized eigenfunctions satisfy
(5) 
The condition in (5) is possibly too strong, and it was thus removed in [18], which used a novel integral operator approach under the regularity condition:
(6) 
where is the integral operator induced by the kernel function and is the regression function. However, the analysis in [18] only works for . In [7], they generalized the results of [18], and derived the optimal learning rate for all under the restriction for bounded kernel functions. Thus, we find that, for the special case of , the number of local processors , does not increase with . Note that , so the largest number of local processors can only reach , which may limit the applicability of distributed learning.
Compared with previous works, there are two main novelties of our results.

The proof techniques of this paper are based on the general properties of loss functions and hypothesis spaces, while for [35, 18, 7], the proofs depend on the special properties of the square loss and RKHS. Thus, our results are suitable for general loss functions and hypothesis spaces, generalizing the results of [35, 18, 7];

To derive the optimal rates, [18, 7] show that the number of local processors should be less than , . Thus, the highest number will be restricted by a constant for , and the best result is (for ). However, in this paper, the number of processors that our result can reach is . Thus, our result can relax the restriction on the number of processors in [18, 7].
Iv Conclusion
In this paper, we studied the risk performance of distributed ERM and derived tight risk bounds for general loss functions and hypothesis spaces. We first show that when the number of processors satisfy some restrictions, we can obtain tight risk bounds under assuming there is a logarithmic (or polynomial) covering number of hypothesis space, and a smooth, Lipschitz continuous and strongly convex loss function. We further present a more general risk bound by removing the restriction of strong convexity.
In our future, we will further improve our result in three sides:

In our result, the loss function should be a (strong) convex function. In our future, we consider to use the PolyakLojasiewicz condition [16] instead of (strong) convex function.

In [7], they show that the number of processors can be improved using the unlabeled examples. In our future, we will consider to use the unlabeled examples to improve our results.

In this paper, we only consider the simple divideandconquer based distributed learning, in our future, we consider to extend our result to other distributed learning machines.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (No.61703396, No.61673293), the CCFTencent Open Fund, the Youth Innovation Promotion Association CAS, the Excellent Talent Introduction of Institute of Information Engineering of CAS (No. Y7Z0111107), the Beijing Municipal Science and Technology Project (No. Z191100007119002), and the Key Research Program of Frontier Sciences, CAS (No. ZDBSLY7024).
V Proof
In this section, we first introduce the key idea of proof, and then give the proof of Theorem 1, 2 and 3.
Va The Key Idea
Note that if is an strongly convex function, then, is also strongly convex. According to the properties of a strongly convex function, , we have
(7) 
or ,
(8) 
By (8), one can see that
Therefore, we have
(9) 
As follows, we will estimate :
(10) 
Note that is convex, thus is convex. By the convexity of and the optimality condition of [3], we have
Thus, we get
(11) 
Substituting the above equation into (10), we have
(12) 
As follows, we utilize the covering number to establish an upper bound for the first term in the last line of (12). The second term in the last line of (12) is upper bounded by the concentration inequality.
VB Proof of Theorem 1
Lemma 1 (Lemma 2 of [24]).
Let be a Hilbert space and
be a random variable on (
) with values in . Assumealmost surely. Denote
Let be independent random drawers of . For any , with confidence ,
Lemma 2.
If the loss function is a smooth and convex function, then for any , with probability at least , we have
(13) 
where .
Proof.
Note that is smooth and convex, so by (2.1.7) of [21], , we have
Taking expectation over both sides, we have
(14) 
where the last inequality follows from the optimality condition of , i.e.,
Note that is smooth, thus we have
(15) 
We obtain Lemma 2 by taking the union bound over all . ∎
Lemma 3.
Under Assumption 3, with probability at least , we have
(16) 
Comments
There are no comments yet.