# Theoretical Analysis of Divide-and-Conquer ERM: Beyond Square Loss and RKHS

Theoretical analysis of the divide-and-conquer based distributed learning with least square loss in the reproducing kernel Hilbert space (RKHS) have recently been explored within the framework of learning theory. However, the studies on learning theory for general loss functions and hypothesis spaces remain limited. To fill the gap, we study the risk performance of distributed empirical risk minimization (ERM) for general loss functions and hypothesis spaces. The main contributions are two-fold. First, we derive two tight risk bounds under certain basic assumptions on the hypothesis space, as well as the smoothness, Lipschitz continuity, strong convexity of the loss function. Second, we further develop a more general risk bound for distributed ERM without the restriction of strong convexity.

Comments

There are no comments yet.

## Authors

• 84 publications
• 10 publications
• 31 publications
• ### Risk Analysis of Divide-and-Conquer ERM

Theoretical analysis of the divide-and-conquer based distributed learnin...
03/09/2020 ∙ by Yong Liu, et al. ∙ 26

read it

• ### Max-Diversity Distributed Learning: Theory and Algorithms

We study the risk performance of distributed learning for the regulariza...
12/19/2018 ∙ by Yong Liu, et al. ∙ 0

read it

• ### On Lipschitz Continuity and Smoothness of Loss Functions in Learning to Rank

In binary classification and regression problems, it is well understood ...
05/03/2014 ∙ by Ambuj Tewari, et al. ∙ 0

read it

• ### Regularized ERM on random subspaces

We study a natural extension of classical empirical risk minimization, w...
06/17/2020 ∙ by Andrea Della Vecchia, et al. ∙ 9

read it

• ### Sparse Learning in reproducing kernel Hilbert space

Sparse learning aims to learn the sparse structure of the true target fu...
01/03/2019 ∙ by Xin He, et al. ∙ 0

read it

• ### Excess Risk Bounds for Exponentially Concave Losses

The overarching goal of this paper is to derive excess risk bounds for l...
01/18/2014 ∙ by Mehrdad Mahdavi, et al. ∙ 0

read it

• ### A Unified Analysis of Random Fourier Features

We provide the first unified theoretical analysis of supervised learning...
06/24/2018 ∙ by Zhu Li, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The rapid expansion in data size and complexity has brought a series of scientific challenges in the era of big data, such as storage bottlenecks and algorithmic scalability issues [37, 34, 18]. Distributed learning is the most popular approach for handling big data. Among many strategies of distributed learning, the approach has been shown most simple and effective, while also being able to preserve data security and privacy by minimizing mutual information communications.

This paper aims to study the theoretical performance of the -- based distributed learning for within a learning theory framework. Given

 S={zi=(xi,yi)}Ni=1∈(Z=X×Y)N,

drawn identically and independently (i.i.d) from an unknown probability distribution

on , the ERM can be defined as

 ^f=\operatornamewithlimitsargminf∈H^R(f):=1NN∑j=1ℓ(f,zj), (1)

where is a loss function111If is a regularizer loss function, that is , is a regularizer, then (1) is related to a regularizer ERM. and is a hypothesis space. In this paper, we assume that is a Hilbert space. In distributed learning, the data set is partitioned into disjoint subsets , and The

th local estimator

is produced on each data subset :

 ^fi=\operatornamewithlimitsargminf∈H^Ri(f):=1|Si|∑zj∈Siℓ(f,zj). (2)

The final global estimator is then obtained by

 ¯f=1mm∑i=1^fi.

The theoretical foundations of distributed learning for (regularized) ERM have received increasing attention in machine learning, and have recently been explored within the framework of learning theory

[34, 18, 17, 36, 35, 31, 20, 19, 6]. However, most existing risk analysis are based on the closed form of the least square solution and the properties of the reproducing kernel Hilbert space (RKHS), which is only suitable when the distributed learning use a least square loss in the RKHS. Studies on establishing the risk bounds of distributed learning for general loss functions and hypothesis spaces remain limited.

In this paper, we study the risk performance of distributed ERM based on the divide-and-conquer approach for general loss functions and hypothesis spaces. Concretely, we consider the use of the proof techniques in stochastic convex optimization for general loss function and the covering number for general hypothesis space. Note that the proof techniques of stochastic convex optimization and covering number are usually two separate paths for theoretical analysis. The main technical difficulty of this paper is how to integrate these two different proof techniques for distributed learning.

The main contributions of the paper include:

• Result I. If the number of processors , we present a tight risk bound with order222We use and to hide constant factors as well as polylogarithmic factors in or . , under assuming there is a logarithmic covering number of hypothesis space (, , see assumption 1 for details), and a smooth, Lipschitz continuous and strongly convex loss function.

• Result II. Under another basic assumption that there exists a hypothesis space of polynomial covering number (, see assumption 2 for details), and if the number of processors , another tight risk bound of order is established.

• Result III. Without the restriction of strong convexity of loss function in Result I, a more general risk bound of order is derived when the number of processors , , and the optimal risk is small. Since , the rate is faster than .

### Related Work

Risk analysis for the original (regularized) ERM has been extensively explored within the framework of learning theory [29, 2, 30, 1, 8, 4, 27, 24, 26, 33]

. Recently, divide-and-conquer based distributed learning with ridge regression

[17, 34, 35, 31], gradient descent algorithms [23, 19], online learning [32], local average regression [6], spectral algorithms [11, 12][7], and the minimum error entropy principle [15], have been proposed and their learning performances have been observed in many practical applications. For point estimation, [17]

showed that the distributed moment estimation is consistent if an unbiased estimate is obtained for each of the subproblems. For the distributed regularized least square in RKHS,

[31] showed that the distributed ERM leads to an estimator that is consistent to the unknown regression function. Under local strong convexity, smoothness and a reasonable set of other conditions, an improved bound was established in [36].

Optimal learning rates for divide-and-conquer kernel ridge regression in expectation were established in the seminal work of [35]

, under certain eigenfunction assumptions. Removing the eigenfunction assumptions, an improved bound was derived in

[18] using a novel integral operator method. Using similar proof techniques as [18] or [35], optimal learning rates were established for distributed spectral algorithms [11], kernel-based distributed gradient descent algorithms [19], kernel-based distributed semi-supervised learning [7], and distributed local average regression [6]. Unfortunately, the optimal learning rates for these distributed learning methods depend on the special properties of the square loss and RKHS (such as their closed form, and the integral operator of the kernel function), which do not apply when analyzing the performance under other loss functions and hypothesis spaces. To fill this gap, in this paper, we derive the risk bounds based on the general properties of loss functions and hypothesis, making them more generalizable.

The rest of the paper is organized as follows. In Section 2, we discuss our main results. In Section 3, we compare against related work. Section 4 is the conclusion. All the proofs are given in the last part.

## Ii Main Results

In this section, we provide and discuss our main results. To this end, we first introduce several notions.

Let be an -covering of a hypothesis space , i.e., for every , one can find an such that

 ∥f−~f∥H≤ϵ.

Let be the -covering number of , that is, the smallest number of cardinality for . Let

 R(f):=Ez[ℓ(f,z)]

be the risk of . We denote the optimal function and risk of , respectively, as

 f∗:=\operatornamewithlimitsargminf∈HR(f)~{}and~{}R∗: =R(f∗).

### Ii-a Assumptions

In this subsection, we introduce some basic assumptions of the hypothesis space and loss function.

###### Assumption 1 (logarithmic covering number).

There exists some such that

 ∀ϵ∈(0,1),logC(H,ϵ)≃hlog(1/ϵ). (3)

Many popular function classes satisfy the above assumption when the hypothesis is bounded:

• Any function space with finite VC-dimension [28], including linear functions and univariate polynomials of degree (for which ) as special cases;

• Any RKHS based on a kernel with rank [5].

###### Assumption 2 (polynomial covering number).

There exists some such that

 ∀ϵ∈(0,1),logC(H,ϵ)≃(1/ϵ)1/h. (4)

If is bounded, this type of covering number is satisfied by many Sobolev/Besov classes [10]

. For instance, if the kernel eigenvalues decay at a rate of

, then the RKHS satisfies Assumption 2 [5]. For the RKHS of a Gaussian kernel, the kernel eigenvalues decay at a rate of .

###### Remark 1.

To derive the risk bounds for divide-and-conquer ERM without specific assumptions on the type of hypothesis, we adopt the tool of covering number to measure the complexity of the hypothesis. To use the covering number in learning theory, an assumption on the bounded hypothesis is usually needed (see [5, 10] for details). In fact, ERM usually includes a regularizer, that is

 minf∈H1NN∑j=1ℓ(f,zi)+λ∥f∥2H,

which is equivalent to the following optimization for a constant related to ,

 minf∈H1NN∑j=1ℓ(f,zi),s.t. ∥f∥2H≤c.

Thus, the assumption for the bounded hypothesis is usually implied in (regularized) ERM.

###### Assumption 3.

The loss function is non-negative, -smooth, -Lipschitz continuous, and convex w.r.t for any .

Assumption 3 is satisfied by several popular losses when and are bounded, such as the square loss , logistic loss , square Hinge loss , square -loss , and so on.

###### Assumption 4.

The loss function is an -strongly convex function w.r.t for any .

Note that usually includes a regularizer, e.g. . In this case, is a strongly convex function which only requires to be a convex function.

###### Assumption 5 (τ-diversity).

There exists some such that

 14m2m∑i,j=1,i≠j∥^fi−^fj∥2H≥τ,

where is the diversity between all partition-based estimates, and is the th local estimator, .

If not all the partition-based estimates , , are almost the same, Assumption 5 is satisfied.

### Ii-B Risk Bounds

In the following, we first derive two tight risk bounds with smooth, Lipschitz continuous and strongly convex function. Then, we further consider the more general case by removing the restriction of strong convexity.

###### Theorem 1.

Under assumptions 1, 3, 4, 5, if the number of processors satisfy the bound:

 m≤min⎧⎨⎩Nη8GhlogNδ,√NhηLlog2δ,hη+Nη2τ128R∗log2δ⎫⎬⎭,

then, with probability at least , we have333In this paper, polylogarithmic factors are usually ignored or considered as a constant for simplicity. The is usually written as for simplicity.

 R(¯f)−R(f∗)≤O(hN).

The above theorem implies that when is smooth, Lipschitz continuous and strongly convex, the distributed ERM achieves a risk bound in the order of This rate in Theorem 1 is minimax-optimal for some cases:

• Finite VC-dimension. If the VC-dimension of is bounded by , which is a special case of Assumption 1, [9, 13, 14] show that there exists a constant and a function , such that

 R(f)−R(f∗)≥c′hN.
• Square loss. Note that, for the square loss function, . From Theorem 2(a) of [22] with , we find that, under Assumption 1, there is a universal constant such that

 inffsupf∗∈BH(1)[R(f)−R(f∗)]≥c′hN,

where is the 1-norm ball in .

From Theorem 1, we know that, to achieve the tight risk bound, the number of processors should satisfy the restriction

 m≤Ω⎛⎝min⎧⎨⎩NηlogNδ,√Nlog1δ,NR∗log1δ⎫⎬⎭⎞⎠.

Thus, the number of processors can reach which is sufficient for using distributed learning in practical applications.

###### Theorem 2.

Under Assumptions 2, 3, 4, 5, if the number of processors satisfy the bound:

 m≤min ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩Nη4(N12h+1+log(2/δ)),√ηNh+12h+1Llog(2/δ), ηNh2h+1GLlog(2/δ),N12h+1+Nη2τ128R∗log(2/δ)⎫⎪⎬⎪⎭,

then, with probability at least , we have

 R(¯f)−R(f∗)≤O(N−2h2h+1).

From Theorem 2(b) of [22] with , we know that, under Assumption 2, there is a universal constant such that

 inffsupf∗∈BH(1)Ez[∥f−f∗∥22]≥c′N−2h2h+1.

Thus, our risk bound of order is minimax-optimal.

From Theorem 2, we know that, to achieve the tight risk bound, the number of processors should satisfy the restriction

 m ≤~Ω(min{N2h2h+1,Nh+12h+1,Nh2h+1}) =~Ω(Nh2h+1).

Note that , thus the processors is smaller than that of Theorem 1, which is due to the restriction of polynomial covering number is looser than that of logarithmic one. When ( satisfied by the Gaussian kernel), the can reach .

### Ii-C Risk Bounds without Strong Convexity

As follows, we provide a more general risk bound without the restriction of strong convexity.

###### Theorem 3.

Under Assumptions 1, 3 and assuming that , , if the number of processors , , then, with probability at least , we have

 R(¯f)−R(f∗)≤O⎛⎜ ⎜⎝hlog(N/δ)N1−r+ ⎷R∗log1δN1−r⎞⎟ ⎟⎠.

If the optimal risk is small, that is , we have

 R(¯f)−R(f∗)≤O(hlog(N/δ)N1−r).

From the above theorem, one can see that:

• The rate of this theorem is worse than that of Theorem 1. This is due to the relaxation of the loss function restriction.

• The above theorem implies that, when the optimal risk is small, the risk bound is in the order of . Note that , so in this case, the rate is faster than

• In the central case, that is , the order of risk can reach

 R(¯f)−R(f∗)=~O(hN),

which is nearly optimal. To the best of our knowledge, such a fast rate of ERM for the central case has never been given before.

###### Remark 2.

In Theorem 3, the risk bound is satisfied for all , . Parameter is used to balance the tightness of the bound and number of processors. The smaller the , the tighter the risk bound and the fewer the processors.

## Iii Comparison with Related Work

In this section, we compare our results with related work. Our and previous results for (regularized) distributed ERM are summarized in Table I.

The theoretical foundations of distributed learning for (regularized) ERM have recently been explored within a learning theory framework [17, 36, 35, 31, 18, 20, 19, 6]. Among these works, [35, 18, 7] are the three most relevant papers. Thus, as follows, we will completely compare our results with those in [35, 18, 7]. The seminal work of [35] considered the learning performance of divide-and-conquer kernel ridge regression. Using a matrix decomposition approach, [35] derived two optimal learning rates of order and , respectively, for the -finite-rank kernels and polynomial eigen-decay kernels, under the assumption that, for some constants and , the normalized eigenfunctions satisfy

 ∀j=1,2,…,E[ϕj(X)2k]≤A2k. (5)

The condition in (5) is possibly too strong, and it was thus removed in [18], which used a novel integral operator approach under the regularity condition:

 fρ=LrKhρ,~{}for~{}some r>0 and hρ∈L2ρ, (6)

where is the integral operator induced by the kernel function and is the regression function. However, the analysis in [18] only works for . In [7], they generalized the results of [18], and derived the optimal learning rate for all under the restriction for bounded kernel functions. Thus, we find that, for the special case of , the number of local processors , does not increase with . Note that , so the largest number of local processors can only reach , which may limit the applicability of distributed learning.

Compared with previous works, there are two main novelties of our results.

• The proof techniques of this paper are based on the general properties of loss functions and hypothesis spaces, while for [35, 18, 7], the proofs depend on the special properties of the square loss and RKHS. Thus, our results are suitable for general loss functions and hypothesis spaces, generalizing the results of [35, 18, 7];

• To derive the optimal rates, [18, 7] show that the number of local processors should be less than , . Thus, the highest number will be restricted by a constant for , and the best result is (for ). However, in this paper, the number of processors that our result can reach is . Thus, our result can relax the restriction on the number of processors in [18, 7].

## Iv Conclusion

In this paper, we studied the risk performance of distributed ERM and derived tight risk bounds for general loss functions and hypothesis spaces. We first show that when the number of processors satisfy some restrictions, we can obtain tight risk bounds under assuming there is a logarithmic (or polynomial) covering number of hypothesis space, and a smooth, Lipschitz continuous and strongly convex loss function. We further present a more general risk bound by removing the restriction of strong convexity.

In our future, we will further improve our result in three sides:

• In our result, the loss function should be a (strong) convex function. In our future, we consider to use the Polyak-Lojasiewicz condition [16] instead of (strong) convex function.

• In [7], they show that the number of processors can be improved using the unlabeled examples. In our future, we will consider to use the unlabeled examples to improve our results.

• In this paper, we only consider the simple divide-and-conquer based distributed learning, in our future, we consider to extend our result to other distributed learning machines.

## Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (No.61703396, No.61673293), the CCF-Tencent Open Fund, the Youth Innovation Promotion Association CAS, the Excellent Talent Introduction of Institute of Information Engineering of CAS (No. Y7Z0111107), the Beijing Municipal Science and Technology Project (No. Z191100007119002), and the Key Research Program of Frontier Sciences, CAS (No. ZDBS-LY-7024).

## V Proof

In this section, we first introduce the key idea of proof, and then give the proof of Theorem 1, 2 and 3.

### V-a The Key Idea

Note that if is an -strongly convex function, then, is also -strongly convex. According to the properties of a strongly convex function, , we have

 (7)

or ,

 R(tf+(1−t)f′)≤tR(f)+(1−t)R(f′)−ηt(1−t)2∥f−f′∥2H. (8)

By (8), one can see that

 R(¯f) =R(1mm∑i=1^fi) ≤1mm∑i=1R(^fi)−η4m2m∑i,j=1,i≠j∥^fi−^fj∥2H ≤1mm∑i=1R(^fi)−ητ  (% by Assumption ???).

Therefore, we have

 R(¯f)−R(f∗)≤1mm∑i=1[R(^fi)−R(f∗)]−ητ. (9)

As follows, we will estimate :

 R(^fi)−R(f∗)+η2∥^fi−f∗∥2H⟨∇R(^fi),^fi−f∗⟩H=⟨∇R(^fi)−∇R(f∗)−[∇^Ri(^fi)−∇^Ri(f∗)],^fi−f∗⟩H    +⟨∇R(f∗)−∇^Ri(f∗),^fi−f∗⟩H    +⟨∇^Ri(^fi),^fi−f∗⟩H. (10)

Note that is convex, thus is convex. By the convexity of and the optimality condition of [3], we have

Thus, we get

 ⟨∇^Ri(^fi),^fi−f∗⟩H≤0. (11)

Substituting the above equation into (10), we have

 R(^fi)−R(f∗)+η2∥^fi−f∗∥2H⟨∇R(^fi)−∇R(f∗)−[∇^Ri(^fi)−∇^Ri(f∗)],^fi−f∗⟩H    +⟨∇R(f∗)−∇^Ri(f∗),^fi−f∗⟩H≤∥∥∇R(^fi)−∇R(f∗)−[∇^Ri(^fi)−∇^Ri(f∗)]∥∥H⋅∥∥^fi−f∗∥∥H    +∥∥∇R(f∗)−∇^Ri(f∗)∥∥H⋅∥∥^fi−f∗∥∥H. (12)

As follows, we utilize the covering number to establish an upper bound for the first term in the last line of (12). The second term in the last line of (12) is upper bounded by the concentration inequality.

### V-B Proof of Theorem 1

To prove Theorem 1, we first introduce a lemma of [24], and provide two lemmas as follows.

###### Lemma 1 (Lemma 2 of [24]).

Let be a Hilbert space and

be a random variable on (

) with values in . Assume

 ∥ξ∥H≤~M<∞

almost surely. Denote

 σ2(ξ)=E(∥ξ∥2H).

Let be independent random drawers of . For any , with confidence ,

 ∥∥ ∥∥1ll∑i=1[ξi−E(ξi)]∥∥ ∥∥ ≤2~Mlog(2/δ)l+√2σ2(ξ)log(2/δ)l.
###### Lemma 2.

If the loss function is a -smooth and convex function, then for any , with probability at least , we have

 ∥∥∇R(f)−∇R(f∗)−[∇^Ri(f)−∇^Ri(f∗)]∥∥H≤Gm∥f−f∗∥HDH,δ,ϵN   +√Gm(R(f)−R(f∗))DH,δ,ϵN, (13)

where .

###### Proof.

Note that is -smooth and convex, so by (2.1.7) of [21], , we have

 ∥∇ℓ(f,z)−∇ℓ(f∗,z)∥2H≤G(ℓ(f,z)−ℓ(f∗,z)−⟨∇ℓ(f∗,z),f−f∗⟩H)

Taking expectation over both sides, we have

 (14)

where the last inequality follows from the optimality condition of , i.e.,

 ⟨∇R(f∗),f−f∗⟩H≥0,∀f∈H.

Note that is -smooth, thus we have

 |∇ℓ(f,zi)−∇ℓ(f∗,zi)|≤G∥f−f∗∥H. (15)

Substituting (14) and (15) into Lemma 1, with probability at least , we have

 ∥∥∇R(f)−∇R(f∗)−[∇^Ri(f)−∇^Ri(f∗)]∥∥H =∥∥ ∥∥∇R(f)−∇R(f∗)−mN∑zi∈Si[∇ℓ(f,zi)−∇ℓ(f∗,zi)]∥∥ ∥∥H ≤2mG∥f−f∗∥Hlog(2/δ)N +√2mG(R(f)−R(f∗))log(2/δ)N.

We obtain Lemma 2 by taking the union bound over all . ∎

###### Lemma 3.

Under Assumption 3, with probability at least , we have

 ∥∥∇R(f∗)−∇^Ri(f∗)∥∥H≤2Lmlog(2δ)N+√8GR∗mlog(2δ)N. (16)
###### Proof.

Since is -smooth and nonegative, from Lemma 4 of [25], we have

 ∥∇ℓ(f∗,zi)∥2H≤4Gℓ(f∗,zi)

and thus we can get

 Ez[∥∇ℓ(f∗,z)∥2H]≤4GEz[ℓ(f∗,z)]=4GR(f∗).

Since is a -Lipschitz continuous function, so we have

 ∥ℓ(f∗+δf,z)−ℓ(f∗,z)∥H≤L∥δf∥H,∀δf∈H

Thus, from the definition of differential of , we have

 ∥∇ℓ(f∗,z)∥H≤L

Then, according to Lemma 1, with probability at least , we have

 ∥∥∇R(f∗)−∇^Ri(f∗)∥∥H≤2Lmlog(2/δ)N+√8GR∗mlog(2/δ)N.

###### Proof of Theorem 1.

From the properties of -covering, we know that there exists a function such that

 ∥^fi−~f∥H≤ϵ.

Thus, we have

 ∥∥∇R(^fi)−∇R(f∗)−[∇^Ri(^fi)−∇^Ri(f∗)]∥∥Hsmooth≤∥∥∇R(~f)−∇R(f∗)−[∇^Ri(~f)−∇^Ri(f∗)]∥∥H     +2Gϵ GDH,δ,ϵ∥~f−f∗∥HmN     +√GDH,δ,ϵ(R(~f)−R(f∗))mN+2GϵLipschitz≤GDH,δ,ϵ∥^fi−f∗∥HmN     +GϵmDH,δ,ϵN     +√GDH,δ,ϵ(R(^fi)−R(f∗))mN     +√GLϵmDH,δ,ϵN+2Gϵ. (17)

Substituting (17) and (16) into (12), with probability at least , we have

 R(^fi)−R(f∗)+η2∥^fi−f∗∥2H≤GDH,δ,ϵ∥^fi−f∗∥2HmN   +GDH,δ,ϵϵ∥^fi−f∗∥HmN   +2Gϵ∥^fi−f∗∥H   +∥^fi−f∗∥H√GDH,δ,ϵ(R(^fi)−R(f∗))mN   +∥^fi−f∗∥H√GLϵmDH,δ,ϵN   +2Llog(2δ)∥^fi−f∗∥HmN   +∥^fi−f∗∥H√8GR∗mlog(2δ)N. (18)

Note that

 √ab≤a2c+bc2,∀a,b,c>0.

Therefore, we have

 ∥^fi−f∗∥H√GlogDH,δ,ϵ(R(^fi)−R(f∗))mN ≤2GDH,δ,ϵ(R(^fi)−R(f∗))mNη+η8∥^fi−f∗∥2H; 2Llog(2δ)∥^fi−f∗∥HmN ≤16L2m2log22δN2η+η16∥^fi−f∗∥2H; ∥^fi−f∗∥H√8GR∗log(2δ)mN ≤64GR∗log(2δ)mNη+η32∥^fi−f∗∥2H; 2Gϵ∥^fi−f∗∥H ≤64G2ϵ2η+η64∥<