# Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate

Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for large-scale applications with high-dimensional data. One popular way to implement sparse learning is to use L_1 regularization. In this paper, we propose a novel method, called proximal SCOPE (pSCOPE), for distributed sparse learning with L_1 regularization. pSCOPE is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework of pSCOPE, we find that the data partition affects the convergence of the learning procedure, and subsequently we define a metric to measure the goodness of a data partition. Based on the defined metric, we theoretically prove that pSCOPE is convergent with a linear convergence rate if the data partition is good enough. We also prove that better data partition implies faster convergence rate. Furthermore, pSCOPE is also communication efficient. Experimental results on real data sets show that pSCOPE can outperform other state-of-the-art distributed methods for sparse learning.

## Authors

• 11 publications
• 2 publications
• 4 publications
• 29 publications
01/30/2016

### SCOPE: Scalable Composite Optimization for Learning on Spark

Many machine learning models, such as logistic regression (LR) and suppo...
02/10/2018

### Feature-Distributed SVRG for High-Dimensional Linear Classification

Linear classification has been widely used in many high-dimensional appl...
10/08/2012

### A Fast Distributed Proximal-Gradient Method

We present a distributed proximal-gradient method for optimizing the ave...
03/27/2011

### Sharp Convergence Rate and Support Consistency of Multiple Kernel Learning with Sparse and Dense Regularization

We theoretically investigate the convergence rate and support consistenc...
09/02/2017

### Communication-efficient Algorithm for Distributed Sparse Learning via Two-way Truncation

We propose a communicationally and computationally efficient algorithm f...
10/09/2019

### Straggler-Agnostic and Communication-Efficient Distributed Primal-Dual Algorithm for High-Dimensional Data Mining

Recently, reducing communication time between machines becomes the main ...
12/17/2010

### Estimating Networks With Jumps

We study the problem of estimating a temporally varying coefficient and ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many machine learning models can be formulated as the following regularized empirical risk minimization problem:

 minw∈Rd P(w)=1nn∑i=1fi(w)+R(w), (1)

where is the parameter to learn, is the loss on training instance , is the number of training instances, and is a regularization term. Recently, sparse learning, which tries to learn a sparse model for prediction, has become a hot topic in machine learning. There are different ways to implement sparse learning Tibshirani94regressionshrinkage ; DBLP:conf/icml/WangKS017 . One popular way is to use regularization, i.e., . In this paper, we focus on sparse learning with . Hence, in the following content of this paper, unless otherwise stated.

One traditional method to solve (1) is proximal gradient descent (pGD) DBLP:journals/siamis/BeckT09 , which can be written as follows:

 wt+1=proxR,η(wt−η∇F(wt)), (2)

where , is the value of at iteration , is the learning rate, is the proximal mapping defined as

 proxR,η(u)=argminv(R(v)+12η∥v−u∥2). (3)

Recently, stochastic learning methods, including stochastic gradient descent (SGD)

DBLP:journals/siamjo/NemirovskiJLS09 , stochastic average gradient (SAG) DBLP:journals/mp/SchmidtRB17

, stochastic variance reduced gradient (SVRG)

DBLP:conf/nips/Johnson013 , and stochastic dual coordinate ascent (SDCA) DBLP:journals/jmlr/Shalev-Shwartz013 , have been proposed to speedup the learning procedure in machine learning. Inspired by the success of these stochastic learning methods, proximal stochastic methods, including proximal SGD (pSGD) DBLP:conf/nips/LangfordLZ08 ; DBLP:journals/jmlr/DuchiS09 ; DBLP:conf/pkdd/ShiL15 ; DBLP:journals/siamjo/ByrdHNS16 , proximal block coordinate descent (pBCD) Tseng:2001:CBC:565614.565615 ; citeulike:9641309 ; DBLP:conf/icml/ScherrerHTH12 , proximal SVRG (pSVRG) DBLP:journals/siamjo/Xiao014 and proximal SDCA (pSDCA) DBLP:conf/icml/Shalev-Shwartz014 , have also been proposed for sparse learning in recent years. All these proximal stochastic methods are sequential (serial) and implemented with one single thread.

The serial proximal stochastic methods may not be efficient enough for solving large-scale sparse learning problems. Furthermore, the training set might be distributively stored on a cluster of multiple machines in some applications. Hence, distributed sparse learning DBLP:conf/icml/AybatWI15 with a cluster of multiple machines has attracted much attention in recent years, especially for large-scale applications with high-dimensional data. In particular, researchers have recently proposed several distributed proximal stochastic methods for sparse learning DBLP:journals/corr/LiXZL16 ; DBLP:conf/aaai/MengCYWML17 ; DBLP:journals/jmlr/LeeLMY17 ; DBLP:journals/jmlr/MahajanKS17 ; DBLP:journals/corr/SmithFJJ15 111In this paper, we mainly focus on distributed sparse learning with regularization. The distributed methods for non-sparse learning, like those in DBLP:conf/nips/ReddiHSPS15 ; DBLP:conf/icdm/DeG16 ; DBLP:conf/aistats/LeblondPL17 , are not considered..

One main branch of the distributed proximal stochastic methods includes distributed pSGD (dpSGD) DBLP:journals/corr/LiXZL16 , distributed pSVRG (dpSVRG) DBLP:journals/corr/HuoGH16 ; DBLP:conf/aaai/MengCYWML17 and distributed SVRG (DSVRG) DBLP:journals/jmlr/LeeLMY17 . Both dpSGD and dpSVRG adopt a centralized framework and mini-batch based strategy for distributed learning. One typical implementation of a centralized framework is based on Parameter Server DBLP:conf/osdi/LiAPSAJLSS14 ; DBLP:conf/kdd/XingHDKWLZXKY15

, which supports both synchronous and asynchronous communication strategies. One shortcoming of dpSGD and dpSVRG is that the communication cost is high. More specifically, the communication cost of each epoch is

, where is the number of training instances. DSVRG adopts a decentralized framework with lower communication cost than dpSGD and dpSVRG. However, in DSVRG only one worker is updating parameters locally and all other workers are idling at the same time.

Another branch of the distributed proximal stochastic methods is based on block coordinate descent DBLP:conf/icml/BradleyKBG11 ; DBLP:journals/mp/RichtarikT16 ; DBLP:journals/siamrev/FercoqR16 ; DBLP:journals/jmlr/MahajanKS17 . Although in each iteration these methods update only a block of coordinates, they usually have to pass through the whole data set. Due to the partition of data, it also brings high communication cost in each iteration.

Another branch of the distributed proximal stochastic methods is based on SDCA. One representative is PROXCOCOA DBLP:journals/corr/SmithFJJ15 . Although PROXCOCOA has been theoretically proved to have a linear convergence rate with low communication cost, we find that it is not efficient enough in experiments.

In this paper, we propose a novel method, called proximal SCOPE (pSCOPE), for distributed sparse learning with regularization. pSCOPE is a proximal generalization of the scalable composite optimization for learning (SCOPE) DBLP:conf/aaai/ZhaoXSGL17 . SCOPE cannot be used for sparse learning, while pSCOPE can be used for sparse learning. The contributions of pSCOPE are briefly summarized as follows:

• pSCOPE is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The CALL framework is communication efficient because there is no communication during the inner iterations of each epoch.

• pSCOPE is theoretically guaranteed to be convergent with a linear convergence rate if the data partition is good enough, and better data partition implies faster convergence rate. Hence, pSCOPE is also computation efficient.

• In pSCOPE, a recovery strategy is proposed to reduce the cost of proximal mapping when handling high dimensional sparse data.

• Experimental results on real data sets show that pSCOPE can outperform other state-of-the-art distributed methods for sparse learning.

## 2 Preliminary

In this paper, we use to denote the norm , to denote the optimal solution of (1

). For a vector

, we use to denote the th coordinate value of . denotes the set . For a function , we use to denote the gradient of with respect to (w.r.t.) the first argument . Furthermore, we give the following definitions.

###### Definition 1

We call a function is -smooth if it is differentiable and there exists a positive constant such that .

###### Definition 2

We call a function is convex if there exists a constant such that , where . If is differentiable, then . If , is called -strongly convex.

Throughout this paper, we assume that is convex, is strongly convex and each is smooth. We do not assume that each is convex.

## 3 Proximal SCOPE

In this paper, we focus on distributed learning with one master (server) and workers in the cluster, although the algorithm and theory of this paper can also be easily extended to cases with multiple servers like the Parameter Server framework DBLP:conf/osdi/LiAPSAJLSS14 ; DBLP:conf/kdd/XingHDKWLZXKY15 .

The parameter is stored in the master, and the training set are partitioned into parts denoted as . Here, contains a subset of instances from , and will be assigned to the th worker. . Based on this data partition scheme, the proximal SCOPE (pSCOPE) for distributed sparse learning is presented in Algorithm 1. The main task of master is to add and average vectors received from workers. Specifically, it needs to calculate the full gradient . Then it needs to calculate . The main task of workers is to update the local parameters initialized with . Specifically, for each worker , after it gets the full gradient from master, it calculates a stochastic gradient

 vk,m=∇fik,m(uk,m)−∇fik,m(wt)+z, (4)

and then update its local parameter by a proximal mapping with learning rate :

 uk,m+1=proxR,η(uk,m−ηvk,m). (5)

From Algorithm 1, we can find that pSCOPE is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The cooperative operation is mainly adding and averaging in the master. During the autonomous local learning procedure in each outer iteration which contains inner iterations (see Algorithm 1), there is no communication. Hence, the communication cost for each epoch of pSCOPE is constant, which is much less than the mini-batch based strategy with communication cost for each epoch DBLP:journals/corr/LiXZL16 ; DBLP:journals/corr/HuoGH16 ; DBLP:conf/aaai/MengCYWML17 .

pSCOPE is a proximal generalization of SCOPE DBLP:conf/aaai/ZhaoXSGL17 . Although pSCOPE is mainly motivated by sparse learning with regularization, the algorithm and theory of pSCOPE can also be used for smooth regularization like regularization. Furthermore, when the data partition is good enough, pSCOPE can avoid the extra term in the update rule of SCOPE, which is necessary for convergence guarantee of SCOPE.

## 4 Effect of Data Partition

In our experiment, we find that the data partition affects the convergence of the learning procedure. Hence, in this section we propose a metric to measure the goodness of a data partition, based on which the convergence of pSCOPE can be theoretically proved. Due to space limitation, the detailed proof of Lemmas and Theorems are moved to the long version DBLP:journals/corr/abs-1803-05621 .

### 4.1 Partition

First, we give the following definition:

###### Definition 3

Define . We call a partition w.r.t. , if and each is -strongly convex and -smooth (). Here, is defined in (1) and is defined in (2). We denote is a partition w.r.t. .

###### Remark 1

Here, is an ordered sequence of functions. In particular, if we construct another partition by permuting of , we consider them to be two different partitions. Furthermore, two functions in can be the same. Two partitions , are considered to be equal, i.e., , if and only if .

For any partition w.r.t. , we construct new functions as follows:

 Pk(w;a)=ϕk(w;a)+R(w),k=1,…,p (6)

where , , and .

In particular, given a data partition of the training set , let which is also called the

local loss function

. Assume each is strongly convex and smooth, and . Then, we can find that is a partition w.r.t. . By taking expectation on defined in Algorithm 1, we obtain . According to the theory in DBLP:journals/siamjo/Xiao014 , in the inner iterations of pSCOPE, each worker tries to optimize the local objective function using proximal SVRG with initialization and training data , rather than optimizing . Then we call such a the local objective function w.r.t. . Compared to the subproblem of PROXCOCOA (equation (2) in DBLP:journals/corr/SmithFJJ15 ),

is more simple and there is no hyperparameter in it.

### 4.2 Good Partition

In general, the data distribution on each worker is different from the distribution of the whole training set. Hence, there exists a gap between each local optimal value and the global optimal value. Intuitively, the whole learning algorithm has slow convergence rate or cannot even converge if this gap is too large.

###### Definition 4

For any partition w.r.t. , we define the Local-Global Gap as

 lπ(a)=P(w∗)−1pp∑k=1Pk(w∗k(a);a),

where .

We have the following properties of Local-Global Gap:

###### Lemma 1

, , where is the conjugate function of .

###### Theorem 1

Let . , there exists a constant such that .

The result in Theorem 1 can be easily extended to smooth regularization which can be found in the long version DBLP:journals/corr/abs-1803-05621 .

According to Theorem 1, the local-global gap can be bounded by . Given a specific , the smaller is, the smaller the local-global gap will be. Since the constant only depends on the partition , intuitively can be used to evaluate the goodness of a partition . We define a good partition as follows:

###### Definition 5

We call a -good partition w.r.t. if and

 γ(π;ϵ)△=sup∥a−w∗∥2≥ϵlπ(a)∥a−w∗∥2≤ξ. (7)

In the following, we give the bound of .

###### Lemma 2

Assume is a partition w.r.t. , where is the local loss function, each is Lipschitz continuous with bounded domain and sampled from some unknown distribution . If we assign these

uniformly to each worker, then with high probability,

. Moreover, if is convex w.r.t. , then . Here we ignore the term and dimensionality .

For example, in Lasso regression, it is easy to get that the corresponding local-global gap

is convex according to Lemma 1 and the fact that is an affine function in this case.

Lemma 2 implies that as long as the size of training data is large enough, will be small and will be a good partition. Please note that the uniformly here means each will be assigned to one of the workers and each worker has the equal probability to be assigned. We call the partition resulted from uniform assignment uniform partition in this paper. With uniform partition, each worker will have almost the same number of instances. As long as the size of training data is large enough, uniform partition is a good partition.

## 5 Convergence of Proximal SCOPE

In this section, we will prove the convergence of Algorithm 1 for proximal SCOPE (pSCOPE) using the results in Section 4.

###### Theorem 2

Assume is a -good partition w.r.t. . For convenience, we set . If , then

 E∥wt+1−w∗∥2≤[(1−μη+2L2η2)M+2L2η+2ξμ−2L2η]∥wt−w∗∥2.

Because smaller means better partition and the partition corresponds to data partition in Algorithm 1, we can see that better data partition implies faster convergence rate.

###### Corollary 1

Assume is a -good partition w.r.t. . For convenience, we set . If , taking , , where is the conditional number, then we have . To get the -suboptimal solution, the computation complexity of each worker is .

###### Corollary 2

When , which means we only use one worker, pSCOPE degenerates to proximal SVRG DBLP:journals/siamjo/Xiao014 . Assume is -strongly convex () and -smooth. Taking , , we have . To get the -optimal solution, the computation complexity is .

We can find that pSCOPE has a linear convergence rate if the partition is -good, which implies pSCOPE is computation efficient and we need outer iterations to get a -optimal solution. For all inner iterations, each worker updates without any communication. Hence, the communication cost is , which is much smaller than the mini-batch based strategy with communication cost for each epoch DBLP:journals/corr/LiXZL16 ; DBLP:journals/corr/HuoGH16 ; DBLP:conf/aaai/MengCYWML17 .

Furthermore, in the above theorems and corollaries, we only assume that the local loss function is strongly convex. We do not need each to be convex. Hence, and it is weaker than the assumption in proximal SVRG DBLP:journals/siamjo/Xiao014 whose computation complexity is when . In addition, without convexity assumption for each , our result for the degenerate case is consistent with that in DBLP:conf/icml/Shalev-Shwartz16 .

## 6 Handle High Dimensional Sparse Data

For the cases with high dimensional sparse data, we propose recovery strategy to reduce the cost of proximal mapping so that it can accelerate the training procedure. Here, we adopt the widely used linear model with elastic net Zou05regularizationand as an example for illustration, which can be formulated as follows: , where is the loss function. We assume many instances in are sparse vectors and let .

Proximal mapping is unacceptable when the data dimensionality is too large, since we need to execute the conditional statements times which is time consuming. Other methods, like proximal SGD and proximal SVRG, also suffer from this problem.

Since is a constant during the update of local parameter , we will design a recovery strategy to recover it when necessary. More specifically, in each inner iteration, with the random index , we only recover to calculate the inner product and update for . For those , we do not immediately update . The basic idea of these recovery rules is: for some coordinate , we can calculate directly from , rather than doing iterations from to . Here, . At the same time, the new algorithm is totally equivalent to Algorithm 1. It will save about times of conditional statements, where is the sparsity of . This reduction of computation is significant especially for high dimensional sparse training data. Due to space limitation, the complete rules are moved to the long version DBLP:journals/corr/abs-1803-05621 . Here we only give one case of our recovery rules in Lemma 3.

###### Lemma 3

(Recovery Rule) We define the sequence as: and for , . For the coordinate and constants , if for any . If , then the relation between and can be summarized as follows: define which satisfies

1. If , then

2. If , then

## 7 Experiment

We use two sparse learning models for evaluation. One is logistic regression (LR) with elastic net

Zou05regularizationand : . The other is Lasso regression Tibshirani94regressionshrinkage : . All experiments are conducted on a cluster of multiple machines. The CPU for each machine has 12 Intel E5-2620 cores, and the memory of each machine is 96GB. The machines are connected by 10GB Ethernet. Evaluation is based on four datasets in Table 1: cov, rcv1, avazu, kdd2012. All of them can be downloaded from LibSVM website .

### 7.1 Baselines

We compare our pSCOPE with six representative baselines: proximal gradient descent based method FISTA DBLP:journals/siamis/BeckT09 , ADMM type method DFAL DBLP:conf/icml/AybatWI15 , newton type method mOWL-QN DBLP:conf/icml/GongY15 , proximal SVRG based method AsyProx-SVRG DBLP:conf/aaai/MengCYWML17 , proximal SDCA based method PROXCOCOA DBLP:journals/corr/SmithFJJ15 , and distributed block coordinate descent DBCD DBLP:journals/jmlr/MahajanKS17 . FISTA and mOWL-QN are serial. We design distributed versions of them, in which workers distributively compute the gradients and then master gathers the gradients from workers for parameter update.

All methods use 8 workers. One master will be used if necessary. Unless otherwise stated, all methods except DBCD and PROXCOCOA use the same data partition, which is got by uniformly assigning each instance to each worker (uniform partition). Hence, different workers will have almost the same number of instances. This uniform partition strategy satisfies the condition in Lemma 2. Hence, it is a good partition. DBCD and PROXCOCOA adopt a coordinate distributed strategy to partition the data.

### 7.2 Results

The convergence results of LR with elastic net and Lasso regression are shown in Figure 1. DBCD is too slow, and hence we will separately report the time of it and pSCOPE when they get -suboptimal solution in Table 2. AsyProx-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. From Figure 1 and Table 2, we can find that pSCOPE outperforms all the other baselines on all datasets.

### 7.3 Speedup

We also evaluate the speedup of pSCOPE on the four datasets for LR. We run pSCOPE and stop it when the gap . The speedup is defined as: . We set . The speedup results are in Figure 2 (a). We can find that pSCOPE gets promising speedup.

### 7.4 Effect of Data Partition

We evaluate pSCOPE under different data partitions. We use two datasets cov and rcv1 for illustration, since they are balanced datasets which means the number of positive instances is almost the same as that of negative instances. For each dataset, we construct four data partitions:  (Each worker has access to the whole data),  (Uniform partition);  (75% positive instances and 25% negative instances are on the first 4 workers, and other instances are on the last 4 workers),  (All positive instances are on the first 4 workers, and all negative instances are on the last 4 workers).

The convergence results are shown in Figure 2 (b). We can see that data partition does affect the convergence of pSCOPE. The partition achieves the best performance, which verifies the theory in this paper 333The proof of that is the best partition and is in the appendix. The performance of uniform partition is similar to that of the best partition , and is better than the other two data partitions. In real applications with large-scale dataset, it is impractical to assign each worker the whole dataset. Hence, we prefer to choose uniform partition in real applications, which is also adopted in above experiments of this paper.

## 8 Conclusion

In this paper, we propose a novel method, called pSCOPE, for distributed sparse learning. Furthermore, we theoretically analyze how the data partition affects the convergence of pSCOPE. pSCOPE is both communication and computation efficient. Experiments on real data show that pSCOPE can outperform other state-of-the-art methods to achieve the best performance.

## References

• (1) Necdet S. Aybat, Zi Wang, and Garud Iyengar. An asynchronous distributed proximal gradient method for composite convex optimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 2454–2462, 2015.
• (2) Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
• (3) Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for -regularized loss minimization. In Proceedings of the 28th International Conference on Machine Learning, pages 321–328, 2011.
• (4) Richard H. Byrd, S. L. Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
• (5) Soham De and Tom Goldstein. Efficient distributed SGD with variance reduction. In Proceedings of the 16th IEEE International Conference on Data Mining, pages 111–120, 2016.
• (6) John C. Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, 2009.
• (7) Olivier Fercoq and Peter Richtárik. Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent. SIAM Review, 58(4):739–771, 2016.
• (8) Pinghua Gong and Jieping Ye. A modified orthant-wise limited memory quasi-newton method with convergence analysis. In Proceedings of the 32nd International Conference on Machine Learning, pages 276–284, 2015.
• (9) Zhouyuan Huo, Bin Gu, and Heng Huang. Decoupled asynchronous proximal stochastic gradient descent with variance reduction. CoRR, abs/1609.06804, 2016.
• (10) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
• (11) Sham M. Kakade and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. volume abs/0910.0610, 2009.
• (12) John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems, pages 905–912, 2008.
• (13) Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: asynchronous parallel SAGA. In

Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

, pages 46–54, 2017.
• (14) Jason D. Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. Journal of Machine Learning Research, 18:122:1–122:43, 2017.
• (15) Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, pages 583–598, 2014.
• (16) Yitan Li, Linli Xu, Xiaowei Zhong, and Qing Ling. Make workers work harder: decoupled asynchronous proximal stochastic gradient descent. CoRR, abs/1605.06619, 2016.
• (17) Dhruv Mahajan, S. Sathiya Keerthi, and S. Sundararajan. A distributed block coordinate descent method for training

regularized linear classifiers.

Journal of Machine Learning Research, 18:91:1–91:35, 2017.
• (18) Qi Meng, Wei Chen, Jingcheng Yu, Taifeng Wang, Zhiming Ma, and Tie-Yan Liu. Asynchronous stochastic proximal optimization algorithms with variance reduction. In Proceedings of the 31th AAAI Conference on Artificial Intelligence, pages 2329–2335, 2017.
• (19) Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
• (20) Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2647–2655, 2015.
• (21) Peter Richtárik and Martin Takác. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433–484, 2016.
• (22) Chad Scherrer, Mahantesh Halappanavar, Ambuj Tewari, and David Haglin. Scaling up coordinate descent algorithms for large regularization problems. In Proceedings of the 29th International Conference on Machine Learning, pages 1407–1414, 2012.
• (23) Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
• (24) Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In Proceedings of the 33nd International Conference on Machine Learning, pages 747–754, 2016.
• (25) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In Proceedings of the 22nd Conference on Learning Theory, 2009.
• (26) Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
• (27) Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31th International Conference on Machine Learning, pages 64–72, 2014.
• (28) Ziqiang Shi and Rujie Liu. Large scale optimization with proximal stochastic newton-type gradient descent. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 691–704, 2015.
• (29) Virginia Smith, Simone Forte, Michael I. Jordan, and Martin Jaggi. -regularized distributed optimization: A communication-efficient primal-dual framework. CoRR, abs/1512.04011, 2015.
• (30) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
• (31) Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475–494, 2001.
• (32) Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, pages 3636–3645, 2017.
• (33) Tong T. Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1):224–244, 2008.
• (34) Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• (35) Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th International Conference on Knowledge Discovery and Data Mining, pages 1335–1344, 2015.
• (36) Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, and Wu-Jun Li. SCOPE: scalable composite optimization for learning on Spark. In Proceedings of the 31th AAAI Conference on Artificial Intelligence, pages 2928–2934, 2017.
• (37) Shen-Yi Zhao, Gong-Duo Zhang, Ming-Wei Li, and Wu-Jun Li. Proximal SCOPE for distributed sparse learning: Better data partition implies faster convergence rate. CoRR, abs/1803.05621, 2018.
• (38) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.

## Appendix A Effect of Data Partition

### a.1 Proof of Lemma 1

, , where is the conjugate function of .

Proof  According to Definition 4 and , we have

 lπ(a)= P(w∗)−1pp∑i=1Pk(w∗k(a);a) = 1pp∑i=1(Pk(w∗;a)−P(w∗k(a);a))≥0

On the other hand, let , then , and

 ∇ϕk(w∗;w∗)+ζ = ∇ϕk(w∗)+∇F(w∗)−∇ϕk(w∗)+ζ = ∇F(w∗)+ζ = 0

where so that . It implies that . Due to the strong convexity of w.r.t , we have , which means .

For the dual form, according to (6) and definition of , we have

 P(w∗k(a);a)= minw(ϕk(w)+R(w)+Gk(a)Tw)) = −maxw(−Gk(a)Tw−(ϕk(w)+R(w))) = −H∗k(−Gk(a))

Then we have

 lπ(a)=P(w∗)+1pp∑k=1H∗k(−Gk(a))

### a.2 Proof of Theorem 1

#### a.2.1 Warm up: quadratic function

 ϕk(w)=12mkw2+bkw+ck,

and

 F(w)=12mw2+bw+c,R(w)=|w|,

where , and so that . The corresponding is defined as

 Pk(w;a)= 12mkw2+bkw+ck+(ma+b−mka−bk)w+|w| = 12mkw2+(ma+b−mka)w+ck+|w| (8)

Then we have the following lemma:

###### Lemma 4

With defined above,

 P(w∗)−1pp∑k=1Pk(w∗k(a);a)≤γ(a−w∗)2,∀a∈R

where .

Proof  For convenience, we define three sets

 K1(a)={k|(m−mk)y+b<−1} K2(a)={k|(m−mk)y+b∈[−1,1]} K3(a)={k|(m−mk)y+b>1}

Then it is easy to find that

• if , then , ;

• if , then , ;

• if , then , ;

Now we calculate the Local-Global Gap.

Firstly we consider the case that , we have , , and

 lπ(y)= −(b+1)22m−1p[∑k∈K1(a)−[(m−mk)a+b+1]22mk +∑k∈K3(a)−[(m−mk)a+b−1]22mk] = −mw∗22+1p[∑k∈K1(a)[(m−mk)a−mw∗]22mk +∑k∈K3(a)[(m−mk)a−mw∗−2]22mk]

For , we have

 [(m−mk)a−mw∗−2]2 = [(m−mk)a−mw∗]2−4[(m−mk)a−mw∗]+4 = [(m−mk)a−mw∗]2−4[(m−mk)a+b+1]+4 ≤ [(m−mk)a−mw∗]2

Then we have

 lπ(a)≤ −mw∗22+1p[∑k∈K1(a)[(m−mk)a−mw∗]22mk +∑k∈K3(a)[(m−mk)a−mw∗]22mk] ≤ −mw∗22+1pp∑k=1[(m−mk)a−mw∗]22mk = 1pp∑k=1(m−mk)2mk(a−w∗)2

Secondly we consider the case that , we have , , and

 lπ(a)= 1p[∑k∈K1(a)[(m−mk)a+b+1]22mk +∑k∈K3(a)[(m−mk)a+b−1]22mk]

For , we have , which means that