Feature-Distributed SVRG for High-Dimensional Linear Classification

02/10/2018
by   Gong-Duo Zhang, et al.
0

Linear classification has been widely used in many high-dimensional applications like text classification. To perform linear classification for large-scale tasks, we often need to design distributed learning methods on a cluster of multiple machines. In this paper, we propose a new distributed learning method, called feature-distributed stochastic variance reduced gradient (FD-SVRG) for high-dimensional linear classification. Unlike most existing distributed learning methods which are instance-distributed, FD-SVRG is feature-distributed. FD-SVRG has lower communication cost than other instance-distributed methods when the data dimensionality is larger than the number of data instances. Experimental results on real data demonstrate that FD-SVRG can outperform other state-of-the-art distributed methods for high-dimensional linear classification in terms of both communication cost and wall-clock time, when the dimensionality is larger than the number of instances in training data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/15/2018

Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate

Distributed sparse learning with a cluster of multiple machines has attr...
04/23/2019

Block-distributed Gradient Boosted Trees

The Gradient Boosted Tree (GBT) algorithm is one of the most popular mac...
11/10/2020

Distributed Learning with Low Communication Cost via Gradient Boosting Untrained Neural Network

For high-dimensional data, there are huge communication costs for distri...
09/02/2017

Communication-efficient Algorithm for Distributed Sparse Learning via Two-way Truncation

We propose a communicationally and computationally efficient algorithm f...
12/11/2020

Casting Multiple Shadows: High-Dimensional Interactive Data Visualisation with Tours and Embeddings

Non-linear dimensionality reduction (NLDR) methods such as t-distributed...
11/06/2018

Semantic Term "Blurring" and Stochastic "Barcoding" for Improved Unsupervised Text Classification

The abundance of text data being produced in the modern age makes it inc...
03/29/2020

High-dimensional Neural Feature using Rectified Linear Unit and Random Matrix Instance

We design a ReLU-based multilayer neural network to generate a rich high...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Linear classification models, such as logistic regression (LR) and linear support vector machine (SVM), can achieve good performance in many high-dimensional applications like text classification. When the training set is too large to be handled by one single machine (node), we often need to design distributed learning methods on a cluster of multiple machines. Hence, it has become an interesting topic to design distributed linear classification models for some large-scale tasks with high-dimensionality.

For large-scale linear classification problems, stochastic gradient descent (SGD) and its variants like stochastic average gradient (SAG) 

(Schmidt et al., 2017), SAGA (Defazio et al., 2014), stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz & Zhang, 2013, 2014) and stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013) have shown promising performance in real applications. Hence, most existing distributed learning methods adopt SGD or its variants for updating (learning) the parameters of the linear classification models. Representatives include PSGD (Zinkevich et al., 2010), DC-ASGD (Zheng et al., 2017), DisDCA (Yang, 2013), CoCoA (Jaggi et al., 2014), CoCoA+ (Ma et al., 2015) and DSVRG (Lee et al., 2017).

According to the organization framework of the cluster, existing distributed learning methods can be divided into three main categories. The first category is based on the master-slave framework, which has one master node (machine) and some slave nodes. In general, the model parameter is stored in the master node, and the data is distributively stored in the slave nodes. The master node is responsible for updating the model parameter, and the slave nodes are responsible for computing the gradient or stochastic gradient. This category is a centralized framework. One representative of this category is MLlib on Spark (Meng et al., 2016). The bottleneck of this kind of centralized framework is the high communication cost on the central (master) node (Lian et al., 2017). The second category is based on the Parameter Server framework (Li et al., 2014a, b; Xing et al., 2015), which has two kinds of nodes called Servers and Workers respectively. The Servers are used to store and update the model parameter, and the Workers are used to distributively store the data and compute the gradient or stochastic gradient. PS-Lite111PS-Lite is called Parameter Server in the original paper (Li et al., 2014a). In this paper, we use Parameter Server to denote the general framework, and use PS-Lite for the specific Parameter Server platform in (Li et al., 2014a). PS-Lite can be downloaded from https://github.com/dmlc/ps-lite. (Li et al., 2014a) and Petuum (Xing et al., 2015) are two representatives of this category. Parameter Server is also a centralized framework. Unlike the master-slave framework which has only one master node for model parameter, Parameter Server can use multiple Servers to distributively store and update the model parameter, and hence can relief the communication burden on the central nodes (Servers). However, there also exists frequent communication of gradients and parameters between Servers and Workers, because the parameter and data are stored separately in Servers and Workers in Parameter Server framework. The third category is the decentralized framework, in which there are only workers and no central nodes (servers). The data is distributively stored on all the workers, and all workers need to store and update (learn) the model parameter. D-PSGD (Lian et al., 2017) and DSVRG (Lee et al., 2017) are two representatives of this category. D-PSGD is a distributed SGD method that abandons the central node and need much less communication on the busiest node compared to centralized frameworks. But it still need to communicate parameter vector frequently between workers. DSVRG is a distributed SVRG method which has a ring framework. Because the convergence rate of SVRG is much faster than SGD (Johnson & Zhang, 2013), DSVRG also converges much faster than other SGD-based distributed methods. Furthermore, the decentralized framework of DSVRG also avoids the communication bottleneck in the centralized frameworks. Hence, DSVRG has achieved promising performance for learning linear classification models.

Most existing distributed learning methods, including all the centralized and decentralized methods mentioned above, are instance-distributed, which partition the training data by instances. These instance-distributed methods have achieved promising performance in large-scale problems when the number of instances is larger than dimensionality (the number of features). In some real applications like web mining, astronomical projects, financial and biotechnology applications, the dimensionality can be larger than the number of instances (Negahban et al., 2012). For these cases, the communication cost of instance-distributed methods is typically high, because they often need to communicate the high-dimensional parameter vectors or gradient vectors among different machines.

In this paper, we propose a new distributed SVRG method, called feature-distributed SVRG (FD-SVRG), for high-dimensional linear classification. The contributions of FD-SVRG are briefly listed as follows:

  • Unlike most existing distributed learning methods which are instance-distributed, FD-SVRG is feature-distributed.

  • FD-SVRG has the same convergence rate as the non-distributed (serial) SVRG, while the parameters in FD-SVRG can be distributively learned.

  • FD-SVRG has lower communication cost than other instance-distributed methods when the dimensionality is larger than the number of instances.

  • Experimental results on real data demonstrate that FD-SVRG can outperform other state-of-the-art distributed SVRG methods in terms of both communication cost and wall-clock time, when the dimensionality is larger than the number of instances in training data.

  • In particular, compared with the Parameter Server PS-Lite which has been widely used by both academy and industry, FD-SVRG is several orders of magnitude faster.

Please note that our feature-distributed framework is not only applicable to SVRG, it can also be applied to SGD and other variants. Furthermore, it can also be used for regression or other liner models. Due to space limitation, we only focus on SVRG based linear classification here and leave other variants for further study.

2 Problem Formulation

Although there exist different formulations for the linear classification problem, this paper focuses the most popular formulation shown as follows:

(1)
(2)

where is the number of instances in the training set, is the feature vector of instance , is the class label of instance , is the dimensionality (number of features) of the instances, is the parameter to learn, is the loss defined on instance , is a regularization function. Here, we only focus on two-class problems, but the techniques in this paper can also be adapted for multi-class problems which are omitted for space saving.

Many popular linear classification models can be formulated as the form in (1). For example, in logistic regression (LR), , where is the logistic loss and is the -norm regularization function with a hyper-parameter . In linear SVM, .

In many real applications, the training set can be too large to be handled by one single machine (node). Hence, we need to design distributed learning methods to learn (optimize) the parameter based on a cluster of multiple machines (nodes). In some applications, can be larger than . And in other applications, can be larger than . In this paper, we focus on the case when is larger than , which has attracted much attention in recent years (Chandrasekaran et al., 2012; Negahban et al., 2012; Wang et al., 2016; Sivakumar & Banerjee, 2017).

3 Related Work

As stated in Section 1, there exist three main categories of distributed learning methods for the problem in (1). Here, we briefly introduce these existing methods to motivate the contribution of this paper. Because the master-slave framework can be seen as a special case of Parameter Server with only one Server, here we only introduce Parameter Server and the decentralized framework.

3.1 Parameter Server

The Parameter Server framework (Li et al., 2014a; Xing et al., 2015) is illustrated in Figure 1, in which there are Servers and Workers. Let denote the training data matrix, where the th column of denotes the th instance . In Parameter Server, the whole training set is instance-distributed, which means that is partitioned vertically (by instance) into subsets and will be assigned to Worker. The parameter is cut off into parts and will be assigned to Server. Servers are responsible for updating parameters, and Workers are responsible for computing gradients.


Figure 1: Parameter Server framework.

The communication is done by pull and push operations. That is to say, the Workers will pull parameters from Servers, and push gradients to Servers. When the number of Workers becomes larger, there are more machines for gradient computation at the same time and the pull and push requests will be more frequent for Servers. Although PS-Lite and Petuum can use data structure to decrease the communication cost for sparse data, the total communication cost is still high because of the frequent communication. Adding more Servers can make the parameter segment in each Server become smaller but cannot reduce the number of communication requests for Servers. Furthermore, as the number of Servers increases, each Worker need to cut the parameter into more segments and communicate more frequently. For the SVRG-based methods, such as mini-batch based synchronous distributed SVRG (SynSVRG) and asynchronous SVRG (AsySVRG) (Reddi et al., 2015; Zhao & Li, 2016)222Many existing AsySVRG methods, such as those in (Reddi et al., 2015; Zhao & Li, 2016), are initially proposed for multi-thread system with a shared memory. But these methods can be easily extended for a cluster of multiple machines to get a distributed version. The distributed SVRG, including SynSVRG and AsySVRG, implemented with parameter server can be found in Appendix B of the supplementary material.

, we need to communicate a dense full gradient vector in each epoch and hence the communication is also high.

3.2 Decentralized Framework

Parameter Server is a centralized framework, where the central nodes (Servers) will become the busiest ones to handle high communication burden from Workers, especially when the number of Workers is large. Recently, decentralized framework (Lian et al., 2017; Lee et al., 2017) is proposed to avoid the communication traffic jam on the busiest machines. This framework is illustrated in Figure 2. There are no Servers in the decentralized methods. Each machine (Worker) will be assigned a subset of the training instances. Based on the local subset of training instances, each Worker computes gradient and update its local parameter. Then different machines communicate parameter among each other. EXTRA (Shi et al., 2015), D-PSGD (Lian et al., 2017) and DSVRG (Lee et al., 2017) are the representatives of this kind of decentralized methods. Although the communication cost of decentralized methods is balanced among Workers, the decentralized methods need to communicate dense parameter vectors, even if the data is sparse. When the data is high-dimensional, the communication cost is also very high.


Figure 2: Decentralized framework.

4 Feature-Distributed SVRG

Most existing distributed learning methods, including all the centralized and decentralized methods introduced in Section 3, are instance-distributed, which partition the training data by instances. When the dimensionality is larger than the number of instances, i.e., , the communication cost of instance-distributed methods is typically high, because they often need to communicate the high-dimensional parameter vectors or gradient vectors of length among different machines.

In this section, we present our new method called FD-SVRG for high-dimensional linear classification. FD-SVRG is feature distributed, which partitions the training data by features. For training data matrix , the difference between instance-distributed partition and feature-distributed partition is illustrated in Figure 3, where the upper-right is feature-distributed partition and the lower-right is instance-distributed partition.

FD-SVRG is based on SVRG, which is much faster than SGD (Johnson & Zhang, 2013). For ease of understanding, we briefly present the original non-distributed (serial) SVRG in Appendix A of the supplementary material.


Figure 3: The difference between instance-distributed and feature-distributed.

Figure 4: Framework of FD-SVRG.

4.1 Framework

Figure 4 shows the distributed framework of FD-SVRG, in which there are workers and one coordinator. FD-SVRG is feature distributed. More specifically, the training data matrix is partitioned horizontally (by features) into parts , and is stored on Worker_. Here, . The parameter is also partitioned into parts , with . The features of correspond to the features of . is also stored on Worker_.

4.2 Learning Algorithm

First, we rewrite in (1) as follows:

(3)

where is the regularization function defined on . It is easy to find that if is or norm, is also or norm.

Then we can get the gradient:

(4)

We can find that the main computation of the gradient is to calculate the inner product and . Since

we can distributively complete the computation.


Figure 5: An example of tree-structured global sum with 4 Workers.

The whole learning algorithm of FD-SVRG is shown in Algorithm 1. The Coordinator is used to help sum from all Workers, where is the column of . Here, we use a tree-structured communication (reduce) scheme to get the global sum. An example of the tree-structured global sum with 4 Workers is shown in Figure 5. We pair the Workers so that while Worker_1 adds the result from Worker_2, Worker_3 can add the result from Worker_4 simultaneously. After the Coordinator has computed the sum, it broadcasts the sum to all Workers in a reverse-order tree structure. Similar tree-structure can be constructed for more Workers. It is faster than the strategy by which all Workers send the result directly to the Coordinator for sum, especially when the number of Workers is large.

1:  Initialize , ;
2:  for  do
3:     Compute ;
4:     Use tree-structured communication scheme to obtain: ;
5:     Compute the full gradient: ;
6:     Let ;
7:     for  do
8:         Pick up an instance from the local data with index ;
9:         Compute ;
10:         Use tree-structured communication scheme to obtain: ;
11:         ;
12:     end for
13:     Set ;
14:  end for
Algorithm 1 Feature-Distributed SVRG (FD-SVRG) on the th Worker

When computing the full gradient, the inner products of all the data are computed only one time for each outer iteration because the parameter is constant for each outer iteration . In line 11 of Algorithm 1, only need to be received from Coordinator. The Worker doesn’t need to receive again for all inner iterations, because the Worker has received when computing the full gradient for each outer iteration .

4.3 Convergence Analysis

It is easy to find that the update rule of FD-SVRG is exactly equivalent to that of the non-distributed (serial) SVRG (Johnson & Zhang, 2013). Hence, the convergence property of FD-SVRG is the same as that of the non-distributed SVRG. Please note that the non-distributed SVRG has two options to get in the original paper (Johnson & Zhang, 2013) (the readers can also refer to Algorithm 2 in Appendix A of the supplementary material). The authors of (Johnson & Zhang, 2013) have only proved the convergence of Option II without proving the convergence of Option I. But in FD-SVRG, we prefer to choose Option I because we need to make the parameter identical for different machines with feature partitioned. If Option II is taken, there exists extra communication for the random value. In this section, we prove that Option I is also convergent with a linear convergence rate.

Theorem 1

Assume is -strongly convex and each is -smooth, which means that ,

In the outer loop, the inner loop starts with , then we have

where is the step size (learning rate), , , is the optimal value of .

Proof  Let , where is the full gradient. Then we get . According to the update rule, we have

According to the definition of , we obtain

The second inequality uses the fact that and . The last inequality uses the smooth property of . Then we have

Using the strongly convex property of , we obtain

For convenience, let , . Taking expectation on the above equation, we obtain

Let , we obtain the result in Theorem 1.

Please note that for the outer loop, actually denotes , and is actually . Based on Theorem 1, we have

When the step size is small enough, . It means that our FD-SVRG has a linear convergence rate.

4.4 Implementation Details

4.4.1 Mini-Batch

FD-SVRG can take a mini-batch strategy as described in (Zhao et al., 2014). In each iteration, Workers sample a batch of data with batch size . Then, inner products are computed once and the scalars are communicated together. Taking a mini-batch cannot reduce the total communication of FD-SVRG, but it can reduce the communication frequency (times).

4.5 Complexity Analysis

For an iteration of the outer loop in SVRG, there will be gradients to be computed. Here, is set as the number of local data instances in general. For the instance-distributed methods, data is partitioned by machines, and each machine has instances. DSVRG sets . Then each machine will compute gradients in inner loops. There is only one machine at work in inner loops for DSVRG. So it computes gradients in total during an iteration. When computing full gradient, the center of DSVRG sends parameter to each machine, then receives gradients from each machine. The communication cost is . In inner loops, center sends full gradient to machine_ which is at work. The machine_ iteratively updates parameter and then returns parameter to center after the iterative updating. The communication is . So the total communication cost is . That means DSVRG computes gradients with communication cost of .

For FD-SVRG, when computing one gradient, the communication cost of a tree is . For the example in Figure 5, there are 4 Workers and the communication cost is 8 scalars (solid arrow). So the total communication cost is for computing gradients. Compared to the of DSVRG, we can find that when , FD-SVRG has lower communication cost. Note that DSVRG only parallels the SVRG algorithm in computing full gradients. Our method parallels the SVRG algorithm both in computing full gradients and in the inner loops. It means that to compute the same number of gradients, the average time of FD-SVRG is lower than that of DSVRG. Furthermore, FD-SVRG also parallels the communication by the tree-structured communication strategy, which can reduce communication time. Hence, FD-SVRG is expected to be much faster than DSVRG, which will be verified in our experiments.

SynSVRG and AsySVRG can be implemented with the Parameter Server framework (refer to Appendix B of the supplementary material). They need to send many times of vectors in the inner loops of SVRG. The communication cost of them is , which is much higher than those of DSVRG and FD-SVRG.

5 Experiments

In this section, we choose logistic regression (LR) with a -norm regularization term to conduct experiments. Hence, the formula (1) is defined as follows:

(5)

All the experiments are performed on a cluster of several machines (nodes) connected by 10GB Ethernet. Each machine has 12 Intel Xeon E5-2620 cores with 96GB memory.

5.1 Data Sets

We use four data sets for evaluation. They are news20, url, webspam and kdd2010. All of these data sets can be downloaded from the LibSVM website333https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The detailed information about these data sets is summarized in Table 1. We use 8 Workers to train news20 because it is relatively smaller. For other data sets, 16 Workers are used for training.

Data set Features () Instances ()
news20 1,355,191 19,954
url 3,231,961 2,396,130
webspam 16,609,143 350,000
kdd2010 29,890,095 19,264,097
Table 1: Data sets for evaluation

5.2 Experimental Setting

We chose DSVRG, AsySVRG and SynSVRG as baselines. AsySVRG and SynSVRG are implemented on Parameter Server framework (refer to Appendix B of the supplementary material). Besides Workers, AsySVRG and SynSVRG need extra machines for Servers. In our experiments, we take 8 Servers for AsySVRG and 4 Servers for SynSVRG. The number of Workers for AsySVRG and SynSVRG is the same as that for FD-SVRG and DSVRG. The w is initialized with 0. We set the number of inner loops of each method to be the number of training instances on each Worker. The step-size is fixed during training.

5.3 Efficiency Comparison

(a) news20
(b) url
(c) webspam
(d) kdd2010
Figure 6: Efficiency comparison in terms of wall-clock time.
(a) news20
(b) url
(c) webspam
(d) kdd2010
Figure 7: Efficiency comparison in terms of communication cost.

When the regularization hyper-parameter is , the convergence results are shown in Figure 6 and Figure 7. In both figures, the vertical axis denotes the gap between the objective function value and the optimal solution value. In Figure 6, the horizontal axis denotes the wall-clock time (in seconds) used by different methods. In Figure 7, the horizontal axis is the communication cost which denotes how many scalars has been communicated. A dimensional vector is considered to be scalars in communication cost. We can find that FD-SVRG achieves the best performance, in terms of both wall-clock time and communication cost.

Because DSVRG is the fastest baseline, we choose DSVRG as a baseline to observe our method’s speedup compared to DSVRG. The result is shown in Table 2, in which the time is recorded when the gap between the objective function value and the optimal value is less than . The time of the two methods on the four data sets is on the upper half of Table 2. The lower half part of Table 2 is our method’s speedup compared to DSVRG. We can find that our method is several times faster than DSVRG.

DSVRG FD-SVRG
news20 2.83 0.68
Time url 119.1 19.24
(in second) webspam 33.01 4.23
kdd2010 400.35 13.39
news20 1 4.16
Speedup url 1 6.19
webspam 1 7.8
kdd2010 1 29.9
Table 2: Speedup to DSVRG

PS-Lite (Li et al., 2014a) has been widely used by both academy and industry. We also compare FD-SVRG with PS-Lite. The original implementation of PS-Lite is based on SGD (Li et al., 2014a). We denote it as PS-Lite (SGD). In particular, PS-Lite (SGD) is an asynchronous SGD implemented based on PS-Lite which is provided by the authors of (Li et al., 2014a). We summarize the speedup to PS-Lite (SGD) in Table 3. The time is recoded when the gap between the objective function value and the optimal value is less than . We can find that our method is hundreds even thousands of times faster than PS-Lite (SGD).

To evaluate the influence of the regularization hyper-parameter , we choose webspam data set to conduct experiments by setting . The results are shown in Figure 8. Once again, our method achieves the best performance in both cases.

PS-Lite (SGD) FD-SVRG
news20 1000 0.69
Time url 2000 19.25
(in second) webspam 827.12 4.23
kdd2010 2000 13.39
news20 1 1449
Speedup url 1 103
webspam 1 196
kdd2010 1 149
Table 3: Speedup to PS-Lite (SGD)
(a)
(b)
Figure 8: Efficiency comparison in terms of wall-clock time for different .

5.4 Scalability

By changing the number of Workers, we can compute a speedup for FD-SVRG. The speedup is defined as follows:

We take the Worker number to be 1, 4, 8, 16. When the gap between the objective function value and the optimal solution value is less than , we stop the training process and record the time.

The speedup result is shown in Figure 9. We can find that FD-SVRG has a speedup close to ideal result. Hence, FD-SVRG has a strong scalability to handle large-scale problems.

Figure 9: Speedup of FD-SVRG on webspam.

6 Conclusion

Most existing distributed learning methods for linear classification are instance-distributed, which cannot achieve satisfactory results for high-dimensional applications when the dimensionality is larger than the number of instances. In this paper, we propose a novel distributed learning method, called FD-SVRG, by adopting a feature-distributed data partition strategy. Experimental results show that our method can achieve the best performance for cases when the dimensionality is larger than the number of instances.

References

  • Chandrasekaran et al. (2012) Chandrasekaran, Venkat, Recht, Benjamin, Parrilo, Pablo A., and Willsky, Alan S. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
  • Defazio et al. (2014) Defazio, Aaron, Bach, Francis R., and Lacoste-Julien, Simon. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Neural Information Processing Systems, pp. 1646–1654, 2014.
  • Jaggi et al. (2014) Jaggi, Martin, Smith, Virginia, Takác, Martin, Terhorst, Jonathan, Krishnan, Sanjay, Hofmann, Thomas, and Jordan, Michael I. Communication-efficient distributed dual coordinate ascent. In Neural Information Processing Systems, pp. 3068–3076, 2014.
  • Johnson & Zhang (2013) Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Neural Information Processing Systems, pp. 315–323, 2013.
  • Lee et al. (2017) Lee, Jason D., Lin, Qihang, Ma, Tengyu, and Yang, Tianbao. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. Journal of Machine Learning Research, 18:122:1–122:43, 2017.
  • Li et al. (2014a) Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr, Josifovski, Vanja, Long, James, Shekita, Eugene J., and Su, Bor-Yiing. Scaling distributed machine learning with the parameter server. In Operating Systems Design and Implementation, pp. 583–598, 2014a.
  • Li et al. (2014b) Li, Mu, Andersen, David G., Smola, Alexander J., and Yu, Kai. Communication efficient distributed machine learning with the parameter server. In Neural Information Processing Systems, pp. 19–27, 2014b.
  • Lian et al. (2017) Lian, Xiangru, Zhang, Ce, Zhang, Huan, Hsieh, Cho-Jui, Zhang, Wei, and Liu, Ji. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5336–5346, 2017.
  • Ma et al. (2015) Ma, Chenxin, Smith, Virginia, Jaggi, Martin, Jordan, Michael I., Richtárik, Peter, and Takác, Martin. Adding vs. averaging in distributed primal-dual optimization. In International Conference on Machine Learning, pp. 1973–1982, 2015.
  • Meng et al. (2016) Meng, Xiangrui, Bradley, Joseph K., Yavuz, Burak, Sparks, Evan R., Venkataraman, Shivaram, Liu, Davies, Freeman, Jeremy, Tsai, D. B., Amde, Manish, Owen, Sean, Xin, Doris, Xin, Reynold, Franklin, Michael J., Zadeh, Reza, Zaharia, Matei, and Talwalkar, Ameet. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17:34:1–34:7, 2016.
  • Negahban et al. (2012) Negahban, Sahand N, Ravikumar, Pradeep, Wainwright, Martin J, and Yu, Bin.

    A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers.

    Statistical Science, pp. 538–557, 2012.
  • Reddi et al. (2015) Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Póczos, Barnabás, and Smola, Alexander J. On variance reduction in stochastic gradient descent and its asynchronous variants. In Neural Information Processing Systems, pp. 2647–2655, 2015.
  • Schmidt et al. (2017) Schmidt, Mark W., Roux, Nicolas Le, and Bach, Francis R. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  • Shalev-Shwartz & Zhang (2013) Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
  • Shalev-Shwartz & Zhang (2014) Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pp. 64–72, 2014.
  • Shi et al. (2015) Shi, Wei, Ling, Qing, Wu, Gang, and Yin, Wotao. EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
  • Sivakumar & Banerjee (2017) Sivakumar, Vidyashankar and Banerjee, Arindam.

    High-dimensional structured quantile regression.

    In International Conference on Machine Learning, pp. 3220–3229, 2017.
  • Wang et al. (2016) Wang, Xiangyu, Dunson, David B., and Leng, Chenlei. No penalty no tears: Least squares in high-dimensional linear models. In International Conference on Machine Learning, pp. 1814–1822, 2016.
  • Xing et al. (2015) Xing, Eric P., Ho, Qirong, Dai, Wei, Kim, Jin Kyu, Wei, Jinliang, Lee, Seunghak, Zheng, Xun, Xie, Pengtao, Kumar, Abhimanu, and Yu, Yaoliang. Petuum: A new platform for distributed machine learning on big data. In International Conference on Knowledge Discovery and Data Mining, pp. 1335–1344, 2015.
  • Yang (2013) Yang, Tianbao. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Neural Information Processing Systems, pp. 629–637, 2013.
  • Zhao & Li (2016) Zhao, Shen-Yi and Li, Wu-Jun. Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In

    AAAI Conference on Artificial Intelligence

    , pp. 2379–2385, 2016.
  • Zhao et al. (2014) Zhao, Tuo, Yu, Mo, Wang, Yiming, Arora, Raman, and Liu, Han. Accelerated mini-batch randomized block coordinate descent method. In Neural Information Processing Systems, pp. 3329–3337, 2014.
  • Zheng et al. (2017) Zheng, Shuxin, Meng, Qi, Wang, Taifeng, Chen, Wei, Yu, Nenghai, Ma, Zhiming, and Liu, Tie-Yan. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pp. 4120–4129, 2017.
  • Zinkevich et al. (2010) Zinkevich, Martin, Weimer, Markus, Smola, Alexander J., and Li, Lihong. Parallelized stochastic gradient descent. In Neural Information Processing Systems, pp. 2595–2603, 2010.

Appendix A Serial SVRG

The learning procedure of the non-distributed (serial) SVRG is shown in Algorithm 2, where is the learning rate, denotes the parameter value at iteration , is a hyper-parameter.

1:  Initialize , , ;
2:  for  do
3:     ;
4:     ;
5:     for  do
6:        Randomly pick
7:        ;
8:     end for
9:     option I: Set ;
10:     option II: Set for randomly chosen ;
11:  end for
Algorithm 2 SVRG

Appendix B Distributed SVRG with Parameter Server

Both SynSVRG and AsySVRG can be implemented on the Parameter Server framework to get distributed versions of SVRG. The SynSVRG is shown in Algorithm 3 and Algorithm 4. Algorithm 3 is the operations of Servers and Algorithm 4 is the operations of Workers.

1:  Initialize , ;
2:  for  do
3:     ;
4:     Send to all Workers;
5:     Receive from the Workers;
6:     Compute the full gradient ;
7:     for  do
8:        Send to all Workers;
9:        Receive from Workers;
10:        ;
11:        ;
12:     end for
13:     ;
14:  end for
Algorithm 3 Task of Server in SynSVRG
1:  for  do
2:     Receive from Servers and combine the complete ;
3:     Compute the local gradient sum ;
4:     Cut into parts and send the parts to Servers, where is sent to Server_;
5:     for  do
6:        Receive from Servers and combine complete .
7:        Pick up an instance from the local data with index ;
8:        ;
9:        Cut into parts and send the parts to Servers, where is sent to Server_;
10:     end for
11:  end for
Algorithm 4 Task of Worker in SynSVRG

The AsySVRG is shown in Algorithm 5 and Algorithm 6. Algorithm 5 is the operations of Servers and Algorithm 6 is the operations of Workers.

1:  Initialize , ;
2:  for  do
3:     ;
4:     Send to all Workers;
5:     Receive from the Workers;
6:     Compute the full gradient ;
7:     ;
8:     repeat
9:        if Receive pull request from Worker then
10:           Send to Worker;
11:        else if Receive push request from Worker then
12:           Receive from Worker;
13:           ;
14:            increases by ;
15:        end if
16:     until 
17:     Send End Signal to all Workers;
18:     ;
19:  end for
Algorithm 5 Task of Server in AsySVRG
1:  for  do
2:     Receive from Servers and combine the complete ;
3:     Compute the local gradient sum ;
4:     Cut into parts and send the parts to Servers, where is sent to Server_;
5:     repeat
6:        Send pull request to all Servers;
7:        Receive from Servers and combine complete .
8:        Pick up an instance from the local data with index ;
9:        ;
10:        Send push request to all Servers;
11:        Cut into parts and send the parts to Servers, where is sent to Server_;
12:     until receive End Signal from Servers;
13:  end for
Algorithm 6 Task of Worker in AsySVRG