1 Introduction
Linear classification models, such as logistic regression (LR) and linear support vector machine (SVM), can achieve good performance in many highdimensional applications like text classification. When the training set is too large to be handled by one single machine (node), we often need to design distributed learning methods on a cluster of multiple machines. Hence, it has become an interesting topic to design distributed linear classification models for some largescale tasks with highdimensionality.
For largescale linear classification problems, stochastic gradient descent (SGD) and its variants like stochastic average gradient (SAG)
(Schmidt et al., 2017), SAGA (Defazio et al., 2014), stochastic dual coordinate ascent (SDCA) (ShalevShwartz & Zhang, 2013, 2014) and stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013) have shown promising performance in real applications. Hence, most existing distributed learning methods adopt SGD or its variants for updating (learning) the parameters of the linear classification models. Representatives include PSGD (Zinkevich et al., 2010), DCASGD (Zheng et al., 2017), DisDCA (Yang, 2013), CoCoA (Jaggi et al., 2014), CoCoA+ (Ma et al., 2015) and DSVRG (Lee et al., 2017).According to the organization framework of the cluster, existing distributed learning methods can be divided into three main categories. The first category is based on the masterslave framework, which has one master node (machine) and some slave nodes. In general, the model parameter is stored in the master node, and the data is distributively stored in the slave nodes. The master node is responsible for updating the model parameter, and the slave nodes are responsible for computing the gradient or stochastic gradient. This category is a centralized framework. One representative of this category is MLlib on Spark (Meng et al., 2016). The bottleneck of this kind of centralized framework is the high communication cost on the central (master) node (Lian et al., 2017). The second category is based on the Parameter Server framework (Li et al., 2014a, b; Xing et al., 2015), which has two kinds of nodes called Servers and Workers respectively. The Servers are used to store and update the model parameter, and the Workers are used to distributively store the data and compute the gradient or stochastic gradient. PSLite^{1}^{1}1PSLite is called Parameter Server in the original paper (Li et al., 2014a). In this paper, we use Parameter Server to denote the general framework, and use PSLite for the specific Parameter Server platform in (Li et al., 2014a). PSLite can be downloaded from https://github.com/dmlc/pslite. (Li et al., 2014a) and Petuum (Xing et al., 2015) are two representatives of this category. Parameter Server is also a centralized framework. Unlike the masterslave framework which has only one master node for model parameter, Parameter Server can use multiple Servers to distributively store and update the model parameter, and hence can relief the communication burden on the central nodes (Servers). However, there also exists frequent communication of gradients and parameters between Servers and Workers, because the parameter and data are stored separately in Servers and Workers in Parameter Server framework. The third category is the decentralized framework, in which there are only workers and no central nodes (servers). The data is distributively stored on all the workers, and all workers need to store and update (learn) the model parameter. DPSGD (Lian et al., 2017) and DSVRG (Lee et al., 2017) are two representatives of this category. DPSGD is a distributed SGD method that abandons the central node and need much less communication on the busiest node compared to centralized frameworks. But it still need to communicate parameter vector frequently between workers. DSVRG is a distributed SVRG method which has a ring framework. Because the convergence rate of SVRG is much faster than SGD (Johnson & Zhang, 2013), DSVRG also converges much faster than other SGDbased distributed methods. Furthermore, the decentralized framework of DSVRG also avoids the communication bottleneck in the centralized frameworks. Hence, DSVRG has achieved promising performance for learning linear classification models.
Most existing distributed learning methods, including all the centralized and decentralized methods mentioned above, are instancedistributed, which partition the training data by instances. These instancedistributed methods have achieved promising performance in largescale problems when the number of instances is larger than dimensionality (the number of features). In some real applications like web mining, astronomical projects, financial and biotechnology applications, the dimensionality can be larger than the number of instances (Negahban et al., 2012). For these cases, the communication cost of instancedistributed methods is typically high, because they often need to communicate the highdimensional parameter vectors or gradient vectors among different machines.
In this paper, we propose a new distributed SVRG method, called featuredistributed SVRG (FDSVRG), for highdimensional linear classification. The contributions of FDSVRG are briefly listed as follows:

Unlike most existing distributed learning methods which are instancedistributed, FDSVRG is featuredistributed.

FDSVRG has the same convergence rate as the nondistributed (serial) SVRG, while the parameters in FDSVRG can be distributively learned.

FDSVRG has lower communication cost than other instancedistributed methods when the dimensionality is larger than the number of instances.

Experimental results on real data demonstrate that FDSVRG can outperform other stateoftheart distributed SVRG methods in terms of both communication cost and wallclock time, when the dimensionality is larger than the number of instances in training data.

In particular, compared with the Parameter Server PSLite which has been widely used by both academy and industry, FDSVRG is several orders of magnitude faster.
Please note that our featuredistributed framework is not only applicable to SVRG, it can also be applied to SGD and other variants. Furthermore, it can also be used for regression or other liner models. Due to space limitation, we only focus on SVRG based linear classification here and leave other variants for further study.
2 Problem Formulation
Although there exist different formulations for the linear classification problem, this paper focuses the most popular formulation shown as follows:
(1)  
(2) 
where is the number of instances in the training set, is the feature vector of instance , is the class label of instance , is the dimensionality (number of features) of the instances, is the parameter to learn, is the loss defined on instance , is a regularization function. Here, we only focus on twoclass problems, but the techniques in this paper can also be adapted for multiclass problems which are omitted for space saving.
Many popular linear classification models can be formulated as the form in (1). For example, in logistic regression (LR), , where is the logistic loss and is the norm regularization function with a hyperparameter . In linear SVM, .
In many real applications, the training set can be too large to be handled by one single machine (node). Hence, we need to design distributed learning methods to learn (optimize) the parameter based on a cluster of multiple machines (nodes). In some applications, can be larger than . And in other applications, can be larger than . In this paper, we focus on the case when is larger than , which has attracted much attention in recent years (Chandrasekaran et al., 2012; Negahban et al., 2012; Wang et al., 2016; Sivakumar & Banerjee, 2017).
3 Related Work
As stated in Section 1, there exist three main categories of distributed learning methods for the problem in (1). Here, we briefly introduce these existing methods to motivate the contribution of this paper. Because the masterslave framework can be seen as a special case of Parameter Server with only one Server, here we only introduce Parameter Server and the decentralized framework.
3.1 Parameter Server
The Parameter Server framework (Li et al., 2014a; Xing et al., 2015) is illustrated in Figure 1, in which there are Servers and Workers. Let denote the training data matrix, where the th column of denotes the th instance . In Parameter Server, the whole training set is instancedistributed, which means that is partitioned vertically (by instance) into subsets and will be assigned to Worker. The parameter is cut off into parts and will be assigned to Server. Servers are responsible for updating parameters, and Workers are responsible for computing gradients.
The communication is done by pull and push operations. That is to say, the Workers will pull parameters from Servers, and push gradients to Servers. When the number of Workers becomes larger, there are more machines for gradient computation at the same time and the pull and push requests will be more frequent for Servers. Although PSLite and Petuum can use data structure to decrease the communication cost for sparse data, the total communication cost is still high because of the frequent communication. Adding more Servers can make the parameter segment in each Server become smaller but cannot reduce the number of communication requests for Servers. Furthermore, as the number of Servers increases, each Worker need to cut the parameter into more segments and communicate more frequently. For the SVRGbased methods, such as minibatch based synchronous distributed SVRG (SynSVRG) and asynchronous SVRG (AsySVRG) (Reddi et al., 2015; Zhao & Li, 2016)^{2}^{2}2Many existing AsySVRG methods, such as those in (Reddi et al., 2015; Zhao & Li, 2016), are initially proposed for multithread system with a shared memory. But these methods can be easily extended for a cluster of multiple machines to get a distributed version. The distributed SVRG, including SynSVRG and AsySVRG, implemented with parameter server can be found in Appendix B of the supplementary material.
, we need to communicate a dense full gradient vector in each epoch and hence the communication is also high.
3.2 Decentralized Framework
Parameter Server is a centralized framework, where the central nodes (Servers) will become the busiest ones to handle high communication burden from Workers, especially when the number of Workers is large. Recently, decentralized framework (Lian et al., 2017; Lee et al., 2017) is proposed to avoid the communication traffic jam on the busiest machines. This framework is illustrated in Figure 2. There are no Servers in the decentralized methods. Each machine (Worker) will be assigned a subset of the training instances. Based on the local subset of training instances, each Worker computes gradient and update its local parameter. Then different machines communicate parameter among each other. EXTRA (Shi et al., 2015), DPSGD (Lian et al., 2017) and DSVRG (Lee et al., 2017) are the representatives of this kind of decentralized methods. Although the communication cost of decentralized methods is balanced among Workers, the decentralized methods need to communicate dense parameter vectors, even if the data is sparse. When the data is highdimensional, the communication cost is also very high.
4 FeatureDistributed SVRG
Most existing distributed learning methods, including all the centralized and decentralized methods introduced in Section 3, are instancedistributed, which partition the training data by instances. When the dimensionality is larger than the number of instances, i.e., , the communication cost of instancedistributed methods is typically high, because they often need to communicate the highdimensional parameter vectors or gradient vectors of length among different machines.
In this section, we present our new method called FDSVRG for highdimensional linear classification. FDSVRG is feature distributed, which partitions the training data by features. For training data matrix , the difference between instancedistributed partition and featuredistributed partition is illustrated in Figure 3, where the upperright is featuredistributed partition and the lowerright is instancedistributed partition.
FDSVRG is based on SVRG, which is much faster than SGD (Johnson & Zhang, 2013). For ease of understanding, we briefly present the original nondistributed (serial) SVRG in Appendix A of the supplementary material.
4.1 Framework
Figure 4 shows the distributed framework of FDSVRG, in which there are workers and one coordinator. FDSVRG is feature distributed. More specifically, the training data matrix is partitioned horizontally (by features) into parts , and is stored on Worker_. Here, . The parameter is also partitioned into parts , with . The features of correspond to the features of . is also stored on Worker_.
4.2 Learning Algorithm
First, we rewrite in (1) as follows:
(3) 
where is the regularization function defined on . It is easy to find that if is or norm, is also or norm.
Then we can get the gradient:
(4) 
We can find that the main computation of the gradient is to calculate the inner product and . Since
we can distributively complete the computation.
The whole learning algorithm of FDSVRG is shown in Algorithm 1. The Coordinator is used to help sum from all Workers, where is the column of . Here, we use a treestructured communication (reduce) scheme to get the global sum. An example of the treestructured global sum with 4 Workers is shown in Figure 5. We pair the Workers so that while Worker_1 adds the result from Worker_2, Worker_3 can add the result from Worker_4 simultaneously. After the Coordinator has computed the sum, it broadcasts the sum to all Workers in a reverseorder tree structure. Similar treestructure can be constructed for more Workers. It is faster than the strategy by which all Workers send the result directly to the Coordinator for sum, especially when the number of Workers is large.
When computing the full gradient, the inner products of all the data are computed only one time for each outer iteration because the parameter is constant for each outer iteration . In line 11 of Algorithm 1, only need to be received from Coordinator. The Worker doesn’t need to receive again for all inner iterations, because the Worker has received when computing the full gradient for each outer iteration .
4.3 Convergence Analysis
It is easy to find that the update rule of FDSVRG is exactly equivalent to that of the nondistributed (serial) SVRG (Johnson & Zhang, 2013). Hence, the convergence property of FDSVRG is the same as that of the nondistributed SVRG. Please note that the nondistributed SVRG has two options to get in the original paper (Johnson & Zhang, 2013) (the readers can also refer to Algorithm 2 in Appendix A of the supplementary material). The authors of (Johnson & Zhang, 2013) have only proved the convergence of Option II without proving the convergence of Option I. But in FDSVRG, we prefer to choose Option I because we need to make the parameter identical for different machines with feature partitioned. If Option II is taken, there exists extra communication for the random value. In this section, we prove that Option I is also convergent with a linear convergence rate.
Theorem 1
Assume is strongly convex and each is smooth, which means that ,
In the outer loop, the inner loop starts with , then we have
where is the step size (learning rate), , , is the optimal value of .
Proof Let , where is the full gradient. Then we get . According to the update rule, we have
According to the definition of , we obtain
The second inequality uses the fact that and . The last inequality uses the smooth property of . Then we have
Using the strongly convex property of , we obtain
For convenience, let , . Taking expectation on the above equation, we obtain
Let , we obtain the result in Theorem 1.
Please note that for the outer loop, actually denotes , and is actually . Based on Theorem 1, we have
When the step size is small enough, . It means that our FDSVRG has a linear convergence rate.
4.4 Implementation Details
4.4.1 MiniBatch
FDSVRG can take a minibatch strategy as described in (Zhao et al., 2014). In each iteration, Workers sample a batch of data with batch size . Then, inner products are computed once and the scalars are communicated together. Taking a minibatch cannot reduce the total communication of FDSVRG, but it can reduce the communication frequency (times).
4.5 Complexity Analysis
For an iteration of the outer loop in SVRG, there will be gradients to be computed. Here, is set as the number of local data instances in general. For the instancedistributed methods, data is partitioned by machines, and each machine has instances. DSVRG sets . Then each machine will compute gradients in inner loops. There is only one machine at work in inner loops for DSVRG. So it computes gradients in total during an iteration. When computing full gradient, the center of DSVRG sends parameter to each machine, then receives gradients from each machine. The communication cost is . In inner loops, center sends full gradient to machine_ which is at work. The machine_ iteratively updates parameter and then returns parameter to center after the iterative updating. The communication is . So the total communication cost is . That means DSVRG computes gradients with communication cost of .
For FDSVRG, when computing one gradient, the communication cost of a tree is . For the example in Figure 5, there are 4 Workers and the communication cost is 8 scalars (solid arrow). So the total communication cost is for computing gradients. Compared to the of DSVRG, we can find that when , FDSVRG has lower communication cost. Note that DSVRG only parallels the SVRG algorithm in computing full gradients. Our method parallels the SVRG algorithm both in computing full gradients and in the inner loops. It means that to compute the same number of gradients, the average time of FDSVRG is lower than that of DSVRG. Furthermore, FDSVRG also parallels the communication by the treestructured communication strategy, which can reduce communication time. Hence, FDSVRG is expected to be much faster than DSVRG, which will be verified in our experiments.
SynSVRG and AsySVRG can be implemented with the Parameter Server framework (refer to Appendix B of the supplementary material). They need to send many times of vectors in the inner loops of SVRG. The communication cost of them is , which is much higher than those of DSVRG and FDSVRG.
5 Experiments
In this section, we choose logistic regression (LR) with a norm regularization term to conduct experiments. Hence, the formula (1) is defined as follows:
(5) 
All the experiments are performed on a cluster of several machines (nodes) connected by 10GB Ethernet. Each machine has 12 Intel Xeon E52620 cores with 96GB memory.
5.1 Data Sets
We use four data sets for evaluation. They are news20, url, webspam and kdd2010. All of these data sets can be downloaded from the LibSVM website^{3}^{3}3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The detailed information about these data sets is summarized in Table 1. We use 8 Workers to train news20 because it is relatively smaller. For other data sets, 16 Workers are used for training.
Data set  Features ()  Instances () 

news20  1,355,191  19,954 
url  3,231,961  2,396,130 
webspam  16,609,143  350,000 
kdd2010  29,890,095  19,264,097 
5.2 Experimental Setting
We chose DSVRG, AsySVRG and SynSVRG as baselines. AsySVRG and SynSVRG are implemented on Parameter Server framework (refer to Appendix B of the supplementary material). Besides Workers, AsySVRG and SynSVRG need extra machines for Servers. In our experiments, we take 8 Servers for AsySVRG and 4 Servers for SynSVRG. The number of Workers for AsySVRG and SynSVRG is the same as that for FDSVRG and DSVRG. The w is initialized with 0. We set the number of inner loops of each method to be the number of training instances on each Worker. The stepsize is fixed during training.
5.3 Efficiency Comparison
When the regularization hyperparameter is , the convergence results are shown in Figure 6 and Figure 7. In both figures, the vertical axis denotes the gap between the objective function value and the optimal solution value. In Figure 6, the horizontal axis denotes the wallclock time (in seconds) used by different methods. In Figure 7, the horizontal axis is the communication cost which denotes how many scalars has been communicated. A dimensional vector is considered to be scalars in communication cost. We can find that FDSVRG achieves the best performance, in terms of both wallclock time and communication cost.
Because DSVRG is the fastest baseline, we choose DSVRG as a baseline to observe our method’s speedup compared to DSVRG. The result is shown in Table 2, in which the time is recorded when the gap between the objective function value and the optimal value is less than . The time of the two methods on the four data sets is on the upper half of Table 2. The lower half part of Table 2 is our method’s speedup compared to DSVRG. We can find that our method is several times faster than DSVRG.
DSVRG  FDSVRG  

news20  2.83  0.68  
Time  url  119.1  19.24 
(in second)  webspam  33.01  4.23 
kdd2010  400.35  13.39  
news20  1  4.16  
Speedup  url  1  6.19 
webspam  1  7.8  
kdd2010  1  29.9 
PSLite (Li et al., 2014a) has been widely used by both academy and industry. We also compare FDSVRG with PSLite. The original implementation of PSLite is based on SGD (Li et al., 2014a). We denote it as PSLite (SGD). In particular, PSLite (SGD) is an asynchronous SGD implemented based on PSLite which is provided by the authors of (Li et al., 2014a). We summarize the speedup to PSLite (SGD) in Table 3. The time is recoded when the gap between the objective function value and the optimal value is less than . We can find that our method is hundreds even thousands of times faster than PSLite (SGD).
To evaluate the influence of the regularization hyperparameter , we choose webspam data set to conduct experiments by setting . The results are shown in Figure 8. Once again, our method achieves the best performance in both cases.
PSLite (SGD)  FDSVRG  

news20  1000  0.69  
Time  url  2000  19.25 
(in second)  webspam  827.12  4.23 
kdd2010  2000  13.39  
news20  1  1449  
Speedup  url  1  103 
webspam  1  196  
kdd2010  1  149 
5.4 Scalability
By changing the number of Workers, we can compute a speedup for FDSVRG. The speedup is defined as follows:
We take the Worker number to be 1, 4, 8, 16. When the gap between the objective function value and the optimal solution value is less than , we stop the training process and record the time.
The speedup result is shown in Figure 9. We can find that FDSVRG has a speedup close to ideal result. Hence, FDSVRG has a strong scalability to handle largescale problems.
6 Conclusion
Most existing distributed learning methods for linear classification are instancedistributed, which cannot achieve satisfactory results for highdimensional applications when the dimensionality is larger than the number of instances. In this paper, we propose a novel distributed learning method, called FDSVRG, by adopting a featuredistributed data partition strategy. Experimental results show that our method can achieve the best performance for cases when the dimensionality is larger than the number of instances.
References
 Chandrasekaran et al. (2012) Chandrasekaran, Venkat, Recht, Benjamin, Parrilo, Pablo A., and Willsky, Alan S. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
 Defazio et al. (2014) Defazio, Aaron, Bach, Francis R., and LacosteJulien, Simon. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Neural Information Processing Systems, pp. 1646–1654, 2014.
 Jaggi et al. (2014) Jaggi, Martin, Smith, Virginia, Takác, Martin, Terhorst, Jonathan, Krishnan, Sanjay, Hofmann, Thomas, and Jordan, Michael I. Communicationefficient distributed dual coordinate ascent. In Neural Information Processing Systems, pp. 3068–3076, 2014.
 Johnson & Zhang (2013) Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Neural Information Processing Systems, pp. 315–323, 2013.
 Lee et al. (2017) Lee, Jason D., Lin, Qihang, Ma, Tengyu, and Yang, Tianbao. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. Journal of Machine Learning Research, 18:122:1–122:43, 2017.
 Li et al. (2014a) Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr, Josifovski, Vanja, Long, James, Shekita, Eugene J., and Su, BorYiing. Scaling distributed machine learning with the parameter server. In Operating Systems Design and Implementation, pp. 583–598, 2014a.
 Li et al. (2014b) Li, Mu, Andersen, David G., Smola, Alexander J., and Yu, Kai. Communication efficient distributed machine learning with the parameter server. In Neural Information Processing Systems, pp. 19–27, 2014b.
 Lian et al. (2017) Lian, Xiangru, Zhang, Ce, Zhang, Huan, Hsieh, ChoJui, Zhang, Wei, and Liu, Ji. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 5336–5346, 2017.
 Ma et al. (2015) Ma, Chenxin, Smith, Virginia, Jaggi, Martin, Jordan, Michael I., Richtárik, Peter, and Takác, Martin. Adding vs. averaging in distributed primaldual optimization. In International Conference on Machine Learning, pp. 1973–1982, 2015.
 Meng et al. (2016) Meng, Xiangrui, Bradley, Joseph K., Yavuz, Burak, Sparks, Evan R., Venkataraman, Shivaram, Liu, Davies, Freeman, Jeremy, Tsai, D. B., Amde, Manish, Owen, Sean, Xin, Doris, Xin, Reynold, Franklin, Michael J., Zadeh, Reza, Zaharia, Matei, and Talwalkar, Ameet. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17:34:1–34:7, 2016.

Negahban et al. (2012)
Negahban, Sahand N, Ravikumar, Pradeep, Wainwright, Martin J, and Yu, Bin.
A unified framework for highdimensional analysis of mestimators with decomposable regularizers.
Statistical Science, pp. 538–557, 2012.  Reddi et al. (2015) Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Póczos, Barnabás, and Smola, Alexander J. On variance reduction in stochastic gradient descent and its asynchronous variants. In Neural Information Processing Systems, pp. 2647–2655, 2015.
 Schmidt et al. (2017) Schmidt, Mark W., Roux, Nicolas Le, and Bach, Francis R. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112, 2017.
 ShalevShwartz & Zhang (2013) ShalevShwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
 ShalevShwartz & Zhang (2014) ShalevShwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pp. 64–72, 2014.
 Shi et al. (2015) Shi, Wei, Ling, Qing, Wu, Gang, and Yin, Wotao. EXTRA: an exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.

Sivakumar & Banerjee (2017)
Sivakumar, Vidyashankar and Banerjee, Arindam.
Highdimensional structured quantile regression.
In International Conference on Machine Learning, pp. 3220–3229, 2017.  Wang et al. (2016) Wang, Xiangyu, Dunson, David B., and Leng, Chenlei. No penalty no tears: Least squares in highdimensional linear models. In International Conference on Machine Learning, pp. 1814–1822, 2016.
 Xing et al. (2015) Xing, Eric P., Ho, Qirong, Dai, Wei, Kim, Jin Kyu, Wei, Jinliang, Lee, Seunghak, Zheng, Xun, Xie, Pengtao, Kumar, Abhimanu, and Yu, Yaoliang. Petuum: A new platform for distributed machine learning on big data. In International Conference on Knowledge Discovery and Data Mining, pp. 1335–1344, 2015.
 Yang (2013) Yang, Tianbao. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Neural Information Processing Systems, pp. 629–637, 2013.

Zhao & Li (2016)
Zhao, ShenYi and Li, WuJun.
Fast asynchronous parallel stochastic gradient descent: A lockfree
approach with convergence guarantee.
In
AAAI Conference on Artificial Intelligence
, pp. 2379–2385, 2016.  Zhao et al. (2014) Zhao, Tuo, Yu, Mo, Wang, Yiming, Arora, Raman, and Liu, Han. Accelerated minibatch randomized block coordinate descent method. In Neural Information Processing Systems, pp. 3329–3337, 2014.
 Zheng et al. (2017) Zheng, Shuxin, Meng, Qi, Wang, Taifeng, Chen, Wei, Yu, Nenghai, Ma, Zhiming, and Liu, TieYan. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pp. 4120–4129, 2017.
 Zinkevich et al. (2010) Zinkevich, Martin, Weimer, Markus, Smola, Alexander J., and Li, Lihong. Parallelized stochastic gradient descent. In Neural Information Processing Systems, pp. 2595–2603, 2010.
Appendix A Serial SVRG
The learning procedure of the nondistributed (serial) SVRG is shown in Algorithm 2, where is the learning rate, denotes the parameter value at iteration , is a hyperparameter.
Comments
There are no comments yet.