Introduction
Many machine learning models can be formulated as composite optimization problems which have the following form with finite sum of some functions: , where is the parameter to learn (optimize), is the number of training instances, and
is the loss function on the training instance
. For example, in logistic regression (LR), and in support vector machine (SVM), where is the regularization hyperparameter and is the training instance with being the feature vector andbeing the class label. Other cases like matrix factorization and deep neural networks can also be written as similar forms of composite optimization.
Due to its efficiency and effectiveness, stochastic optimization (SO) has recently attracted much attention to solve the composite optimization problems in machine learning [Xiao2009, Bottou2010, Duchi, Hazan, and Singer2011, Schmidt, Roux, and Bach2013, Johnson and Zhang2013, Zhang, Mahdavi, and Jin2013, ShalevShwartz and Zhang2013, ShalevShwartz and Zhang2014, Lin, Lu, and Xiao2014, Nitanda2014]
. Existing SO methods can be divided into two categories. The first category is stochastic gradient descent (SGD) and its variants, such as stochastic average gradient (SAG)
[Schmidt, Roux, and Bach2013]and stochastic variance reduced gradient (SVRG)
[Johnson and Zhang2013], which try to perform optimization on the primal problem. The second category, such as stochastic dual coordinate ascent (SDCA) [ShalevShwartz and Zhang2013], tries to perform optimization with the dual formulation. Many advanced SO methods, such as SVRG and SDCA, are more efficient than traditional batch learning methods in both theory and practice for largescale learning problems.Most traditional SO methods are sequential which means that the optimization procedure is not parallelly performed. However, with the increase of data scale, traditional sequential SO methods may not be efficient enough to handle largescale datasets. Furthermore, in this big data era, many largescale datasets are distributively stored on a cluster of multiple machines. Traditional sequential SO methods cannot be directly used for these kinds of distributed datasets. To handle largescale composite optimization problems, researchers have recently proposed several parallel SO (PSO) methods for multicore systems and distributed SO (DSO) methods for clusters of multiple machines.
PSO methods perform SO on a single machine with multicores (multithreads) and a shared memory. Typically, synchronous strategies with locks will be much slower than asynchronous ones. Hence, recent progress of PSO mainly focuses on designing asynchronous or lockfree optimization strategies [Recht et al.2011, Liu et al.2014, Hsieh, Yu, and Dhillon2015, J. Reddi et al.2015, Zhao and Li2016].
DSO methods perform SO on clusters of multiple machines. DSO can be used to handle extremely large problems which are beyond the processing capability of one single machine. In many real applications especially industrial applications, the datasets are typically distributively stored on clusters. Hence, DSO has recently become a hot research topic. Many DSO methods have been proposed, including distributed SGD methods from primal formulation and distributed dual formulation. Representative distributed SGD methods include PSGD [Zinkevich et al.2010], BAVGM [Zhang, Wainwright, and Duchi2012] and Splash [Zhang and Jordan2015]. Representative distributed dual formulations include DisDCA [Yang2013], CoCoA [Jaggi et al.2014] and CoCoA+ [Ma et al.2015]. Many of these methods provide nice theoretical proof about convergence and promising empirical evaluations. However, most of these DSO methods might not be scalable enough.
In this paper, we propose a novel DSO method, called scalable composite optimization for learning (SCOPE), and implement it on the faulttolerant distributed platform Spark [Zaharia et al.2010]. SCOPE is both computationefficient and communicationefficient. Empirical results on real datasets show that SCOPE can outperform other stateoftheart distributed learning methods on Spark, including both batch learning methods and DSO methods, in terms of scalability.
Please note that some asynchronous methods or systems, such as Parameter Server [Li et al.2014], Petuum [Xing et al.2015] and the methods in [Zhang and Kwok2014, Zhang, Zheng, and Kwok2016], have also been proposed for distributed learning with promising performance. But these methods or systems cannot be easily implemented on Spark with the MapReduce programming model which is actually a bulk synchronous parallel (BSP) model. Hence, asynchronous methods are not the focus of this paper. We will leave the design of asynchronous version of SCOPE and the corresponding empirical comparison for future study.
Scope
Framework of SCOPE
SCOPE is based on a masterslave distributed framework, which is illustrated in Figure 1. More specifically, there is a master machine (called Master) and () slave machines (called Workers) in the cluster. These Workers are called Worker, Worker, , and Worker, respectively.
Data Partition and Parameter Storage

For Workers: The whole dataset is distributively stored on all the Workers. More specifically, is partitioned into subsets, which are denoted as with . is stored on Worker. The data stored on different Workers are different from each other, which means that if , .

For Master: The parameter is stored on the Master and the Master always keeps the newest version of .
Different Workers can not communicate with each other. This is similar to most existing distributed learning frameworks like MLlib [Meng et al.2016], Splash, Parameter Server, and CoCoA and so on.
Optimization Algorithm
The whole optimization (learning) algorithm is completed cooperatively by the Master and Workers:

Task of Master: The operations completed by the Master are outlined in Algorithm 1. We can find that the Master has two main tasks. The first task is to compute the full gradient after all the local gradient sum have been received from all Workers, and then send the full gradient to all Workers. The second task is to update the parameter after all the locally updated parameters have been received, and then send the updated parameter to all Workers. It is easy to see that the computation load of the Master is lightweight.

Task of Workers: The operations completed by the Workers are outlined in Algorithm 2. We can find that each Worker has two main tasks. The first task is to compute the sum of the gradients on its local data (called local gradient sum), i.e., for Worker_, and then send the local gradient sum to the Master. The second task is to train by only using the local data, after which the Worker will send the locally updated parameters, denoted as for Worker_, to the Master and wait for the newest from Master.
Here, denotes the global parameter at the th iteration and is stored on the Master. denotes the local parameter at the th iteration on Worker_.
SCOPE is inspired by SVRG [Johnson and Zhang2013] which tries to utilize full gradient to speed up the convergence of stochastic optimization. However, the original SVRG in [Johnson and Zhang2013] is sequential. To design a distributed SVRG method, one natural strategy is to adapt the minibatch SVRG [Zhao et al.2014] to distributed settings, which is a typical strategy in most distributed SGD frameworks like Parameter Server [Li et al.2014] and Petuum [Xing et al.2015]. In appendix^{1}^{1}1All the appendices and proofs of this paper can be found in the arXiv version of this paper [Zhao et al.2016]., we briefly outline the sequential SVRG and the minibatch based distributed SVRG (called DisSVRG). We can find that there exist three major differences between SCOPE and SVRG (or DisSVRG).
The first difference is that in SCOPE each Worker locally performs stochastic optimization by only using its native data (refer to the update on for each Worker_ in Algorithm 2). On the contrary, SVRG or DisSVRG perform stochastic optimization on the Master (refer to the update on ) based on the whole dataset, which means that we need to randomly pick up an instance or a minibatch from the whole dataset in each iteration of stochastic optimization. The locally stochastic optimization in SCOPE can dramatically reduce the communication cost, compared with DisSVRG with minibatch strategy.
The second difference is the update rule of in the Master. There are no locally updated parameters in DisSVRG with minibatch strategy, and hence the update rule of in the Master for DisSVRG can not be written in the form of Algorithm 1, i.e., .
The third difference is the update rule for in SCOPE and in SVRG or DisSVRG. Compared to SVRG, SCOPE has an extra term in Algorithm 2 to guarantee convergence, where is a parameter related to the objective function. The strictly theoretical proof will be provided in the following section about convergence. Here, we just give some intuition about the extra term . Since SCOPE puts no constraints about how to partition training data on different Workers, the data distributions on different Workers may be totally different from each other. That means the local gradient in each Worker can not necessarily approximate the full gradient. Hence, the term
is a bias estimation of the full gradient. This is different from SVRG whose stochastic gradient is an unbias estimation of the full gradient. The bias estimation
in SCOPE may lead to be far away from the optimal value . To avoid this, we use the technique in the proximal stochastic gradient that adds an extra term to make not be far away from . If is close to , will also be close to . So the extra term in SCOPE is reasonable for convergence guarantee. At the same time, it does not bring extra computation since the update rule in SCOPE can be rewritten aswhere can be precomputed and fixed as a constant for different .
Besides the above minibatch based strategy (DisSVRG) for distributed SVRG, there also exist some other distributed SVRG methods, including DSVRG [Lee et al.2016], KroMagnon [Mania et al.2015], SVRGfoR [Konecný, McMahan, and Ramage2015] and the distributed SVRG in [De and Goldstein2016]. DSVRG needs communication between Workers, and hence it cannot be directly implemented on Spark. KroMagnon focuses on asynchronous strategy, which cannot be implemented on Spark either. SVRGfoR can be implemented on Spark, but it provides no theoretical results about the convergence. Furthermore, SVRGfoR is proposed for cases with unbalanced data partitions and sparse features. On the contrary, our SCOPE can be used for any kind of features with theoretical guarantee of convergence. Moreover, in our experiment, we find that our SCOPE can outperform SVRGfoR. The distributed SVRG in [De and Goldstein2016] cannot be guaranteed to converge because it is similar to the version of SCOPE with .
EASGD [Zhang, Choromanska, and LeCun2015] also adopts a parameter like
to control the difference between the local update and global update. However, EASGD assumes that each worker has access to the entire dataset while SCOPE only requires that each worker has access to a subset. Local learning strategy is also adopted in other problems like probabilistic logic programs
[Riguzzi et al.2016].Communication Cost
Traditional minibatch based distributed SGD methods, such as DisSVRG in the appendix, need to transfer parameter and stochastic gradients frequently between Workers and Master. For example, the number of communication times is for DisSVRG. Other traditional minibatch based distributed SGD methods have the same number of communication times. Typically, . Hence, traditional minibatch based methods have number of communication times, which may lead to high communication cost.
Most training (computation) load of SCOPE comes from the inner loop of Algorithm 2, which is done at local Worker without any communication. It is easy to find that the number of communication times in SCOPE is , which is dramatically less than of traditional minibatch based distributed SGD or distributed SVRG methods. In the following section, we will prove that SCOPE has a linear convergence rate in terms of the iteration number . It means that to achieve an optimal solution^{2}^{2}2 is called an optimal solution if where is the optimal solution., . Hence, is typically not large for many problems. For example, in most of our experiments, we can achieve convergent results with . Hence, SCOPE is communicationefficient. SCOPE is a synchronous framework, which means that some waiting time is also needed for synchronization. Because the number of synchronization is also , and is typically a small number. Hence, the waiting time is also small.
SCOPE on Spark
One interesting thing is that the computing framework of SCOPE is quite suitable for the popular distributed platform Spark. The programming model underlying Spark is MapReduce, which is actually a BSP model. In SCOPE, the task of Workers that computes local gradient sum and the training procedure in the inner loop of Algorithm 2 can be seen as the Map process since both of them only use local data. The task of Master that computes the average for both full gradient and can be seen as the Reduce process.
The MapReduce programming model is essentially a synchronous model, which need some synchronization cost. Fortunately, the number of synchronization times is very small as stated above. Hence, both communication cost and waiting time are very small for SCOPE. In this paper, we implement our SCOPE on Spark since Spark has been widely adopted in industry for big data applications, and our SCOPE can be easily integrated into the data processing pipeline of those organizations using Spark.
Convergence of SCOPE
In this section, we will prove the convergence of SCOPE when the objective functions are strongly convex. We only list some Lemmas and Theorems, the detailed proof of which can be found in the appendices [Zhao et al.2016].
For convenience, we use to denote the optimal solution. denotes the norm . We assume that , which means that each Worker has the same number of training instances and . In practice, we can not necessarily guarantee that these s are the same. However, it is easy to guarantee that , which will not affect the performance.
We define local functions as , where . Then we have .
To prove the convergence of SCOPE, we first give two assumptions which have also been widely adopted by most existing stochastic optimization algorithms for convergence proof.
Assumption 1 (Smooth Gradient).
There exists a constant such that and , we have .
Assumption 2 (Strongly Convex).
For each local function , there exists a constant such that , we have .
Please note that these assumptions are weaker than those in [Zhang and Jordan2015, Ma et al.2015, Jaggi et al.2014], since we do not need each to be convex and we do not make any assumption about the Hessian matrices either.
Lemma 1.
Let . If , then we have .
Let , . Given and which are determined by the objective function, we can always guarantee , , and by setting . We have the following theorems:
Theorem 1.
If we take , then we can get the following convergence result:
When , , which means we can get a linear convergence rate if we take .
Theorem 2.
If we take with , then we can get the following convergence result:
When , , which means we can also get a linear convergence rate if we take with .
According to Theorem 1 and Theorem 2, we can find that SCOPE gets a linear convergence rate when is larger than some threshold. To achieve an optimal solution, the computation complexity of each worker is . In our experiment, we find that good performance can be achieved with . Hence, SCOPE is computationefficient.
Impact of Parameter
In Algorithm 2, we need the parameter to guarantee the convergence of SCOPE. Specifically, we need according to Lemma 1. Here, we discuss the necessity of .
We first assume , and try to find whether Algorithm 2 will converge or not. It means that in the following derivation, we always assume .
Let us define another local function:
and denote
Let . When , . Then, we have and . Hence, we can find that each local Worker actually tries to optimize the local function with SVRG based on the local data . It means that if we set a relatively small and a relatively large , the will converge to .
Since is strongly convex, we have . Then, we can get
For the lefthand side, we have
For the righthand side, we have
Combining the two approximations, we can get
where and are two Hessian matrices for the local function and the global function , respectively. Assuming in each iteration we can always get the local optimal values for all local functions, we have
(1) 
Please note that all the above derivations assume that . From (1), we can find that Algorithm 2 will not necessarily converge if , and the convergence property is dependent on the Hessian matrices of the local functions.
Here, we give a simple example for illustration. We set and . We set a small stepsize and a large . The convergence results of SCOPE with different are presented in Table 1.
0  1  5  10  

Converge?  No  No  No  Yes 
Separating Data Uniformly
If we separate data uniformly, which means that the local data distribution on each Worker is similar to the global data distribution, then we have and . From (1), we can find that can make SCOPE converge for this special case.
Experiment
We choose logistic regression (LR) with a norm regularization term to evaluate SCOPE and baselines. Hence, is defined as . The code can be downloaded from https://github.com/LIBBLE/LIBBLESpark/.
Dataset
We use four datasets for evaluation. They are MNIST8M, epsilon, KDD12 and DataA. The first two datasets can be downloaded from the LibSVM website^{3}^{3}3https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. MNIST8M contains 8,100,000 handwritten digits. We set the instances of digits 5 to 9 as positive, and set the instances of digits 0 to 4 as negative. KDD12 is the dataset of Track 1 for KDD Cup 2012, which can be downloaded from the KDD Cup website^{4}^{4}4http://www.kddcup2012.org/. DataA is a dataset from a data mining competition^{5}^{5}5http://www.yiban.cn/project/2015ccf/comp_detail.php?cid=231. The information about these datasets is summarized in Table 2. All the data is normalized before training. The regularization hyperparameter is set to for the first three datasets which are relatively small, and is set to for the largest dataset DataA. Similar phenomenon can be observed for other , which is omitted due to space limitation. For all datasets, we set .
instances  features  memory  
MNIST8M  8,100,000  784  39G  1e4 
epsilon  400,000  2,000  11G  1e4 
KDD12  73,209,277  1,427,495  21G  1e4 
DataA  106,691,093  320  260G  1e6 
Experimental Setting and Baseline
Distributed Platform
We have a Spark cluster of 33 machines (nodes) connected by 10GB Ethernet. Each machine has 12 Intel Xeon E52620 cores with 64GB memory. We construct two clusters, a small one and a large one, from the original 33 machines for our experiments. The small cluster contains machines, one master and eight slaves. We use 2 cores for each slave. The large cluster contains 33 machines, 1 master and 32 slaves. We use 4 cores for each slave. In both clusters, each machine has access to 64GB memory on the corresponding machine and one core corresponds to one Worker. Hence, the small cluster has one Master and 16 Workers, and the large cluster has one Master and 128 Workers. The small cluster is for experiments on the three relatively small datasets including MNIST8M, epsilon and KDD12. The large cluster is for experiments on the largest dataset DataA. We use Spark1.5.2 for our experiment, and implement our SCOPE in Scala.
Baseline
Because the focus of this paper is to design distributed learning methods for Spark, we compare SCOPE with distributed learning baselines which can be implemented on Spark. More specifically, we adopt the following baselines for comparison:

MLlib^{6}^{6}6http://spark.apache.org/mllib/ [Meng et al.2016]: MLlib is an open source library for distributed machine learning on Spark. It is mainly based on two optimization methods: minibatch based distributed SGD and distributed lbfgs. We find that the distributed SGD method is much slower than distributed lbfgs on Spark in our experiments. Hence, we only compare our method with distributed lbfgs for MLlib, which is a batch learning method.

LibLinear^{7}^{7}7https://www.csie.ntu.edu.tw/ cjlin/liblinear/ [Lin et al.2014]: LibLinear is a distributed Newton method, which is also a batch learning method.

Splash^{8}^{8}8http://zhangyuc.github.io/splash [Zhang and Jordan2015]: Splash is a distributed SGD method by using the local learning strategy to reduce communication cost [Zhang, Wainwright, and Duchi2012], which is different from the minibatch based distributed SGD method.

CoCoA^{9}^{9}9https://github.com/gingsmith/cocoa [Jaggi et al.2014]: CoCoA is a distributed dual coordinate ascent method by using local learning strategy to reduce communication cost, which is formulated from the dual problem.

CoCoA+^{10}^{10}10https://github.com/gingsmith/cocoa [Ma et al.2015]: CoCoA+ is an improved version of CoCoA. Different from CoCoA which adopts average to combine local updates for global parameters, CoCoA+ adopts adding to combine local updates.
We can find that the above baselines include stateoftheart distributed learning methods with different characteristics. All the authors of these methods have shared the source code of their methods to the public. We use the source code provided by the authors for our experiment. For all baselines, we try several parameter values to choose the best performance.
Efficiency Comparison with Baselines
We compare SCOPE with other baselines on the four datasets. The result is shown in Figure 2. Each marked point on the curves denotes one update for by the Master, which typically corresponds to an iteration in the outerloop. For SCOPE, good convergence results can be got with number of updates (i.e., the in Algorithm 1) less than five. We can find that Splash vibrates on some datasets since it introduces variance in the training process. On the contrary, SCOPE are stable, which means that SCOPE is a variance reduction method like SVRG. It is easy to see that SCOPE has a linear convergence rate, which also conforms to our theoretical analysis. Furthermore, SCOPE is much faster than all the other baselines.
SCOPE can also outperform SVRGfoR [Konecný, McMahan, and Ramage2015] and DisSVRG. Experimental comparison can be found in appendix [Zhao et al.2016].
Speedup
We use dataset MNIST8M for speedup evaluation of SCOPE. Two cores are used for each machine. We evaluate speedup by increasing the number of machines. The training process will stop when the gap between the objective function value and the optimal value is less than . The is defined as follows: where is the number of machines and we choose . The experiments are performed by 5 times and the average time is reported for the final speedup result.
The speedup result is shown in Figure 3, where we can find that SCOPE has a superlinear speedup. This might be reasonable due to the higher cache hit ratio with more machines [Yu et al.2014]. This speedup result is quite promising on our multimachine settings since the communication cost is much larger than that of multithread setting. The good speedup of SCOPE can be explained by the fact that most training work can be locally completed by each Worker and SCOPE does not need much communication cost.
SCOPE is based on the synchronous MapReduce framework of Spark. One shortcoming of synchronous framework is the synchronization cost, which includes both communication time and waiting time. We also do experiments to show the low synchronization cost of SCOPE, which can be found in the appendix [Zhao et al.2016].
Conclusion
In this paper, we propose a novel DSO method, called SCOPE, for distributed machine learning on Spark. Theoretical analysis shows that SCOPE is convergent with linear convergence rate for strongly convex cases. Empirical results show that SCOPE can outperform other stateoftheart distributed methods on Spark.
Acknowledgements
This work is partially supported by the “DengFeng” project of Nanjing University.
References
 [Bottou2010] Bottou, L. 2010. Largescale machine learning with stochastic gradient descent. In International Conference on Computational Statistics.
 [De and Goldstein2016] De, S., and Goldstein, T. 2016. Efficient distributed SGD with variance reduction. In IEEE International Conference on Data Mining.
 [Duchi, Hazan, and Singer2011] Duchi, J. C.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121–2159.
 [Hsieh, Yu, and Dhillon2015] Hsieh, C.J.; Yu, H.F.; and Dhillon, I. S. 2015. Passcode: Parallel asynchronous stochastic dual coordinate descent. In International Conference on Machine Learning.
 [J. Reddi et al.2015] J. Reddi, S.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. J. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In Neural Information Processing Systems.
 [Jaggi et al.2014] Jaggi, M.; Smith, V.; Takac, M.; Terhorst, J.; Krishnan, S.; Hofmann, T.; and Jordan, M. I. 2014. Communicationefficient distributed dual coordinate ascent. In Neural Information Processing Systems.
 [Johnson and Zhang2013] Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Neural Information Processing Systems.
 [Konecný, McMahan, and Ramage2015] Konecný, J.; McMahan, B.; and Ramage, D. 2015. Federated optimization: Distributed optimization beyond the datacenter. arXiv:1511.03575.
 [Lee et al.2016] Lee, J. D.; Lin, Q.; Ma, T.; and Yang, T. 2016. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. arXiv:1507.07595v2.
 [Li et al.2014] Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation.
 [Lin et al.2014] Lin, C.Y.; Tsai, C.H.; Lee, C.P.; and Lin, C.J. 2014. Largescale logistic regression and linear support vector machines using spark. In IEEE International Conference on Big Data.
 [Lin, Lu, and Xiao2014] Lin, Q.; Lu, Z.; and Xiao, L. 2014. An accelerated proximal coordinate gradient method. In Neural Information Processing Systems.
 [Liu et al.2014] Liu, J.; Wright, S. J.; Ré, C.; Bittorf, V.; and Sridhar, S. 2014. An asynchronous parallel stochastic coordinate descent algorithm. In International Conference on Machine Learning.
 [Ma et al.2015] Ma, C.; Smith, V.; Jaggi, M.; Jordan, M. I.; Richtárik, P.; and Takác, M. 2015. Adding vs. averaging in distributed primaldual optimization. In International Conference on Machine Learning.
 [Mania et al.2015] Mania, H.; Pan, X.; Papailiopoulos, D. S.; Recht, B.; Ramchandran, K.; and Jordan, M. I. 2015. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv:1507.06970.
 [Meng et al.2016] Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.; Amde, M.; Owen, S.; Xin, D.; Xin, R.; Franklin, M. J.; Zadeh, R.; Zaharia, M.; and Talwalkar, A. 2016. Mllib: Machine learning in apache spark. Journal of Machine Learning Research 17(34):1–7.
 [Nitanda2014] Nitanda, A. 2014. Stochastic proximal gradient descent with acceleration techniques. In Neural Information Processing Systems.
 [Recht et al.2011] Recht, B.; Re, C.; Wright, S. J.; and Niu, F. 2011. Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems.

[Riguzzi et al.2016]
Riguzzi, F.; Bellodi, E.; Zese, R.; Cota, G.; and Lamma, E.
2016.
Scaling structure learning of probabilistic logic programs by
mapreduce.
In
European Conference on Artificial Intelligence
.  [Schmidt, Roux, and Bach2013] Schmidt, M. W.; Roux, N. L.; and Bach, F. R. 2013. Minimizing finite sums with the stochastic average gradient. CoRR abs/1309.2388.
 [ShalevShwartz and Zhang2013] ShalevShwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research 14(1):567–599.
 [ShalevShwartz and Zhang2014] ShalevShwartz, S., and Zhang, T. 2014. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning.
 [Xiao2009] Xiao, L. 2009. Dual averaging method for regularized stochastic learning and online optimization. In Neural Information Processing Systems.
 [Xing et al.2015] Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; and Yu, Y. 2015. Petuum: A new platform for distributed machine learning on big data. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
 [Yang2013] Yang, T. 2013. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Neural Information Processing Systems.
 [Yu et al.2014] Yu, Z.Q.; Shi, X.J.; Yan, L.; and Li, W.J. 2014. Distributed stochastic ADMM for matrix factorization. In International Conference on Conference on Information and Knowledge Management.
 [Zaharia et al.2010] Zaharia, M.; Chowdhury, M.; Franklin, M. J.; Shenker, S.; and Stoica, I. 2010. Spark: Cluster computing with working sets. In USENIX Workshop on Hot Topics in Cloud Computing.
 [Zhang and Jordan2015] Zhang, Y., and Jordan, M. I. 2015. Splash: Userfriendly programming interface for parallelizing stochastic algorithms. CoRR abs/1506.07552.
 [Zhang and Kwok2014] Zhang, R., and Kwok, J. T. 2014. Asynchronous distributed ADMM for consensus optimization. In International Conference on Machine Learning.
 [Zhang, Choromanska, and LeCun2015] Zhang, S.; Choromanska, A.; and LeCun, Y. 2015. Deep learning with elastic averaging SGD. In Neural Information Processing Systems.
 [Zhang, Mahdavi, and Jin2013] Zhang, L.; Mahdavi, M.; and Jin, R. 2013. Linear convergence with condition number independent access of full gradients. In Neural Information Processing Systems.
 [Zhang, Wainwright, and Duchi2012] Zhang, Y.; Wainwright, M. J.; and Duchi, J. C. 2012. Communicationefficient algorithms for statistical optimization. In Neural Information Processing Systems.
 [Zhang, Zheng, and Kwok2016] Zhang, R.; Zheng, S.; and Kwok, J. T. 2016. Asynchronous distributed semistochastic gradient optimization. In AAAI Conference on Artificial Intelligence.
 [Zhao and Li2016] Zhao, S.Y., and Li, W.J. 2016. Fast asynchronous parallel stochastic gradient descent: A lockfree approach with convergence guarantee. In AAAI Conference on Artificial Intelligence.
 [Zhao et al.2014] Zhao, T.; Yu, M.; Wang, Y.; Arora, R.; and Liu, H. 2014. Accelerated minibatch randomized block coordinate descent method. In Neural Information Processing Systems.
 [Zhao et al.2016] Zhao, S.Y.; Xiang, R.; Shi, Y.H.; Gao, P.; and Li, W.J. 2016. SCOPE: scalable composite optimization for learning on Spark. CoRR abs/1602.00133.
 [Zinkevich et al.2010] Zinkevich, M.; Weimer, M.; Li, L.; and Smola, A. J. 2010. Parallelized stochastic gradient descent. In Neural Information Processing Systems.
Appendix A Appendix
SVRG and MiniBatch based Distributed SVRG
The sequential SVRG is outlined in Algorithm 3, which is the same as the original SVRG in [Johnson and Zhang2013].
Proof of Lemma 1
We define the local stochastic gradient in Algorithm 2 as follows:
Then the update rule at local Workers can be rewritten as follows:
(2) 
Lemma 2.
The conditional expectation of local stochastic gradient on is
Proof.
∎
Lemma 3.
The variance of has the the following property:
Proof.
The second inequality uses Assumption 1. The third inequality uses the fact that . ∎
Comments
There are no comments yet.