The need to scale up machine learning in the presence of sheer volume of data has spurred recent interest in developing efficient distributed optimization algorithms. Distributed machine learning jobs often involve solving a non-convex, decomposable, and regularized optimization problem of the following form:
where each is a smooth but possibly non-convex function, fitting the model to local training data available on node ; each is a closed, convex, and compact set; and the regularizer is a separable, convex but possibly non-smooth
regularization term to prevent overfitting. Example problems of this type can be found in deep learning with regularization[Dean et al.2012, Chen et al.2015], robust matrix completion [Niu et al.2011], LASSO [Tibshirani et al.2005]
, sparse logistic regression[Liu et al.2009]
, and sparse support vector machine (SVM)[Friedman et al.2001].
To date, a number of efficient asynchronous and distributed stochastic gradient descent (SGD) algorithms, e.g.,[Niu et al.2011, Lian et al.2015, Li et al.2014a], have been proposed, in which each worker node asynchronously updates its local model or gradients based on its local dataset, and sends them to the server(s) for model updates or aggregation. Yet, SGD is not particularly suitable for solving optimization problems with non-smooth objectives or with constraints, which are prevalent in practical machine learning adopting regularization, e.g., [Liu et al.2009]. Distributed (synchronous) ADMM [Boyd et al.2011, Zhang and Kwok2014, Chang et al.2016a, Chang et al.2016b, Hong2017, Wei and Ozdaglar2013, Mota et al.2013, Taylor et al.2016] has been widely studied as an alternative method, which avoids the common pitfalls of SGD for highly non-convex problems, such as saturation effects, poor conditioning, and saddle points [Taylor et al.2016]. The original idea on distributed ADMM can be found in [Boyd et al.2011], which is essentially a synchronous algorithm. In this work, we focus on studying the asynchronous distributed alternating direction method of multipliers (ADMM) for non-convex non-smooth optimization.
Asynchronous distributed ADMM has been actively discussed in recent literature. Zhang and Kwok zhang2014asynchronous consider an asynchronous ADMM assuming bounded delay, which enables each worker node to update a local copy of the model parameters asynchronously without waiting for other workers to complete their work, while a single server is responsible for driving the local copies of model parameters to approach the global consensus variables. They provide proof of convergence for convex objective functions only. Wei and Ozdaglar wei20131 assume that communication links between nodes can fail randomly, and propose an ADMM scheme that converges almost surely to a saddle point. Chang et al. chang2016asynchronous1,chang2016asynchronous2 propose an asynchronous ADMM algorithm with analysis for non-convex objective functions. However, their work requires each worker to solve a subproblem exactly, which is often costly in practice. Hong hong2017distributed proposes another asynchronous ADMM algorithm, where each worker only computes the gradients based on local data, while all model parameter updates happen at a single server, a possible bottleneck in large clusters.
To our knowledge, all existing work on asynchronous distributed ADMM requires locking global consensus variables at the (single) server for each model update; although asynchrony is allowed among workers, i.e., workers are allowed to be at different iterations of model updating. Such atomic or memory-locking operations essentially serialize model updates contributed by different workers, which may seriously limit the algorithm scalability. In many practical problems, not all workers need to access all model parameters. For example, in recommender systems, a local dataset of user-item interactions is only associated with a specific set of users (and items), and therefore does not need to access the latent variables of other users (or items). In text categorization, each document usually consists of a subset of words or terms in corpus, and each worker only needs to deal with the words in its own local corpus.
A distributed machine learning algorithm is illustrated in Fig. 1. There are multiple server nodes, each known as a “PS” and stores a subset (block) of model parameters (consensus variables)
. There are also multiple worker nodes; each worker owns a local dataset, and has a loss functiondepending on one or several blocks of model parameters, but not necessarily all of them. If there is only one server node, the architecture in Fig. 1 degenerates to a “star” topology with a single master, which has been adopted by Spark [Zaharia et al.2010]. With multiple servers, the system is also called a Parameter Server architecture [Dean et al.2012, Li et al.2014a]
and has been adopted by many large scale machine learning systems, including TensorFlow[Abadi et al.2016] and MXNet [Chen et al.2015].
It is worth noting that enabling block-wise updates in ADMM is critical for training large models, such as sparse logistic regression, robust matrix completion, etc., since not all worker nodes will need to work on all model parameters — each worker only needs to work on the blocks of parameters pertaining to its local dataset. For these reasons, block-wise updates have been extensively studied for a number of gradient type of distributed optimization algorithms, including SGD [Lian et al.2015], proximal gradient descent [Li et al.2014b], block or stochastic coordinate descent (BCD or SCD) [Liu and Wright2015], as well as for a recently proposed block successive upper bound minimization method (BSUM) [Hong et al.2016b].
In this work, we propose the first block-wise asynchronous distributed ADMM algorithm that can increase efficiency over existing single-server ADMM algorithms, by better exploiting the parallelization opportunity in model parameter updates. Specifically, we introduce the general form consensus optimization problem [Boyd et al.2011], and solve it in a block-wise asynchronous fashion, thus making ADMM amenable for implementation on Parameter Server, with multiple servers hosting model parameters. In our algorithm, each worker only needs to work on one or multiple blocks of parameters that are relevant to its local data, while different blocks of model parameters can be updated in parallel asynchronously subject to a bounded delay. Since this scheme does not require locking all the decision variables together, it belongs to the set of lock-free optimization algorithms (e.g., HOGWILD! [Niu et al.2011] as a lock-free version of SGD) in the literature. Our scheme is also useful on shared memory systems, such as on a single machine with multi-cores or multiple GPUs, where enforcing atomicity on all the consensus variables is inefficient.
Theoretically, we prove that, for general non-convex objective functions, our scheme can converge to stationary points. Experimental results on a cluster of 36 CPU cores have demonstrated the convergence and near-linear speedup of the proposed ADMM algorithm, for training sparse logistic regression models based on a large real-world dataset.
2.1 Consensus Optimization and ADMM
where is often called the global consensus variable, traditionally stored on a master node, and is its local copy updated and stored on one of worker nodes. The function is decomposable. It has been shown [Boyd et al.2011] that such a problem can be efficiently solved using distributed (synchronous) ADMM. In particular, let denote the Lagrange dual variable associated with each constraint in (2b) and define the Augmented Lagrangian as
where represents a juxtaposed matrix of all , and represents the juxtaposed matrix of all . We have, for (synchronized) rounds , the following variable updating equations:
2.2 General Form Consensus Optimization
Many machine learning problems involve highly sparse models, in the sense that each local dataset on a worker is only associated with a few model parameters, i.e., each only depends on a subset of the elements in . The global consensus optimization problem in (2), however, ignores such sparsity, since in each round each worker must push the entire vectors and to the master node to update . In fact, this is the setting of all recent work on asynchronous distributed ADMM, e.g., [Zhang and Kwok2014]. In this case, when multiple workers attempt to update the global census variable at the same time, must be locked to ensure atomic updates, which leads to diminishing efficiency as the number of workers increases.
To better exploit model sparsity in practice for further parallelization opportunities between workers, we consider the general form consensus optimization problem [Boyd et al.2011]. Specifically, with worker nodes and server nodes, the vectors , and can all be decomposed into blocks. Let denote the -th block of the global consensus variable , located on server , for . Similarly, let () denote the corresponding -th block of the local variable () on worker . Let be all the pairs such that depends on the block (and correspondingly depends on ). Furthermore, let denote the set of all the neighboring workers of server . Similarly, let .
Then, the general form consensus problem [Boyd et al.2011] is described as follows:
In fact, in , a block will only be relevant if
, and will be a dummy variable otherwise, whose value does not matter. Yet, since the sparse dependencies ofon the blocks can be captured through the specific form of , here we have included all blocks in each ’s arguments just to simplify the notation.
The structure of problem (4) can effectively capture the sparsity inherent to many practical machine learning problems. Since each only depends on a few blocks, the formulation in (4) essentially reduces the number of decision variables—it does not matter what value will take for any . For example, when training a topic model for documents, the feature of each document is represented as a bag of words, and hence only a subset of all words in the vocabulary will be active in each document’s feature. In this case, the constraint only accounts for those words that appear in the document , and therefore only those words that appeared in document should be optimized. Like (3), we also define the Augmented Lagrangian 111To simplify notations, we still use and as previously defined, but entries are not taken into account. as follows:
The formulation in (4) perfectly aligns with the latest Parameter Server architecture as shown in Fig. 1. Here we can let each server node maintain one model block , such that worker updates if and only if . Since all three vectors , and in (4) are decomposable into blocks, to achieve a higher efficiency, we will investigate block-wise algorithms which not only enable different workers to send their updates asynchronously to the server (like prior work on asynchronous ADMM does), but also enable different model blocks to be updated in parallel and asynchronously on different servers, removing the locking or atomicity assumption required for updating the entire .
3 A Block-wise, Asynchronous, and Distributed ADMM Algorithm
In this section, we present our proposed block-wise, asynchronous and distributed ADMM algorithm (a.k.a, AsyBADMM) for the general consensus problem. For ease of presentation, we first describe a synchronous version motivated by the basic distributed ADMM for non-convex optimization problems as a starting point.
3.1 Block-wise Synchronous ADMM
The update rules presented in Sec. 2.1 represent the basic synchronous distributed ADMM approach [Boyd et al.2011]. To solve the general form consensus problem, our block-wise version extends such a synchronous algorithm mainly by 1) approximating the update rule of with a simpler expression under non-convex objective functions, and 2) converting the all-vector updates of variables into block-wise updates only for .
Generally speaking, in each synchronized epoch, each worker node updates all blocks of its local primal variables and dual variables for , and pushes these updates to the corresponding servers. Each server , when it has received and from all , will update accordingly, by aggregating these received blocks.
Specifically, at epoch , the basic synchronous distributed ADMM will do the following update for :
However, this subproblem is hard, especially when is non-convex. To handle non-convex objectives, we adopt an alternative solution [Hong2017, Hong et al.2016a] to this subproblem through the following first-order approximation of at :
where (5) can be readily obtained by setting the partial derivative w.r.t. to zero.
The above full-vector update on is equivalent to the following block-wise updates on each block by worker :
where is the partial derivative of w.r.t. . Furthermore, the dual variable blocks can also be updated in a block-wise fashion as follows:
Note that in fact, each only depends on a part of and thus each worker only needs to pull the relevant blocks for . Again, we put the full vector in just to simplify notation.
On the server side, server will update based on the newly updated , received from all workers such that . Again, the update in the basic synchronous distributed ADMM can be rewritten into the following block-wise format (with a regularization term introduced):
where is defined as
and the proximal operator is defined as
Furthermore, the regularization term is introduced to stabilize the results, which will be helpful in the asynchronous case.
In the update of , the constant of the proximal operator is given by . Now it is clear that it is sufficient for worker to send to server in epoch .
3.2 Block-wise Asynchronous ADMM
We now take one step further to present a block-wise asynchronous distributed ADMM algorithm, which is our main contribution in this paper. In the asynchronous algorithm, each worker will use a local epoch to keep track of how many times has been updated, although different workers may be in different epochs, due to random delays in computation and communication.
Let us first focus on a particular worker . While worker is in epoch , there is no guarantee for worker to download —different blocks in may have been updated for different numbers of times, for which worker has no idea. Therefore, we use to denote the latest copy of on server while worker is in epoch and . Then, the original synchronous updating equations (6) and (7) for and , respectively, are simply replaced by
Now let us focus on the server side. Since in the asynchronous case, the variables for different workers do not generally arrive at the server at the same time. In this case, we will update incrementally as soon as a is received from some worker until all are received for all , at which point the update for is fully finished. We use to denote the working (dirty) copy of for which the update may not be fully finished by all workers yet. Then, the update of is given by
where if is received from worker and triggering the above update; and for all other , is the latest version of that server holds for worker . The regularization coefficient helps to stabilize convergence in the asynchronous execution with random delays.
according to a uniform distribution, which is common in practice. Due to the page limit, we only consider the random block selection scheme, and we refer readers to other options including Gauss-Seidel and Gauss-Southwell block selection in the literature, e.g.,[Hong et al.2016b].
We put a few remarks on implementation issues at the end of this section to characterize key features of our proposed block-wise asynchronous algorithm, which differs from full-vector updates in the literature [Hong2017]. Firstly, model parameters are stored in blocks, so different workers can update different blocks asynchronously in parallel, which takes advantage of the popular Parameter Server architecture. Secondly, workers can pull while others are updating some blocks, enhancing concurrency. Thirdly, in our implementation, workers will compute both gradients and local variables. In contrast, in the full-vector ADMM [Hong2017], workers are only responsible for computing gradients, therefore all previously computed and transmitted must be cached on servers with non-negligible memory overhead.
4 Convergence Analysis
In this section, we provide convergence analysis of our algorithm under certain standard assumptions:
Assumption 1 (Block Lipschitz Continuity).
For all , there exists a positive constant such that
Assumption 2 (Bounded from Below).
Each function is bounded below, i.e., there exists a finite number where denotes the optimal objective value of problem (4).
Assumption 3 (Bounded Delay).
The total delay of each link is bounded with the constant of for each pair of worker and server . Formally, there is an integer , such that for all . This should also hold for .
To characterize the convergence behavior, a commonly used metric is the squared norm of gradients. Due to potential nonsmoothness of , Hong et al. propose a new metric hong2016convergence as a hybrid of gradient mapping as well as vanilla gradient as follows:
where is defined as
It is clear that if , then we obtain a stationary solution of (1).
Then the following is true for Algorithm 1:
Algorithm 1 converges in the following sense:
(19a) (19b) (19c)
For each worker and server , denote the limit points of , and by and , respectively. Then these limit points satisfy KKT conditions, i.e., we have
(20a) (20b) (20c)
When sets are compact, the sequence of iterates generated by Algorithm 1 converges to stationary points.
For some , let denote the epoch that achieves the following:
Then there exists some constant such that
where is defined in Assumption 2.
Due to the non-convex objective function , no guarantee of global optimality is possible in general. The parameter acts like the learning rate hyper-parameter in gradient descent: a large slows down the convergence and a smaller one can speed it up. The term is associated with the delay bound . In the synchronous case, we can set ; otherwise, to guarantee convergence, should be increased as the maximum allowable delay increases.
We now show how our algorithm can be used to solve the challenging non-convex non-smooth problems in machine learning. We will show how AsyBADMM exhibits a near-linear speedup as the number of workers increases. We use a cluster of 18 instances of type c4.large on Amazon EC2. This type of instances has 2 CPU cores and at least 3.75 GB RAM, running 64-bit Ubuntu 16.04 LTS (HVM). Each server and worker process uses up to 2 cores. In total, our deployment uses 36 CPU cores and 67.5 GB RAM. Two machines serve as server nodes, while the other 16 machines serve as worker nodes. Note that we treat one core as a computational node (either a worker or server node).
Setup: In this experiment, we consider the sparse logistic regression problem:
where the constant is used to clip out some extremely large values for robustness. The -regularized logistic regression is one of the most popular algorithms used for large scale risk minimization. We consider a public sparse text dataset KDDa 222http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. This dataset has more than 8 million samples, 20 million features, and 305 million nonzero entries. To show the advantage of parallelism, we set up five experiments with 1, 4, 8, 16 and 32 nodes, respectively. In each experiment, the whole dataset will be evenly split into several smaller parts, and each node only has access to its local dataset.
We implement our algorithm on the ps-lite framework [Li et al.2014a], which is a lightweight implementation of Parameter Server architecture. It supports Parameter Server for multiple devices in a single machine, and multiple machines in a cluster. This is the back end of kvstore API of the deep learning framework MXNet [Chen et al.2015]. Each worker updates the blocks by cycling through the coordinates of and updating each in turns, restarting at a random coordinate after each cycle.
Results: Empirically, Assumption 3 is observed to hold for this cluster. We set the hyper-parameter and the clip threshold constant as , and the penalty parameter for all . Fig. 2(a) and Fig. 2(b) show the convergence behavior of our proposed algorithm in terms of objective function values. From the figures, we can clearly observe the convergence of our proposed algorithm. This observation confirms that asynchrony with tolerable delay can still lead to convergence.
To further analyze the parallelism in AsyBADMM, we measure the speedup by the relative time for workers to perform iterations, i.e., Speedup of workers = , where is the time it takes for workers to perform iterations of optimization. Fig. 2(b) illustrates the running time comparison and Table 1 shows that AsyBADMM actually achieves near-linear speedup.
6 Concluding Remarks
In this paper, we propose a block-wise, asynchronous and distributed ADMM algorithm to solve general non-convex and non-smooth optimization problems in machine learning. Under the bounded delay assumption, we have shown that our proposed algorithm can converge to stationary points satisfying KKT conditions. The block-wise updating nature of our algorithm makes it feasible to be implemented on Parameter Server, take advantage of the ability to update different blocks of all model parameters in parallel on distributed servers. Experimental results based on a real-world dataset have demonstrated the convergence and near-linear speedup of the proposed ADMM algorithm, for training large-scale sparse logistic regression models in Amazon EC2 clusters.
- [Abadi et al.2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283. USENIX Association, 2016.
- [Boyd et al.2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
- [Chang et al.2016a] Tsung-Hui Chang, Mingyi Hong, Wei-Cheng Liao, and Xiangfeng Wang. Asynchronous distributed admm for large-scale optimization – part i: Algorithm and convergence analysis. IEEE Transactions on Signal Processing, 64(12):3118–3130, 2016.
- [Chang et al.2016b] Tsung-Hui Chang, Wei-Cheng Liao, Mingyi Hong, and Xiangfeng Wang. Asynchronous distributed admm for large-scale optimization – part ii: Linear convergence analysis and numerical performance. IEEE Transactions on Signal Processing, 64(12):3131–3144, 2016.
- [Chen et al.2015] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
- [Dean et al.2012] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
- [Friedman et al.2001] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.
- [Hong et al.2016a] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
- [Hong et al.2016b] Mingyi Hong, Meisam Razaviyayn, Zhi-Quan Luo, and Jong-Shi Pang. A unified algorithmic framework for block-structured optimization involving big data: With applications in machine learning and signal processing. IEEE Signal Processing Magazine, 33(1):57–77, 2016.
- [Hong2017] Mingyi Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm approach. IEEE Transactions on Control of Network Systems, PP(99):1–1, 2017.
- [Li et al.2014a] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583–598, 2014.
- [Li et al.2014b] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014.
- [Lian et al.2015] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
- [Liu and Wright2015] Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351–376, 2015.
- [Liu et al.2009] Jun Liu, Jianhui Chen, and Jieping Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547–556. ACM, 2009.
- [Mota et al.2013] João FC Mota, João MF Xavier, Pedro MQ Aguiar, and Markus Puschel. D-admm: A communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718–2723, 2013.
- [Niu et al.2011] Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
[Taylor et al.2016]
Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom
Training neural networks without gradients: A scalable admm approach.In International Conference on Machine Learning, pages 2722–2731, 2016.
- [Tibshirani et al.2005] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
- [Wei and Ozdaglar2013] Ermin Wei and Asuman Ozdaglar. On the o (1/k) convergence of asynchronous distributed alternating direction method of multipliers. arXiv preprint arXiv:1307.8254, 2013.
- [Zaharia et al.2010] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. HotCloud, 10:10–10, 2010.
- [Zhang and Kwok2014] Ruiliang Zhang and James Kwok. Asynchronous distributed admm for consensus optimization. In International Conference on Machine Learning, pages 1701–1709, 2014.
Appendix A Key Lemmas
For simplicity, we say is performed at epoch when worker is updating block at epoch . If the updating is not performed at epoch , this inequality holds in trivial, as . So we only consider the case that is performed at epoch . Note that in this case, we have
Since , we have
Therefore, we have
Since the actual updating time for should be in , and for in , then we have
which proves the lemma.
At epoch , we have
Updating is performed as follows
Thus, we have
Appendix B Proof of Lemma 3
Next we try to bound the gap between two consecutive Augmented Lagrangian values and break it down into three steps, namely, updating , and :
To prove Lemma 3, we bound the above three gaps individually. Firstly, we bound the on . For each worker , at epoch , we use the following auxiliary function for convergence analysis:
To simplify our proof in this section, we consider the case that only one block is updated. Therefore, only block in differs from , and similarly for and . We will use as the delayed version of in this proof.
For node , we have the following inequality to bound the gap after updating :
From the block Lipschitz assumption and the updating rule, we have