I Introduction
Machine learning (ML) techniques are penetrating into many aspects of our lives, and boosting the development of new applications from autonomous driving, virtual and augmented reality, to Internet of things [1]
. Training high-performance ML models requires computations with massive volumes of data, e.g., large-scale matrix-vector multiplications, which cannot be realized on a single centralized computing server. Distributed computing frameworks such as MapReduce
[2] enable a centralized master node to allocate data and update global model, while tens or hundreds of distributed computing nodes, called workers, train ML models in parallel using partial data. Since task completion time depends on the slowest worker, a key bottleneck in distributed computing is the straggler effect: experiments on Amazon EC2 instances show that some workers can be 5 times slower than the typical performance [3].Straggler effect can be mitigated by adding redundancy to the distributed computing system via coding [3, 4, 2, 6, 5, 8, 7], or by scheduling computation tasks [9, 10, 11]. Maximum distance separable (MDS) codes are widely applied for matrix multiplications [3, 4, 2, 6, 5, 7], which can reduce the task completion time by , where is the number of workers [2]. A unified coded computing framework for straggler mitigation is proposed in [4]. Heterogeneous workers are considered in [5], and an asymptotically optimal load allocation scheme is proposed. Although the stragglers are slower than the typical workers, they can still make non-negligible contributions to the system [6, 8]. A hierarchical coded computing framework is thus proposed in [6], where tasks are partitioned into multiple levels so that stragglers contribute to subtasks in the lower level. Multi-message communication with Lagrange coded computing is used in [8] to exploit straggler servers.
The straggler effect can be mitigated even with uncoded computing, via redundant scheduling of tasks and multi-message communications. A batched coupon’s collector scheme is proposed in [9], and the expected completion time is analyzed in [10]. The input data is partitioned into batches, and each worker randomly processes one at a time, until the master collects all the results. Deterministic scheduling orders of tasks at different workers are proposed in [11], specifically cyclic and staircase scheduling, and the relation between redundancy and task completion time is characterized.
Existing papers mainly consider a single master. However, in practice, workers may be shared by more than one masters to carry out multiple large-scale computation tasks in parallel.
In this work, we focus on a multi-task assignment problem for a heterogeneous distributed computing system using MDS codes. As shown in Fig. 1, we consider multiple masters, each with a matrix-vector multiplication task, and a number of workers with heterogeneous computing capabilities. The goal is to design centralized worker assignment and load allocation algorithms that minimize the completion time of all the tasks. We consider both dedicated and probabilistic
worker assignment policies, and formulate a non-convex optimization problem under a unified framework. For dedicated assignment, each worker serves one master. The optimal load allocation is derived, and the worker assignment is transformed into a max-min allocation problem, for which NP-hardness is proved and greedy algorithms are proposed. For probabilistic assignment, each worker selects a master to serve based on an optimized probability, and a successive convex approximation (SCA) based algorithm is proposed. Simulation results show that the proposed algorithms can drastically reduce the task completion time compared to uncoded and unbalanced coded schemes.
The rest of the paper is organized as follows. The system model and problem formulation is introduced in Sec. II. Dedicated and probabilistic worker assignments, and the corresponding load allocation algorithms are proposed in Sec. III and Sec. IV, respectively. Simulation results are presented in Sec. V, and the conclusions are summarized in Sec. VI.
Ii System Model and Problem Formulation
Ii-a System Architecture
We consider a heterogeneous distributed computing system with masters , and workers , with . We assume that each master has a matrix-vector multiplication task^{1}^{1}1
In training ML models, e.g., linear regression, matrix-vector multiplication tasks are carried out at each iteration of the gradient descent algorithm. These tasks are independent over iterations, thus we focus on one iteration here.
. The task of master is denoted by , where , , . The masters can use the workers to complete their computation tasks in a distributed manner.To deal with the straggling workers, we adopt MDS coded computation, and encode the rows of . Define the coded version of as , which is further divided into sub-matrices:
(1) |
where is assigned to worker , and is a non-negative integer representing the load allocated to worker . Vector is multicast from master to the workers with , and worker calculates the multiplication of coded rows of () and . Matrix is thus -MDS-coded, with the requirement of . Upon aggregating the multiplication results for any coded rows of , master can recover .
Ii-B Task Processing Time
The processing times of the assigned computation tasks at the workers are modeled as mutually independent random variables. Following the literature on coded computing, the processing time at each worker is modeled by a shifted exponential distribution
^{2}^{2}2In this work, the worker assignment and load allocation algorithms are designed based on the assumption of shifted exponential distribution. However, the proposed methods can also be applied to other distributions, as long as the corresponding function defined in (40) is convex.. The processing time, , for worker to calculate the multiplication of coded rows of andhas the cumulative distribution function:
(2) |
where is a parameter indicating the minimum processing time for one coded row, and is the parameter modeling the straggling effect.
We consider a heterogeneous environment by assuming that and are different over different master-worker pairs , for and . This assumption is due to the fact that workers may have different computation speeds, and the dimensions of and vary over .
Ii-C Worker Assignment Policy
We consider two worker assignment policies:
1) Dedicated worker assignment: In this policy, each worker is assigned computation tasks from a single master . Let indicator if worker provides computing service for master , and otherwise. Since a worker serves at most one master, we have .
2) Probabilistic worker assignment: In this policy, each worker randomly selects which master to serve according to probability . For each worker , we have . In Fig. 1, worker selects master to serve with probability , and master with probability .
Ii-D Problem Formulation
Let denote the number of multiplication results (one result refers to the multiplication of one coded row of with ) master collects from worker till time . We assume that worker computes and then sends the result to the master upon completion, without further dividing it into subtasks or transmitting any feedbacks before completion. Therefore, master can either receive results or none from worker till time . We denote the number of aggregated results at master until time by , and we have .
Our objective is to minimize the average completion time , upon which all the masters can aggregate sufficient results from the workers to recover their computations with high probability. We aim to design a centralized policy that optimizes worker assignment and load allocation . The optimization problem is formulated as:
(3) | ||||
s.t. | (4) | |||
(5) | ||||
(6) |
where we have for dedicated worker assignment, while for probabilistic worker assignment, and is the set of non-negative integers. In constraint (4), is defined as the probability that master receives no less than results until time ; that is, the probability of being recovered. Constraint (5) guarantees that under dedicated assignment, each worker serves at most one master, and under probabilistic assignment, the total probability rule is satisfied.
The key challenge to solve is that, constraint (4) cannot be explicitly expressed, since it is difficult to find all the combinations that satisfy in a heterogeneous environment with non-uniform loads . Therefore, we instead consider an approximation to this problem, by substituting constraint (4) with an expectation constraint:
(7) | ||||
s.t. | (8) | |||
where constraint (8) states that the expected number of results master receives until time is no less than . A similar approach is used in [5], where the gap between the solutions of and is proved to be bounded when there is a single master. We will design algorithms that solve in the following two sections. In Sec. V, we will use Monte Carlo simulations to evaluate the average time it takes to compute all the tasks (i.e., ), and show that it is close to the approximate completion time obtained by solving .
Constraint (8) can be explicitly expressed. Let be an indicator function with value if event is true, and otherwise. If (and thus ),
(9) |
If (and thus ), . And we have .
The following observations help us simplify :
1) From constraint (8), we can infer that the optimal task completion time satisfies . In fact, if there exists such that , we have , i.e., master cannot expect to receive any results from worker . By reducing to satisfy , it is possible to further reduce .
2) Due to the high dimension of input matrix , is usually in the order of hundreds or thousands. So we relax the constraint to , and omit the effect of rounding in the following derivations.
Based on the two statements, by substituting (9), we simplify constraint (8) as:
(10) |
And problem can be simplified as follows:
(11) | ||||
s.t. | ||||
(12) |
Problem is a non-convex optimization problem due to the non-convexity of (10), which is in general difficult to solve. In the following two sections, we will propose algorithms for dedicated and probabilistic worker assignments and corresponding load allocations, respectively.
Iii Dedicated Worker Assignment
In this section, we solve for dedicated worker assignment, where . Given the assignment of workers, we first derive the optimal load allocation. Then the worker assignment can be transformed into a max-min allocation problem, for which NP-hardness is shown and two greedy algorithms are developed.
Iii-a Optimal Load Allocation for a Given Worker Assignment
We first assume that the subset of workers that serve master is given by , and derive the optimal load allocation for master , that minimizes the approximate completion time. The problem is formulated as:
(13) | ||||
s.t. | (14) | |||
(15) |
where is defined as the approximate completion time of master , is the number of results aggregated at master till time , and
(16) |
Lemma 1.
Problem is a convex optimization problem.
Proof.
See Appendix A. ∎
Let . The partial Lagrangian of is given by
(17) |
where is the Lagrange multiplier associated with (14).
The partial derivatives of can be derived as
(18) |
(19) |
The optimal solution needs to satisfy the Karush–Kuhn–Tucker (KKT) conditions
(20) | |||
(21) | |||
(22) |
Define as the lower branch of Lambert W function, where and . Let
(23) |
By solving KKT conditions (20)-(22), the optimal load allocation for each individual master is given as follows.
Theorem 1.
For master , and a given subset of workers serving this master, the optimal load allocation derived from , and the corresponding minimum approximate completion time are given by:
(24) | |||
(25) |
Proof.
See Appendix B. ∎
Iii-B Greedy Worker Assignment Algorithms
Now we consider how to assign workers to different masters to minimize the task completion time . Let
(26) |
Based on Theorem 1, the worker assignment problem can be transformed into a max-min allocation problem, given in the following proposition.
Proposition 1.
Problem is equivalent to
(27) | ||||
(28) |
Proof.
We use to represent the minimum task completion time of each master given the set of workers , and define . From Theorem 1, we have:
(29) |
Note that in , . With and , is equivalent to . ∎
Problem
is a combinatorial optimization problem named
max-min allocation, which is motivated by the fair allocation of indivisible goods [12, 13, 14]. Specifically, there are agents and items. Each item has a unique value for each agent, and can only be allocated to one agent. The goal is to maximize the minimum sum value of agents, by allocating items as fairly as possible. In our problem, each master corresponds to an agent with sum value , and each worker can be considered as an item with value for master . The problem can be reduced to a NP-complete partitioning problem [15], when considering only agents and that each item has identical value for both agents. Therefore, problem is NP-hard. An -approximation algorithm in time is proposed in [13] for max-min allocation, with . Another polynomial-time algorithm is proposed in [14], guaranteeing approximation to the optimum. However, these algorithms have high computational complexity, and are difficult to implement. We propose two low-complexity greedy algorithms as follows.An iterated greedy algorithm is proposed in Algorithm 1, which is inspired by [16], where a similar min-max fairness problem is investigated. In the initialization phase, each worker is assigned to the master for which its value is the highest. The main iteration has the following three phases:
1) Insertion: We extract each worker from the current master , and assign it to a master with minimum sum value . As shown in Lines 12-14, if the minimum sum value among masters is improved, let worker serve master . The complexity is .
2) Interchange: We pick two workers , that serve two masters , , and interchange their assignments. If the minimum sum is improved, and the overall system performance is improved (i.e., ), the interchange is kept. The complexity is . Note that the insertion and interchange are repeated for multiple times within each iteration, in order to obtain a local optimum.
3) Exploration: We randomly remove some workers from the current assignment, and allocate them in a greedy manner. This operation can be regarded as an exploration, which prevents the algorithm to be stuck in a local optimum.
When the number of iterations reach a predefined maximum, or the performance does not improve any more, the main loop is terminated. Note that the final output is the assignment obtained before the exploration phase.
While Algorithm 1 still requires iterations to obtain a good assignment, Algorithm 2, which is inspired by the largest-value-first algorithm in [12], is even simpler with only one round. In a homogeneous case with , the algorithm finds an agent with minimum sum value , and assigns a remaining item with the largest value . The algorithm guarantees a approximation to the optimum. We extend the idea of the largest-value-first to the heterogeneous environment, and propose a simple greedy algorithm. As shown in Algorithm 2, in the initialization phase, we find a master without any workers assigned, and allocate an available worker that has the largest contribution for it. In the main loop, we always select master with the minimum sum value , and allocate a remaining worker that has the maximum value for this master. The overall complexity of the simple greedy algorithm is .
Iv Probabilistic Worker Assignment
In this section, we solve problem for the probabilistic worker assignment, where . The key challenge is the non-convexity of constraint (8). We observe that constraint (8) can be decomposed into the difference of convex functions, and adopt SCA method to jointly solve the worker assignment and load allocation problems.
From Lemma 1, we know that defined in (40) is convex. Thus is also convex with respect to and . Let , , and . It is easy to see that the following functions are all convex:
(30) | |||
(31) |
and we have
(32) |
By linearizing the concave parts and , given any two points , , the convex upper approximations of and can be obtained as follows [17]:
(33) | ||||
(34) |
Let subscript denote the variables, parameters and functions related to master and worker , e.g., , ; and thus,
(35) |
Let , . Now we can give a convex upper approximation for the left-hand side of constraint (10) in the following lemma.
Lemma 2.
The left-hand side of constraint (10) can be approximated by a convex function as follows:
(36) |
Let be a feasible point of . The convex approximation to at point , defined as , is given by:
(37) | ||||
s.t. | (38) | |||
(39) |
A probabilistic worker assignment and load allocation algorithm is proposed in Algorithm 3 based on the SCA method. A diminishing step-size rule is adopted with decreasing ratio , guaranteeing the convergence of the SCA [17], and is the step-size in the th iteration. Starting from a feasible point of , we iteratively solve convex optimization problems , in which constraint (10) is replaced by its upper convex approximation (38). The iteration terminates when the solution is stationary (e.g., ), and according to Theorem 2 in [17], the stationary solution obtained by the SCA based algorithm is a local optimum.
Iv-a Comparison of Dedicated and Probabilistic Assignments
We remark that the completion time of probabilistic worker assignment is a lower bound on that is achieved by dedicated worker assignment, since any feasible point of dedicated assignment is also feasible for probabilistic assignment. However, dedicated assignment simplifies the connections between workers and masters, and requires less communication for the multicast of and less storage at each worker. Moreover, the proposed dedicated assignment algorithms have lower computational complexity and are easier to implement.
V Simulation Results
In this section, we evaluate the average task completion time of the proposed dedicated and probabilistic worker assignment algorithms, in both small-scale and large-scale scenarios. In the small-scale scenario, we consider masters and workers, and three benchmarks: 1) Uncoded computing with uniform dedicated worker assignment: each master is assigned an equal number of workers, and is equally partitioned into sub-matrices without coding, each with rows. 2) Coded computing with uniform dedicated worker assignment [5]: each master is assigned an equal number of workers, and the load is allocated according to Theorem 1. 3) Brute-force search for dedicated worker assignment: the oracle solution for dedicated worker assignment is obtained by searching all possible combinations, and the load is allocated according to Theorem 1. In the large-scale scenario, we consider masters and workers, and only use the first two benchmarks, due to the high complexity of the brute-force search.
The straggling parameter is randomly selected within , the shift parameter is set as , and , [5]. In Algorithm 1, we randomly remove workers for each exploration. In Algorithm 3, we set the convergence criteria as , decreasing ratio , and use CVX toolbox^{3}^{3}3http://cvxr.com/cvx/ to solve each convex approximation problem. We obtain the worker assignment and load allocation based on the algorithms that minimize the approximate completion time. Then we carry out Monte Carlo realizations and calculate the average task completion time, upon which all masters can recover their computations.
Fig. 2 and Fig. 3 compares the average task completion time achieved by the proposed algorithms and benchmarks. The x-axis is the index of the master nodes, and each group of bars show the average task completion time of a master using different algorithms. We label the time for the final task to be finished on the corresponding master, which is what we aim to minimize. For example, in Fig. 2, using uncoded scheme, the task completion time is for master , and for master . We thus label on master .
As shown in both Fig. 2 and Fig. 3, the proposed greedy dedicated assignment and SCA-based probabilistic assignment algorithms outperform the uncoded and coded benchmarks with uniform assignment of dedicated workers. For example, in the small-scale scenario, all the proposed algorithms improve the delay performance by more than over uncoded benchmark, and more than over coded benchmark. The performance gain is mainly achieved by taking into account the heterogeneity of the system. Moreover, iterated greedy algorithm is slightly better than the simple greedy algorithm, and both performances are close to the optimal brute-force search algorithm. Probabilistic assignment further outperforms the dedicated assignment, which is consistent with the fact that it is a lower bound for dedicated assignment. We also record the approximate completion time achieve by . For Fig. 2, results are in the order of the legend, which are quite close to the average completion time.
In Fig. 4, the impact of the decreasing ratio on the convergence of SCA-based probabilistic assignment algorithm is evaluated, in the scenario with masters and workers. The decreasing ratio decides the step-size , and thus the convergence rate of the SCA algorithm. We can see that by choosing a proper , the proposed algorithm can converge after iterations, and outperforms the iterated greedy algorithm for dedicated worker assignment.
Vi Conclusions
We have considered a joint worker assignment and load allocation problem in a distributed computing system with heterogeneous computing servers, i.e., workers, and multiple master nodes competing for these workers. MDS coding has been adopted by the masters to mitigate the straggler effect, and both dedicated and probabilistic assignment algorithms have been proposed, in order to minimize the average task completion time. Simulation results show that the proposed algorithms can reduce the task completion time by compared to uncoded task assignment, and over an unbalanced coded scheme. While probabilistic assignment is more general, we have observed through simulations that the two have similar delay performances. We have noted that dedicated assignment has lower computational complexity and lower communication and storage requirements, beneficial for practical implementations. As future work, we plan to take communication delay into consideration, and develop decentralized algorithms.
Appendix A Proof of Lemma 1
It is easy to see that (13) and (15) are convex objective and constraints, respectively. Let
(40) |
with variables , , and parameters , . The Hessian matrix of is:
(41) |
The eigenvalues of
are and . Thus , and is convex. Let and , is convex. Constraint (14) is the summation of convex functions, and hence convex. Therefore, is a convex optimization problem.Appendix B Proof of Theorem 1
References
- [1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” [Online] Available: https://arxiv.org /abs/1812.02858, Dec. 2018.
- [2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. Inf. Theory, vol. 64, no. 3, pp. 1514-1529, Mar. 2018.
- [3] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: avoiding stragglers in distributed learning,” in Proc. Int. Conf. on Machine Learning, Sydney, Australia, Aug. 2017, pp. 3368–3376.
- [4] S. Li, M. A. Maddah-Ali and A. S. Avestimehr, “A unified coding framework for distributed computing with straggling servers,” IEEE Global Commun. Conf. Workshop, Washington, DC, USA, Dec. 2016.
- [5] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” [Online] Available: https://arxiv.org/abs/1701.05973, Jan. 2017.
- [6] N. Ferdinand, and S. C. Draper, “Hierarchical coded computation,” in Proc. IEEE Int. Symp. on Inform. Theory (ISIT), Vail, CO, USA, Jun. 2018, pp. 1620–1624.
- [7] S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” [Online] Available: https://arxiv.org/abs/1801.10292, May 2018.
- [8] E. Ozfatura, D. Gündüz, and S. Ulukus, “Speeding up distributed gradient descent by utilizing non-persistent stragglers,” [Online] Available: https://arxiv.org/abs/1808.02240, Aug. 2018.
- [9] S. Li, S. M. M. Kalan, A. S. Avestimehr, and M. Soltanolkotabi, “Near-optimal straggler mitigation for distributed gradient methods,” IEEE Int. Parallel and Distributed Processing Symp. Workshops, Vancouver, BC, Canada, May 2018, pp. 857-866.
- [10] A. Behrouzi-Far and E. Soljanin, “On the effect of task-to-worker assignment in distributed computing systems with stragglers,” 56th Annual Allerton Conf. on Commun., Control, and Comput., Monticello, IL, USA, Oct. 2018, pp. 560-566.
- [11] M. Mohammodi Amiri, and D. Gündüz, “Computation scheduling for distributed machine learning with straggling workers,” [Online] Available: https://arxiv.org/abs/1810.09992, Oct. 2018.
- [12] B. Deuermeyer, D. Friesen, and M. Langston, “Scheduling to maximize the minimum processor finish time in a multiprocessor system,” SIAM J. Algebraic Discrete Methods, vol. 3, no. 2, pp. 190-196, Jun. 1982.
- [13] D. Chakrabarty, J. Chuzhoy, and S. Khanna, “On allocating goods to maximize fairness,” 50th Annual IEEE Symposium on Foundations of Computer Science, Atlanta, GA, USA, Oct. 2009, pp. 107-116.
- [14] A. Asadpour, and A. Saberi, “An approximation algorithm for max-min fair allocation of indivisible goods,” SIAM J. Algebraic Discrete Methods, vol. 39, no. 7, pp. 2970-2989, May 2010.
- [15] B. Hayes, “Computing science: the easiest hard problem,” American Scientist, vol. 90, no. 2, pp. 113-117, Apr. 2002.
- [16] L. Fanjul-Peyro, R. Ruiz, “Iterated greedy local search methods for unrelated parallel machine scheduling,” European Journal of Operational Research, vol. 207, no. 1, pp.55-69, Nov. 2010.
- [17] G. Scutari, F. Facchinei, and L. Lampariello, “Parallel and distributed methods for constrained nonconvex optimization -part I: theory,” IEEE Trans. Signal Process., vol. 65, no. 8, pp. 1929-1944, Apr. 2017.
Comments
There are no comments yet.