## I Introduction

With the fast development of artificial intelligence technologies and the explosion of data, computation tasks for the training and inference of machine learning (ML) models are becoming increasingly complex and demanding, which are almost impossible to be realized on a single machine. Distributed computing frameworks have been developed to parallelize these computations

[2, 3], where a centralized*master*node takes charge of task partition, data dissemination, and result collection, and distributed computing nodes, called

*workers*, process partial computation tasks in parallel.

While parallel processing across multiple workers speeds up computation, the overall delay depends critically on the slowest worker. According to the experiments on the commercial Amazon elastic compute cloud (EC2) platform, some workers might experience much longer computation and communication delays than the average [4, 5, 6]. This fact is mainly due to the randomness of the system, e.g., time-varying stochastic workloads of the workers, or the traffic over the communication network connecting the workers to the master. Such randomness leads to the so-called *straggler effect*, which substantially increases the overall computation delay and becomes a major bottleneck in distributed computing.

The key idea to mitigate the straggler effect is to add redundancy to the computation tasks, so that the computation result does not rely on receiving results from all the workers. State-of-the-art approaches mainly include redundant scheduling of computation tasks [7, 8, 9], and various coding schemes [10], such as maximum distance separable (MDS) coding [4, 11, 5, 6, 12, 13], gradient coding [14, 15, 16], and polynomial coded computation [17, 18]. Among them, the easiest policy is to replicate each task to multiple workers upon its arrival, and the optimal number of replicas can be derived under exponential [7] or general service time distributions [8]. The orders of partitioned tasks at different workers are designed in [9], and the impact of redundancy on the task completion delay under different scheduling orders is characterized.

Compared with simple task replication, coding can further improve the efficiency of computation. MDS coding schemes under different system settings have been widely investigated for matrix multiplication, which is the most common type of computation task in the distributed computing system. With homogeneous workers, it is proved in [4] that MDS coding can reduce the computation delay by compared to uncoded computation. Considering that workers have heterogeneous computing capabilities, the load allocation algorithms are proposed in [5] and [12] for a single-task scenario, both with asymptotic optimality. Based on [5], an online, recursive load allocation algorithm is further proposed in [13] for the random task arrival case, where cancellation is enabled to clear the unfinished parts of each task upon its completion, so as to avoid unnecessary computations.

Although stragglers are slower than the average computation speed, it is still possible for them to provide partial results. This can be achieved by the hierarchical coded computation framework [6], or multi-message communications [19, 20, 21, 18]. Specifically, in the hierarchical framework, the coded task at each worker is partitioned into multiple layers. Stragglers are able to finish the lower layer sub-tasks and thus the coding redundancy in the lower layers can be reduced to improve system efficiency [6]. Multiple messages that include partial computation result are allowed to transmit from each worker to the master at each time slot, and thus stragglers can contribute a few messages, not none, to the system [19]. Multi-message communication may introduce additional transmission overhead, and the corresponding trade-off of communication and computation delay is investigated in [20]. Bivariate polynomial coding is introduced in [18], and is shown to reduce the average computation delay with respect to univariate polynomial alternatives. Such method is further combined with the concept of age of information for timely distributed computing in [21].

The papers above mainly address the straggler effect caused by the randomness of computation delay. Meanwhile, as the communication data volume between the master and worker nodes is usually high, the communication delay cannot be ignored either. Particularly, master and worker nodes might be base stations, mobile phones and smart vehicles at the edge of the wireless network, where the communication delay through wireless links may be highly stochastic and non-negligible. A scalable framework is proposed in [22] for coded distributed computing over wireless networks, where the communication load does not scale with the number of workers. Considering an MDS-coded distributed computing system with homogeneous workers, the impact of packet erasure channel on the delay of tasks is analyzed in [23]. Under heterogeneous settings, fixed transmission rate is considered in [24], and the load allocation of MDS-coded tasks is optimized. A cooperative transmission scheme for coded matrix multiplication is proposed in [25] to reduce the inter-cell interference, while a joint coding and node scheduling algorithm is proposed in [26]

based on reinforcement learning.

Most existing papers on distributed coded computing only consider a single-master scenario, and the impact of communication delay on the load allocation has not been sufficiently investigated. In this work, we consider a multi-master heterogeneous-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded with MDS codes, and allocated to workers for parallel computation, with random communication and computation delay. The goal is to jointly design worker assignment and load allocation algorithms to minimize the completion delay of all the tasks. The main contributions of this work are summarized as follows:

1) We consider both dedicated and fractional worker assignment policies, where each worker can process the encoded tasks of either a single master or multiple masters, respectively. Considering the randomness of communication and computation delays, we formulate a unified delay minimization problem for the joint allocation of computing power, communication bandwidth and task load.

2) For dedicated worker assignment, we obtain a non-convex mixed-integer non-linear programming problem (MINLP). The load allocation problem is solved first by deriving a convex approximation problem with Markov’s inequality. Worker assignment is then transformed to a max-min allocation problem, which is NP-hard and solved with greedy heuristics. A successive convex approximation (SCA) based algorithm is proposed to further enhance the load allocation.

3) For fractional worker assignment, the optimization problem is non-convex. We again use Markov’s inequality to simplify the problem, and transform the fractional worker assignment and resource allocation problem to max-min allocation by deriving its optimality condition. A greedy algorithm is proposed accordingly.

4) Simulations under various settings verify the feasibility of the proposed Markov’s inequality based approximation, and show the significant delay reduction of the proposed algorithms over benchmarks. In particular, when using Amazon EC2 for delay evaluation, about and delay reductions are achieved by the proposed algorithms compared to the uncoded and coded benchmarks, respectively.

The rest of the paper is organized as follows. In Section II, we introduce the system model and formulate the problem. In Section III, we propose load allocation and worker assignment algorithms under the dedicated case. In Section IV, we further consider the fractional assignment case. Simulation results are shown in Section V, and conclusions are given in Section VI.

## Ii System Model and Problem Formulation

As shown in Fig. 1, we consider a distributed computing system with master nodes and worker nodes, denoted by and , respectively. Each master

has a matrix-vector multiplication task, denoted by

, where , , and are positive integers. Each task can be partitioned and allocated to a subset of workers and processed by them in parallel. Local computation at the master is also available, and thus the set of nodes that can serve master is defined as , where index represents local processing.To reduce the straggler effect brought by the randomness of communication and computation, we introduce redundancy to each task through MDS coded computation. Specifically, each master encodes matrix in units of rows to get its coded version , where denotes the number of coded rows. Then, the coded matrix is divided into disjoint sub-matrices , , , , where has rows, i.e., . Note that, indicates no assignment from master to worker . Let be the set of workers to serve master .

After task encoding and assignment, each master transmits and to worker through their communication channel. We assume that the channel of each worker is orthogonal with that of others, and each worker can allocate its channel bandwidth to multiple masters and communicate with them simultaneously. This assumption is suitable for many realistic scenarios, e.g., the communication link is wired, or each worker is a base station with orthogonal wireless bandwidth. Each worker calculates the multiplication of and , and transmits back the result. Finally, master can recover the result of the original task upon receiving the inner products of any out of coded rows of and vector .

### Ii-a Worker Assignment Policy

We consider two worker assignment policies in this work:

1) *Dedicated worker assignment:*
Each worker only serves a single master.
For , let be the worker assignment indicator, where if worker is assigned a coded task by master , and otherwise. We have .

2) *Fractional worker assignment:*
We allow each worker to serve multiple masters simultaneously through processor sharing. Let be the fraction of computing power of worker allocated to master , with .
Define as the fraction of bandwidth allocated to the link between master and worker , with .

We assume that a master is always dedicated, i.e., it only computes local task but not helping others. Therefore, for , we have and . Also note that, for dedicated worker assignment, the bandwidth allocation variable is binary, with .

### Ii-B Communication and Computation Delays

We consider the delay of transmitting from master to worker , and ignore the transmission delays of and the computation results. This is because the size of is typically much larger than that of and the result vector. Moreover, as is shared among multiple workers that serve master , can be transmitted in a more efficient way, such as broadcast or multicast.

The communication delay to transmit a single coded row from master to worker

using the whole bandwidth is modeled by an exponential distribution

[23], with rate parameter . Define the total communication delay of transmitting using of the bandwidth as, whose cumulative distribution function (CDF) is given by:

(1) |

At each master, local processing does not need communication, and thus .

Following the literature [4, 5, 6, 13, 24], the delay of computing the inner product of one coded row of and vector at worker or master () is modeled by a shifted exponential distribution, with shift parameter and rate parameter . For , define the total computation delay of as , with CDF

(2) |

Let be the total communication plus computation delay of the task assigned from master to worker , where and

are two independent random variables. Then, if

and , the CDF of is given as follows:(3) |

If and , the CDF of is

(4) |

Otherwise, if , .

For local computation, we have . When , the CDF is given by

(5) |

otherwise, .

### Ii-C Problem Formulation

Our objective is to minimize the task completion delay, by jointly optimizing the allocation of task load , computing power , and communication bandwidth . As the communication and computation delays are with random, we aim to minimize the delay

, upon which the probability that all the masters can recover their computations is higher than a given threshold

. The optimization problem is formulated as:(6a) | ||||

s.t. | (6b) | |||

(6c) | ||||

(6d) | ||||

(6e) |

In constraint (6b), is defined as the number of computation results that can be received by the master until time , where a unit result refers to the inner product of one coded row of and vector . Constraint (6b) guarantees that each task can be recovered with probability . Equation (6c) is the resource allocation constraint of each worker. In constraint (6d), we have for dedicated worker assignment, while for fractional worker assignment. In constraint (6e), represents the set of non-negative integers.

Since workers have heterogeneous computing and communication capabilities, their loads will be different in general. To derive , we need to find all the combinations of that satisfy

, and further derive their joint probability distributions, which is intractable. As a result, problem

is hard to solve.We thus consider an approximation of , where the probability constraint (6b) is substituted by an expectation constraint, shown as follows:

(7a) | ||||

s.t. | (7b) | |||

(7c) | ||||

Constraint (7b) states that master is expected to receive sufficient computation results to recover until time . Similar approximation approach is also used in [5, 13, 24], and the performance gap under a single master case can be bounded [5]. As is with high dimension and thus the non-zero are typically large, we further relax to in (7c), and ignore the rounding error in the following.

To simplify the system workflow as well as the theoretical analysis, we assume that each encoded task , either being processed locally or allocated to a worker, is processed as a whole without any further partition. Accordingly, each master can only receive computation results from node upon the completion. As computations on workers are independent, can be written as follows:

where denotes the indicator function with if event is true, and otherwise. For , is given in (II-B) or (II-B), and for , is given in (5).

In the following two sections, we design solutions to under dedicated and fractional worker assignments, respectively. We will further show in Section V that a good solution to can also achieve low delay under the constraints of .

## Iii Dedicated Worker Assignment

In this section, we solve problem under the dedicated worker assignment policy, where and . Accordingly, problem is a non-convex MINLP, which is very challenging to solve in general.

We decouple the binary worker assignment variable and the continuous load allocation variable to seek a solution. First, given any worker assignment decision, the load allocation problem is still non-convex. We use Markov’s inequality to provide a convex approximation to the non-convex constraint, and derive the optimal load allocation for this sub-problem. We also show that, when either the computation or communication delay plays a leading role, the original load allocation problem is convex, and the optimal solution can be derived. Then, based on the optimal load allocation, we transform the worker assignment problem into a max-min allocation problem, which is still NP-hard and thus solved with greedy heuristics. Finally, after optimizing the worker assignment, we further provide an enhanced load allocation algorithm by solving the original non-convex problem with the SCA method.

### Iii-a Load Allocation for the General Case

Given the set of workers that serve master , the optimal load allocation problem aims to minimize the task completion delay for each master :

(8a) | ||||

s.t. | (8b) | |||

(8c) |

where includes the master itself, and denotes the load allocation vector. For , the CDF of the total delay under dedicated assignment can be obtained by letting and in (II-B) and (II-B). Accordingly, is a non-convex function, making problem hard to solve.

We provide an approximation to based on Markov’s inequality, i.e., for ,

(9) |

At the master, . Let

(10) |

Then we have

(11) |

Substituting inequality (III-A) into (8b), we obtain a tighter constraint, and an approximation to is given by

(12a) | ||||

s.t. | (12b) | |||

(12c) |

Problem is a convex optimization problem, and the optimal solution is given as follows.

###### Theorem 1.

For a given subset of workers that serves a master , the optimal load allocation and the corresponding task completion delay to are

(13a) | ||||

(13b) |

###### Proof.

See Appendix A. ∎

As shown in (10), represents the expected total delay for worker to handle a unit coded task of master , and thus indicates the average communication plus computation rate. As shown in Theorem 1, the optimal load allocated to each worker is proportional to , while inversely proportional to the overall communication plus computation rates of workers.

### Iii-B Load Allocation for the Computation Delay Dominant Case

When computation delay is much larger than the communication delay, we ignore the latter and get . The CDF of is given in (II-B). It is easy to see that the optimal solution of must satisfy . In fact, if there is a worker such that , then , meaning that the master cannot expect to obtain the computation results from worker . By reducing to satisfy , constraint (8b) can be strictly satisfied, and thus can be further reduced.

Based on this observation, constraint (8b) of can be written as

(14) |

The following theorem provides the optimal solution to .

###### Theorem 2.

When computation delay dominates the total delay, is a convex optimization problem, and the optimal load allocation and task completion delay are

(15a) | ||||

(15b) |

where , and denotes the lower branch of Lambert W function, with and .

###### Proof.

See Appendix B. ∎

Similar results can be derived for the communication delay dominant case, by substituting with and letting .

### Iii-C Dedicated Worker Assignment Algorithms

In this subsection, we design worker assignment algorithms, aiming to assign workers to masters in a balanced manner and minimize the completion delay of the slowest task.

According to Theorem 1, the minimum task completion delay that can be achieved under a given subset of workers is

(16) |

where we recall that is the worker assignment indicator.

From , the objective of worker assignment is . As , the objective is equivalent to . Let , and thus

(17) |

The worker assignment problem can be transformed into the following form:

(18a) | ||||

s.t. | (18b) | |||

(18c) |

Note that, for the computation delay dominant case, we only need to set , while the rest of the derivation still holds.

Problem is called *max-min allocation* problem, which is proposed for the fair assignment of items [27, 28]. In the original max-min allocation problem, each of the items has a unique value for an agent, and can be assigned to one of the agents. The objective is to assign all the items to the agents as fairly as possible, by maximizing the minimum total value of agents.
In , each worker is an item with value for master , and each master corresponds to an agent.
The max-min allocation problem can be reduced to the partitioning problem [29], when considering only 2 agents and assuming that each item has the same value for each agent. Since the partitioning problem is NP-complete, the max-min allocation problem is NP-hard.

Although some polynomial-time algorithms have been proposed for the max-min allocation problem with worst-case performance guarantee [27, 28], they are very complex and difficult to implement. Instead, we propose two greedy algorithms in the following.

Inspired by [30], an iterated greedy algorithm is proposed, as shown in Algorithm 1. In the initialization phase, we assign each worker to the master with highest , in order to maximize the contribution of workers. Then, we iterate among the insertion, interchange, and exploration phases, until the termination condition is met. To be specific, in the insertion phase, each worker is re-assigned to a master with the minimum sum value if the minimum sum value of the masters is improved. In the interchange phase, any two workers exchange the masters they are serving, if the minimum sum values of both masters, and the total value of the workers are improved. In the exploration phase, a subset of workers are randomly removed from the current assignment, and allocated to the masters in a greedy manner. If the number of iterations reaches a preset maximum value, or the minimum sum value of the masters does not improve any more, the iteration is terminated. Note that, the final output is the worker assignment after the interchange phase.

As shown in Algorithm 2, we also propose a simple greedy algorithm that does not require iterations for performance improvement, inspired by the largest-value-first algorithm [31]. The initial value of each master is related to its local computation capability, given by . During the main loop, we select a master whose current sum value is the minimum, and allocate an available worker with highest for master . The algorithm terminates when all the workers are allocated.

### Iii-D SCA-Enhanced Load Allocation

The main purpose of using Markov’s inequality for load allocation in the general case is to provide an explicit form for the worker assignment problem. After that, we can get back to the original load allocation problem to further improve the performance. We observe that the non-convex constraint (8b) in has a structure of the difference of convex functions, and thus we implement the SCA method to further optimize the load allocation.

When ,

(19) |

Let . Without loss of generality, we assume , and let

Otherwise, we can exchange with , and the following solution still works. Let . From Appendix B, we know that , , and are all convex functions. Accordingly,

(20) |

that is, can be decomposed into the difference of convex functions.

For any given point , a convex upper bound of can be obtained by linearizing :

(21) |

Let . A convex approximation problem to under point , denoted by , is given by

(22a) | ||||

s.t. | (22b) | |||

(22c) |

Based on the SCA method proposed in [32], we develop an SCA-enhanced load allocation algorithm, as shown in Algorithm 3. For each master and the corresponding worker assignment by Algorithm 1 or 2, the SCA algorithm starts from a feasible point of of . Note that, the Markov’s inequality provides a tighter approximation to constraint (8b), and thus Theorem 1 directly provides . Then, we iteratively solve convex optimization problems until convergence, where in the -th iteration, is updated according to Line 4 using step-size . According to [32], we update with a decreasing ratio , so as to guarantee the convergence to a local optimum.

As a summary, we would like to provide the following remarks.

###### Remark 1.

Scope of application:
*While we assumed certain delay distributions in the system model, the Markov’s inequality based approximate load allocation and the corresponding worker assignment algorithms, introduced in Section III-A and Section III-C, do not rely on these distributions. Instead, the proposed solution can be applied to any communication and computation delay distributions with broad adaptivity, as long as their mean values are known.
To further carry out the SCA-enhanced load allocation, we need to specify the delay distributions.*

###### Remark 2.

Iterated matrix multiplication:
*Distributed matrix-vector multiplication is often needed for the training of large ML models, where matrix corresponds to the data and vector to the model [4, 19]. Using a common training algorithm such as distributed gradient descent, the coded data is transmitted to the workers at the beginning, while multiple iterations of computations are required with the updated model vector. In this scenario, we can use the result of the computation-delay dominant case for worker assignment and load allocation, or modify the communication delay distribution of by removing the load variable . *

## Iv Fractional Worker Assignment

While dedicated worker assignment only needs a simple communication connection topology between masters and workers, it may lead to an unbalanced worker assignment, particularly when a few workers are much more powerful than the others, or the number of workers is relatively small. Therefore, in this section, we further consider fractional worker assignment, by allowing each worker to serve multiple masters simultaneously. In this case, we have ,

Comments

There are no comments yet.