# Computation Scheduling for Distributed Machine Learning with Straggling Workers

We study the scheduling of computation tasks across n workers in a large scale distributed learning problem. Computation speeds of the workers are assumed to be heterogeneous and unknown to the master, and redundant computations are assigned to workers in order to tolerate straggling workers. We consider sequential computation and instantaneous communication from each worker to the master, and each computation round, which can model a single iteration of the stochastic gradient descent algorithm, is completed once the master receives k distinct computations from the workers. Our goal is to characterize the average completion time as a function of the computation load, which denotes the portion of the dataset available at each worker. We propose two computation scheduling schemes that specify the computation tasks assigned to each worker, as well as their computation schedule, i.e., the order of execution, and derive the corresponding average completion time in closed-form. We also establish a lower bound on the minimum average completion time. Numerical results show a significant reduction in the average completion time over existing coded computing schemes, which are designed to mitigate straggling servers, but often ignore computations of non-persistent stragglers, as well as uncoded computing schemes. Furthermore, it is shown numerically that when the speeds of different workers are relatively skewed, the gap between the upper and lower bounds is relatively small. The reduction in the average completion time is obtained at the expense of increased communication from the workers to the master. We have studied the resulting trade-off by comparing the average number of distinct computations sent from the workers to the master for each scheme, defined as the communication load.

• 17 publications
• 150 publications
03/05/2019

### Gradient Coding with Clustering and Multi-message Communication

Gradient descent (GD) methods are commonly employed in machine learning ...
08/07/2018

### Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers

Distributed gradient descent (DGD) is an efficient way of implementing g...
04/12/2018

### Bilateral Teleoperation of Multiple Robots under Scheduling Communication

In this paper, bilateral teleoperation of multiple slaves coupled to a s...
09/23/2021

### Coded Computation across Shared Heterogeneous Workers with Communication Delay

Distributed computing enables large-scale computation tasks to be proces...
04/10/2020

### Straggler-aware Distributed Learning: Communication Computation Latency Trade-off

When gradient descent (GD) is scaled to many parallel workers for large ...
05/24/2018

### Polynomially Coded Regression: Optimal Straggler Mitigation via Data Encoding

We consider the problem of training a least-squares regression model on ...
02/07/2018

### Minimizing Latency for Secure Coded Computing Using Secret Sharing via Staircase Codes

We consider the setting of a Master server, M, who possesses confidentia...

## I Introduction

The growing computational complexity and memory requirements of emerging machine learning applications involving massive datasets cannot be satisfied on a single machine. Consequently, distributed computation across tens or even hundreds of computation servers, called the workers, has been a topic of great interest in machine learning [1] big data analytics [2]. A major bottleneck in distributed computation, and its application in large learning tasks is that the overall performance can significantly deteriorate due to slow servers, referred to as the stragglers. To mitigate the limitation of straggling servers, coded computation techniques, inspired by erasure codes, have been proposed recently. With coded computation, computations from only a subset of non-straggling workers are sufficient to complete the computation task, thanks to redundant computations performed by the faster workers. In [3]

the authors employ a maximum-distance separable (MDS) code-inspired distributed computation scheme to mitigate the effect of straggling servers in a distributed matrix-vector multiplication problem. A more general distributed gradient descent problem is considered in

[4], where labeled dataset is partitioned and distributed to worker nodes, each of which evaluates the gradient on its own partition. Various coding schemes have been introduced in [4, 5, 6, 7, 8], that assign redundant computations to workers to attain tolerance against straggling workers. Coded distributed computation has also been studied for matrix-matrix multiplication, where the labeled data is coded before being delivered to workers [9, 10, 11], and for distributed computing of a polynomial function [12]. Please see [13] for an overview and classification of different approaches.

Most existing coded computation techniques are designed to tolerate persistent stragglers, i.e., workers that are extremely slow compared to the rest for a long period of time; and therefore, they discard computations performed by straggling workers. However, persistent stragglers are rarely seen in practice, and we often encounter non-persistent stragglers, which, despite being slower, complete a significant portion of the assigned tasks by the time faster workers complete their assignments [14]. Recently, there have been efforts to exploit the computations that have been carried out by non-persistent stragglers at the expense of increasing the communication load from the workers to the master [14, 15, 16, 13, 17]. The techniques studied in [14, 15, 16, 13] are based on coding with associated encoding and decoding complexities, which may require the availability and central processing of all the data points at the master. Furthermore, the coded design proposed in [14] depends on the statistical behavior of the stragglers, which may not be possible to predict accurately in practice. The coding technique studied in [16]

requires a large enough number of data samples assigned to each worker to guarantee decodability of the target function at the master node with high probability, while the approach considered in

[17] requires a large number of workers compared to the number of data batches to ensure that the master node can recover all the data from the workers with high probability.

We do not apply any coding across the dataset, but consider a centralized scheduling strategy for uncoded distributed computation, where the computation tasks are assigned to the workers by the master server. Each worker can compute a limited number of tasks, referred to as the computation load. Computations are carried out sequentially, and the result of each computation is sent to the master immediately after it is completed. Communication delay from the workers to the master is ignored, although independent delays across workers can easily be incorporated into our framework. This sequential computation and communication framework allows the master to exploit even partial computations by slow workers.

Assuming that the computation time of a task by each worker is random, the goal is to characterize the minimum average completion time as a function of the computation load. We first provide a generic expression for the average completion time as a function of the computation schedule, which specifies both the tasks assigned to each worker and their computation order. We propose two different computation scheduling schemes, and obtain closed-form expressions for their average completion times, which provide an upper bound on the minimum average completion time. We also establish a lower bound on the minimum average completion time, which is tight for low and high computation loads. To fully understand the impact of the proposed uncoded distributed computation framework on the system performance, we also study the resultant communication load, which is defined as the average total number of distinct computations transmitted from the workers to the master. Since the computations are transmitted sequentially, we assume that each transmission corresponds to a communication packet creating additional traffic for the underlying communication protocol. It is expected that the increased communication load will increase the communication delay; however, the exact amount of this delay will depend on the network topology, connection technology, and the communication protocol between the workers and the master.

The organization of the paper is as follows. We present the system model and the problem formulation in Section II. In Section III, we analyze the performance of the minimum average completion time for the general case. We provide an upper and a lower bound on the minimum average completion time in Section IV and Section V, respectively. In Section VI, we overview some of the alternative approaches in the literature, and compare their performance with the proposed uncoded schemes numerically. Finally the paper is concluded in Section VII.

Notations: , , and represent sets of real values, integers, and positive integers, respectively. For two integers and , , denotes the set . For any , we define ; and finally, returns the binomial coefficient “ choose ”.

## Ii Problem Formulation

We consider distributed computation of a function over a dataset across workers. Each element of the dataset, , which we will call as a data point, may correspond to a minibatch of labeled data samples. Function is an arbitrary function, where and are two vector spaces over the same field , and each is an element of , for . The computation is considered completed once the master recovers distinct evaluations (tasks) , , where is any arbitrary subset of with . Note that we allow partial computations, i.e., can be smaller than , and we refer to as the computation target. We also define the computation load as the maximum number of data points available at each worker for computation. We denote by , the indices of the data points in the dataset assigned to worker , where , , i.e., worker computes , , for .

We denote by the time worker spends to compute each task assigned to it, i.e., , . We assume that

is a random variable with cumulative density function (CDF)

, for 111We assume that the CDF is a smooth function of its argument, for ., and, for , is independent of . In our model, while the computation speed of each server is random, we assume that, once its value is fixed, each computation at that server takes the same amount of time. Each worker sends the result of each assigned computation to the master immediately after its computation. We assume that the communication time is negligible compared to the computation time; that is, the result of each computation becomes available at the master immediately after its completion222Note that a constant or a random computation time (independent across workers) can be easily incorporated into this framework by updating the computation time statistics accordingly; however, we expect that the communication delays will be correlated across workers in a network setting. The impact of such correlations (i.e., due to congestion) will be the topic of our future work..

We assume that the computations start at time at all the workers, and once the master receives distinct evaluations, it sends an acknowledgement message to all the workers to stop computations. We denote the time that master receives the result of the computation by , , which is a random variable. Let denote the time worker , for , computes , then we have

 TXj=mini∈[n]{Ti,Xj},for j∈[n]. (1)

The distributions of the random variables depend on the assignment of the computation tasks to the workers , as well as the order these tasks are carried out by each worker , where denotes the computing order of the tasks assigned to worker , . If evaluation has not been assigned to worker , i.e., , we assume that , for . We note that , in general, is not independent of , for .

We define the task ordering (TO) matrix as an matrix of integers, , specifying the computation schedule of the tasks assigned to each worker, i.e., . Let

 C=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣c11c12…c1nc21c22…c2n⋮⋮⋱⋮cr1cr2…crn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦, (2)

where , for and . Each column of matrix corresponds to a different worker, and its elements from top to bottom represent the order of computations. That is, the entry denotes the index of the element of the dataset that is computed by worker as its -th evaluation, i.e., worker first computes , then computes , and so on so forth until either it computes , for and , or it receives the acknowledgement message from the master, and stops computations. Note that the task assignment and the order of evaluations is specified by a unique TO matrix . While any matrix is a valid TO matrix, it is easy to see that the optimal TO matrix will have distinct entries in each of its columns, and distinct entries overall.

###### Example 1.

Consider the following TO matrix for and

 C=⎡⎢⎣133422433111⎤⎥⎦. (3)

According to the TO matrix , each worker follows the following computation schedule:

• Worker 1 first computes , then , and finally .

• Worker 2 first computes , then , and finally .

• Worker 3 first computes , then , and finally .

• Worker 4 first computes , then , and finally .

Since the computation time is assumed to scale linearly with the number of computation tasks executed; that is, the time for worker to compute computations is , it follows that:

 T1,X1 =T1,T1,X2=2T1,T1,X3=3T1,T1,X4=∞, (4a) T2,X1 =3T2,T2,X2=2T2,T2,X3=T2,T2,X4=∞, (4b) T3,X1 =3T3,T3,X2=∞,T3,X3=T3,T3,X4=2T3, (4c) T4,X1 =3T4,T4,X2=∞,T4,X3=2T4,T4,X4=T4, (4d)

and therefore,

 TX1 =min{T1,3T2,3T3,3T4}, (5a) TX2 =min{2T1,2T2}, (5b) TX3 =min{3T1,T2,T3,2T4}, (5c) TX4 =min{2T3,T4}. (5d)

For given task assignment and order of computations resulting in a unique TO matrix , we denote the completion time, which corresponds to the time it takes the master to receive distinct computations, by . Note that is also a random variable, and we define the average completion time as

 ¯¯¯¯TC(r,k)≜E[TC(r,k)], (6)

where the expectation is taken over the distributions of . We define the minimum average completion time

 ¯¯¯¯T∗(r,k)≜minC{¯¯¯¯TC(r,k)}, (7)

where the minimization is taken over all possible TO matrices . The goal is to characterize .

###### Remark 1.

We highlight that, most coded distributed computation schemes in the literature require the master to recover the computations (or, their average) for the whole dataset. However, it is known that convergence of stochastic gradient descent is guaranteed even if the gradient is computed at only a portion of the dataset at each iteration [18, 19, 20, 21, 22, 23, 24, 25]. This can be compensated particularly for the random straggling model considered here; that is, when the straggling workers, and hence the uncomputed gradient values, vary at each iteration.

The above formulation of distributed computation includes several problems of interest studied in the literature, some of which are listed below for the case of , where the goal is to collect all the computations by the master:

• In [17], the dataset is divided into batches, and the gradient function computed for the samples within each batch needs to be delivered to the master. By letting each worker compute the gradient for samples of at most batches, our formulation covers this problem by treating as the -th batch, and function as the gradient computed for the samples in each batch.

• Linear regression with a least-square objective function is studied in [11]. For a feature matrix , for some , the problem boils down to computing the matrix multiplication, , in a distributed manner. Matrix is divided into equal-size sub-matrices , where , for . The goal is to compute , where each worker can compute evaluations , , where and . This problem is covered by our formulation by letting function be defined as , for some matrix . Another application of our problem is the matrix-vector multiplication problem, in which the goal is to compute in a distributed manner, where and . Matrix is again divided into equal-size sub-matrices , where , . Defining as , , we have . We let each worker compute at most of the evaluations.

• Our formulation also covers the problem studied in [12], where a multivariate polynomial function is computed over a dataset of inputs in a distributed manner utilizing workers, where . Note, however, that the scheme considered in [12] allows coding across the input dataset.

## Iii Average Completion Time Analysis

Here we analyze the average completion time for a given TO matrix .

###### Theorem 1.

For a given TO matrix , we have

 Pr{TC(r,k)>t}=1−FTC(t) =∑ni=n−k+1(−1)n−k+i+1(i−1n−k) ∑S⊂[n]:|S|=iPr{TXj>t,∀j∈S}, (8)

which yields

 ¯¯¯¯TC(r,k)=∑ni=n−k+1(−1)n−k+i+1(i−1n−k) (9)

Note that the dependence of the completion time statistics on the TO matrix in (1) and (1) is through the statistics of . We also note that the expectations in (1) and (1) are fairly general, and can apply to situations where the computation time of a task may depend on its order of computation, rather than being the same for all the computations carried out by the same worker, as we assume in this work.

###### Proof.

The event is equivalent to the union of the events, for which the time to complete any arbitrary set of at least distinct computations is greater than , i.e.,

 Pr{TC(r,k)>t}=Pr {⋃G⊂[n]:n−k+1≤|G|≤n{TXj>t, TXj′≤t,∀j∈G,∀j′∈G′}}, (10)

where we define . Since the events of , for all distinct sets are mutually exclusive (pairwise disjoint), we have

 Pr{TC(r,k)>t} =n∑i=n−k+1∑G⊂[n]:|G|=iPr{TXj>t, TXj′≤t,∀j∈G,∀j′∈G′} =∑ni=n−k+1∑G⊂[n]:|G|=iHG,G′, (11)

where, for , we define

 HG,G′≜Pr{TXj>t,TXj′≤t,∀j∈G,∀j′∈G′}. (12)

Given a particular set , such that , for , we have

 HG,G′=∑nl=i(−1)i+l∑^G⊂G′:∣∣^G∣∣=l−iHG∪^G,∅ =∑nl=i(−1)i+l∑^G⊂G′:∣∣^G∣∣=l−iPr{TXj>t,∀j∈G∪^G}, (13)

where we used the fact that, for any , we have

 HG,G′=HG,G′∖{g}−HG∪{g},G′∖{g}. (14)

According to (III), for , we have

 ∑G⊂[n]:|G|=iHG,G′ =∑G⊂[n]:|G|=i∑nl=i(−1)i+l∑^G⊂G′:∣∣^G∣∣=l−iHG∪^G,∅ =∑nl=i(−1)i+l∑G⊂[n]:|G|=i∑^G⊂G′:∣∣^G∣∣=l−iHG∪^G,∅ \makebox[0.0pt]\tiny(a)=∑nl=i(−1)i+l(li)∑S⊂[n]:|S|=lHS,∅, (15)

where (a) follows since, for each set with , there are sets . Plugging (III) into (III) yields

 Pr{TC(r,k)>t} =n∑i=n−k+1n∑l=i(−1)i+l(li)∑S⊂[n]:|S|=lHS,∅. (16)

For a particular set with , for some , the coefficient of in (III) is given by

 ∑si=n−k+1(−1)i+s(si) =∑si=0(−1)i+s(si)−∑n−ki=0(−1)i+s(si) =(−1)n−k+s(s−1n−k)=(−1)n−k+s+1(s−1n−k), (17)

which results in

 Pr{TC(r,k)>t} =n∑i=n−k+1(−1)n−k+i+1(i−1n−k)∑S⊂[n]:|S|=iHS,∅. (18)

According to the definition of , (III) concludes the proof of the expression given in (1). Furthermore, since , i.e., the completion time is a non-negative value, we have

 (19)

which substituting by yelids the expression given in (1). ∎

###### Remark 2.

For , we have

 Pr{TC(r,n)>t} =n∑i=1(−1)i+1∑S⊂[n]:|S|=iPr{TXj>t,∀j∈S}, (20)

and

 ¯¯¯¯TC(r,n)=∑ni=1(−1)i+1 ∑S⊂[n]:|S|=i∫∞0Pr{TXj>t,∀j∈S}dt. (21)

The minimum average completion time can be obtained as a solution of the optimization problem . Providing a general characterization for is elusive. In the next section, we will propose two specific computation task assignment and scheduling schemes, and evaluate their average completion times.

## Iv Upper Bounds on the Minimum Average Completion Time

In this section we introduce and study two particular computation task assignment and scheduling schemes, namely cyclic scheduling (CS) and staircase scheduling (SS). The average completion time for these schemes will provide upper bounds on .

### Iv-a Cyclic Scheduling (CS) Scheme

The CS scheme is motivated by the symmetry across the workers when we have no prior information on their computation speeds. CS makes sure that each computation task has a different order at different workers. This is achieved by a cyclic shift operator.

We denote the TO matrix of the CS scheme by and its element in the -th row and -th column by , for and . The TO matrix is given by

 CCS(i,j)=g(j+i−1),for i∈[r] and j∈[n], (22)

where function is defined as follows:

 g(l)≜⎧⎨⎩l,if 1≤l≤n,l−n,if l≥n+1,l+n,if l≤0. (23)

Thus, the TO matrix is given by

 CCS=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣g(1)g(2)…g(n)g(2)g(3)…g(n+1)⋮⋮⋱⋮g(r)g(r+1)…g(n+r−1)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (24)

Due to the linear scaling of the computation time with the number of computations executed, for the TO matrix in (24), we have, for ,

 Tg(j−i+1),Xj={iTg(j−i+1),for i=1,…,r,∞,for i=r+1,…,n, (25)

which results in

 TXj=mini=1,…,r{iTg(j−i+1)}. (26)
###### Example 2.

Consider and . We have

 CCS=⎡⎢ ⎢ ⎢⎣123456234561345612456123⎤⎥ ⎥ ⎥⎦, (27)

and

 TX1 =min{T1,2T6,3T5,4T4}, (28a) TX2 =min{T2,2T1,3T6,4T5}, (28b) TX3 =min{T3,2T2,3T1,4T6}, (28c) TX4 =min{T4,2T3,3T2,4T1}, (28d) TX5 =min{T5,2T4,3T3,4T2}, (28e) TX6 =min{T6,2T5,3T4,4T3}. (28f)

In order to characterize the average completion time of the CS scheme, , using (1), we need to obtain , for any set such that , where are given in (26). Consider a set , , . We have

 HS,∅ =Pr{mini=1,…,r{iTg(j−i+1)}>t,∀j∈S} =Pr{Tg(j−i+1)>t/i,∀i∈[r],∀j∈S}. (29)

In order to obtain , we need to find all possible values of , and . Thus, we define

 US≜{g(j−i+1),∀i∈[r],∀j∈S}, (30)

and . We represent set as

 US={p1,…,puS}, (31)

where , for . We also define

 tpl≜min{i:g(j−i+1)=pl,∀j∈S}. (32)

Accordingly, we have

 HS,∅=Pr{Tpl>t/tpl,∀l∈[uS]}, (33)

where, due to the independence of , we can obtain

 HS,∅=∏uSl=1(1−Fpl(t/tpl)). (34)

Next, for with , we define

 LCSS≜∫∞0∏uSl=1(1−Fpl(t/tpl))dt. (35)

Then, the average completion time of the CS scheme can be written as

 ¯¯¯¯TCS(r,k)=n∑i=n−k+1(−1)n−k+i+1(i−1n−k)∑S⊂[n]:|S|=iLCSS. (36)

Note that we have obtained an explicit characterization of the CS scheme in terms of the CDFs of the computation time of the workers. While this CDF would depend on the particular computation task as well as the capacity and load of the particular server, it is often modeled as a shifted exponential in the literature [3, 14].

In the following corollary, we characterize for a shifted exponential computation time, i.e., for ,

 Fi(t)={1−e−μi(t−τi),if t≥τi,0,if t<τi, (37)

where . We define , and . We define as the -th smallest value among , for , i.e.,

 τ′p1 ≜min{tp1τp1,…,tpuSτpuS}, (38a) τ′pl ≜min{{tp1τp1,…,tpuSτpuS}∖{τ′p1,…,τ′pl−1}}, for l=2,...,uS. (38b)

We obtain the unique values , such that , for ; accordingly, we define

 μ′pl ≜μpil/tpil,for l∈[uS]. (39)
###### Corollary 1.

Given a fixed set , , we have

 LCSS=τ′p1+ ∑uSl=11∑lj=1μ′pje∑lj=1μ′pjτ′pj (e−τ′pl∑lj=1μ′pj−e−τ′pl+1∑lj=1μ′pj), (40)

where .

###### Proof.

According to the definitions of and , we have

 ∏uSl=1(1−Fpl(t/tpl)) ={1,if t<τ′p1,e∑lj=1μ′pjτ′pje−∑lj=1μ′pjt,if τ′pl≤t<τ′pl+1, l∈[uS], (41)

where . Thus, it follows that

 LCSS =∫τ′p101dt+∑uSl=1∫τ′pl+1τ′ple∑lj=1μ′pjτ′pje−∑lj=1μ′pjtdt =τ′p1+∑uSl=11∑lj=1μ′pje∑lj=1μ′pjτ′pj (e−τ′pl∑lj=1μ′pj−e−τ′pl+1∑lj=1μ′pj). (42)

Overall, the average completion time of the CS scheme with shifted exponential CDFs is given by

 ¯¯¯¯TCS(r,k)=∑ni=n−k+1(−1)n−k+i+1(i−1n−k) ∑S⊂[n]:|S|=i(τ′p1+∑uSl=11∑lj=1μ′pje∑lj=1μ′pjτ′pj (e−τ′pl∑lj=1μ′pj−e−τ′pl+1∑lj=1μ′pj)). (43)

The numerical evaluation and comparison of the above result will be presented in Section VI.

### Iv-B Staircase Scheduling (SS) Scheme

While CS seems to be a natural way of scheduling tasks to the workers with unknown speeds, one can see that imposing the same order of computations across all the workers may not be ideal when the goal is to recover distinct computations at the master. Alternatively, here we propose the SS scheme, which introduces inverse computation orders at the workers.

The entries of the TO matrix for the SS scheme are given by, for and ,

 CSS(i,j)=g(j+(−1)j−1(i−1)). (44)

It follows that

 CSS=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣g(1)g(2)…g(n)g(2)g(1)…g(n+(−1)n−1)⋮⋮⋱⋮g(r)g(3−r)…g(n+(−1)n−1(r−1))⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (45)

We remark here that the main difference between the CS and SS schemes is that in the CS scheme (according to (22)) all the workers have the same step size and direction in their computations, while in the SS scheme (according to (44

)) workers with even and odd indices have different directions (ascending and descending, respectively) in the order they carry out the computations assigned to them, but the same step size in their evaluations.

Assuming a linear scaling of the computation time as before, it can be verified that, for ,

 Tg(j+(−1)j+i−1(i−1)),X