Heterogeneity-aware Gradient Coding for Straggler Tolerance

Gradient descent algorithms are widely used in machine learning. In order to deal with huge volume of data, we consider the implementation of gradient descent algorithms in a distributed computing setting where multiple workers compute the gradient over some partial data and the master node aggregates their results to obtain the gradient over the whole data. However, its performance can be severely affected by straggler workers. Recently, some coding-based approaches are introduced to mitigate the straggler problem, but they are efficient only when the workers are homogeneous, i.e., having the same computation capabilities. In this paper, we consider that the workers are heterogeneous which are common in modern distributed systems. We propose a novel heterogeneity-aware gradient coding scheme which can not only tolerate a predetermined number of stragglers but also fully utilize the computation capabilities of heterogeneous workers. We show that this scheme is optimal when the computation capabilities of workers are estimated accurately. A variant of this scheme is further proposed to improve the performance when the estimations of the computation capabilities are not so accurate. We conduct our schemes for gradient descent based image classification on QingCloud clusters. Evaluation results show that our schemes can reduce the whole computation time by up to 3× compared with a state-of-the-art coding scheme.

READ FULL TEXT VIEW PDF

page 1

page 7

page 8

page 9

07/01/2020

Distributed Linearly Separable Computation

This paper formulates a distributed computation problem, where a master ...
04/20/2019

Optimal Load Allocation for Coded Distributed Computation in Heterogeneous Clusters

Recently, coding has been a useful technique to mitigate the effect of s...
01/31/2022

Lightweight Projective Derivative Codes for Compressed Asynchronous Gradient Descent

Coded distributed computation has become common practice for performing ...
06/06/2022

Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning

Gradient coding schemes effectively mitigate full stragglers in distribu...
10/27/2017

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferen...
01/15/2018

Hierarchical Coding for Distributed Computing

Coding for distributed computing supports low-latency computation by rel...
02/16/2022

Efficient Distributed Machine Learning via Combinatorial Multi-Armed Bandits

We consider the distributed stochastic gradient descent problem, where a...

I Introduction

With the rapid increasing of data size, fast processing of big data becomes more and more important. Due to the saturation of Moore’s law, distributed processing has been viewed as the primary method for breaking down the limitation of computing power. Modern systems for distribute processing of big data like MapReduce [1] and Apache Spark [2] usually adopt a master-slave architecture. In such architecture, a master server divides the initial task into many small tasks and assigns them to several slave nodes (worker). These workers process tasks in parallel and return outcomes back to master after finishing.

In such distributed form, the performance of distributed system is usually limited by delays or faults as master collects outcomes from workers [3]. Delays or faults are usually incurred by stragglers which are workers that cannot return outcome within a reasonable deadline. Stragglers are mainly caused by two reasons, 1) transient fluctuation of resource in cluster, e.g., fault occurrence [4, 5], resource contention between processes, and 2) consistent heterogeneity of clusters [6]. Due to the notable negative impact of stragglers on performance, many recent works were proposed trying to mitigate them regarding to different tasks [7, 8, 9]. In this paper, we focus on the task of gradient computing. Gradient is the derivative of objective function and is of great importance for being the cornerstone of many optimization algorithms [10, 11]. For gradient computing task Tandon [12] proposes using coding method to tolerate stragglers. In their framework, the gradient of a sample is computed by several workers so that the gradient of the sample could be recovered by master as long as master receives the update of any worker that participates in the gradient computation of the sample. The essence of this gradient coding method is to improve stragglers tolerance by making data duplication. Though their method works efficiently for stragglers incurred trasient fluctuation, it can do nothing for stragglers caused by heterogeneneity. This is because it does not take computing capabilities of workers into account as designing coding scheme. Another work [13]

encodes the second-moment of data to reduce computational overhead of encoding naive data. However, it is only limited to the gradient of linear model which cannot be used in many domains, e.g, training of DNN.

Considering all the insufficiencies of existing methods, we seek to tolerate stragglers incurred both the two reasons, i.e., stragglers in heterogeneous clusters, such that the processing efficiency of distributed system could be improved. This is a non-trival problem, because heterogeneneity is very common in modern clusters [8, 14, 6]. In fact, we can solve this problem by designing a solution that can both tolerate transient stragglers and take full utilization of the computing resources in heterogeneous cluster. To acheive this goal, we propose two heterogeneity-aware gradient coding methods that adaptively allocate data partitions to each worker according to their computing capabilities. In this way, each worker has the similiar completion time so that the consistent stragglers incurred by heterogeneity could be eliminated. On the other hand, the transient stragglers will also be eliminated by using coding theory.

To implement heterogeneity-aware gradient coding scheme, data partitions are firstly allocated to each worker according to their processing speed, and then we show how to construct coding strategy. The experiemental evaluations were done on popular deep learning tasks on several heterogeneous clusters range from

workers to workers. Results show that our methods improve the performance of deep learning task up to compared to traditional gradient coding methods.

Our contributions are summarized as follows:

  • Straggler tolerance in heterogeneous setup is of great importance, but is ignored by existing methods. We propose a new heterogeneity-aware gradient coding scheme that could work efficiently in heterogeneous clusters while tolerating stragglers.

  • We theoretically show that our heterogeneity-aware gradient coding scheme is optimal for a cluster with accurately estimated computing capacity.

  • Considering practicalities of running system that the computing capacity is hard to be measured accurately, we further propose a more effcient variant of heterogeneity-aware gradient coding scheme.

  • We conduct our coding schemes for gradient-based machine learning tasks on QingCloud clusters. Evaluation results show that our coding scheme could not only tolerate stragglers but also take fully utilization of computing capabilities of workers.

This paper is organized as follows. The related work about stragglers in distributed system is firstly presented in Section II. And then, we present the problem formulation in Section III. After that, we present our designed two heterogeneity-aware gradient coding schemes, heter-aware and group-based coding scheme. In Section VI, a wide range of evaluations are performed in various large-scale heterogeneous clusters to show the efficiency of our coding scheme. Finally, the conclusions are drawn in Section VII.

Ii Related Work

Straggler problem has a long history in parallel computing, and it attracts more and more interests as the era of big data comes. Here below, we will firstly introduce methods for straggler problem from specific to general, and then show recently emerging coded methods for straggler mitigation.

Considering distributed learning task is the typical task by using gradient, we firstly the related work in distributed learning system for stragglers. Due to fault tolerance property inherent in machine learning task, there are many methods trying to starting with parallel mechanism. Typical algorithms including asynchronous parallel training algorithms including TAP[15, 16] and SSP[17, 18, 19] were proposed to avoid stragglers in learning steps, where the core idea of these methods is to improve hardware efficiency by sacrificing statistical efficiency (e.g., convergence accuracy and speed)[20]. Further based on SSP, DynSSP[6] was proposed to improve the statistical efficiency of asynchronous learning by tuning learning rates. Though such parallel algorithms could reduce the affecting of stragglers, they are hard to analysis, debug, and reproduced. Besides, the accuracy as convergence couldn’t reach optimal as shown in[21]. Different from these work, we try to mitigate stragglers for the BSP distribtued scheme that keeps accuracy.
Another line for mitigating stragglers is load balancing which can be referenced by general task. There are many work [22, 23, 24] trying to rebalance workload allocation by using work stealing in traditional parallel computing. Work stealing in fact is a technique that reallocates tasks from busy cores to idle cores. However, this idea isn’t suitable to machine learning task, especially DNN’s training. One reason is that each iteration of DNN’s training is very short which lasts only a few seconds or less [25, 26, 27] causing that the detection of stragglers and transferring of workloads are almost impossible. In this paper, we propose a new load balancing method that use the property of data parallel processing task that the computing complexity is of each task is proportional to its number of samples.

Recently, coding theory based methods were also introduced to distributed computation to tolerate stragglers. The initial work was proposed in [28, 29] that they aim at large-scale matrix multiplication. They encode the matrix to tolerate stragglers and design a coding shuffling algorithm to reduce the data shuffling traffic. An improvement in [30, 13]

is that they encode the second moment of data for the linear regression problem to reduce computational overhead of encoding naive data.

[31, 32, 33]

utilize polynomial interpolation to design the coded computation to tolerate more stragglers under the same workload compared to traditional coding method. But different from our model, all these algorithms are only limited to the linear model which couldn’t be adopted by a broad of optimization problems. For example, this strict condition cannot be satisfied by current DNN models. A general coding method named gradient coding was proposed in

[12]. Different from traditional works that encode the data directly, they encode the gradients generated by optimization algorithm such that the linear model constraint could be ruled out. Based on [12], [34] proposes reducing communication overhead by using coding method but further increases computing load incurred by coding method. Besides, both their coding methods have not taken computing capacity of workers into account causing the waste of computing resource. Though [35] and [36] aim at reducing computing load of coding method, they are at the cost of scarificing optimization accuracy. Recognizing that, here in this paper we propose a heterogeneity-aware gradient coding method for general optimization problem which not only takes computing capacity into account but also keeps accuracy of model.

Iii Problem Formulation

Iii-a The Framework

Consider a typical distributed learning system, as illustrated in Fig.1, which consists of a master and a set of workers denoted by . A whole dataset is divided into equal-sized data partitions, denoted by , i.e., . The partial gradient over a data partition is denoted as , which can be obtained by computation with . The whole task of distributed computation over this learning system is to obtain the aggregated gradient as

Fig. 1: Distributed learning system with possible heterogeneous workers, where small rectangulars represent computing units. The main component of task is that server aggregates all patial gradients from workers.

A direct approach is to allocate different data partitions to different workers. Each worker computes the partial gradients over the data partitions in hand, and then sends the summation of these partial gradients to the master. After collecting all the summations from the workers, the master can get the aggregated gradient by summing up the summations. However, when there exists some straggler, the computation latency could be significantly increased. Even worse, when some worker fails (e.g.,virtual machine breaks down), the whole task cannot be completed. In order to tolerate stragglers/failures, we consider the following general coding-based scheme. Initially, each worker is allocated with a subset of data partitions , where different could be joint. Then computes all the corresponding gradients . After this, encodes these gradients as , where is the encoding function of , and sends to the master. After receiving enough results from some workers, say , the master recovers the desired aggregated gradient immediately, where is referred to as the decoding function.

Iii-B Gradient Coding Strategy

Same as [12], we consider linear encoding functions, i.e., is a linear combination of , . Specifically, we can represent as

where vector

, and its support, denoted by , which is the set of indices of non-zero entries of , satisfies that , i.e., the indices of non-zero entries of show the allocation of data partitions to worker . Let , which not only describes the allocation of data partition to each worker, but also represents the encoding function of each worker. Henceforth, we will refer to as a gradient coding strategy.

We seek gradient coding strategies that are robust to any stragglers with . Same as [12], we assume that any straggler is a full straggler, i.e, it can be arbitrarily slow to the extent of complete failure. Under this assumption, a sufficient and necessary condition for a gradient coding strategy to be robust to any stragglers has been shown in [12] as follows.

Lemma 1.

A gradient coding strategy is robust to any stragglers if and only if satisfies the following condition:

(Condition 1): for any subset , ,

(1)

where is a all one vector, and is the span of vectors.

Given the coding strategy that satisfies the condition (C1), the decoding strategy could be correspondingly acheived for all stragglers patterns, where . Considering each row of denotes a specific scenario of stragglers, master decodes by using coded gradients sent by workers in . Accordingly, the decoding function can also be a linear combination as

Hence, the decoding strategy can be constructed by using

(2)

To reduce storage cost, the decoding matrix could be partially stored specially for regular stragglers. As to decoding functions designed for unregular stragglers, the decoding vectors could solved in realtime in a complexity of . Note that the time for solving decoding vector usually can be ignored due to and are usually small numbers.

Iii-C Problem Formulation

Besides the tolerance of stragglers, we mainly concern about the computation time of the whole task. We consider heterogeneous workers which have different computation capabilities. For each worker , let denote the number of partial gradients over data partitions that can be computed when is a non-straggler, which can be estimated by sampling. Thus, given a gradient coding strategy , the computation time of worker , denoted by , is given by

where denotes the -norm of , or equivalently, the cardinality of . Without loss of generality, we assume that .

Evidently, the computation time of the whole task under strategy depends on which workers are stragglers, referred to as straggler pattern. For a considered straggler pattern , the computation time of the whole task under strategy , denoted by can be characterized as

where is the minimum value of such that

For a gradient coding strategy that can tolerate up to stragglers, we evaluate its performance by the computation time of the whole task under in the worst case, which is denoted by and is given by

(3)

Aiming at finding a gradient coding strategy with a best performance, we have the following optimization problem:

(4)
s.t.

For ease of reading, the main notations used in this paper are sumarized in the following Table.I

Symbol Definition
The number of worker
The number of data partition
The number of stragglers
Worker
The number of data partitions in worker
The throughput of worker i
Decoding matrix
Coding matrix
Matrix with all elements being
{,…, }
{ , is the element of vector }
{ , is straggler }
The set of all data partitions
The set of all workers
Group composed of workers
Groups set composed of groups
TABLE I: Symbols

Iv Heterogeneity-aware Gradient Coding Strategy

In this section, we will show our coding scheme for heterogeneous distributed system detailly. Firstly, we specify how to design the support of with the considering of load balance and stragglers tolerance. We solve this by designing an heterogneity-aware data allocation scheme. After that, the construction process of is elaborated which is the key for accurate decoding. Finaly, we show that our coding strategy is optimal to problem (4).

Iv-a The Design

We first show how to allocate data partitions to the workers, which gives the support structure of , i.e., the positions of non-zero elements in .

In order to tolerate stragglers, each data partition has to be assigned to at least workers to compute . In our design, is copied exactly times, and there are in total copies of data partitions, i.e., , where is the number of data partitions assigned to worker . For load balancing, we set to be proportional to , the computation rate of . Hence, we have

(5)

Without loss of generality, here we assume that is an integer, and .

Once are fixed, we assign the total copies of data partitions to the workers in a cyclic manner. Specifically, the set of data partitions assigned to worker are given as

(6)

where . It is straightforward to see that, for each , there are exact copies assigned to different workers. By denoting as non-zero entry, the support structure of worker is , where if else , and the support sutructure of can be written as

(7)
Example 1.

As an example, consider a -workers system with normalized sampling throughput as . If there is straggler, we could allocate data partitions and determine suppoprt structure of as

Given the support structure of , we now introduce how to construct such that it can satisfy the condition (C1). In our construction, an auxiliary matrix is introduced, which satisfies the following properties.

  • (P1): any columns of is linearly independent.

  • (P2): for any submatrix composed by columns of and any non-zero vector such that , .

The usefulness of such a is revealed by the following result.

Lemma 2.

For a matrix having properties (P1) and (P2), there exists a matrix with a support structure of (7) such that and satisfies condition (C1).

Proof.

Our proof proceeds as follows. First, we construct a matrix with a support structure of (7) such that . Then, we show that satisfies condition (C1).

For each , let be the submatrix of by deleting all the -th columns where the -th element of the -th column of the support structure (7) is zero. Since each column of the support structure (7) has non-zero elements, has columns which are linearly independent according to property (P1). Therefore, is non-singular, and has an inverse which is denoted by . Let

and be the matrix formed by embedding each , into the -th column of the support structure (7). The embedding process is to assign each value in to according to the position presented in . Evidently, .

Next we show that the constructed satisfies condition C1. Let be the rows of . Consider an arbitrary subset such that . Let be the submatrix composed by all the -th columns of where . Since has columns while it has rows, there exists some non-zero vector such that . Since satisfies property (P2), we have . Hence,

Note that for , the -th entry of the vector is equal to 0 since . We then have that belongs to the span of . Therefore, satisfies condition (C1). The proof is accomplished. ∎

In the proof of Lemma 2, we give a construction method of with desired properties if we have a matrix satisfying properties (P1) and (P2). Hence, all we need now is to construct such a matrix . In the following, we show that a random choice of suffices where each entry of is chosen from the interval independently and uniformly at random.

Lemma 3.

For a matrix where each entry of is chosen from the interval independently and uniformly at random, then

satisfies both properties of (P1) and (P2) with probability 1.

Proof.

It has been shown in [12] that satisfies property (P1) with probability 1. So we only need to show that satisfies (P2) with probability 1.

Consider any submatrix composed by columns of . Let be the rows of . Without loss of generality, we assume that the values of have been exposed and they are independent which holds with probability 1, so that we focus on the randomness of . Let , which is unique, such that

We can check that

is a continuous multivariate random variable. So the probability of

is 1. On the other hand, if , then for any non-zero vector such that , and . Therefore, . This implies that the property (P2) restricted to the holds with probability 1. Since there are such , taking a union bound over them shows that property (P2) holds with probability 1. ∎

The algorithm for constructing is given in Alg.1

Input: ,
Output:

1:initialize
2:for  in  do
3:     for  in  do
4:               
5:for  in  do
6:     
7:     
8:     for  in  do
9:         for  in  do
10:                             
11:     
12:     for  in  do
13:               
14:     
15:return
Algorithm 1 Heter-aware Coding Scheme

As a consequence of Lemma 1, Lemma 2 and Lemma 3, we have the following theorem immediately.

Theorem 4.

The matrix constructed by Alg. is robust to any stragglers with probability 1.

Iv-B Optimality

Theorem 5.

The gradient coding strategy constructed by Alg.1 is an optimal solution to problem (4) with probability .

Proof.

Let be an optimal gradient coding strategy. Let be the -th row of . If there exists some such that , then according to the definition of (c.f. Eq. (3)), the result of worker is useless for earliest successful decoding whatever the straggler pattern is. Hence, we can remove the assignment of data partitions to worker , which does not affect the straggler tolerance and the computation time of the whole task. In other words, this is still an optimal gradient coding strategy. Hence, we can conclude that there exists an optimal gradient coding strategy such that

where is the -th row of . Now we have

On the other hand, in order to tolerate straggler, each data partition has to be assigned to at least workers. This implies that

Hence,

For our construction , we can see that every worker completes its local task in time according to Eq. (2). Hence, , which implies that is optimal. ∎

V Group-based Coding Scheme

Based on the sampling throughput of each worker , we have proposed an optimal solution for problem (4) on the above section. However, in practical system is hard to be measured exactly because of tiny fluctuation in runtime. This leads to that coding scheme could hardly acheive optimal. In fact, we could further improve the performance by reducing the number of workers needed by recovering gradient. This is because (1) if , then with probability where is the recovering time from active workers and (2) from lemma 2, we could directly conclude that recovering gradient from constructed by Alg.1 needs workers given stragglers. In the follows, we show that could be reduced by finding groups, where a group consists of at most workers and can be used to recover gradient. Denote a group as , then the following conditions are desired to satisfy requirement

  • (): for all workers , their sets of data partitions satisfy

  • (): for all groups in , ,

Input:
Output:

1: = FindAllGroups()
2: = PruneGroups()
3:return
4:
5:function FindAllGroups()
6:     initialize groups set
7:     
8:     for  in  do
9:         
10:         if  then
11:              
12:               = FindAllGroups()
13:              for subgroup in  do
14:                  
15:                                 
16:         else if  then
17:              
18:              
19:         else
20:              pass;               
21:     return
22:
23:function PruneGroups()
24:     while  doesn’t satisfy condition (do
25:         find that intersects most groups
26:               
27:     return
Algorithm 2 Find Groups

As shown in Alg.2, all groups are found in a recursive way to satisfy condition (), and then several groups are pruned to satisfy condition (). After finding groups, we just set non-zero elements in corresponding workers in groups to be . Besides, consider a group set . Let be the submatrix composed by all the -th rows of where worker . Obviously, can be constructed as long as the submatrix is solved. can be constructed by using Alg.1 under stragglers.

A little different from decoding function for constructed by Alg.1, the decoding matrix are constructed for workers in groups and workers not in groups separately. For workers in each group , we design each corresponding decoding vector as , where is the indicator function. Obviously, we have

(8)
(a) s=1 Straggler
(b) s=2 Stragglers
Fig. 2: Avg. time per iteration of different coding schemes running on Cluster-A with and stragglers. The stragglers are created artificially by adding delay to the workers. The results show that our proposed heter-aware and group-based gradient coding scheme performs best without regarding to the delays.

Consequently, a decoding submatrix denoted by is composed by decoding vectors for all groups. As to workers not in groups, the decoding submatrix is solved by according to (2).

The algorithm for constructing and solving decoding matrix is shown in Alg.3.

According to Theorem 4, we could have the following theorem.

Theorem 6.

The matrix constructed by Alg.3 is robust to any stragglers with probabilty .

Proof.

According to both condition () and () of groups and Theorem 4, we could easily know that are robust to stragglers with probability . Besides, all groups can be used to recover gradient as showed in . Hence, is robust to stragglers with probabilty . ∎

From this theorem, we could know that the group-based coding scheme is also an optimal solution for problem (4) for that the computation time of each active worker is the same like of worker in coding scheme 1 under deterministic situation.

Example 2.

An example is shown as in the following support structure of workers. There are three groups, including workers , including workers and including workers . Laterly, system prunes to satisfy condition (). For constructing , all entries of workers that in groups are set to be , and the remained entries of and are solved by using Alg.1.

Input:
Output:

1:initialize
2:
3:for  in  do
4:     
5:     
6:solve via Alg.1
7:solve by
8: = merge()
9: = merge()
10:return
Algorithm 3 Group-Detection Coding Scheme

Vi Performance Evaluations

In this section, experiments are presented to show the results of our coding scheme. Our coding scheme was mainly compared to two schemes: 1) Naive scheme. In naive scheme, the whole dataset was divided uniformly on each worker and server makes an update step by waiting for the completion of all workers. 2) Cyclic coding scheme [12]. Cyclic coding scheme uniformly divides the dataset into data partitions and makes copies of each data partition, and each worker computes data partitions. We didn’t implement fractional repetition scheme and partial coding scheme in [12], because fractional repetition scheme not only has a great limitation that requires that the number of worker is divisible by but also its performance is comparable to cyclic coding scheme and as to partial coding scheme, it a strong assumption that the slowest worker is at most slower than the fastest worker causing that it is unable to tolerate corrupted workers.

(a) Cluster-B
(b) Cluster-C
(c) Cluster-D
Fig. 3: Avg.time per iteration on different clusters. Our coding schemes perform best in all clusters with different configurations.
number of vCPUs Cluster-A Cluster-B Cluster-C Cluster-D
2-vCPUs 2 2 1 0
4-vCPUs 2 4 4 4
8-vCPUs 3 8 10 20
12-vCPUs 1 0 12 18
16-vCPUs 0 2 5 16
TABLE II: Cluster Configurations

Experiment Setup. Based on QingCloud [37], we make evaluations on various heterogeneous clusters with different scales ranging from workers to workers. We design four clusters including Cluster-A, Cluster-B, Cluster-C and Cluster-D as shown in Table II

. Such design mainly aims to cover various scales and heterogeneity of cluster to show the generality of our coding scheme. The instance type is performance type, and operating system of all the nodes is 64-bit Centos7.1. PyTorch

[38] is adopted as the platform.

Workload. Two typical image classification datasets Cifar10 [39]

and ImageNet

[40] are adopted. Cifar10 is composed of training images on which we train AlexNet [39], and ImageNet consists of over million images on which we train ResNet34 [41].

Metrics. System efficiency is measured by running time to show the overall efficiency of distributed learning system. It consists of statistical efficiency and hardware efficiency. Statistical efficiency measures the convergence rate of the learning algorithm can be shown by learning curve. Hardware efficiency is a metric that represents the efficiency effcient CPU resource usage.

Vi-a Experimental Results

Vi-A1 Robustness to Stragglers

By simulating faults, we add extra delay to any random workers on Cluster-A to show both the performance improvement and the ability of straggler tolerance of our coding scheme. We artificially generate stragglers and stragglers as shown in Fig.(a)a and Fig.(b)b. As expected, running time of naive increases with the increasing of delay and could not normally run as workers take place faults. Correspondingly, all coding schemes are designed for straggler in Fig.(a)a and for stragglers in Fig.(b)b. Different from naive distributed learning algorithm, cyclic algorithm could tolerate stragglers that the running time changes little to different delays as shown in Fig.(a)a. However, the running time of cyclic algorithm also increases with the increasing of delay. This is mainly because the performance of cyclic is mainly limited to workers with low-computing capacity, and it approaches the performance of low-computing workers as delay increases until reaches the lower bound as delay is infinite (faults take place). Compared to these two distributed algorithm, both our heter-aware coding scheme and group-based coding scheme are all robust to stragglers that the running time keeps almost unchanged as shown in Fig.(a)a and Fig.(b)b. When the fault takes place, our heter-aware coding scheme even acheives speedup compared to cyclic coding scheme because of high computing resource usage.

Fig. 4: Training Loss curve of different learning schemes on Cluster-C. Group-based coding scheme has the best convergence efficiency, and then heter-aware coding scheme. Cyclic coding scheme could only have a little better efficiency than Naive learning method due to insufficient workload allocation. SSP performs worst in such heterogeneous setting due to consistent straggler and poor convergence rate.

Vi-A2 Efficiency under different clusters

To show generality and efficiency of our coding scheme, we extend experiments to a large range of clusters with different scales and computing configurations as Cluster-B, Cluster-C and Cluster-D. The results are shown as in Fig.3. Obviously, heter-aware and group-based coding scheme acheive better performance than the other methods on each cluster of different configurations. On the other side, traditional cyclic coding scheme even makes performance worse for that it aggreggates the straggler problem by allocating equivalent workload to each worker with different computing capacity.

Besides, one most notable advantage of coding based methods is that they have better statistical efficiency by using BSP. This is not true in asynchronous learning algorithm, which is deeply discussed in [21]. We here validate the efficiency of our learning method compared to SSP, a notablely effcient asynchronous distributed learning algorithm. The result is shown as in Fig.4. Due to heterogeneous computing capacity of workers, SSP will in fact easily reach the staleness threshold nearly every step causing that the synchronization overhead is similar to Naive BSP learning algorithm. Besides, master receives unbalanced contributions from different data parts to the update of parameters due to the dicrepancy of workers causing that SSP has a lower convergence rate compared BSP. Consequently, our coding scheme converges smoother and faster than SSP as shown in Fig.4.

Fig. 5: Computing resource usage of different coding schemes. Computing resource usage of group-based coding scheme is the best among all coding schemes.

At last, we have a discussion at the hardware efficiency of our coding scheme. We use computing resource usage as the metric. Resource usage is caculated by average iteration:

As we can see, Naive has a resource usage lower than in Fig.5. This is incurred by low-computing capacity workers and many other factors, e.g., background interferring process and fluctuate network. Cylic coding scheme mitigates this problem by discarding stragglers. However, it still has a limits incurred by unbalanced distribtuion of computing resource. Our heter-aware coding scheme and group-based coding scheme solve all these two problems and acheive high resource usage. Though still half of resouce is idle due to communication overhead, this can be solved by combined techniques proposed by [42] that code gradients layer by layer.

Vii Conclusion

To tolerate stragglers and take fully advantage of computing resources, we propose two new coding schemes in this paper, heter-aware and group-based coding scheme. Traditional coding methods proposed by [12] could efficiently mitigate stragglers, especially for fault tolerance, but their equivalent data allocation mechanism causes that they have bad performance in heterogeneous clusters. Considering these, our coding schemes take both stragglers and heterogeneity into account to tolerate stragglers by firstly allocating data partitions to workers according to their processing speed and then designing corresponding coding strategy. Evaluations show that our coding schemes could acheive up to speedup compared to cyclic coding scheme.

References