I Introduction
With the rapid increasing of data size, fast processing of big data becomes more and more important. Due to the saturation of Moore’s law, distributed processing has been viewed as the primary method for breaking down the limitation of computing power. Modern systems for distribute processing of big data like MapReduce [1] and Apache Spark [2] usually adopt a masterslave architecture. In such architecture, a master server divides the initial task into many small tasks and assigns them to several slave nodes (worker). These workers process tasks in parallel and return outcomes back to master after finishing.
In such distributed form, the performance of distributed system is usually limited by delays or faults as master collects outcomes from workers [3]. Delays or faults are usually incurred by stragglers which are workers that cannot return outcome within a reasonable deadline. Stragglers are mainly caused by two reasons, 1) transient fluctuation of resource in cluster, e.g., fault occurrence [4, 5], resource contention between processes, and 2) consistent heterogeneity of clusters [6]. Due to the notable negative impact of stragglers on performance, many recent works were proposed trying to mitigate them regarding to different tasks [7, 8, 9]. In this paper, we focus on the task of gradient computing. Gradient is the derivative of objective function and is of great importance for being the cornerstone of many optimization algorithms [10, 11]. For gradient computing task Tandon [12] proposes using coding method to tolerate stragglers. In their framework, the gradient of a sample is computed by several workers so that the gradient of the sample could be recovered by master as long as master receives the update of any worker that participates in the gradient computation of the sample. The essence of this gradient coding method is to improve stragglers tolerance by making data duplication. Though their method works efficiently for stragglers incurred trasient fluctuation, it can do nothing for stragglers caused by heterogeneneity. This is because it does not take computing capabilities of workers into account as designing coding scheme. Another work [13]
encodes the secondmoment of data to reduce computational overhead of encoding naive data. However, it is only limited to the gradient of linear model which cannot be used in many domains, e.g, training of DNN.
Considering all the insufficiencies of existing methods, we seek to tolerate stragglers incurred both the two reasons, i.e., stragglers in heterogeneous clusters, such that the processing efficiency of distributed system could be improved. This is a nontrival problem, because heterogeneneity is very common in modern clusters [8, 14, 6]. In fact, we can solve this problem by designing a solution that can both tolerate transient stragglers and take full utilization of the computing resources in heterogeneous cluster. To acheive this goal, we propose two heterogeneityaware gradient coding methods that adaptively allocate data partitions to each worker according to their computing capabilities. In this way, each worker has the similiar completion time so that the consistent stragglers incurred by heterogeneity could be eliminated. On the other hand, the transient stragglers will also be eliminated by using coding theory.
To implement heterogeneityaware gradient coding scheme, data partitions are firstly allocated to each worker according to their processing speed, and then we show how to construct coding strategy. The experiemental evaluations were done on popular deep learning tasks on several heterogeneous clusters range from
workers to workers. Results show that our methods improve the performance of deep learning task up to compared to traditional gradient coding methods.Our contributions are summarized as follows:

Straggler tolerance in heterogeneous setup is of great importance, but is ignored by existing methods. We propose a new heterogeneityaware gradient coding scheme that could work efficiently in heterogeneous clusters while tolerating stragglers.

We theoretically show that our heterogeneityaware gradient coding scheme is optimal for a cluster with accurately estimated computing capacity.

Considering practicalities of running system that the computing capacity is hard to be measured accurately, we further propose a more effcient variant of heterogeneityaware gradient coding scheme.

We conduct our coding schemes for gradientbased machine learning tasks on QingCloud clusters. Evaluation results show that our coding scheme could not only tolerate stragglers but also take fully utilization of computing capabilities of workers.
This paper is organized as follows. The related work about stragglers in distributed system is firstly presented in Section II. And then, we present the problem formulation in Section III. After that, we present our designed two heterogeneityaware gradient coding schemes, heteraware and groupbased coding scheme. In Section VI, a wide range of evaluations are performed in various largescale heterogeneous clusters to show the efficiency of our coding scheme. Finally, the conclusions are drawn in Section VII.
Ii Related Work
Straggler problem has a long history in parallel computing, and it attracts more and more interests as the era of big data comes. Here below, we will firstly introduce methods for straggler problem from specific to general, and then show recently emerging coded methods for straggler mitigation.
Considering distributed learning task is the typical task by using gradient, we firstly
the related work in distributed learning system for stragglers. Due to fault tolerance property inherent
in machine learning task, there are many methods trying to starting with parallel mechanism.
Typical algorithms including asynchronous parallel training algorithms including
TAP[15, 16] and SSP[17, 18, 19]
were proposed to avoid stragglers in learning steps, where the core idea of these methods is to improve hardware
efficiency by sacrificing statistical efficiency (e.g., convergence accuracy and speed)[20].
Further based on SSP, DynSSP[6] was proposed to improve the statistical efficiency of
asynchronous learning by tuning learning rates. Though such parallel algorithms could reduce the affecting
of stragglers, they are hard to analysis, debug, and reproduced. Besides, the accuracy as convergence couldn’t
reach optimal as shown in[21]. Different from these work, we try to mitigate stragglers
for the BSP distribtued scheme that keeps accuracy.
Another line for mitigating stragglers is load balancing which can be referenced by general task.
There are many work [22, 23, 24] trying to rebalance
workload allocation by using work stealing in traditional parallel computing. Work stealing in fact is a
technique that reallocates tasks from busy cores to idle cores. However, this idea isn’t suitable to
machine learning task, especially DNN’s training. One reason is that each iteration of DNN’s training is
very short which lasts only a few seconds or less
[25, 26, 27] causing that the
detection of stragglers and transferring of workloads are almost impossible. In this paper, we propose a
new load balancing method that use the property of data parallel processing task that the
computing complexity is of each task is proportional to its number of samples.
Recently, coding theory based methods were also introduced to distributed computation to tolerate stragglers. The initial work was proposed in [28, 29] that they aim at largescale matrix multiplication. They encode the matrix to tolerate stragglers and design a coding shuffling algorithm to reduce the data shuffling traffic. An improvement in [30, 13]
is that they encode the second moment of data for the linear regression problem to reduce computational overhead of encoding naive data.
[31, 32, 33]utilize polynomial interpolation to design the coded computation to tolerate more stragglers under the same workload compared to traditional coding method. But different from our model, all these algorithms are only limited to the linear model which couldn’t be adopted by a broad of optimization problems. For example, this strict condition cannot be satisfied by current DNN models. A general coding method named gradient coding was proposed in
[12]. Different from traditional works that encode the data directly, they encode the gradients generated by optimization algorithm such that the linear model constraint could be ruled out. Based on [12], [34] proposes reducing communication overhead by using coding method but further increases computing load incurred by coding method. Besides, both their coding methods have not taken computing capacity of workers into account causing the waste of computing resource. Though [35] and [36] aim at reducing computing load of coding method, they are at the cost of scarificing optimization accuracy. Recognizing that, here in this paper we propose a heterogeneityaware gradient coding method for general optimization problem which not only takes computing capacity into account but also keeps accuracy of model.Iii Problem Formulation
Iiia The Framework
Consider a typical distributed learning system, as illustrated in Fig.1, which consists of a master and a set of workers denoted by . A whole dataset is divided into equalsized data partitions, denoted by , i.e., . The partial gradient over a data partition is denoted as , which can be obtained by computation with . The whole task of distributed computation over this learning system is to obtain the aggregated gradient as
A direct approach is to allocate different data partitions to different workers. Each worker computes the partial gradients over the data partitions in hand, and then sends the summation of these partial gradients to the master. After collecting all the summations from the workers, the master can get the aggregated gradient by summing up the summations. However, when there exists some straggler, the computation latency could be significantly increased. Even worse, when some worker fails (e.g.,virtual machine breaks down), the whole task cannot be completed. In order to tolerate stragglers/failures, we consider the following general codingbased scheme. Initially, each worker is allocated with a subset of data partitions , where different could be joint. Then computes all the corresponding gradients . After this, encodes these gradients as , where is the encoding function of , and sends to the master. After receiving enough results from some workers, say , the master recovers the desired aggregated gradient immediately, where is referred to as the decoding function.
IiiB Gradient Coding Strategy
Same as [12], we consider linear encoding functions, i.e., is a linear combination of , . Specifically, we can represent as
where vector
, and its support, denoted by , which is the set of indices of nonzero entries of , satisfies that , i.e., the indices of nonzero entries of show the allocation of data partitions to worker . Let , which not only describes the allocation of data partition to each worker, but also represents the encoding function of each worker. Henceforth, we will refer to as a gradient coding strategy.We seek gradient coding strategies that are robust to any stragglers with . Same as [12], we assume that any straggler is a full straggler, i.e, it can be arbitrarily slow to the extent of complete failure. Under this assumption, a sufficient and necessary condition for a gradient coding strategy to be robust to any stragglers has been shown in [12] as follows.
Lemma 1.
A gradient coding strategy is robust to any stragglers if and only if satisfies the following condition:
(Condition 1): for any subset , ,
(1) 
where is a all one vector, and is the span of vectors.
Given the coding strategy that satisfies the condition (C1), the decoding strategy could be correspondingly acheived for all stragglers patterns, where . Considering each row of denotes a specific scenario of stragglers, master decodes by using coded gradients sent by workers in . Accordingly, the decoding function can also be a linear combination as
Hence, the decoding strategy can be constructed by using
(2) 
To reduce storage cost, the decoding matrix could be partially stored specially for regular stragglers. As to decoding functions designed for unregular stragglers, the decoding vectors could solved in realtime in a complexity of . Note that the time for solving decoding vector usually can be ignored due to and are usually small numbers.
IiiC Problem Formulation
Besides the tolerance of stragglers, we mainly concern about the computation time of the whole task. We consider heterogeneous workers which have different computation capabilities. For each worker , let denote the number of partial gradients over data partitions that can be computed when is a nonstraggler, which can be estimated by sampling. Thus, given a gradient coding strategy , the computation time of worker , denoted by , is given by
where denotes the norm of , or equivalently, the cardinality of . Without loss of generality, we assume that .
Evidently, the computation time of the whole task under strategy depends on which workers are stragglers, referred to as straggler pattern. For a considered straggler pattern , the computation time of the whole task under strategy , denoted by can be characterized as
where is the minimum value of such that
For a gradient coding strategy that can tolerate up to stragglers, we evaluate its performance by the computation time of the whole task under in the worst case, which is denoted by and is given by
(3) 
Aiming at finding a gradient coding strategy with a best performance, we have the following optimization problem:
(4)  
s.t. 
For ease of reading, the main notations used in this paper are sumarized in the following Table.I
Symbol  Definition 

The number of worker  
The number of data partition  
The number of stragglers  
Worker  
The number of data partitions in worker  
The throughput of worker i  
Decoding matrix  
Coding matrix  
Matrix with all elements being  
{,…, }  
{ , is the element of vector }  
{ , is straggler }  
The set of all data partitions  
The set of all workers  
Group composed of workers  
Groups set composed of groups 
Iv Heterogeneityaware Gradient Coding Strategy
In this section, we will show our coding scheme for heterogeneous distributed system detailly. Firstly, we specify how to design the support of with the considering of load balance and stragglers tolerance. We solve this by designing an heterogneityaware data allocation scheme. After that, the construction process of is elaborated which is the key for accurate decoding. Finaly, we show that our coding strategy is optimal to problem (4).
Iva The Design
We first show how to allocate data partitions to the workers, which gives the support structure of , i.e., the positions of nonzero elements in .
In order to tolerate stragglers, each data partition has to be assigned to at least workers to compute . In our design, is copied exactly times, and there are in total copies of data partitions, i.e., , where is the number of data partitions assigned to worker . For load balancing, we set to be proportional to , the computation rate of . Hence, we have
(5) 
Without loss of generality, here we assume that is an integer, and .
Once are fixed, we assign the total copies of data partitions to the workers in a cyclic manner. Specifically, the set of data partitions assigned to worker are given as
(6) 
where . It is straightforward to see that, for each , there are exact copies assigned to different workers. By denoting as nonzero entry, the support structure of worker is , where if else , and the support sutructure of can be written as
(7) 
Example 1.
As an example, consider a workers system with normalized sampling throughput as . If there is straggler, we could allocate data partitions and determine suppoprt structure of as
Given the support structure of , we now introduce how to construct such that it can satisfy the condition (C1). In our construction, an auxiliary matrix is introduced, which satisfies the following properties.

(P1): any columns of is linearly independent.

(P2): for any submatrix composed by columns of and any nonzero vector such that , .
The usefulness of such a is revealed by the following result.
Lemma 2.
For a matrix having properties (P1) and (P2), there exists a matrix with a support structure of (7) such that and satisfies condition (C1).
Proof.
Our proof proceeds as follows. First, we construct a matrix with a support structure of (7) such that . Then, we show that satisfies condition (C1).
For each , let be the submatrix of by deleting all the th columns where the th element of the th column of the support structure (7) is zero. Since each column of the support structure (7) has nonzero elements, has columns which are linearly independent according to property (P1). Therefore, is nonsingular, and has an inverse which is denoted by . Let
and be the matrix formed by embedding each , into the th column of the support structure (7). The embedding process is to assign each value in to according to the position presented in . Evidently, .
Next we show that the constructed satisfies condition C1. Let be the rows of . Consider an arbitrary subset such that . Let be the submatrix composed by all the th columns of where . Since has columns while it has rows, there exists some nonzero vector such that . Since satisfies property (P2), we have . Hence,
Note that for , the th entry of the vector is equal to 0 since . We then have that belongs to the span of . Therefore, satisfies condition (C1). The proof is accomplished. ∎
In the proof of Lemma 2, we give a construction method of with desired properties if we have a matrix satisfying properties (P1) and (P2). Hence, all we need now is to construct such a matrix . In the following, we show that a random choice of suffices where each entry of is chosen from the interval independently and uniformly at random.
Lemma 3.
For a matrix where each entry of is chosen from the interval independently and uniformly at random, then
satisfies both properties of (P1) and (P2) with probability 1.
Proof.
It has been shown in [12] that satisfies property (P1) with probability 1. So we only need to show that satisfies (P2) with probability 1.
Consider any submatrix composed by columns of . Let be the rows of . Without loss of generality, we assume that the values of have been exposed and they are independent which holds with probability 1, so that we focus on the randomness of . Let , which is unique, such that
We can check that
is a continuous multivariate random variable. So the probability of
is 1. On the other hand, if , then for any nonzero vector such that , and . Therefore, . This implies that the property (P2) restricted to the holds with probability 1. Since there are such , taking a union bound over them shows that property (P2) holds with probability 1. ∎The algorithm for constructing is given in Alg.1
Theorem 4.
The matrix constructed by Alg. is robust to any stragglers with probability 1.
IvB Optimality
Theorem 5.
Proof.
Let be an optimal gradient coding strategy. Let be the th row of . If there exists some such that , then according to the definition of (c.f. Eq. (3)), the result of worker is useless for earliest successful decoding whatever the straggler pattern is. Hence, we can remove the assignment of data partitions to worker , which does not affect the straggler tolerance and the computation time of the whole task. In other words, this is still an optimal gradient coding strategy. Hence, we can conclude that there exists an optimal gradient coding strategy such that
where is the th row of . Now we have
On the other hand, in order to tolerate straggler, each data partition has to be assigned to at least workers. This implies that
Hence,
For our construction , we can see that every worker completes its local task in time according to Eq. (2). Hence, , which implies that is optimal. ∎
V Groupbased Coding Scheme
Based on the sampling throughput of each worker , we have proposed an optimal solution for problem (4) on the above section. However, in practical system is hard to be measured exactly because of tiny fluctuation in runtime. This leads to that coding scheme could hardly acheive optimal. In fact, we could further improve the performance by reducing the number of workers needed by recovering gradient. This is because (1) if , then with probability where is the recovering time from active workers and (2) from lemma 2, we could directly conclude that recovering gradient from constructed by Alg.1 needs workers given stragglers. In the follows, we show that could be reduced by finding groups, where a group consists of at most workers and can be used to recover gradient. Denote a group as , then the following conditions are desired to satisfy requirement

(): for all workers , their sets of data partitions satisfy

(): for all groups in , ,
As shown in Alg.2, all groups are found in a recursive way to satisfy condition (), and then several groups are pruned to satisfy condition (). After finding groups, we just set nonzero elements in corresponding workers in groups to be . Besides, consider a group set . Let be the submatrix composed by all the th rows of where worker . Obviously, can be constructed as long as the submatrix is solved. can be constructed by using Alg.1 under stragglers.
A little different from decoding function for constructed by Alg.1, the decoding matrix are constructed for workers in groups and workers not in groups separately. For workers in each group , we design each corresponding decoding vector as , where is the indicator function. Obviously, we have
(8) 
Consequently, a decoding submatrix denoted by is composed by decoding vectors for all groups. As to workers not in groups, the decoding submatrix is solved by according to (2).
The algorithm for constructing and solving decoding matrix is shown in Alg.3.
According to Theorem 4, we could have the following theorem.
Theorem 6.
The matrix constructed by Alg.3 is robust to any stragglers with probabilty .
Proof.
According to both condition () and () of groups and Theorem 4, we could easily know that are robust to stragglers with probability . Besides, all groups can be used to recover gradient as showed in . Hence, is robust to stragglers with probabilty . ∎
From this theorem, we could know that the groupbased coding scheme is also an optimal solution for problem (4) for that the computation time of each active worker is the same like of worker in coding scheme 1 under deterministic situation.
Example 2.
An example is shown as in the following support structure of workers. There are three groups, including workers , including workers and including workers . Laterly, system prunes to satisfy condition (). For constructing , all entries of workers that in groups are set to be , and the remained entries of and are solved by using Alg.1.
Vi Performance Evaluations
In this section, experiments are presented to show the results of our coding scheme. Our coding scheme was mainly compared to two schemes: 1) Naive scheme. In naive scheme, the whole dataset was divided uniformly on each worker and server makes an update step by waiting for the completion of all workers. 2) Cyclic coding scheme [12]. Cyclic coding scheme uniformly divides the dataset into data partitions and makes copies of each data partition, and each worker computes data partitions. We didn’t implement fractional repetition scheme and partial coding scheme in [12], because fractional repetition scheme not only has a great limitation that requires that the number of worker is divisible by but also its performance is comparable to cyclic coding scheme and as to partial coding scheme, it a strong assumption that the slowest worker is at most slower than the fastest worker causing that it is unable to tolerate corrupted workers.
number of vCPUs  ClusterA  ClusterB  ClusterC  ClusterD 

2vCPUs  2  2  1  0 
4vCPUs  2  4  4  4 
8vCPUs  3  8  10  20 
12vCPUs  1  0  12  18 
16vCPUs  0  2  5  16 
Experiment Setup. Based on QingCloud [37], we make evaluations on various heterogeneous clusters with different scales ranging from workers to workers. We design four clusters including ClusterA, ClusterB, ClusterC and ClusterD as shown in Table II
. Such design mainly aims to cover various scales and heterogeneity of cluster to show the generality of our coding scheme. The instance type is performance type, and operating system of all the nodes is 64bit Centos7.1. PyTorch
[38] is adopted as the platform.Workload. Two typical image classification datasets Cifar10 [39]
and ImageNet
[40] are adopted. Cifar10 is composed of training images on which we train AlexNet [39], and ImageNet consists of over million images on which we train ResNet34 [41].Metrics. System efficiency is measured by running time to show the overall efficiency of distributed learning system. It consists of statistical efficiency and hardware efficiency. Statistical efficiency measures the convergence rate of the learning algorithm can be shown by learning curve. Hardware efficiency is a metric that represents the efficiency effcient CPU resource usage.
Via Experimental Results
ViA1 Robustness to Stragglers
By simulating faults, we add extra delay to any random workers on ClusterA to show both the performance improvement and the ability of straggler tolerance of our coding scheme. We artificially generate stragglers and stragglers as shown in Fig.(a)a and Fig.(b)b. As expected, running time of naive increases with the increasing of delay and could not normally run as workers take place faults. Correspondingly, all coding schemes are designed for straggler in Fig.(a)a and for stragglers in Fig.(b)b. Different from naive distributed learning algorithm, cyclic algorithm could tolerate stragglers that the running time changes little to different delays as shown in Fig.(a)a. However, the running time of cyclic algorithm also increases with the increasing of delay. This is mainly because the performance of cyclic is mainly limited to workers with lowcomputing capacity, and it approaches the performance of lowcomputing workers as delay increases until reaches the lower bound as delay is infinite (faults take place). Compared to these two distributed algorithm, both our heteraware coding scheme and groupbased coding scheme are all robust to stragglers that the running time keeps almost unchanged as shown in Fig.(a)a and Fig.(b)b. When the fault takes place, our heteraware coding scheme even acheives speedup compared to cyclic coding scheme because of high computing resource usage.
ViA2 Efficiency under different clusters
To show generality and efficiency of our coding scheme, we extend experiments to a large range of clusters with different scales and computing configurations as ClusterB, ClusterC and ClusterD. The results are shown as in Fig.3. Obviously, heteraware and groupbased coding scheme acheive better performance than the other methods on each cluster of different configurations. On the other side, traditional cyclic coding scheme even makes performance worse for that it aggreggates the straggler problem by allocating equivalent workload to each worker with different computing capacity.
Besides, one most notable advantage of coding based methods is that they have better statistical efficiency by using BSP. This is not true in asynchronous learning algorithm, which is deeply discussed in [21]. We here validate the efficiency of our learning method compared to SSP, a notablely effcient asynchronous distributed learning algorithm. The result is shown as in Fig.4. Due to heterogeneous computing capacity of workers, SSP will in fact easily reach the staleness threshold nearly every step causing that the synchronization overhead is similar to Naive BSP learning algorithm. Besides, master receives unbalanced contributions from different data parts to the update of parameters due to the dicrepancy of workers causing that SSP has a lower convergence rate compared BSP. Consequently, our coding scheme converges smoother and faster than SSP as shown in Fig.4.
At last, we have a discussion at the hardware efficiency of our coding scheme. We use computing resource usage as the metric. Resource usage is caculated by average iteration:
As we can see, Naive has a resource usage lower than in Fig.5. This is incurred by lowcomputing capacity workers and many other factors, e.g., background interferring process and fluctuate network. Cylic coding scheme mitigates this problem by discarding stragglers. However, it still has a limits incurred by unbalanced distribtuion of computing resource. Our heteraware coding scheme and groupbased coding scheme solve all these two problems and acheive high resource usage. Though still half of resouce is idle due to communication overhead, this can be solved by combined techniques proposed by [42] that code gradients layer by layer.
Vii Conclusion
To tolerate stragglers and take fully advantage of computing resources, we propose two new coding schemes in this paper, heteraware and groupbased coding scheme. Traditional coding methods proposed by [12] could efficiently mitigate stragglers, especially for fault tolerance, but their equivalent data allocation mechanism causes that they have bad performance in heterogeneous clusters. Considering these, our coding schemes take both stragglers and heterogeneity into account to tolerate stragglers by firstly allocating data partitions to workers according to their processing speed and then designing corresponding coding strategy. Evaluations show that our coding schemes could acheive up to speedup compared to cyclic coding scheme.
References
 [1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
 [2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010, 2010.
 [3] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.
 [4] Y. Yan, Y. Gao, Y. Chen, Z. Guo, B. Chen, and T. Moscibroda, “Trspark: Transient computing for big data analytics,” in Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 484–496, ACM, 2016.
 [5] A. Harlap, A. Tumanov, A. Chung, G. R. Ganger, and P. B. Gibbons, “Proteus: agile ml elasticity through tiered reliability in dynamic resource markets,” in Proceedings of the Twelfth European Conference on Computer Systems, pp. 589–604, ACM, 2017.
 [6] J. Jiang, B. Cui, C. Zhang, and L. Yu, “Heterogeneityaware distributed parameter servers,” in Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478, ACM, 2017.
 [7] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effective straggler mitigation: Attack of the clones,” in Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, April 25, 2013, pp. 185–198, 2013.
 [8] Q. Zhang, M. F. Zhani, R. Boutaba, and J. L. Hellerstein, “Dynamic heterogeneityaware resource provisioning in the cloud,” IEEE transactions on cloud computing, vol. 2, no. 1, pp. 14–28, 2014.
 [9] G. Ananthanarayanan, M. C.C. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu, “Grass: trimming stragglers in approximation analytics,” 2014.
 [10] A. Cutkosky and R. BusaFekete, “Distributed stochastic optimization via adaptive SGD,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 1914–1923, 2018.
 [11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
 [12] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in International Conference on Machine Learning, pp. 3368–3376, 2017.
 [13] R. K. Maity, A. S. Rawat, and A. Mazumdar, “Robust gradient descent via moment encoding with ldpc codes,” arXiv preprint arXiv:1805.08327, 2018.
 [14] X. Zhao, L. Liu, Q. Zhang, and X. Dong, “Improving mapreduce performance in a heterogeneous cloud: A measurement study,” in Cloud Computing (CLOUD), 2014 IEEE 7th International Conference on, pp. 400–407, IEEE, 2014.
 [15] A. Smola and S. Narayanamurthy, “An architecture for parallel topic models,” Proceedings of the VLDB Endowment, vol. 3, no. 12, pp. 703–710, 2010.
 [16] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, pp. 1223–1231, 2012.
 [17] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Advances in neural information processing systems, pp. 1223–1231, 2013.
 [18] J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. P. Xing, “Solving the straggler problem with bounded staleness.,” in HotOS, vol. 13, pp. 22–22, 2013.
 [19] H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, et al., “Exploiting bounded staleness to speed up big data analytics.,” in USENIX Annual Technical Conference, pp. 37–48, 2014.
 [20] S. Hadjis, C. Zhang, I. Mitliagkas, D. Iter, and C. Ré, “Omnivore: An optimizer for multidevice deep learning on cpus and gpus,” arXiv preprint arXiv:1606.04487, 2016.
 [21] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” arXiv preprint arXiv:1604.00981, 2016.
 [22] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded computations by work stealing,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 720–748, 1999.
 [23] U. A. Acar, A. Charguéraud, and M. Rainey, “Scheduling parallel programs by work stealing with private deques,” in ACM SIGPLAN Notices, vol. 48, pp. 219–228, ACM, 2013.
 [24] J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha, “Scalable work stealing,” in High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on, pp. 1–11, IEEE, 2009.
 [25] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” CoRR, vol. abs/1512.01274, 2015.
 [26] K. Keeton and T. Roscoe, eds., 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 24, 2016, USENIX Association, 2016.
 [27] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
 [28] IEEE International Symposium on Information Theory, ISIT 2016, Barcelona, Spain, July 1015, 2016, IEEE, 2016.
 [29] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, 2018.
 [30] R. K. Maity, A. S. Rawat, and A. Mazumdar, “Robust gradient descent via moment encoding with ldpc codes,”
 [31] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigation via data encoding,” arXiv preprint arXiv:1805.09934, 2018.
 [32] Q. Yu, N. Raviv, J. So, and A. S. Avestimehr, “Lagrange coded computing: Optimal design for resiliency, security and privacy,” arXiv preprint arXiv:1806.00939, 2018.
 [33] E. Ozfaturay, D. Gunduz, and S. Ulukus, “Speeding up distributed gradient descent by utilizing nonpersistent stragglers,” arXiv preprint arXiv:1808.02240, 2018.
 [34] M. Ye and E. Abbe, “Communicationcomputation efficient gradient coding,” arXiv preprint arXiv:1802.03475, 2018.
 [35] N. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis, “Gradient coding from cyclic mds codes and expander graphs,” arXiv preprint arXiv:1707.03858, 2017.
 [36] Z. Charles, D. Papailiopoulos, and J. Ellenberg, “Approximate gradient coding via sparse random graphs,” arXiv preprint arXiv:1711.06771, 2017.
 [37] “Qingcloud.” https://www.qingcloud.com/.
 [38] S. C. Adam Paszke, Sam Gross and G. Chanan, “Pytorch.” https://github.com/pytorch/pytorch.

[39]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, pp. 1097–1105, 2012.  [40] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, Ieee, 2009.
 [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 [42] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters,” arXiv preprint, 2017.