In many machine learning problems, for given training data points , , and the corresponding labels , , , the objective is to minimize the parameterized empirical loss function
parameterized empirical loss function
is the parameter vector,is an application specific loss function, and is the regularization component. This optimization problem is commonly solved by gradient descent (GD), where at each iteration, the parameter vector is updated along the GD direction:
where is the learning rate at iteration ,
and the gradient at the current parameter vector is given by .
When a large data set is considered, convergence of GD may take a long time, and distributed GD (DGD) techniques may be needed to speed up the convergence, where the computational task is divided into smaller sub-tasks and distributed across multiple computing servers (CSs) to be executed in parallel. In the beginning of the process, the aggregating server (AS) assigns sub-tasks to each CS, which may involve computing the gradient for different data points at each iteration. Whenever a CS completes sub-tasks assigned to it, it sends the results to the AS, where the results are aggregated to obtain , which is then transmitted to all the CSs to be used in the next iteration of the DGD algorithm. While distributed computation is essential to handle large data sets, the completion time of each iteration is constrained by the slowest server(s), called the straggling server(s), which can be detrimental for the convergence of the algorithm.
Typically the computation and communication latency of CSs vary over time, and these values are not known in advance for a particular DGD session. The randomness of the persistent straggling servers can be considered to model a packet erasure communication channel, in which the transmitted data packets are randomly erased . Motivated by this analogy, several papers have recently introduced coding theoretic ideas in order to mitigate the effect of straggling servers in DGD [1, 2, 3, 4]. The main idea behind these schemes is to introduce redundancy when allocating computation tasks to CSs in order to mitigate straggling servers.
More recently, it has been shown that more efficient straggler mitigation techniques can be introduced for specific computation tasks. particular attention has been paid to the least squares linear regression problem, which has the following loss function:
For this particular model, the gradient is given by
Note that remains the same throughout all the iterations, and the main computation task is to calculate . In this particular case the problem can be reduced to distributed matrix-matrix multiplication or matrix-vector multiplication, and the linearity of the gradient computation allows exploiting novel ideas from coding theory [1, 5, 6, 7, 8].
Before the detailed explanation and analysis of these scheme we want to emphasize that in most of the straggling avoidance techniques designed for DGD, it is assumed that the straggling servers have no contribution to the computation task. However, in practice, non-persistent straggling servers are capable of completing a certain portion of their assigned tasks. Therefore, our main objective in this paper is to redesign the straggling avoidance techniques in a way that computational capacity of the non-persistent stragglers can also be utilized. This will be achieved at the expense of an increase in the number of computations conveyed to the AS from the CSs, which we will define as the communication load. We first focus on the DGD scheme for the linear regression problem, then we consider another DGD strategy with uncoded computations, which can be applied to a general loss function.
|, , ||, , || , , |
|without pre-processing||with pre-processing|
|,  ,, || , ,, |
I-a Straggler Avoidance Techniques
In general, DGD schemes can be classified under three groups based on the employed straggling avoidance strategy; namely, 1) uncoded computation with uncoded communication (UCUC); 2) uncoded computation with coded communication (UCCC); and finally, 3) coded computation. The first group includes techniques in which the data points or mini-batches are distributed among the CSs, and each CS computes certain gradients, and returns results to the AS. In order to limit the completion time AS can update the parameter vector
after receiving only a limited number of gradients. The most common example of such schemes is the stochastic gradient descent (SGD) approach with several different implementations, such as the K-sync SGD, K-batch-sync SGD, K-async SGD and K-batch-async SGD (see for more details on these particular techniques). The schemes in the second group also distribute the data points in a similar fashion, but the computation results, i.e., values of the gradients, are sent to the AS in a coded form to achieve a certain tolerance against slow/straggling CSs [2, 11, 3]. While in uncoded computation the training data points are provided to the CSs as they are, in coded computation they are delivered in coded form [1, 5, 6, 7]. Classification of some of the DGD techniques in the literature into these three groups is given in Table I. In all these schemes, the main idea is to assign redundant tasks to CSs in order to avoid straggling servers. We assume that tasks (these might correspond to data points or mini-batches depending on the application) assigned to each CS, which will be called the computation load.
In the gradient coding approach , a UCCC scheme, rows of , denoted by and which are also referred as data points, are distributed to number of CSs111Throughout the paper, for simplicity, we assume that the number of data points are equal to the number of CSs, i.e, , although the proposed schemes can be easily applied to any pair. Moreover, while we can refer data points, each data point can represent a mini batch of an arbitrary size depending on the application. Each row is assigned to multiple CSs to create redundancy. Each CS computes for all the rows assigned to it, and sends a linear combination of these computations to the AS. In gradient coding the AS can recover the full gradient by receiving coded gradients from only CSs, at the expense of increased computation load at the CSs. Alternatively, in coded computation, linear combinations of the rows of are distributed to CSs . For each assigned coded input , the corresponding CS computes , and transmits the result to the AS.
Note that in remains the same throughout the iterations of the DGD process. Hence, if is computed at the beginning of the process, the AS only requires the results of the inner products , where is the th row of . We call those schemes that work directly with data samples as distributed computation without preprocessing, and schemes that work with as distributed computation with preprocessing. If is available at the AS, the DGD for linear regression boils down to distributed matrix-vector multiplication, and the linear combinations of the rows can be distributed to CSs as coded inputs [1, 5, 6, 8]. Classification of some of the known techniques in the literature according to pre-processing is given in Table II.
I-B Communication Load of DGD
Coded computation and communication techniques are designed to ameliorate the effects of slow/straggling servers such that fast servers can compensate for the straggling ones. In most of the existing schemes, each non-straggling CS transmits a single message to the AS at each iteration of the DGD algorithm, conveying the results of all computation tasks assigned to it while the straggling servers do not transmit at all as they cannot complete their assigned tasks. This restriction leads to a trade-off between the per-server computation load, r, and the non-straggling threshold, where the latter denotes the minimum number of CSs that must complete their tasks for the AS to recover all the gradients. This is achieved by assigning redundant computations to each of the CSs. In the extreme case, it may even be sufficient to get the results from only one CS, if all the computation tasks are assigned to each of the CSs, i.e., .
However it is important to emphasize that a smaller non-straggling threshold does not necessarily imply a lower completion time; thus, the number of computations assigned to each CS and the non-straggling threshold should be chosen carefully. Indeed, beyond a threshold on the computation load (i.e., the number of computation tasks assigned to each CS), the average completion time starts increasing.
An important limitation of the existing schemes in the literature is that the computations that have been carried out by the straggling servers are discarded, and not used by the AS at all; thus, the computation capacity of the network is underutilized. We show in this paper that the performance of the existing schemes can be improved by allowing the communication of multiple messages from the CSs to the AS at each iteration of the employed DGD technique, so that CSs can send the results of partial computations before completing all the assigned computations at the expense of an increased communication load, which characterizes the average number of total transmissions from the CSs to the AS per iteration. We remark that the overall impact of the increased communication load on the completion time depends on the distributed system architecture as well as the communication protocol used. The proposed multi-message techniques may be more attractive for special-purpose high performance computing (HPC) architectures employing message passing interface (MPI) rather than physically distributed machines communicating through standard networking protocols .
Multiple messages per server per iteration has recently been considered in  and . In , a hierarchical coded computation scheme is proposed, in which the computation tasks are divided into disjoint layers. For each layer an MDS code is used for encoding the rows of , while the parameters are optimized according to the straggling statistics of the servers. Although this scheme provides an improvement compared to single-message schemes, it has two main limitations. First, the code design is highly dependent on the straggling behavior of the server, which is often not easy to predict, and can be time-varying. Second, if a sufficient number of coded computations for a particular layer are received to allow the decoding of the corresponding gradients, any further computations received for this particular layer will be useless. In that sense, a strategy with a single layer, i.e., , will have a lower per iteration completion time when the decoding time is neglected. However, the decoding complexity at AS is also affects the network performance, and this layered structure helps reduce the decoding complexity. In , the authors also consider the multi-message approach, but instead of using MDS code with layered structure they use rateless codes, particularly LT codes, to reduce the decoding complexity. However, to achieve the introduced results, large number of coded messages should be passed to AS at each iteration, which induces the packetization problem that limits its applicability to real systems.
I-C Objective and Contributions
Although the aforementioned works [5, 8] allow multiple messages per server (per iteration), they assume the presence of a preprocessing step; that is, instead of the distribution of the rows of matrix (or, their coded versions) as computation tasks, rows of matrix are distributed. However, obtaining may not be practical for large data sets. Hence, we focus on the performance of coded computation and communication schemes that work directly on matrix , allowing multiple messages to be transmitted from each CS at each iteration. Moreover, in many scenarios with huge data sets, the data may not even be available centrally at the AS, and instead stored at the CSs to reduce the communication costs and the storage requirements at the AS. Therefore, we also consider uncoded computation techniques.
As we discussed previously, the schemes in the literature focus on minimizing the non-straggling threshold, which does not necessarily capture the average completion time statistics for one iteration of the GD algorithm. Indeed, in certain regimes of computation load , the average completion time may be increasing as the non-straggling threshold decreases. Accordingly, in this paper, we consider the average completion time as the main performance metric and develop DGD algorithms that can provide a trade-off between the communication load and the computation load.
To model the straggling behavior at the CSs, we use the model introduced in  to derive a closed form expression for the completion time statistics for both single and multi-message communication scenarios. We will also present numerical results based on Monte-Carlo simulations to compare the performances of different schemes in terms of the trade-off they obtain between the average completion time and the computation load. We also analyze the performance of an uncoded computation and communication scheme for the multi-message scenario, and show that in certain cases it outperforms its coded counterparts, while also significantly reducing the decoding complexity.
Ii Coded Computation
We first explain the coded computation strategy when there is no pre-processing step, i.e., is not known in advance. For a given computational load constraint , also called as the repetition factor, coded rows, are assigned to which executes the following computations . Once all these computations are executed, returns their sum to AS. The results obtained from a sufficient number of CSs are used at the AS to compute the next iteration of the parameter vector, . Now we will briefly summarize the Lagrange coded computation method introduced in [13, 7]
, which utilizes polynomial interpolation for the code design.
Ii-a Lagrange Polynomial
Consider the following polynomial
where are distinct real numbers, and are vectors of size . The main feature of the polynomial is that; , for . Let us consider another polynomial
such that222We dropped the time index on for brevity. . Hence, if the coefficients of polynomial are known, then the term can be obtained easily. We remark that the degree of the polynomials and are and , respectively. Accordingly, if the value of at distinct points are known at the AS, then all its coefficients can be obtained via polynomial interpolation. This is the key notion behind Lagrange coded computation, which is explained in the next subsection.
Ii-B Lagrange Coded Computation (LCC)
Let us first assume that is multiple of For given and , the rows of , , are divided into disjoint groups, each of size , and the rows within each group are ordered according to their indices. Let denote the th row in the th group, and denote all the rows in the th group; that is, is the submatrix of . Then, for distinct real numbers , we form the following structurally identical polynomials of degree take the rows of as their coefficients:
Then we define
Coded vectors , , for , are obtained by evaluating polynomials at distinct values, , i.e., . At each iteration of the DGD algorithm returns the value of
The degree of polynomial is ; and thus, the non-straggling threshold for LCC is given by ; that is, having received the value of at distinct points, the AS can extrapolate and compute
When is not divisible by , zero-valued data points can be added to to make it divisible by . Hence, in general the non-straggling threshold is given by .
Ii-C LCC with Multi-Message Communication
LCC for distributed gradient descent has been originally proposed in [13, 7] considering the transmission of a single-message to the AS per CS per iteration. Here, we introduce a multi-message version of LCC by using a single polynomial of degree , instead of using different polynomials, each of degree . We define
where are distinct real numbers, and we construct
such that . Consequently, if the polynomial is known at the AS, then the full gradient can be obtained. To this end, coded vectors , which are assigned to , are constructed by evaluating at different points, , i.e.,
computes , and transmits the resultant vector to the AS after each computation. Coded computation corresponding to coded data point at provides the value of polynomial at point . The degree of the polynomials and are and , respectively, which implies that can be interpolated from its values at any distinct points. Hence, any computations received from any subset of the CSs are sufficient to obtain the full gradient.
We note that, in the original LCC scheme coded data points are constructed evaluating different polynomials at the same data point, whereas in the multi-message LCC scheme, coded data points are constructed evaluating a single polynomial at distinct points. In multi-message scenario, per iteration completion time can be reduced since the partial computations of the non-persistent stragglers are also utilized; however, at the expense of an increase in the communication load. Nevertheless, it is possible to set the number of polynomials to a different value to seek balance between the communication load and the per iteration completion time. This will be illustrated in Section V.
Iii Uncoded Computation and Communication (UCUC)
In UCUC, the data points are divided into groups, where is the number of CSs, and each group is assigned to a different CS. While the per iteration completion time is determined by the slowest CS in this case, it can be reduced by assigning multiple data points to each CS, and allowing it to communicate the result of its computation for each data point right after its execution. We note here that, with UCUC the AS can apply SGD, and evaluate the next iteration of the parameter vector without waiting for all the computations. While we will mainly consider GD with a full gradient computation in our analysis for a fair comparison with the presented CGD approaches, we will show in Section V that significant gains can be obtained in both computation time and communication load by ignoring only 5% of the computations.
Let be the assignment matrix for the data points to CSs, where means that the th data point is computed by the th CS in the th order.
An easy and efficient way of constructing is to use a circular shift matrix, where
For instance, for and , we have:
We highlight that, in the multi-message scenario uncoded communication always outperforms the gradient coding scheme of . In the latter, a necessary condition to obtain the full gradient is that each partial gradient, i.e., the gradient corresponding to one data point, is computed by at least one server. It is easy to see that, under this condition, full gradient can also be obtained by UCUC. Hence, the main advantage of the gradient coding scheme is to minimize the communication overhead. Hence, we do not consider a multi-message gradient coding scheme. We note here that the utilization of the non-persistent stragglers in the single-message UCUC scenario is studied in . In the scheme proposed in , instead of sending each gradient separately, each CS transmits the sum of the gradients computed up until a specified time constraint, and, these sums are combined at the AS using different weights.
Iv Per Iteration Completion Time Statistics
In this section, we analyze the statistics of per iteration completion time for the DGD schemes introduced above. For the analysis we consider a setup with CSs and similarly we assume that the data set is divided into data points. For the straggling behavior, we adopt the model in  and 
, and assume that the probability of completingcomputations at any server, such as multiplying with different coded rows , by time is given by
The statistical model considered above is a shifted exponential distribution, such that the duration of a computation cannot be less than. We also note that, although the overall computation time at a particular CS has an exponential distribution, the duration of each computation is assumed to be identical. Further, let denote the probability of completing exactly computations by time . We have
where , since there are a total of computations assigned to each user. One can observe from (16) that ; and, hence can be written as follows:
We divide the CSs into groups according to the number of computations completed by time . Let be the number of CSs that have completed exactly computations by time , , and define , where . The probability of a particular realization is given by
At this point, we introduce , which denotes the total number of computations completed by all the CSs by time , i.e., , and let denote the threshold for obtaining the full gradient333Recall that this threshold is either or depending on the existence of a preprocessing step.. Hence, the probability of recovering the full gradient at AS by time , , is given by . Consequently, we have
Per iteration completion time statistics of non-straggler threshold based schemes can be derived similarly. For a given non-straggler threshold , and per server computation load , we can have
when , and otherwise.
V Numerical Results
We first verify the correctness of the expressions provided for the per iteration completion time statistics in (19) and (22) through Monte Carlo simulations generating 100000 independent realizations. Then, we will show that the multiple-message communication approach can reduce the average per-iteration completion time significantly. In particular, we analyze the per iteration completion time of different DGD schemes, coded gradient (CG), Lagrange coded computation (LCC), and LCC with multi-message communication (LCC-MMC). For the simulations we consider two different settings, with , and , , respectively, and use the cumulative density function (CDF) in (15) with parameters and for the completion time statistics.
In Fig.1 we plot the CDF of the per iteration completion time for CG, LCC, and LCC-MMC schemes according to the closed form expressions derived in Section 4 and Monte Carlo simulations. We observe from Fig. 1 that the provided closed-form expressions match perfectly with the results from the Monte Carlo simulations. We also observe that, although the LCC-MMC and LCC schemes perform closely in the first scenario, LCC-MMC outperforms the LCC scheme in the second scenario. This is because, when the per user computation load is increased, it will take more time for even the fast CSs to complete all the assigned computations, which results in a higher number of non-persistent stragglers. Hence, the performance gap between LCC-MMC and LCC increases with . Similarly, we also observe that CG performs better for small when the ratio is preserved.
Next, we consider the setup from , where CSs are assigned tasks to be computed at each iteration, where different computations are assigned to each server. Again, we use the distribution in (15) with parameters and . We compare the average per iteration completion time, , of the CG, LCC and LCC-MMC schemes, as well as the uncoded scheme with multi-message communication, UC-MMC, and the results are illustrated in Fig. 2. We observe that LCC-MMC approach can provide approximately reduction in the average completion time compared to LCC, and more than reduction compared to GC. A more interesting result is that the UC-MMC scheme outperforms both LCC and GC. This result is especially important since UC-MMC has no decoding complexity at the AS. Hence, when the decoding time of AS is also included in the average per iteration completion time this improvement will be even more significant.
Finally, we analyze the performance of the various DGD schemes with respect to computation load . We consider the previous setup with , and consider . For the performance analysis, we consider both the average per iteration completion time and the communication load, measured by the average total number of transmissions from the CSs to the AS, and the results obtained from Monte Carlo realizations are illustrated in Fig. 3. From Fig. 3(a), we observe that the UC-MMC scheme consistently outperforms LCC for all computation load values. More interestingly, UC-MMC performs very close to LCC-MMC, and for a small , such as , it can even outperform UC-MMC. Hence, in terms of the computation load UC-MMC can be considered as a better option compared to LCC especially when is low.
On the other hand, from Fig. 3(b) we observe that, in terms of the communication load the best scheme is LCC, while the UC-MMC introduces the highest communication load. We also observe that communication load of the LCC-MMC scheme remains constant with , whereas that of the LCC (UC-MMC) scheme monotonically decreases (increases) with . Accordingly, the communication load of the LCC and UC-MMC schemes are closest at . From both Fig. 3(a) and Fig. 3(b) we note that, when is low, e.g., when the CSs have small storage capacity, UC-MMC may outperform the LCC scheme in terms of the average per iteration completion time including the decoding time as well.
An important aspect of the average per-iteration completion time that is ignored here, and by other works in the literature, is the decoding complexity at the AS. Among these three schemes, UC-MMC has the lowest decoding complexity, while LCC-MMC has the highest. However, as discussed in Section 2, the number of transmissions as well as the decoding complexity can be reduced via increasing the number of polynomials used in the decoding process. To illustrate this, we consider a different implementation of the LCC-MMC scheme, where two polynomials are used in the encoding part, denoted by LCC-MMC-2. In this scheme, for given , the coded inputs correspond to the evaluation of two polynomials with each degree , at different points. Each CS sends a partial result to AS after execution of two computations, which correspond to the evaluation of these two polynomials at the same point. Since two polynomials are used in the encoding, the number of transmissions is reduced approximately to half compared to LCC-MMC as illustrated in Fig. 4(b). Although a noticeable improvement is achieved in the communication load, we observe a relatively small increase in the average per iteration completion time as illustrated in Fig. 4 (a).
Overall, the optimal strategy highly depends on the network structure. When the completion time is dominated by the CSs’ computation time, the LCC-MMC becomes the best alternative. This might be the case when the workers represent GPUs or CPUs on the same machine. On the other hand, if the communication load is the bottleneck, then LCC becomes more attractive especially when the servers have enough storage capacity, i.e., large r. However, as we observe in Fig. 4, the communication load and the average per iteration completion time can be balanced via playing with the number of polynomials used in the encoding process; hence, the per iteration completion time can be reduced further without causing excessive increase in the communication load. We also note here that it has been recently shown in  that the communication load can be reduced further by doing consecutive matrix multiplications at the CSs over several iterations without communicating with AS, and then sending higher degree coded matrix multiplication results to the AS. In the end, the AS interpolates a polynomial with a higher degree, which requires a larger non-straggling threshold compared to LCC, but with a benefit of drastically reduced communication load. However, we note that implementation of the proposed strategy is limited by the number of CSs since the non-straggling threshold can not be larger than the number CSs.
We also observe that when the CSs have a small storage capacity, i.e., small , UC-MMC has the lowest per iteration completion time. Moreover, when the decoding complexity is taken into account, UC-MMC can be preferable to coded computation schemes. Another advantage of the UC-MMC scheme is its applicability to K-batch SGD. The coded computation approaches are designed to obtain the full gradient; hence, at each iteration, they wait until they can recover all the gradient values. However, in the K-batch stochastic gradient descent approach the parameter vector is updated when any gradient values, corresponding to different batches (data points), are available at the AS. Using gradients corresponding to data points, instead of the full gradient, the per iteration completion time can be reduced. To this end, we consider a partial gradient scheme with multi-message communication, UC-MMC-PG, with tolerance, i.e., . We plot the average completion time and communication loads for different values of in Fig. 5. The results show that when is small, UC-MMC-PG can reduce the average completion time up to compared to LCC, and up to compared to UC-MMC; while only two gradient values are missing at each iteration. In addition to the improvement in average completion time, the UC-MMC-PG scheme can also reduce the communication load as shown in Fig. 5(b). We remark that, in the K-batch approach the gradient used for each update is less accurate compared to the full-gradient approach; however, since the parameter vector is updated over many iterations, K-batch approach may converge to the optimal value faster than the full-gradient approach.
Vi Conclusions and Future Directions
We have introduced novel coded and uncoded DGD schemes when multi-message communication is allowed from each server at each iteration of the DGD algorithm. We first provided a closed-form expression for the per iteration completion time statistics of these schemes, and verified our results with Monte Carlo simulations. Then, we compared these schemes with other DGD schemes in the literature in terms of the average computation and communication loads incurred. We have observed that allowing multiple messages to be conveyed from each CS at each GD iterations can reduce the average completion time significantly at the expense of an increase in the average communication load. Depending on the network structure, communication protocol employed, and computation capabilities of the CSs, we have proposed a generalized coded DGD scheme that can provide a balance between the communication load and the completion time. We also observed that UCUC with simple circular shift can be more efficient compared to coded computation approaches when the servers have limited storage capacity. We emphasize that, despite benefits of coded computation in reducing the computation time, their relevance in practical big data problems is questionable due to the need to jointly transform the whole data set, which may not even be possible to store in a single server. As a future extension of this work we will analyze the overall performance of these schemes in a practical setup for a more realistic comparison.
-  K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. on Information Theory, vol. 64, no. 3, pp. 1514–1529, Mar. 2018.
-  R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proc. Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70, Sydney, Australia, Aug. 2017, pp. 3368–3376.
-  W. Halbawi, N. A. Ruhi, F. Salehi, and B. Hassibi, “Improving distributed gradient descent using Reed-Solomon codes,” CoRR, vol. abs/1706.05436, 2017. [Online]. Available: http://arxiv.org/abs/1706.05436
S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and stale
gradients can win the race: Error-runtime trade-offs in distributed SGD,”
The 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.
-  N. Ferdinand and S. C. Draper, “Hierarchical coded computation,” in IEEE Int’l Symp. on Information Theory (ISIT), Vail, CO, Jun. 2018.
R. K. Maity, A. S. Rawat, and A. Mazumdar, “Robust gradient descent via moment encoding with ldpc codes,”SysML Conference, 2018.
-  S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigation via data encoding,” CoRR, vol. abs/1805.09934, 2018. [Online]. Available: http://arxiv.org/abs/1805.09934
-  A. Mallick, M. Chaudhari, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” CoRR, vol. abs/1804.10331, 2018. [Online]. Available: http://arxiv.org/abs/1804.10331
-  S. Li, S. M. M. Kalan, A. S. Avestimehr, and M. Soltanolkotabi, “Near-optimal straggler mitigation for distributed gradient methods,” CoRR, vol. abs/1710.09990, 2017. [Online]. Available: http://arxiv.org/abs/1710.09990
-  N. Ferdinand, B. Gharachorloo, and S. C. Draper, “Anytime exploitation of stragglers in synchronous stochastic gradient descent,” in IEEE Int’l Conf. on Machine Learning and Applications (ICMLA), Dec. 2017, pp. 141–146.
-  M. Ye and E. Abbe, “Communication-computation efficient gradient coding,” CoRR, vol. abs/1802.03475, 2018. [Online]. Available: http://arxiv.org/abs/1802.03475
-  T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” CoRR, vol. abs/1802.09941, 2018. [Online]. Available: http://arxiv.org/abs/1802.09941
-  S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. R. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” CoRR, vol. abs/1801.10292, 2018. [Online]. Available: http://arxiv.org/abs/1801.10292
-  F. Haddadpour, Y. Yang, M. Chaudhari, V. R. Cadambe, and P. Grover, “Straggler-resilient and communication-efficient distributed iterative linear solver,” CoRR, vol. abs/1806.06140, 2018. [Online]. Available: https://arxiv.org/abs/1806.06140