Compressed Coded Distributed Computing

05/05/2018 ∙ by Songze Li, et al. ∙ 0

Communication overhead is one of the major performance bottlenecks in large-scale distributed computing systems, in particular for machine learning applications. Conventionally, compression techniques are used to reduce the load of communication by combining intermediate results of the same computation task as much as possible. Recently, via the development of coded distributed computing (CDC), it has been shown that it is possible to enable coding opportunities across intermediate results of different computation tasks to further reduce the communication load. We propose a new scheme, named compressed coded distributed computing (in short, compressed CDC), which jointly exploits the above two techniques (i.e., combining the intermediate results of the same computation and coding across the intermediate results of different computations) to significantly reduce the communication load for computations with linear aggregation (reduction) of intermediate results in the final stage that are prevalent in machine learning (e.g., distributed training algorithms where partial gradients are computed distributedly and then averaged in the final stage). In particular, compressed CDC first compresses/combines several intermediate results for a single computation, and then utilizes multiple such combined packets to create a coded multicast packet that is simultaneously useful for multiple computations. We characterize the achievable communication load of compressed CDC and show that it substantially outperforms both combining methods and CDC scheme.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In order to scale up machine learning applications that process a massive amount of data, various distributed computing frameworks have been developed where data is stored and processed distributedly on multiple cores or GPUs on a single machine, or multiple machines in computing clusters (see, e.g., [1, 2, 3]). When implementing these frameworks, the communication overhead of shuffling intermediate results across distributed computing nodes is a major performance bottleneck. For example, it was observed in [4]

that on a Facebook’s Hadoop cluster, 33% of the job execution time was spent on data shuffling. This bottleneck is becoming worse for training deep neural networks with millions of model parameters (e.g., ResNet-50 

[5]

) using distributed stochastic gradient descent, where partial gradients with millions of entries need to be passed between computing nodes.

Conventionally, compression techniques are used to reduce the communication load by combining intermediate results of the same computation task as much as possible. For example, in the original MapReduce distributed computing framework [1], when the Reduce function is commutative and associative, a “combiner function” is proposed to pre-combine multiple intermediate values with the same key computed from different Map functions. Then, instead of sending multiple values to the reducer, the mapper only needs to send the pre-combined value whose size is the same as one of the values before combining, which significantly reduces the bandwidth consumption without any performance loss.

Coded distributed computing (CDC) is another approach that has been recently proposed in [6, 7] to mitigate the communication bottleneck. Unlike the compression/combining technique, CDC enables coding opportunities across intermediate results of different computation tasks to further reduce the communication load. In particular, within a MapReduce-type distributed computing model, CDC specifies a repetitive pattern of computing Map functions, creating side information at the computing nodes that enables coded multicasting during data shuffling across nodes, where each coded multicast packet is simultaneously useful for multiple Reduce tasks. For example, if we repeat each of the Map tasks times across the cluster, then utilizing the CDC scheme, we can reduce the total amount of bandwidth consumption by times. It has been shown that CDC can provide substantial speedups in practice [8], and several generalizations of it has been developed in the literature [9, 10, 11, 12, 13].

In this paper, we focus on MapReduce-type distributed computing frameworks and propose a new scheme, named compressed coded distributed computing (in short, compressed CDC). It jointly exploits the above compression/combining technique and the CDC scheme to significantly reduce the communication load for computation tasks with linear Reduce functions (and arbitrary Map functions) that are prevalent in data analytics (e.g., distributed gradient descent where the partial gradients computed at multiple distributed computing nodes are averaged to reduce to the final gradient). Specifically, the compressed CDC scheme first specifies a repetitive storage of the dataset across distributed computing nodes. Each node, after processing locally stored files, first pre-combines the intermediate values of a single computation task needed by another node. Having generated multiple such pre-combined packets for different tasks, the computing node further codes them to generate a coded multicast packet that is simultaneously useful for multiple tasks. Therefore, compressed CDC enjoys both the intra-computation gain from combining, and the inter-computation gain from coded multicasting.

We characterize the achievable communication load of compressed CDC and show that it substantially outperforms both combining methods and CDC scheme. In particular, compared with the scheme that only relies on the combining technique, compressed CDC reduces the communication load by a factor that is proportional to the storage size of each computing node, which is significant for the common scenarios where large-scale machine learning tasks are executed on commodity servers with relatively small storage size. On the other hand, compared with the CDC scheme whose communication load scales linearly with the size of the dataset, compressed CDC eliminates this dependency by pre-combining intermediate values of the same task, allowing the system to scale up to handle computations on arbitrarily large dataset.

Other Related Work

Motivated by the fact that training algorithms exhibit tolerance to precision loss of intermediate values, as opposed to the above lossless compression technique that guarantees exact recovery of computation results, a family of lossy compression (or quantization) algorithms for distributed learning systems have been developed to compress the intermediate results (e.g., gradients) for a smaller bandwidth consumption (see, e.g., [14, 15, 16]). Apart from compression, various coding techniques have also been recently utilized in distributed machine learning algorithms to mitigate the communication bottleneck and the straggler’s delay (see, e.g. [17, 18, 19, 20, 21, 22, 23, 24, 25]).

Ii Motivating Example

In this section, we demonstrate through a motivating example, how compression and CDC techniques, applied alone or jointly, can help to reduce the bandwidth requirement for distributed computing tasks.

Fig. 1: A MapReduce framework to compute 3 functions from 6 files with linear Reduce functions.

As shown in Fig. 1, we consider a MapReduce job of computing output functions, represented by red/circle, green/square, and blue/triangle respectively, by processing input files. When mapping a file, we obtain intermediate values, one for each of the functions, which are represented by the color/shape of the corresponding functions labelled by the file index. The Reduce operation of each output function computes its final result by summing up the intermediate values of the function from all input files. This computation job is executed on distributed computing nodes connected through a multicast network. Each node can store up to files in its local memory. As shown in Fig. 2, we assign the computation tasks such that Nodes 1, 2, and 3 are respectively responsible for final reduction of red/circle, green/square, and blue/triangle functions. For this problem, we are interested in minimizing the communication load, which is the number of bits that need to be shuffled between computing nodes to accomplish the computation tasks, normalized by the size of a single intermediate value. Next, we describe three coded computing schemes, and compare their communication loads.

Fig. 2: Coded computing schemes for a MapReduce job with linear Reduce functions, which processes files to compute functions, over distributed computing nodes each with a storage size of files.

For all of these three schemes, as illustrated in Fig. 2, the file placement is performed such that Node 1 stores the files , Node 2 stores the files , and Node 3 stores the files .

Ii-1 Compression scheme

Since only the sum of the intermediate values is needed for final reduction, we can pre-combine the computed intermediate values of the same function at the sender node to reduce communication. For example, as shown in Fig. 2(a), having computed the green squares labelled by and in the Map phase, Node 1 sums them up and sends the computed sum to Node 2, instead of sending them individually. Upon receiving this pre-combined packet, Node 2 can directly use it for the final reduction of the green/square function. This compression scheme reduces the communication load by half, compared with the schemes that unicast uncoded intermediate values, and achieves a communication load of .

Ii-2 CDC scheme

Utilizing the redundant Map results across computing nodes, the CDC scheme creates coded multicast packets by combining intermediate values of different functions that are intended at different nodes. As shown in Fig. 2(b), since the blue triangle labelled by 3 is computed at both Nodes 1 and 2, and the green square labelled by 1 is computed at both Nodes 1 and 3, Node 1 can multicast the bit-wise XOR (denoted by ) of these two intermediate values to the other two nodes. From this coded packet, both Nodes 2 and 3 can decode their intended values by cancelling their locally computed values. Since each of the multicast packets is simultaneously useful for two nodes, the CDC scheme cuts the communication load by half from the schemes that unicast uncoded intermediate values, and achieve a communication load of . While achieving the same communication load as the compression scheme that pre-combines intermediate values of the same function, the CDC scheme combines intermediate values from different functions, and allows the recovery of them individually instead of their sum. Therefore, CDC can be utilized on more general MapReduce jobs with arbitrary Reduce functions to slash the communication load.

Ii-3 Compressed CDC scheme

The above described two techniques can be applied jointly to further reduce the communication load. In particular, we can generate coded multicast packets as in the CDC scheme from the pre-combined packets created as in the compression scheme. Each node, as shown in Fig. 2(c), sums up two pairs of intermediate values to generate two pre-combined packets, each of which is needed by another node. Then, for example, Node 1 first splits each of its pre-combined packets (the unlabelled green square and the unlabelled blue triangle) into two segments, and computes the bitwise-XOR, of two segments, one from each of the pre-combined packets, generating a coded packet whose size is half of the size of an intermediate value. Finally, Node 1 multicasts this coded packet to Nodes 2 and 3. Similar operations are performed at Nodes 2 and 3. Next, each node utilizes the locally computed intermediate values to decode the intended pre-combined packet, which is used to reduce the output function. Compared with the compression and the CDC schemes, the compressed CDC scheme exploits both the compression opportunities within individual functions, and the multicasting opportunities across different functions, and achieves a communication load of .

In the next section, we first give the general problem formulation, and then present our main results on the proposed coded computing scheme that jointly exploits both types of coding from the compression scheme and the CDC scheme.

Iii Problem Formulation and Main Results

We consider a computation job of processing input files, for some , to compute output functions, for some . We denote the input files as , for some , and the output functions as , for some . We focus on a class of computation jobs with linear aggregation for which the computation of each output function can be decomposed as the sum of intermediate values computed from the input files, i.e., for ,

(1)

where is the intermediate value of computed from some intermediate function . So far, we have introduced one computation job that involves computing functions. Here, we consider the scenario where such computation jobs are executed in parallel, for some . We denote the input files of job as , and the output functions job wants to compute as .111As an example, we can consider executing machine learning tasks (e.g., image classification), each of which has its own dataset, and aims to obtain its own set of model parameters. Another example is the navigation application, where navigation sessions, each of which requires to find the shortest path on a disjoint sector of the map, are executed in parallel.

Iii-a Network model

The above described computation jobs are executed distributedly on a computer cluster that consists of distributed computing nodes, for some . These computing nodes are denoted as Node  Node . Here we assume , and focus on a symmetric setting for the sake of load balancing, in which , and each node is responsible for computing output functions for each job. The nodes are connected through an error-free broadcast network. Each node has a local storage that can store up to input files, i.e., fraction of the entire dataset that contains all input files from all jobs, for some satisfying .

Before the computation starts, each node selects and stores input files from the dataset. For each node , we denote the set of indices of the files stored locally as . A valid file placement has to satisfy 1) , for all (local storage constraint), and 2) (the entire dataset needs to be collectively stored across the cluster).

Iii-B Distributed computing model

The nodes process their locally stored files to compute the output functions following a MapReduce-type model. In particular, the overall computation proceeds in three phases: Map phase, Shuffle phase, and Reduce phase.

Map phase. For each file of job , , Node  maps it into intermediate values , one for each of the functions computed in job . We assume that all the intermediate values across the jobs have the same size of bits, which is the case when for example, we are training

image classifiers in parallel using the same deep neural network.

Shuffle phase. Before the Shuffle phase starts, for each computation job , we assign the tasks of reducing the output functions symmetrically across the nodes, such that each node computes a disjoint subset of functions. We denote the set of the indices of the output functions assigned to Node  for job as , .

In the Shuffle phase, each node  produces a message, denoted by , as a function of the locally computed intermediate values in the Map phase (i.e., ), where denotes the length of the message in bits. Having generated , Node  broadcasts it to all the other nodes.

Definition 1 (Communication Load).

We define the communication load, denoted by , as the total number of bits contained in all broadcast messages, normalized by , i.e.,

(2)

Reduce phase. For each job and each , , Node  computes the output function as in (1), using the locally computed Map results and the received broadcast messages in the Shuffle phase.

Iii-C Main Results

For the above formulated distributed computing problem, we first study the effects of applying the compression scheme and the CDC scheme individually on reducing the communication load. Then, we present our main result, which is a communication load achieved by the proposed computing scheme that jointly utilizes compression and CDC.

Exploiting the compression technique, each sender node pre-combines all the intermediate values needed at the receiver node for a particular function, and then sends the pre-combined value. We demonstrate in the appendix that the following communication load can be achieved by solely applying compression.

(3)

The above communication load achieved by compression only depends on the storage size . In the regime of , the communication load is a constant that does not decrease as the storage size increases. This is because that as long as , each node has to receive at least one intermediate value for each of the functions it is computing.

When only applying the CDC scheme without compression, as shown in [7], we can achieve the communication load

(4)

The CDC scheme creates coded multicast packets that are simultaneously useful for nodes. Hence, for fixed storage size , the achieved communication load decreases inversely proportionally with the network size (). On the other hand, since the CDC scheme was designed to handle general Reduce functions that require each of the intermediate values separately as the inputs, the load also scales linearly with the number of input files ().

We propose the compressed coded distributed computing (compressed CDC) scheme, which jointly utilizes the combining and the coded multicasting techniques, and achieves a smaller communication load than those achieved by applying each of the two techniques individually. We present the performance of compressed CDC in the following theorem.

Theorem 1.

To execute computation jobs with linear aggregation of intermediate results, each of which processes input files to compute output functions, distributedly over computing nodes each with a local storage of size , the proposed compressed CDC scheme achieves the following communication load

(5)

for , and , for some .

We describe the general compressed CDC scheme in the next section.

Remark 1.

Compared with the compression scheme whose communication load is in (3), for large , the proposed compressed CDC scheme reduces the communication load by a factor of when , and by a factor of when . In the scenarios where the cluster consists of many low-end computing nodes with small storage size (e.g., ), this bandwidth reduction can scale with the network size. Also, in contrast to the compression scheme, the load keeps decreasing as the storage size increases.

Remark 2.

Unlike the communication load in (4) achieved by the CDC scheme, the communication load achieved by the compressed CDC scheme does not grow with the number of input files. This is accomplished by incorporating the compression technique, i.e., pre-combining multiple intermediate values of the same Reduce function.

Remark 3.

The file placement of the compressed CDC scheme is performed such that all input files of each particular computation job are placed exclusively on a unique subset of nodes, following a repetitive pattern specified by the CDC scheme. As a result, the compressed CDC scheme executes a batch of jobs in parallel. In the Shuffle phase of compressed CDC, each computing node first pre-combines several intermediate values of a single function reduced at another node, and then applies bit-wise XOR operations on multiple such pre-combined packets to generate a coded multicast packet that is simultaneously useful for computing functions. We note that these functions can be different functions in the same job, as well as different functions in different jobs.

Iv Description of the compressed CDC scheme

In this section, we describe the proposed compressed CDC scheme, and analyze its communication load.

We consider the storage size such that , and take sufficiently many computation jobs to process in parallel, where the number of jobs , for some . The proposed compressed CDC scheme operates on a batch of jobs at a time, and repeats the same operations times to process all the jobs. Therefore, it is sufficient to describe the scheme for the case of .

Along the general description of the compressed CDC scheme, we consider the following illustrative example.

Example (compressed CDC). We have a distributed computing cluster that consists of nodes each with a storage size of . On this cluster, we need to execute MapReduce jobs with linear Reduce functions, each of which requires processing files to compute output functions. Each node is responsible for computing one output function, for each of the jobs. In particular, Node  computes

(6)

for all , where is the intermediate value of the function of job mapped from the input file of job .

Iv-a File placement

For each job , , all of its input files are stored exclusively on a unique subset of nodes, and we denote the set of indices of these nodes as . Within , each file of job is repeatedly stored on nodes. In particular, we first evenly partition the files into batches, and label each batch by a unique size- subset of , denoted by . Then, we store all the files in a batch on each of the nodes whose index is in the corresponding subset . We denote the set of indices of the files from job in a batch labelled by a subset as . The file placement is performed such that for each with , and each , we have

(7)

for all , where is the set of indices of all files stored at Node .

Applying the above file placement, each node in stores files. Since each node is in subsets of of size , it stores overall files, satisfying its local storage constraint.

Fig. 3: File placement onto computing nodes. For each , we place the set of files for job , onto a unique subset of nodes, following a repetitive pattern where each file is stored on nodes.

Example (compressed CDC: file placement). As shown in Fig. 3, we perform the file placement such that for each , the set of files from job , are placed on a unique subset of nodes. For example, the files of job 1, are exclusively stored on Nodes , , and . These files are partitioned into 3 batches, i.e., , , and . Then, the files and are stored on Nodes 1 and 2, the files and are stored on Nodes 1 and 3, and the files and are stored on Nodes 2 and 3.

Iv-B Coded computing

After the file placement, the compressed CDC scheme starts the computation and data shuffling in subsets of nodes. Within each subset , , that contains the indices of nodes, the computing scheme proceeds in two stages. In the first stage, the nodes in process the files they have exclusively stored, i.e., the files of job . In the second stage, they handle the files from other jobs.

Iv-B1 Stage 1 (coding for a single job)

In the first stage, nodes in only process input files and compute output functions for job . For ease of exposition, we drop all the job indices in the rest of the description of stage 1. According to the file placement, each node in stores files of job , and each node in the subset of nodes stores all the files in the batch .

In the Map phase, each node maps all the files of job it has stored locally, for all output functions of job . We note that after the Map phase, for each subset of size , and , each of the nodes in has computed intermediate values, one for each of the functions assigned to Node , from each of the files in the batch . More precisely, these intermediate values are

(8)

In the Shuffle phase, within each subset of size , we first perform the pre-combining operation as follows. For each , Node  sums up the intermediate values computed in (8) to obtain the pre-combined values

(9)

for all .

Having computed such pre-combined values , the nodes in concatenate them to generate a packet , and evenly and arbitrarily split it into segments. We label the segments by the elements in . That is, for , we have

(10)

Finally, each node in generates a coded packet by computing bit-wise XOR (denoted by ) of the data segments labelled by , i.e.,

(11)

and multicasts to all other nodes in .

After Node  receives a coded packet from Node , it cancels all the segments s with , and recovers the intended segment . Repeating this decoding process for all received coded packets, Node  recovers , and hence , for all . Using these values, together with the local Map results, Node  computes the output for all . After the first stage of computation, each node in completes its computation tasks for job .

Since each of the coded packets in (11) contains bits, the communication load exerted in the Shuffle phase of the first stage is

(12)

Example (compressed CDC: coding for a single job). We start describing the proposed scheme in the subset of Nodes 1, 2, and 3. In the first stage of computation, since , these three nodes will focus on processing job . The computation and communication scheme for this stage is the same as described for the example in Fig. 2(c). By the end of this stage, Nodes 1, 2, and 3 compute their assigned functions for job 1. The first stage incurs a communication load of .

Iv-B2 Stage 2 (coding across jobs)

In the second stage, we first take a node outside , and then for each , we label the job whose input files are exclusively stored on the nodes in as . Next, the nodes in process the files of job in the batch in the Map phase, and communicate the computed intermediate values needed by Node  in a coded manner.

For a node , and each , the nodes in share a batch of files in for job . In the Map phase, for each , Node  computes intermediate values, one for each function of job assigned to Node  in , from each of the files in the batch . More precisely, each Node  computes the intermediate values

(13)

In the Shuffle phase, for each , the nodes in first pre-combine the Map results in (13) locally to compute

(14)

for all .

Next, as similarly done in the first stage, the nodes in first concatenate the above pre-combined values to form a packet , and then split it into segments. We label these segments by the elements in , i.e., for , we have

(15)

Finally, each node in generates a coded packet by computing bit-wise XOR of the data segments labelled by , i.e.,

(16)

and multicasts to all other nodes in .

We note that since the job index (whose input files are exclusively stored on nodes in ) is different for different , the above coded packet is generated using intermediate values from different jobs.

Having received a coded packet from Node , Node  cancels all the segments s with , and recovers the intended segment . Repeating this decoding process for all received coded packets, Node  recovers , and hence , for all .

We repeat the above Map and Shuffle phase operations for all . By the end of the second stage, each node in recovers partial sums to compute functions from jobs.

The communication load incurred in the Shuffle phase, for a particular , is , and the total communication load of the second stage is

(17)
Fig. 4: Illustration of the operations in the second stage of compressed CDC, in the subset of Nodes 1, 2, and 3. Note that in this stage, pre-combined packets from different jobs are utilized to create coded multicast packets.

Example (compressed CDC: coding across jobs). We now move on to describe the second stage of compressed CDC within the subset via Fig. 4, where we represent the functions computed by Node 1, 2, and 3 by red/circle, green/square, and blue/triangle respectively, and the intermediate value of a function from a file as the corresponding color/shape labelled by . In this stage, as shown in Fig. 4, each node maps files, two of which belong to a job, and the other two belong to another job. For example, Node 1 maps the files , from job , and files , from job , producing two blue triangles labelled by and , and two green squares labelled by and . During data shuffling, each node first sums up the two intermediate values from the same job to create two pre-combined packets locally (e.g., the summation of blue triangles labelled by and , and the summation of green squares labelled by and at Node ). Then, as shown in Fig. 4, each node splits each of the computed sums evenly into two segments, computes the bit-wise XOR of two segments, one from each sum, and multicasts it to the other two nodes. Finally, each node decodes the intended sum from the multicast packets using its locally computed intermediate values. The second stage incurs a communication load of .

Having performed this two-stage operation on all subsets of nodes, , each node has finished computing its assigned functions from jobs. For each of the remaining jobs, say job , and each , Node  receives a partial sum of intermediate values for each of the functions in , in the subset . Summing up these partial sums, Node  finishes computing each of its assigned functions from job .

The overall communication load of compressed CDC is

(18)

Example (compressed CDC: final reduction). After the two-stage computations in the subset , we repeat the same operations in the other subsets of 3 nodes. In the end, taking Node 1 as an example,

  • In subset , Node 1 computes , and ,

  • In subset , Node 1 computes , and ,

  • In subset , Node 1 computes , and .

Finally, Node 1 computes by adding up the received partial sums in the 3 subsets. We can verify that Nodes 2, 3, and 4 also successfully recover their assigned functions from the 4 jobs. The overall communication load is .

Remark 4.

For the above example, using only the combining technique to process each job, we would have communicated 4 pre-combined packets, one for each node, achieving a communication load . On the other hand, using the CDC scheme that only exploits the coded multicasting opportunities, we would have achieved a communication load of .

V Conclusion

We propose a coded distributed computing scheme for MapReduce jobs with linear Reduce functions, named compressed coded distributed computing (compressed CDC), which achieves substantially smaller bandwidth consumption compared with the state-of-the-art schemes. Compressed CDC jointly exploits 1) pre-combining intermediate results for the same computation task, and 2) coded multicasting across different computation tasks, achieving significant communication reduction, compared with those achieved by applying the above two techniques separately. A future direction is to develop lower bounds on the minimum communication load, and study the optimality of the compressed CDC scheme. [Communication Load of the compression scheme] For the schemes that solely apply the compression/combining techniques, we consider a class of single-job strategies where we repeat the same steps to handle the scenario of executing a single job, for all jobs. Hence, it is sufficient to describe and analyze the scheme for the case where . In this case, each computing node stores files of a single job locally, and wants to compute output functions.

We first consider the case of small storage size where . In this case, we partition the indices of the input files into batches, which are denoted as . Each of the first batches contains file indices, and the last batch contains the remaining file indices. In the file placement phase, for each , we place the input files whose indices are in in the local storage of Nodes . In other words, Node , , stores the files whose indices are in the batch . We note that since , each batch of files is placed on at least one node.

In the Map phase, each node maps each of the files in the locally stored batch, generating intermediate values for the output functions. In the Shuffle phase, for a node to compute a function assigned to it, apart from the intermediate values computed from the local batch of files, it needs the partial sums of intermediate values from the other batches. We assume that Node  stores the files in locally, then, for some other node who stores a different batch , , Node  first pre-combines the intermediate values for the function to generate

(19)

and sends this pre-combined package to Node . Having received such pre-combined packets, one from a node who stores a distinct batch of files, Node  compute the function by summing them up together with the intermediate values computed from the local batch. In this communication scheme, each node receives pre-combined packets, each of which has the same size as a single intermediate value, for each of its assigned functions, incurring a total communication load of

(20)

We note that for the case of , we have a total of batches, and each node only receives a single pre-combined packet to compute each of its assigned functions, resulting in a total communication load of . For the cases where , since each node has to receive at least one intermediate value to compute each of its assigned functions, the incurred communication load is at least . Hence, increasing the storage size beyond does not further reduce the communication load, and we have

(21)

References

  • [1] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Sixth USENIX OSDI, Dec. 2004.
  • [2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” 2nd USENIX HotCloud, vol. 10, p. 10, June 2010.
  • [3] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” NIPS, pp. 693–701, 2011.
  • [4] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, “Managing data transfers in computer clusters with orchestra,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, Aug. 2011.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE CVPR, pp. 770–778, 2016.
  • [6] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” 53rd Allerton Conference, Sept. 2015.
  • [7] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, “A fundamental tradeoff between computation and communication in distributed computing,” IEEE Trans. Inf. Theory, vol. 64, no. 1, Jan. 2018.
  • [8] S. Li, S. Supittayapornpong, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded terasort,” IPDPS ParLearning Workshop, May 2017.
  • [9] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “A scalable framework for wireless distributed computing,” IEEE/ACM Trans. Netw., vol. 25, no. 5, pp. 2643–2654, Oct. 2017.
  • [10] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded distributed computing: Straggling servers and multistage dataflows,” 54th Allerton Conference, Sept. 2016.
  • [11] Y. H. Ezzeldin, M. Karmoose, and C. Fragouli, “Communication vs distributed computation: an alternative trade-off curve,” e-print arXiv:1705.08966, 2017.
  • [12] M. Kiamari, C. Wang, and A. S. Avestimehr, “On heterogeneous coded distributed computing,” IEEE GLOBECOM, Dec. 2017.
  • [13] K. Konstantinidis and A. Ramamoorthy, “Leveraging coding techniques for speeding up distributed computing,” e-print arXiv:1802.03049, 2018.
  • [14] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” Interspeech, 2014.
  • [15] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” NIPS, pp. 1707–1718, 2017.
  • [16]

    W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,”

    NIPS, pp. 1508–1518, 2017.
  • [17] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. Inf. Theory, vol. 64, no. 3, pp. 1514–1529, 2018.
  • [18]

    S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,”

    NIPS, pp. 2100–2108, 2016.
  • [19] R. Tandon, Q. Lei, A. Dimakis, and N. Karampatziakis, “Gradient coding,” NIPS Machine Learning Systems Workshop, 2016.
  • [20] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “A unified coding framework for distributed computing with straggling servers,” IEEE NetCod, Dec. 2016.
  • [21] ——, “Coding for distributed fog computing,” IEEE Commun. Mag., vol. 55, no. 4, Apr. 2017.
  • [22] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” NIPS, pp. 4406–4416, 2017.
  • [23] ——, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,” e-print arXiv:1801.07487, 2018.
  • [24] L. Song, C. Fragouli, and T. Zhao, “A pliable index coding approach to data shuffling,” IEEE ISIT, pp. 2558–2562, 2017.
  • [25] M. A. Attia and R. Tandon, “Information theoretic limits of data shuffling for distributed learning,” IEEE GLOBECOM, Dec. 2016.