1. Introduction
Cloud computing and large distributed frameworks like Apache Spark (Zaharia et al., 2010) are being widely used as they enable the execution of largescale applications, such as machine learning and graph applications on data sizes of the order of tens of terabytes and more, efficiently and at lower cost. However, as we “scale out” computations across many distributed nodes, one needs to deal with the “system noise” that is due to several factors such as heterogeneity of hardware, hardware failures, disk IO delay, communication delay, operating system issues, maintenance activities, and power limits (Ananthanarayanan et al., 2010)
. System noise leads to uneven execution latencies where different servers may take different amount of time to execute the same task, even if the servers have identical hardware configuration. In the extreme case, a server may be even an order of magnitude slower than the remaining servers, called a straggler node. Such speed variations create significant delays in task executions and can also lead to major performance bottlenecks, since the master node waits for the slowest workers nodes to finish their tasks. This phenomenon results in tail latency which can be defined as the high percentile completion latency of the distributed tasks. If the number of servers within a cluster experiencing this speed variance increases, the probability of having long tail latency increases exponentially
(Dean and Barroso, 2013).Replication approaches are commonly used today to deal with the straggler delay bottleneck. For example, in distributed computation and storage frameworks like Hadoop MapReduce (had, 2014) and Spark, data that needs to be processed is split into partitions by the master node, and each data partition is replicated across a subset of workers. The master node then keeps track of progress of task on each worker node. After completion of a certain fraction of the tasks, if the master observes that a particular node’s progress is slow, it schedules a copy of that slow task to be executed on another node which contains a copy of the data partition to be processed. Whichever node, either the original or the replicated node, finishes first will return the results to the master node and the other copy of the task is terminated. This technique is used in the Hadoop MapReduce framework (had, 2014). However this technique is reactive since the master node waits until most of the tasks finish their execution before launching a replica of tasks running on straggler nodes. In addition the replica can only be launched on a restrictive subset of nodes. This process significantly degrades the overall execution time.
Recently, it has been been shown that “coding” can provide a novel approach to mitigate the tail latency caused by straggler nodes, and a new framework named “coded computation” is proposed (Lee et al., 2016; Li et al., 2016a; Reisizadeh et al., 2017; Yu et al., 2017; Dutta et al., 2016). The key idea of coded computation frameworks is to inject computation redundancy in an unorthodox coded form (as opposed to the stateoftheart replication approaches) in order to create robustness to stragglers. For example, it was shown in (Lee et al., 2016) that error correcting codes (e.g., MaximumDistanceSeparable (MDS) codes ^{1}^{1}1MDS codes are an important class of block codes since they have the greatest error correcting and detecting capabilities. For more information see (Hill, 1986) Chapter 16.) can be utilized to create redundant computation tasks for linear computations (e.g., matrix multiplication).
Overview of MDS coding: An MDS coded computation first decomposes the overall computation into smaller tasks, for some . Then it encodes them into coded tasks using an MDS code, and assigns each of them to a node to compute. From the desirable “any of ” property of the MDS code, we can accomplish the overall computation once we have collected the results from the fastest out of coded tasks, without worrying about the tasks still running on the slow nodes (or stragglers). We observed that figuring out the amount of redundancy to provision during coding i.e., the value of
, poses a significant challenge because it is hard to estimate the ”noise” or speed variability in the compute nodes before hand. For
MDS code the assumption is that there may be at most very slow nodes or failures. But estimating the number of stragglers a priori is an extremely challenging task (Ananthanarayanan et al., 2013; Zaharia et al., 2008). As such assuming worst case scenarios for the number of stragglers and specifying a conservative seems to be the best possible option. Coded computation schemes use these parameters to encode the data and to determine how much of the coded dataset is processed by each of the compute nodes. The smaller the value (i.e. the more conservative, highly redundant), the larger the amount of computation performed by each node in the cluster.If the number of persistent stragglers during a particular execution instance is fewer than what the scheme is built to support, efficiency of coded computing drops . For instance, in MDS coding as explained above there is no significant performance benefit if there are fewer than stragglers, since the coded computation still has to wait for nodes to complete their execution. In cloud computing systems partial stragglers are more often encountered i.e., nodes that are slower but can do partial amount of work assigned to them. The existing coded computation schemes always waste the compute capability of the partial stragglers and do not take advantage of the fact that data which is needed for computation already exists with them and they can do partial amount of work (More on this in section 7.2). It is this lack of elasticity that makes coded computing unpalatable in large scale cluster settings. What is ideal is to allow the developer to select high redundancy coding to be conservative (essentially assuming a reasonable worst case straggler scenario) but allow a workload scheduler to decide how much redundant computing to perform based on observed speed variations in a distributed or cloud computing environment.
In this work, we design a new dynamic workload distribution strategy for coded computing that is elastic with the speeds of nodes measured during runtime, irrespective of how much redundancy is chosen for creating the coded data. Our proposed (Slack Squeeze Coded Computing) strategy adapts to varying number of stragglers by squeezing out any computation slack that may be built into the coded computation to tolerate the worst case execution scenarios. The performance of is determined by the actual speeds measured and actual number of very slow nodes seen rather than by the redundancy used in encoding the data. As the speeds of nodes change, responds by appropriately increasing or decreasing the amount of work allocated to each node in the cluster to maximize the performance.
To predict the speeds of the nodes as they change during runtime we propose a novel prediction mechanism. We model speed prediction into a time series forecasting problem and use a Long ShortTerm Memory (LSTM) based learning model to predict the speeds. These predicted speeds are used by
to do work allocation among the nodes.In summary, the main contributions in this paper are as follows:

We empirically study and evaluate the speed variations of compute nodes in a largescale cloud computing cluster. We then propose a LSTM based model to predict speed of each node in the next computation round.

We propose which exploits the computational redundancy available in the system and elastically distributes work to worker nodes by using predictions from the LSTM model. It increases performance without compromising on robustness.

We propose two variations of and evaluate their performance on our local cluster and on cloud while running machine learning and graph ranking workloads. We propose a new finegrained replication approach that combines overdecomposition of data (Laxmikant Kale, 1993) and speed prediction based workload distribution.

We first demonstrate the performance gains of on top of an MDScoded dataset by deploying on the cloud, DigitalOcean (dig, [n. d.]). While executing algorithms such as gradient descent, graph ranking, and graph filtering is able to reduce the total compute latency by up to % over the conventional coded computation and by up to over the finegrained replication approach.
Rest of the paper is organized as follows: section 2 provides background on coded computation, section 3 describes speed prediction and overheads of coded computation, section 4 describes proposed algorithm, section 5 describes extensions to nonlinear coded computing, section 6 provides implementation and system details, section 7 shows evaluations, section 8 describes related work.
2. Coded Computing Background
In this section we briefly introduce the coded computation. For clarity of explanation we first focus on how coded computing is applied for linear algebraic operations. Let us consider a distributed matrix multiplication problem where a master node wants to multiply a data matrix
with the input vector
to compute . The data matrix is distributed across worker nodes on which the matrix multiplication will be executed in a distributed manner.One natural approach to tackle this problem is to vertically and evenly divide the data matrix into submatrices, each of which is stored on one node. Then when each node receives the input , it simply multiplies its locally stored submatrix with and returns the results, and the master vertically concatenates the returned matrices to obtain the final result. However, we note that since uncoded approach relies on successfully retrieving the task results from all nodes, it has a major drawback that once one of the nodes runs slow, the computation may take long to finish. Coded computation framework deals with slow or straggler nodes by optimally creating redundant computation tasks. A MDScoded computing scheme vertically partitions the data matrix into submatrices and , and creates one redundant task by summing and . Then , and are stored on worker nodes 1, 2, and 3 respectively. In this case the final result is obtained once the master receives the task results from any 2 out of the 3 nodes, without needing to wait for the slow/straggler node. Let us assume worker node 2 is a straggler and the master node only collects results from node 1 and 3. Then the master node can compute by subtracting the computed task of node 1, i.e. , from the computed task of node 3, i.e. .
Broader use of coded computing: MDScoded computing can inject redundancy to tolerate stragglers in linear computations. Coded computing is applicable to a wider range of compute intensive algorithms, going beyond linear computations. Polynomial coded computing (Yu et al., 2017; Gupta et al., 2018) can tolerate stragglers in computations that solve polynomial equations such as Hessian matrix computation. Lagrange coded computing (Yu et al., 2018)
can add coded redundancy to tolerate stragglers in any arbitrary multivariate polynomial computations such as general tensor algebraic functions, inner product functions, function computing outer products, and tensor contractions
(Renteln, 2013). A recent work (Kosaian et al., 2018)demonstrated some promising results while extending coded computing to deep learning inference.
3. Motivation
3.1. Overheads
Consider an uncoded strategy with replication i.e., each data partition is replicated across different worker nodes where is the replication factor. Consider a node executing task on data partition . If the node is determined to be a straggler at some future time, the master node can replicate task on any one of the nodes which has a replica of data partition to speed it up. However, there are two challenges. First, when should the master determine that is a straggler? Second, even if the master has early knowledge of as a straggler, it is restricted to launching the task only on a subset of nodes that have the required data partition . Third, in the worst case if all the nodes with replicas are also stragglers i.e., if the system has stragglers, the uncoded replication strategy cannot speed up computation at all. An alternative is to move the data partition to another available faster node and execute on that node. This option forces data transfer time into the critical path of the overall execution time. Let us consider the Uncoded with 2replication in Figure 1. If the number of very slow or straggler nodes is or more, computation latency increases significantly. Similar behavior can be observed for the Uncoded with 3replication scheme when there are 3 or more stragglers.
Next let us consider the MDS coded computation on matrix multiplication. The master node divides the original matrix into submatrices, encodes them into partitions and distributes them to workers. As we discussed before, a small needs to be chosen for dealing with worst case scenarios. However, this overprovisioning with a small comes with a price. If the original data size is , then each of the worker nodes must compute on a coded partition of size . If becomes smaller, each worker node has to execute a larger computation independent of their actual speeds. On the other hand if the designer picks a large then the robustness of the computation decreases. This is a difficult tradeoff since the selection of must be done prior to creating a correct encoding and decoding strategy, and distribution of the encoded data partitions appropriately to all nodes, which are usually done once before executing the given workload.
One solution to deal with the straggler uncertainty with MDScoded computation is to store multiple encoded partitions in each worker node, such that the system can adapt and choose the appropriate encoded partition dynamically when the number of stragglers changes in the cluster. For example, in a cluster with 12 worker nodes, each worker node can store a MDS encoded partition and a MDS encoded partition at the same time. Assume the original data size is , when it’s observed there are three straggling nodes, MDScoded computation will be performed with each worker node operating on an encoded partition of size ; and when it’s observed there are fewer straggling nodes, MDScoded computation is performed with each worker node operating on partition of size . This approach is optimal only for two scenarios, and supporting a wider range of scenarios means storing more copies of the encoded data. This dramatically increases the storage overhead. It is possible to encode the data at run time and redistribute the large data partitions based on measured speeds and slow node count. However, this will dramatically increase the communication overhead and is not practical.
Figure 1 shows the computation time for (12,10)MDS coded computation and (12,9)MDS coded computation with varying number of stragglers. The computation time of (12,10)MDS coded computation increases exponentially when there are more than 2 stragglers. The computation time of (12,9)MDS coded computation is constant with more stragglers. But there is a significant increase in baseline execution latency with this strategy compared to other strategies, because (12,9)MDS code requires each worker node to perform more work than (12,10)MDS code, even if the number of stragglers is fewer than 3.
In summary, although conservative MDScoded computation can provide robust protections against stragglers, its computation overhead per node is higher and remains the same even when all the nodes in the cluster are fast, since it does not make efficient use of all worker nodes. These drawbacks bring us to our key idea which is to have a workload scheduling strategy that provides the same robustness as the MDScoded computation, but only induces a much smaller computation overhead as if MDScoded computation is being used when there are only stragglers in the cluster with . We present our proposed Slack Squeeze Coded Computation scheme in the next section.
3.2. Speed prediction and Coded data
In the introduction section, we noted that it is important to consider the speed variations across compute nodes when determining the efficacy of discarding the work done by slow nodes in the MDS coded computing framework. To collect and analyze the execution speeds of servers, we conducted experiments on 100compute nodes, referred to as droplets in Digital Ocean cloud (dig, [n. d.]). Each droplet is similar to a t2.micro shared compute instance in Amazon AWS. For our experiments, each droplet node executes matrixmatrix multiplication and logs its execution times after completion of every 1% of the task. The size of each matrix is 20000 by 5000. We analyzed the measured speeds at 1% granularity intervals at all nodes. Figure 2 shows the speed variations in 4 of the representative nodes. Xaxis in each plot corresponds to time. Yaxis in each plot corresponds to speed of the node normalized by its maximum observed speed during the experiment.
One critical observation from the figure is that while the speed of each node varies over time, on average the speed observed at any time slot stays within 10% for about 10 samples within the neighborhood. This relatively slow changing speed provides us an opportunity to estimate speeds of nodes in future intervals using speeds from past intervals. The speed estimates can be reasonably accurate for most of the time intervals except for a short time window when the speed changes drastically, but again we will soon be able to track the new speed as the nodes stay in that speed zone.
To find a good prediction mechanism, we considered the speeds of each node as a time series and modeled our problem as a time series forecasting problem. We evaluated LSTM (Long ShortTerm Memory (Hochreiter and Schmidhuber, 1997)) and Auto Regressive Integrated Moving Average (ARIMA) models to predict the speeds. We found that the LSTM models provided the best prediction accuracy among all models. The model predicts the speeds of the nodes within % of their actual values. In statistical terms, the Mean Absolute Percentage error of the model on the test set is %. This prediction error is better than simply using the speed from past iteration by 5%. As expected, only immediately after a large speed variance is observed the prediction lags behind but catches up with the observed speed soon after.
Based on this critical observation we hypothesize that reliably estimating the speeds for next computation round allows master node to perform appropriate task assignment to the workers such that the computations performed by all workers can be utilized to obtain the final result. But this finegranularity task assignment and utilization of all worker nodes becomes feasible only if there is no data movement overhead between rounds of computation. Coded computing is well suited for this finegrained task assignment since the input data that is distributed among workers is encoded and as a result there would be no additional data movement needed between rounds of computation. However, this feature is not exploited in conventional MDScoded computation. In uncoded computation, to assign workload optimally based on the predicted speeds, either each worker node will need to store significant percentage of the entire data, which can impose huge storage overhead; or it requires the master to redistribute the data among nodes at runtime, which can add huge communication overhead for iterative workloads such as gradient descent and page rank. To measure the storage overhead of uncoded computation, we performed experiments in our local cluster consisting of 12 worker nodes. We measure the total data moved to each node between rounds of computation and consider it as the effective storage needed at that node to avoid additional data movement. Figure 3 shows the mean effective storage needed at each node to avoid data movement during the course of 270 gradient descent iterations for Logistic Regression. In this experiment, the uncoded computation has accurate predictions of speed of nodes for next iteration. It needs of the total data to be stored at each worker node to have zero data movement overhead. For with (12,10)MDS coding the data storage remains fixed at of the total data and much lower than the uncoded computation.
Following from these observations, we argue for that exploits the unique feature of coded data availability and thereby utilize the compute capacity of all worker nodes.
4.
4.1. Basic algorithm
The major goals of the algorithm are to achieve high tolerance to stragglers and reduce computation work assigned per worker when the number of slow nodes observed during run time is less than the conservative estimate. To achieve high straggler tolerance the master node encodes and distributes the large matrix using a conservative MDS coding once at the beginning. To assign reduced computation work to worker nodes, master node then employs algorithm. There are two key insights, in a cluster using conservative MDS coding, that underlie our algorithm.

Each worker node stores high redundancy encoded matrix data partition

The master node can decode and construct the final product as long as it receives any k out of n responses corresponding to each row index of partitioned matrix
Let there be ns nk stragglers in the (n,k)MDS coded cluster. As we explained in previous section, when there are (ns) stragglers (n,s)MDS coding is the best suited coding strategy. But rather than using a new (n,s)MDS code to reencode the data, we use the (n,k)MDS coded data partition as is but we change the amount of work done by each node. In particular, allocates decodable computational work assignment per node equal to that in (n,s)MDS coding instead of (n,k)MDS coding. If is the number of rows in the original matrix, each node gets an allocation = number of rows to be computed.
Figure 4 provides an illustration of the strategy in a cluster consisting of 4 worker nodes (and 1 master node). Figure 3(a) shows the conventional (4,2)MDS coded computation performed when worker 4 is the only straggler node and the remaining 3 workers have same speed. Note that (4,2)MDS coding is conservative here, since it can support 2 stragglers but in this case there is only 1 straggler. Each worker node computes on its full partition but the master node needs only the results from workers 1 and 2 and can ignore the result from worker 3. Submatrices refer to the vertical divisions of the matrix A. Data stored in worker 3 is a coded matrix, . Data stored in worker 4 is coded as . These codes are generated as per MDScoding principles.
In figure 3(b), conventional (4,3)MDS coded computation when worker 4 is the straggler node is shown. Each nonstraggler node computes on its full partition but the size of the partition here is smaller than partition size in the previous coding. The master node needs the results of all workers to construct the final product. Submatrices are the vertical divisions of the matrix A into three parts. Data stored in worker 4 is coded as .
with (4,2)MDS Coded computation for this scenario is shown in figure 3(c). If we consider data in each worker as composed of 3 equal size partitions, worker node 1 computes only on the first and second of its partitions. Worker 2 computes only on the first and third of its partitions. Worker 3 computes only on the second and third of its partitions. As a result, each worker node is performing less amount of compute and this compute is equal to the amount performed by each worker in conventional (4,3)MDS coded computation. Partitions to be computed at each worker are assigned to ensure that each row index is computed by exactly two workers. This is necessary for successfully decoding the results by the master node at the end of computation.
4.2. General Algorithm
In cloud computing services and data centers, compute nodes within a cluster can have different speeds during run time, as described in section 3.2, due to them being shared or due to various microarchitectural events such as cache misses and other control/data bottlenecks. They can also be heterogeneous. We present a General algorithm which, unlike basic , can consider the variation in speeds of all nodes and assign work to them. At the beginning of execution of every application, matrix data is partitioned, encoded and distributed to the worker nodes using MDS coding. For efficient decoding and work allocation, General algorithm also decomposes and considers each matrix partition as composed of chunks (groups) of rows i.e., overdecomposition. The speed predictions from LSTM model are provided to the General . Then workers are sorted according to their speeds. Starting from the worker with highest speed, each worker is assigned chunks to be computed equal to the ratio of its speed to the total available computational speed of all workers. If the assigned chunks for a worker turn out to be more than the total chunks in the partition already stored in a worker, the algorithm reassigns these extra chunks to next worker. This case occurs when one worker is relatively much faster than all other workers. The algorithm is summarized in Algorithm 1. In the case where, all nonstraggler nodes have equal speed general would reduce to simple .
4.3. Dealing with misprediction or failures
Speed prediction algorithm can mispredict when there is a sudden and significant change in the speeds of workers. Also, one of the worker nodes can die or fail during execution. To handle these scenarios, algorithm employs a timeout mechanism. collects results from the first workers that complete their work and measures their average response time. If the remaining workers do not respond within 15% of the average response time, considers this situation as a misprediction and reassigns the pending work among the completed workers. We choose 15% based on the average error from our speed prediction algorithm (16.7%).
4.4. Robustness of
Coded computing with is robust and can tolerate the same number of stragglers as conventional coded computing because:

Data distribution in is identical to the data distribution in conventional coded computing.

The worst case scenario may occur when the speed predictions for completely fails. In this case the general along with the timeout mechanism, described in section 4.3, essentially turns into a conventional coded computing.
5. Extension to nonlinear codedcomputing
, being a workload distribution strategy, can be extended to many different coded computations. In this section we demonstrate how to apply it on top of the popular polynomial codes (Yu et al., 2017). We refer the reader to the paper for the mathematical underpinning of polynomial codes and we will only provide brief overview to demonstrate its working and how can be applied to such a generalized codes. These codes can be used to compute polynomial functions on data with low decoding and encoding overheads.
Consider computing (), a bilinear computation on two matrices A and B, in a distributed manner using a cluster with nodes. Matrix A is divided into subpartitions along rows and matrix B is divided into subpartitions along columns. Then encoded partitions each for are computed from these subpartitions. Let us consider the scenario we want to distribute this computation on a minimum of 5 nodes with one potential straggler. In this case, i.e., each matrix has 2 subpartitions. Each encoded partition of is of the form . Each encoded partition of is of the form where is equal to the node index . For example, node 0 will store , . Node 2 will store , and so on. In this configuration, node 2 computes . Hence, we have four partial coded computations on and and we need to get such results from any four nodes to be able to fully decode this computation. But if there are any nonstraggler nodes the polynomial coding wastes the computation just as is the case with MDS coding.
In figure 5 we illustrate how our framework can be applied on top of such a polynomial coded bilinear computation. In figure 5, the cluster has nodes. For illustration purposes each matrix partition has 9 rows. A minimum of 4 responses per each row are needed for successful computation of . The relative speeds of each nodes are {2,2,2,2,1}. Node 4 is a partialstraggler. Conventional polynomial coded computing ignore the computation from this node. However General does not and it allocates partial work to it. General allocates {8,8,8,8,4} rows to the 5 nodes respectively as highlighted by the bounding rectangles in each worker node. The last worker (speed 1) is shown to compute the last set of rows. Product of each row with is computed by exactly 4 nodes and sent to the master node.
In this paper we evaluate polynomial coding while computing Hessian matrices of the form (Himmelblau, 1972)
. These Hessian matrices form the foundation for a variety of optimization algorithms such as semidefinite programs, kernel ridge regression, maximum likelihood estimation and more.
6. Implementation
At the beginning of computation, master node encodes the matrix data and distributes the encoded submatrices to the corresponding worker nodes. For MDS coding we are dealing with just a single matrix, but with Polynomial codes we have two matrices to encode, and both coding strategies use different encoding as described earlier. At the start of each iteration of our applications master node distributes the vector () to all worker nodes. At the end of each iteration, master node receives the subproduct from the worker nodes, decodes them and constructs the result vector.
Each worker node has two separate running processes, one for computation and one for communication. The computation process on the worker node performs the appropriate computation on encoded data, either a matrixvector operation in MDS setting or a Hessian matrix computation in Polynomial setting. The communication process is in charge of receiving input data from the master node, work assignment information, and sending the partial product, and controlling the start and stop of the computation process at the worker node.
6.1. LSTM based speed prediction Model
We used the speed data measured from our experiments in motivation section as the dataset for training the LSTM model. The train/test dataset split is 80:20. The speed prediction model expects input to have 1 dimension and consists of one singlelayer LSTM with a hidden state being 4 dimensional with tanh activation, 1 dimensional output. Dimension of hidden state is a hyper parameter and we experimented with different values and selected 4 as it provided highest accuracy on the dataset. It is used to predict speed of nodes once every iteration. Input to the model is the speed of node from previous iteration and its output is the speed prediction for the next iteration. The LSTM model computation takes 200 microseconds for each node.
6.2. specifics
Basic strategy needs information on which nodes are stragglers. General strategy needs information on the relative execution speeds of all nodes and it adjusts the work assignment to the worker nodes according to their speed. To obtain this information we rely on the iterative nature of our algorithms. Initially master node starts with the assumption that all the worker nodes have the same speed and this is provided as input to the current strategy. The master then distributes the work assignment calculated by to each worker node. Upon receiving the partial products from the worker nodes, master node also records the response time for each worker node corresponding to iteration . If the number of rows computed at worker is , then the speed of each worker node for the current iteration is computed as . These values from all nodes are provided as a batch input to the trained LSTM model which predicts speeds for the next iteration. The predicted speeds are fed into the General strategy to generate the computational work assignment at each worker node for iteration (). Thus automatically adapts to speed changes at the granularity of an iteration.
6.3. Computing Applications
We evaluated
on MDS using the following linear algebraic algorithms: Logistic Regression, Support Vector Machine, Page Rank and Graph Filtering. Graph ranking algorithms like Page Rank and Graph signal processing algorithms employ repeated matrixvector multiplication: Calculating page rank involves computing the eigenvector corresponding to the largest eigenvalue which is done using power iteration algorithm; Graph filtering operations such as the
hop fitering operations employ iterations of matrixvector multiplication over the combinatorial Laplacian matrix. We evaluated on both these algorithms. We evaluated on Polynomial coding using the Hessian matrix operation that we described earlier.6.4. System Setup
We evaluated the above computing applications in a datacenter scale cloud setting in the Digital ocean cloud. On Digital ocean cloud we employ 11 shared compute instances each with 1 virtual CPU and 2 GB of memory. We use Kubernetes to bootstrap a cluster using these 11 nodes, with one being the Master and the other 10 nodes being the Worker nodes. We then dockerize the computing applications and deploy them on the cloud.
6.5. Verification in a controlled cluster
For theoretical verification purposes we also evaluated all the applications and results on our local cluster where we had the ability to precisely control the straggler behavior. Our local cluster is composed of 13 identical servers. Each server consists of two Intel Xeon CPU E52630 v3 each with 8 cores (8 threads, 2.40 GHz), 20 MB of L3 cache, running Centos Linux version 7.2.1511. Each machine has 64GB of DRAM. All the servers have access to a shared storage node of size 1 TB. All the servers are connected to one another through Mellanox SX6036 FDR14 InfiniBand Switch with a bandwidth of 56 Gbps. We use one of the nodes as Master node and other 12 nodes as Worker nodes.
6.6. Baseline strategies
We implement and evaluate two baseline strategies: Our first baseline is an enhanced Hadooplike uncoded approach that is similar to LATE (Zaharia et al., 2008). In this baseline we used a 3repetition strategy with up to six tasks that are speculatively launched. The strategy provides 3 copies of data at 3 randomly selected nodes in the distributed system. This enhanced Hadoop strategy does not enforce strict data locality during speculation, unlike traditional Hadoop, and allows data to be moved at runtime if a task needs to relaunched on a node that does not have a copy of the data. We allow up to six tasks to be speculatively launched. Furthermore, the speculative task assignment strategy always tries to find a node that already has a copy of the data before moving the data, thereby allowing data communication only when absolutely needed.
The second baseline is the MDScoded computation proposed by (Lee et al., 2016) and described previously in section 2. The two MDScoding schemes we evaluated are: MDS as the conservative scheme, and MDS as the optimistic scheme. No data movement is allowed in these schemes during computation. The purpose of showing results for MDS coding is simply to show the robustness of our scheme in the presence of such high redundancy. We do not expect that system designers are unlikely to provision 2X computation redundancy in practice. Hence, we will highlight MDS results in our discussion in the next section.
7. Evaluation
7.1. Results from controlled cluster
We evaluate the performance of against the baseline strategies for varying straggler counts in our 12workernode cluster and these different cases correspond to the Xaxis in the plots. Each bar in the plots captures the average relative execution time spent by the application for 15 iterations, normalized by the execution time of the uncoded strategy when there is 0 straggler in the cluster. The total execution time is dominated by the computation time, which is the total time spent by the master node waiting to receive results from enough worker nodes once it commands each worker to start computing their partitions. The total execution time also includes communication time, the total time spent by the master node in communicating with the worker nodes; and assembling time, the total time spent by the master node in loading the partial results returned from the worker nodes and decoding them to produce the final result. Majority of assembling time is spent for loading the data and not on actual decoding itself. The matrix data encoding time is tiny and it also paid only once at the beginning and hence is not included in the figures.
7.1.1. Logistic Regression and SVM
We evaluated Gradient Descent for Logistic Regression (LR) and SVM. The results for both of them are very similar and hence we focus the discussion on evaluations of LR. For our experiments we use publicly available gisette dataset from UC Irvine machine learning repository (Lichman, 2013). The data in this dataset is duplicated to create a larger dataset. The final size of data partition in each node is 760 MB. Only one processor thread in each worker node is used for computation.
In the cluster, nonstraggler workers may have upto variation between their processing speeds and straggler nodes are 5X slower than the fastest performing nonstraggler node in our cluster. We compare the three baselines with the two versions of : basic that does not consider this variation in speeds of the nonstraggler workers and treats all the nonstraggler workers as having equal speed, and general algorithm that takes this speed variation into account and allocates different computational work to nonstraggler workers accordingly. The results are shown in Figure 6.
As we can see from the figure 6, when there are no stragglers, all strategies have low execution times with having the lowest. As the number of stragglers increases, the execution time of uncoded strategy increases since the slower job needs to be detected and reexecuted. Whereas in coding based strategies there is no need for reexecution. Once the number of stragglers exceeds 2, the uncoded strategy’s performance starts to degrade and it is 3x of the execution time compared to no straggler scenario. The super linear degradation is because data partition will need to be moved across worker nodes prior to the reexecution and communication costs start to play a role in the overall performance loss.
Both versions of (12,6)MDS coded not only are able to provide robustness against up to two stragglers in the cluster, but also are able to reduce the computation overhead due to the use of coding when there are fewer or no stragglers in the cluster. By taking the various speeds of the nonstraggler worker nodes into account, the general version of the strategy is able to outperform the conservative (12,6)MDS coded computation strategy even more than the basic version of . This result indicates that even if we can’t take into account the precise variation in the processing speeds of various nonstraggler nodes, the basic algorithm provides excellent performance and robustness. However, if the processing speed information is more accurately gathered, the generalized can squeeze the hidden compute slack in the 20% speed variation and provide further performance improvements without compromising robustness.
7.1.2. Page Rank and Graph Filtering
We evaluated Page Rank (PR) and Graph Filtering. The results for both of them are very similar and hence we focus the discussion on Page Rank. We used the ranking dataset available from (uto, 2000) for our evaluations. This dataset is duplicated to create a larger dataset that is used in evaluation.
The total execution time for Page Rank when the nonstraggler nodes can have up to 20% variation between their speeds, is plotted in Figure 7. We can see that algorithms significantly outperform the baseline strategies. The general algorithm reduces the execution time compared to basic in all scenarios.
7.2. Results from industrial cloud deployment
In this section we discuss results from our experiments on DigitalOcean cloud. In our experiments we evaluate and compare the performances of General strategy against MDS coded computation and an over decomposition strategy based on Charm++ (Laxmikant Kale, 1993; Acun et al., 2014) (described below). We evaluated and MDS coded computation under (10,7), (9,7) and (8,7) MDS codes. During the course of our experiments we observed different misprediction rates from the LSTM speed prediction model. We show and discuss the performance gains from the experimental conditions where we observe the best and worst case misprediction rates. The performance results obtained across various applications are similar (as has been shown also in the local cluster setting). Due to space constraint we focus only on SVM results in this section.
7.2.1. Charm++ based overdecomposition
In the cloud setting, we evaluated an overdecomposition based strategy inspired by charm++ (Laxmikant Kale, 1993; Acun et al., 2014). In our implementation we combine over decomposition and speed prediction. We overdecompose each data partition by a factor of 4. The data is divided into partitions with each of the workers receiving 4 partitions. The data is also replicated by a factor of 1.42, similar to replication in (10,7)MDS coding. The additional partitions are distributed in a roundrobin fashion across the workers. Master node uses predictions from the speed model to do load balancing and transfer of partitions between workers during computations. This is better than the Uncoded strategy since it allows for finer grained data transfer.
7.2.2. Results in low misprediction rate environment
The average relative execution times for 15 iterations of SVM are shown in figure 8 when we observe a 0% misprediction rate for worker speeds. The execution times of all strategies are normalized by the execution time of . First we can observe that over decomposition approach performs better than (10,7)MDS coded computation. This is expected since each worker in MDScoded computation processes more data than each worker in overdecomposition strategy. Next, we observe that all three variations of MDScoded computation show similar execution times. This is because in all cases the work performed by a single worker remains same and only the results from fastest 7 workers are used by the master. Over decomposition performs similar to (10,7) in this environment since there is no additional data movement during computations. Next, for all 3 data coding variations outperforms regular MDS Coded computation. Further, performance of increases as the redundancy is increased. This is because work done in a single worker decreases as redundancy is increased. (10,7) outperforms the (10,7)MDS coded computation by . For the maximum reduction in execution time over (10,7)MDS coded computation would occur when all 10 workers are always fast during execution. The exact reduction would be . with misprediction rate captures this best possible reduction in execution time.
Figure 9 plots the wasted computation efforts measured in each of the worker node during execution of the conservative (10,7)MDS coded computation and (10,7). Since the misprediction rate is 0% there is no wasted computation effort in . In this execution, workers 1, 3, 7 and 8 have high wasted computation. Worker 1 has close to 90% of its computation wasted. This is because it is only slightly slower than the fastest 7 workers but the MDScoded computation stops the execution of the 3 remaining workers and ignores their result after it receives from the fastest 7 workers.
7.2.3. Results in high misprediction rate environment
During our experiments with shared VM instances on DigitalOcean, we observe the highest misprediction rate is 18%. Under this condition, the average execution times for 15 iterations of SVM are shown in figure 10. (10,7)MDS coded computation performs better than (9,7) and (8,7)MDS coded computation because the probability of any 7/10 nodes being fast is higher than any 7/9 nodes or 7/8 nodes being fast. (8,7) outperforms (8,7)MDS coding by 13%, whereas (9,7) outperforms (9,7)MDS coding by 11% and (10,7) outperforms the (10,7) MDScoded computation approach by 17%. As expected (10,7) outperforms both (9,7) and (8,7) variants since the opportunities to do load balancing increase as the redundancy increases. The observed performance of overdecomposition approach is lower than the performance of (10,7)MDS approach owing to the extra data movement costs for load balancing during computations. Whereas in (10,7)MDS coded computation there are no extra data movement costs during computations.
The wasted computation efforts measured in each of the worker node under (10,7)coding are shown in figure 11. Due to a relatively high misprediction rate, also incurs wasted computation efforts among the worker nodes when the compute tasks of slow nodes are cancelled and reassigned to other worker nodes. However, the conservative MDS approach incurs higher wasted computation since it also ignores the slowest 3 nodes’ computation efforts. On average, the conservative MDS scheme incurs 47% more wasted computation effort.
7.2.4. Results with on polynomial Coding
We evaluated applied on Conventional Polynomial coding while performing Hessian matrix computation, , as we previously described in section 5. The dimensions of matrix are x . The results collected under low and high misprediction rates are shown in figure 12. In these experiments, the cluster has 12 nodes. The matrices are partitioned each into 3 submatrices, encoded, and the encoded partitions are distributed to the 12 nodes. Each node would compute on 2 encoded partitions. Results from any 9 nodes would be enough to compute the Hessian. In this setup, reduces the overall computation time by 19% in low misprediction rate environment. The maximum possible reduction is . The part of Hessian computation where each node has to first compute is not influenced by . As a result, the gains from using are lower than expected. Under high misprediction rate enviornment, reduces the overall computation time by 14%.
The evaluation results demonstrate the effectiveness of across different Coded Computation schemes.
8. Related Work
Straggler Mitigation: Authors of (Dean and Barroso, 2013) propose several software techniques to contain the effect of stragglers such as using redundant tasks, selective replication. The authors in work (Ananthanarayanan et al., 2010) utilize real time progress reports to detect and cancel the stragglers early. Authors in (Zaharia et al., 2008) demonstrated LATE algorithm to improve the straggler detection and speculative execution in Hadoop framework. Authors of (Ananthanarayanan et al., 2014) use extrapolation to estimate task durations and perform straggler mitigation. In works (Li et al., 2014; Zhang et al., 2016; Kasture and Sanchez, 2014) the authors explore system sources of tail latency from system and implement mechanisms to eliminate these causes. Adrenaline (Hsu et al., 2017) identifies and selectively speeds up long queries by quick voltage boosting. Paragon (Delimitrou and Kozyrakis, 2013) presents a QOS aware online heterogenous datacenter scheduler. Prior works (Delimitrou and Kozyrakis, 2014; Lo et al., 2014; Leverich and Kozyrakis, 2014; Zhu et al., 2017a) focus on improving resource efficiency while providing low latency. Using replicated tasks to improve the response times has been explored in (Ananthanarayanan et al., 2013; Shah et al., 2013; Wang et al., 2014; Gardner et al., 2015; Chaubey and Saule, 2015; Lee et al., 2017). Generally this approach involves launching multiple copies of each task across workers, using results from the fastest and canceling the slower copies. This approach needs multiple replicas of all the data. But recent works on coded computing have shown that replicating with coding is a better approach to tolerate stragglers, rather than reactive replication of a task. Another strategy used for straggler mitigation is arriving at an approximate result without waiting on the stragglers (Goiri et al., 2015). Coded Computation: Coded computation is a recently proposed framework with two concepts to deal with the communication and straggler bottlenecks in distributed computing. The first coded computing concept (Li et al., 2015, 2016b) enables an inverselinear tradeoff between computation load and communication load in distributed computing. This can be leveraged to speed up largescale data analytics applications (Li et al., 2017). The second coded computation, the focus of this paper, concept (Lee et al., 2016) provides resiliency to stragglers and can be utilized to mitigate tail latency in distributed computing (Lee et al., 2016; Reisizadeh et al., 2017; Li et al., 2016a; Dutta et al., 2016; Tandon et al., 2016; Yu et al., 2017). In particular, several of these works target distributed machine learning. Recent results from (Kosaian et al., 2018) demonstrate coded computing on nonlinear computations, specifically deep learning inference.There have been few recent works in the coded computing literature to exploit the computations of slow nodes (Zhu et al., 2017b; Yang et al., 2017), however the key ingredient of our proposed strategy is that it dynamically adapts the computation load of each node to its estimated speed from the previous rounds of computations.
9. Conclusion
In this paper we proposed and evaluated that efficiently tolerates speed variance and uncertainty about the number of stragglers in the system. distributes coded data to nodes and during runtime adaptively adjusts the computation work per node. Thereby it significantly reduces the total execution time of several applications. Through our evaluations using machine learning and graph processing applications, we demonstrate ~39.3% reduction in execution time. We conclude that the workload scheduling techniques presented in this paper effectively reduce the overhead in coded computation frameworks and make them more effective in real deployments.
References
 (1)
 dig ([n. d.]) [n. d.]. DigitalOcean. https://www.digitalocean.com/.
 uto (2000) 2000. CS Toronto Datasets. http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html.
 had (2014) 2014. Apache Hadoop. http://hadoop.apache.org/.
 Acun et al. (2014) Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice (SC).
 Ananthanarayanan et al. (2013) Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (nsdi’13). USENIX Association, Berkeley, CA, USA, 185–198. http://dl.acm.org/citation.cfm?id=2482626.2482645
 Ananthanarayanan et al. (2014) Ganesh Ananthanarayanan, Michael ChienChun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. 2014. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14). USENIX Association, Berkeley, CA, USA, 289–302. http://dl.acm.org/citation.cfm?id=2616448.2616475

Ananthanarayanan et al. (2010)
Ganesh Ananthanarayanan,
Srikanth Kandula, Albert G Greenberg,
Ion Stoica, Yi Lu, Bikas
Saha, and Edward Harris.
2010.
Reining in the Outliers in MapReduce Clusters using Mantri.. In
OSDI, Vol. 10. 24.  Chaubey and Saule (2015) Manmohan Chaubey and Erik Saule. 2015. Replicated Data Placement for Uncertain Scheduling. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW ’15). IEEE Computer Society, Washington, DC, USA, 464–472. https://doi.org/10.1109/IPDPSW.2015.50
 Dean and Barroso (2013) Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56 (2013), 74–80. http://cacm.acm.org/magazines/2013/2/160173thetailatscale/fulltext
 Delimitrou and Kozyrakis (2013) Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoSaware Scheduling for Heterogeneous Datacenters. SIGPLAN Not. 48, 4 (March 2013), 77–88. https://doi.org/10.1145/2499368.2451125
 Delimitrou and Kozyrakis (2014) Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resourceefficient and QoSaware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). ACM, New York, NY, USA, 127–144. https://doi.org/10.1145/2541940.2541941

Dutta
et al. (2016)
Sanghamitra Dutta, Viveck
Cadambe, and Pulkit Grover.
2016.
ShortDot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products. In
Advances In Neural Information Processing Systems. 2092–2100.  Gardner et al. (2015) Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor HarcholBalter, and Esa Hyytia. 2015. Reducing Latency via Redundant Requests: Exact Analysis. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’15). ACM, New York, NY, USA, 347–360. https://doi.org/10.1145/2745844.2745873
 Goiri et al. (2015) Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA, 383–397. https://doi.org/10.1145/2694344.2694351
 Gupta et al. (2018) Vipul Gupta, Shusen Wang, Thomas Courtade, and Kannan Ramchandran. 2018. OverSketch: Approximate Matrix Multiplication for the Cloud. arXiv preprint arXiv:1811.02653 (2018).
 Hill (1986) Raymond Hill. 1986. A First Course in Coding Theory. Clarendon Press. https://books.google.com/books?id=UTxjBX9lKoMC
 Himmelblau (1972) David Mautner Himmelblau. 1972. Applied nonlinear programming. McGrawHill Companies.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long ShortTerm Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
 Hsu et al. (2017) ChangHong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. 2017. Reining in Long Tails in WarehouseScale Computers with Quick Voltage Boosting Using Adrenaline. ACM Trans. Comput. Syst. 35, 1, Article 2 (March 2017), 33 pages. https://doi.org/10.1145/3054742
 Kasture and Sanchez (2014) Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient Cache Sharing with Strict Qos for Latencycritical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). ACM, New York, NY, USA, 729–742. https://doi.org/10.1145/2541940.2541944
 Kosaian et al. (2018) J. Kosaian, K. V. Rashmi, and S. Venkataraman. 2018. Learning a Code: Machine Learning for Approximate NonLinear Coded Computation. ArXiv eprints (June 2018). arXiv:cs.LG/1806.01259
 Laxmikant Kale (1993) Sanjeev Krishnan Laxmikant Kale. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of OOPSLA’93. ACM Press, 91–108.
 Lee et al. (2016) Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2016. Speeding up distributed machine learning using codes. In 2016 IEEE International Symposium on Information Theory (ISIT). 1143–1147. https://doi.org/10.1109/ISIT.2016.7541478
 Lee et al. (2017) Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. 2017. On Scheduling Redundant Requests With Cancellation Overheads. IEEE/ACM Trans. Netw. 25, 2 (April 2017), 1279–1290. https://doi.org/10.1109/TNET.2016.2622248
 Leverich and Kozyrakis (2014) Jacob Leverich and Christos Kozyrakis. 2014. Reconciling High Server Utilization and Submillisecond Qualityofservice. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys ’14). ACM, New York, NY, USA, Article 4, 14 pages. https://doi.org/10.1145/2592798.2592821
 Li et al. (2014) Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Applicationlevel Sources of Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing (SOCC ’14). ACM, New York, NY, USA, Article 9, 14 pages. https://doi.org/10.1145/2670979.2670988
 Li et al. (2015) Songze Li, Mohammad Ali MaddahAli, and Amir Salman Avestimehr. 2015. Coded MapReduce. 53rd Allerton Conference (Sept. 2015).
 Li et al. (2016a) Songze Li, Mohammad Ali MaddahAli, and A Salman Avestimehr. 2016a. A Unified Coding Framework for Distributed Computing with Straggling Servers. eprint arXiv:1609.01690 (Sept. 2016). A shorter version to appear in IEEE NetCod 2016.
 Li et al. (2016b) Songze Li, Mohammad Ali MaddahAli, Qian Yu, and A Salman Avestimehr. 2016b. A Fundamental Tradeoff between Computation and Communication in Distributed Computing. to appear in IEEE Transactions on Information Theory (2016).
 Li et al. (2017) Songze Li, Sucha Supittayapornpong, Mohammad Ali MaddahAli, and A. Salman Avestimehr. 2017. Coded Terasort. in proceedings of 2017 International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics (2017).
 Lichman (2013) M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Lo et al. (2014) David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards Energy Proportionality for Largescale Latencycritical Workloads. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA ’14). IEEE Press, Piscataway, NJ, USA, 301–312. http://dl.acm.org/citation.cfm?id=2665671.2665718
 Reisizadeh et al. (2017) Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Amir Salman Avestimehr. 2017. Coded Computation over Heterogeneous Clusters. In 2017 IEEE International Symposium on Information Theory (ISIT).
 Renteln (2013) Paul Renteln. 2013. Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists. Cambridge University Press.
 Shah et al. (2013) Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2013. When do redundant requests reduce latency ?. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). 731–738. https://doi.org/10.1109/Allerton.2013.6736597
 Tandon et al. (2016) Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. 2016. Gradient Coding. arXiv preprint arXiv:1612.03301 (2016).
 Wang et al. (2014) Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient Task Replication for Fast Response Times in Parallel Computation. SIGMETRICS Perform. Eval. Rev. 42, 1 (June 2014), 599–600. https://doi.org/10.1145/2637364.2592042
 Yang et al. (2017) Yaoqing Yang, Pulkit Grover, and Soummya Kar. 2017. Coding Method for Parallel Iterative Linear Solver. to appear Advances In Neural Information Processing Systems (NIPS) (2017).
 Yu et al. (2017) Qian Yu, Mohammad MaddahAli, and A. Salman Avestimehr. 2017. Polynomial Codes: an Optimal Design for HighDimensional Coded Matrix Multiplication. In to appear Advances In Neural Information Processing Systems (NIPS).
 Yu et al. (2018) Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. 2018. Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy. arXiv preprint arXiv:1806.00939 (2018).
 Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). USENIX Association, Berkeley, CA, USA, 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113
 Zaharia et al. (2008) Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, USA, 29–42. http://dl.acm.org/citation.cfm?id=1855741.1855744
 Zhang et al. (2016) Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Treadmill: Attributing the Source of Tail Latency Through Precise Load Testing and Statistical Inference. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). IEEE Press, Piscataway, NJ, USA, 456–468. https://doi.org/10.1109/ISCA.2016.47
 Zhu et al. (2017b) Jingge Zhu, Ye Pu, Vipul Gupta, Claire Tomlin, and Kannan Ramchandran. 2017b. A Sequential Approximation Framework for Coded Distributed Optimization. CoRR abs/1710.09001 (2017). arXiv:1710.09001 http://arxiv.org/abs/1710.09001
 Zhu et al. (2017a) Timothy Zhu, Michael A. Kozuch, and Mor HarcholBalter. 2017a. WorkloadCompactor: Reducing Datacenter Cost While Providing Tail Latency SLO Guarantees. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC ’17). ACM, New York, NY, USA, 598–610. https://doi.org/10.1145/3127479.3132245
Comments
There are no comments yet.