Distributed Matrix Multiplication Using Speed Adaptive Coding

04/15/2019 ∙ by Krishna Narra, et al. ∙ 0

While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S^2C^2). S^2C^2 squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data. We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. We evaluate S^2C^2 on linear algebraic algorithms, gradient descent, graph ranking, and graph filtering algorithms. We demonstrate a 19 computation latency using S^2C^2 compared to job replication and coded computation. We further show how S^2C^2 can be applied beyond linear algebra.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Cloud computing and large distributed frameworks like Apache Spark (Zaharia et al., 2010) are being widely used as they enable the execution of large-scale applications, such as machine learning and graph applications on data sizes of the order of tens of terabytes and more, efficiently and at lower cost. However, as we “scale out” computations across many distributed nodes, one needs to deal with the “system noise” that is due to several factors such as heterogeneity of hardware, hardware failures, disk IO delay, communication delay, operating system issues, maintenance activities, and power limits (Ananthanarayanan et al., 2010)

. System noise leads to uneven execution latencies where different servers may take different amount of time to execute the same task, even if the servers have identical hardware configuration. In the extreme case, a server may be even an order of magnitude slower than the remaining servers, called a straggler node. Such speed variations create significant delays in task executions and can also lead to major performance bottlenecks, since the master node waits for the slowest workers nodes to finish their tasks. This phenomenon results in tail latency which can be defined as the high percentile completion latency of the distributed tasks. If the number of servers within a cluster experiencing this speed variance increases, the probability of having long tail latency increases exponentially

(Dean and Barroso, 2013).

Replication approaches are commonly used today to deal with the straggler delay bottleneck. For example, in distributed computation and storage frameworks like Hadoop MapReduce (had, 2014) and Spark, data that needs to be processed is split into partitions by the master node, and each data partition is replicated across a subset of workers. The master node then keeps track of progress of task on each worker node. After completion of a certain fraction of the tasks, if the master observes that a particular node’s progress is slow, it schedules a copy of that slow task to be executed on another node which contains a copy of the data partition to be processed. Whichever node, either the original or the replicated node, finishes first will return the results to the master node and the other copy of the task is terminated. This technique is used in the Hadoop MapReduce framework (had, 2014). However this technique is reactive since the master node waits until most of the tasks finish their execution before launching a replica of tasks running on straggler nodes. In addition the replica can only be launched on a restrictive subset of nodes. This process significantly degrades the overall execution time.

Recently, it has been been shown that “coding” can provide a novel approach to mitigate the tail latency caused by straggler nodes, and a new framework named “coded computation” is proposed (Lee et al., 2016; Li et al., 2016a; Reisizadeh et al., 2017; Yu et al., 2017; Dutta et al., 2016). The key idea of coded computation frameworks is to inject computation redundancy in an unorthodox coded form (as opposed to the state-of-the-art replication approaches) in order to create robustness to stragglers. For example, it was shown in (Lee et al., 2016) that error correcting codes (e.g., Maximum-Distance-Separable (MDS) codes 111MDS codes are an important class of block codes since they have the greatest error correcting and detecting capabilities. For more information see (Hill, 1986) Chapter 16.) can be utilized to create redundant computation tasks for linear computations (e.g., matrix multiplication).

Overview of MDS coding: An -MDS coded computation first decomposes the overall computation into smaller tasks, for some . Then it encodes them into coded tasks using an -MDS code, and assigns each of them to a node to compute. From the desirable “any of ” property of the MDS code, we can accomplish the overall computation once we have collected the results from the fastest out of coded tasks, without worrying about the tasks still running on the slow nodes (or stragglers). We observed that figuring out the amount of redundancy to provision during coding i.e., the value of

, poses a significant challenge because it is hard to estimate the ”noise” or speed variability in the compute nodes before hand. For

-MDS code the assumption is that there may be at most very slow nodes or failures. But estimating the number of stragglers a priori is an extremely challenging task (Ananthanarayanan et al., 2013; Zaharia et al., 2008). As such assuming worst case scenarios for the number of stragglers and specifying a conservative seems to be the best possible option. Coded computation schemes use these parameters to encode the data and to determine how much of the coded dataset is processed by each of the compute nodes. The smaller the value (i.e. the more conservative, highly redundant), the larger the amount of computation performed by each node in the cluster.

If the number of persistent stragglers during a particular execution instance is fewer than what the scheme is built to support, efficiency of coded computing drops . For instance, in MDS coding as explained above there is no significant performance benefit if there are fewer than stragglers, since the coded computation still has to wait for nodes to complete their execution. In cloud computing systems partial stragglers are more often encountered i.e., nodes that are slower but can do partial amount of work assigned to them. The existing coded computation schemes always waste the compute capability of the partial stragglers and do not take advantage of the fact that data which is needed for computation already exists with them and they can do partial amount of work (More on this in section 7.2). It is this lack of elasticity that makes coded computing unpalatable in large scale cluster settings. What is ideal is to allow the developer to select high redundancy coding to be conservative (essentially assuming a reasonable worst case straggler scenario) but allow a workload scheduler to decide how much redundant computing to perform based on observed speed variations in a distributed or cloud computing environment.

In this work, we design a new dynamic workload distribution strategy for coded computing that is elastic with the speeds of nodes measured during runtime, irrespective of how much redundancy is chosen for creating the coded data. Our proposed (Slack Squeeze Coded Computing) strategy adapts to varying number of stragglers by squeezing out any computation slack that may be built into the coded computation to tolerate the worst case execution scenarios. The performance of is determined by the actual speeds measured and actual number of very slow nodes seen rather than by the redundancy used in encoding the data. As the speeds of nodes change, responds by appropriately increasing or decreasing the amount of work allocated to each node in the cluster to maximize the performance.

To predict the speeds of the nodes as they change during runtime we propose a novel prediction mechanism. We model speed prediction into a time series forecasting problem and use a Long Short-Term Memory (LSTM) based learning model to predict the speeds. These predicted speeds are used by

to do work allocation among the nodes.

In summary, the main contributions in this paper are as follows:

  • We empirically study and evaluate the speed variations of compute nodes in a large-scale cloud computing cluster. We then propose a LSTM based model to predict speed of each node in the next computation round.

  • We propose which exploits the computational redundancy available in the system and elastically distributes work to worker nodes by using predictions from the LSTM model. It increases performance without compromising on robustness.

  • We propose two variations of and evaluate their performance on our local cluster and on cloud while running machine learning and graph ranking workloads. We propose a new fine-grained replication approach that combines over-decomposition of data (Laxmikant Kale, 1993) and speed prediction based workload distribution.

  • We first demonstrate the performance gains of on top of an MDS-coded dataset by deploying on the cloud, DigitalOcean (dig, [n. d.]). While executing algorithms such as gradient descent, graph ranking, and graph filtering is able to reduce the total compute latency by up to % over the conventional coded computation and by up to over the fine-grained replication approach.

  • Finally, we go beyond linear algebra to demonstrate the versatility of by applying its workload distribution and scheduling strategies on top of polynomial code (Gupta et al., 2018; Yu et al., 2017), a coded computing strategy for polynomial computations.

Rest of the paper is organized as follows: section 2 provides background on coded computation, section 3 describes speed prediction and overheads of coded computation, section 4 describes proposed algorithm, section 5 describes extensions to non-linear coded computing, section 6 provides implementation and system details, section 7 shows evaluations, section 8 describes related work.

2. Coded Computing Background

In this section we briefly introduce the coded computation. For clarity of explanation we first focus on how coded computing is applied for linear algebraic operations. Let us consider a distributed matrix multiplication problem where a master node wants to multiply a data matrix

with the input vector

to compute . The data matrix is distributed across worker nodes on which the matrix multiplication will be executed in a distributed manner.

One natural approach to tackle this problem is to vertically and evenly divide the data matrix into sub-matrices, each of which is stored on one node. Then when each node receives the input , it simply multiplies its locally stored sub-matrix with and returns the results, and the master vertically concatenates the returned matrices to obtain the final result. However, we note that since uncoded approach relies on successfully retrieving the task results from all nodes, it has a major drawback that once one of the nodes runs slow, the computation may take long to finish. Coded computation framework deals with slow or straggler nodes by optimally creating redundant computation tasks. A MDS-coded computing scheme vertically partitions the data matrix into sub-matrices and , and creates one redundant task by summing and . Then , and are stored on worker nodes 1, 2, and 3 respectively. In this case the final result is obtained once the master receives the task results from any 2 out of the 3 nodes, without needing to wait for the slow/straggler node. Let us assume worker node 2 is a straggler and the master node only collects results from node 1 and 3. Then the master node can compute by subtracting the computed task of node 1, i.e. , from the computed task of node 3, i.e. .

Broader use of coded computing: MDS-coded computing can inject redundancy to tolerate stragglers in linear computations. Coded computing is applicable to a wider range of compute intensive algorithms, going beyond linear computations. Polynomial coded computing (Yu et al., 2017; Gupta et al., 2018) can tolerate stragglers in computations that solve polynomial equations such as Hessian matrix computation. Lagrange coded computing (Yu et al., 2018)

can add coded redundancy to tolerate stragglers in any arbitrary multivariate polynomial computations such as general tensor algebraic functions, inner product functions, function computing outer products, and tensor contractions 

(Renteln, 2013). A recent work (Kosaian et al., 2018)

demonstrated some promising results while extending coded computing to deep learning inference.

3. Motivation

3.1. Overheads

Consider an uncoded strategy with -replication i.e., each data partition is replicated across different worker nodes where is the replication factor. Consider a node executing task on data partition . If the node is determined to be a straggler at some future time, the master node can replicate task on any one of the nodes which has a replica of data partition to speed it up. However, there are two challenges. First, when should the master determine that is a straggler? Second, even if the master has early knowledge of as a straggler, it is restricted to launching the task only on a subset of nodes that have the required data partition . Third, in the worst case if all the nodes with replicas are also stragglers i.e., if the system has stragglers, the uncoded replication strategy cannot speed up computation at all. An alternative is to move the data partition to another available faster node and execute on that node. This option forces data transfer time into the critical path of the overall execution time. Let us consider the Uncoded with 2-replication in Figure 1. If the number of very slow or straggler nodes is or more, computation latency increases significantly. Similar behavior can be observed for the Uncoded with 3-replication scheme when there are 3 or more stragglers.

Next let us consider the -MDS coded computation on matrix multiplication. The master node divides the original matrix into sub-matrices, encodes them into partitions and distributes them to workers. As we discussed before, a small needs to be chosen for dealing with worst case scenarios. However, this over-provisioning with a small comes with a price. If the original data size is , then each of the worker nodes must compute on a coded partition of size . If becomes smaller, each worker node has to execute a larger computation independent of their actual speeds. On the other hand if the designer picks a large then the robustness of the computation decreases. This is a difficult tradeoff since the selection of must be done prior to creating a correct encoding and decoding strategy, and distribution of the encoded data partitions appropriately to all nodes, which are usually done once before executing the given workload.

One solution to deal with the straggler uncertainty with MDS-coded computation is to store multiple encoded partitions in each worker node, such that the system can adapt and choose the appropriate encoded partition dynamically when the number of stragglers changes in the cluster. For example, in a cluster with 12 worker nodes, each worker node can store a -MDS encoded partition and a -MDS encoded partition at the same time. Assume the original data size is , when it’s observed there are three straggling nodes, -MDS-coded computation will be performed with each worker node operating on an encoded partition of size ; and when it’s observed there are fewer straggling nodes, -MDS-coded computation is performed with each worker node operating on partition of size . This approach is optimal only for two scenarios, and supporting a wider range of scenarios means storing more copies of the encoded data. This dramatically increases the storage overhead. It is possible to encode the data at run time and redistribute the large data partitions based on measured speeds and slow node count. However, this will dramatically increase the communication overhead and is not practical.

Figure 1. Logistic Regression Experiments

Figure 1 shows the computation time for (12,10)-MDS coded computation and (12,9)-MDS coded computation with varying number of stragglers. The computation time of (12,10)-MDS coded computation increases exponentially when there are more than 2 stragglers. The computation time of (12,9)-MDS coded computation is constant with more stragglers. But there is a significant increase in baseline execution latency with this strategy compared to other strategies, because (12,9)-MDS code requires each worker node to perform more work than (12,10)-MDS code, even if the number of stragglers is fewer than 3.

In summary, although conservative MDS-coded computation can provide robust protections against stragglers, its computation overhead per node is higher and remains the same even when all the nodes in the cluster are fast, since it does not make efficient use of all worker nodes. These drawbacks bring us to our key idea which is to have a workload scheduling strategy that provides the same robustness as the -MDS-coded computation, but only induces a much smaller computation overhead as if -MDS-coded computation is being used when there are only stragglers in the cluster with . We present our proposed Slack Squeeze Coded Computation scheme in the next section.

Figure 2. Measured speeds

3.2. Speed prediction and Coded data

In the introduction section, we noted that it is important to consider the speed variations across compute nodes when determining the efficacy of discarding the work done by slow nodes in the MDS coded computing framework. To collect and analyze the execution speeds of servers, we conducted experiments on 100-compute nodes, referred to as droplets in Digital Ocean cloud (dig, [n. d.]). Each droplet is similar to a t2.micro shared compute instance in Amazon AWS. For our experiments, each droplet node executes matrix-matrix multiplication and logs its execution times after completion of every 1% of the task. The size of each matrix is 20000 by 5000. We analyzed the measured speeds at 1% granularity intervals at all nodes. Figure 2 shows the speed variations in 4 of the representative nodes. X-axis in each plot corresponds to time. Y-axis in each plot corresponds to speed of the node normalized by its maximum observed speed during the experiment.

One critical observation from the figure is that while the speed of each node varies over time, on average the speed observed at any time slot stays within 10% for about 10 samples within the neighborhood. This relatively slow changing speed provides us an opportunity to estimate speeds of nodes in future intervals using speeds from past intervals. The speed estimates can be reasonably accurate for most of the time intervals except for a short time window when the speed changes drastically, but again we will soon be able to track the new speed as the nodes stay in that speed zone.

To find a good prediction mechanism, we considered the speeds of each node as a time series and modeled our problem as a time series forecasting problem. We evaluated LSTM (Long Short-Term Memory (Hochreiter and Schmidhuber, 1997)) and Auto Regressive Integrated Moving Average (ARIMA) models to predict the speeds. We found that the LSTM models provided the best prediction accuracy among all models. The model predicts the speeds of the nodes within % of their actual values. In statistical terms, the Mean Absolute Percentage error of the model on the test set is %. This prediction error is better than simply using the speed from past iteration by 5%. As expected, only immediately after a large speed variance is observed the prediction lags behind but catches up with the observed speed soon after.

Based on this critical observation we hypothesize that reliably estimating the speeds for next computation round allows master node to perform appropriate task assignment to the workers such that the computations performed by all workers can be utilized to obtain the final result. But this fine-granularity task assignment and utilization of all worker nodes becomes feasible only if there is no data movement overhead between rounds of computation. Coded computing is well suited for this fine-grained task assignment since the input data that is distributed among workers is encoded and as a result there would be no additional data movement needed between rounds of computation. However, this feature is not exploited in conventional MDS-coded computation. In uncoded computation, to assign workload optimally based on the predicted speeds, either each worker node will need to store significant percentage of the entire data, which can impose huge storage overhead; or it requires the master to redistribute the data among nodes at run-time, which can add huge communication overhead for iterative workloads such as gradient descent and page rank. To measure the storage overhead of uncoded computation, we performed experiments in our local cluster consisting of 12 worker nodes. We measure the total data moved to each node between rounds of computation and consider it as the effective storage needed at that node to avoid additional data movement. Figure 3 shows the mean effective storage needed at each node to avoid data movement during the course of 270 gradient descent iterations for Logistic Regression. In this experiment, the uncoded computation has accurate predictions of speed of nodes for next iteration. It needs of the total data to be stored at each worker node to have zero data movement overhead. For with (12,10)-MDS coding the data storage remains fixed at of the total data and much lower than the uncoded computation.

Figure 3. Storage overhead of uncoded computation even if we can predict speed for each node accurately

Following from these observations, we argue for that exploits the unique feature of coded data availability and thereby utilize the compute capacity of all worker nodes.

4.

4.1. Basic algorithm

(a) Conventional (4,2)-MDS.
Size of
(b) Conventional (4,3)-MDS.
Size of
(c) on (4,2)-MDS.
Size of .
Work performed at each non-straggler worker =
Figure 4. Illustration on MDS codes
Figure 5. General on Polynomial Codes

The major goals of the algorithm are to achieve high tolerance to stragglers and reduce computation work assigned per worker when the number of slow nodes observed during run time is less than the conservative estimate. To achieve high straggler tolerance the master node encodes and distributes the large matrix using a conservative -MDS coding once at the beginning. To assign reduced computation work to worker nodes, master node then employs algorithm. There are two key insights, in a cluster using conservative -MDS coding, that underlie our algorithm.

  • Each worker node stores high redundancy encoded matrix data partition

  • The master node can decode and construct the final product as long as it receives any k out of n responses corresponding to each row index of partitioned matrix

Let there be n-s n-k stragglers in the (n,k)-MDS coded cluster. As we explained in previous section, when there are (n-s) stragglers (n,s)-MDS coding is the best suited coding strategy. But rather than using a new (n,s)-MDS code to re-encode the data, we use the (n,k)-MDS coded data partition as is but we change the amount of work done by each node. In particular, allocates decodable computational work assignment per node equal to that in (n,s)-MDS coding instead of (n,k)-MDS coding. If is the number of rows in the original matrix, each node gets an allocation = number of rows to be computed.

Figure 4 provides an illustration of the strategy in a cluster consisting of 4 worker nodes (and 1 master node). Figure 3(a) shows the conventional (4,2)-MDS coded computation performed when worker 4 is the only straggler node and the remaining 3 workers have same speed. Note that (4,2)-MDS coding is conservative here, since it can support 2 stragglers but in this case there is only 1 straggler. Each worker node computes on its full partition but the master node needs only the results from workers 1 and 2 and can ignore the result from worker 3. Sub-matrices refer to the vertical divisions of the matrix A. Data stored in worker 3 is a coded matrix, . Data stored in worker 4 is coded as . These codes are generated as per MDS-coding principles.

In figure 3(b), conventional (4,3)-MDS coded computation when worker 4 is the straggler node is shown. Each non-straggler node computes on its full partition but the size of the partition here is smaller than partition size in the previous coding. The master node needs the results of all workers to construct the final product. Sub-matrices are the vertical divisions of the matrix A into three parts. Data stored in worker 4 is coded as .

with (4,2)-MDS Coded computation for this scenario is shown in figure 3(c). If we consider data in each worker as composed of 3 equal size partitions, worker node 1 computes only on the first and second of its partitions. Worker 2 computes only on the first and third of its partitions. Worker 3 computes only on the second and third of its partitions. As a result, each worker node is performing less amount of compute and this compute is equal to the amount performed by each worker in conventional (4,3)-MDS coded computation. Partitions to be computed at each worker are assigned to ensure that each row index is computed by exactly two workers. This is necessary for successfully decoding the results by the master node at the end of computation.

4.2. General Algorithm

In cloud computing services and data centers, compute nodes within a cluster can have different speeds during run time, as described in section 3.2, due to them being shared or due to various micro-architectural events such as cache misses and other control/data bottlenecks. They can also be heterogeneous. We present a General algorithm which, unlike basic , can consider the variation in speeds of all nodes and assign work to them. At the beginning of execution of every application, matrix data is partitioned, encoded and distributed to the worker nodes using -MDS coding. For efficient decoding and work allocation, General algorithm also decomposes and considers each matrix partition as composed of chunks (groups) of rows i.e., over-decomposition. The speed predictions from LSTM model are provided to the General . Then workers are sorted according to their speeds. Starting from the worker with highest speed, each worker is assigned chunks to be computed equal to the ratio of its speed to the total available computational speed of all workers. If the assigned chunks for a worker turn out to be more than the total chunks in the partition already stored in a worker, the algorithm re-assigns these extra chunks to next worker. This case occurs when one worker is relatively much faster than all other workers. The algorithm is summarized in Algorithm 1. In the case where, all non-straggler nodes have equal speed general would reduce to simple .

4.3. Dealing with mis-prediction or failures

Speed prediction algorithm can mis-predict when there is a sudden and significant change in the speeds of workers. Also, one of the worker nodes can die or fail during execution. To handle these scenarios, algorithm employs a timeout mechanism. collects results from the first workers that complete their work and measures their average response time. If the remaining workers do not respond within 15% of the average response time, considers this situation as a mis-prediction and reassigns the pending work among the completed workers. We choose 15% based on the average error from our speed prediction algorithm (16.7%).

4.4. Robustness of

Coded computing with is robust and can tolerate the same number of stragglers as conventional coded computing because:

  • Data distribution in is identical to the data distribution in conventional coded computing.

  • The worst case scenario may occur when the speed predictions for completely fails. In this case the general along with the timeout mechanism, described in section 4.3, essentially turns into a conventional coded computing.

5. Extension to non-linear coded-computing

, being a workload distribution strategy, can be extended to many different coded computations. In this section we demonstrate how to apply it on top of the popular polynomial codes (Yu et al., 2017). We refer the reader to the paper for the mathematical underpinning of polynomial codes and we will only provide brief overview to demonstrate its working and how can be applied to such a generalized codes. These codes can be used to compute polynomial functions on data with low decoding and encoding overheads.

Consider computing (), a bilinear computation on two matrices A and B, in a distributed manner using a cluster with nodes. Matrix A is divided into sub-partitions along rows and matrix B is divided into sub-partitions along columns. Then encoded partitions each for are computed from these sub-partitions. Let us consider the scenario we want to distribute this computation on a minimum of 5 nodes with one potential straggler. In this case, i.e., each matrix has 2 sub-partitions. Each encoded partition of is of the form . Each encoded partition of is of the form where is equal to the node index . For example, node 0 will store , . Node 2 will store , and so on. In this configuration, node 2 computes . Hence, we have four partial coded computations on and and we need to get such results from any four nodes to be able to fully decode this computation. But if there are any non-straggler nodes the polynomial coding wastes the computation just as is the case with MDS coding.

In figure 5 we illustrate how our framework can be applied on top of such a polynomial coded bilinear computation. In figure 5, the cluster has nodes. For illustration purposes each matrix partition has 9 rows. A minimum of 4 responses per each row are needed for successful computation of . The relative speeds of each nodes are {2,2,2,2,1}. Node 4 is a partial-straggler. Conventional polynomial coded computing ignore the computation from this node. However General does not and it allocates partial work to it. General allocates {8,8,8,8,4} rows to the 5 nodes respectively as highlighted by the bounding rectangles in each worker node. The last worker (speed 1) is shown to compute the last set of rows. Product of each row with is computed by exactly 4 nodes and sent to the master node.

In this paper we evaluate polynomial coding while computing Hessian matrices of the form  (Himmelblau, 1972)

. These Hessian matrices form the foundation for a variety of optimization algorithms such as semi-definite programs, kernel ridge regression, maximum likelihood estimation and more.

Lines with # are comments
Input: List () of Speeds () of worker nodes, of coding, Number of rows per Node ()
Output: Computation assignment per node i
#over decompose each partition into chunks of rows
#minimum total chunks needed for correct decoding
#Sort the workers as per their speed in descending
#order and assign number of chunks to be computed
for each node in sorted  do
     #Allocate number of chunks to node proportional to its speed
      = ()
     #Update total chunks left to be computed
     
#Assign the exact chunks that will be computed
for each node in sorted  do
     
     
     
#Convert chunks to exact row indices
Algorithm 1 General algorithm

6. Implementation

At the beginning of computation, master node encodes the matrix data and distributes the encoded sub-matrices to the corresponding worker nodes. For MDS coding we are dealing with just a single matrix, but with Polynomial codes we have two matrices to encode, and both coding strategies use different encoding as described earlier. At the start of each iteration of our applications master node distributes the vector () to all worker nodes. At the end of each iteration, master node receives the sub-product from the worker nodes, decodes them and constructs the result vector.

Each worker node has two separate running processes, one for computation and one for communication. The computation process on the worker node performs the appropriate computation on encoded data, either a matrix-vector operation in MDS setting or a Hessian matrix computation in Polynomial setting. The communication process is in charge of receiving input data from the master node, work assignment information, and sending the partial product, and controlling the start and stop of the computation process at the worker node.

6.1. LSTM based speed prediction Model

We used the speed data measured from our experiments in motivation section as the dataset for training the LSTM model. The train/test dataset split is 80:20. The speed prediction model expects input to have 1 dimension and consists of one single-layer LSTM with a hidden state being 4 dimensional with tanh activation, 1 dimensional output. Dimension of hidden state is a hyper parameter and we experimented with different values and selected 4 as it provided highest accuracy on the dataset. It is used to predict speed of nodes once every iteration. Input to the model is the speed of node from previous iteration and its output is the speed prediction for the next iteration. The LSTM model computation takes 200 microseconds for each node.

6.2. specifics

Basic strategy needs information on which nodes are stragglers. General strategy needs information on the relative execution speeds of all nodes and it adjusts the work assignment to the worker nodes according to their speed. To obtain this information we rely on the iterative nature of our algorithms. Initially master node starts with the assumption that all the worker nodes have the same speed and this is provided as input to the current strategy. The master then distributes the work assignment calculated by to each worker node. Upon receiving the partial products from the worker nodes, master node also records the response time for each worker node corresponding to iteration . If the number of rows computed at worker is , then the speed of each worker node for the current iteration is computed as . These values from all nodes are provided as a batch input to the trained LSTM model which predicts speeds for the next iteration. The predicted speeds are fed into the General strategy to generate the computational work assignment at each worker node for iteration (). Thus automatically adapts to speed changes at the granularity of an iteration.

6.3. Computing Applications

We evaluated

on MDS using the following linear algebraic algorithms: Logistic Regression, Support Vector Machine, Page Rank and Graph Filtering. Graph ranking algorithms like Page Rank and Graph signal processing algorithms employ repeated matrix-vector multiplication: Calculating page rank involves computing the eigenvector corresponding to the largest eigenvalue which is done using power iteration algorithm; Graph filtering operations such as the

-hop fitering operations employ iterations of matrix-vector multiplication over the combinatorial Laplacian matrix. We evaluated on both these algorithms. We evaluated on Polynomial coding using the Hessian matrix operation that we described earlier.

6.4. System Setup

We evaluated the above computing applications in a datacenter scale cloud setting in the Digital ocean cloud. On Digital ocean cloud we employ 11 shared compute instances each with 1 virtual CPU and 2 GB of memory. We use Kubernetes to bootstrap a cluster using these 11 nodes, with one being the Master and the other 10 nodes being the Worker nodes. We then dockerize the computing applications and deploy them on the cloud.

6.5. Verification in a controlled cluster

For theoretical verification purposes we also evaluated all the applications and results on our local cluster where we had the ability to precisely control the straggler behavior. Our local cluster is composed of 13 identical servers. Each server consists of two Intel Xeon CPU E5-2630 v3 each with 8 cores (8 threads, 2.40 GHz), 20 MB of L3 cache, running Centos Linux version 7.2.1511. Each machine has 64GB of DRAM. All the servers have access to a shared storage node of size 1 TB. All the servers are connected to one another through Mellanox SX6036 FDR14 InfiniBand Switch with a bandwidth of 56 Gbps. We use one of the nodes as Master node and other 12 nodes as Worker nodes.

6.6. Baseline strategies

We implement and evaluate two baseline strategies: Our first baseline is an enhanced Hadoop-like uncoded approach that is similar to LATE (Zaharia et al., 2008). In this baseline we used a 3-repetition strategy with up to six tasks that are speculatively launched. The strategy provides 3 copies of data at 3 randomly selected nodes in the distributed system. This enhanced Hadoop strategy does not enforce strict data locality during speculation, unlike traditional Hadoop, and allows data to be moved at runtime if a task needs to relaunched on a node that does not have a copy of the data. We allow up to six tasks to be speculatively launched. Furthermore, the speculative task assignment strategy always tries to find a node that already has a copy of the data before moving the data, thereby allowing data communication only when absolutely needed.

The second baseline is the MDS-coded computation proposed by (Lee et al., 2016) and described previously in section 2. The two MDS-coding schemes we evaluated are: -MDS as the conservative scheme, and -MDS as the optimistic scheme. No data movement is allowed in these schemes during computation. The purpose of showing results for -MDS coding is simply to show the robustness of our scheme in the presence of such high redundancy. We do not expect that system designers are unlikely to provision 2X computation redundancy in practice. Hence, we will highlight -MDS results in our discussion in the next section.

7. Evaluation

7.1. Results from controlled cluster

We evaluate the performance of against the baseline strategies for varying straggler counts in our 12-worker-node cluster and these different cases correspond to the X-axis in the plots. Each bar in the plots captures the average relative execution time spent by the application for 15 iterations, normalized by the execution time of the uncoded strategy when there is 0 straggler in the cluster. The total execution time is dominated by the computation time, which is the total time spent by the master node waiting to receive results from enough worker nodes once it commands each worker to start computing their partitions. The total execution time also includes communication time, the total time spent by the master node in communicating with the worker nodes; and assembling time, the total time spent by the master node in loading the partial results returned from the worker nodes and decoding them to produce the final result. Majority of assembling time is spent for loading the data and not on actual decoding itself. The matrix data encoding time is tiny and it also paid only once at the beginning and hence is not included in the figures.

Figure 6. LR with Varying-Speed Non-Stragglers

7.1.1. Logistic Regression and SVM

We evaluated Gradient Descent for Logistic Regression (LR) and SVM. The results for both of them are very similar and hence we focus the discussion on evaluations of LR. For our experiments we use publicly available gisette dataset from UC Irvine machine learning repository (Lichman, 2013). The data in this dataset is duplicated to create a larger dataset. The final size of data partition in each node is 760 MB. Only one processor thread in each worker node is used for computation.

In the cluster, non-straggler workers may have upto variation between their processing speeds and straggler nodes are 5X slower than the fastest performing non-straggler node in our cluster. We compare the three baselines with the two versions of : basic that does not consider this variation in speeds of the non-straggler workers and treats all the non-straggler workers as having equal speed, and general algorithm that takes this speed variation into account and allocates different computational work to non-straggler workers accordingly. The results are shown in Figure 6.

As we can see from the figure 6, when there are no stragglers, all strategies have low execution times with having the lowest. As the number of stragglers increases, the execution time of uncoded strategy increases since the slower job needs to be detected and re-executed. Whereas in coding based strategies there is no need for re-execution. Once the number of stragglers exceeds 2, the uncoded strategy’s performance starts to degrade and it is 3x of the execution time compared to no straggler scenario. The super linear degradation is because data partition will need to be moved across worker nodes prior to the re-execution and communication costs start to play a role in the overall performance loss.

Both versions of (12,6)-MDS coded not only are able to provide robustness against up to two stragglers in the cluster, but also are able to reduce the computation overhead due to the use of coding when there are fewer or no stragglers in the cluster. By taking the various speeds of the non-straggler worker nodes into account, the general version of the strategy is able to outperform the conservative (12,6)-MDS coded computation strategy even more than the basic version of . This result indicates that even if we can’t take into account the precise variation in the processing speeds of various non-straggler nodes, the basic algorithm provides excellent performance and robustness. However, if the processing speed information is more accurately gathered, the generalized can squeeze the hidden compute slack in the 20% speed variation and provide further performance improvements without compromising robustness.

Figure 7. PR with Varying-Speed Non-Straggler Workers

7.1.2. Page Rank and Graph Filtering

We evaluated Page Rank (PR) and Graph Filtering. The results for both of them are very similar and hence we focus the discussion on Page Rank. We used the ranking dataset available from (uto, 2000) for our evaluations. This dataset is duplicated to create a larger dataset that is used in evaluation.

The total execution time for Page Rank when the non-straggler nodes can have up to 20% variation between their speeds, is plotted in Figure 7. We can see that algorithms significantly outperform the baseline strategies. The general algorithm reduces the execution time compared to basic in all scenarios.

7.2. Results from industrial cloud deployment

Figure 8. Execution time comparison on Cloud when has low mis-prediction rate
Figure 9. Per worker Wasted computation effort with 0% mis-prediction rate

In this section we discuss results from our experiments on DigitalOcean cloud. In our experiments we evaluate and compare the performances of General strategy against MDS coded computation and an over decomposition strategy based on Charm++ (Laxmikant Kale, 1993; Acun et al., 2014) (described below). We evaluated and MDS coded computation under (10,7), (9,7) and (8,7) MDS codes. During the course of our experiments we observed different mis-prediction rates from the LSTM speed prediction model. We show and discuss the performance gains from the experimental conditions where we observe the best and worst case mis-prediction rates. The performance results obtained across various applications are similar (as has been shown also in the local cluster setting). Due to space constraint we focus only on SVM results in this section.

7.2.1. Charm++ based over-decomposition

In the cloud setting, we evaluated an over-decomposition based strategy inspired by charm++ (Laxmikant Kale, 1993; Acun et al., 2014). In our implementation we combine over decomposition and speed prediction. We over-decompose each data partition by a factor of 4. The data is divided into partitions with each of the workers receiving 4 partitions. The data is also replicated by a factor of 1.42, similar to replication in (10,7)-MDS coding. The additional partitions are distributed in a round-robin fashion across the workers. Master node uses predictions from the speed model to do load balancing and transfer of partitions between workers during computations. This is better than the Uncoded strategy since it allows for finer grained data transfer.

7.2.2. Results in low mis-prediction rate environment

The average relative execution times for 15 iterations of SVM are shown in figure 8 when we observe a 0% mis-prediction rate for worker speeds. The execution times of all strategies are normalized by the execution time of -. First we can observe that over decomposition approach performs better than (10,7)-MDS coded computation. This is expected since each worker in MDS-coded computation processes more data than each worker in over-decomposition strategy. Next, we observe that all three variations of MDS-coded computation show similar execution times. This is because in all cases the work performed by a single worker remains same and only the results from fastest 7 workers are used by the master. Over decomposition performs similar to (10,7)- in this environment since there is no additional data movement during computations. Next, for all 3 data coding variations outperforms regular MDS Coded computation. Further, performance of increases as the redundancy is increased. This is because work done in a single worker decreases as redundancy is increased. (10,7)- outperforms the (10,7)-MDS coded computation by . For the maximum reduction in execution time over (10,7)-MDS coded computation would occur when all 10 workers are always fast during execution. The exact reduction would be . with mis-prediction rate captures this best possible reduction in execution time.

Figure 9 plots the wasted computation efforts measured in each of the worker node during execution of the conservative (10,7)-MDS coded computation and (10,7)-. Since the mis-prediction rate is 0% there is no wasted computation effort in . In this execution, workers 1, 3, 7 and 8 have high wasted computation. Worker 1 has close to 90% of its computation wasted. This is because it is only slightly slower than the fastest 7 workers but the MDS-coded computation stops the execution of the 3 remaining workers and ignores their result after it receives from the fastest 7 workers.

Figure 10. Execution time comparison on Cloud when has high mis-prediction rate
Figure 11. Per worker Wasted computation effort with 18% mis-prediction rate

7.2.3. Results in high mis-prediction rate environment

During our experiments with shared VM instances on DigitalOcean, we observe the highest mis-prediction rate is 18%. Under this condition, the average execution times for 15 iterations of SVM are shown in figure 10. (10,7)-MDS coded computation performs better than (9,7) and (8,7)-MDS coded computation because the probability of any 7/10 nodes being fast is higher than any 7/9 nodes or 7/8 nodes being fast. (8,7)- outperforms (8,7)-MDS coding by 13%, whereas (9,7)- outperforms (9,7)-MDS coding by 11% and (10,7)- outperforms the (10,7) MDS-coded computation approach by 17%. As expected (10,7)- outperforms both (9,7) and (8,7)- variants since the opportunities to do load balancing increase as the redundancy increases. The observed performance of over-decomposition approach is lower than the performance of (10,7)-MDS approach owing to the extra data movement costs for load balancing during computations. Whereas in (10,7)-MDS coded computation there are no extra data movement costs during computations.

The wasted computation efforts measured in each of the worker node under (10,7)-coding are shown in figure 11. Due to a relatively high mis-prediction rate, also incurs wasted computation efforts among the worker nodes when the compute tasks of slow nodes are cancelled and reassigned to other worker nodes. However, the conservative -MDS approach incurs higher wasted computation since it also ignores the slowest 3 nodes’ computation efforts. On average, the conservative MDS scheme incurs 47% more wasted computation effort.

Figure 12. on Polynomial codes

7.2.4. Results with on polynomial Coding

We evaluated applied on Conventional Polynomial coding while performing Hessian matrix computation, , as we previously described in section 5. The dimensions of matrix are x . The results collected under low and high mis-prediction rates are shown in figure 12. In these experiments, the cluster has 12 nodes. The matrices are partitioned each into 3 sub-matrices, encoded, and the encoded partitions are distributed to the 12 nodes. Each node would compute on 2 encoded partitions. Results from any 9 nodes would be enough to compute the Hessian. In this setup, reduces the overall computation time by 19% in low mis-prediction rate environment. The maximum possible reduction is . The part of Hessian computation where each node has to first compute is not influenced by . As a result, the gains from using are lower than expected. Under high mis-prediction rate enviornment, reduces the overall computation time by 14%.

The evaluation results demonstrate the effectiveness of across different Coded Computation schemes.

8. Related Work

Straggler Mitigation: Authors of (Dean and Barroso, 2013) propose several software techniques to contain the effect of stragglers such as using redundant tasks, selective replication. The authors in work (Ananthanarayanan et al., 2010) utilize real time progress reports to detect and cancel the stragglers early. Authors in  (Zaharia et al., 2008) demonstrated LATE algorithm to improve the straggler detection and speculative execution in Hadoop framework. Authors of (Ananthanarayanan et al., 2014) use extrapolation to estimate task durations and perform straggler mitigation. In works (Li et al., 2014; Zhang et al., 2016; Kasture and Sanchez, 2014) the authors explore system sources of tail latency from system and implement mechanisms to eliminate these causes. Adrenaline (Hsu et al., 2017) identifies and selectively speeds up long queries by quick voltage boosting. Paragon (Delimitrou and Kozyrakis, 2013) presents a QOS aware online heterogenous datacenter scheduler. Prior works (Delimitrou and Kozyrakis, 2014; Lo et al., 2014; Leverich and Kozyrakis, 2014; Zhu et al., 2017a) focus on improving resource efficiency while providing low latency. Using replicated tasks to improve the response times has been explored in (Ananthanarayanan et al., 2013; Shah et al., 2013; Wang et al., 2014; Gardner et al., 2015; Chaubey and Saule, 2015; Lee et al., 2017). Generally this approach involves launching multiple copies of each task across workers, using results from the fastest and canceling the slower copies. This approach needs multiple replicas of all the data. But recent works on coded computing have shown that replicating with coding is a better approach to tolerate stragglers, rather than reactive replication of a task. Another strategy used for straggler mitigation is arriving at an approximate result without waiting on the stragglers (Goiri et al., 2015). Coded Computation: Coded computation is a recently proposed framework with two concepts to deal with the communication and straggler bottlenecks in distributed computing. The first coded computing concept (Li et al., 2015, 2016b) enables an inverse-linear tradeoff between computation load and communication load in distributed computing. This can be leveraged to speed up large-scale data analytics applications (Li et al., 2017). The second coded computation, the focus of this paper, concept (Lee et al., 2016) provides resiliency to stragglers and can be utilized to mitigate tail latency in distributed computing (Lee et al., 2016; Reisizadeh et al., 2017; Li et al., 2016a; Dutta et al., 2016; Tandon et al., 2016; Yu et al., 2017). In particular, several of these works target distributed machine learning. Recent results from (Kosaian et al., 2018) demonstrate coded computing on non-linear computations, specifically deep learning inference.There have been few recent works in the coded computing literature to exploit the computations of slow nodes (Zhu et al., 2017b; Yang et al., 2017), however the key ingredient of our proposed strategy is that it dynamically adapts the computation load of each node to its estimated speed from the previous rounds of computations.

9. Conclusion

In this paper we proposed and evaluated that efficiently tolerates speed variance and uncertainty about the number of stragglers in the system. distributes coded data to nodes and during runtime adaptively adjusts the computation work per node. Thereby it significantly reduces the total execution time of several applications. Through our evaluations using machine learning and graph processing applications, we demonstrate ~39.3% reduction in execution time. We conclude that the workload scheduling techniques presented in this paper effectively reduce the overhead in coded computation frameworks and make them more effective in real deployments.

References