# OverSketch: Approximate Matrix Multiplication for the Cloud

We propose OverSketch, an approximate algorithm for distributed matrix multiplication in serverless computing. OverSketch leverages ideas from matrix sketching and high-performance computing to enable cost-efficient multiplication that is resilient to faults and straggling nodes pervasive in low-cost serverless architectures. We establish statistical guarantees on the accuracy of OverSketch and empirically validate our results by solving a large-scale linear program using interior-point methods and demonstrate a 34 reduction in compute time on AWS Lambda.

• 10 publications
• 21 publications
• 4 publications
• 47 publications
11/19/2020

### Approximate Weighted CR Coded Matrix Multiplication

One of the most common, but at the same time expensive operations in lin...
01/09/2018

### Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

Sparse Matrix-Matrix multiplication is a key kernel that has application...
12/11/2017

### StrassenNets: Deep learning with a multiplication budget

A large fraction of the arithmetic operations required to evaluate deep ...
09/27/2021

### Distributed Computing With the Cloud

We investigate the effect of omnipresent cloud storage on distributed co...
03/13/2019

### GNA: new framework for statistical data analysis

We report on the status of GNA — a new framework for fitting large-scale...
10/12/2021

### Extending the R Language with a Scalable Matrix Summarization Operator

Analysts prefer simpler interpreted languages to program their computati...
08/06/2017

### A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

In recent years, randomized methods for numerical linear algebra have re...

## I Introduction

Matrix multiplication is a frequent computational bottleneck in fields like scientific computing, machine learning, graph processing, etc. In many applications, such matrices are very large, with dimensions easily scaling up to millions. Consequently, the last three decades have witnessed the development of many algorithms for parallel matrix multiplication for High Performance Computing (HPC). During the same period, technological trends like Moore’s law made arithmetic operations faster and, as a result, the bottleneck for parallel computation shifted from computation to communication. Today, the cost of moving data between nodes exceeds the cost of arithmetic operations by orders of magnitude, and this gap is increasing exponentially with time

[1, 2, 3]. This has led to the popularity of communication-avoiding algorithms for parallel computation [3, 4].

In the last few years, there has been a paradigm shift from HPC towards distributed computing on the cloud due to extensive and inexpensive commercial offerings. In spite of developments in recent years, server-based cloud computing is inaccessible to a large number of users due to complex cluster management and a myriad of configuration tools. Serverless computing111The term ‘serverless’ is a misnomer, servers are still used for computation but their maintenance and provisioning is hidden from the user. has recently begun to fill this void by abstracting away the need for maintaining servers and thus removing the need for complicated cluster management while providing greater elasticity and easy scalability [5, 6, 7]. Some examples are Amazon Web Services (AWS) based Lambda, Microsoft Azure functions, and Google Cloud Functions. Large-scale matrix multiplication, being embarrassingly parallel and frequently encountered, is a natural fit for serverless computing.

Existing distributed algorithms for HPC/server-based systems cannot, in general, be extended to serverless computing due to the following crucial differences between the two architectures:

• [leftmargin=*]

• Workers in the serverless setting, unlike cluster nodes, do not communicate amongst themselves. They read/write data directly from/to a single data storage entity (for example, cloud storage like AWS S3) and the user is only allowed to submit prespecified jobs and does not have any control over the management of workers [5, 6, 7].

• Distributed computation in HPC/server-based systems is generally limited by the number of workers at disposal. However, in serverless systems, the number of inexpensive workers can easily be scaled into the thousands, but these low-commodity nodes are generally limited by the amount of memory and lifespan available.

• Unlike HPC, nodes in the cloud-based systems suffer degradation due to system noise which can be a result of limited availability of shared resources, network latency, hardware failure, etc. [8, 9]. This causes variability in job times, which results in subsets of slower nodes, often called stragglers, which significantly slow the computation. Time statistics for worker job times are plotted in Figure 1 for AWS Lambda. Notably, there are a few workers () that take much longer than the median job time, thus decreasing the overall computational efficiency of the system. Distributed algorithms robust to such unreliable nodes are desirable in cloud computing.

### I-a Main Contributions

This paper bridges the gap between communication-efficient algorithms for distributed computation and existing methods for straggler-resiliency. To this end, we first analyze the monetary cost of distributed matrix multiplication for serverless computing for two different schemes of partitioning and distributing the data. Specifically, we show that row-column partitioning of input matrices requires asymptotically more communication than blocked partitioning for distributed matrix multiplication, similar to the optimal communication-avoiding algorithms in the HPC literature.

In applications like machine learning, where the data itself is noisy, solution accuracy is often traded for computational efficiency. Motivated by this, we propose OverSketch, a sketching scheme to perform blocked approximate matrix multiplication and prove statistical guarantees on the accuracy of the result. OverSketch has threefold advantages:

1. [leftmargin=*]

2. Reduced computational complexity by significantly decreasing the dimension of input matrices using sketching,

3. Resiliency against stragglers and faults in serverless computing by over-provisioning the sketch dimension,

4. Communication efficiency for distributed multiplication due to the blocked partition of input matrices.

Sketching for OverSketch requires linear time that is embarrassingly parallel. Through experiments on AWS Lambda, we show that small redundancy () is enough to tackle stragglers using OverSketch. Furthermore, we use OverSketch to calculate the Hessian distributedly while solving a large linear program using interior point methods and demonstrate a reduction in total compute time on AWS Lambda.

### I-B Related Work

Traditionally, techniques like speculative execution are used to deal with stragglers, for example, Hadoop MapReduce [10] and Apache Spark [11]. Such techniques work by detecting nodes that are running slowly or will slow down in the future and then assigning their jobs to new nodes without shutting down the original job. The node that finishes first submits its results. This has many limitations. A constant monitoring of jobs is required, which might be costly if there are many workers in the system. It is also possible that a node will straggle only towards the end of the job, and by the time the job is resubmitted, the additional time and computational overhead has already hurt the overall efficiency of the system. The situation is even worse for smaller jobs, as spinning up an extra node requires additional invocation and setup time which can exceed the job time itself.

Recently, approaches based on coding theory have been developed which cleverly introduce redundancy into the computation to deal with stragglers [12, 13, 14, 15, 16, 17]. Many of these proposed schemes have been dedicated to distributed matrix multiplication [13, 15, 14, 16]. In [13], the authors develop a coding scheme for matrix multiplication that uses Maximum Distance Separable (MDS) codes to code in a column-wise fashion and in a row-wise fashion, so that the resultant is a product-code of , where . An illustration is shown in Figure 2. A simpler version of this has been known in the HPC community as Algorithm-Based-Fault-Tolerance (ABFT) [18]. Authors in [14] generalize the results in [13] to a -dimensional product code with only one parity in each dimension. In [15], the authors develop polynomial codes for matrix multiplication, which is an improvement over [13] in terms of recovery threshold, that is, the minimum number of workers required to recover the product .

The commonality in these and other similar results is that they divide the input matrices into row and column blocks, where each worker multiplies a row block (or some combination of row blocks) of and a column block (or some combination of column blocks) of . These methods provide straggler resiliency but are not cost-efficient as they require asymptotically more communication than blocked partitioning of data, as discussed in detail in the next section. Another disadvantage of such coding-based methods is that there are separate encoding and decoding phases that require additional communication and potentially large computational burden at the master node, which may make the algorithm infeasible in some distributed computing environments.

## Ii Preliminaries

There are two common schemes for distributed multiplication of two matrices and , as illustrated in Figures 2(a) and 2(b). We refer to these schemes as naive and blocked matrix multiplication, respectively. Detailed steps for these schemes are provided in Algorithms 1 and 2, respectively, for the serverless setting. During naive matrix multiplication, each worker receives and multiplies an row-block of and column-block of to compute an block of . Blocked matrix multiplication consists of two phases. During the computation phase, each worker gets two blocks, one each from and , which are then multiplied by the workers. In the reduction phase, to compute a block of , one worker gathers results of all the workers from the cloud storage corresponding to one row-block of and one column-block of and adds them. For example, in Figure 2(b), to get , results from 3 workers who compute , and are added.

It is accepted in High Performance Computing (HPC) that blocked partitioning of input matrices takes less time than naive matrix multiplication [19, 4, 3]. For example, in [4], the authors propose 2.5D matrix multiplication, an optimal communication avoiding algorithm for matrix multiplication in HPC/server-based computing, that divides input matrices into blocks and stores redundant copies of them across processors to reduce bandwidth and latency costs. However, perhaps due to lack of a proper analysis for cloud-based distributed computing, existing algorithms for straggler mitigation in the cloud do naive matrix multiplication [13, 15, 14]. Next, we bridge the gap between cost analysis and straggler mitigation for distributed computation in the serverless setting.

## Iii Cost Analysis: Naive and Blocked multiplication

There are communication and computation costs associated with any distributed algorithm. Communication costs themselves are of two types: latency and bandwidth. For example, sending words requires packing them into contiguous memory and transmitting them as a message. The latency cost is the fixed overhead time spent in packing and transmitting a message over the network. Thus, to send messages, the total latency cost is . Similarly, to transmit words, a bandwidth cost proportional to , given by , is associated. Letting denote the time to perform one floating point operation (FLOP), the total computing cost is , where is the total number of FLOPs at the node. Hence, the total time pertaining to one node that sends messages, words and performs FLOPs is

 Tworker=αQ+βK+γF,

where . The model defined above has been well-studied and is used extensively in the HPC literature [2, 3, 20, 4, 21]. It is ideally suited for serverless computing, where network topology does not affect the latency costs as each worker reads/writes directly from/to the cloud storage and no multicast gains are possible.

However, our analysis for costs incurred during distributed matrix multiplication differs from previous works in three principle ways. 1) Workers in serverless architecture cannot communicate amongst themselves, and hence, our algorithm for blocked multiplication is very different from optimal communication avoiding algorithm for HPC that involves message passing between workers [4]. 2) The number of workers in HPC analyses is generally fixed, whereas the number of workers in the serverless setting is quite flexible, easily scaling into the thousands, and the limiting factor is memory/bandwidth available at each node. 3) Computation on the inexpensive cloud is more motivated by savings in expenditure than the time required to run the algorithm. We define our cost function below.

If there are workers, each doing an equal amount of work, the total amount of money spent in running the distributed algorithm on the cloud is proportional to

 Ctotal=W×Tworker=W(αQ+βK+γF). (1)

Eq. (1

) does not take into account the straggling costs as they increase the total cost by a constant factor (by re-running the jobs that are straggling) and hence does not affect our asymptotic analysis.

Inexpensive nodes in serverless computing are generally constrained by the amount of memory or communication bandwidth available. For example, AWS Lambda nodes have a maximum allocated memory of 3008 MB222AWS Lambda limits are available at (may change over time) https://docs.aws.amazon.com/lambda/latest/dg/limits.html, a fraction of the memory available in today’s smartphones. Let the memory available at each node be limited to words. That is, the communication bandwidth available at each worker is limited to words, and this is the main bottleneck of the distributed system. We would like to multiply two large matrices and in parallel, and let . Note that if , one of the following will happen:

• [leftmargin=*]

• and , and the input matrices can fit into one worker’s memory and parallelism is not required.

• Either or or both, and block-size for blocked matrix multiplication is . The two schemes, naive and blocked multiplication, would exactly be the same in this case.

Thus, for all practical cases in consideration, .

###### Proposition III.1

For the cost model defined in Eq. (1), communication (i.e., latency and bandwidth) costs for blocked multiplication outperform naive multiplication by a factor of , where the individual costs are listed in Table I.

See Appendix -A.

The rightmost column in Table I lists the ratio of communication costs for naive and blocked matrix multiplication. We note that the latter significantly outperforms the former, with communication costs being asymptotically worse for naive multiplication. An intuition behind why this happens is that each worker in distributed blocked multiplication does more work than in distributed naive multiplication for the same amount of received data. For example, to multiply two square matrices of dimension , where memory at each worker limited by , for naive multiplication and for blocked multiplication. We note that the amount of work done by each worker in naive and blocked multiplication is and , respectively. Since the total amount of work is constant and equal to , blocked matrix multiplication ends up communicating less during the overall execution of the algorithm as it requires fewer workers. Note that naive multiplication takes less time to complete as each worker does asymptotically less work, however, the number of workers required is asymptotically more, which is not an efficient utilization of resources and increases the expenditure significantly.

Figure 4 supports the above analysis where we plot the cost in dollars of multiplying two square matrices in AWS Lambda, where each node’s memory is limited by 3008 MB and price per worker per millisecond is $. However, as discussed earlier, existing schemes for straggler-resiliency in distributed matrix multiplication consider naive multiplication which is impractical from a user’s point of view. In the next section, we propose OverSketch, a scheme to mitigate the detrimental effects of stragglers for blocked matrix multiplication. ## Iv OverSketch: Straggler-resilient Blocked Matrix Multiplication using Sketching Many of the recent advances in algorithms for numerical linear algebra have come from the technique of linear sketching, in which a given matrix is compressed by multiplying it with a random matrix of appropriate dimension. The resulting product can then act as a proxy for the original matrix in expensive computations such as matrix multiplication, least-squares regression, low-rank approximation, etc. [22, 23, 24, 25]. For example, computing the product of and takes time. However, if we use to compute the sketches, say and , where is the sketch dimension, we can reduce the computation time to by computing an approximate product . This is very useful in applications like machine learning, where the data itself is noisy, and computing the exact result is not needed. Key idea behind OverSketch: Sketching accelerates computation by eliminating redundancy in the matrix structure through dimension reduction. However, the coding-based approaches described in Section I-B have shown that redundancy can be good for combating stragglers if judiciously introduced into the computation. With these competing points of view in mind, our algorithm OverSketch works by "oversketching" the matrices to be multiplied by reducing dimensionality not to the minimum required for sketching accuracy, but rather to a slightly higher amount which simultaneously ensures both the accuracy guarantees and speedups of sketching and the straggler resilience afforded by the redundancy which was not eliminated in the sketch. OverSketch further reduces asymptotic costs by adopting the idea of block partitioning from HPC, suitably adapted for a serverless architecture. Next, we propose a sketching scheme for OverSketch and describe the process of straggler mitigation in detail. ### Iv-a OverSketch: The Algorithm During blocked matrix multiplication, the -th block of is computed by assimilating results from workers who compute the product , for . Thus, the computation can be viewed as the product of the row sub-block of and the column sub-block of . An illustration is shown in Figure 5. Assuming is large enough to guarantee the required accuracy in , we increase the sketch dimension from to , where is the worst case number of stragglers in workers. For the example in Figure 5, . To get a better insight on , we observe in our simulations for cloud systems like AWS lambda and EC2 that the number of stragglers is for most runs. Thus, if , i.e. workers compute one block of , then is sufficient to get similar accuracy for matrix multiplication. Next, we describe how to compute the sketched matrices and . Many sketching techniques have been proposed recently for approximate matrix computations. For example, to sketch a matrix with sketch dimension , Gaussian projection takes time, Subsampled Randomized Hadamard Transform (SRHT) takes time, count sketch takes time, where is the number of non-zero entries [26, 23, 27, 28]. Count sketch is one of the most popular sketching techniques as it requires linear time to compute the matrix sketch with similar approximation guarantees. To compute the count sketch of of sketch dimension , each column in is multiplied by with probability 0.5 and then mapped to an integer sampled uniformly from . Then, to compute the sketch , columns with the same mapped value are summed. An example of count sketch matrix with and is  STc=⎡⎢⎣0001−10−1001−100010000010000−1−1⎤⎥⎦. (2) Here, has 9 columns, and columns and were mapped to , columns and were mapped to 2, and columns and were mapped to . Thus, the count sketch would have only 3 columns, which are obtained by summing the columns of with the same mapped value (after possibly multiplying with -1). The sparse structure of ensures that the computation of sketch takes time. However, a drawback of the desirable sparse structure of count sketch is that it cannot be directly employed for straggler mitigation in blocked matrix multiplication as it would imply complete loss of information from a subset of columns of . For the example in (2), suppose the worker processing column 3 of be straggling. Ignoring this worker would imply that columns and of were not considered in the computation. This will generally lead to poor accuracy for sketched matrix multiplication. To facilitate straggler mitigation for blocked matrix multiplication, we propose a new sketch matrix , inspired by count sketch, and define it as  S=1√N(S1,S2,⋯,SN+e), (3) where , is the expected number of stragglers per block of and , for , is a count sketch matrix with dimension . Thus, the total sketch-dimension for the sketch matrix in (3) is . Computation of this sketch takes time and can be implemented in a distributed fashion trivially, where is the number of workers per block of and is a constant less than for most practical cases. We describe OverSketch in detail in Algorithm 3. We prove statistical guarantees on the accuracy of our sketching based matrix multiplication algorithm next. ### Iv-B OverSketch: Approximation guarantees ###### Definition IV.1 We say that an approximate matrix multiplication of two matrices and using sketch , given by , is accurate if, with probability at least , it satisfies  ||AB−ASSTB||2F≤ϵ||A||2F||B||2F. Now, for blocked matrix multiplication using OverSketch and as illustrated in Figure 5, the following holds ###### Theorem IV.1 Computing using sketch in (3) and , while ignoring stragglers among any workers, is accurate. See Appendix -B. For certain applications, the guarantee in theorem IV.1 may be too crude as the product of and in the RHS can get big for large matrices and . We can obtain a stronger result than in theorem IV.1 when , for example, when is a fat matrix, or is a tall matrix. Without loss of generality, say . Thus, , where denotes the spectral norm. Hence, with probability at least  ||ASSTB−AB||2F≤ϵr||A||22||B||2F. Now, if we increase the sketch dimension by a factor of to , we get  ||ASSTB−AB||2F≤ϵ||A||22||B||2F (4) with probability , which is a better approximation for the product . During the reduction phase, we use workers, which is much less than the number of workers used during the computation phase, that is, . In our experiments, we observe that the possibility of stragglers reduces significantly if fewer workers are used. This is especially true for the reduction phase, as healthy running workers from the computation phase are reused, reducing the chances of stragglers. However, in the unfortunate event that stragglers are observed during reduction, speculative execution can be used, i.e. detecting and restarting the slow job. Another simple solution is to use existing coding techniques as described in Figure 2, that is, by adding one parity row-block to and one parity row column to before multiplying them, which can tolerate stragglers in the worst case. However, this would require a decoding step to compensate for the missing stragglers. ## V Experimental Results ### V-a Blocked Matrix Multiplication on AWS Lambda We implement the straggler-resilient blocked matrix multiplication described above in the serverless computing platform Pywren [6, 7]333A working implementation of OverSketch is available at https://github.com/vvipgupta/OverSketch, on the AWS Lambda cloud system to compute an approximate with and as defined in (3) with sketch dimension . Throughout this experiment, we take and to be constant matrices where the entries of are given by for all and and . Thus, to compute -th block of , 30 nodes compute the product of and , where and . While collecting results, we ignore workers for each block of , where is varied from to . The time statistics are plotted in Figure 5(a). The corresponding worker job times are shown in Figure 1, where the median job time is around seconds, and some stragglers return their results around seconds and some others take up to seconds. We note that the compute time for matrix multiplication reduces by a factor of if we ignore at most workers per workers that compute a block of . In figure 5(b), for same and , we plot average error in matrix multiplication by generating ten instances of sketches and averaging the error in Frobenius norm, , across instances. We see that the average error is only when 4 workers are ignored. ### V-B Solving Optimization Problems with Sketched Matrix multiplication Matrix multiplication is the bottleneck of many optimization problems. Thus, sketching has been applied to solve several fairly common optimization problems using second-order methods, like linear programs, maximum likelihood estimation, generalized linear models like least squares and logistic regression, semi-definite programs, support vector machines, Kernel ridge regression, etc., with essentially same convergence guarantees as exact matrix multiplication [24, 25]. As an instance, we solve the following linear program (LP) using interior point methods on AWS Lambda  minimizex cTx (5) subject to Ax≤b, where and is the constraint matrix with . To solve (5) using the logarithmic barrier method, we solve the following sequence of problems using Newton’s method  minx∈Rmf(x)=minx∈Rm(τcTx−n∑i=1log(bi−aix)), (6) where is the -th row of , is increased geometrically as after every 10 iterations and the total number of iterations is . The update in the -th iteration is given by  xt+1=xt−η(∇2f(xt))−1∇f(xt), (7) where is the estimate of the solution in the -th iteration and is the appropriate step-size. The gradient and Hessian for the objective in (6) are given by  ∇f(x)=τc+n∑i=1aTibi−aTix and (8) ∇2f(x)=ATdiag1(bi−aix)2A, (9) respectively. The square root of the Hessian is given by . The computation of Hessian requires time and is the bottleneck in each iteration. Thus, we use our distributed and sketching-based blocked matrix multiplication scheme to mitigate stragglers while evaluating the Hessian approximately, i.e. , on AWS Lambda, where is defined in (3). We take the block size, , to be , the dimensions of to be and and the sketch dimension to be . We use a total of workers in each iteration. Thus, to compute each block of , workers are assigned to compute matrix multiplication on two blocks. We depict the time and error versus iterations in figure 7. We plot our results for different values of , where is the number of workers ignored per block of . In our simulations, each iteration includes around seconds of invocation time to launch AWS Lambda workers and assign tasks. In figure 6(a), we plot the total time that includes the invocation time and computation time versus iterations. In 6(b), we exclude the invocation time and plot just the compute time in each iteration and observe savings in solving (5) when , whereas the effect on the error with respect to the optimal solution is insignificant (as shown in figure 6(c)). ### V-C Comparison with Existing Straggler Mitigation Schemes In this section, we compare OverSketch with an existing coding-theory based straggler mitigation scheme described in [13]. An illustration for [13] is shown in Figure 2. We multiply two square matrices and of dimension on AWS Lambda using the two schemes, where and for all . We limit the bandwidth of each worker by 400 MB (i.e. around 48 million entries, where each entry takes 8 bytes) for a fair comparison. Thus, we have , or for OverSketch and for [13], where is the size of the row-block of (and column-block of ). We vary the matrix dimension from to . For OverSketch, we take the sketch dimension to be , and take , i.e., ignore one straggler per block of . For straggler mitigation in [13], we add one parity row in and one parity column in . In Figures 7(a) and 7(b), we compare the workers required and average cost in dollars, respectively, for the two schemes. We note that OverSketch requires asymptotically fewer workers, and it translates to the cost for doing matrix multiplication. This is because the running time at each worker is heavily dependent on communication, which is the same for both the schemes. For , the average error in Frobenius norm for OverSketch is less than , and decreases as is increased. The scheme in [13] requires an additional decoding phase, and assume the existence of a powerful master that can store the entire product in memory and decode for the missing blocks using the redundant chunks. This is also true for the other schemes in [15, 14, 16]. Moreover, these schemes would fail when the number of stragglers is more than the provisioned redundancy while OverSketch has a ’graceful degradation’ as one can get away by ignoring more workers than provisioned at the cost of accuracy of the result. ## Vi Conclusion Serverless computing penetrates a large user base by allowing users to run distributed applications without the hassles of server management. We analyzed the cost of distributed computation in serverless computing for naive and blocked matrix multiplication. Through analysis and experiments on AWS Lambda, we show that the latter significantly outperforms the former. Thus, existing straggler mitigation schemes that do naive matrix multiplication are unsuitable. To this end, we develop OverSketch, a sketching based algorithm for approximate blocked matrix multiplication. Our sketching scheme requires time linear in the size of input matrices. As a distributed matrix multiplication algorithm, OverSketch has many advantages: reduction in dimension of input matrices for computational savings, and built-in straggler resiliency. Extensive experiments on AWS Lambda support our claims that OverSketch is resilient to stragglers, cost-efficient, and highly accurate for suitably chosen sketch dimension. ### -a Proof of Proposition iii.1 To compare naive and blocked multiplication, we first observe that the computation cost in (1), that is , is the same for both naive and blocked multiplication and is equal to , which is the total amount of work done during matrix-matrix multiplication444The computation cost for blocked matrix multiplication can be further improved by using Strassen type methods that take to multiply two square sub-blocks of dimension , but we do not consider that advantage in this paper for clarity of exposition and to emphasize on savings just due to communication.. Let be the number of workers required for naive matrix multiplication. Then, as each worker is sent one row-block of from choices, and one column-block of from choices. Each worker receives words and writes back words. Hence, the total communication incurred during the algorithm is . Also, since each worker can only receive words, we have , thus . Hence, the total bandwidth cost for naive multiplication is . Also, the total number of messages sent during the process is , and hence the total latency cost is . During the computation phase for blocked multiplication, , as computation of one block of requires workers, and there are a total of such blocks. Again, each worker receives two blocks, one from each and , and writes back a block, where satisfies . Thus, the total bandwidth cost incurred during the computation phase is . The total number of messages received by the workers is , and, hence, the latency cost is . During the reduction phase, the number of workers required is , and each worker receives blocks of size to compute one block of . Thus, for the reduction phase, the communication is and total messages sent is . Hence, the total latency and bandwidth costs for blocked multiplication are and , respectively. This analysis justifies the costs summarized in Table I and proves the theorem. ### -B Proof of Theorem iv.1 The following three lemmas will assist us with the proof of Theorem IV.1. ###### Lemma .1 let be a count sketch matrix. Then, for any vectors , the following holds  E[xTScSTc% y]=xTy (10) Var[xTScSTcy]=1b⎛⎝∑j≠lx2jy2l+∑j≠lxjyjxlyl⎞⎠ ≤1b((xTy)2+||x||22||y||22)≤2b||x||22||y||22. (11) See [29], Appendix A. ###### Lemma .2 Let where and is a count sketch matrix that satisfies (10) and (11), for all . Then, for any vectors , the following holds  E[xTSSTy] = xTy Var[xTSSTy] ≤ 2d||x||22||y||22. Note that, . Thus, and hence, by (10) and linearity of expectation. Now,  Var[xTSSTy]=E[(xTSSTy−xTy)2] =E\scalebox0.87$[1N((xTS1ST1y+xTS2ST2y+⋯+xTSNSTNy)−NxTy)2]\$ =E[1N2(N∑i=1(xTSiSTiy−xTy))2] =1N2(N∑i=1E[(xTSiSTiy−xTy)2] +∑i≠jE[(xTSiSTiy−xTy)(xTSjSTjy−xTy)]) =1N2(N∑i=1Var[xTSiSTiy]+ ∑i≠jE[(xTSiSTiy−xTy)(xTSjSTjy−xTy)]). (12)

Noting that

are independent random variables and using (

10), we get

Now, using the above equation and (11) in (12), we get

 Var[xTSSTy]=1N2×N×2b||x||22||y||22=2d||x||22||y||22,

which proves the lemma.

###### Lemma .3

Let . Then, for any , and as defined in lemma .2,

 E||AB−ASSTB||2F≤ϵ||A||2F||B||2F. (13)

By the property of Frobenius norm and linearity of expectation, we have

 E||AB−ASSTB||2F=m∑i=1l∑j=1E|a(i)b(j)−a(i)SSTb(j)|2, (14)

where and are the -th row and -th columns of and , respectively. Now, using lemma .2 in (14), we get

 E||AB −ASSTB||2F=m∑i=1k∑j=1Var[a(i)SSTb(j)]2 ≤m∑i=1k∑j=12d||a(i)||22||b(j)||22 =ϵ(m∑i=1||a(i)||22)(k∑j=1||b(j)||22)    (% as d=2/ϵ) =ϵ||A||2F||B||2F,

which is the desired result.

We are now ready to prove theorem IV.1. As illustrated in figure 5, we can think of computation of a sub-block as multiplication of row block of and column-block of . Since we ignore upto only workers in the calculation of a block of , the effective sketch dimension is greater than , and therefore, from lemma .3

 E||A(i,:)B(:,j) −A(i,:)SijSTijB(:,j)||2F ≤ϵθ||A(i,:)||2F||B(:,j)||2F, (15)

for all and . Note that even if we applied the same sketch on and across row and column blocks, respectively, in the above equation might end up being different for each pair depending upon the location of stragglers, though with a common property that the sketch dimension is at least . Now, we note that

 E|| ASSTB−AB||2F =m/b∑i=1l/b∑j=1E||A(i,:)B(:,j)−A(i,:)SijSTijB(:,j)||2F ≤ϵθm/b∑i=1l/b∑j=1||A(i,:)||2F||B(:,j)||2F=ϵθ||A||2F||B||2F.

Now, by Markov’s inequality

 P(||ASSTB −AB||2F>ϵ||A||2F||B||2F) ≤E||ASSTB−AB||2Fϵ||A||2F||B||2F ≤ϵθ||A||2F||B||2Fϵ||A||2F||B||2F=θ,

which proves the desired result.

## References

• [1] S. L. Graham, M. Snir, and C. A. Patterson, Getting up to speed: The future of supercomputing.    National Academies Press, 2005.
• [2] L. I. Millett and S. H. Fuller, “Computing performance: Game over or next level?” Computer, vol. 44, pp. 31–38, 01 2011.
• [3] G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, pp. 1–155, 2014.
• [4] E. Solomonik and J. Demmel, “Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms,” in Proceedings of the 17th International Conference on Parallel Processing, 2011, pp. 90–109.
• [5] I. Baldini, P. Castro, K. Chang, P. Cheng, S. Fink, V. Ishakian, N. Mitchell, V. Muthusamy, R. Rabbah, A. Slominski, and P. Suter, Serverless Computing: Current Trends and Open Problems.    Springer Singapore, 2017.
• [6] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the cloud: distributed computing for the 99%,” in Proceedings of the 2017 Symposium on Cloud Computing.    ACM, 2017, pp. 445–451.
• [7] V. Shankar, K. Krauth, Q. Pu, E. Jonas, S. Venkataraman, I. Stoica, B. Recht, and J. Ragan-Kelley, “numpywren: serverless linear algebra,” ArXiv e-prints, Oct. 2018.
• [8] J. Dean and L. A. Barroso, “The tail at scale,” Commun. ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013.
• [9] T. Hoefler, T. Schneider, and A. Lumsdaine, “Characterizing the influence of system noise on large-scale applications by simulation,” in Proc. of the ACM/IEEE Int. Conf. for High Perf. Comp., Networking, Storage and Analysis, 2010, pp. 1–11.
• [10] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
• [11] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, 2010, pp. 10–10.
• [12] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, 2018.
• [13] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in IEEE Int. Sym. on Information Theory (ISIT), 2017.    IEEE, 2017, pp. 2418–2422.
• [14] T. Baharav, K. Lee, O. Ocal, and K. Ramchandran, “Straggler-proofing massive-scale distributed matrix multiplication with d-dimensional product codes,” in IEEE Int. Sym. on Information Theory (ISIT), 2018.    IEEE, 2018.
• [15] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in Adv. in Neural Inf. Processing Systems, 2017, pp. 4403–4413.
• [16] S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” arXiv preprint arXiv:1801.10292, 2018.
• [17] J. Zhu, Y. Pu, V. Gupta, C. Tomlin, and K. Ramchandran, “A sequential approximation framework for coded distributed optimization,” in Annual Allerton Conf. on Communication, Control, and Computing, 2017.    IEEE, 2017, pp. 1240–1247.
• [18] M. Vijay and R. Mittal, “Algorithm-based fault tolerance: a review,” Microprocessors and Microsystems, vol. 21, no. 3, pp. 151 – 161, 1997, fault Tolerant Computing.
• [19] R. A. van de Geijn and J. Watts, “Summa: Scalable universal matrix multiplication algorithm,” Tech. Rep., 1995.
• [20] J. Demmel, “Communication-avoiding algorithms for linear algebra and beyond,” in 2013 IEEE 27th Int. Sym. on Parallel and Distributed Processing, May 2013, pp. 585–585.
• [21] A. Devarakonda, K. Fountoulakis, J. Demmel, and M. W. Mahoney, “Avoiding communication in primal and dual block coordinate descent methods,” arXiv:1612.04003, 2016.
• [22] P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices I: Approximating matrix multiplication,” SIAM Journal on Computing, vol. 36, no. 1, pp. 132–157, 2006.
• [23] D. P. Woodruff, “Sketching as a tool for numerical linear algebra,” Found. Trends Theor. Comput. Sci., vol. 10, pp. 1–157, 2014.
• [24] M. Pilanci and M. J. Wainwright, “Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence,” SIAM Jour. on Opt., vol. 27, pp. 205–245, 2017.
• [25] Y. Yang, M. Pilanci, and M. J. Wainwright, “Randomized sketches for kernels: Fast and optimal non-parametric regression,” stat, vol. 1050, pp. 1–25, 2015.
• [26] K. L. Clarkson and D. P. Woodruff, “Low rank approximation and regression in input sparsity time,” in

Proc. of the Annual ACM Sym. on Theory of Computing

.    ACM, 2013, pp. 81–90.
• [27] S. Wang, “A practical guide to randomized matrix computations with matlab implementations,” arXiv:1505.07570, 2015.
• [28]

X. Meng and M. W. Mahoney, “Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression,” in

Proc. of the Forty-fifth Annual ACM Sym. on Theory of Computing.    ACM, 2013, pp. 91–100.
• [29] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” in Proc. of the 26th Annual International Conference on Machine Learning.    ACM, 2009, pp. 1113–1120.