# MapReduce Meets Fine-Grained Complexity: MapReduce Algorithms for APSP, Matrix Multiplication, 3-SUM, and Beyond

Distributed processing frameworks, such as MapReduce, Hadoop, and Spark are popular systems for processing large amounts of data. The design of efficient algorithms in these frameworks is a challenging problem, as the systems both require parallelism---since datasets are so large that multiple machines are necessary---and limit the degree of parallelism---since the number of machines grows sublinearly in the size of the data. Although MapReduce is over a dozen years old dean2008mapreduce, many fundamental problems, such as Matrix Multiplication, 3-SUM, and All Pairs Shortest Paths, lack efficient MapReduce algorithms. We study these problems in the MapReduce setting. Our main contribution is to exhibit smooth trade-offs between the memory available on each machine, and the total number of machines necessary for each problem. Overall, we take the memory available to each machine as a parameter, and aim to minimize the number of rounds and number of machines. In this paper, we build on the well-known MapReduce theoretical framework initiated by Karloff, Suri, and Vassilvitskii karloff2010model and give algorithms for many of these problems. The key to efficient algorithms in this setting lies in defining a sublinear number of large (polynomially sized) subproblems, that can then be solved in parallel. We give strategies for MapReduce-friendly partitioning, that result in new algorithms for all of the above problems. Specifically, we give constant round algorithms for the Orthogonal Vector (OV) and 3-SUM problems, and O( n)-round algorithms for Matrix Multiplication, All Pairs Shortest Paths (APSP), and Fast Fourier Transform (FFT), among others. In all of these we exhibit trade-offs between the number of machines and memory per machine.

## Authors

• 32 publications
• 20 publications
• 17 publications
• 7 publications
• ### Efficient distributed algorithms for Convolutional Neural Networks

Several efficient distributed algorithms have been developed for matrix-...
05/27/2021 ∙ by Rui Li, et al. ∙ 0

• ### Centralized and Parallel Multi-Source Shortest Paths via Hopsets and Fast Matrix Multiplication

Consider an undirected weighted graph G = (V,E,w). We study the problem ...
04/16/2020 ∙ by Michael Elkin, et al. ∙ 0

• ### Efficient parameterized algorithms for computing all-pairs shortest paths

Computing all-pairs shortest paths is a fundamental and much-studied pro...
01/14/2020 ∙ by Stefan Kratsch, et al. ∙ 0

• ### Scaling betweenness centrality using communication-efficient sparse matrix multiplication

Betweenness centrality (BC) is a crucial graph problem that measures the...
09/22/2016 ∙ by Edgar Solomonik, et al. ∙ 0

• ### JAMPI: efficient matrix multiplication in Spark using Barrier Execution Mode

The new barrier mode in Apache Spark allows embedding distributed deep l...
06/27/2020 ∙ by Tamas Foldi, et al. ∙ 0

• ### Low-Depth Parallel Algorithms for the Binary-Forking Model without Atomics

The binary-forking model is a parallel computation model, formally defin...
08/30/2020 ∙ by Zafar Ahmad, et al. ∙ 0

• ### The Bitlet Model: Defining a Litmus Test for the Bitwise Processing-in-Memory Paradigm

This paper describes an analytical modeling tool called Bitlet that can ...
10/22/2019 ∙ by Kunal Korgaonkar, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

During the past decade the amount of user generated data has been growing at an astonishing rate. For example, even three years ago, Facebook’s warehouse stored 300 PB of Hive data, with an incoming daily rate of about 600 TB [24]. Similarly, Twitter processes over 500 millions tweets per day [52]. As a consequence, developing parallel and scalable solutions to efficiently process this wealth of information is now a central problem in computer science.

#### MapReduce

Distributed processing frameworks such as MapReduce [22], Hadoop [1], and Spark [2] help address this challenge. While differing in details, these frameworks share the same high level principles. The main advantage of these frameworks is that they: (i) support fault-tolerance, (ii) run on a shared cluster with commodity hardware, and (iii) provide a simple abstraction to implement new algorithms.

In the past few years several theoretical models have been proposed to enable formal analysis of algorithms in these settings [25, 34, 8, 26, 27, 33, 46, 48, 29]. Most notable is the MapReduce model of Karloff, Suri, and Vassilvitskii [34], which captures the fundamental challenge of distributing the input data so that no machine ever sees more than a tiny fraction of it. So far this model has received a lot of attention in the applied community and algorithms for several problems in clustering, distance measures, submodular optimization, and query optimizations have been developed [10, 12, 30, 23, 37, 43, 13, 15, 28, 16]. A few papers also considered graph problems such as density, minimum cuts, matchings [38, 4, 5, 9, 18, 42]. Very recently Im, Moseley, and Sun [29] (STOC’17) show how to adapt some dynamic programming algorithms to the MapReduce framework.

MapReduce and its variants are essentially special cases of the Bulk Synchronous Parallel (BSP) model [53], but by restricting the allowable parameters, they better capture what is feasible with modern distributed architectures. In the interests of space, we assume familiarity with the basic MapReduce framework, which can roughly be thought of as alternating rounds of local computation and global communication. MapReduce models have four main parameters to consider: 1) the number of machines used by the algorithm, 2) the memory available on each machine, 3) the number of parallel communication rounds, and 4) the running time of the overall algorithm111Note that even if we do not explicitly restrict the communication in each round, such a parameter is bounded by the product of the number of machines and the memory used in each machine.. The MRC framework popularized by  [34] assumes that for an input of size , the number of machines and the memory per machine is bounded by for some , while the number of rounds is polylogarithmic in .

While the MRC framework required the algorithms to take subquadratic space (since both the number of machines, and the memory per machine is bounded by ), here we are interested in fine grained trade-offs between space and the number of machines. To that end, we will explore the number of machines necessary when memory is set to . Obviously, machines is a lower bound for all functions that depend on the whole input, and we strive to get as close to this bound as possible. Second, we want the algorithm to be work-efficient, that is, the total processing time over all machines should be close to the running time of an efficient sequential algorithm.

Work efficiency was an important consideration in many PRAM algorithms (e.g. [36]), but has received less attention in MapReduce algorithms. In the MapReduce setting, in order to achieve a work efficient-algorithm, we may need to use additional total memory. In fact, there is often a trade-off between how close we are to a work-efficient algorithm and how much total memory the algorithm is using. We illustrate this trade of for many classic problems in the MapReduce setting.

Instead of distributing the workload onto different machines, parallel computing models, such as PRAM, assume that a shared memory is accessible to several processors of the same machine. Despite the fundamental difference from a practical point of view, there are many algorithmic similarities between the two models.

For many problems of interest, there are well understood PRAM algorithms. It is, therefore, natural to ask whether these algorithms can be easily adapted to new distributed models. Several papers [34, 27] have shown that it is possible to simulate the known PRAM algorithms in the MapReduce model with minimal slow-downs. But, these simulations have a major drawback – they do not take advantage of the fact that the new models are stronger and allow for better algorithms, mainly because the distributed models allow for free internal computation at the machines. Our goal in designing distributed algorithms is to minimize the total communication rounds, whereas, in parallel algorithms, the goal is to minimize the time complexity.

We leverage this “free” internal computation in several ways. In some cases, it allows us to substantially improve the round complexity of simulated results, e.g., by achieving constant round (instead of logarithmic round) solutions. For other cases, it allows us to improve the overall running time or use fewer processors; and in some cases, it basically simplifies the known solutions.

#### Results Overview

In this paper we propose new MapReduce algorithms for APSP, Matrix Multiplication, as well as other problems such as 3-SUM and Orthogonal Vectors whose fine-grained complexity (see [56] for a survey) has been extensively studied and is well understood in the sequential setting (see e.g. [32, 35]).

We begin with the Orthogonal Vectors and 3-SUM problems a give a few basic and useful ideas for designing MapReduce algorithms. Next, we consider more fundamental problems and present our main results. For matrix multiplication, we show a new technique to parallelize any matrix multiplication algorithm in the MapReduce framework using the bilinear noncommutative model introduced by Strassen [50] and subsequently improved in [44, 14, 49, 47, 19, 51, 20, 55]. Interestingly, while Strassenframework has already been used to develop parallel algorithm in other computational models [11], previous approaches do not extend to the MapReduce framework. Specifically, we show that given an algorithm based on this method with running time for matrix multiplication it is possible to obtain a MapReduce algorithm that, for any , uses machines with memory per machine. We also extend our approach to non-square matrices.

Next, we present an efficient algorithm for matrix multiplication over , obtaining an efficient parallel algorithm whose total running time is optimal (i.e. the sum of the running times over all the parallel instance of the algorithm is ). We then extend it to new efficient MapReduce algorithms for the all pairs shortest paths problem, the diameter problem and other graph centrality measures. We also show that using some ideas from [57], it is possible to improve the running time of the algorithm if the graph is unweighted or if the weight of the edges are small.

## 2 Our Results and Techniques

We present several MapReduce algorithms for fundamental problems. The novelty of our work is to exhibit smooth trade-offs between the memory available on each machine, and the total number of machines necessary. Overall, we take the memory available to each machine as a parameter, and aim to minimize the number of rounds and number of machines.

We begin, as a warm-up, by stating our results for the orthogonal vectors (OV) problem. This problem is of particular interest to the fine grained complexity community. In the OV problem, we are given two lists of vectors, and , each containing vectors of size , and want to determine whether there exist two vectors and , such that . Although a quadratic time solution for OV is trivial by iterating over all pairs of vectors and examining whether the inner product of the vectors is equal to 0, this algorithm is one of the fastest algorithms known for OV to this date. This solution is a natural example of an algorithm that can be efficiently parallelized. Therefore, we begin by presenting a MapReduce algorithm for OV.

We show in Section B (deferred to Appendix in the interest of space), that the above algorithm can be implemented in a single MapReduce round using machines with memory , for any . The idea is to split the lists into sublists of size and assign a machine to every pair of sublists to find out if the sublists have orthogonal vectors. Therefore, this algorithm requires machines. We provide further explanation regarding the MapReduce details and present pseudocode for both mappers and reducers of this algorithm in Section B.

The idea that the input can be divided into asymptotically smaller instances, and therefore, problems can be reduced to smaller subproblems is a promising direction for designing MapReduce algorithms. However, as we show, this idea does not always lead to the most efficient algorithms. In Section 3, we study the 3-SUM problem in the MapReduce setting. In this problem, we are given 3 lists of integer numbers , , and , each containing elements. The goal is to determine if there exist , , and such that . Similar to the solution of OV, one can divide each of the lists into sublists, each of size . Any combination of the sublists makes a subtask, and as such, the problem breaks into smaller instances each having an input size of . Thus, we need machines to solve the problem for each combination of the sublists. However, unlike OV, 3-SUM can be implemented more efficiently in the MapReduce model. The crux of the argument is that not all combinations of sublists need to be examined for a potential solution. In fact, we show in Section 3 that out of the combination of sublists, only can potentially have a solution. The rest can be ruled out via an argument on the ranges of the sublists. Therefore, we can reduce the number of machines needed to solve 3-SUM from down to . The algorithm now needs two rounds, one to determine, on a single machine, which combinations need to be examined, and a second to distribute the subtasks between the machines and solve the problem.

Theorem 3.1 (restated). For , 3-SUM can be solved with a MapReduce algorithm on machines with memory in two MapReduce rounds.

Our algorithms for OV and 3-SUMshow how MapReduce tools can solve classic problems efficiently with less memory per machine. We now turn to more fundamental problems, such as matrix multiplication, graph centrality measures, or shortest paths.

Our main contribution is an algorithm for multiplying two matrices via MapReduce. Matrix multiplication is one of the most fundamental and oldest problems in computer science. Many algebraic problems such as LUP decomposition, the determinant, Gaussian elimination, and matrix inversion can be reduced to matrix multiplication. The trivial algorithm for matrix multiplication takes time, and via a long series of results the best sequential algorithm currently takes where  [40].

An important breakthrough in matrix multiplication algorithms was the first improvement below by Strassen [50] in 1969. As our algorithms use some of the ideas from Strassen, we describe his algorithm here briefly. Strassen’s idea was to show that two matrices can be multiplied only using integer multiplications. Using recursion, we can think of any matrix as four submatrices of size , and Strassen’s observation shows that only matrix multiplications of matrices suffice to determine the solution. Solving the resulting recursion, we see that the total number of integer multiplications is .

Our MapReduce algorithm for matrix multiplication is based on a similar logic, although instead of Strassen [50], we use the latest decomposition of Le Gall [40]. In a single round, we reduce the problem to smaller instances and divide the machines between the instances. In the next round, the machines are evenly divided between the subtasks and each subtask is to multiply two smaller matrices. Therefore, we again use the same idea to break the problem into smaller pieces. More generally, in every step, we divide the matrices into smaller submatrices and distribute the machines evenly between the smaller instances. We continue this until the memory of each machine ( is enough to contain all indices of the matrices, that is, the matrices are of size . At this point, each machine computes the multiplication of the given matrices and outputs the solution. We show in Section 4 that this can be done in MapReduce rounds using machines and memory where  [40] is the best known upper bound on the exponent of any algorithm for matrix multiplication.

Theorem 4.1 (restated). For two given matrices and and , there exists an -round MapReduce algorithm to compute with machines and memory on each machine.

We further extend this result to a MapReduce algorithm for multiplying an matrix into an matrix. Let denote the smallest such that one can multiply an matrix into an matrix in time . Our algorithm needs machines and memory on each machine. Similarly, the number of MapReduce rounds of our algorithm is .

Theorem 4.4 (restated). For an matrix and an matrix and , there exists an -round MapReduce algorithm to compute with machines and memory on each machine.

In Section 5, we use both Theorems 4.1 and 4.4 to design similar MapReduce algorithm for integer matrix multiplication, also known as distance multiplication and denoted by . In this problem, we are given two matrices and , and wish to compute a matrix such that . Again, a trivial solution follows from definition, however unlike matrix multiplication, the cubic running time of the naive algorithm has not been improved yet. In fact, many believe that this problem does not admit any truly subcubic algorithm. for small ranges,

Our first result for distance multiplication uses a similar approach as the one we used for 3-SUM, and uses machines and memory memory on each machine. This algorithm runs in MapReduce rounds.

Theorem 5.1 (restated). For any two matrices and and , can be computed with machines and memory in MapReduce rounds.

To prove Theorem 5.1, we reduce the problem into smaller instances. However unlike our algorithms for 3-SUM, we do not assign each subtask to a single machine. Instead, we assign each subtasks to several machines and then compute the solution based on the solutions generated on each of the machines. More precisely, we divide the solution matrix into submatrices, each of size . Notice that for every and , we have and therefore, computing the entries of the solution matrix can be parallelized by dividing the range of between the machines. More precisely, we assign machines to each submatrix, with each machine in charge of a range of size for . Thus, every machine receives a range and a submatrix along with the corresponding entries of and to that submatrix and outputs the solution. In the first round, each machine outputs these values for the corresponding ranges and in the subsequent rounds, for every and we compute ’s based on the generated values in round 1. A slightly different variant of this algorithm can compute the distance multiplication of an matrix into an matrix with machines.

Theorem 5.5 (restated). For and , there exists a MapReduce algorithm that computes for an matrix and an matrix . This algorithm runs on machines with memory and executes in MapReduce rounds. The total running time of the algorithm over all machines is .

A reduction similar to one used by Zwick  [57], shows that distance multiplication for small ranges can be done as efficiently as matrix multiplication. We explain this reduction in more detail in Section 5. Table 1 illustrates the problems that we solve either directly, or via a reduction to matrix multiplication. Table 2 shows the performance of our algorithms when the number of machines is equal to the memory of each machine.

### 2.1 Application to Other Problems

Matrix multiplication, both over and the standard version has many applications to other classic problems, and we obtain MapReduce algorithms for several of these problems problems. It has been shown that for a weighted graph with adjacency matrix we have

 APSP(G)=n % timesA(G)⋆A(G)⋆…⋆A(G)

where is a matrix that contains the distance of vertex from vertex at index . Thus, one can use our algorithm to obtain a MapReduce algorithm for APSP using matrix multiplication times. This algorithm, then, can be used to determine the diameter and center of a graph and examine whether a graph contains a negative cycle. We explain this algorithm in more details in Section 5.2.

Although APSP can be solved via distance multiplication, we show that for unweighted graphs (or in general graphs with small weights), the algorithm can be improved. This improvement is inspired by the work of Zwick  [57]. Our approach to obtain this result in twofold. Recall that APSP can be solved by taking the adjacency matrix of a graph to the power of in terms of distance multiplication, and to compute the th power, we can iteratively apply distance multiplication times. One one hand, we show that the first few distance multiplications can be run more efficiently since the entries of the matrices are small, using the bounded distance multiplication algorithm instead of the general distance multiplication. On the other hand, the last distance multiplications can be reduced to rectangular distance multiplications with fewer than indices. This second observation is shown by Zwick  [57]. Based on these two ideas, we show in Section A that APSP in unweighted graphs can be computed more efficiently than in weighted graphs.

Theorem A.1 (restated). Let

 F(ϵ)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩3−5/2ϵϵ≤0.3192(1−ϵ/2)+√0.07ϵ2+1.66ϵ+−0.43ϵ+2.92−0.89−0.26ϵ−1.110.53−ϵ2,otherwise

be a real number, and be a graph with vertices whose edge weights are in . There exists a MapReduce algorithm to compute APSP of with machines and memory in MapReduce rounds.

It has been shown that LUP decomposition, the determinant, Gaussian elimination, and inversion of a matrix are all equivalent to matrix multiplication (up to logarithic factors) and thus our algorithms extend to these problems as well.

## 3 3-Sum

In 3-SUM, we are given three lists of integers , , and , each containing up to numbers. The goal is to find out whether there exist , , and such that . The best algorithm for 3-SUM on classic computers runs in time and there has not been any substantial improvement for this problem to this date. In fact, many lower bounds are proposed on the running time of several algorithmic problems, based on a conjecture that no algorithm can solve 3-SUM in time for any  [3]. In this section, we present an efficient MapReduce algorithm for 3-SUM that runs in two MapReduce rounds on machines with memory . The total running time of the algorithm over all machines is . For simplicity, we assume that the numbers of all lists are sorted in non-decreasing order and each list contains exactly elements. In other words, and that for all we have, , , and .

The classic algorithm for solving 3-SUM on one machine is as follows: We iterate over all possible choices of for the first element. For every , we create two pointers and initially pointing at the first elements of and respectively. Let and denote the values of the pointers at any step of the algorithm. Hence, in the beginning of every iteration, and hold. Next, we move the pointers according to the following rule: If , we push one step forward to point at the next element in the list. Similarly, if we change to point to the next element. Otherwise if , we immediately halt the algorithm and report this triple as an answer to the problem. This way, in every step of the algorithm, we make at most iterations and thus the running time of the algorithm is . Moreover, since all three lists are sorted initially, we never skip over a potential solution by moving any of the pointers forward. Therefore, if there is any triple , our algorithm finds it in time .

Our MapReduce algorithm for 3-SUM is inspired by the above algorithm. We restrict the memory of each machine to be bounded by and for simplicity, we assume . The challenge for the MapReduce setting is that we no longer afford to store all elements of in a single machine. This is particularly troubling since in order to examine whether a pair of number adds up to some , we need to have access to all of the values of the list . To overcome this hardness, our algorithm runs in two MapReduce rounds. We initially divide the elements of the lists into sublists of size . For each sublist , we define its head as the smallest number of the list and denote it by . Similarly, we define the tail of a sublist as the largest number of that list and refer to it by . Each sublist contains consecutive numbers of a list and thus for any two sublists and of the same list we have either or . In the first MapReduce round of our algorithm, we accumulate all the heads and tails ( many numbers) in a single machine. Based on this information, we determine which triples of the sublists can potentially have a solution. We show the number of such combinations is bounded by . Therefore, the problem boils down to subproblems of smaller size . In the second MapReduce round of our algorithm, each machine solves a subtask of the problem and finally we report any solution found on any machine as the solution of 3-SUM. For three sublists , , and of , , and , we say makes a non-trivial subtask if and . In the following we show how to implement this algorithm with machines and memory on each machine.

###### Theorem 3.1

For , 3-SUM can be solved with a MapReduce algorithm on machines with memory in two MapReduce rounds. The overall running time of our algorithm is .

Proof. We first divide each of the lists , , and , into sublists of size . Each sublist contains consecutive integers in a sorted list. In the first round of the algorithm we feed the heads and the tails of the sublists to a single machine, and that machine decides how the tasks are distributed among the machines in the second round.

Therefore, in the first round, we only have a single machine working as a reducer. This reducer receives the heads and tails of the sublists and reports all triples of sublists that might potentially contain a solution. Notice that if three sublists do not make a non-trivial subtask then either or hold and thus no potential solution can be found in such sublists.

We begin by proving that the number of non-trivial subtasks is .

###### Lemma 3.2

The number of non-trivial subtasks is .

Proof. The crux of the argument is that if makes a non-trivial subtask, then both and hold. Therefore, none of the triples make non-trivial tasks for since

 H(Ai′)+H(Bj′)>T(Ai)+T(Bj)≥H(Ck)>T(Ck′).

Now, we show that for any two different non-trivial subtasks and , either or . Suppose for the sake of contradiction that both equations hold. We assume w.l.g that and this implies since and since . This contradicts with the above observation. This shows that if we map every non-trivial subtask to a pair , all the corresponding pairs are identical. Notice that both and range over intervals of length and thus the number of non-trivial subtasks cannot be more than .

Since , the memory of a single machine is enough to contain all heads and tails of the sublists. In the first MapReduce round, a reducer identifies all non-trivial subtasks. This can be done in time on a single machine. To this end, we first sort the sublists of , , and based on the values of their heads. It only suffices to find for each pair which sublists of contain and . Then we can iterate over all the sublists in between them and report , , and each of those sublists as a non-trivial subtasks. Note that sorting all pairs of based on can be done in time since both two sets of sublists are sorted. Therefore, we can determine the sublist of that contains for all by iterating the sublists of as well as the sorted pairs of . Similarly, we can identify where appears for each pair in time . This yields an time algorithm for identifying all non-trivial subtasks.

In the second MapReduce round of our algorithm, we feed each non-trivial subtask to a single machine and that machine finds out whether there exists a 3-SUM solution in that subtasks. Recall that the size of each subtask is . Therefore, both the number of machines needed for our algorithm and the memory of each machine is . In addition to this, the running time of each MapReduce phase for every machine is and thus our overall running time is .

## 4 Matrix Multiplication

Matrix multiplication is one of the most fundamental and oldest problems in computer science. Many algebraic problems such as LUP decomposition, the determinant, Gaussian elimination, and inversing a matrix can be reduced to matrix multiplication222The reductions may incur additional logarithmic factors to the number of machines and the memory of each machine, but these factors are hidden in the notation. [45]. In addition to this, matrix multiplication sometimes can be used as black box to solve combinatorial problems. One example is finding a triangle in an unweighted graph which can be solved by taking the square of the adjacency matrix of the graph [7]. Despite the importance and long-standing of this problem, the computational complexity of matrix multiplication is not settled yet.

A naive time solution for multiplying two matrices follows from the definition. The first improvement was the surprising result of Strassen, who showed that the multiplication of two can be determined using only 7 integer multiplications, and more generally, that the problem of computing the multiplication of two matrices reduces to 7 instances of matrix multiplications, plus additions of matrics, yielding an algorithm for matrix multiplication.

Perhaps more important than the improvement on the running time of matrix multiplication was the general idea of reducing the problem to small instances in the bilinear noncommutative model. Strassen [50] showed that 7 multiplications suffice for

instances, but any bound on the number of necessary multiplications for any matrix size can turn into an algorithm for matrix multiplication. Such a notion is now known as the rank of a tensor

for multiplying an matrix by an matrix and is denoted by . Let be the smallest exponent of in the running time of any algorithm for computing matrix multiplication. Strassen’s algorithm implies and an improvement of follows by showing a symmetry on the rank of the tensors [41]. There were then a series of improvements [44, 14, 49, 47, 19, 51, 20], culminating in the result of Le Gall the latest of which shows  [40]. We use to denote this bound. All these bounds are obtained directly or indirectly by bounding the rank of an tensor, and thus Le Gall shows that there exists an integer such that

Moreover, we assume a decomposition of the tensor to a corresponding number of products in the bilinear noncommutative model is given since is a constant and one can compute that via exhaustive search. We state our main theorem in terms of . Indeed, any improvement on the running time of the matrix multiplication based on the bilinear noncommutative model carries over to our setting.

###### Theorem 4.1

For two given matrices and and , there exists a MapReduce algorithm to compute with machines and memory on each machine. This algorithm runs in rounds.

Proof. The overall idea of the algorithm is to implement the Strassen’s idea in a parallel setting. Our algorithm uses machines with memory . Let be the number of terms in a decomposition of the solution matrix. We refer to these terms by and assume where and are linear combinations of the indices of and respectively. In other words, every entry of matrices and , is a linear combination of the entries of and , respectively. For simplicity, we assume is divisible by (if not, we add extra 0’s to increase the size of the matrix and make its size divisible by ). Next, we decompose both matrices into submatrices of size , namely ’s and ’s. We think of each submatrix or as a single entry and compute the product of the two matrices based on the decomposition of  [40] in the noncommutative model.

Notice that we only have elements in the original matrix and thus each and is a linear combination of size at most . Therefore, if the number of machines times the memory of each machine is at least , then all ’s and ’s can be computed in a single round. Thus, we can compute all the variables ’s and ’s in a single MapReduce round. Via a similar argument, once we compute the values of ’s for every , then we can in a single round, compute the solution matrix . Therefore, the problem boils down to computing the multiplication of every for . We divide the machines evenly between the subproblems and, recursively, compute ’s in the phase 2 of the algorithm.

In phase 2, for every matrix multiplication of size , we have machines with memory . We again, use the same method of phase 1 to divide the problem down to instances of size for phase 3. More generally, in phase , for every matrix multiplication of size we have machines with memory . We stop when , i.e., we only have a single machine in step . In that case, we compute the matrix multiplication on the only machine dedicated to the subproblem in time and report the output. The number of machines and the memory of each machine in each phase is given in Table 3.

Notice that in the last phase of the algorithm, the size of the matrices is . Moreover, we have , hence, the size of the matrices in the last phase is . Furthermore, we have and thus the size of the matrices in the last phase is equal to . Therefore, the memory of the machines in the last phase () suffices to compute the multiplication. Furthermore, the number of machines assigned to each task times the memory of each task is at least times the square of the size of the matrices and thus all linear computations can be done in a single round. Therefore, this algorithm computes the multiplication of two matrices in MapReduce rounds with machines and memory .

Setting yields the following corollary.

###### Corollary 4.2

For two given matrices and , there exists a MapReduce algorithm to compute with machines and memory on each machine. This algorithm runs in rounds.

Note that is very close to 1. The reader can find the complexity of our algorithm in terms of the memory and the number of machines for matrix multiplication when the number of machines is equal to the memory of each machine for different ’s in Figure 1.

A closer look at the analysis of Theorem 4.1 shows that it is not limited to square matrices. For instance, one could show that a similar approach yields to an algorithm for multiplying an matrix into another matrix with fewer than operations. However, in order to use this approach for rectangular matrix multiplication, we need to show a bound on the rank of the tensors. To this end, we borrow the result of Le Gall [39].

###### Theorem 4.3 (proven in  [39])

Define

 ω⟨1,1,x⟩:=infn{h|RK(⟨n,n,⌈nx⌉⟩)=O(nh)}.

Then for we have

 ω⟨1,1,x⟩≤{2,if x≤γ2+(ω−2)(x−γ)/(1−γ),otherwise.

Theorem 4.3 allows us to extend Theorem 4.1 to imbalanced matrix multiplication. In order to multiply an matrix by an matrix for any , , we begin with the following observation: there exists an such that where

 ω∗⟨1,x,1⟩=ω∗⟨1,1,x⟩{2,if x≤γ2+(ω∗−2)(x−γ)(1−γ)≃1.83+0.53x,% otherwise.

for  [39] and  [40]. This directly follows from Theorem 4.3 and William’s bound of Let . By definition, we can formulate the product of an by an linear combinations of terms , where every term is the product of two terms and which are linear combinations of the entries of each matrix. Similar to what we did in Theorem 4.1, here we use a MapReduce algorithm to compute this product with several machines of memory .

###### Theorem 4.4

For an matrix and an matrix and , there exists a MapReduce algorithm to compute with machines and memory on each machine. This algorithm runs in rounds.

Proof. The proof is similar to Theorem 4.1. Let be the number of machines and be the memory of each machine. We assume w.l.o.g. that is divisible by and is also divisible by . Of course, if that’s not the case, one can extend the matrices by adding extra 0’s to guarantee these conditions. Also, we assume that each of the two matrices and are divided into matrices each having rows and columns. We refer to these matrices by ’s and ’s.

As stated in the proof of Theorem 4.1, our algorithm consists of rounds. In the first round we compute all terms ’s and ’s for all . This can be done in a single round since we only need to compute linear combinations of matrices. Then the problem reduces to different multiplications, each of size . Once we solve the problem for these matrices, we can, in a single round, compute the solution matrix. In the first round, we have machines with memory , and in every phase of recursion the number of machines is divided by and the size of the problem () is divided by . Therefore, in round the size of the problem is . Since we have . Moreover, in round we only have a single machine with memory to compute the solution for each subtask. Since the matrices fit in the memory of each machine, we can compute the multiplications in round and based on the solutions recursively solve the problem.

Again, if one wishes to minimize the maximum of the number of machines and the memory of each machine, a bound of can be derived by setting

 ϵ=2ω∗⟨1,x,1⟩/(ω∗⟨1,x,1⟩+2).
###### Corollary 4.5 (of Theorem 4.4)

For a given matrix and an matrix , there exists a MapReduce algorithm to compute with machines and memory on each machine. This algorithm runs in rounds.

Indeed this result improves as the upper bound on improves. Figure 2 shows the exponent of the complexity of our algorithm for different ’s in the interval .

## 5 Matrix Multiplication over (min,+)

In this section we provide an efficient algorithm for maxtrix multiplication over . Given two matrices and , our goal is to compute a matrix such that . Through this paper, we refer to this operation with . An important observation here is that for any graph with adjacency matrix , formulates the distance matrix of  [21]. Therefore, any algorithm for computing for two matrices and can turn into an algorithm for computing APSP, diameter, and center of a graph with an additional overhead. Thus, all results of this section can be seen as algorithms for computing graph centrality measures. In Section 5.1, we present an algorithm for computing for two matrices and show this yields fast algorithms for determining graph centrality measures. Next, we show in Section 5.2 that a similar approach gives us an algorithm for imbalanced matrix multiplication over . Finally, in Section 5.3, we show that all our results can be improved if the entries of matrices and range over a small interval .

### 5.1 Computing A⋆B for n×n Matrices

We begin by stating a simple algorithm to compute in one MapReduce round with machines and memory for each machine. We next, show how one can further improve this algorithm by allowing more MapReduce rounds.

The idea is to divide the solution matrix into submatrices of size and assign the task of computing each submatrix to a separate machine. Each machine then, needs access to the entire rows of and columns of ( many rows and columns) corresponding its solution matrix and thus its memory is . Upon receiving the rows and columns of and , each machine determines the multiplication of the rows and columns over and reports the output. Therefore, all it takes is a mapper to distribute the rows and columns of the matrices between the machines and machines to solve the problem for each subtask. The running time of each machine in this case is . Moreover, the number of machines is and thus the overall running time of the algorithm is which is the best known for this problem on classic computers.

Although this seems to be an efficient MapReduce algorithm, we show that this algorithm can be substantially improved to use fewer machines. In the rest of this section, we present an algorithm to compute with machines and memory .

###### Theorem 5.1

For any two matrices and and , can be computed with machines and memory in MapReduce rounds. Moreover, the total running time of the algorithm is

Proof. Our algorithm consists of two stages. The first stage runs in a single MapReduce round. In this round, we divide the solution matrix into matrices of size . However, instead of assigning each submatrix to a single machine, this time, we assign the task of computing the solution of each submatrix to machines. Notice that in order to compute ’s, we have to iterate over all and take the minimum of in this range. Therefore, one can divide this job between machines, by dividing the range of into intervals of size . Each of the machines then, receives a range of size , a submatrix of the solution, and the corresonding entries of and to the solution submatrix and the given range. This makes a total of matrix entries of and . Next, each machine finds the solution of its submatrix subject to the range given to it. The number of solution submatrices is . Moreover, the task of solving each submatrix is given to machines and thus the total number of machines used in this round is . Furthermore, the memory of each machine in this round is , since it only needs to have access to the values of the matrix for its corresponding submatrix of solution and range. Therefore, the momory of each machine is also bounded by .

In the first stage we compute values for each entry of the solution. All that is remained is to find the minimum of all these values for each entry of the matrix and report that as . We do this in the second stage of the algorithm. If this can be done in a single MapReduce round as follows: divide the entries of the solution matrix evenly between the machines and feed all related values to each machine. Each machine receive the data associated to indices, each having values generated in the first stage of the algorithm. Notice that the total data given to each machine is and thus it fits into the memory of each machine. Next, each machine computes the minimum of all values generated in the first phase for each index and outputs the corresponding entries of the solution matrix.

The above algorithm fails when . The reason is that no machine has enough memory to contain all values corresponding to each entry of the solution matrix. However, we can get around this issue by allowing more MapReduce rounds. Since , we have and thus we have more than machines. Therefore, we allocate machines to each entry of the solution matrix. The task of each machines is to compute the minimum of all values for the corresponding entry of the solution matrix. In a single round, we can give entries to each machine and then compute the minimum of all these numbers in a MapReduce round. This way we can reduce these numbers to numbers for each entry of the matrix. More generally, in each round we can reduce the size of the data associated to each entry of the matrix by a factor . Thus, we can in rounds, take the minimum of the data associated to each entry of the solution matrix and report the output.

As we mentioned earlier, this result carries over to a number of problems regarding graph centrality measures. Included in this list are all pairs shortest paths (APSP), diameter, and center of a graph. Although the number of machines and the memory of each machine remains the same for all these problems, both the running time and the number of MapReduce rounds is multiplied by a factor .

###### Corollary 5.2 (of Theorem 5.1)

For , APSP, diameter, and center of a graph and detecting whether a graph has a negative triangle can be computed with machines with memory in MapReduce rounds. The total running time of the algorithms for these problems is .

Proof. As we stated before, the APSP matrix of a graph is equal to the ’th power of (the adjacency matrix of ) with respect to matrix multiplication. Of course, this can be done via operations. Hence, APSP, can be solved by simply using our algorithm for matrix multiplication under , times as a blackbox. Negative triangle directly reduces to APSP and thus the same solution works for NT as well [54]. For center and diameter, we first compute the distance matrix of the graph and then in additional MapReduce rounds we find the center or diameter. In these additional MapReduce rounds, for every vertex , we find the closest and furthest vertices to it . This can be done by taking the minimum/maximum of numbers for each vertex. if