1. Introduction
AllPairs ShortestPaths is a classic problem in graph theory. The problem is both compute and memory intensive, owing to computational and space complexity for graphs with nodes. Over the years, many parallel APSP solvers have been proposed, including for distributed memory clusters, multicore processors, and accelerators. However, to the best of our knowledge, currently no dedicated APSP approach is available for the Apache Spark model (see Section 2
). This is somewhat surprising: many problems in Big Data analytics and Machine Learning (ML) – domains in which Spark holds dominating position – involve APSP as a critical kernel. For example, shortest paths in a neighborhood graph over highdimensional points are known to be very robust approximation of geodesic distances on the underlying manifold
(Bernstein et al., 2000). Spectral dimensionality reduction methods, such as Multidimensional Scaling or Isomap (Tenenbaum et al., 2000), directly involve APSP solver in their workflows. The same holds true for many other techniques, like networks classification (Borgwardt and Kriegel, 2005) or information retrieval (Quirin et al., 2008). In all these applications, data sets with hundreds of thousands and even millions of points/vertices are not uncommon (Talwalkar et al., 2008; Schoeneman and Zola, 2018). At the same time Spark is known and praised for its focus on programmer productivity and easeofuse, including equally convenient deployments in HPC centers as in computational clouds. This makes Spark the frequent platform of choice for nonexperts in parallel computing, who want to combine pure data analytics with processing of large graph instances.In this paper, we ask the question of how efficient and scalable parallel APSP solver can we achieve with Apache Spark. Deriving from the known parallel distributed memory APSP algorithms, we propose four different APSP Spark implementations, which vary in how far they depart from what would be a pure Spark implementation. We extract basic functional blocks that when combined with Spark transformations and actions deliver complete APSP implementations in just few lines of code, promoting programmer productivity. Then, we perform extensive experimental benchmarking to assess efficiency and scalability of the resulting implementations on problems with up to vertices using a 1,024core cluster. We contrast performance of Apache Spark with naive and highly optimized implementations in MPI. Our results show that using pure Spark implementation, that does not involve auxiliary mechanisms and is faulttolerant, is impractical for large problems due to the high cost of data shuffling. By leveraging collect and broadcast operations performed via auxiliary storage we are able to push the size of the problems we can solve, achieving good scalability. However, even then Sparkbased solutions are outperformed by MPI. Our results suggest that while Spark is certainly able to handle large APSP problems, and can compete with naive MPIbased approach, endusers should be aware of the trade off between programmer’s productivity (and the ease of use), and the achievable scalability.
The paper is organized following the common convention. In Section 2, we briefly review related parallel APSP solutions. In Section 3, we state basic assumptions and definitions we use in the work. In Section 4, we propose different strategies to implement APSP with Apache Spark and discuss their pros and cons taking into account implementation complexity and workarounds to missing Spark functionality. We describe detailed experimental validation of the methods in Section 5, and conclude the paper with brief discussion in Section 6.
2. Related Work
The APSP remains very actively researched problem in parallel computing, and hence we necessarily limit our review to projects that directly relate to our work.
In (Solomonik et al., 2013), Solomonik et al. propose recursive 2D blockcyclic algorithm that achieves lowerbound on communication latency and bandwidth, which they next extend to 2.5D communication avoiding formulation. On machine with 24,576 cores, the algorithms maintain strong scaling for problems with nodes, and weak scaling for up to nodes. The paper is also noteworthy for its extensive review of distributed memory APSP solvers. The advantage of the recursive formulation is induced data locality, which directly contributes to improved performance (Solomonik et al., 2013). However, in Spark, the concept of data locality is much weaker, since Spark’s runtime system has a significant freedom in scheduling where to materialize or move data for computations. Even though programmer has some control over the data placement (e.g., via partitioning functions), it does not have direct control over where computations will be executed. Therefore, in our work, we derive from the iterative formulation of the 2D blockcyclic algorithm, which can be found in (Venkataraman et al., 2003), and is one of several methods tracing their origin to transitive closure (Ullman and Yannakakis, 1991).
In (Katz and Kider Jr, 2008), Katz et al. give a solution of the transitive closure problem, and apply it to solving APSP on a GPU. They demonstrate 24 speedup over highly tuned CPU implementation, and report problems for up to . Another efficient GPUbased solution is given by Djidjev et al. in (Djidjev et al., 2014). However, the method applies only to planar graphs. Djidjev solves APSP on sparse graphs with up to one million nodes, using a combination of multicore CPUs and two GPUs. While we could combine GPU acceleration with Spark, e.g., by using Python and Numba JustinTime (JIT) compiler (Lam et al., 2015), it is not clear how advantageous such approach would be (we provide more detailed discussion in Sections 4 and 5).
APSP is one of several graph primitives that can be directly posed as a linear algebra problem, and solved using matrix operations over the semiring (Kepner and Gilbert, 2011), e.g., by employing GraphBLAS (Kepner et al., 2016). Since asymptotically efficient matrix multiplication algorithms (e.g., Strassen) are not applicable to the general APSP under minplus matrix multiplication, algorithms to improve on complexity have been proposed for cases when the graph of interest is directed, undirected and unweighted, or undirected with integer weights from a finite set (Seidel, 1995; Shoshan and Zwick, 1999; Zwick, 1998). In our work, we exploit the semiring formulation to derive one possible implementation, and as in case of GPU acceleration, we could combine multicore GraphBLAS implementation with Spark backbone (see Section 4).
Apache Spark frameworks, such as GraphX (Xin et al., 2013) (no longer maintained) and GraphFrames (Dave et al., 2016), offer multisource shortestpaths algorithms. These algorithms are simple extensions of the single source shortest paths solver in Pregel/BSP model (Malewicz et al., 2010), and are not designed with APSP in mind. Our proposed implementations break away from the Pregel model because in the initial tests, GraphX was unable to handle any reasonable problem size, prompting us to investigate the solutions reported in this paper.
3. Preliminaries
We focus on computing length of all pairs shortest paths (i.e., no paths themselves) in undirected weighted graphs with no negative cycles. Let be such undirected graph, where is a set of vertices, is a set of edges, and determines edge weights. Here, we make no assumption about the graph sparsity, or weights distribution. In general, and in ML or Big Data analytics in particular, each vertex might be a complex object with nontrivial memory layout. However, we assume that some initial preprocessing of the input graph has been performed, and each vertex is uniquely identified by an integer index. This is inline with the common use patterns, where APSP is invoked as a computational building block within a larger framework. The standard approach to handle APSP in such cases is to use a variant of classic FloydWarshall (Cormen et al., 2009) or Johnson algorithms (Cormen et al., 2009), with complexity and , respectively. The Johnson algorithm offers better asymptotic behaviour for reasonably sparse graphs, but typically FloydWarshall derivatives outperform it as they allow for better computational density (see also (Solomonik et al., 2013)). When using FloydWarshall and related algorithms, we will represent adjacency matrix of by , where stores the weight of edge between vertices with indices and . Furthermore, we are going to use the following notation. We will write to describe block of matrix (size of the block and indexing order will be clear from the context), and and to describe rowblock and columnblock , respectively. Similarly, we will write and to describe row and column of , and and to describe column in rowblock and row in columnblock .
To maintain all data elements in our proposed solutions, e.g., matrix , we will primarily depend on Spark’s Resilient Distributed Datasets (RDDs) (Zaharia et al., 2012). For the sake of completeness: RDD is a faulttolerant abstraction over distributed memory, tailored for inmemory iterative algorithms. RDDs are nonmutable, and are lazily evaluated and transparently materialized either in main memory or persistent storage. Unless explicitly specified by a programmer, their partitioning is automatically handled by Spark runtime using either keys range or hashbased partitioning schemes. From the computational point of view, objects within RDD can be processed in parallel by Spark executors using a set of transformations that yield new RDDs. Spark driver node is responsible for managing RDD lineage, and orchestrating the Spark application execution.
When implementing APSP solvers, whenever possible we will depend solely on Apache Spark API, benefiting from the implicit faulttolerance ensured by the platform. Our focus will be on programmerperceived productivity (e.g., minimizing boilerplate code, using convenient highlevel abstractions, etc.). We will refer to such implementations as pure. To improve performance, in some implementations we will depend on workarounds to missing Spark functionality (e.g., implementing pointtopoint data exchange through a shared file system). When such implementations will not be faulttolerant (e.g., failed tasks depending on data in a shared file system are not guaranteed to be able to access that data when rescheduled), we will refer to them as impure.
4. APSP with Apache Spark
We propose and investigate four different APSP approaches that vary in the implementation complexity (and as we shall see, in performance). In all approaches, we 2D decompose the adjacency matrix into blocks, where , and is a userprovided (or autotuned) decomposition parameter. We store the resulting matrix blocks as keyvalue tuples in Spark RDD, where tuple is a key we will use to reference corresponding data block. The matrix decomposition will be different from RDD partitioning, in that a single RDD partition will by maintaining multiple matrix blocks. We will use different ways of assigning RDD partitions to executors (see Section 5), which we can think of as overdecomposition of . Moreover, even though we will be discussing our approaches as if the entire matrix was stored in RDD (to simplify presentation), in the actual implementation we exploit symmetricity of and store only its uppertriangular part. The remaining blocks are generated ondemand by transposition (with no measurable overheads). Hence, executor responsible for the processing of block is also responsible for processing of block . This enables us to reduce the total amount of data maintained by the RDD, while increasing computational cost of processing tasks. We note that by disregarding symmetricity of , our algorithms can be directly adopted for cases where is a directed graph.
In our experiments, we use pySpark. The choice is motivated by practical considerations. First, the use of Python makes it convenient and efficient to store and communicate data blocks maintaining C/C++ compatible rowmajor matrix representation (e.g., via NumPy arrays). This in turn enables us to easily offload computationally intensive operations to a baremetal runtime, either using direct C/C++ bindings (e.g., CPython, Boost.Python (Abrahams and GrosseKunstleve, 2003), implicitly SciPy (Jones et al., 2014)), or through the excellent Numba JIT compiler (Lam et al., 2015). Second, serialization and deserialization of Python objects, which plays a role in our implementations (e.g., when exchanging data via persistent storage), is supported outofbox, further contributing to the ease of implementation and programmer’s productivity. We recognize the fact that pySpark tends to incur overheads due to object conversions between Python and Spark’s native Scala runtime (Xin et al., 2015). However, as we mentioned earlier, in this work we opt for productivity and compactness of the Python code while leveraging baremetal execution when applicable.
4.1. Implementation Elements
To separate computational parts of an APSP solver from the pure Spark mechanics, we first identify functional blocks that will be shared between different implementations. We are careful here not to draw analogy to, e.g., graph building blocks (Kepner et al., 2016), that have solid semantics and theoretical framework attached. In our case, the functional elements are just natural consequence of code separation to enable efficient implementation. We present each element as a function (we use term function somewhat liberally, since in our case function may return/yield multiple elements for the same argument). The functions will be passed to Spark transformations, in which case their first argument, omitted from invocation, will always be a single record stored in the RDD for which transformation is invoked. Table 1 summarizes all functions. How they are used will become clear in the following subsections.
Predicate testing whether given block is in columnblock . 

Predicate testing if given block is th block on diagonal. 
Return a new record with th column of given block (the column is stored as a vector). Within a block rows and columns are zeroindexed. 
Create copies of a diagonal block, each identified by different columnblock index. 
and
Create copies of a given block, each identified by different columnblock index, and copies of its transpose, each identified by different rowblock index. 
Return elementwise minimum between block and . 
Return minplus product between block and . 
Returns MatProd followed by MatMin for block and . 
Execute FloydWarshall, or any other APSP solver, over diagonal block , and return the resulting distance matrix . 
Append block to list of blocks . This function is a simplified representation of the actual Spark combiner. 
Process list of blocks to return block that will be second argument of minplus product with . 
We use yield to indicate that multiple elements are returned.
4.2. Repeated Squaring
In the first approach, we exploit APSP minplus product formulation, and we essentially compute using classic repeated squaring method. While asymptotically this method is clearly inefficient, it is very fast to implement, hence we include it in the comparison. Given RDD with block decomposed , repeated squaring becomes a sequence of three steps over the RDD: cartesian followed by filter to group blocks that should be multiplied in the current iteration, map applying minplus product (function MatProd), and finally reduceByKey with function MatMin to finalize the product. All blocks of can be persisted in the total main memory of the executing cluster, hence directly leveraging support for iterative algorithms in Spark (this will be the case for the remaining algorithms as well). However, the problem with this approach is reliance on cartesian that involves extensive alltoall data shuffle, which we found to stall even on small problems. To bypass this bottleneck of pure implementation, we replaced cartesian with iteration over the columnblocks of , effectively rewriting matrixmatrix product into a series of matrixvector products. This allows us to reduce data movement, at the cost of increased Spark overheads (like scheduling delays or tasks deserialization costs).
The core of the resulting implementation is outlined in Algorithm 1. In line 3, we identify blocks of columnblock to multiply, and group them on the Spark driver node, which next distributes the entire column to executors (line 4). Note that we do not broadcast the column, but rather store its blocks in a shared file system available to driver and executor nodes (e.g., HDFS, GPFS, etc.). As a result, the appropriate blocks can be selected and used by executors only when needed (most of the executors will not be using all blocks). In line 5, we perform the actual matrixvector product. Here, MatProd takes its first argument, block , directly from the RDD, and the second argument is the th block of the columnblock deserialized from the shared secondary storage. Both MatProd and MatMin are delegated for baremetal execution using Numba and NumPy, respectively. The results of individual matrixvector products are brought together via union (line 6) to form RDD ready for the next iteration of the algorithm.
Our implementation of repeated squaring does not require any particular attention to, e.g., when RDDs are materialized, delegating the entire execution to Spark. However, utilizing persistent storage for broadcast introduces sideeffects, which go against Spark’s faulttolerance mechanisms, making the implementation impure. We note that one could consider using shared file system directly at the filtering stage, eliminating the need for collect on the driver. However, this approach is problematic considering Spark scheduler, e.g., files written in one RDD may not be materialized on time to be used by a subsequent RDD.
The computational cost of repeated squaring is bounded by iterations, where each iteration involves map transformation with operation, followed by reduction.
4.3. 2D FloydWarshall
In our second approach, we adopt the textbook parallel FloydWarshall algorithm based on 2D block decomposition (Grama et al., 2003). In this approach, parallelism is due to the two innermost loops of FloydWarshall, which can be partitioned once row/column indexed by the outermost loop is blockwise broadcast, and locally available to the processors. Since Spark provides collect and broadcast that combined can emulate standard broadcast initiated at any processor, we can easily express the entire algorithm in Spark.
Our Spark implementation is outlined in Algorithm 2. The algorithm proceeds in steps. In iteration , we first identify column blocks storing column , and then we extract the actual column using function ExtractCol (lines 25). The column is aggregated on the driver node (line 6) and broadcast to all executors (line 8). Because the memory footprint of a column is very small, the operation can be easily performed without the need for persistent storage. This step is an explicit synchronization point. Once column is available to every executor, we proceed with the update phase of FloydWarshall algorithm, as implemented in FloydWarshallUpdate function. In this case, the first argument is block taken directly from the RDD, and the two remaining arguments are vectors extracted from the column (exploiting the fact that is symmetric). As in earlier algorithm, the actual computations are delegated for baremetal execution using Numba JIT compilation.
The 2D FloydWarshall is interesting in the sense that it is amenable to pure Spark implementation, supporting faulttolerance, and involving no sideeffects. Moreover, it does not require any so called wide transformations, which trigger data shuffling and movement. Consequently, it plays into Spark strengths. However, its computational cost is bounded by iterations, where each iteration involves building blocks with cost . Since synchronization overheads in Spark are significant, we anticipate poor scalability.
4.4. Blocked InMemory
In the third approach, we build on top of the blocked APSP from Venkataraman et al. (Venkataraman et al., 2003). Originally, the algorithm was designed to improve cache utilization of the standard FloydWarshall. It is an iterative algorithm, where each iteration consist of three phases, as depicted in Figure 1. In Phase 1, diagonal block is processed using any efficient APSP solver (in our case sequential FloydWarshall). In Phase 2, the result is passed to update shortest paths in all blocks of block row and block column . Then, in Phase 3, iteration is completed by performing update on the remaining blocks.
Our corresponding implementation, which we call Blocked InMemory, is outlined in Algorithm 3. In iterations, we start by processing diagonal block (lines 24) using FloydWarshall function executed on baremetal via call to optimized SciPy routine. The diagonal block is extracted from RDD via filtering, and once processed we explicitly create its copies (function CopyDiag) that are distributed to executors via custom partitioner (line 4). Here our goal is to place these copies in the RDD partitions that store blocks from the corresponding column/row block (extracted in line 6), thus maximizing data locality and mitigating partitions blowup due to RDDs merging in line 7 (we provide details of partitioning schemes in the next section). We can think about this step as using data shuffling to simulate broadcast, testing Sparks ability to optimize data distribution and corresponding tasks scheduling. To complete pairing of diagonal and column/row blocks we use a combination of combineByKey and map (line 8), by first aggregating a list of blocks to pair (ListAppend), and then enumerating the actual block pairs (ListUnpack). The resulting pairs are processed to compute the update (line 9), and as previously we use block copying and custom partitioning to bring together blocks that will be operated on in Phase 3 (lines 910). To finalize iteration, we implement Phase 3 computations (lines 1215) following the same pattern as in case of Phase 2.
The above implementation of the blocked algorithm depends entirely on faulttolerant Spark functionality, and hence we consider it pure. While it operates in iterations, it is data intensive due to data copying and shuffling involved. However, the method should be able to leverage significant data locality, hence it is a good test for Spark runtime system.
The computational cost of the algorithm is bounded by iterations, with each iteration bounded by . While asymptotically this is similar to 2D FloydWarshall, the blocked algorithm allows us to control the tradeoff between the the cost of single iteration and the number of iterations (which is not the case in the previous methods).
4.5. Blocked Collect/Broadcast
Our last implementation, given in Algorithm 4, is another take on the blocked APSP, this time bypassing explicit data shuffling. Specifically, instead of pairing a diagonal block with column/row blocks, we bring it to the driver node via collect, which then redistributes it to the executors via shared persistent storage (lines 23). We note that here, similar to the repeated squaring algorithm, we do not use broadcast because in pySpark each task created by executor maintains its local copy of the broadcast variables. This usually means exceeding executor’s memory, since the number of running tasks may be the same as the number of cores in a node running the executor. To realize Phase 2, we apply MinPlus function on the column blocks such that the first argument comes from the RDD, and the second argument is taken from Spark storage (line 5). As in earlier phase, we aggregate the entire column on the driver node and then redistribute it via shared persistent storage (lines 67). Finally, in line 9, we perform Phase 3 computations updating the remaining blocks of with the column blocks from persistent storage. All updated blocks are brought into a single RDD via union, and repartitioned to match intended partitioning of (lines 1112).
The collect and broadcast implementation of the blocked algorithm involves secondary storage to handle communication, and hence we consider it impure. Except of custom partitioning of matrix , we do not repartition RDD that stores column/row blocks (line 5). Instead we depend on the partitioning scheme of from which we extract blocks. This means that Spark runtime will most likely trigger data movement to utilize all executors.
Asymptotically, the cost of the algorithm is the same as Blocked InMemory. However, in single iteration we substitute data shuffling with writing to auxiliary storage.
5. Experimental Analysis
To benchmark all four implementations, we performed a set of experiments on a standalone Apache Spark cluster with 32 nodes and GbE interconnect. Each node in the cluster is equipped with two 16core Intel Skylake (Intel Xeon Gold 6130 2.10GHz) processors with 32KB L1 and 1024KB L2 cache, and 192GB of RAM (thus the cluster provides total of 1,024 cores and 6TB of RAM). Moreover, each node has standard SSD drive available, which Spark uses for local data staging (available local storage is 1TB). We note that the use of SSDs is essential for Spark performance, since all data movement in Spark involves staging in the local storage. The local storage is complemented by shared GPFS storage that executors can use for additional communication (e.g., as in case of Blocked Collect/Broadcast algorithm).
In all tests, Spark driver was run with 180GB of memory, to allow efficient management of complex RDD lineages. For the same reason, we run the driver on a dedicated node separately from Spark executors. We allocated one executor per node using the default configuration for the number of cores, i.e., each executor was using all available cores in a node. All executors were configured to use 180GB out of the available 192GB, with the remaining memory available to the operating system and Python interpreter. We note that we tested different Spark runtime configurations, including executortocore ratio, memory use fractions, etc. (Spa, 2019), without noticeable difference in performance.
Our entire software is implemented in Apache Spark 2.2 pySpark, and Python 2.7. Compute intensive linear algebra operations and sequential FloydWarshall algorithm are offloaded to baremetal via NumPy and SciPy that are configured to work with the Intel MKL 2017 BLAS library. Finally, we use Numba 0.35 for justintime compilation whenever required. Our entire software is open source and available from https://gitlab.com/SCoReGroup/APSPark.
5.1. Test Data
All four APSP solvers are oblivious to the structure of input graph, and in our implementations we do not include optimizations to target any specific graph properties. Consequently, the scalability of every solver is a function of graph size expressed only by the number of vertices,
. Therefore our input data is simply a set of synthetic ErdősRényi graphs. In all graphs, the probability of edge,
, is set to , with . The particular choice of is to make graphs generation fast. We note that taking larger to increase likelihood of connectivity in no way adversely affects our performance, as we disregard the cost of populating RDD that stores the adjacency matrix , and each of our approaches scales only with . In our tests, we consider graphs for up to 262,144.5.2. Block Size and Sequential Components
In the first set of experiments, we setup a baseline for subsequent comparisons by analyzing performance of the key functional elements, MatProd combined with MatMin and FloydWarshall, depending on the block size . These building blocks will be dispatched by Spark for sequential execution, and hence their performance will be critical to the overall performance. Moreover, in all our implementations, the block size directly affects level of available parallelism, and level of data movement. If the block size is too large, we may be spending too much time in a sequential execution or staging data blocks in the auxiliary storage, and we may not be able to leverage all executors. On the other hand, if block size is too small, overheads due to tasks scheduling and data shuffling, or creating many small files in the auxiliary storage, may dominate the execution. Hence, the block size should be selected carefully.
To test the sequential components, we call the corresponding functions directly from Python, the same way as they would be invoked by Spark, i.e., FloydWarshall via SciPy with Intel MKL, and MatProd and MatMin via Numba. To test the actual solvers, we use our pySpark implementations. In all tests, we look at the observed execution times. Results of these experiments are reported in Figures 2 and 3.
Figure 2 shows how block size, , affects the time of processing a single adjacency matrix block. As expected, the runtime increases roughly as , in line with the asymptotic behavior of the underlying algorithms. For up to approximately 3,000, sequential operations can be executed very quickly, primarily because adjacency matrix fits in cache memory. Given available Skylake cache, the approximate size for processing completely in L3 cache is around . Once is above that threshold, runtime starts to grow rapidly, going into minutes.
Figure 3 highlights the impact of partitioning granularity on the total execution time. Here we focus on Blocked InMemory (IM) and Blocked Collect/Broadcast (CB) methods, as these are the best performing methods (see next subsection).
Irrespective of the choice of RDD partitioner (we discuss partitioners later), the runtime of both methods first decreases, and then grows with the growing block size. When the block size is too small (), the Blocked InMemory solver fails. This is because the volume of the data that has to be staged in local SSDs due to shuffling exceeds the storage capacity of our cluster (in our case, each node is equipped with 1TB of local storage). Specifically, data repartitioning (call to partitionBy in Algorithm 3, line 15) triggers data shuffling in each iteration. Because shuffled blocks are spilled to the local storage and preserved for fault tolerance, the storage requirement grows linearly with the number of iterations. At the same time, eliminating the call to partitionBy is not an option: without repartitioning step the number of partitions in the RDDs created via union (Algorithm 3, line 12) would quickly explode since in Spark each component RDD preserves its partitioning when in union. The excessive number of partitions would degrade performance of the combineByKey step, and would add significant scheduling overheads. We note that this issue does not affect Blocked Collect/Broadcast, because the volume of the data we manage is smaller (recall that here we avoid creating copies of the data blocks), and repartitioning is thus less aggressive.
The results above confirm the impact of the block size on the performance of our APSP solvers. They suggest also that accelerating single block computations, e.g., by delegating them to GPGPU, could further improve performance of the solvers. Specifically, since with acceleration we should be able to process larger blocks in the same time limit, we could reduce the number of iterations required by the solver without significantly affecting the cost of individual iterations.
5.3. Tuning Performance of the Solvers
In the next set of experiments, we take a closer look at the performance of our solvers in relation to RDD partitioning scheme that we use to distribute RDD with matrix . In Figure 3, we can observe that both partitioning function as well as the number of RDD partitions per core (parameter ) affect the runtime, and hence tuning them may lead to better performance.
The efficiency of each of our APSP methods depends on data locality. Ideally, to reduce the number and frequency of Spark data shuffles, blocks that are paired for computation should be assigned to the same partition, and partitions should be evenly distributed between executors. We should keep in mind however, that even then there is no guarantee that the executor maintaining given partition will be responsible for its processing. Thus, to provide a room for load balancing, the ratio between the number of partitions and the number of available cores should be more than one (per Spark guidelines (Spa, 2019), the number of partitions is recommended to be  the total number of cores, i.e., ).
Method  Partitioner  Decomposition  Time  

Iterations  Single  Projected  
Repeated Squaring  MD  256  18432  45s  9d16h 
512  9216  2m23s  15d8h  
1024  4608  5m6s  16d8h  
2048  2304  19m45s  31d15h  
4096  1152  51m47s  41d10h  
PH  256  18432  44s  9d11h  
512  9216  2m7s  13d13h  
1024  4608  6m5s  19d12h  
2048  2304  18m39s  29d21h  
4096  1152  1h15m  60d6h  
2D FloydWarshall  MD  256  262144  21s  64d11h 
512  262144  18s  53d10h  
1024  262144  17s  51d22h  
2048  262144  18s  55d7h  
4096  262144  20s  61d9h  
PH  256  262144  21s  65d8h  
512  262144  18s  55d10h  
1024  262144  16s  49d7h  
2048  262144  20s  60d3h  
4096  262144  19s  56d9h  
BlockedIM  MD  256  1024  51s  14h29m 
512  512  1m11s  10h8m  
1024  256  1m55s  8h12m  
2048  128  3m44s  7h59m  
4096  64  7m21s  7h51m  
PH  256  1024  48s  13h32m  
512  512  1m14s  10h33m  
1024  256  2m12  9h23m  
2048  128  4m3s  8h39m  
4096  64  8m49s  9h24m  
BlockedCB  MD  256  1024  48s  13h35m 
512  512  1m1s  8h40m  
1024  256  1h40m  7h8m  
2048  128  3m18s  7h4m  
4096  64  8m23s  8h57m  
PH  256  1024  46s  13h12m  
512  512  1m3s  9h4m  
1024  256  1m51s  7h54m  
2048  128  3m51s  8h15m  
4096  64  9m23s  10h2m 
In our implementation we consider two partitioners. The first one is the default Spark partitioner, called Portable Hash (PH), which is available in pySpark as method portable_hash. This is the partitioner that one would use ad hoc. The partitioner is expected to distribute RDD uniformly at random by computing a hash code for Python tuples , which are used as keys in all our RDDs. The resulting hash values aim at equal distribution of RDD partitions between executors, but provide no guarantee on achieving data locality desired by our solvers.
The second partitioner is our multidiagonal partitioner (MD). The assignment of matrix blocks to partitions induced by this approach is outlined in Figure 4. The idea behind MD partitioner is to explicitly place blocks that are in the same column and rowblock into separate partitions, while maintaining equal block distribution between the partitions. The strategy is especially critical to Blocked InMemory and Blocked Collect/Broadcast methods, as it enables us to avoid bottlenecks in Phase 2 of these algorithms. At this point we should note that MD partitioner may seem counterintuitive from the perspective of matrix algorithms based on 2D decomposition. Typically, such algorithms involve some form of gridbased partitioning to optimize communication patterns. However, as we mentioned earlier, in Spark we have limited control over which executors handle which partitions. Therefore, the best strategy is to use partitioning which will give scheduling flexibility to Spark runtime.
In Table 2 we quantify the effect of the block size and the partitioner on the performance of each solver. Here several observations stand out. First, Repeated Squaring and 2D FloydWarshall both are infeasible on large problems, with projected execution times going into days. The poor performance is especially pronounced for 2D FloydWarshall. Even though from the computational point of view the method is asymptotically comparable to the blocked methods, it suffers from unfavorable balance between computations and data movement – each iteration involves operations, the number of iterations is linear in the problem size. Repeated Squaring on the other hand shows very good performance in single iteration, but this is insufficient to compensate higher number of iterations (compared to the blocked methods).
The second observations is that the effect of partitioner depends on the block size, and on factor . When the block size is small, the difference between PH and MD partitioners becomes negligible. This is because partitions become fairly balanced just by chance, as we have many blocks to distribute, and even when partitions are slightly unbalanced, the difference in runtime becomes marginal owing to the small block size. Conversely, for the large block size the impact of the partitioner on runtime becomes critical. Figure 3
(bottom plot) explains this effect in case of blocked methods. The PH partitioner consistently fails to evenly distribute blocks to RDDs, which is caused by suboptimal choice of the hashing function (inspection of Spark code reveals that it uses XOR based mixing of elements of the tuple, which in case of uppertriangular matrix leads to many collisions). The resulting skew in hash distribution translates directly into poor runtimes (Figure
3 top plot). This is especially pronounced for the Blocked InMemory method when . In this case we have to deal with very large and unbalanced partitions with no room for load balancing by dynamic scheduling. Additionally, we run multiple repartitioning steps with their shuffling operations, which ultimately result in unbalanced spills to local storage. We note that because Blocked Collect/Broadcast mitigates shuffling, it performs significantly better.The last observation is with respect to the overdecomposition factor, . From our results it is clear that the Spark guideline of having is imperative. However, since in our solvers the total number of blocks to distribute is a function of the block size, for larger we may easily end up with . In such situations, it may be advantageous to scale down to increase without decreasing block size . This strategy however ultimately limits the scalability of the method. In our tests, most of the time we were able to maintain .
5.4. Scalability Tests
We now turn our attention to scalability of the solvers. We focus only on Blocked InMemory and Blocked Collect/Broadcast, since the other two methods cannot be expected to scale to larger problems (see Table 2). We perform weak scaling analysis, as it can be considered a typical use scenario for Spark (i.e., we increase cluster size to solve larger problem instances).
To assess scalability we measure operations per second (ops), which we express as , where is the time taken to solve the APSP problem of size using cores. Considering that the key computational building blocks used by our solvers have complexity , the measure is a good proxy to assess the performance. For convenience we are reporting gigaops (Gops) normalized with respect to (Gops/core). Our reference point is – the time taken for a proportional problem on a single core using efficient sequential FloydWarshall as implemented in SciPy and tested in Section 5.2.
Results of the experiment are summarized in Table 3 and Figure 5. Here we maintain to ensure that the largest resulting problem (i.e., ) is feasible given the hardware resources we use in the tests. Moreover, for each method and every problem size, we use our multidiagonal partitioner, and we select the optimal block size, , following the arguments given in the previous sections (we report block size in Table 3). Finally, for we record s, which translates into Gops.
Our results show that as expected Blocked Collect/Broadcast outperforms Blocked InMemory. Moreover, for the largest problem size (), Blocked InMemory runs out of space in the local storage, and is unable to finish processing. Both methods show rather stable scaling that saturates around . For performance is slightly degraded, which we attribute to Spark scheduler: for the small problem size, the number of RDD partitions per core is one, and that limits optimal resources utilization (note however, that choosing different block size does not improve performance). For , the Blocked Collect/Broadcast solver achieves 78% Gops/core with respect to the sequential solver, which we consider quite good result taking into account all Sparkadded overheads. We should keep in mind though that Blocked Collect/Broadcast is not faulttolerant.
Method /  64  128  256  512  1024 

BlockedIM  4m2s  14m20s  35m33s  2h17m  – 
1024  1024  1536  2048  –  
BlockedCB  2m50s  11m0s  34m16s  2h11m  8h9m 
1024  1280  1536  2048  2560  
FW2DMPI  2m3s  –  37m2s  –  11h51m 
DCMPI  1m15s  –  18m54s  –  2h52m 
s
5.5. Spark vs MPI
In the last set of tests we contrast performance of our solvers with the approaches based on MPI. The idea is to provide a reference point to judge Spark scaling. To make the comparison, we use two MPIbased APSP solvers. In the first solver, FW2DMPI, we implement the standard parallel FloydWarshall algorithm based on 2D block decomposition (Grama et al., 2003) (described in Section 4.3). The method is relatively straightforward to implement, however, as usual for MPI, the resulting code is objectively more verbose and complex than its Spark counterpart. The second method, DCMPI, is highly optimized divideandconquer solver by Solomonik et al. (Solomonik et al., 2013), available from https://github.com/solomonik/APSP. The solver has been demonstrated to scale extremely well to very large parallel machines, and is the stateoftheart HPC solution. Both solvers are written in C++, and we compile them with Intel MPI 2018 and g++ 7.8 for execution. The MPI codes are run on the same cluster as Spark, however, we use InfiniBand interconnect instead of GbE. Because both MPI solvers assume that processors are organized into a square grid in our tests we use .
From Table 3 and Figure 5 we can see that optimized DCMPI solver significantly outperforms ( on cores) Sparkbased methods. While this result is not surprising, it clearly highlights importance of good parallel algorithm. Specifically, for naive FW2DMPI is outperformed by our Spark solvers, which is explained by communication overheads in FW2DMPI that grow with (due to broadcast) and (due to iterative steps). At this point we should recall that 2D FloydWarshall is completely infeasible in Spark. This is a nice example showing that good runtime system is required but not sufficient to obtain good scalability. In our tests, we checked also impact of the interconnect on MPIbased solvers. When enforcing communication over GbE we observed 15%20% performance degradation, which still makes DCMPI far superior solution.
6. Conclusion
In 2017, Internet circulated widely discussed blog post by Jonathan Dursi “HPC is dying, and MPI is killing it” (Dursi, 2017). One of the messages in the post is that old standards, namely MPI, should and must be replaced by modern APIs like Spark, since MPI offers wrong level of abstraction, hindering programmers’ productivity. In this paper, we demonstrate that indeed by separating core computational functionality from the data and tasks distribution problem, we can implement APSP in pySpark in just few compact lines of code (including usual boilerplate). However, while our optimized implementation is able to handle fairly large graphs in acceptable time limits, it is not faulttolerant and is significantly outperformed by the stateoftheart MPI solver. This is because programmers’ productivity comes at the cost of limited functionality, which in APSP leads to implementations involving large and frequent data movements. While we mitigate this effect by coupling Spark with data broadcast through persistent storage, the resulting solution is far from elegant, and is clearly not faulttolerant.
7. Acknowledgments
The authors would like to acknowledge support provided by the Center for Computational Research at the University at Buffalo.
References
 (1)
 Spa (2019) 2019. Tuning Spark. https://spark.apache.org/docs/latest/tuning.html.
 Abrahams and GrosseKunstleve (2003) D. Abrahams and R.W. GrosseKunstleve. 2003. Building Hybrid Systems with Boost.Python. C/C++ Users Journal 21, 7 (2003), 29–36.
 Bernstein et al. (2000) M. Bernstein, V. de Silva, J.C. Langford, and J.B. Tenenbaum. 2000. Graph Approximations to Geodesics on Embedded Manifolds.
 Borgwardt and Kriegel (2005) K.M. Borgwardt and H. Kriegel. 2005. ShortestPath Kernels on Graphs. In IEEE International Conference on Data Mining (ICDM). 74–81.
 Cormen et al. (2009) T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. 2009. Introduction to Algorithms. MIT press.
 Dave et al. (2016) A. Dave, A. Jindal, L.E. Li, R. Xin, J. Gonzalez, and M. Zaharia. 2016. GraphFrames: An Integrated API for Mixing Graph and Relational Queries. In International Workshop on Graph Data Management Experiences and Systems. 2.
 Djidjev et al. (2014) H. Djidjev, S. Thulasidasan, G. Chapuis, R. Andonov, and D. Lavenier. 2014. Efficient MultiGPU Computation of AllPairs Shortest Paths. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 360–369.

Dursi (2017)
J. Dursi. 2017.
HPC is dying, and MPI is killing it.
https://www.dursi.ca/post/hpcisdyingandmpiiskillingit.html.  Grama et al. (2003) A. Grama, V. Karypis, G. Kumar, and A. Gupta. 2003. Introduction to Parallel Computing. Pearson.
 Jones et al. (2014) E. Jones, T. Oliphant, and P. Peterson. 2014. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org/.
 Katz and Kider Jr (2008) G.J. Katz and J.T. Kider Jr. 2008. AllPairs ShortestPaths for Large Graphs on the GPU. In ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. 47–55.
 Kepner et al. (2016) J. Kepner, P. Aaltonen, D. Bader, A. Buluc, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, et al. 2016. Mathematical Foundations of the GraphBLAS. arXiv:1606.05790 (2016).
 Kepner and Gilbert (2011) J. Kepner and J. Gilbert. 2011. Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics.
 Lam et al. (2015) S.K. Lam, A. Pitrou, and S. Seibert. 2015. Numba: A LLVMbased Python JIT Compiler. In Workshop on the LLVM Compiler Infrastructure in HPC. 7.
 Malewicz et al. (2010) G. Malewicz, M.H. Austern, A.J.C Bik, J.C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A System for Largescale Graph Processing. In ACM SIGMOD International Conference on Management of Data. 135–146.
 Quirin et al. (2008) A. Quirin, O. Cordon, J. Santamaria, B. VargasQuesada, and F. MoyaAnegon. 2008. A New Variant of the Pathfinder Algorithm to Generate Large Visual Science Maps in Cubic Time. Information Processing and Management 44, 4 (2008), 1611–1623.
 Schoeneman and Zola (2018) F. Schoeneman and J. Zola. 2018. Scalable Manifold Learning for Big Data with Apache Spark. In IEEE International Conference on Big Data.
 Seidel (1995) R. Seidel. 1995. On the AllPairs ShortestPath Problem in Unweighted Undirected Graphs. J. Comput. System Sci. 51, 3 (1995), 400–403.
 Shoshan and Zwick (1999) A. Shoshan and U. Zwick. 1999. AllPairs ShortestPaths in Undirected Graphs with Integer Weights. In Annual Symposium on Foundations of Computer Science. 605–614.
 Solomonik et al. (2013) E. Solomonik, A. Buluc, and J. Demmel. 2013. Minimizing Communication in AllPairs Shortest Paths. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 548–559.

Talwalkar
et al. (2008)
A. Talwalkar, S. Kumar,
and H. Rowley. 2008.
Largescale Manifold Learning. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. 1–8.  Tenenbaum et al. (2000) J.B. Tenenbaum, V. de Silva, and J.C. Langford. 2000. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 5500 (2000), 2319.

Ullman and
Yannakakis (1991)
J.D. Ullman and M.
Yannakakis. 1991.
The Input/Output Complexity of Transitive Closure.
Annals of Mathematics and Artificial Intelligence
3, 24 (1991), 331–360.  Venkataraman et al. (2003) G. Venkataraman, S. Sahni, and S. Mukhopadhyaya. 2003. A Blocked Allpairs Shortestpaths Algorithm. Journal of Experimental Algorithmics 8 (2003).

Xin
et al. (2015)
R. Xin, M. Armbrust,
and D. Liu. 2015.
Introducing DataFrames in Apache Spark for Large Scale Data Science.
https://databricks.com/blog/2015/02/17/introducingdataframesinsparkforlargescaledatascience.html.  Xin et al. (2013) R.S. Xin, J.E. Gonzalez, M.J. Franklin, and I. Stoica. 2013. GraphX: A Resilient Distributed Graph System on Spark. In International Workshop on Graph Data Management Experiences and Systems. 2.
 Zaharia et al. (2012) M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, and I. Stoica. 2012. Resilient Distributed Datasets: A Faulttolerant Abstraction for Inmemory Cluster Computing. In USENIX Conference on Networked Systems Design and Implementation (NSDI).
 Zwick (1998) U. Zwick. 1998. AllPairs ShortestPaths in Weighted Directed Graphs–Exact and Almost Exact Algorithms. In Annual Symposium on Foundations of Computer Science. 310–319.
Comments
There are no comments yet.