Performance of All-Pairs Shortest-Paths Solvers with Apache Spark

02/12/2019
by   Frank Schoeneman, et al.
University at Buffalo
0

Algorithms for computing All-Pairs Shortest-Paths (APSP) are critical building blocks underlying many practical applications. The standard sequential algorithms, such as Floyd-Warshall and Johnson, quickly become infeasible for large input graphs, necessitating parallel approaches. In this work, we provide detailed analysis of parallel APSP performance on distributed memory clusters with Apache Spark. The Spark model allows for a portable and easy to deploy distributed implementation, and hence is attractive from the end-user point of view. We propose four different APSP implementations for large undirected weighted graphs, which differ in complexity and degree of reliance on techniques outside of pure Spark API. We demonstrate that Spark is able to handle APSP problems with over 200,000 vertices on a 1024-core cluster, and can compete with a naive MPI-based solution. However, our best performing solver requires auxiliary shared persistent storage, and is over two times slower than optimized MPI-based solver.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/15/2018

A Deterministic Distributed Algorithm for Exact Weighted All-Pairs Shortest Paths in Õ(n^3/2) Rounds

We present a deterministic distributed algorithm to compute all-pairs sh...
05/21/2018

Distributed Algorithms for Directed Betweenness Centrality and All Pairs Shortest Paths

The betweenness centrality (BC) of a node in a network (or graph) is a m...
01/07/2021

A Deterministic Parallel APSP Algorithm and its Applications

In this paper we show a deterministic parallel all-pairs shortest paths ...
03/14/2019

Fast Approximate Shortest Paths in the Congested Clique

We design fast deterministic algorithms for distance computation in the ...
07/06/2021

An MPI-based Algorithm for Mapping Complex Networks onto Hierarchical Architectures

Processing massive application graphs on distributed memory systems requ...
02/05/2018

Shortest k-Disjoint Paths via Determinants

The well-known k-disjoint path problem (k-DPP) asks for pairwise vertex-...
04/06/2017

A Comparison of Parallel Graph Processing Implementations

The rapidly growing number of large network analysis problems has led to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

All-Pairs Shortest-Paths is a classic problem in graph theory. The problem is both compute and memory intensive, owing to computational and space complexity for graphs with nodes. Over the years, many parallel APSP solvers have been proposed, including for distributed memory clusters, multi-core processors, and accelerators. However, to the best of our knowledge, currently no dedicated APSP approach is available for the Apache Spark model (see Section 2

). This is somewhat surprising: many problems in Big Data analytics and Machine Learning (ML) – domains in which Spark holds dominating position – involve APSP as a critical kernel. For example, shortest paths in a neighborhood graph over high-dimensional points are known to be very robust approximation of geodesic distances on the underlying manifold 

(Bernstein et al., 2000). Spectral dimensionality reduction methods, such as Multidimensional Scaling or Isomap (Tenenbaum et al., 2000), directly involve APSP solver in their workflows. The same holds true for many other techniques, like networks classification (Borgwardt and Kriegel, 2005) or information retrieval (Quirin et al., 2008). In all these applications, data sets with hundreds of thousands and even millions of points/vertices are not uncommon (Talwalkar et al., 2008; Schoeneman and Zola, 2018). At the same time Spark is known and praised for its focus on programmer productivity and ease-of-use, including equally convenient deployments in HPC centers as in computational clouds. This makes Spark the frequent platform of choice for non-experts in parallel computing, who want to combine pure data analytics with processing of large graph instances.

In this paper, we ask the question of how efficient and scalable parallel APSP solver can we achieve with Apache Spark. Deriving from the known parallel distributed memory APSP algorithms, we propose four different APSP Spark implementations, which vary in how far they depart from what would be a pure Spark implementation. We extract basic functional blocks that when combined with Spark transformations and actions deliver complete APSP implementations in just few lines of code, promoting programmer productivity. Then, we perform extensive experimental benchmarking to assess efficiency and scalability of the resulting implementations on problems with up to vertices using a 1,024-core cluster. We contrast performance of Apache Spark with naive and highly optimized implementations in MPI. Our results show that using pure Spark implementation, that does not involve auxiliary mechanisms and is fault-tolerant, is impractical for large problems due to the high cost of data shuffling. By leveraging collect and broadcast operations performed via auxiliary storage we are able to push the size of the problems we can solve, achieving good scalability. However, even then Spark-based solutions are outperformed by MPI. Our results suggest that while Spark is certainly able to handle large APSP problems, and can compete with naive MPI-based approach, end-users should be aware of the trade off between programmer’s productivity (and the ease of use), and the achievable scalability.

The paper is organized following the common convention. In Section 2, we briefly review related parallel APSP solutions. In Section 3, we state basic assumptions and definitions we use in the work. In Section 4, we propose different strategies to implement APSP with Apache Spark and discuss their pros and cons taking into account implementation complexity and workarounds to missing Spark functionality. We describe detailed experimental validation of the methods in Section 5, and conclude the paper with brief discussion in Section 6.

2. Related Work

The APSP remains very actively researched problem in parallel computing, and hence we necessarily limit our review to projects that directly relate to our work.

In (Solomonik et al., 2013), Solomonik et al. propose recursive 2D block-cyclic algorithm that achieves lower-bound on communication latency and bandwidth, which they next extend to 2.5D communication avoiding formulation. On machine with 24,576 cores, the algorithms maintain strong scaling for problems with nodes, and weak scaling for up to nodes. The paper is also noteworthy for its extensive review of distributed memory APSP solvers. The advantage of the recursive formulation is induced data locality, which directly contributes to improved performance (Solomonik et al., 2013). However, in Spark, the concept of data locality is much weaker, since Spark’s runtime system has a significant freedom in scheduling where to materialize or move data for computations. Even though programmer has some control over the data placement (e.g., via partitioning functions), it does not have direct control over where computations will be executed. Therefore, in our work, we derive from the iterative formulation of the 2D block-cyclic algorithm, which can be found in (Venkataraman et al., 2003), and is one of several methods tracing their origin to transitive closure (Ullman and Yannakakis, 1991).

In (Katz and Kider Jr, 2008), Katz et al. give a solution of the transitive closure problem, and apply it to solving APSP on a GPU. They demonstrate 2-4 speedup over highly tuned CPU implementation, and report problems for up to . Another efficient GPU-based solution is given by Djidjev et al. in (Djidjev et al., 2014). However, the method applies only to planar graphs. Djidjev solves APSP on sparse graphs with up to one million nodes, using a combination of multi-core CPUs and two GPUs. While we could combine GPU acceleration with Spark, e.g., by using Python and Numba Just-in-Time (JIT) compiler (Lam et al., 2015), it is not clear how advantageous such approach would be (we provide more detailed discussion in Sections 4 and 5).

APSP is one of several graph primitives that can be directly posed as a linear algebra problem, and solved using matrix operations over the semi-ring  (Kepner and Gilbert, 2011), e.g., by employing GraphBLAS (Kepner et al., 2016). Since asymptotically efficient matrix multiplication algorithms (e.g., Strassen) are not applicable to the general APSP under min-plus matrix multiplication, algorithms to improve on complexity have been proposed for cases when the graph of interest is directed, undirected and unweighted, or undirected with integer weights from a finite set (Seidel, 1995; Shoshan and Zwick, 1999; Zwick, 1998). In our work, we exploit the semi-ring formulation to derive one possible implementation, and as in case of GPU acceleration, we could combine multi-core GraphBLAS implementation with Spark backbone (see Section 4).

Apache Spark frameworks, such as GraphX (Xin et al., 2013) (no longer maintained) and GraphFrames (Dave et al., 2016), offer multi-source shortest-paths algorithms. These algorithms are simple extensions of the single source shortest paths solver in Pregel/BSP model (Malewicz et al., 2010), and are not designed with APSP in mind. Our proposed implementations break away from the Pregel model because in the initial tests, GraphX was unable to handle any reasonable problem size, prompting us to investigate the solutions reported in this paper.

3. Preliminaries

We focus on computing length of all pairs shortest paths (i.e., no paths themselves) in undirected weighted graphs with no negative cycles. Let be such undirected graph, where is a set of vertices, is a set of edges, and determines edge weights. Here, we make no assumption about the graph sparsity, or weights distribution. In general, and in ML or Big Data analytics in particular, each vertex might be a complex object with non-trivial memory layout. However, we assume that some initial pre-processing of the input graph has been performed, and each vertex is uniquely identified by an integer index. This is inline with the common use patterns, where APSP is invoked as a computational building block within a larger framework. The standard approach to handle APSP in such cases is to use a variant of classic Floyd-Warshall (Cormen et al., 2009) or Johnson algorithms (Cormen et al., 2009), with complexity and , respectively. The Johnson algorithm offers better asymptotic behaviour for reasonably sparse graphs, but typically Floyd-Warshall derivatives outperform it as they allow for better computational density (see also (Solomonik et al., 2013)). When using Floyd-Warshall and related algorithms, we will represent adjacency matrix of by , where stores the weight of edge between vertices with indices  and . Furthermore, we are going to use the following notation. We will write to describe block of matrix (size of the block and indexing order will be clear from the context), and and to describe row-block and column-block , respectively. Similarly, we will write and to describe row and column of , and and to describe column in row-block and row in column-block .

To maintain all data elements in our proposed solutions, e.g., matrix , we will primarily depend on Spark’s Resilient Distributed Datasets (RDDs) (Zaharia et al., 2012). For the sake of completeness: RDD is a fault-tolerant abstraction over distributed memory, tailored for in-memory iterative algorithms. RDDs are non-mutable, and are lazily evaluated and transparently materialized either in main memory or persistent storage. Unless explicitly specified by a programmer, their partitioning is automatically handled by Spark runtime using either keys range or hash-based partitioning schemes. From the computational point of view, objects within RDD can be processed in parallel by Spark executors using a set of transformations that yield new RDDs. Spark driver node is responsible for managing RDD lineage, and orchestrating the Spark application execution.

When implementing APSP solvers, whenever possible we will depend solely on Apache Spark API, benefiting from the implicit fault-tolerance ensured by the platform. Our focus will be on programmer-perceived productivity (e.g., minimizing boilerplate code, using convenient high-level abstractions, etc.). We will refer to such implementations as pure. To improve performance, in some implementations we will depend on workarounds to missing Spark functionality (e.g., implementing point-to-point data exchange through a shared file system). When such implementations will not be fault-tolerant (e.g., failed tasks depending on data in a shared file system are not guaranteed to be able to access that data when rescheduled), we will refer to them as impure.

4. APSP with Apache Spark

We propose and investigate four different APSP approaches that vary in the implementation complexity (and as we shall see, in performance). In all approaches, we 2D decompose the adjacency matrix into blocks, where , and is a user-provided (or auto-tuned) decomposition parameter. We store the resulting matrix blocks as key-value tuples in Spark RDD, where tuple is a key we will use to reference corresponding data block. The matrix decomposition will be different from RDD partitioning, in that a single RDD partition will by maintaining multiple matrix blocks. We will use different ways of assigning RDD partitions to executors (see Section 5), which we can think of as over-decomposition of . Moreover, even though we will be discussing our approaches as if the entire matrix was stored in RDD (to simplify presentation), in the actual implementation we exploit symmetricity of and store only its upper-triangular part. The remaining blocks are generated on-demand by transposition (with no measurable overheads). Hence, executor responsible for the processing of block is also responsible for processing of block . This enables us to reduce the total amount of data maintained by the RDD, while increasing computational cost of processing tasks. We note that by disregarding symmetricity of , our algorithms can be directly adopted for cases where is a directed graph.

In our experiments, we use pySpark. The choice is motivated by practical considerations. First, the use of Python makes it convenient and efficient to store and communicate data blocks maintaining C/C++ compatible row-major matrix representation (e.g., via NumPy arrays). This in turn enables us to easily offload computationally intensive operations to a bare-metal runtime, either using direct C/C++ bindings (e.g., CPython, Boost.Python (Abrahams and Grosse-Kunstleve, 2003), implicitly SciPy (Jones et al., 2014)), or through the excellent Numba JIT compiler (Lam et al., 2015). Second, serialization and deserialization of Python objects, which plays a role in our implementations (e.g., when exchanging data via persistent storage), is supported out-of-box, further contributing to the ease of implementation and programmer’s productivity. We recognize the fact that pySpark tends to incur overheads due to object conversions between Python and Spark’s native Scala runtime (Xin et al., 2015). However, as we mentioned earlier, in this work we opt for productivity and compactness of the Python code while leveraging bare-metal execution when applicable.

4.1. Implementation Elements

To separate computational parts of an APSP solver from the pure Spark mechanics, we first identify functional blocks that will be shared between different implementations. We are careful here not to draw analogy to, e.g., graph building blocks (Kepner et al., 2016), that have solid semantics and theoretical framework attached. In our case, the functional elements are just natural consequence of code separation to enable efficient implementation. We present each element as a function (we use term function somewhat liberally, since in our case function may return/yield multiple elements for the same argument). The functions will be passed to Spark transformations, in which case their first argument, omitted from invocation, will always be a single record stored in the RDD for which transformation is invoked. Table 1 summarizes all functions. How they are used will become clear in the following subsections.


Predicate testing whether given block is in column-block .

Predicate testing if given block is -th block on diagonal.

Return a new record with

-th column of given block (the column is stored as a vector). Within a block rows and columns are zero-indexed.


Create copies of a diagonal block, each identified by different column-block index.
and
Create copies of a given block, each identified by different column-block index, and copies of its transpose, each identified by different row-block index.

Return element-wise minimum between block and .

Return min-plus product between block and .

Returns MatProd followed by MatMin for block and .

Execute Floyd-Warshall, or any other APSP solver, over diagonal block , and return the resulting distance matrix .

Append block to list of blocks . This function is a simplified representation of the actual Spark combiner.

Process list of blocks to return block that will be second argument of min-plus product with .

We use yield to indicate that multiple elements are returned.

Table 1. Summary of our functional elements.

4.2. Repeated Squaring

In the first approach, we exploit APSP min-plus product formulation, and we essentially compute using classic repeated squaring method. While asymptotically this method is clearly inefficient, it is very fast to implement, hence we include it in the comparison. Given RDD with block decomposed , repeated squaring becomes a sequence of three steps over the RDD: cartesian followed by filter to group blocks that should be multiplied in the current iteration, map applying min-plus product (function MatProd), and finally reduceByKey with function MatMin to finalize the product. All blocks of can be persisted in the total main memory of the executing cluster, hence directly leveraging support for iterative algorithms in Spark (this will be the case for the remaining algorithms as well). However, the problem with this approach is reliance on cartesian that involves extensive all-to-all data shuffle, which we found to stall even on small problems. To bypass this bottleneck of pure implementation, we replaced cartesian with iteration over the column-blocks of , effectively rewriting matrix-matrix product into a series of matrix-vector products. This allows us to reduce data movement, at the cost of increased Spark overheads (like scheduling delays or tasks deserialization costs).

1for i in range(0, log(n)):
2  for J in range(0, q):
3    col = A.filter(InColumn(J)).collect()
4    for block in col: block.tofile()
5    T[J] = A.map(MatProd).reduceByKey(MatMin)
6  A = sc.union(T)
Algorithm 1: Repeated squaring with column-blocks.

The core of the resulting implementation is outlined in Algorithm 1. In line 3, we identify blocks of column-block to multiply, and group them on the Spark driver node, which next distributes the entire column to executors (line 4). Note that we do not broadcast the column, but rather store its blocks in a shared file system available to driver and executor nodes (e.g., HDFS, GPFS, etc.). As a result, the appropriate blocks can be selected and used by executors only when needed (most of the executors will not be using all blocks). In line 5, we perform the actual matrix-vector product. Here, MatProd takes its first argument, block , directly from the RDD, and the second argument is the -th block of the column-block  deserialized from the shared secondary storage. Both MatProd and MatMin are delegated for bare-metal execution using Numba and NumPy, respectively. The results of individual matrix-vector products are brought together via union (line 6) to form RDD ready for the next iteration of the algorithm.

Our implementation of repeated squaring does not require any particular attention to, e.g., when RDDs are materialized, delegating the entire execution to Spark. However, utilizing persistent storage for broadcast introduces side-effects, which go against Spark’s fault-tolerance mechanisms, making the implementation impure. We note that one could consider using shared file system directly at the filtering stage, eliminating the need for collect on the driver. However, this approach is problematic considering Spark scheduler, e.g., files written in one RDD may not be materialized on time to be used by a subsequent RDD.

The computational cost of repeated squaring is bounded by iterations, where each iteration involves map transformation with operation, followed by reduction.

4.3. 2D Floyd-Warshall

In our second approach, we adopt the textbook parallel Floyd-Warshall algorithm based on 2D block decomposition (Grama et al., 2003). In this approach, parallelism is due to the two innermost loops of Floyd-Warshall, which can be partitioned once row/column indexed by the outermost loop is block-wise broadcast, and locally available to the processors. Since Spark provides collect and broadcast that combined can emulate standard broadcast initiated at any processor, we can easily express the entire algorithm in Spark.

1for k in range(0, n):
2  K = k / b
3  kloc = k % b
4
5  D = A.filter(InColumn(K)).map(ExtractCol(kloc)) \
6       .collect()
7
8  colk = sc.broadcast(D)
9
10  A = A.map(FloydWarshallUpdate)
Algorithm 2: Floyd-Warshall with 2D decomposition.

Our Spark implementation is outlined in Algorithm 2. The algorithm proceeds in steps. In iteration , we first identify column blocks storing column , and then we extract the actual column using function ExtractCol (lines 2-5). The column is aggregated on the driver node (line 6) and broadcast to all executors (line 8). Because the memory footprint of a column is very small, the operation can be easily performed without the need for persistent storage. This step is an explicit synchronization point. Once column is available to every executor, we proceed with the update phase of Floyd-Warshall algorithm, as implemented in FloydWarshallUpdate function. In this case, the first argument is block  taken directly from the RDD, and the two remaining arguments are vectors extracted from the column (exploiting the fact that is symmetric). As in earlier algorithm, the actual computations are delegated for bare-metal execution using Numba JIT compilation.

The 2D Floyd-Warshall is interesting in the sense that it is amenable to pure Spark implementation, supporting fault-tolerance, and involving no side-effects. Moreover, it does not require any so called wide transformations, which trigger data shuffling and movement. Consequently, it plays into Spark strengths. However, its computational cost is bounded by iterations, where each iteration involves building blocks with cost . Since synchronization overheads in Spark are significant, we anticipate poor scalability.

4.4. Blocked In-Memory

In the third approach, we build on top of the blocked APSP from Venkataraman et al. (Venkataraman et al., 2003). Originally, the algorithm was designed to improve cache utilization of the standard Floyd-Warshall. It is an iterative algorithm, where each iteration consist of three phases, as depicted in Figure 1. In Phase 1, diagonal block is processed using any efficient APSP solver (in our case sequential Floyd-Warshall). In Phase 2, the result is passed to update shortest paths in all blocks of block row and block column . Then, in Phase 3, iteration is completed by performing update on the remaining blocks.

Figure 1. Processing phases in the blocked Floyd-Warshall algorithm. Diagonal blocks form critical path.
1for i in range(0, q):
2  diag = A.filter(OnDiagonal(i))
3  d = diag.map(FloydWarshall).flatMap(CopyDiag) \
4          .partitionBy(partitioner, p)
5
6  rowcol = A.filter(InColumn(i))
7  d = sc.union([d, rowcol]) \
8        .combineByKey(ListAppend).map(ListUnpack) \
9        .map(MatMin).flatMap(CopyCol) \
10        .partitionBy(partitioner, p)
11
12  A = A.filter(not InColumn(i)).union(d) \
13       .combineByKey(ListAppend) \
14       .map(ListUnpack).map(MatMin) \
15       .partitionBy(partitioner, p)
Algorithm 3: Blocked In-Memory.

Our corresponding implementation, which we call Blocked In-Memory, is outlined in Algorithm 3. In iterations, we start by processing diagonal block (lines 2-4) using FloydWarshall function executed on bare-metal via call to optimized SciPy routine. The diagonal block is extracted from RDD via filtering, and once processed we explicitly create its copies (function CopyDiag) that are distributed to executors via custom partitioner (line 4). Here our goal is to place these copies in the RDD partitions that store blocks from the corresponding column/row block (extracted in line 6), thus maximizing data locality and mitigating partitions blowup due to RDDs merging in line 7 (we provide details of partitioning schemes in the next section). We can think about this step as using data shuffling to simulate broadcast, testing Sparks ability to optimize data distribution and corresponding tasks scheduling. To complete pairing of diagonal and column/row blocks we use a combination of combineByKey and map (line 8), by first aggregating a list of blocks to pair (ListAppend), and then enumerating the actual block pairs (ListUnpack). The resulting pairs are processed to compute the update (line 9), and as previously we use block copying and custom partitioning to bring together blocks that will be operated on in Phase 3 (lines 9-10). To finalize iteration, we implement Phase 3 computations (lines 12-15) following the same pattern as in case of Phase 2.

The above implementation of the blocked algorithm depends entirely on fault-tolerant Spark functionality, and hence we consider it pure. While it operates in iterations, it is data intensive due to data copying and shuffling involved. However, the method should be able to leverage significant data locality, hence it is a good test for Spark runtime system.

The computational cost of the algorithm is bounded by iterations, with each iteration bounded by . While asymptotically this is similar to 2D Floyd-Warshall, the blocked algorithm allows us to control the trade-off between the the cost of single iteration and the number of iterations (which is not the case in the previous methods).

4.5. Blocked Collect/Broadcast

Our last implementation, given in Algorithm 4, is another take on the blocked APSP, this time bypassing explicit data shuffling. Specifically, instead of pairing a diagonal block with column/row blocks, we bring it to the driver node via collect, which then redistributes it to the executors via shared persistent storage (lines 2-3). We note that here, similar to the repeated squaring algorithm, we do not use broadcast because in pySpark each task created by executor maintains its local copy of the broadcast variables. This usually means exceeding executor’s memory, since the number of running tasks may be the same as the number of cores in a node running the executor. To realize Phase 2, we apply MinPlus function on the column blocks such that the first argument comes from the RDD, and the second argument is taken from Spark storage (line 5). As in earlier phase, we aggregate the entire column on the driver node and then redistribute it via shared persistent storage (lines 6-7). Finally, in line 9, we perform Phase 3 computations updating the remaining blocks of with the column blocks from persistent storage. All updated blocks are brought into a single RDD via union, and repartitioned to match intended partitioning of (lines 11-12).

1for i in range(0, q):
2  diag = A.filter(OnDiagonal(i)).map(FloydWarshall)
3  diag.collect().tofile()
4
5  rowcol = A.filter(InColumn(i)).map(MinPlus)
6  rowcol_coll = rowcol.collect()
7  for b in rowcol_coll: b.tofile()
8
9  offcol = A.filter(not InColumn(i)).map(MinPlus)
10
11  A = sc.union([diag, rowcol, offcol]) \
12        .partitionBy(partitioner, p)
Algorithm 4: Blocked Collect/Broadcast.

The collect and broadcast implementation of the blocked algorithm involves secondary storage to handle communication, and hence we consider it impure. Except of custom partitioning of matrix , we do not repartition RDD that stores column/row blocks (line 5). Instead we depend on the partitioning scheme of from which we extract blocks. This means that Spark runtime will most likely trigger data movement to utilize all executors.

Asymptotically, the cost of the algorithm is the same as Blocked In-Memory. However, in single iteration we substitute data shuffling with writing to auxiliary storage.

5. Experimental Analysis

To benchmark all four implementations, we performed a set of experiments on a standalone Apache Spark cluster with 32 nodes and GbE interconnect. Each node in the cluster is equipped with two 16-core Intel Skylake (Intel Xeon Gold 6130 2.10GHz) processors with 32KB L1 and 1024KB L2 cache, and 192GB of RAM (thus the cluster provides total of 1,024 cores and 6TB of RAM). Moreover, each node has standard SSD drive available, which Spark uses for local data staging (available local storage is 1TB). We note that the use of SSDs is essential for Spark performance, since all data movement in Spark involves staging in the local storage. The local storage is complemented by shared GPFS storage that executors can use for additional communication (e.g., as in case of Blocked Collect/Broadcast algorithm).

In all tests, Spark driver was run with 180GB of memory, to allow efficient management of complex RDD lineages. For the same reason, we run the driver on a dedicated node separately from Spark executors. We allocated one executor per node using the default configuration for the number of cores, i.e., each executor was using all available cores in a node. All executors were configured to use 180GB out of the available 192GB, with the remaining memory available to the operating system and Python interpreter. We note that we tested different Spark runtime configurations, including executor-to-core ratio, memory use fractions, etc. (Spa, 2019), without noticeable difference in performance.

Our entire software is implemented in Apache Spark 2.2 pySpark, and Python 2.7. Compute intensive linear algebra operations and sequential Floyd-Warshall algorithm are offloaded to bare-metal via NumPy and SciPy that are configured to work with the Intel MKL 2017 BLAS library. Finally, we use Numba 0.35 for just-in-time compilation whenever required. Our entire software is open source and available from https://gitlab.com/SCoRe-Group/APSPark.

5.1. Test Data

All four APSP solvers are oblivious to the structure of input graph, and in our implementations we do not include optimizations to target any specific graph properties. Consequently, the scalability of every solver is a function of graph size expressed only by the number of vertices,

. Therefore our input data is simply a set of synthetic Erdős-Rényi graphs. In all graphs, the probability of edge, 

, is set to , with . The particular choice of is to make graphs generation fast. We note that taking larger to increase likelihood of connectivity in no way adversely affects our performance, as we disregard the cost of populating RDD that stores the adjacency matrix , and each of our approaches scales only with . In our tests, we consider graphs for  up to 262,144.

5.2. Block Size and Sequential Components

In the first set of experiments, we setup a baseline for subsequent comparisons by analyzing performance of the key functional elements, MatProd combined with MatMin and FloydWarshall, depending on the block size . These building blocks will be dispatched by Spark for sequential execution, and hence their performance will be critical to the overall performance. Moreover, in all our implementations, the block size directly affects level of available parallelism, and level of data movement. If the block size is too large, we may be spending too much time in a sequential execution or staging data blocks in the auxiliary storage, and we may not be able to leverage all executors. On the other hand, if block size is too small, overheads due to tasks scheduling and data shuffling, or creating many small files in the auxiliary storage, may dominate the execution. Hence, the block size should be selected carefully.

To test the sequential components, we call the corresponding functions directly from Python, the same way as they would be invoked by Spark, i.e., FloydWarshall via SciPy with Intel MKL, and MatProd and MatMin via Numba. To test the actual solvers, we use our pySpark implementations. In all tests, we look at the observed execution times. Results of these experiments are reported in Figures 2 and 3.

Figure 2 shows how block size, , affects the time of processing a single adjacency matrix block. As expected, the runtime increases roughly as , in line with the asymptotic behavior of the underlying algorithms. For up to approximately 3,000, sequential operations can be executed very quickly, primarily because adjacency matrix fits in cache memory. Given available Skylake cache, the approximate size for processing completely in L3 cache is around . Once is above that threshold, runtime starts to grow rapidly, going into minutes.

Figure 2. The effect of block size on the execution time of sequential operations MatProd combined with MatMin, and FloydWarshall.
Figure 3. Example effect of block size on the execution time of Blocked In-Memory (IM) and Blocked Collect/Broadcast (CB), depending on the choice of block size, partitioner and partition size. Top: default Spark partitioner. Middle: our multi-diagonal partitioner. Bottom: distribution of RDD partition sizes incurred by both partitioners when . Problem size run on  cores. indicates the number of RDD partitions per core.

Figure 3 highlights the impact of partitioning granularity on the total execution time. Here we focus on Blocked In-Memory (IM) and Blocked Collect/Broadcast (CB) methods, as these are the best performing methods (see next subsection).

Irrespective of the choice of RDD partitioner (we discuss partitioners later), the runtime of both methods first decreases, and then grows with the growing block size. When the block size is too small (), the Blocked In-Memory solver fails. This is because the volume of the data that has to be staged in local SSDs due to shuffling exceeds the storage capacity of our cluster (in our case, each node is equipped with 1TB of local storage). Specifically, data repartitioning (call to partitionBy in Algorithm 3, line 15) triggers data shuffling in each iteration. Because shuffled blocks are spilled to the local storage and preserved for fault tolerance, the storage requirement grows linearly with the number of iterations. At the same time, eliminating the call to partitionBy is not an option: without repartitioning step the number of partitions in the RDDs created via union (Algorithm 3, line 12) would quickly explode since in Spark each component RDD preserves its partitioning when in union. The excessive number of partitions would degrade performance of the combineByKey step, and would add significant scheduling overheads. We note that this issue does not affect Blocked Collect/Broadcast, because the volume of the data we manage is smaller (recall that here we avoid creating copies of the data blocks), and repartitioning is thus less aggressive.

The results above confirm the impact of the block size on the performance of our APSP solvers. They suggest also that accelerating single block computations, e.g., by delegating them to GPGPU, could further improve performance of the solvers. Specifically, since with acceleration we should be able to process larger blocks in the same time limit, we could reduce the number of iterations required by the solver without significantly affecting the cost of individual iterations.

5.3. Tuning Performance of the Solvers

In the next set of experiments, we take a closer look at the performance of our solvers in relation to RDD partitioning scheme that we use to distribute RDD with matrix . In Figure 3, we can observe that both partitioning function as well as the number of RDD partitions per core (parameter ) affect the runtime, and hence tuning them may lead to better performance.

The efficiency of each of our APSP methods depends on data locality. Ideally, to reduce the number and frequency of Spark data shuffles, blocks that are paired for computation should be assigned to the same partition, and partitions should be evenly distributed between executors. We should keep in mind however, that even then there is no guarantee that the executor maintaining given partition will be responsible for its processing. Thus, to provide a room for load balancing, the ratio between the number of partitions and the number of available cores should be more than one (per Spark guidelines (Spa, 2019), the number of partitions is recommended to be - the total number of cores, i.e., ).

Method Partitioner Decomposition Time
Iterations Single Projected
Repeated Squaring MD 256 18432 45s 9d16h
512 9216 2m23s 15d8h
1024 4608 5m6s 16d8h
2048 2304 19m45s 31d15h
4096 1152 51m47s 41d10h
PH 256 18432 44s 9d11h
512 9216 2m7s 13d13h
1024 4608 6m5s 19d12h
2048 2304 18m39s 29d21h
4096 1152 1h15m 60d6h
2D Floyd-Warshall MD 256 262144 21s 64d11h
512 262144 18s 53d10h
1024 262144 17s 51d22h
2048 262144 18s 55d7h
4096 262144 20s 61d9h
PH 256 262144 21s 65d8h
512 262144 18s 55d10h
1024 262144 16s 49d7h
2048 262144 20s 60d3h
4096 262144 19s 56d9h
Blocked-IM MD 256 1024 51s 14h29m
512 512 1m11s 10h8m
1024 256 1m55s 8h12m
2048 128 3m44s 7h59m
4096 64 7m21s 7h51m
PH 256 1024 48s 13h32m
512 512 1m14s 10h33m
1024 256 2m12 9h23m
2048 128 4m3s 8h39m
4096 64 8m49s 9h24m
Blocked-CB MD 256 1024 48s 13h35m
512 512 1m1s 8h40m
1024 256 1h40m 7h8m
2048 128 3m18s 7h4m
4096 64 8m23s 8h57m
PH 256 1024 46s 13h12m
512 512 1m3s 9h4m
1024 256 1m51s 7h54m
2048 128 3m51s 8h15m
4096 64 9m23s 10h2m

Table 2. The effect of block size on execution time.

In our implementation we consider two partitioners. The first one is the default Spark partitioner, called Portable Hash (PH), which is available in pySpark as method portable_hash. This is the partitioner that one would use ad hoc. The partitioner is expected to distribute RDD uniformly at random by computing a hash code for Python tuples , which are used as keys in all our RDDs. The resulting hash values aim at equal distribution of RDD partitions between executors, but provide no guarantee on achieving data locality desired by our solvers.

Figure 4. Data layout induced by our multi-diagonal partitioner. Blocks with the same index are assigned to the same RDD partition.

The second partitioner is our multi-diagonal partitioner (MD). The assignment of matrix blocks to partitions induced by this approach is outlined in Figure 4. The idea behind MD partitioner is to explicitly place blocks that are in the same column- and row-block into separate partitions, while maintaining equal block distribution between the partitions. The strategy is especially critical to Blocked In-Memory and Blocked Collect/Broadcast methods, as it enables us to avoid bottlenecks in Phase 2 of these algorithms. At this point we should note that MD partitioner may seem counter-intuitive from the perspective of matrix algorithms based on 2D decomposition. Typically, such algorithms involve some form of grid-based partitioning to optimize communication patterns. However, as we mentioned earlier, in Spark we have limited control over which executors handle which partitions. Therefore, the best strategy is to use partitioning which will give scheduling flexibility to Spark runtime.

In Table 2 we quantify the effect of the block size and the partitioner on the performance of each solver. Here several observations stand out. First, Repeated Squaring and 2D Floyd-Warshall both are infeasible on large problems, with projected execution times going into days. The poor performance is especially pronounced for 2D Floyd-Warshall. Even though from the computational point of view the method is asymptotically comparable to the blocked methods, it suffers from unfavorable balance between computations and data movement – each iteration involves operations, the number of iterations is linear in the problem size. Repeated Squaring on the other hand shows very good performance in single iteration, but this is insufficient to compensate higher number of iterations (compared to the blocked methods).

The second observations is that the effect of partitioner depends on the block size, and on factor . When the block size is small, the difference between PH and MD partitioners becomes negligible. This is because partitions become fairly balanced just by chance, as we have many blocks to distribute, and even when partitions are slightly unbalanced, the difference in runtime becomes marginal owing to the small block size. Conversely, for the large block size the impact of the partitioner on runtime becomes critical. Figure 3

(bottom plot) explains this effect in case of blocked methods. The PH partitioner consistently fails to evenly distribute blocks to RDDs, which is caused by suboptimal choice of the hashing function (inspection of Spark code reveals that it uses XOR based mixing of elements of the tuple, which in case of upper-triangular matrix leads to many collisions). The resulting skew in hash distribution translates directly into poor runtimes (Figure 

3 top plot). This is especially pronounced for the Blocked In-Memory method when . In this case we have to deal with very large and unbalanced partitions with no room for load balancing by dynamic scheduling. Additionally, we run multiple repartitioning steps with their shuffling operations, which ultimately result in unbalanced spills to local storage. We note that because Blocked Collect/Broadcast mitigates shuffling, it performs significantly better.

The last observation is with respect to the over-decomposition factor, . From our results it is clear that the Spark guideline of having is imperative. However, since in our solvers the total number of blocks to distribute is a function of the block size, for larger we may easily end up with . In such situations, it may be advantageous to scale down to increase without decreasing block size . This strategy however ultimately limits the scalability of the method. In our tests, most of the time we were able to maintain .

5.4. Scalability Tests

We now turn our attention to scalability of the solvers. We focus only on Blocked In-Memory and Blocked Collect/Broadcast, since the other two methods cannot be expected to scale to larger problems (see Table 2). We perform weak scaling analysis, as it can be considered a typical use scenario for Spark (i.e., we increase cluster size to solve larger problem instances).

To assess scalability we measure operations per second (ops), which we express as , where is the time taken to solve the APSP problem of size using cores. Considering that the key computational building blocks used by our solvers have complexity , the measure is a good proxy to assess the performance. For convenience we are reporting giga-ops (Gops) normalized with respect to (Gops/core). Our reference point is – the time taken for a proportional problem on a single core using efficient sequential Floyd-Warshall as implemented in SciPy and tested in Section 5.2.

Results of the experiment are summarized in Table 3 and Figure 5. Here we maintain to ensure that the largest resulting problem (i.e., ) is feasible given the hardware resources we use in the tests. Moreover, for each method and every problem size, we use our multi-diagonal partitioner, and we select the optimal block size, , following the arguments given in the previous sections (we report block size in Table 3). Finally, for we record s, which translates into Gops.

Our results show that as expected Blocked Collect/Broadcast outperforms Blocked In-Memory. Moreover, for the largest problem size (), Blocked In-Memory runs out of space in the local storage, and is unable to finish processing. Both methods show rather stable scaling that saturates around . For performance is slightly degraded, which we attribute to Spark scheduler: for the small problem size, the number of RDD partitions per core is one, and that limits optimal resources utilization (note however, that choosing different block size does not improve performance). For , the Blocked Collect/Broadcast solver achieves 78% Gops/core with respect to the sequential solver, which we consider quite good result taking into account all Spark-added overheads. We should keep in mind though that Blocked Collect/Broadcast is not fault-tolerant.

Method     /      64 128 256 512 1024
Blocked-IM 4m2s 14m20s 35m33s 2h17m
  1024 1024 1536 2048
Blocked-CB 2m50s 11m0s 34m16s 2h11m 8h9m
  1024 1280 1536 2048 2560
FW-2D-MPI 2m3s 37m2s 11h51m
DC-MPI 1m15s 18m54s 2h52m

s

Table 3. Weak scaling of blocked methods.
Figure 5. Weak scaling of blocked methods.

5.5. Spark vs MPI

In the last set of tests we contrast performance of our solvers with the approaches based on MPI. The idea is to provide a reference point to judge Spark scaling. To make the comparison, we use two MPI-based APSP solvers. In the first solver, FW-2D-MPI, we implement the standard parallel Floyd-Warshall algorithm based on 2D block decomposition (Grama et al., 2003) (described in Section 4.3). The method is relatively straightforward to implement, however, as usual for MPI, the resulting code is objectively more verbose and complex than its Spark counterpart. The second method, DC-MPI, is highly optimized divide-and-conquer solver by Solomonik et al. (Solomonik et al., 2013), available from https://github.com/solomonik/APSP. The solver has been demonstrated to scale extremely well to very large parallel machines, and is the state-of-the-art HPC solution. Both solvers are written in C++, and we compile them with Intel MPI 2018 and g++ 7.8 for execution. The MPI codes are run on the same cluster as Spark, however, we use InfiniBand interconnect instead of GbE. Because both MPI solvers assume that processors are organized into a square grid in our tests we use .

From Table 3 and Figure 5 we can see that optimized DC-MPI solver significantly outperforms ( on cores) Spark-based methods. While this result is not surprising, it clearly highlights importance of good parallel algorithm. Specifically, for naive FW-2D-MPI is outperformed by our Spark solvers, which is explained by communication overheads in FW-2D-MPI that grow with (due to broadcast) and (due to iterative steps). At this point we should recall that 2D Floyd-Warshall is completely infeasible in Spark. This is a nice example showing that good runtime system is required but not sufficient to obtain good scalability. In our tests, we checked also impact of the interconnect on MPI-based solvers. When enforcing communication over GbE we observed 15%-20% performance degradation, which still makes DC-MPI far superior solution.

6. Conclusion

In 2017, Internet circulated widely discussed blog post by Jonathan Dursi “HPC is dying, and MPI is killing it” (Dursi, 2017). One of the messages in the post is that old standards, namely MPI, should and must be replaced by modern APIs like Spark, since MPI offers wrong level of abstraction, hindering programmers’ productivity. In this paper, we demonstrate that indeed by separating core computational functionality from the data and tasks distribution problem, we can implement APSP in pySpark in just few compact lines of code (including usual boilerplate). However, while our optimized implementation is able to handle fairly large graphs in acceptable time limits, it is not fault-tolerant and is significantly outperformed by the state-of-the-art MPI solver. This is because programmers’ productivity comes at the cost of limited functionality, which in APSP leads to implementations involving large and frequent data movements. While we mitigate this effect by coupling Spark with data broadcast through persistent storage, the resulting solution is far from elegant, and is clearly not fault-tolerant.

7. Acknowledgments

The authors would like to acknowledge support provided by the Center for Computational Research at the University at Buffalo.

References

  • (1)
  • Spa (2019) 2019. Tuning Spark. https://spark.apache.org/docs/latest/tuning.html.
  • Abrahams and Grosse-Kunstleve (2003) D. Abrahams and R.W. Grosse-Kunstleve. 2003. Building Hybrid Systems with Boost.Python. C/C++ Users Journal 21, 7 (2003), 29–36.
  • Bernstein et al. (2000) M. Bernstein, V. de Silva, J.C. Langford, and J.B. Tenenbaum. 2000. Graph Approximations to Geodesics on Embedded Manifolds.
  • Borgwardt and Kriegel (2005) K.M. Borgwardt and H. Kriegel. 2005. Shortest-Path Kernels on Graphs. In IEEE International Conference on Data Mining (ICDM). 74–81.
  • Cormen et al. (2009) T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. 2009. Introduction to Algorithms. MIT press.
  • Dave et al. (2016) A. Dave, A. Jindal, L.E. Li, R. Xin, J. Gonzalez, and M. Zaharia. 2016. GraphFrames: An Integrated API for Mixing Graph and Relational Queries. In International Workshop on Graph Data Management Experiences and Systems. 2.
  • Djidjev et al. (2014) H. Djidjev, S. Thulasidasan, G. Chapuis, R. Andonov, and D. Lavenier. 2014. Efficient Multi-GPU Computation of All-Pairs Shortest Paths. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 360–369.
  • Dursi (2017) J. Dursi. 2017. HPC is dying, and MPI is killing it.
    https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html.
  • Grama et al. (2003) A. Grama, V. Karypis, G. Kumar, and A. Gupta. 2003. Introduction to Parallel Computing. Pearson.
  • Jones et al. (2014) E. Jones, T. Oliphant, and P. Peterson. 2014. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org/.
  • Katz and Kider Jr (2008) G.J. Katz and J.T. Kider Jr. 2008. All-Pairs Shortest-Paths for Large Graphs on the GPU. In ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. 47–55.
  • Kepner et al. (2016) J. Kepner, P. Aaltonen, D. Bader, A. Buluc, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, et al. 2016. Mathematical Foundations of the GraphBLAS. arXiv:1606.05790 (2016).
  • Kepner and Gilbert (2011) J. Kepner and J. Gilbert. 2011. Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics.
  • Lam et al. (2015) S.K. Lam, A. Pitrou, and S. Seibert. 2015. Numba: A LLVM-based Python JIT Compiler. In Workshop on the LLVM Compiler Infrastructure in HPC. 7.
  • Malewicz et al. (2010) G. Malewicz, M.H. Austern, A.J.C Bik, J.C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In ACM SIGMOD International Conference on Management of Data. 135–146.
  • Quirin et al. (2008) A. Quirin, O. Cordon, J. Santamaria, B. Vargas-Quesada, and F. Moya-Anegon. 2008. A New Variant of the Pathfinder Algorithm to Generate Large Visual Science Maps in Cubic Time. Information Processing and Management 44, 4 (2008), 1611–1623.
  • Schoeneman and Zola (2018) F. Schoeneman and J. Zola. 2018. Scalable Manifold Learning for Big Data with Apache Spark. In IEEE International Conference on Big Data.
  • Seidel (1995) R. Seidel. 1995. On the All-Pairs Shortest-Path Problem in Unweighted Undirected Graphs. J. Comput. System Sci. 51, 3 (1995), 400–403.
  • Shoshan and Zwick (1999) A. Shoshan and U. Zwick. 1999. All-Pairs Shortest-Paths in Undirected Graphs with Integer Weights. In Annual Symposium on Foundations of Computer Science. 605–614.
  • Solomonik et al. (2013) E. Solomonik, A. Buluc, and J. Demmel. 2013. Minimizing Communication in All-Pairs Shortest Paths. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 548–559.
  • Talwalkar et al. (2008) A. Talwalkar, S. Kumar, and H. Rowley. 2008. Large-scale Manifold Learning. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . 1–8.
  • Tenenbaum et al. (2000) J.B. Tenenbaum, V. de Silva, and J.C. Langford. 2000. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 5500 (2000), 2319.
  • Ullman and Yannakakis (1991) J.D. Ullman and M. Yannakakis. 1991. The Input/Output Complexity of Transitive Closure.

    Annals of Mathematics and Artificial Intelligence

    3, 2-4 (1991), 331–360.
  • Venkataraman et al. (2003) G. Venkataraman, S. Sahni, and S. Mukhopadhyaya. 2003. A Blocked All-pairs Shortest-paths Algorithm. Journal of Experimental Algorithmics 8 (2003).
  • Xin et al. (2015) R. Xin, M. Armbrust, and D. Liu. 2015.

    Introducing DataFrames in Apache Spark for Large Scale Data Science.

    https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html.
  • Xin et al. (2013) R.S. Xin, J.E. Gonzalez, M.J. Franklin, and I. Stoica. 2013. GraphX: A Resilient Distributed Graph System on Spark. In International Workshop on Graph Data Management Experiences and Systems. 2.
  • Zaharia et al. (2012) M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, and I. Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In USENIX Conference on Networked Systems Design and Implementation (NSDI).
  • Zwick (1998) U. Zwick. 1998. All-Pairs Shortest-Paths in Weighted Directed Graphs–Exact and Almost Exact Algorithms. In Annual Symposium on Foundations of Computer Science. 310–319.