Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers

03/20/2020
by   Dan Alistarh, et al.
0

Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively. While the sequential variant of such algorithms usually specifies a fixed (but sometimes random) order in which the tasks should be performed, a standard approach to parallelizing such algorithms is to relax this constraint to allow for out-of-order parallel execution. This is the case for parallel implementations of Dijkstra's single-source shortest-paths algorithm (SSSP), and for parallel Delaunay mesh triangulation. While many software frameworks parallelize incremental computation in this way, it is still not well understood whether this relaxed ordering approach can still provide any complexity guarantees. In this paper, we address this problem, and analyze the efficiency guarantees provided by a range of incremental algorithms when parallelized via relaxed schedulers. We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of k in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of O(log(n) poly (k) ), where n is the number of tasks to be executed. For SSSP, we show that the additional work is O(poly (k) d_max / w_min), where d_max is the maximum distance between two nodes, and w_min is the minimum such distance. In practical settings where n ≫ k, this suggests that the overheads of relaxation will be outweighed by the improved scalability of the relaxed scheduler. On the negative side, we provide lower bounds showing that certain algorithms will inherently incur a non-trivial amount of wasted work due to scheduler relaxation, even for relatively benign relaxed schedulers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/13/2018

Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms

There has been significant progress in understanding the parallelism inh...
11/05/2019

Parallel Approximate Undirected Shortest Paths Via Low Hop Emulators

We present a (1+ε)-approximate parallel algorithm for computing shortest...
04/03/2018

Distributionally Linearizable Data Structures

Relaxed concurrent data structures have become increasingly popular, due...
03/02/2020

Massively Parallel Algorithms for Distance Approximation and Spanners

Over the past decade, there has been increasing interest in distributed/...
04/19/2022

Massively Parallel Computation and Sublinear-Time Algorithms for Embedded Planar Graphs

While algorithms for planar graphs have received a lot of attention, few...
10/12/2018

Parallelism in Randomized Incremental Algorithms

In this paper we show that many sequential randomized incremental algori...
07/01/2020

On the Distributed Construction of Stable Networks in Polylogarithmic Parallel Time

We study the class of networks which can be created in polylogarithmic p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several classic problems in graph processing and computational geometry can be solved incrementally: algorithms are structured as a series of tasks, each of which examines a subset of the algorithm state, performs some computation, and then updates the state. For instance, in Dijkstra’s classic graph single-source shortest paths (SSSP) algorithm [18]

, the state consists of the current distance estimates for each node in the graph, and each task corresponds to a node “relaxation,” which may update the distance estimates of the node’s neighbors. In the case of the classic sequential variant, the order in which tasks get executed is dictated by the sequence of node distances. At the same time, many other incremental algorithms, such as Delaunay mesh triangulation, assume arbitrary (or random) orders on the tasks to be executed, and can even provide efficiency guarantees under such orderings 

[10].

A significant amount of attention has been given to parallelizing such incremental iterative algorithms, e.g. [20, 28, 10, 16, 17]. One approach has been to study the dependence structure of such algorithms, proving that, in many cases, the dependency chains are shallow. This can be intuitively interpreted as proving that such algorithms should have significant levels of parallelism. One way to exploit this fine-grained parallelism, e.g. [9, 33] has been to carefully split the execution into task prefixes of limited length, and to parallelize the execution of each prefix efficiently. While this approach can be efficient, it does require an understanding of the workload and task structure, and may not be immediately applicable to algorithms where the task ordering is dependent on the input.

An alternative approach has been to employ scalable data structures with only ensure relaxed priority order to schedule task-based programs. The idea can be traced back to Karp and Zhang [24], who studied parallel backtracking in the PRAM model, and noticed that, in some cases, the scheduler can relax the strict order induced by the sequential algorithm, allowing tasks to be processed speculatively ahead of their dependencies, without loss of correctness. For instance, when parallelizing SSSP, e.g. [5, 30, 28], the scheduler may retrieve vertices in arbitrary order without breaking correctness, as the distance at each vertex is guaranteed to eventually converge to the minimum. However, there is intuitively a trade-off between the performance gains arising from using scalable relaxed schedulers, and the loss of determinism and the possible wasted work due to having to re-execute speculative tasks.

This approach is quite popular in practice, as several efficient relaxed schedulers have been proposed [32, 6, 35, 5, 21, 28, 29, 4, 31]

, which can attain state-of-the-art results for graph processing and machine learning 

[28, 20], and even have hardware implementations [23]. At the same time, despite showing good empirical performance, this approach does not come with analytical bounds: in particular, for most known algorithms, it is not clear how the relaxation factor in the scheduler affects the total work performed by the parallel algorithm.

We address this question in this paper. Roughly, we show under analytic assumptions that, for a set of fundamental algorithms including parallel Dijkstra’s and Delaunay mesh triangulation, the extra work engendered by scheduler relaxation can be negligible with respect to the total number of tasks executed by the sequential algorithm. On the negative side, we show that relaxation does not come for free: we can construct worst-case instances where the cost of relaxation is asymptotically non-negligible, even for relatively benign relaxed schedulers.

We model the relaxed execution of incremental algorithms as follows. The algorithm is specified as an ordered sequence of tasks, which may or may not have precedence constraints. The algorithm’s execution is modeled as an interaction between a processor, which can execute tasks, modify state, and possibly create new tasks, and a scheduler, which stores the tasks in a priority order specified by the algorithm. At each step, the processor requests a new task from the scheduler, examines whether the task can be processed (i.e., that all precedence constraints are satisfied), and then executes the task, possibly modifying the state and inserting new tasks as a consequence.

An exact scheduler would return tasks following priority order. Since ensuring such strict order semantics is known to lead to contention and poor performance [1], practical scalable schedulers often relax the priority order in which tasks are returned, up to some constraints. For generality, in this paper, we assume when proving performance upper bounds that the scheduler may in fact be adversarial—actively trying to get the algorithm to waste steps, up to some natural rank inversion and fairness constraints. Specifically, the two natural constraints we enforce on the scheduler are on 1) the maximum rank inversion between the highest priority task present and the rank of the task returned, and on 2) fairness, in terms of the maximum number of schedule steps for which the task of highest priority may remain unscheduled. For convenience, we upper bound both these quantities by a parameter , which we call the relaxation factor of the scheduler. Simply put, a -relaxed scheduler must 1) return one of the highest-priority elements in every step; and 2) return a task at the latest steps after it has become the highest-priority task available to the scheduler. We note that real schedulers enforce such constraints either deterministically [35]

or with high probability 

[4, 2, 3].

A significant limitation of the above model is that it is sequential, as it assumes a single processor which may execute tasks. While our results will be developed in this simplified sequential model, we also discuss a parallel version of the model in Section 4.

It is natural to ask whether incremental algorithms can still provide any guarantess on total work performed under relaxed schedulers. Additional work may arise due to relaxation for two reasons. The first is if the parallel execution enforces ordering constraints between data-dependent tasks: for instance, when executing a graph algorithm, the task corresponding to a node may need to be processed before the task corresponding to any neighbor which has higher priority in the initial node ordering. A second cause for wasted work is if a task may need to be re-executed once the state is updated: this is the case when running parallel SSSP: due to relaxation, a node may be speculatively relaxed at a distance that is higher than its optimal distance from the source, leading it to be relaxed several times. We note that neither phenomenon may occur when the priority order is strict—since the top priority task cannot have preceding constraints nor need to be re-executed—but are inherent in parallel executions.

A trivial upper bound on wasted work for an algorithm with total work under a -relaxed scheduler would be —intuitively, in the worst case the scheduler may return tasks before the top priority one, which can always be executed without violating constraints. The key observation we make in this work is that, because of their local dependency structure, some popular incremental algorithms will incur significantly less overhead due to out-of-order execution.

More precisely, for incremental algorithms, such as Delaunay mesh triangulation and sorting by insertion, we show that the expected overhead of execution via a -relaxed scheduler is
, where is the number of tasks the sequential variant of the algorithm executes. We exploit the following properties of incremental algorithms, shown in [10]: The probability that the task at position is dependent on the task at position depends only on the tasks at positions and , and assuming a random order of tasks, this probability is upper bounded by . While the technical details are not immediate, the argument boils down to bounding, for each top-priority task, the number of dependent tasks which may be returned by the scheduler while the task is still in the scheduler queue.

For SSSP, which does not have a dependency structure but may have to re-execute tasks, we use a slightly different approach, based on -stepping,  [27]. We bound the total overhead of relaxation to , where is the maximum distance between two nodes, and is the minimum such distance. While this overhead may in theory be arbitrarily large, depending on the input, we note that for many graph models, this overhead is small. (For example, for Erdös-Renyi graphs with constant weights, the overhead is .)

It is interesting to interpret these overheads in the context of practical concurrent schedulers such as MultiQueues [29, 20], where the relaxation factor is proportional to the number of concurrent processors , up to logarithmic factors. Since in most instances the size of the number of tasks is significantly larger than the number of available processors , the overhead of relaxation can be seen to be comparatively small. This observation has been already made empirically for specific instances, e.g. [25]; our analysis formalizes this observation in our model.

On the negative side, we also show that the overhead of relaxation is non-negligible in some instances. Specifically, we exhibit instances of incremental algorithms where the overhead of relaxed execution is . Interestingly, this lower bound does not require the scheduler to be adversarial: we show that it holds even in the case of the relatively benign MultiQueue scheduler [29, 4].

Related Work.

Parallel scheduling of iterative algorithms is a vast area, so a complete survey is beyond our scope. We begin by noting that our results are not relevant to standard work-stealing schedulers [12, 11] since such schedulers do not provide any guarantees in terms of the rank of elements removed.111We are aware of only one previous attempt to add priorities to work-stealing schedulers [22], using a multi-level global queue of tasks, partitioned by priority. This technique is different, and provides no work guarantees.

An early instance a relaxed scheduler is in the work of Karp and Zhang [24], for parallelizing backtracking in the PRAM model. This area has recently become very active, and several relaxed scheduler designs have been proposed, trading off relaxation and scalability, e.g. [32, 6, 35, 5, 21, 28, 29, 4, 31]. In particular, software packages for graph processing [28] and machine learning [20] implement such relaxed schedulers.

Our work is related to the line of research by Blelloch et al. [7, 9, 8, 34, 10], as well as [14, 13, 19], which examines the dependency structure of a broad set of iterative/incremental algorithms and exploit their inherent parallelism for fast implementations. We benefit significantly from the analytic techniques introduced in this work.

We note however some important differences between these results and our work. The first difference concerns the scheduling model: references such as [9, 34, 10] assume a synchronous PRAM execution model, and focus on analyzing the maximum dependency length of algorithms under random task ordering, validating the results via concurrent implementations. By contrast, we employ a relaxed scheduling model, that models data processing systems based on relaxed priority schedulers, such as [28], and provide work bounds for such executions. Although superficially similar, our analysis techniques are different from those of e.g. [9, 10] since our focus is not on the maximum dependency depth of the algorithms, but rather on the number of local dependencies which may be exploited by the adversarial scheduler to cause wasted work. We emphasize that the fact that the algorithms we consider may have low dependency depth does not necessarily help, since a sequential algorithm could have low dependency depth and be inefficiently executable by a relaxed scheduler: a standard example is when the dependency depth is low (logarithmic), but each “level” in a breadth-first traversal of the dependency graph has high fanout. This has low depth, but would lead to high speculative overheads. (In practice, greedy graph coloring on a dense graph would lead to such an example.)

A second difference concerns the interaction between the scheduler and the algorithm. The scheduling mechanisms proposed in e.g. [9] assume knowledge about the algorithm structure, and in particular adapt the length of the prefix of tasks which can be scheduled at any given time to the structure of the algorithm. In contrast, we assume a basic scheduling model, which may even be adversarial (up to constraints), and show that such schedulers, which relax priority order for increased scalability, inherently provide bounded overheads in terms of wasted work due to relaxation.

Finally, we note that references such as [9, 10] focus on algorithms which are efficient under random orderings of the tasks. In the case of SSSP, we show that relaxed schedulers can efficiently execute algorithms which have a fixed optimal ordering.

Another related reference is [3], in which we introduced the scheduling model used in this paper, related it to MultiQueue schedulers [29], and analyzed the work complexity of some simple iterative greedy algorithms such as maximal independent set or greedy graph coloring. We note the technique introduced in this previous paper only covered a relatively limited set of iterative algorithms, where the set of tasks are defined and fixed in advance, and focused on the complexity of greedy maximal independent set (MIS) under relaxed scheduling. In contrast, here we consider more complex incremental algorithms, in which tasks can be added and modified dynamically. Moreover, as stated, here we also cover algorithms such as SSSP, in which computation should follow a fixed sequential task ordering, as opposed to a random ordering which was the case for greedy MIS.

2 Relaxed Schedulers: The Sequential Model

We begin by formally introducing our sequential model of relaxed priority schedulers. We represent a priority scheduler as a relaxed ordered set data structure , where the integer parameter is the relaxation factor. A relaxed priority scheduler contains pairs and supports the following operations:

  1. , returns true if is empty, and false otherwise.

  2. , returns a pair if one is available, without deleting it.

  3. , removes specified task from the scheduler. This is used to remove a task returned by , if applicable.

  4. , inserts a new task-priority pair in .

We denote the rank (in ) of the task returned by the -th operation by , and call it the rank of a task returned on step . For a task , let be the number of inversions experienced by task between the step when becomes the highest priority task in and the step when task is returned by the scheduler. That is, is the number of operations needed for the highest priority task to be scheduled.

Rank and Fairness Properties.

The relaxed priority schedulers we consider will enforce the following properties:

  1. : at any time step , .

  2. : for any task , .

Priority schedulers such as k-LSM [35] enforce these properties deterministically, where is a tunable parameter. We have shown in previous work that the MultiQueue [29] scheduler ensures these properties both in sequential and concurrent executions [4, 2] with parameter , with exponentially low failure probability in , the number of queues.

Next, we describe how incremental algorithms can be implemented in this context.

3 Incremental Algorithms

3.1 General Definitions

We assume a model of incremental algorithms which execute a set of tasks iteratively, one by one, and where each task incrementally updates the algorithm’s state. For example, in incremental graph algorithms, the shared state corresponds to a data structure storing the graph nodes, edges, and meta-data corresponding to nodes. Tasks usually correspond to vertex operations, and are usually inserted and executed in some order, given by the input. If this task order is random, we say that the incremental algorithm is randomized. We will consider both randomized incremental algorithms, where each task has a priority based on the random order, and deterministic ones, where the order is fixed. Using an exact scheduler corresponds to executing tasks in the same order as the sequential algorithm, while using a relaxed scheduler allows out-of-order execution of tasks.

Definition.

More formally, randomized incremental algorithms such as Delaunay triangulation and comparison sorting with via BST insertion can be modelled as follows:

We are given tasks, which must be executed iteratively in some (possibly random) order. Initially, each task is assigned a unique label . For instance, this label can be based on a random permutation of given tasks, . That is, for task , , iff . A lower label can be equated with higher priority. Each task performs some computation and updates the algorithm state. In the case of Delaunay triangulation, tasks update the triangle mesh, while in the case of Comparison Sorting tasks modify the BST accordingly. Generic sequential pseudocode is given in Algorithm 1. We note that a similar generic algorithm was presented in [3] for parallelizing greedy iterative algorithms.

Data: Sequence of tasks , in decreasing priority order.
1 // is an exact priority queue.
2
3 for each step  do
4       // remove the task with highest priority.
5      
6      
7       // stop if is empty.
8       if  then
9             break
10       end if
11      
12 end for
Algorithm 1 General Framework for incremental algorithms.

When using a relaxed priority instead of an exact priority queue , one issue is the presence of inter-task dependencies. These dependencies are specified by the algorithm, and are affected by the permutation of the tasks: For comparison sorting, a task depends on all of its ancestor tasks in the resulting BST, while for Delaunay Triangulation there is a dependency between two tasks if right before either one is added, their encroaching regions overlap by at least an edge in the mesh. (Due to space constraints, we will assume the reader is familiar with terminology related to Delaunay mesh triangulation. We direct the reader to e.g. [10] for an overview of sequential and parallel algorithms for this problem.)

If task depends on task and , then task can not be processed before task . We call task an ancestor of task in this case. We assume that the task returned by the relaxed scheduler can be processed only if all of its ancestors are already processed. Pseudocode is given in Algorithm 2.

Data: Sequence of tasks , in decreasing priority order.
1 // is a relaxed priority queue.
2 for each step  do
3       // get the task with highest priority from .
4      
5       // check if has no dependencies.
6       if  then
7            
8            
9            
10       end if
11      // stop if is empty.
12       if  then
13             break
14       end if
15      
16 end for
Algorithm 2 General Framework for executing incremental algorithms using relaxed priority schedulers.

Observe that the loop runs for exactly steps in the exact case, but it may require extra steps in the relaxed case. We are interested in upper bounding the number of extra steps, since this is a measure of the additional work incurred when executing via the relaxed priority queue. In order to do this, we need to specify some properties for the dependencies of the incremental algorithms we consider.

Denote by be probability that task with label depends on task with label . We require the incremental algorithms to have the following properties:

  1. for each pair of task indices , , where is large enough constant which depends on the incremental algorithm.

  2. for each pair , depends only on tasks with labels and .

The fact that comparison sorting and Delaunay triangulation have the above properties has already been shown in [10]. More precisely, for comparison sorting, these properties are proved in [10, Section 3]. In the case of Delaunay triangulation, property (2) is showed in the same paper [10, Section 4], while property (1) follows from [10, Theorem 4.2].

3.2 Analysis

In this subsection, we prove an upper bound on the number of extra steps required by our generic relaxed framework for executing incremental algorithms. As a first step, we will derive some additional properties of the relaxed scheduler.

Let be the event that task with label at least is returned by the scheduler before task with label is processed by the incremental algorithm. Observe that if the scheduler returns the highest priority task, then this task can always be processed by the incremental algorithm, since this task is guaranteed to have no ancestors.

Lemma 3.1.

If ,

Proof.

For labels , let be the task with label and let be some task with label at least . Also, let be the earliest step at which rank of is at most . This means that at time step , and by rank property no tasks with labels larger than were scheduled at time steps . Thus, we have that at time step , . Because of the fairness property it takes steps to remove the task with highest priority(lowest label), so task will be returned by the scheduler and subsequently will be processed by the algorithm no later than at time step . Rank of can decrease by at most 1 after each step, thus at time step , . Hence, can be returned by the scheduler only after time step and this gives us that

For any label , let be the number of times scheduler returns task with label greater than (some task can be counted multiple times), before task with label is processed by the algorithm. The following holds:

Lemma 3.2.

For any ,

Proof.

Let be the task with label . Also, let be the earliest step at which rank of is at most .

This means that at time step , and by rank property no task with label at least can be returned by the scheduler at time steps . Because of the fairness property it takes steps to remove the task with highest priority(lowest label), so task will be returned by the scheduler and subsequently will be processed by the algorithm no later than at time step . Trivially, the total number of times some task with label at least (or in fact any label) can be returned by the scheduler before the time step is . ∎

With the above lemmas in place we can proceed to prove an upper bound for the extra number of steps.

Theorem 3.3.

The expected number of extra steps is upper bounded by .

Proof.

Let be the event that task with label depends on task with label . From the properties of incremental algorithms we consider, we get that .

Recall that for , is the event that task with label at least is returned by relaxed scheduler before task with label is processed by the algorithm. Observe that at and are independent. Since, depends only on the initial priorities of the tasks and does not depend on the relaxed scheduler. On the other hand, it is easy to see that

Every extra step is caused by a task with an ancestor which is not processed. Let be the label of the task we are not able to process because of dependencies and let be the highest priority ancestor task of . If also has an unprocessed ancestor , we repeat the same step. Eventually we can recurse to the pair of tasks such that is highest priority ancestor of and all ancestors of are already processed. Let and , we charge the extra step to the pair of labels .

Note that pair of labels can be charged only if and . Let be the event that is charged at least once. That is, will happen if and only if and happen. Also it is easy to see that the total number of times can be charged is upper bounded by (Recall 3.2).

(1)

4 Relaxed Schedulers: The Transactional Model

We now consider an alternative model where tasks are executed concurrently, each as part of a (software or hardware) transaction. This is unlike our standard model, which is entirely sequential. It is important to note that the correspondence between the two models is not one-to-one, since in this concurrent model transactions may abort due to data conflicts. More precisely, we assume that the algorithm consists of tasks, each corresponding to some transaction. Transactions are scheduled by an entity called the the transactional scheduler. Every task has label , where a lower label corresponds to higher priority. In transactional model, unlike sequential model, we assume that transaction aborts if and only if it is executed concurrently with a transaction it depends on. In other words, dependencies create data conflicts for concurrent transactions and conflicts are resolved in favor of higher priority transaction. Another crucial difference is that in transactional model we assume an upper bound on the interval contention. That is, each transaction can be concurrent with at most transactions in total(during one execution). This is needed because, If is the transaction with highest priority and is the transaction with second highest priority, which depends on , then can cause to be aborted large number of times, even in the case of exact scheduler.

Properties of the transactional scheduler

For transaction , let be the number of transactions returned by transactional scheduler after the point becomes the highest priority transaction available to the scheduler and before it is returned by the scheduler. We require transactional scheduler to have the following properties, which are similar to the properties in sequential model:

  1. RankBound: transaction with label is available to the transactional scheduler only after at least transactions with higher priority than are executed successfully.

  2. Fairness: For any transaction , .

Next, we derive concurrent versions of lemmas proved in the sequential setting. Let be the event that transaction with label at least is executed concurrently with the transaction with label or returned by the transactional scheduler before the transaction with label . Observe that if scheduler returns highest priority transaction, then this transaction will never abort.

Lemma 4.1.

If , .

Proof.

Let be the transaction with label and let be a transaction with label at least . Consider first point when is available to the scheduler. Observe that at this point, no transactions with label greater than are available to the scheduler and by the property, there are at most transactions with higher priority than which are left to be processed. By property, there can be only transactions scheduled before the transaction with highest priority. Once the highest priority transaction is scheduled, there can be at most transactions executing concurrently with it. Thus, the total number of transactions which were running at some point during period between became available to the scheduler and was executed successfully is at most . We get that at the point has finished successful execution, is not available to the scheduler, since the total number of successful transactions is at most . Thus, . ∎

Let be the task with label . Let be the total number of times scheduler returns transaction with label greater than before it returns the transaction , plus the number of transaction which are concurrent with at some point.

Lemma 4.2.

For any transaction with , .

Proof.

As in the proof of the previous lemma, we can show that the total number of transactions which were running at some point during period between became available to the scheduler and was executed successfully is at most , this trivially gives us that . ∎

Now, we are ready to prove the following theorem:

Theorem 4.3.

The expected number of transactions aborted by an incremental algorithm is at most

Proof.

Let be event that the transaction(task) with label depends on the transaction(task) with label . Note that transaction can abort transaction if and only if and . In transactional model, we charge the aborted transaction to the transaction which caused abort. Each transaction can be charged at most times. Observe that is a loose upper bound on the number of times transaction can be charged, since charge to transaction can be caused by a concurrent transaction only. With these properties in place, we can follow exactly the same steps as in the proof of Theorem 3.3 to show that The expected number of transactions aborted by an incremental algorithm is at most . ∎

5 Lower Bound on Wasted Work

In this section, we prove the lower bound on the cost of relaxation in terms of additional work. We emphasize the fact that this argument does not require the scheduler to be adversarial: in fact, we will prove that a fairly benign relaxed priority scheduler, the MultiQueue [29], can cause incremental algorithms to incur wasted work.

More precisely, let be event that the relaxed scheduler returns the task with label before task with label . First, we will prove the following claim for being used as a relaxed scheduler:

Claim 1.

For every , .

Proof.

First we describe how incremental algorithms work using . The maintains sequential priority queues, where can be assumed to be a fixed parameter. As before, each task is assigned a label according to the random permutation of input tasks (lower label means higher priority). Initially, all tasks are inserted in as follows: for each task, we select one priority queue uniformly at random out, and insert the task into it. To retrieve a task, the processor selects two priority queues uniformly at random and returns the task with highest priority (lowest label), among the tasks on the top of selected priority queues.

Let and . Additionally, let and be the queues where and are inserted in initially. Also, let be event that and are the top tasks of queues at some point during the run of our algorithm. We have that:

(2)
(3)
(4)
(5)

Observe that if and , this means that tasks and are never compared against each other. Consider two runs of our algorithm until it returns either or , first with initially chosen and and second with and swapped (these cases have equal probability of occurring). Since vertices and have consecutive labels and are never compared by the MultiQueue, this means that all the comparison results are the same in both cases, hence the scheduler has equal probability of returning or . (It is worth mentioning here is that only depends on values and and does not depend on their ordering.)

This means that :

(6)

Now we look at the case where and are top tasks of queues at some step . Let be event that is returned by MultiQueue and similarly, let be event that is returned. We need to lower bound the probability that happens before . We can safely ignore all the other tasks returned by scheduler and processed by algorithm since it is independent of whether or is returned first. Let be the number of top tasks in queues which have labels larger than . At step , and , So we have that

(7)

Observe that during the run of algorithm will start to increase but we will always have invariant that . This means that probability that happens before is at least:

(8)

This gives us that:

(9)

and consequently, since we get that:

(10)

Theorem 5.1.

For Delaunay triangulation and comparison sorting, the expected number of extra steps is lower bounded by .

Proof.

To establish the lower bound, we can assume that if the scheduler returns vertex , which depends on some other unprocessed vertex, we check if vertex with label is not processed and we charge edge , if depends on . This way, we get that and are not correlated, since if we run algorithm to the point where vertex with label or is returned, it will never check the dependency between them.

We will employ the following property of Delaunay triangulation and -based comparison sorting: for any , . This property is easy to verify: in Delaunay triangulation there is at least probability that vertices with labels and are neighbours in the Delaunay triangulation of vertices with labels , in based comparison sorting there is at least probability that tasks with labels and have consecutive keys among keys of tasks with labels and in both cases the task with label will depend on the task with label (see [10]).

This, in combination with Claim 1 will give us the lower bound on the number of extra steps, since if task with label depends on the task with label and it is returned first by scheduler, this will trigger at least one extra step, caused by not being able to process task:

(11)

6 Analyzing SSSP under Relaxed Scheduling

Preliminaries.

Since the algorithm is different from the ones we considered thus far, we re-introduce some notation. We assume we are given a directed graph with positive edge weights for each edge , and a source vertex . For each vertex , let be the weight of a shortest path from to . Additionally, let and .

We consider the sequential pseudocode from Algorithm3, which uses a relaxed priority queue to find shortest paths from via a procedure similar to the -stepping algorithm [27].

In this algorithm inserts a vertex with distance in , removes and returns a vertex, distance pair , such that is among the smallest distance vertices in . We also assume that supports a operation, which atomically decreases the distance of vertex in to .

Data: Graph , source vertex .
Initially empty relaxed priority queue .
Array for tentative distances.
1 for each vertex  do
2      
3 end for
4
5
6 while  do
7      
8       if  then
9             continue // is outdated
10       end if
11      for  do
12            
13             if  then
14                  
15                   // We assume that we can check whether is in
16                   // , this can be implemented via maintaining
17                   // the corresponding flag for each vertex.
18                   if  then
19                        
20                   end if
21                  else
22                        
23                   end if
24                  
25             end if
26            
27       end for
28      
29 end while
Algorithm 3 SSSP algorithm based on a relaxed priority queue.
11todo: 1”dist” instead of ”tent”22todo: 2… with an additional marker field in vertices

Analysis.

We will prove the following statement, which upper bounds the extra work incurred by the relaxed scheduler:

Theorem 6.1.

The number of operations performed by Algorithm 3 is .

Proof.

Our analysis will follow the general pattern of -stepping analysis. We will partition the vertex set into buckets, based on distance: vertex belongs to bucket iff . Let be the total number of buckets we need (for simplicity we assume that is an integer).

Observe that because of the way we defined buckets, we have the following property, which we will call the bucket property : for any vertex , no shortest path from to contains vertices which belong to the same bucket.

We say that Algorithm 3 processes vertex at the correct distance if returns , this means that at this point and we relax outgoing edges of . (See Algorithm 3 for clarification.)

We fix and look at what happens when Algorithm 3 processes all vertices in the buckets at the correct distance. Because of the bucket property, we get that for every , and the vertices from bucket are either ready to be processed at the correct distance, or are already processed at the correct distance. To avoid the second case, we also assume that if returns , where and not all vertices in the buckets are processed at the correct distance, then this operation still counts towards the total number of operations, but it does not actually remove the task and does not perform edge relaxations, even though is ready to be processed at the correct distance. This assumption only increases the total number of operations, so to prove the claim it suffices to derive an upper bound for this pessimistic case.

Once the algorithm processes the vertices in buckets at the correct distances, we know that the only vertices with tentative distance less than are the vertices in the bucket . (Note that this statement would not hold if we didn’t have the operation: if we insert multiple copies of vertices in with different distances, as in some versions of Dijkstra, there might exist outdated copies of vertex , even though was already processed at the correct distance.) This means that, at this point, the top vertices (vertices with the smallest distance estimates) belong to .

Next, we bound how many operations are needed to process the vertices in , after all vertices in the buckets are processed. If , using the rank property, we have that the first () operations process vertices in . If , we know that it will take at most operations to process all vertices in , since, because by the fairness bound, the number of operations to return the top vertex (the one with the smallest tentative distance) is at most , and we showed that the top vertex belongs to until all vertices in are processed. By combining these two cases, we get that the number of operations to process vertices in at the correct distance is at most .

Thus the number of operations performed by Algorithm 3 in total is at most:

(12)

as claimed.

Discussion.

A clear limitation is that the bound depends on the maximum distance , and on the minimum weight . Hence, this bound would be relevant only for low-diameter graphs with bounded edge weights. We note however this case is relatively common: for instance, [16] considers weighted graph models of low diameter, where weights are chosen in the interval . These assumptions appear to hold in for many publicly available weighted graphs [26]. Further, our argument assumes a relaxed scheduler supporting operations. This operation can be supported by schedulers such as the SprayList [5] or MultiQueues [29, 4] where elements are hashed consistently into the priority queues.

7 Experiments

Figure 1: Overheads (left) and speedups (right) for parallel SSSP Dijkstra’s algorithm executed via a MultiQueue relaxed scheduler on random, road network, and social network graphs. The overhead is measured as the ratio between the number of tasks executed via a relaxed scheduler versus an exact one.
Figure 2: Relaxation overheads versus relaxation factor/queue multiplier for parallel SSSP Dijkstra’s algorithm. The number of queues is the multiplier ( axis) times the number of threads, and is proportional to the average relaxation factor of the queue [4].

We implemented the parallel SSSP Dijkstra’s algorithm described in Section 6 using an instance of the MultiQueue relaxed priority scheduler [29, 4]. In the classic sequential algorithm nodes are processed sequentially, while in this parallel version a node can be processed several times due to out-of-order execution. In our experiments, we are interested in the total number of tasks processed by the concurrent variant, in order to examine the overhead of relaxation in concurrent executions. In addition, we also measure execution times for increasing number of threads. Overhead is measured as the average number of tasks executed in a concurrent execution divided by the number of tasks executed in a sequential execution using an exact scheduler.

Sample graphs.

We use the following list of graphs in our experiments:

  • Random undirected graph with million nodes and million edges, with uniform random weights between and (random);

  • USA road network graph with physical distances as edge lengths; million nodes and million edges (road) [15];

  • LiveJournal social network friendship graph; million nodes and million edges, with uniform random weights between and (social) [26].

Platforms.

We evaluated the experiment on a server with 4 Intel Xeon Gold 6150 (Skylake) sockets. Each socket has 18 2.70 GHz cores, each of which multiplexes 2 hardware threads, for a total of 144 hardware threads. In addition, we ran the experiment on a Google Cloud Platform VM supporting to 96 hardware threads.

Experimental results.

The experimental results are summarized in Figure 1. On the left column, notice that, on both machines, the overheads of relaxation are almost negligible: for the random graph and the social network, the overheads are almost at all thread counts, what practically means the absence of extra work. (Recall that the number of queues is always the number of threads, so the relaxation factor increases with the thread count.)

The road network incurs higher overheads ( at 144 threads / 288 queues). This can be explained by the higher diameter of the graph (, versus for the LiveJournal and

for the random graphs), and by the higher variance in edge costs for the road network. In terms of speedup (right), our implementation scales well for 1-2 sockets on our local server, after which NUMA effects become prevalent. NUMA effects are less prevalent on the Google Cloud machine, but the maximum speedup is also more limited (

instead of ). In Figure 2, we examine the relaxation overhead (in terms of the amount of extra tasks executed) versus the relaxation factor. While we cannot control the relaxation factor exactly, we know that the average value of this factor is proportional to the number of queues allocated, which is the number of threads (fixed for each sub-plot) times the multiplier for the number of queues (the axis) [4]. We notice that these overheads are only non-negligible for the road network graph. On the one hand, this suggests that our worst-case analysis is not tight, but can also be interpreted as showing that the overheads of relaxation do become apparent on dense, high-diameter graphs such as road networks.

8 Conclusion

We have provided the first efficiency bounds for parallel implementations of SSSP and Delaunay mesh triangulation under relaxed schedulers. In a nutshell, our results show that, for some inputs and under analytic assumptions, the overheads of parallelizing these algorithms via relaxed schedulers can be negligible. Our findings complement empirical results showing similar trends in the context of high-performance relaxed schedulers [28, 25]. While our analysis was specialized to these algorithms, we believe that our techniques can be generalized to other iterative algorithms, which we leave as future work.

Acknowledgments

We would like to thank Ekaterina Goltsova, Charles E. Leiserson, Tao B. Schardl, and Matthew Kilgore for useful discussions in the incipent stages of this work, and Justin Kopinsky for careful proofreading and insightful suggestions on an earlier draft.

Giorgi Nadiradze was supported by the Swiss National Fund Ambizione Project PZ00P2 161375. Dan Alistarh was supported by European Research Council funding award PR1042ERC01.

References

  • [1] Dan Alistarh, James Aspnes, Keren Censor-Hillel, Seth Gilbert, and Rachid Guerraoui. Tight bounds for asynchronous renaming. J. ACM, 61(3):18:1–18:51, June 2014.
  • [2] Dan Alistarh, Trevor Brown, Justin Kopinsky, Jerry Z. Li, and Giorgi Nadiradze. Distributionally linearizable data structures. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, SPAA ’18, pages 133–142, New York, NY, USA, 2018. ACM.
  • [3] Dan Alistarh, Trevor Brown, Justin Kopinsky, and Giorgi Nadiradze. Relaxed schedulers can efficiently parallelize iterative algorithms. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC ’18, pages 377–386, New York, NY, USA, 2018. ACM.
  • [4] Dan Alistarh, Justin Kopinsky, Jerry Li, and Giorgi Nadiradze. The power of choice in priority scheduling. In Elad Michael Schiller and Alexander A. Schwarzmann, editors, Proceedings of the ACM Symposium on Principles of Distributed Computing, PODC 2017, Washington, DC, USA, July 25-27, 2017, pages 283–292. ACM, 2017.
  • [5] Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. The spraylist: A scalable relaxed priority queue. In 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San Francisco, CA, USA, 2015. ACM.
  • [6] Dmitry Basin, Rui Fan, Idit Keidar, Ofer Kiselov, and Dmitri Perelman. CafÉ: Scalable task pools with adjustable fairness and contention. In Proceedings of the 25th International Conference on Distributed Computing, DISC’11, pages 475–488, Berlin, Heidelberg, 2011. Springer-Verlag.
  • [7] Guy E Blelloch. Some sequential algorithms are almost always parallel. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA, pages 24–26, 2017.
  • [8] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally deterministic parallel algorithms can be fast. In J. Ramanujam and P. Sadayappan, editors, Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25-29, 2012, pages 181–192. ACM, 2012.
  • [9] Guy E Blelloch, Jeremy T Fineman, and Julian Shun. Greedy sequential maximal independent set and matching are parallel on average. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, pages 308–317. ACM, 2012.
  • [10] Guy E Blelloch, Yan Gu, Julian Shun, and Yihan Sun. Parallelism in randomized incremental algorithms. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, pages 467–478. ACM, 2016.
  • [11] Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing, 37(1):55–69, 1996.
  • [12] Robert D Blumofe and Charles E Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM), 46(5):720–748, 1999.
  • [13] Neil Calkin and Alan Frieze. Probabilistic analysis of a parallel algorithm for finding maximal independent sets. Random Structures & Algorithms, 1(1):39–50, 1990.
  • [14] Don Coppersmith, Prabhakar Raghavan, and Martin Tompa. Parallel graph algorithms that are efficient on average. In Foundations of Computer Science, 1987., 28th Annual Symposium on, pages 260–269. IEEE, 1987.
  • [15] Camil Demetrescu, Andrew V Goldberg, and David S Johnson. The Shortest Path Problem: Ninth DIMACS Implementation Challenge, volume 74. American Mathematical Soc., 2009.
  • [16] Laxman Dhulipala, Guy Blelloch, and Julian Shun. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’17, pages 293–304, New York, NY, USA, 2017. ACM.
  • [17] Laxman Dhulipala, Guy E Blelloch, and Julian Shun. Theoretically efficient parallel graph algorithms can be fast and scalable. arXiv preprint arXiv:1805.05208, 2018.
  • [18] Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959.
  • [19] Manuela Fischer and Andreas Noever. Tight analysis of parallel randomized greedy mis. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2152–2160. SIAM, 2018.
  • [20] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Chandu Thekkath and Amin Vahdat, editors, 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012, pages 17–30. USENIX Association, 2012.
  • [21] Andreas Haas, Michael Lippautz, Thomas A. Henzinger, Hannes Payer, Ana Sokolova, Christoph M. Kirsch, and Ali Sezgin. Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation. In Hubertus Franke, Alexander Heinecke, Krishna V. Palem, and Eli Upfal, editors, Computing Frontiers Conference, CF’13, Ischia, Italy, May 14 - 16, 2013, pages 17:1–17:9. ACM, 2013.
  • [22] Shams Imam and Vivek Sarkar. Load balancing prioritized tasks via work-stealing. In European Conference on Parallel Processing, pages 222–234. Springer, 2015.
  • [23] Mark C Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. Unlocking ordered parallelism with the swarm architecture. IEEE Micro, 36(3):105–117, 2016.
  • [24] R. M. Karp and Y. Zhang. Parallel algorithms for backtrack search and branch-and-bound. Journal of the ACM, 40(3):765–789, 1993.
  • [25] Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Priority queues are not good concurrent priority schedulers. In European Conference on Parallel Processing, pages 209–221. Springer, 2015.
  • [26] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • [27] Ulrich Meyer and Peter Sanders. -stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114–152, 2003.
  • [28] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 456–471, New York, NY, USA, 2013. ACM.
  • [29] Hamza Rihani, Peter Sanders, and Roman Dementiev. Brief announcement: Multiqueues: Simple relaxed concurrent priority queues. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’15, pages 80–82, New York, NY, USA, 2015. ACM.
  • [30] Konstantinos Sagonas and Kjell Winblad. The contention avoiding concurrent priority queue. In International Workshop on Languages and Compilers for Parallel Computing, pages 314–330. Springer, 2016.
  • [31] Konstantinos Sagonas and Kjell Winblad. A contention adapting approach to concurrent ordered sets. Journal of Parallel and Distributed Computing, 2017.
  • [32] Nir Shavit and Itay Lotan. Skiplist-based concurrent priority queues. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, pages 263–268. IEEE, 2000.
  • [33] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. Reducing contention through priority updates. In Proceedings of the Twenty-fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’13, pages 152–163, New York, NY, USA, 2013. ACM.
  • [34] Julian Shun, Yan Gu, Guy E Blelloch, Jeremy T Fineman, and Phillip B Gibbons. Sequential random permutation, list contraction and tree contraction are highly parallel. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 431–448. SIAM, 2014.
  • [35] Martin Wimmer, Jakob Gruber, Jesper Larsson Träff, and Philippas Tsigas. The lock-free k-lsm relaxed priority queue. CoRR, abs/1503.05698, 2015.