DeepAI

# Analysis of Work-Stealing and Parallel Cache Complexity

Parallelism has become extremely popular over the past decade, and there have been a lot of new parallel algorithms and software. The randomized work-stealing (RWS) scheduler plays a crucial role in this ecosystem. In this paper, we study two important topics related to the randomized work-stealing scheduler. Our first contribution is a simplified, classroom-ready version of analysis for the RWS scheduler. The theoretical efficiency of the RWS scheduler has been analyzed for a variety of settings, but most of them are quite complicated. In this paper, we show a new analysis, which we believe is easy to understand, and can be especially useful in education. We avoid using the potential function in the analysis, and we assume a highly asynchronous setting, which is more realistic for today's parallel machines. Our second and main contribution is some new parallel cache complexity for algorithms using the RWS scheduler. Although the sequential I/O model has been well-studied over the past decades, so far very few results have extended it to the parallel setting. The parallel cache bounds of many existing algorithms are affected by a polynomial of the span, which causes a significant overhead for high-span algorithms. Our new analysis decouples the span from the analysis of the parallel cache complexity. This allows us to show new parallel cache bounds for a list of classic algorithms. Our results are only a polylogarithmic factor off the lower bounds, and significantly improve previous results.

• 30 publications
• 1 publication
• 15 publications
09/25/2018

### Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming and Linear Algebra

For many cache-oblivious algorithms for dynamic programming and linear a...
04/27/2020

### In-Place Parallel-Partition Algorithms using Exclusive-Read-and-Write Memory: An In-Place Algorithm With Provably Optimal Cache Behavior

We present an in-place algorithm for the parallel partition problem that...
08/01/2020

### Data Oblivious Algorithms for Multicores

As secure processors such as Intel SGX (with hyperthreading) become wide...
11/05/2018

### Lower Bounds for Parallel and Randomized Convex Optimization

We study the question of whether parallelization in the exploration of t...
05/14/2021

Self-adjusting computation is an approach for automatically producing dy...
05/25/2022

### Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficient

To design efficient parallel algorithms, some recent papers showed that ...
03/11/2019

### Optimal Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model ...

## 1. Introduction

Hardware advances in the last decade have brought multicore parallel machines to the mainstream. While there are multiple programming paradigms and tools to enable parallelism in multicore machines, the one based on nested parallelism with randomized work-stealing (RWS) scheduler is with no doubt the most popular and widely used. The nested parallelism model and its variants have been supported by most parallel programming languages (e.g., Cilk, TBB, TPL, X10, Java Fork-join, and OpenMP), introduced in textbooks (e.g., Cormen, Leiserson, Rivest and Stein (CLRS, )), and employed in a variety of research papers (to list a few: (agrawal2014batching, ; blelloch2010low, ; BCGRCK08, ; BG04, ; Blelloch1998, ; blelloch1999pipelining, ; BlellochFiGi11, ; BST12, ; Cole17, ; BBFGGMS16, ; dinh2016extending, ; chowdhury2017provably, ; blelloch2018geometry, ; dhulipala2020semi, ; BBFGGMS18, ; Dhulipala2018, ; blelloch2020randomized, ; gu2021parallel, ; blelloch2016justjoin, ; sun2018pam, ; sun2019parallel, ; ptreedb, )). At a high level, this model allows an algorithm to recursively and dynamically create (fork) parallel tasks, which will be executed on processors by a dynamic scheduler. This nested (binary) fork-join provides a good abstraction for shared-memory parallelism. On the user (algorithm designer or programmer) side, it is a simple extension to the classic programming model with additional keywords for creating new tasks (e.g., fork) and synchronization (e.g., join) between tasks.

The randomized work-stealing (RWS) scheduler plays a crucial role in this ecosystem. It automatically and dynamically maps a nested parallel algorithm to the hardware efficiently both in theory and in practice. In this paper, we study two important topics related to the RWS scheduler. First, we show a new, simplified, and classroom-ready version of analysis for the RWS scheduler, which we believe is easier to understand than existing ones. Second, we provide some new analyses for parallel cache complexity for nested-parallel algorithms based on the RWS scheduler. We provide a list of almost optimal bounds listed in Table 1, and more discussions will be given later.

Our first contribution is a simplified analysis for the RWS scheduler. The theoretical efficiency of the RWS scheduler was first given by Blumofe and Leiserson (blumofe1999scheduling, ), and is later analyzed for a variety of settings (to list a few: (Acar02, ; itpa, ; muller2016latency, ; acar2013scheduling, ; BBFGGMS16, ; singer2019proactive, ; Acar2016tapp, ; singer2020scheduling, ; arora2001thread, )). Although these analyses essentially consider more complex settings (e.g., to also consider external I/Os), the analyses themselves are quite complicated. To the best of our knowledge, the details of the proofs are covered in very few courses related to parallelism, and in most cases, RWS is just treated as a black box. Hence, we simply consider the goal of bounding the number of steals of the RWS scheduler, and we want to answer the question of what the simplest analysis for the RWS scheduler can be, or what is the most comprehensible version in education.

This paper presents a simplified analysis in Section 4. Unlike most of the existing analyses, our version does not rely on defining the potential function for a substructure of the computation, which we believe is easy to understand. Our analysis is inspired by a recent analysis (itpa, ) that is similar to (agrawal2008adaptive, ; suksompong2014bounds, ). Our analysis differs from (itpa, ) in two aspects. First, unlike (itpa, ), we do not assume all processors run in lock-steps (the PRAM setting). Instead, we assume a highly asynchronous setting, which is more realistic for today’s parallel machines. Second, we separate the math calculation from the details of the RWS algorithm, which may be helpful for classroom teaching.

Our second and main contribution of this paper is on the parallel cache complexity for algorithms using the RWS scheduler. On today’s machines, the memory access cost usually dominates the running time of most combinatorial algorithms. To capture this, sequentially, Aggarwal and Vitter (AggarwalV88, ) first formalized the external-memory model to capture the I/O cost of an algorithm, which was refined by Frigo et al. (Frigo99, ) as the ideal-cache model. The cost measure is called I/O complexity (AggarwalV88, ) (noted as ) or cache complexity (Frigo99, ) when specifying the communication cost between the cache and the main memory. While this model has received great success in the algorithm and database communities, so far, few results have extended it to the parallel (distributed cache) setting, which we summarize in Section 3.4. Among them, Acar, Blelloch and Blumofe (Acar02, ) first defined and showed that the parallel cache complexity is at most , where , , and are the number of processors, span (aka. depth, the longest critical path of dependences), cache size, and cache block size, respectively (definitions in Section 3). However, this bound is pessimistic and usually too loose. Frigo and Strumpen (frigo2009cache, ) and later work by Cole and Ramachandran (cole2013analysis, ; cole2012revisiting, ) showed tighter parallel cache complexity for many cache-oblivious algorithms summarized Table 1. However, the bounds have a polynomial overhead111Such an overhead is usually , where is a constant and . on the input size when the span is polynomial to the input size. Such overhead can easily dominate the cache bound when plugging in the real-world input size, as discussed in the caption of Table 1. It remained an open question on how to close this gap.

In this paper, we significantly close this gap for a variety of algorithms. The polynomial overhead in the previous analysis (Acar02, ; frigo2009cache, ; cole2013analysis, ; cole2012revisiting, ) is due to the high span of these algorithms (a polynomial of the algorithm’s span shows up in the bound). The key insight in our analysis is to break such polynomial correlation between the algorithm’s span and the overhead for parallel cache complexity. Our new analysis is inspired by the abstraction of -d grid proposed recently by Blelloch and Gu (BG2020, ), which was previously used to show sequential cache bounds. We extend the idea for analyzing parallel cache complexity. Our analysis directly studies the “real” dependencies in the computation structure of these algorithms, instead of those caused by parallelism. In particular, the core of our analysis is just the recurrences of these algorithms. As a result, the parallel dependency of the computation (the algorithm’s span) does not show up in the analysis. This helps us avoid the complication of digging into the details of the scheduling algorithms in the analysis, and the analysis is no more complicated than just solving recurrences. To do this, we define the -recurrence, which effectively models the pattern of recurrences of many classic algorithms, and show a general theorem, Theorem 5.2, to solve these recurrences. By applying these results, we can achieve the new bounds in Table 1. The parallel overheads are polylogarithmic instead of polynomial to the input size, compared to the lower bounds by (ballard2014communication, ; BG2020, ). We believe the methodology is of independent interest, and could be useful in analyzing other parallel or sequential algorithms. We leave this as future work.

The contributions of this paper are as follows.

1. We show a new analysis for the randomized work-stealing scheduler, which avoids the use of potential functions and is very simple.

2. We propose the -recurrence and a general theorem (Theorem 5.2) for solving it, which applies to solving the parallel cache complexity for many classic parallel algorithms.

3. We show new cache complexity bounds for a variety of algorithms, shown in Table 1, which significantly improves existing results, and is very close to the lower bounds (only a polylogarithmic overhead).

## 2. old

Hardware advances in the last decade have brought multicore parallel machines to the mainstream. While there are multiple programming paradigms and tools to enable parallelism in multicore machines, the one based on nested fork-join with randomized work-stealing scheduler has no doubt to be the most popular and widely-used. In a high level, an algorithm can recursively and dynamically create (fork) parallel tasks, which will be executed and executed on all cores. We give more details of the model in Section 3.

Nested (binary) fork-join provides a good abstraction for shared-memory parallelism. On the user (algorithm designer or programmer) side, it is a simple extension to the classic programming model with additional keywords for fork and join. On the system side, the randomized work-stealing scheduler automatically runs the code on all cores, with good theoretical guarantees. This abstraction is also used in most of the modern textbooks and courses for parallel algorithms and programming (to list a few: XXX).

The randomized work-stealing (RWS) scheduler plays an crucial role in this ecosystem since it dynamically maps a parallel algorithm to the hardware in an efficient manner both theoretically and practically. The theoretical efficiency of the RWS scheduler has been first shown by Blumofe and Leiserson (BL98), and later analyzed for a variety of settings (to list a few: YYY). Unfortunately, all of these analyses are quite complicated, and the RWS scheduler is usually treated as a black box. To the best of our knowledge, the details of the proofs are not covered in any of the existing courses listed above. Given the importance of the RWS scheduler, it is of crucial relevance to simplify the analysis that is more comprehensible in courses.

This paper shows such a simplified analysis. Unlike most of the existing analyses (BL and others), our version does not rely on defining the potential function for a substructure of the computation, which we believe is easier to understand222We note that this will limit the applicability of this analysis in more settings, but the goal here is to provide a simple version for the plain setting.. Our analysis is inspired by a recent analysis in (itpa, ). However, unlike that in (itpa, ) (a single long proof), our proof breaks down into several lemmas that each is simple both conceptually and in details. Then, the proof of the theorem simply combines the lemmas. We believe this version is particularly suitable in the classroom.

Our second and main contribution of this paper is about parallel cache complexity based on the RWS scheduler. Nowadays, memory access cost dominates the running time of most combinatorial algorithms. To capture this, sequentially, Aggarwal and Vitter (AggarwalV88, ) first formalized the external-memory model to capture the I/O cost of an algorithm, which was refined by Frigo et al. (Frigo99, ) as the ideal-cache model. The cost measure is I/O complexity (AggarwalV88, ) (usually ) or cache complexity (Frigo99, ) when specifying the communication cost between the cache and the main memory. While this model has received a great success in the algorithm and database communities, so far we have few results on extending it to the parallel (distributed cache) setting. The two main results include the follows. Acar, Blelloch and Blumofe (Acar02, ) showed that the parallel cache complexity is , where , and are the number of cores, span (aka. depth, the longest critical path of dependences), and cache size. However, this bound is usually too loose and counterintuitive: for the same algorithm, increases as the increase of (cache size), but in reality it is the opposite. Hence, Frigo and Strumpen (frigo2009cache, ) and later work by Cole and Ramachandran (cole2013analysis, ; cole2012revisiting, ) analyzed the RWS scheduler and the computational structure for certain cache-oblivious algorithms, and showed tighter parallel cache complexity as compared to (Acar02, ). However, their bounds are still relatively loose, and have an overhead that is polynomial on the input size when the span is polynomial to the input size. More details can be found in LABEL:tab:mainresult, and the parallelism overhead can easily dominate the cache bound when plugging in real-world input size.

In this paper, we show how to break this polynomial dependence between the algorithm’s span and the overhead for parallel cache complexity. We note that unlike the previous work (frigo2009cache, ; cole2013analysis, ; cole2012revisiting, ) that analyzed how the computation is scheduled by RWS, we directly study the recurrence structures of the computations. This novel idea will lead to a list of tighter bounds as shown in LABEL:tab:mainresult. This is inspired by the abstraction of -d grid proposed recently by Blelloch and Gu (BG2020, ), which was originally used to show cache lower bounds independently with the dependence structure of the computation. Compared to the lower bounds by (ballard2014communication, ; BG2020, ), the overheads of all bounds shown in this paper are polylogarithmic instead of polynomial to the input size.

//Check if we want to mention lower-order terms later.

## 3. Preliminaries

### 3.1. Nested Parallelism

) threads are also allowed to be synchronized with any other threads, probably by using atomic primitives such as

test_and_set or compare_and_swap. This model is the most widely-used model for multicore programming, and is supported by many parallel languages including NESL (blelloch1992nesl, ), Cilk (frigo1998implementation, ), the Java fork-join framework (Java-fork-join, ), OpenMP (OpenMP, ), X10 (charles2005x10, ), Habanero (budimlic2011design, ), Intel Threading Building Blocks (TBB, ), the Task Parallel Library (TPL, ), and many others. In this model, we usually require the computation to be either race-free (feng1999efficient, ) (i.e., no logically parallel instructions access the same memory location and at least one is a write), or to only use atomic operations (e.g., test_and_set or compare_and_swap) to deal with concurrent writes (e.g., (blelloch2012internally, )).

### 3.2. Work-Span Measure

For a computation using nested parallelism, we can measure its work and span by evaluating its series-parallel computational DAG (i.e., a DAG modeling the dependence between operations in the computation). The work is the number of operations in this computation, or the costs of all tasks in the computation DAG (the time complexity on the RAM model). The span (or depth) is the maximum number of operations over all directed paths in the computation DAG.

### 3.3. Randomized Work-Stealing (RWS) Scheduler

In practice, a nested-parallel computation can be scheduled on multicore machines using the randomized work-stealing (RWS) algorithm. More details of the RWS scheduler can be found in (blumofe1999scheduling, ), and here we overview the high-level ideas that will be used in our analysis. The RWS scheduler assigns one double-ended queue (deque) for each processor that can execute a thread at a time. One processor starts with taking the root thread. Each processor then proceeds as follows:

• If the current thread runs a fork, the processor enqueues one thread at the front of its queue (spawned child), and executes the other thread (continuation).

• If the current thread completes, the processor pulls a thread from the front of its own queue.

• If a processor’s queue is empty, it randomly selects one of the other processors, and steals a thread from the end of that processor’s queue (victim queue). If that fails, the processor retries until succeeds.

Since a steal can be pretty costly (involves complicated inter-processor communication) in practice, a common practice is to wait for at least the time for a successful steal before retrying (Acar02, ).

The overhead of executing the computation based on the RWS scheduler is mainly on the steal attempts (both successful ones and failed ones), plus maintaining the deque. Hence, bounding the number of steals is of great interest for multicore parallelism.

###### Theorem 3.1 ().

Executing a series-parallel computation DAG with work and span on an RWS scheduler uses steals to ,333We use the term with high probability () in to indicate the bound holds with probability at least for any . With clear context we drop “in ”. where is the number of processors.

Theorem 3.1 was first shown by Blumofe and Leiserson (blumofe1999scheduling, ), and was later proved in different papers (e.g.,(Acar02, ; itpa, ; muller2016latency, ; acar2013scheduling, ; BBFGGMS16, ; singer2019proactive, ; Acar2016tapp, ; singer2020scheduling, ; arora2001thread, )). Since RWS is an asynchronous algorithm, certain synchronization assumptions are required. Early work (blumofe1999scheduling, ) assumed all processors are fully synchronized and all operations have unit cost. Later work (Acar02, ) relaxed it (and thus made it more realistic) that a steal attempt takes at least and at most time steps where is the cost for a steal and is a constant. In this paper, we further relax the assumption of synchronization—we assume that, between a failed steal and the next steal attempt of each processor, every other processor can try at most steal attempts for some constant . Also, between two steal attempts from the same processor, another processor that has work to do will execute at least one instruction. We believe such a relaxation is crucial since, in practice, processors are highly asynchronous due to various reasons, including cache misses, processor pipelines, branch prediction, hyper-threading, changing clock speeds, interrupts, or the operating system scheduler. Hence, it is hard to define what time steps mean for different processors. However, it is reasonable to assume that processors run in similar speeds within a constant factor, and the steal attempts are not too often (in practice the gap between two steal attempts is usually set to be at least hundreds of to thousands of cycles). In the analysis, we consider the simpler case for , but it is easy to see that a larger will not asymptotically affect the scheduling result (Theorem 3.1) as long as is a constant.

### 3.4. Cache Complexity

Cache complexity (aka. I/O complexity) measures the memory access cost of an algorithm, which in many cases can be the bottleneck of the execution time, especially for parallel combinatorial algorithms. The idea was first introduced by Aggarwal and Vitter (AggarwalV88, ), and has been widely studied since then. Here we use definition by Frigo et al. (Frigo99, ) that is the most adopted now. Here we assume a two-level memory hierarchy. The CPU is connected to a small-memory (cache) of size , and this small-memory is connected to a large-memory (main memory) of effectively infinite size. Both small-memory and large-memory are divided into blocks of size (cachelines), so there are cachelines in the cache. The CPU can only access the memory on blocks resident in the cache and it is free of charge. Finally, we assume an optimal offline cache replacement policy, which is automatic, to transfer the data between the cache and the main memory, and a unit cost for each cacheline load and evict. The practical policy such as LRU or FIFO, are -competitive with the optimal offline algorithm if they have a cache with twice the size. The cache complexity of an algorithm, , is the total cost to execute this algorithm on such a model.

The above measure is sequential. One way to extend it to the parallel setting is the “Parallel External Memory” (PEM) model (Arge08, ) that analogs the PRAM model (shiloach1981finding, ). However, since modern processors are highly asynchronous (blelloch2020optimal, ) instead of running in lock-steps as assumed in PRAM, the PEM model cannot measure the communication between cache and main memory well. An alternative solution is to assume multiple processors work independently, and they either share a common cache or own their individual caches. For individual caches (aka. distributed caches), the parallel cache complexity based on RWS is upper bounded by , where is the number of steals (Acar02, ). This is easy to see since each steal can lead to, at worst, an entire reload of the cache, with cost . Applying Theorem 3.1, we can get .

There have also been studies on other cache configurations, such as shared caches (BG04, ), multi-level hierarchical caches (BlellochFiGi11, ; SBFGK14, ; BCGRCK08, ), and varying cache sizes (bender2014cache, ; bender2016cache, ). Many parallel cache-efficient algorithms have been designed inspired by these measurements (e.g., (cole2010resource, ; tang2011pochoir, ; BCGRCK08, ; chowdhury2008cache, ; blelloch2010low, ; chowdhury2010cache, ; BG2020, ; BFGGS15, ; Arge08, )).

## 4. Simplified RWS Analysis

The randomized work-stealing (RWS) scheduler plays a crucial role in the ecosystem of nested-parallel algorithms and multicore platforms, and it dynamically maps the algorithms to the hardware. RWS is efficient both theoretically and practically. The theoretical efficiency of the RWS scheduler has been first given by Blumofe and Leiserson (blumofe1999scheduling, ). A lot of later work analyzed RWS in different settings (Acar02, ; itpa, ; muller2016latency, ; acar2013scheduling, ; BBFGGMS16, ; singer2019proactive, ; Acar2016tapp, ; singer2020scheduling, ; arora2001thread, ). As mentioned, most of these analyses are quite involved, especially when they also consider some more complicated settings (e.g., external I/O costs). This makes the analysis very hard to cover in undergraduate or graduate courses, and RWS is usually treated as a black box. Given the importance of the RWS scheduler, it is crucial to make the analysis more comprehensible so that it can be taught in classes. In the following, we will show a simplified analysis for RWS, which proves Theorem 3.1.

Unlike most of the existing analyses, our version does not rely on defining the potential function for a substructure of the computation. We understand that this will limit the applicability of this analysis in some settings (e.g., we do not extend the results to consider external I/O costs), but our goal is to provide a simple proof for a reasonably general setting (the binary-forking model (blelloch2020optimal, )), and make it easy to understand in most parallel algorithm courses.

Our analysis is inspired by a recent analysis in (itpa, ) that is similar to those in (agrawal2008adaptive, ; suksompong2014bounds, ). Our analysis differs from (itpa, ) in two aspects. First, the analysis in (itpa, ) assumes all processors run in lock-steps (PRAM setting). However, on today’s machines, the processors are loosely synchronized with different relative processing rates changing over time. Hence, the processors can run in different speeds and do not and should not be synchronized. In our analysis, we assume, between a failed steal attempt and the next steal attempt on each processor, every other processor can make at most steal attempts for some constant . We believe this is a reasonably realistic assumption that all today’s machines and RWS implementations satisfy. For simplicity, in our analysis we use , and it is easy to see that this does not affect the asymptotical bounds.

Secondly, instead of showing a single long proof, we tried our best to improve its understandability, and separate the math and calculation from the main idea of the proof. We start from the most optimistic case as a motivating example in Lemma 4.1, and then generalize it to Lemma 4.2 and Lemma 4.3. The only mathematics tools we use are Chernoff bound and union bound. These lemmas pave the path to, and decouple the mathematics from, the proof of Theorem 3.1, which involves more details and the core idea about RWS. We believe this can be helpful for classroom teaching.

We say two steal attempts overlap with each other when they choose the same victim processor, and happen concurrently. To start with, we first show the simplest case where none of the steal attempts overlap with each other. This is the optimal case because in fact multiple attempts may happen simultaneously and only one can succeed. We show the lemma below as a starting point of our analysis.

###### Lemma 4.1 ().

Given a victim processor and non-overlapping steal attempts, the probability that at least tasks from the deque of the processor are stolen is at least , where is the number of processors.

Here we overload the notation in the lemma that was previously defined as the span of the computation. We do so because later in the proof of Theorem 3.1, we plug in as the span of the algorithm, so we just use

for convenience. We use the classic version of Chernoff bound that considers independent random variables

taking values in . Let be the sum of these random variables, and let be the expected value of , then for any “offset” , .

###### Lemma 4.1.

Recall that in RWS, each steal attempt independently chooses a victim processor and tries to steal a task. Hence, it has probability to choose processor (and steal one task if so). We now consider each of the steal attempts as a random variable . if it chooses processor , and otherwise. In Chernoff Bound, let the number of random variables . Then for steals, the expected number of hits is . We are interested in the probability that fewer than steal attempts hit processor . In this case, , so . Applying Chernoff bound, we can get that the probability is no more than . Here . The “” step is because we discard the term that is always positive. Hence, , which proves the lemma. ∎

Next, we consider the general case that multiple processors try to steal concurrently, so the steal attempts can overlap. In this case, although unlikely, many processors may make the steal attempts “almost” at the same time, where only one processor wins (using arbitrary tie-break) and the others fail. As mentioned, we assume that between a failed steal attempt and the next steal attempts on one processor, every other processor can have at most one steal attempt.

###### Lemma 4.2 ().

Given a specific victim processor and steal attempts from other processors, the probability that tasks from the deque of the processor are stolen is at least .

###### Proof.

Although multiple processors can attempt to steal concurrently, two steals from the same processor will never be concurrent. Hence, based on our assumption, the most pessimistic situation is that the processors always have steals at the same time, which maximizes the chance that a steal hits the queue but fails to get the task. The probability that at least one of the concurrent steals chooses this victim processor (so that at least one of the tasks is stolen) is at least .

We can similarly use Chernoff bound to show that the probability that fewer than tasks are stolen after steps is small. We consider each random variable as a group of steal attempts, with probability of at least to choose the deque of the specific processor . In this case, the expected value of the sum and the offset remain the same as in Lemma 4.1, so the probability is also the same. ∎

Here note that in the analysis, we do not need the assumption that the tasks are in the same deque of a certain processor. In fact, the analysis easily extends to when the tasks are from different processors as long as there is always one task available to be stolen. Therefore, we show the relaxed form of Lemma 4.2.

###### Lemma 4.3 ().

Given tasks and steal attempts, the probability that these tasks are stolen is at least , as long as at least one task is available at the time of any steal attempt.

As discussed in Section 3.1, any nested-parallel computation can be viewed as a DAG, and each (non-termination) node is an instruction and has either two successors (for a fork) or one successor (otherwise). The RWS scheduler dynamically maps each node to a processor. For each specific path, the length is no more than the span of the algorithm, based on the definition. Based on the RWS algorithm, a node will be mapped to the same processor that executes the predecessor node, except for the spawned children (definition in Section 3.3). The spawned child of a processor is ready to be stolen during the process when is executing the other branch (continuation), and will be executed by if it is not stolen during this process.

We now prove Theorem 3.1 by showing that the computation must have been terminated after steals. We will use Lemma 4.3 and apply union bound.

###### Theorem 3.1.

We consider a path in the DAG and show that all instructions on this path will be executed with no more than steals. Each node on this path is either a spawned child (that can be stolen) or executed directly after the previous node by the same processor. Now let’s consider a processor that is not working on the instructions on this path. When the next steal is attempted, the processor working on this path either has added one more node on this path that is ready to be stolen, or has executed the node . This is because we assume a processor executes at least one instruction between two steal attempts from another processor. The only case that the next node is not executed is when it is a spawned child. It will not be executed immediately, and needs to wait until to be stolen for execution, or for the continuation branches to finish and execute these nodes.

Hence, let’s consider the worst case that all nodes on the path are spawned children. Lemma 4.3 upper bounds the number of steals to finish the execution of this path. Namely, after steal attempts, all nodes are stolen and executed with probability at least . For a DAG with the longest path length , there are at most paths in the DAG. Now we set where is the work of the computation (the number of nodes in the DAG). For any constant , steals are sufficient for executing all existing paths. Now we take the union bound on the probability that all paths will finish, which is . Since each node in the DAG can have at most two successors, the DAG needs to have longest path length to contains nodes. Hence, the term will not dominate, which simplifies the number of steals to be . ∎

We have attempted to include this analysis in a few lectures of parallel algorithm courses, and we also would like to include the answer to a frequently asked question. The question is, in the analysis, we apply union bound on paths, but apparently, the steals cannot cover all paths since is a much lower-order term than in practice. Theoretically, the answer is that is a sufficiently small term for us to apply union bound, which can give us the desired bound in Theorem 3.1. The more practical and easy-to-understand answer is that we are assuming the worst case, and in practice we do not need all spawn children to be stolen in the execution. In fact, it is likely that most of them are executed by the same processor that spawns this child. Take a parallel-for-loop as an example, which can be viewed level of binary-forks. For most of the paths in this DAG, the spawn children are executed by the same processor that executes the parent node. Because of the design of the RWS algorithm, most of the successful steals will involve a large chunk of work, so steal attempts are infrequent. The analysis shows that, once steals are made, the path must have finished, but it is more likely that the computation has finished even before this number of attempts are made.

## 5. Analysis for Parallel Cache Complexity

Studying parallel cache complexity for nested-parallel algorithms scheduled by RWS is a crucial topic for parallel computing and has been studied in many existing papers (e.g., (Acar02, ; frigo2009cache, ; cole2012revisiting, ; cole2013analysis, )). The goal in these analyses is to show the parallel overhead when scheduling using RWS, in addition to the sequential cache complexity. We show the best existing parallel bounds for a list of widely used algorithms and problems in Table 1. While the results in these papers are reasonably good for algorithms with low (polylogarithmic) span, the bounds for parallel overhead can be significant for algorithms with linear or super-linear span. Compared to the lower bounds for the parallel overhead, the upper bounds given in these papers incur polynomial (usually or ) overheads. Such parallel overhead will dominate most of the input range when compared to the sequential cache bounds. Meanwhile, it is known that the practical performance of many of these algorithms is almost as good as low-span algorithms (chowdhury2010cache, ; SchardlThesis, ). Hence, it remains an open problem for decades to tightly bound the parallel overhead of such algorithms.

In this section, we show a new analysis to give almost tight parallel cache bounds for the list of problems in Table 1, which are only a polylogarithmic factor off the lower bounds for the main term. Unlike the previous approaches that analyze the scheduler, we directly study the recurrence relations of such computations and find it surprisingly simple. This new analysis is inspired by the concept of d-grid (BG2020, ) (see more details in Section 5.2).

In the rest of this section, we first review the existing work on this topic in Section 5.1. Section 5.2 presents the high-level idea and the main theorem (Theorem 5.2) of our analysis, which provides a general approach to solve the cache complexity based on recurrences. In Section 5.3, we use a simple example of Kleene’s algorithm to show how to use the newly introduced main theorem. Finally, we show the new results for more complicated algorithms in Section 5.4, and discuss the applicability and open problems in Section 5.5.

### 5.1. Related Work

Given the importance of I/O efficiency and the RWS scheduler, parallel cache bounds have been studied for over 20 years. The definition on distributed cache was given by Acar et al. (Acar02, ), and they also showed a trivial parallel upper bound on processors: . To achieve this bound, one just needs to pessimistically assume that in each of the steals, the stealing thread accesses the entire cache from the original processor, which is additional cache misses. This bound is easy to understand and good for algorithms with polylogarithmic span, but is too loose for linear and super-linear span algorithms. Hence, the following later works showed tighter parallel cache bounds for algorithms with certain structures.

Frigo and Strumpen (frigo2009cache, ) first analyzed the parallel cache complexity of a class of divide-and-conquer computations, such as matrix multiplication and 1D Stencil, where the problems have subproblem cache complexity as a “concave” function of the computation cost (see more details in (frigo2009cache, )). Actually, the idea from (frigo2009cache, ) is general and can be applied to a variety of algorithms as shown in Table 1. Later work by Cole and Ramachandran (cole2012revisiting, ) pointed out a missing part of the analysis in (frigo2009cache, )—the additional cache misses by accessing the execution stacks after a successful steal. They carefully studied this problem, and showed that in most cases, this additional cost is asymptotically bounded by other terms (so the bounds are the same as (frigo2009cache, )). In other cases, this can lead to a small overhead (e.g., matrix multiply in row-major format). The authors of (cole2012revisiting, ) also extended the set of applicable algorithms and showed tighter parallel cache bounds for problems such as FFT and list ranking. For the algorithms in this paper, we assume the matrices are in bit-interleaved format (Frigo99, ), so algorithms incur no asymptotic cost for accessing the execution stacks after steals. Even not, we note that all algorithms in Table 1 do not require accesses to the cactus stack anyway (many later RWS implementations chose not to support that for better practicality).

In this paper, we do not consider the additional cost of false sharing (cole2013analysis, ) or other schedulers (cole2017bounding, ; yang2018scheduling, ), but it seems possible to extend the analysis in this paper to the other settings. We leave this as future work.

### 5.2. Our Approach

As opposed to directly analyzing the algorithms on the scheduler in previous work (frigo2009cache, ; cole2012revisiting, ; cole2013analysis, ), our key observation is to directly study the computation structure of these algorithms. Interestingly, our analysis is mostly independent with the RWS scheduler, and only plugs in some results from (frigo2009cache, ) for some basic primitives such as matrix multiplication. By doing so, our analysis can bound the parallel cache complexity much better than the previous results.

The idea of our analysis is motivated by the recent work by Blelloch and Gu (BG2020, ). This work studies several parallel dynamic programming and algebra problems, and defines a structure called d-grid to reveal the computational structure of these problems. It uses d-grid with or to model a list of classic problems, such as matrix multiplication, to capture the memory access pattern of these problems. By using d-grid, their analysis decouples the parallel dependency (and, effectively, the span) from the sequential cache complexity in many parallel algorithms (see details in (BG2020, )). Although they only applied the d-grid analysis on sequential cache complexity, the high-level idea motivates us to also revisit the analysis of parallel cache complexity, and inspired us to directly analyze the essence of the computation structure (the recurrences) of the algorithms. This effectively avoids the crux in previous analysis (Acar02, ; frigo2009cache, ; cole2012revisiting, ; cole2013analysis, ), which incurs a polynomial overhead in the cache complexity charged by the span of the algorithm. In all of our analyses, the span of these algorithms, no matter linear or super-linear, do not show up in the analysis, which is very different from previous work. Of course, larger span does lead to more parallel cache overhead since it increases the number of steal attempts and each successful one incurs at least one additional cache miss. However, in all applications in this paper, this term is bounded by either the sequential bound or the main term for parallel overhead. Combining all together, we summarize all cache bounds in Table 1, and our new parallel cache bounds for linear and super-linear span algorithms are almost as good as those of the low-span algorithms (e.g., matrix multiplication).

As mentioned, our analysis will use previous results to derive the parallel cache bounds for d-grids, and use the recurrence relation to bound the entire algorithm. We formalize the recurrences we study for these algorithms and problems, which we refer to as the -recurrence.

###### Definition 5.1 ((α,β,k,l,m)-recurrence).

An -recurrence is a recurrence in the following form:

 Q(n)=α⋅Q(n/β)+∑ki⋅nlilogmin

where and are non-negative numbers, and is a function of , and .

In the next section, we will use Kleene’s algorithm as an example to show an instantiation of this recurrence relation. The -recurrence is easy to solve using the master method (bentley1980general, ):

###### Theorem 5.2 (Main Theorem).

The solution to , an -recurrence, is:

 O(∑ki⋅nlilogmin+∑kj⋅nljlogmj+1n+∑krnlogβα)

for , , and .

As shown here and in the next section, the parallel dependencies of the computation (the algorithm’s span) do not show up in the analysis and the solution, which is different from the previous analyses (Acar02, ; frigo2009cache, ; cole2012revisiting, ; cole2013analysis, ).

In the rest of this section, we will first use Kleene’s algorithm as an example to show how to use our approach to derive tighter parallel cache bound. Then we show a list of cache-oblivious algorithms that we can apply Theorem 5.2 to and get improved bounds.

It is worth mentioning that the cache bound contains the term for the call stack of the (recursive) subproblems. This term is a constant in the sequential bound, and in many cases the parallel term is the same as the number of steals (e.g., for matrix multiplication and Kleene’s algorithm). In other cases, directly applying Theorem 5.2 leads to a term, which is suboptimal since Cole and Ramanchadran (cole2012revisiting, ) showed that this term can be . Hence, when , instead of using from Theorem 5.2, we plug in the term from (cole2012revisiting, ), which gives a tighter result.

### 5.3. Kleene’s Algorithm as an Example

To start with, we use Kleene’s algorithm for all-pair shortest-paths (APSP) as an example to explain the analysis. Kleene’s algorithm solves the all-pair shortest-paths (APSP) problem that takes a graph (with no negative cycles) as input. The Kleene’s algorithm was first mentioned in (kleene1951representation, ; munro1971efficient, ; fischer1971boolean, ; furman1970application, ), and later discussed in full details in (Aho74, ). It is a divide-and-conquer algorithm that is I/O-efficient, cache-oblivious and highly parallelized. The pseudocode of Kleene’s algorithm is in Algorithm 2.

In Kleene’s algorithm, the graph is represented as the matrix , where is the weight of the edge between vertices and (the weight is if the edge does not exist). is partitioned into 4 submatrices indexed as . The matrix multiplication is defined in a closed semi-ring with . The high-level idea is first to compute the APSP between the first half of the vertices only using the paths between these vertices. Then by applying some matrix multiplication, we update the shortest paths between the second half of the vertices using the computed distances from the first half. We then apply another recursive subtask on the second half vertices. The computed distances are finalized, and then we use them to update the shortest paths from the first-half vertices.

The cache complexity and 0pt of this algorithm follow the recurrence relations:

 (1) Q(n)=2Q(n/2)+6Q\smbMM(n/2) (2) D(n)=2D(n/2)+2D\smbMM(n/2)

where is the I/O cost of a matrix multiplication of input size . Note that the recurrence relation for the cache complexity is true no matter if we are considering the sequential case (e.g., and ) or the parallel case (e.g., and ). For the parallel matrix multiplication algorithm from (BG2020, ), we have , , , and .

If we directly use the result from (Acar02, ), then we get the parallel cache bound . As the significant growth of processor count and cache size, the term dominates unless is very large. The tighter bound from (frigo2009cache, ) shows . This bound is tighter than the previous one from (Acar02, ), but the term still dominates unless , which is unlikely in practice. The parallel lower bound for this computation (ballard2014communication, ; BG2020, ) is , so a polynomial gap remains between the lower and upper cache bounds. Our analysis significantly closes this gap to polylogarithmic.

Now we use Theorem 5.2 to directly solve this -recurrence. Equation 4 includes the cache complexity of matrix multiplication. The parallel bounds on processors based on the algorithm from (BG2020, ) is:

 (3) Q\smbMM,P=O(n3B√M+P1/3n2log2/3nB+Plog2n).

which can be shown by the analysis from (frigo2009cache, ; cole2012revisiting, ). Now we can plug in Equation 5 to Equation 4, and get an -recurrence for . In this case, we have , and . Plugging in Theorem 5.2 directly gives the solution of:

 QP(n)=O(n3B√M+P1/3n2log2/3nB+Pn).

In this case, the input size is so the corresponding term . Hence, even though Kleene’s algorithm has linear span as opposed to the polylogarithmic span for matrix multiplication, the additional steals caused by the span will not affect the input term (the term in this case for Kleene’s algorithm and matrix multiplication). In fact, as one can see, either the span bound, or the span recurrence (Equation 2), does not show up in the entire analysis.

Since Kleene’s algorithm is very simple, we can also show how to directly solve the recurrence by plugging Equation 5 in Equation 4. We believe this can illustrate a more intuitive idea of our analysis.

 Q(n)=6Q\smbMM(n/2)+12Q\smbMM(n/4)+⋯+3n⋅Q\smbMM(1)=O(6n38B√M+12n364B√M+⋯)+O(6P1/3n2log2/3n4B+12P1/3n2log2/3n16B+⋯)+O(Plog2n+2Plog2(n/2)+⋯+Pn)=O(n3B√M+P1/3n2log2/3nB+Pn).

Here the terms in the first two big-Os are decreasing geometrically, while the last term increases geometrically. The main term for parallel overhead is only a polylogarithmic factor () more than the lower bound, as opposed to a polynomial factor () in the previous terms.

Although Kleene’s algorithm can also be directly analyzed as shown above, using Theorem 5.2 enables a simpler way to show a tighter bound of Kleene’s algorithm than previous analysis. More importantly, for many algorithms that are more complicated than Kleene’s algorithm, it is nearly impossible to show new bounds by directly plugging in the recurrences, in which case using Theorem 5.2 easily enables simple analysis and tighter bounds. We will then present these algorithms and our new analysis and bound in Section 5.4.

To start with, we use Kleene’s algorithm for all-pair shortest-paths (APSP) as an example to explain the analysis. Kleene’s algorithm is very simple, so even without using Theorem 5.2, it is not hard to show the parallel cache bound in Table 1. However, this is not the case for many complicated algorithms later discussed in Section 5.4, as many of them are given in the recent paper by (BG2020, ), and their analysis is much harder without the definitions and theorems in Section 5.2.

Kleene’s algorithm solves the all-pair shortest-paths (APSP) problem that takes a graph (with no negative cycles) as input. The Kleene’s algorithm (first mentioned in (kleene1951representation, ; munro1971efficient, ; fischer1971boolean, ; furman1970application, ), discussed in full details in (Aho74, )) is a divide-and-conquer algorithm, which is I/O-efficient, cache-oblivious and highly parallelized. The pseudocode of Kleene’s algorithm is provided in Algorithm 2.

In Kleene’s algorithm, the graph is represented as the matrix . is partitioned into 4 submatrices indexed as . The matrix multiplication is defined in a closed semi-ring with . The high-level idea is first to compute the APSP between the first half of the vertices only using the paths between these vertices. Then by applying some matrix multiplication, we update the shortest paths between the second half of the vertices using the computed distances from the first half. We then apply another recursive subtask on the second half vertices. The computed distances are finalized, and then we use them to update the shortest paths from the first-half vertices.

The cache complexity and 0pt of this algorithm follow the recursions:

 (4) Q(n)=2Q(n/2)+6Q\smbMM(n/2)
 D(n)=2D(n/2)+2D\smbMM(n/2)

where is the I/O cost of a matrix multiplication of input size . Using the parallel matrix multiplication algorithm from (BG2020, ), we have , , , and .

If we directly use the result from (Acar02, ), then we get the parallel cache bound . As the significant growth of processor count and cache size, the term dominates unless is very large. The tighter bound from (frigo2009cache, ; cole2012efficient; cole2012revisiting, ) shows . This bound is tighter than the previous one from (Acar02, ), but the term still dominates unless , which is unlikely in practice. The parallel lower bound for this computation (ballard2014communication, ; BG2020, ) is , so a polynomial gap remains between the lower and upper cache bounds. Our analysis significantly closes this gap to polylogarithmic.

As discussed in Section 5.2, our analysis will first use the bound from (frigo2009cache, ; cole2012efficient; cole2012revisiting, ) on the parallel matrix multiplication algorithm from (BG2020, ), which is

 (5) Q\smbMM,P=O(n3B√M+P1/3n2log2/3nB+plog2n).

Since Kleene’s algorithm is very simple, we first directly solve the recurrence and show the result by plugging Equation 5 in Equation 4.

 Q(n)=6Q\smbMM(n/2)+12Q\smbMM(n/4)+⋯+3n⋅Q\smbMM(1)=O(6n38B√M+12n364B√M+⋯)+O(6P1/3n2log2/3n4B+12P1/3n2log2/3n16B+⋯)+O(plog2n+2plog2(n/2)+⋯+Pn)=O(n3B√M+P1/3n2log2/3nB+Pn).

Here the terms in the first two big-Os are decreasing geometrically, while the last term increases geometrically. The main term for parallel overhead is only a polylogarithmic factor () more than the lower bound, as opposed to a polynomial factor () in the previous terms.

Now we show how to use Theorem 5.2 to directly solve this -recurrence. For , we have , and . Plugging in Theorem 5.2 directly gives the above solution.

In this case, the input size is so the corresponding term . Hence, even though Kleene’s algorithm has linear span as opposed to the polylogarithmic span for matrix multiplication, the additional steals caused by the span will not affect the input term (the term in this case for Kleene’s algorithm and matrix multiplication).

### 5.4. Other Applications

We have shown the main theorem (Theorem 5.2) and the intuition why it leads to better parallel cache complexity for Kleene’s algorithm. We now apply the theorem to a variety of classic cache-oblivious algorithms, which leads to better parallel cache bound. The details of the algorithms can be found in (dinh2016extending, ; BG2020, ). Some of these algorithms are complicated, here we only show the recurrences and the parallel cache bounds since those are all we need.

#### 5.4.1. Building Blocks

Before we go over the applications, we first show the parallel cache complexity of some basic primitives (d-grid and matrix transpose) that the applications use.

Matrix Multiplication (MM). Matrix multiplication is modeled as a 3d-grid in (BG2020, ). The sequential cache bound is , and the parallel bound on processors is . Here we assume the matrix is stored in the bit-interleaved (BI) format (Frigo99, ), which can be easily converted from other formats such as the row-major format with the same cost as matrix transpose.

Matrix Transpose (MT). Matrix transpose is another widely used primitives in cache-oblivious algorithms. The sequential cache bound is , and the parallel bound on processors (frigo2009cache, ; cole2012revisiting, ) is:

 (6) Q\smbMT,P=O(n2B+Plog2n).

The 2d-grid. The 2d-grid can be viewed as an analog of matrix multiplication, but is a 2 dimensional computation instead of 3 dimensional as in MM (

arithmetic operations and memory accesses). It can also be viewed as a matrix-vector multiplication but the matrix is implicit. The 2d-grid is a commonly used primitive in dynamic programming algorithms

(BG2020, ). The sequential cache bound is , and the parallel bound on processors (frigo2009cache, ; cole2012revisiting, ) is

 (7) Q\smb2D,P=O(n2BM+P1/2nlognB+Plog2n).

#### 5.4.2. Gaussian Elimination

Here we consider the parallel divide-and-conquer Gaussian elimination algorithm shown in (BG2020, ), with the recurrence of the cache bound as . Compared to Kleene’s algorithm (Equation 4), this recurrence only differs by a constant. Hence, the parallel cache bound is asymptotically the same as Kleene’s algorithm.

#### 5.4.3. Triangular System Solver

The triangular system solver (TRS) solves the linear system that takes the output of Gaussian elimination (i.e., where is an upper triangular matrix). We consider the parallel divide-and-conquer algorithm for a triangular system solver from (BG2020, ) with cubic work and linear span. The cache bound is:

 Q\smbTRS(n)=4Q\smbTRS(n/2)+2Q\smbMM(n/2).

To analyze the parallel cache complexity, we can plug in Equation 5 and get the -recurrence with , , and . Applying and Theorem 5.2 leads to the parallel cache complexity as

 (8) Q\smbTRS,P=O(n3B√M+P1/3n2log5/3nB+Pn).

#### 5.4.4. Cholesky Factorization and LU Decomposition

Both Cholesky factorization and LU decomposition are widely used linear algebraic tools to decompose a matrix to the product of a lower triangular matrix and an upper triangular matrix. The divide-and-conquer algorithms for Cholesky factorization and LU decomposition (dinh2016extending, ) are quite similar in the way that they are designed on top of triangular system solver and matrix multiplication. The cache bounds for both algorithms are:

 Q(n)=2Q(n/2)+Q\smbTRS(n/2)+O(1)⋅Q\smbMM(n/2).

We can plug in Equation 5 and Equation 8 to get the -recurrence with , , and . Since , the parallel bound is almost the same as , except that the span for these algorithms is , which increases the last term by a logarithmic factor. Hence for these two problems, we have:

 Qp=O(n3B√M+P1/3n2log5/3nB+Pnlogn).

#### 5.4.5. LWS Recurrence

The LWS (least-weighted subsequence) recurrence (hirschberg1987least, ) is one of the most commonly-used DP recurrences in practice. Given a real-valued function for integers and , for ,

 Dj=min0≤i

This recurrence is widely used in real-world applications (kleinberg2006algorithm, ; knuth1981breaking, ; galil1992dynamic, ; galil1994parallel, ; aggarwal1990applications, ; kunnemann2017fine, ). Here we assume that can be computed in constant work based on a constant size of input associated to and , which is true for all these applications.

Here we consider the parallel divide-and-conquer algorithm to solve LWS recurrence from (BG2020, ) with quadratic work and linear span. This algorithm partitions the problems into two halves, solves the first one, applies a 2d-grid computation, and solves the second one. The cache bound is .

Here, by using Equation 7, the -recurrence has , , and . Since, , so the parallel cache bound is:

 QP(n)=O(n2BM+P1/2nlog2nB+Pn).

#### 5.4.6. GAP Recurrence

The GAP problem (galil1989speeding, ; galil1994parallel, ) is a generalization of the edit distance problem that has many applications in molecular biology, geology, and speech recognition. Given a source string and a target string , an “edit” can be a sequence of consecutive deletes corresponding to a gap in , and a sequence of consecutive inserts corresponding to a gap in . Let be the cost of deleting the substring of from -th to -th character, be inserting the substring of accordingly, and be the cost to change the -th character in to -th character in .

Let be the minimum cost for such transformation from the prefix of with characters to the prefix of with characters, the recurrence for is:

 Di,j=min⎧⎪⎨⎪⎩min0≤q

corresponding to either replacing a character, inserting or deleting a substring. The best parallel divide-and-conquer algorithm to compute the GAP recurrence is proposed by Blelloch and Gu (BG2020, ). The cache bound recurrence of the algorithm in (BG2020, ) is , which includes 4 subproblems with half size, a linear number of 2d-grid (see more details in (BG2020, )), and 2 matrix transpose calls.

To derive parallel cache complexity, we can apply Equation 6 and Equation 7 and get the -recurrence with , , and . Then using Theorem 5.2 gives and

 QP(n)=O(n3BM+P1/2n2log2nB+Pnlog23).

#### 5.4.7. RNA recurrence

The RNA problem (galil1994parallel, ) is a generalization of the GAP problem. In this problem, a weight function is given, which is the cost to delete the substring of from -th to -th character and insert the substring of from -th to -th character. Similar to GAP, let be the minimum cost for such transformation from the prefix of with characters to the prefix of with characters, the recurrence for is:

 Di,j=min0≤p

This recurrence is widely used in computational biology, like to compute the secondary structure of RNA (waterman1978rna, ). The RNA recurrence can be viewed as a 2d version of the LWS recurrence, and the latest algorithm from (BG2020, ) has the cache bound of .

For parallel cache complexity, the -recurrence by plugging in Equation 7 is , , and . The parallel cache bound can be solved as:

 QP(n)=O(n4BM+P1/2n2log2nB+Pnlog23).

#### 5.4.8. Protein Accordion Folding

The recurrence for protein accordion folding (tithi2015high, ) is:

 Di,j=max1≤k

for , with