Keywords
Data structures, parallel programs, dictionaries, comparisonbased search, distributionsensitive algorithms
Abstract
In this paper ^{1}^{1}1This is the authors’ version of a paper submitted to the 30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’18). It is posted here by permission of ACM for your personal or classroom use. Not for redistribution. The definitive version can be found at https://doi.org/10.1145/3210377.3210390. © 2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM. we present two versions of a parallel workingset map on processors that supports searches, insertions and deletions. In both versions, the total work of all operations when the map has size at least is bounded by the workingset bound, i.e., the cost of an item depends on how recently it was accessed (for some linearization): accessing an item in the map with recency takes work. In the simpler version each map operation has span (where is the maximum size of the map). In the pipelined version each map operation on an item with recency has span. (Operations in parallel may have overlapping span; span is additive only for operations in sequence.)
Both data structures are designed to be used by a dynamic multithreading parallel program that at each step executes a unittime instruction or makes a data structure call. To achieve the stated bounds, the pipelined version requires a weakpriority scheduler, which supports a limited form of 2level prioritization. At the end we explain how the results translate to practical implementations using workstealing schedulers.
To the best of our knowledge, this is the first parallel implementation of a selfadjusting search structure where the cost of an operation adapts to the access sequence. A corollary of the workingset bound is that it achieves work static optimality: the total work is bounded by the access costs in an optimal static search tree.
1 Introduction
Map (or dictionary) data structures, such as binary search trees and hash tables, support inserts, deletes and searches/updates (collectively referred to as accesses) and are one of the most used and studied data structures. In the comparison model, balanced binary search trees such as AVL trees and redblack trees provide a performance guarantee of worstcase cost per access for a tree with items. Other kinds of balanced binary trees provide probabilistic or amortized performance guarantees, such as treaps and weightbalanced trees.
Selfadjusting maps, such as splay trees [37], adapt their internal structure to the sequence of operations to achieve better performance bounds that depend on various properties of the access pattern (see [20] for a hierarchical classification). Many of these data structures make it cheaper to search for recently accessed items (temporal locality) or items near to previously accessed items (spatial locality). For instance, the workingset structure described by Iacono in [29] has the workingset property (which captures temporal locality); it takes time per operation with access rank (creftype 1), so its total cost satisfies the workingset bound (creftype 2).
Definition 1 (Access Rank).
Define the access rank of an operation in a sequence of operations on a map as follows. The access rank for a successful search for item is the number of distinct items in that have been searched for or inserted since the last prior operation on (including itself). The access rank for an insertion, deletion or unsuccessful search is always , where is the current size of .
Definition 2 (WorkingSet Bound).
Given any sequence of map operations, we shall use to denote the workingset bound for , defined by where is the access rank of the th operation in when is performed on an empty map.
Parallel search structures
Our goal in this paper is to design efficient selfadjusting search structures that can be used by parallel programs. Even designing a nonadjusting parallel or concurrent search structure is quite challenging, and there has been a lot of research on the topic.
There are basically two approaches. In the concurrent computing world, processes independently access the data structure using some variety of concurrency control (e.g., locks) to prevent conflicts. In the parallel computing world, data structures are designed to support executing operations in parallel, either individually or in batches.
For example, in the concurrent computing world, Ellen et al. [19] show how to design a nonblocking binary search tree, with later work generalizing this technique [13] and analyzing the amortized complexity [18]. However, these data structures do not maintain balance in the tree (i.e., the height can get large) and their cost depends on the number of concurrent operations.
An alternate approach (that bears some similarity to the implicit batching that we use) is software combining [22, 28, 32], where each processor inserts a request in a shared queue and a single processor sequentially executes all outstanding requests later. These works provide empirical efficiency but not worstcase bounds.
Another notable example is the CBTree [1, 2], a concurrent splay tree that in real experiments achieves surprisingly good performance — leading to an interesting hypothesis that selfadjustment may be even more valuable (in practice) in concurrent settings than sequential settings. However, the CBTree does not guarantee that it maintains the proper ‘frequency balance’, and hence does not provide the guarantees of a splay tree (despite much experimental success).
In the parallel computing world, there are several classic results in the PRAM model. Paul et al. [34] devised a parallel 23 tree such that synchronous processors can perform a batch of operations on a parallel 23 tree of size in time. Blelloch et al. [8] show how pipelining can be used to increase parallelism of tree operations. Also, (batched) parallel priority queues [12, 15, 16, 36] have been utilized to give efficient parallel algorithms such as for shortestpath and minimum spanning tree [12, 16, 33].
More recently, in the dynamic multithreading model, there have been several elegant papers on parallel treaps [9] and how to parallelize a variety of different binary search trees [7] supporting unions and intersections, and also work on how to achieve batch parallel search trees with optimal work and span [4]. Other batch parallel search trees include redblack trees [23] and weightbalanced Btrees [21]. (We are unaware of any batched selfadjusting data structures.)
And yet, such concurrent/parallel map data structures can be difficult to use; the programmer cannot simply treat it as a black box and use atomic map operations on it from within an ordinary parallel program. Instead, she must carefully coordinate access to the map.
Implicit batching
Recently, Agrawal et al. [3] introduced the idea of implicit batching. Here, the programmer writes a parallel program that uses a black box data structure, treating calls to the data structure as basic operations. In addition, she provides a data structure that supports batched operations (e.g., search trees in [9, 7]). The runtime system automatically stitches these two components together, ensuring efficient running time by creating batches on the fly and scheduling them appropriately. This idea of implicit batching provides an elegant solution to the problem of parallel search trees.
Our goals
Our goal is to extend the idea of implicit batching to selfadjusting data structures — and more generally, to explore the feasibility of the implicit batching approach for a wider class of problems. In [3], they show how to apply the idea to uniformcost data structures (where every operation has the same cost). ^{2}^{2}2They also provide some bounds for amortized data structures, where queries do not modify the data structure. In a selfadjusting structure, some operations are much cheaper than others, and additionally every operation may modify the data structure (unlike say AVL/redblack trees where searches have no effect on the structure), which makes parallelizing it much harder.
We present in this paper, to the best of our knowledge, the first parallel selfadjusting search structure that is distributionsensitive with worstcase guarantees. In particular, we design two versions of a parallel map whose total work is essentially bounded by the Definition 2 (WorkingSet Bound). for some linearization of the operations (that respects the dependencies between them).
Parallel Programming Model
The parallel data structures in this paper can be used in the scenario where a parallel program accesses data structures expressed through dynamic multithreading (see [14, Ch. 27]), which is the case in many parallel languages and libraries, such as Cilk dialects [24, 30], Intel TBB [35], Microsoft Task Parallel Library [38] and subsets of OpenMP [31]. The programmer expresses algorithmic parallelism through parallel programming primitives such as fork/join (also spawn/sync), parallel loops and synchronized methods, and does not provide any mapping from subcomputations to processors.
These types of programs are typically scheduled using a greedy scheduler [11, 27] or a nearly greedy scheduler such as workstealing scheduler (e.g., [10]) provided by the runtime system. A greedy scheduler guarantees that at each time step if there are available tasks then of them are completed.
We analyze our two data structures in the context of a greedy scheduler and a weakpriority scheduler (respectively). A weakpriority scheduler has two priority levels, and at each step at least half the processors greedily choose highpriority tasks and then lowpriority tasks — if there are at most highpriority tasks, then all are executed. We discuss in Section 8 how to adapt these results for workstealing schedulers.
2 Main Results
We present two parallel workingset maps that can be used with any parallel program , whose actual execution is captured by a program DAG where each node is a unittime instruction or a call to some data structure , called an call, that blocks until the answer is returned, and each edge represents a dependency due to the parallel programming primitives. Let be the total number of nodes in , and be the number of nodes on the longest path in .
Both designs take work nearly proportional to the Definition 2 (WorkingSet Bound). for some legal linearization of , while having good parallelism. (We assume that each key comparison takes steps.)
The first design, called , is a simpler batched data structure.
Theorem 3 ( Performance).
If uses only (i.e., no other data structures), then its running time on processes using any greedy scheduler is
(as ) for some linearization of , where is the maximum number of calls on any path in , and is the maximum size of the map, and is the number of smallops, defined as operations in that are performed on the map when its size is less than .
Notice that if is replaced by an ideal concurrent workingset map (one that does the same work as the sequential workingset map if we ran the program according to linearization ), then running on processors according to the linearization takes worstcase time where . Also, we very likely have in practice, and so can usually ignore the term. Thus gives an essentially optimal time bound except for the “span term” , which adds time per call along some path in . In short, the parallelism of is within a factor of of the optimal.
The second design, called , uses a more complex pipelined data structure design as well as a weakpriority scheduler (see Section 7.2) to provide a better bound on the “span term”.
Theorem 4 ( Performance).
If uses only , then its running time on processes using any weakpriority scheduler is
for some linearization of , where are defined as in creftype 3, and is the weighted span of where each map operation is weighted by its cost according to . Specifically, each map operation in with access rank is given the weight , and is the maximum weight of any path in .
Compared to , the “work term” is unchanged, but the “span term” for has no term. Since running on processors according to the linearization takes worstcase time, gives an essentially optimal time bound up to an extra time per map operation along some path in , and hence has parallelism within an factor of optimal.
3 Central Ideas
We shall now sketch the intuitive motivations behind and .
It starts with Iacono’s sequential workingset structure, which contains a sequence of balanced binary search trees where tree for contains items and hence has height . The invariant maintained is that the most recently accessed items are in the first trees. A search on a key proceeds by searching in each tree in the sequence in order until the key is found in tree . After a search, the item is moved to and then for each , the least recently accessed item from tree is moved to tree . By the invariant, any item in the map with recency will take time to access. Each insertion or deletion can be easily carried out in time while preserving the invariant.
The challenge is to ‘parallelize’ this workingset structure while preserving the total work. The first step is to process operations in batches, using a batched search structure in place of each ‘tree’.
The problem is that, if there are searches for the same item in the last tree, then according to the workingset bound these operations should take work. But if these operations all happen in parallel and end up in the same batch, and we execute this batch naively, then each operation will go through the entire structure leading to work.
Therefore, in order to get the desired bound, we must combine duplicate accesses in each batch. But naively sorting a batch of operations takes work. To eliminate this as well, (Section 6) uses a novel entropysorting algorithm, and a careful analysis yields the desired work bound.
Next, we cannot simply apply the generic “implicit batching” transformation in [3] to , because the Batcher Bound (Theorem 1 in [3]) would give an expected running time of for map operations, where is the work done by , and is the worstcase span of a size batch. The problem is that is , because a batch with a search for an item in the last tree has span .
Firstly, this means that the term would be , and so the Batcher Bound would be no better than for a batched binary search tree. Secondly, the term would be . has the same span term, because if a cheap operation is ‘blocked’ by the previous batch that has an expensive operation, then the span of the cheap operation could be . To reduce this, we improve to using an intricate pipelining scheme (explained in Section 7) so that a cheap operation is ‘blocked’ less by the previous batch.
4 Parallel Computation Model
In this section, we describe how the parallel program generates an execution DAG, how we measure the cost of a given execution DAG, and issues related to the chosen memory model.
Execution DAG
The actual complete execution of can be captured by the execution DAG (which may be scheduledependent), in which each node is a unittime instruction and the directed edges represent the underlying computation flow (such as constrained by forking/joining of threads and acquiring/releasing of locks). At any point during the execution of , a node in the program/execution DAG is said to be ready if its parent nodes have been executed. An active thread is simply a ready node in , while a suspended thread is a terminal node in .
The program DAG captures the highlevel execution of , but interaction between data structure calls is only captured by the execution DAG. We further assume that all the data structures are (implicitly) batched data structures, and that the number of data structures in use is bounded by some constant. To support implicit batching, each data structure call is automatically handled by a parallel buffer for the data structure. (See Appendix Section A.1.)
The execution DAG consists of core nodes and ds nodes, which are dynamically generated as follows. At the start has a single core node, corresponding to the start of the program . Each node could be a local instruction or a synchronization instruction (including fork/join and acquire/release of a lock). Each core node could also be a data structure call. When a node is executed, it may generate child nodes or terminate. A join instruction also generates edges that linearize all the join operations according to the actual execution. Likewise, simultaneous operations on a nonblocking lock generate child nodes that are linearized by edges. For a blocking lock, a release instruction generates a child node that is simply the resumed thread that next acquires the lock (if any), with an edge to it from the node corresponding to the originally suspended thread.
The core nodes are further classified into
program nodes and buffer nodes. The program nodes (here nodes) correspond to nodes in the program DAG , and they generate only program nodes except for data structure calls. An call generates a buffer node corresponding to passing the call to the parallel buffer. This buffer node generates more buffer nodes, until at some point it generates an node (every node is a ds node), corresponding to the actual operation on , which passes the input batch to . That node generates only nodes except for when it returns the result of some operation in the batch (generating a program node with an edge to it from the original call), or when it becomes ready for input (generating a buffer node that initiates flushing of the parallel buffer).Effective Cost
We shall now precisely define the notion of effective work/span/cost for a parallel data structure used by a (terminating) parallel program.
Definition 5 (Effective Work/Span/Cost).
Take any program using a batched data structure on processors. Let be the actual execution DAG of using . Then the effective work taken by (as used by ) is the total number of nodes in . And the effective span taken by is the maximum number of nodes on a path in . And the effective cost of is .
The effective cost has the desired property that it is subadditive across multiple parallel data structures. This implies that our results are composable with other data structures in this model, since we actually show the following for some linearization :

(creftype 12 and creftype 13) takes effective work and effective span (using any scheduler).

(creftype 22 and creftype 25) takes effective work and effective span (using a weakpriority scheduler (Section 7.2)).
Interestingly, the bound for the effective cost of is independent of the scheduler, while the effective cost bound for requires a weakpriority scheduler. In addition, using any greedy scheduler, the parallel buffer for either map has effective cost (analogously defined) at most where is the effective work taken by (Appendix creftype 26). Therefore our main results (creftype 3 and creftype 4) follow from the above claims.
Memory Model
Unless otherwise stated, we work within the pointer machine model for parallel programs given by Goodrich and Kosaraju [26] ^{3}^{3}3In short, the main memory can be accessed only via pointers, which can only be stored, dereferenced or tested for equality (no pointer arithmetic). . But instead of having synchronous processors, we introduce a new more realistic ^{4}^{4}4Exclusive reads/writes (EREW) is too strict, while concurrent reads/writes (CRCW) does not realistically model the cost of contention, as stated in [25]. QRMW model with queued readmodifywrite operations (including read, write, testandset, fetchandadd, compareandswap) as described in [17], where multiple memory requests to the same memory cell are FIFOqueued and serviced one at a time, and the processor making each memory request is blocked until the request has been serviced. Our data structures can hence be implemented and used in the dynamic multithreading paradigm.
This QRMW pointer machine model supports binary fork/join primitives. It cannot support constanttime randomaccess locks, but it supports nonblocking locks (trylocks), where attempts to acquire the lock are serialized but do not block. Acquiring a nonblocking lock succeeds if the lock is not currently held but fails otherwise, and releasing always succeeds. If threads concurrently access a nonblocking lock, then each access completes within time steps. Nonblocking locks can be used to support activation calls to a process, where activating a process will start its execution iff it is not already executing and it is ready (some condition is satisfied), and the process can optionally reactivate itself on finishing.
We can also implement a dedicated lock, which is a blocking lock initialized with keys for some constant , such that simultaneous acquisitions must be made using distinct keys. When a thread attempts to acquire a dedicated lock, it is guaranteed to obtain the lock after at most other threads that attempt to acquire the lock at the same time or later.
5 Amortized Sequential Workingset Map
In this section we explain the amortized sequential workingset map , which is similar to Iacono’s workingset structure [29], but does not move an accessed item all the way to the front. This localization of selfadjustment is the basis for parallelizing it as in .
keeps the items in a list with segments . Each segment has capacity and every segment is full except perhaps the last. Items in each segment are stored in both a keymap and a recencymap, each of which is a BBT (balanced binary tree), sorted by key and by recency respectively. Consider any item currently in segment . On a search of , if then is moved to the front (most recent; i.e. first in the recencymap) of , otherwise is moved to the front of and the last (least recent) item of is shifted to the front of . On a deletion of , it is removed and for each the first (most recent) item of is moved to the back of . On an insertion, the item is added at the back of (if is full, then it is added to a new segment ).
We now prove an abstract lemma about a list with operations and costs that mimic . We will later use this same lemma to analyze and as well.
Lemma 6 (WorkingSet Cost Lemma).
Take any sequence of operations on an abstract list , each of which is a search, insert, delete or demote (described below), and a constant , such that the following hold (where is the current size of ):

Searching for an item with rank in costs and it is pulled forward to within the first items in .

Searching for an item not in costs .

Inserting or deleting an item costs .

Demoting an item in costs and pushes it backward in , but that item subsequently can only be demoted or deleted.
Then the total cost of performing on is , where demotions are ignored in computing (they are not counted as accesses).
Proof ().
We shall perform the analysis via the accounting method; each operation on has a budget according to , and we must use those credits to pay for that operation, possibly saving surplus credits for later. Define the recency of an item in to be the number of items in that have been inserted before or pulled forward past in since the last operation on . Clearly, for each search/insertion/deletion of an item in , its access rank (actual recency) is at least the recency of . Each item in has some stored credit, and we shall maintain the invariant that every item in with recency and stored credit is within the first items in or has been demoted. The invariant trivially holds at the start.
First we show that, on every operation on an item , the invariant can be preserved for itself. For insertion/deletion or unsuccessful search for , the budget of can pay for the operation and (for insertion) also pay for the stored credit for . For successful search for , it is as follows. Let be the stored credit and be the recency of before the operation, and let be the rank of in after that. By the invariant, was within the first items in before the operation. Also the budget is . If , then and so the budget can pay for both the operation cost and a new stored credit of . If , then and so the stored credit can pay for the operation cost and a new stored credit of .
Finally we check that the invariant is preserved for every other item in . For search/insertion of , the rank of in changes by the same amount as its recency. For deletion of , if is after in then is more recent than and so the recency of does not change, and if is before in then the rank of in decreases by and its recency decreases by at most . For a demotion, every other item’s rank in does not increase.
This lemma implies that has the desired workingset property.
Theorem 7 ( Performance).
The cost of satisfies the workingset bound.
Proof ().
Let be the number of items in . By construction, , and each operation on takes time on segment . Thus each insertion/deletion takes time, and each access/update of an item with rank in (in order of segment followed by order in the recencymap) takes time. Also, on each access of an item with rank in , its new rank is at most , because if is in then , and if is in for some then and . Thus by the Lemma 6 (WorkingSet Cost Lemma). (creftype 6) we are done.
6 Simple Parallel WorkingSet Map
We now present our simple batched workingset map . The idea is to use the amortized sequential workingset map (Section 5) and execute operations in batches ^{5}^{5}5Each batch is stored in a leafbased balanced binary tree for efficient processing.. In order to get the bound we desire, however, we must combine operations in a batch that access the same item. In particular, for consecutive accesses to the same item, all but the first one should cost . Therefore, we must sort the batch (efficiently) using the Definition 32 (Parallel Entropy Sort). (Appendix creftype 32), to ‘combine duplicates’. We also control the size of batches — if batches are too small, then we lose parallelism; if the batches are too large, then the sorting cost is too large.
6.1 Description of
As described in the Section 4, calls in the program are put into the parallel buffer for . When is ready (i.e. the previous batch is done), we flush the parallel buffer to obtain the next input batch , which we cut and store in a feed buffer, which is a queue of bunches (a bunch is a set supporting addition of a batch and span conversion to a batch if it has size ) ^{6}^{6}6A bunch is implemented using a complete binary tree with batches at the leaves, with a linked list threaded through each level to support adding a leaf in steps. each of size except possibly the last. Specifically, we divide into small batches of size except possibly the first and last, the first having size , where is the size of and is the size of the last bunch in the feed buffer. Next we add that first small batch to , and append the rest as bunches to the feed buffer. Let be the current size of . Then we remove the first bunches from the feed buffer, convert them into batches, merge them in parallel into a cut batch , and process it as follows.
First we sort using the Definition 32 (Parallel Entropy Sort). (Appendix creftype 32). Then we combine all operations on each item into one groupoperation ^{7}^{7}7Each groupoperation stores the original operations as a batch. that is treated as a single operation with the same effect as the whole group of operations in the given order.
We pass the resulting batch (of groupoperations) through each segment from to . At segment , first we search for the relevant items. For insertions, if the item is found then we treat it as an update. For successful searches/updates, we return the results immediately, and shift the items (keeping their relative order) to the front of the previous segment ( if ). For deletions, if the item is found then we delete the item. Next, we restore the capacity invariant for — for each segment from to , we transfer the appropriate number of items between the front of and the back of so that either have total size or is empty. Then we pass the unfinished (unreturned) operations (including all deletions) on to the next segment.
At the end, we handle the remaining insertions. First we insert at the back of up to its capacity. If there are leftover items, we create segments with just enough capacity, and carve out the correct amount for each segment from to in that order.
Finally, we return the results for the insertions/deletions and the unsuccessful searches/updates, and we are done processing the batch.
To parallelize the above, we need to be able to efficiently delete, from each segment, any sorted batch of items or any number of the most/least recent items. For this we replace each BBT by a A.2 Batched Parallel 23 Tree (Appendix Section A.2), where each leaf in the keymap also has a direct pointer to the corresponding leaf in the recencymap, and vice versa. Given a sorted batch of items, we can find them by one batch operation on the keymap, and then we have a batch of direct pointers to the leaves for these items in the recencymap, and hence can perform one reverseindexing operation on the recencymap to obtain a sorted batch of indices for those leaves, with which we can perform one batch operation to delete them. Similarly, to remove the most recent items from a segment, we find them via the recencymap, and then do a reverseindexing operation on the keymap to obtain them in sorted order, whence we can delete them.
6.2 Analysis of
We first bound the cost of parallel entropysorting each batch (creftype 10). To do so, we will find a batchpreserving linearization (creftype 8) such that for each batch of size , the entropy bound is at most the insert workingset bound (creftype 9), which we in turn bound by the cost according to the workingset bound , plus per operation when the map is small (i.e. has size ). This extra cost arises when a batch has many operations on distinct items while the map is small, which according to are cheap.
Definition 8 (BatchPreserving Linearization).
Take any sequence of batches of operations on a map . We say that is a batchpreserving linearization of if is a permutation of the operations in that preserves the ordering of batches and (within each batch) the ordering of operations on the same item.
Remark 0 ().
When any two batchpreserving linearizations of are performed on , the items in are the same after each batch, and the successfulness of each operation remain the same.
Definition 9 (Insert WorkingSet Bound).
The insert workingset bound for any sequence of map operations is the workingset bound for ‘inserting’ the items in in the given order (ignoring the actual operations) into an empty map, namely for each item first searching for it and then inserting it iff it is absent.
Lemma 10 (BatchSorting Cost Lemma).
Take any sequence of batches of operations on a map , and any constant . Then there is a batchpreserving linearization of such that parallel entropysorting (Appendix creftype 32) each batch in takes total work over all batches, where each batch in has size and has operations that are performed when has size less than (when is performed on ).
Proof ().
Let be a batchpreserving linearization of such that each batch in has the maximum insert workingset bound (creftype 9). By the Theorem 31 (Worsecase Workingset Bound). (Appendix creftype 31) is at least the entropy bound for . Thus parallel entropysorting takes work (Appendix creftype 33).
Let be the size of , and be the number of distinct items (accessed by operations) in . Partition into subsequences and such that has only the first operation of every distinct item in . For each , let be the cost of the operations in according to , and let be the cost of the operations in according to , so . Let be the number of operations in performed when has size less than (according to ).
Note that because every operation in is a successful search according to with access rank no more than according to . Thus it suffices to show that .
If , then obviously .
If , then at least operations in are performed when has size at least . So according to , each of those operations has access rank at least and hence costs . Thus . Also, , since any insertion on a map with at most items has access rank .
Next we prove a simple lemma that allows us to divide the work done on the segments among the operations.
Lemma 11 ( Segment Work).
Each segment takes work per operation that reaches it.
Proof ().
Searching/deleting/shifting the relevant items in the parallel 23 trees takes work per operation. Also, for each , the number of transfers (to restore the capacity invariant) between and is at most the number of operations, and each transfer takes work because there are always at most items in . Thus the transfers take total work.
Then we can prove the desired effective work bound for .
Theorem 12 ( Effective Work).
takes effective work for some linearization of .
Proof ().
Cutting the input batch of size from the parallel buffer into small batches takes work. Adding the first small batch to the last bunch in the feed buffer takes work. Inserting the bunches into the feed buffer takes work. Forming a cut batch of size (converting the bunches and merging the results) takes work. So all this buffering work adds up to per map operation.
Sorting the (cut) batches takes total work (over all batches) for some linearization , by the Lemma 10 (BatchSorting Cost Lemma). (creftype 10). Specifically, we choose . For each batch , let be the size of just before that batch, and then has size and:

If , then and so (as ).

If , then so none of the operations in that batch can be smallops.
It now suffices to show that the work on segments is for some linearization (since either or suffices for the final bound). For this, we pretend that a deleted item is marked rather than removed, and when a segment is filled to capacity all marked items are simultaneously transferred to the next segment, and at the last segment the marked items are removed. This takes more work than what actually does, but is easier to bound.
We shall now use the Lemma 6 (WorkingSet Cost Lemma). (creftype 6) on the list of the items in (including the marked items) in order of segment followed by recency within the segment, where is updated after the batch has passed through each segment in the actual execution of , and after we finish processing the batch.
We simulate the updates to by list operations as follows:

Shift successfully searched/updated items in : Search for them in reverse order (from back to front in ).

Shift marked (tobedeleted) items in : Demote them.

Insert items in : Insert in the desired positions.

Remove marked items in : Delete them.
This simulation yields a sequence of list operations on , to which we can then apply the Lemma 6 (WorkingSet Cost Lemma)..
For each search for an item with rank in , is found in or some segment such that , and so by Lemma 11 ( Segment Work). (creftype 11) the search takes work in , after which has new rank in at most , like in (creftype 7). After each batch , let be the final size of and be the new last segment, and then each insertion in takes work in . Each deletion takes work in .
Thus by creftype 6, takes work on segments. Now let be the same as but with each groupoperation expanded to its original sequence of operations. Clearly , since each groupoperation is on the same item, so we are done.
And now we turn to bounding the effective span.
Theorem 13 ( Effective Span).
takes effective span, where is the number of operations on , and is the maximum size of .
Proof ().
First we bound the the span of processing each cut batch (i.e. the span of the corresponding execution subDAG). Let denote the maximum span of processing a cut batch of size . Take any cut batch of size and let be the size of just before . takes span to be removed and formed from the feed buffer, and span to be sorted. then takes span in each segment (because shifting between parallel 23 trees of size or cutting a batch of size takes span), which adds up to span over all segments, since when . Returning the results for each groupoperation takes span. Thus . If then . If then and hence .
Now let be the actual execution DAG for using (on processors). Then the effective span of is simply the time taken to execute on an unlimited number of processors when each node in takes unit time while every other node takes zero time, since captures all relevant behaviour of using including all the dependencies created by the locks. In this execution, we put a counter at each call in the program DAG , initialized to zero, and at each step we increment the counter at every pending call (i.e., the result is not yet returned). Then the total number of steps is at most the final counterweighted span of , which we now bound.
Take any path in . Consider each call on . We trace the ‘journey’ of from the parallel buffer as an operation in an uncut batch of size to a cut batch of size to the end of .
Observe that any batch of size takes span to be flushed from the parallel buffer, and span to be cut and added/appended to the bunches in the feed buffer, which in total is at most span.
So, first of all, waits for the preceding uncut batch of size to be processed, taking span. Next, waits for the current cut batch of size to be processed, taking span. After that, is processed, taking span. Then waits for intervening cut batches (between and ) with operations in total. Each intervening batch has some size and hence . Finally, is processed, taking span. Thus takes span in total.
Note that no two calls on the path can wait for the same intervening batch, because the second can be executed only after the first has returned. Thus over all counters at calls on , each of will sum up to at most . Therefore the final counterweighted span of is at most .
7 Faster Parallel WorkingSet Map
To reduce the effective span of , we intuitively have to:

Shift each accessed item near enough to the front, so that accessing it again soon would be cheap.

Pipeline the batches somehow, so that an expensive access in a batch does not hold up the next batch.
Naive pipelining will not work, because operations on the same item may take too much work. Hence we shall use a filter before the pipelined segments to ensure that operations proceeding through them are on distinct items, namely we pass all operations through the filter and only allow an operation through if there is not already another operation on the same item in the pipeline.
For similar reasons as in , we must control both the batch size and filter size to achieve enough parallelism, and so we choose the filter capacity to be . However, we cannot put the filter before the first segment, because accessing the filter requires work per operation, whereas to meet the workingset bound we need operations with access rank to cost only work.
Therefore, we divide the segments into the first slab and the final slab, where the first slab comprises the first segments and the final slab contains the rest, and put the filter after the first slab. Only operations that do not finish in the first slab are passed through the filter, and so the filtering cost per operation is bounded by the work already incurred in going through the first slab. Furthermore, we shift accessed items to the front of the final slab, and ‘cascade’ the excess items only when a later batch passes.
We cannot pipeline the first slab, but since the first slab is essentially a copy of but with only trees, its nonpipelined span turns out to be bounded by the span of sorting. To allow operation on items in the first slab to finish quickly, we need to allow the first slab to run while the final slab is running, but only when the filter has size at most , so that the filter size is always .
We also use special locking schemes to guarantee that the first slab and the segments in the final slab can process the operations at a consistent pace without interfering with one another. Finally, we shall weakly prioritize the execution of the final slab, to prevent excessive work from being done in the first slab on an item if there is already an operation on in the final slab.
7.1 Description of
: 
First slab: where 
Final slab: 
We shall now give the details for implementing this (see Figure 2), making considerable use of the A.2 Batched Parallel 23 Tree (Appendix Section A.2). has the same segments as in , where segment has assigned capacity but may be underfull or overfull. We shall group the first segments into the first slab, and the other segments into the final slab. uses a feed buffer (like ; see Section 6.1), which is a queue of bunches creftype 6 each of size except possibly the last.
The interface is ready iff both the following hold:

The parallel buffer or feed buffer is nonempty.

The filter has size at most .
When the interface is activated (and ready), it does the following (in sequence) on its run (the locks are described later):

Let be the size of the last bunch in the feed buffer. Flush the parallel buffer and cut the input batch of size into small batches of size except possibly the first and last, where the first has size . Add that first small batch to , and append the others as bunches to the feed buffer. Remove the first bunch from the feed buffer and convert it into a batch , which we shall call a cut batch.

Sort using the Definition 32 (Parallel Entropy Sort). (Appendix creftype 32), combining operations on the same item creftype 7, as in .

Pass through the first slab, which processes the operations as in . Successful searches/updates immediately finish, while the rest finish only if there was no final slab. Successful deletions are tagged to indicate success. But just before running (if it exists) to process the remaining batch at that segment, acquire the neighbourlock shared with (as shown in Figure 2) and then acquire the frontlock .

If there was a final slab, then pass the (sorted) batch of unfinished operations through the filter (including successful deletions), insert the filtered batch into the buffer before , and fork (a child thread) to activate .

Release and the neighbourlock shared with .

Reactivate itself.
The filter is used to ensure that at any point all the operations in the final slab are on distinct items. It is implemented using a batched parallel 23 tree that stores items, each tagged with a list of operations on that item (in the order they arrive at the filter) and their cumulative effect (as a single equivalent map operation).
When a batch is passed through the filter, each operation on an item in the filter is appended to the list for it (the effect is also updated) and filtered out of the batch, whereas each operation on an item not in the filter is added to the filter and put into the buffer of .
The final slab is pipelined in the following way. Between every pair of consecutive segments is a neighbourlock, which is a dedicated lock (see Section 4 Memory Model) with key for each arrow to it in Figure 2. Since each segment needs to access the filter and the contents of , those accesses will also be guarded by a frontlocking scheme using a series of frontlocks , each of which is a dedicated lock with key for each arrow to it in Figure 3. (This will be fully spelt out below.)
Each final slab segment has a sorted buffer before it (for operations from ), which is a batched parallel 23 tree. is ready iff its buffer is nonempty, and when activated (and ready) it runs as follows (frontlocking is highlighted):

Acquire the neighbourlocks (between and its neighbours) in the order given by the arrow number in Figure 2.

If , acquire .

If is the terminal segment and have total size exceeding their total capacity, create a new terminal segment .

Flush and process the operations in its buffer as follows:

Search for the accessed items in (by performing one batch operation on the keymap in ). Let be the (sorted batch of) items in that are found in , and delete from .

If , acquire to in that order.

Search for in the filter to determine what to do with it. Let be the items in to be searched or updated, and delete from the filter. (Insertions on items in are treated as updates.) Perform all the updates on items in .

Let . Fork to return the results for operations on , and insert at the front of . If is (now) the terminal segment, perform all insertions at the front of , and delete from the filter, and fork to return the results for operations on .

If the filter size is at most , fork to activate the interface.

If , release to in that order.

If is overfull, transfer items from the back of to the front of so that is full.

If is underfull by items and has items and has successful deletions, transfer items from the front of to the back of .

If is not the terminal segment, insert the operations on (with successful deletions tagged as such) into the buffer of , then fork to activate .


If is the terminal segment and is empty, remove to make the new terminal segment.

If , release .

Release both neighbourlocks and reactivate itself.
7.2 Weakpriority scheduler
It turns out that, ignoring the sorting cost, all we need to achieve the workingset bound is that, for each operation on an item in the final slab, the work ‘done’ by the first slab on can be counted against the work done in the final slab. This can be ensured using a weakpriority scheduler. A weakpriority scheduler has two queues and , where is the highpriority queue, and each ready node is assigned to either or , and at every step the following hold:

If there are ready nodes, then of them are executed.

If queue has ready nodes, then of them are executed (and so nodes are weaklyprioritized).
The nodes generated for the final slab are assigned to , while all other nodes are assigned to . Specifically, each (forked) activation call to and all nodes generated by that are assigned to , except for activation calls to the interface (which are assigned to ). Any suspended thread is put back into its original queue when it is resumed (i.e. the resuming node is assigned to the same queue as the acquiring node).
7.3 Analysis of
For each computation (subDAG of the actual execution DAG), we shall define its delay, which intuitively captures the minimum possible time it needs, including all waiting on locks. Each blocked acquire of a dedicated lock corresponds to an acquirestall node in the execution DAG whose child node is created by the release just before the successful acquisition of the lock. Let be the ancestor nodes of that have not yet executed at the point when is executed. Then we define delay as follows.
Definition 14 (Computation Delay).
The delay of a computation is recursively defined as the weighted span of , where each acquirestall node in is weighted by the delay of (to capture the total waiting at ), and every other node has unit weight.
Also, we classify each operation as segmentbound iff it finishes in some segment, and filterbound otherwise (namely filtered out due to a prior operation on the same item that finishes in the final slab).
The key lemma (creftype 19) is that any operation that finishes in segment takes delay to be processed by the final slab, which is ensured (creftype 18) by the frontlocking scheme and the balance invariants (creftype 16).
Then takes effective work for some linearization of , and here is the intuitive proof sketch:

The work divides into segment work (done by the slabs) and nonsegment work (cutting and sorting batches, and filtering).

The segment work can be divided per operation; segment does work per operation that reaches it (creftype 20).

The cutting work is per operation, and the (overall) sorting work is for some linearization of (creftype 10). The filtering work is per filtered operation, which can be ignored since each filtered operation already took work in the first slab. Similarly we can ignore the work done in passing the batch through .

The segment work on segmentbound operations is for some linearization of (creftype 21; proven using creftype 6 like we did for ).

The segment work on filterbound operations is too:

We can ignore work during highbusy steps (where has at least ready nodes), because the final slab takes work and so there are highbusy steps.

We can ignore every highbusy run of (namely with at least half its steps highbusy), because its work is times the work during highbusy steps.

Highidle runs of take total work.

Every filterbound operation is filtered out due to some operation in the final slab. So take any operation on item that finishes in a final slab segment .

During each highidle run (which takes highidle steps), the processing of is not blocked by any thread (since does not hold any neighbourlock or filterlock), so each highidle step ‘reduces’ its remaining delay, which by the key lemma (creftype 19) is .

Therefore the work done on by highidle runs while is in the final slab is times the work done by the final slab on .


Moreover, takes effective span for some linearization of :

The effective span is the time taken to run the execution DAG on infinitely many processors, where each node takes unit time while every other node takes zero time.

There are filterfull steps (steps in which the filter has size at least ). To see why, let every operation in the final slab consume a token on each step. Then each filterfull step consumes at least tokens. But by the key lemma (creftype 19) the total token consumption is just times the total work in the final slab, which amounts to (creftype 21).

There are filterempty steps (filter size at most ) for some linearization of , where is the number of calls, because each operation essentially has the following path:

It waits filterempty steps for the current cut batch in the first slab.

Then it waits filterempty steps per intervening cut batch of size .

Finally it takes steps to pass through the slabs where is its access rank according to , by the key lemma and the rank invariant (creftype 24).

We shall now give the details. For the purpose of our analysis, we consider a segment to be running iff it has acquired all neighbourlocks and has not released them. We also consider an operation to be in segment exactly when it is in the buffer of or is being processed by up to item 4h (inclusive). Likewise, we consider an item that is found in and to be searched/updated to remain in until it is shifted to (in item 4d).
We begin by showing that remains ‘balanced’; each nonterminal segment has size not too far from capacity.
Definition 15 ( Segment Holes).
We say that a segment has holes if is not the terminal segment but has fewer items than its capacity. (If exceeds capacity then it has no holes.)
Lemma 16 ( Balance Invariants).
The following balance invariants hold:

If is not running or does not exist, then does not exceed capacity (if it exists).

If the interface is not running, has no holes, and has at most holes where is the number of successful deletions in .

Each final slab segment has at most items.

If a final slab segment is not running, is at most below capacity.
Proof ().
Let be the filter size. Then always, since the interface only runs if , on a batch of size at most .
Invariant 1
can only exceed capacity when runs, and restores the invariant (in item 4g) before it finishes running.
Invariant 2
Only the interface creates more holes in (in item 3), each corresponding to a unique successful deletion that is inserted into the buffer of , and so just after the interface finishes running has at most holes where is the number of successful deletions in , and all the holes must be in since . Once runs, will have no holes, because either was the terminal segment or had at least items by Invariant 4.
Invariants 3,4
To establish Invariants 3,4, we shall prove sharper invariants. Let be the total size of minus their total capacity. Let be the number of unfinished operations in . Let be the number of successful deletions in . Then the following invariants hold:

If a final slab segment is not running, .

For each final slab segment , we always have .

.

If a final slab segment is not running, .
Firstly, (A) holds for , because and by Invariant 1 since the interface does not insert any item. Thus by induction it suffices to show that (A) holds for where assuming (A) holds for . This can be done by the following observations:

When is not running, never increases, because each search/update/deletion in does not increase or affect , and each operation that finishes in increases by at most but decreases by .

When is newly created, by (C) since was the previous terminal segment, and .

When is running, is not running. Thus just after finishes running, since does not exceed capacity (due to item 4g), and since has an empty buffer. Thus by (A) for .
We now establish (B) using (A). Consider each final slab segment run. Just before that run, by (A), and there are at most unfinished operations in . During that run, no new operation enters , and increases by at most for each unfinished operation in that finishes. Thus throughout that run.
Next we establish (C). Note that the terminal segment is only changed by the interface or the previous terminal segment, and so when the terminal segment is not running, never increases because no insertions finish. It suffices to observe the following:

Just after is newly created, it does not exceed capacity and so .

Just before was removed (making the new terminal segment), had finished running, and so did not exceed capacity, and hence by (A) for .

Whenever runs and does not create a new terminal segment, just before that run do not exceed total capacity and so at that point. There are two cases:

If : Just before that run, since do not exceed capacity, and has at most operations. During that run, increases by at most per unfinished operation in , and hence after that run .

If : Just before that run, by (A) for , and has operations. During that run, is increased only by per unfinished operation in , and hence after that run .

Finally we establish (D). Firstly, (D) holds for by Invariant 2. Thus by induction it suffices to show that (D) holds for where assuming (D) holds for . This can be done by the following observations:

Any search/update/insertion that finishes does not decrease .

If is not running, any deletion that succeeds in decreases by but increases by .

When is newly created by an run, after that run does not exceed capacity (due to item 4g) and so for each hole in there will be at least successful deletion in , and hence by (D) for .

When runs, after the run because:

If after the run is exactly full, then at that point and , and by (D) for .

If after the run is below capacity and still exists, the run must have made frontward transfers (in item 4h) where was the number of successful deletions in at the start of that run. Thus the run increased by at least , and decreased by .

Finally we can establish Invariants 3,4. By both (B) and (C), for each final slab segment we have , and hence has size at most . By (D), if a final slab segment is not running, then and hence is at most below capacity.
Corollary 17 ( Segment Access Bound).
Each batch operation on a parallel 23 tree in segment where takes work per operation in the batch and span.
Now we can prove a delay bound on the ‘front access’ (through the frontlocks) by each final slab segment.
Lemma 18 ( Front Access Bound).
Any segment takes total delay to acquire the frontlocks and run the frontlocked section (inbetween) and then release . And similarly the interface takes delay to acquire and run the frontlocked section and then release .
Proof ().
The frontlocked section takes delay, since each operation on a parallel 23 tree in takes span by creftype 17. We shall show by induction that any segment that has acquired will release within delay, where is a constant chosen to make it true when . If , then next attempts to acquire , and if it fails then must now be holding it and will release it within delay by induction, and then will actually acquire and then will release within delay by induction, which in total amounts to delay. Therefore any segment that attempts to acquire will wait at most delay for any current holder of to release it, and then take at most delay to run its frontlocked section and release , which is in total delay. Similarly for when the interface attempts to acquire .
Then we can prove the key lemma:
Lemma 19 ( Final Slab Bound).
Take any segment where , and any operation . Then runs within delay. So if finishes in segment then the processing of in the final slab takes delay.
Proof ().
Once any acquires the second neighbourlock, it will finish within delay, since the operations on the parallel 23 trees takes span by creftype 17, the front access take delay by creftype 18, and inserting the unfinished operations into the buffer of takes span. Thus once any acquires the first lock, it waits delay for the holder of the second lock to finish, and then itself finishes within delay. And once the interface acquires the lock shared with , it will finish within delay by creftype 18. Thus any when run will acquire both locks within delay, and then itself finish within delay. Therefore, the final slab takes delay to process any operation that finishes in .
To bound the total work, we begin by partitioning it per operation:
Lemma 20 ( Segment Work).
We can divide the segment work (work done on segments) in among the operations in the following natural way — each segment does work per operation it processes.
Proof ().
If , then the proof is the same as for Lemma 11 ( Segment Work). (
Comments
There are no comments yet.