Data structures, parallel programs, dictionaries, comparison-based search, distribution-sensitive algorithms
In this paper 111This is the authors’ version of a paper submitted to the 30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’18). It is posted here by permission of ACM for your personal or classroom use. Not for redistribution. The definitive version can be found at https://doi.org/10.1145/3210377.3210390. © 2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM. we present two versions of a parallel working-set map on processors that supports searches, insertions and deletions. In both versions, the total work of all operations when the map has size at least is bounded by the working-set bound, i.e., the cost of an item depends on how recently it was accessed (for some linearization): accessing an item in the map with recency takes work. In the simpler version each map operation has span (where is the maximum size of the map). In the pipelined version each map operation on an item with recency has span. (Operations in parallel may have overlapping span; span is additive only for operations in sequence.)
Both data structures are designed to be used by a dynamic multithreading parallel program that at each step executes a unit-time instruction or makes a data structure call. To achieve the stated bounds, the pipelined version requires a weak-priority scheduler, which supports a limited form of 2-level prioritization. At the end we explain how the results translate to practical implementations using work-stealing schedulers.
To the best of our knowledge, this is the first parallel implementation of a self-adjusting search structure where the cost of an operation adapts to the access sequence. A corollary of the working-set bound is that it achieves work static optimality: the total work is bounded by the access costs in an optimal static search tree.
Map (or dictionary) data structures, such as binary search trees and hash tables, support inserts, deletes and searches/updates (collectively referred to as accesses) and are one of the most used and studied data structures. In the comparison model, balanced binary search trees such as AVL trees and red-black trees provide a performance guarantee of worst-case cost per access for a tree with items. Other kinds of balanced binary trees provide probabilistic or amortized performance guarantees, such as treaps and weight-balanced trees.
Self-adjusting maps, such as splay trees , adapt their internal structure to the sequence of operations to achieve better performance bounds that depend on various properties of the access pattern (see  for a hierarchical classification). Many of these data structures make it cheaper to search for recently accessed items (temporal locality) or items near to previously accessed items (spatial locality). For instance, the working-set structure described by Iacono in  has the working-set property (which captures temporal locality); it takes time per operation with access rank (creftype 1), so its total cost satisfies the working-set bound (creftype 2).
Definition 1 (Access Rank).
Define the access rank of an operation in a sequence of operations on a map as follows. The access rank for a successful search for item is the number of distinct items in that have been searched for or inserted since the last prior operation on (including itself). The access rank for an insertion, deletion or unsuccessful search is always , where is the current size of .
Definition 2 (Working-Set Bound).
Given any sequence of map operations, we shall use to denote the working-set bound for , defined by where is the access rank of the -th operation in when is performed on an empty map.
Parallel search structures
Our goal in this paper is to design efficient self-adjusting search structures that can be used by parallel programs. Even designing a non-adjusting parallel or concurrent search structure is quite challenging, and there has been a lot of research on the topic.
There are basically two approaches. In the concurrent computing world, processes independently access the data structure using some variety of concurrency control (e.g., locks) to prevent conflicts. In the parallel computing world, data structures are designed to support executing operations in parallel, either individually or in batches.
For example, in the concurrent computing world, Ellen et al.  show how to design a non-blocking binary search tree, with later work generalizing this technique  and analyzing the amortized complexity . However, these data structures do not maintain balance in the tree (i.e., the height can get large) and their cost depends on the number of concurrent operations.
An alternate approach (that bears some similarity to the implicit batching that we use) is software combining [22, 28, 32], where each processor inserts a request in a shared queue and a single processor sequentially executes all outstanding requests later. These works provide empirical efficiency but not worst-case bounds.
Another notable example is the CBTree [1, 2], a concurrent splay tree that in real experiments achieves surprisingly good performance — leading to an interesting hypothesis that self-adjustment may be even more valuable (in practice) in concurrent settings than sequential settings. However, the CBTree does not guarantee that it maintains the proper ‘frequency balance’, and hence does not provide the guarantees of a splay tree (despite much experimental success).
In the parallel computing world, there are several classic results in the PRAM model. Paul et al.  devised a parallel 2-3 tree such that synchronous processors can perform a batch of operations on a parallel 2-3 tree of size in time. Blelloch et al.  show how pipelining can be used to increase parallelism of tree operations. Also, (batched) parallel priority queues [12, 15, 16, 36] have been utilized to give efficient parallel algorithms such as for shortest-path and minimum spanning tree [12, 16, 33].
More recently, in the dynamic multithreading model, there have been several elegant papers on parallel treaps  and how to parallelize a variety of different binary search trees  supporting unions and intersections, and also work on how to achieve batch parallel search trees with optimal work and span . Other batch parallel search trees include red-black trees  and weight-balanced B-trees . (We are unaware of any batched self-adjusting data structures.)
And yet, such concurrent/parallel map data structures can be difficult to use; the programmer cannot simply treat it as a black box and use atomic map operations on it from within an ordinary parallel program. Instead, she must carefully coordinate access to the map.
Recently, Agrawal et al.  introduced the idea of implicit batching. Here, the programmer writes a parallel program that uses a black box data structure, treating calls to the data structure as basic operations. In addition, she provides a data structure that supports batched operations (e.g., search trees in [9, 7]). The runtime system automatically stitches these two components together, ensuring efficient running time by creating batches on the fly and scheduling them appropriately. This idea of implicit batching provides an elegant solution to the problem of parallel search trees.
Our goal is to extend the idea of implicit batching to self-adjusting data structures — and more generally, to explore the feasibility of the implicit batching approach for a wider class of problems. In , they show how to apply the idea to uniform-cost data structures (where every operation has the same cost). 222They also provide some bounds for amortized data structures, where queries do not modify the data structure. In a self-adjusting structure, some operations are much cheaper than others, and additionally every operation may modify the data structure (unlike say AVL/red-black trees where searches have no effect on the structure), which makes parallelizing it much harder.
We present in this paper, to the best of our knowledge, the first parallel self-adjusting search structure that is distribution-sensitive with worst-case guarantees. In particular, we design two versions of a parallel map whose total work is essentially bounded by the Definition 2 (Working-Set Bound). for some linearization of the operations (that respects the dependencies between them).
Parallel Programming Model
The parallel data structures in this paper can be used in the scenario where a parallel program accesses data structures expressed through dynamic multithreading (see [14, Ch. 27]), which is the case in many parallel languages and libraries, such as Cilk dialects [24, 30], Intel TBB , Microsoft Task Parallel Library  and subsets of OpenMP . The programmer expresses algorithmic parallelism through parallel programming primitives such as fork/join (also spawn/sync), parallel loops and synchronized methods, and does not provide any mapping from subcomputations to processors.
These types of programs are typically scheduled using a greedy scheduler [11, 27] or a nearly greedy scheduler such as work-stealing scheduler (e.g., ) provided by the runtime system. A greedy scheduler guarantees that at each time step if there are available tasks then of them are completed.
We analyze our two data structures in the context of a greedy scheduler and a weak-priority scheduler (respectively). A weak-priority scheduler has two priority levels, and at each step at least half the processors greedily choose high-priority tasks and then low-priority tasks — if there are at most high-priority tasks, then all are executed. We discuss in Section 8 how to adapt these results for work-stealing schedulers.
2 Main Results
We present two parallel working-set maps that can be used with any parallel program , whose actual execution is captured by a program DAG where each node is a unit-time instruction or a call to some data structure , called an -call, that blocks until the answer is returned, and each edge represents a dependency due to the parallel programming primitives. Let be the total number of nodes in , and be the number of nodes on the longest path in .
Both designs take work nearly proportional to the Definition 2 (Working-Set Bound). for some legal linearization of , while having good parallelism. (We assume that each key comparison takes steps.)
The first design, called , is a simpler batched data structure.
Theorem 3 ( Performance).
If uses only (i.e., no other data structures), then its running time on processes using any greedy scheduler is
(as ) for some linearization of , where is the maximum number of -calls on any path in , and is the maximum size of the map, and is the number of small-ops, defined as operations in that are performed on the map when its size is less than .
Notice that if is replaced by an ideal concurrent working-set map (one that does the same work as the sequential working-set map if we ran the program according to linearization ), then running on processors according to the linearization takes worst-case time where . Also, we very likely have in practice, and so can usually ignore the term. Thus gives an essentially optimal time bound except for the “span term” , which adds time per -call along some path in . In short, the parallelism of is within a factor of of the optimal.
The second design, called , uses a more complex pipelined data structure design as well as a weak-priority scheduler (see Section 7.2) to provide a better bound on the “span term”.
Theorem 4 ( Performance).
If uses only , then its running time on processes using any weak-priority scheduler is
for some linearization of , where are defined as in creftype 3, and is the weighted span of where each map operation is weighted by its cost according to . Specifically, each map operation in with access rank is given the weight , and is the maximum weight of any path in .
Compared to , the “work term” is unchanged, but the “span term” for has no term. Since running on processors according to the linearization takes worst-case time, gives an essentially optimal time bound up to an extra time per map operation along some path in , and hence has parallelism within an factor of optimal.
3 Central Ideas
We shall now sketch the intuitive motivations behind and .
It starts with Iacono’s sequential working-set structure, which contains a sequence of balanced binary search trees where tree for contains items and hence has height . The invariant maintained is that the most recently accessed items are in the first trees. A search on a key proceeds by searching in each tree in the sequence in order until the key is found in tree . After a search, the item is moved to and then for each , the least recently accessed item from tree is moved to tree . By the invariant, any item in the map with recency will take time to access. Each insertion or deletion can be easily carried out in time while preserving the invariant.
The challenge is to ‘parallelize’ this working-set structure while preserving the total work. The first step is to process operations in batches, using a batched search structure in place of each ‘tree’.
The problem is that, if there are searches for the same item in the last tree, then according to the working-set bound these operations should take work. But if these operations all happen in parallel and end up in the same batch, and we execute this batch naively, then each operation will go through the entire structure leading to work.
Therefore, in order to get the desired bound, we must combine duplicate accesses in each batch. But naively sorting a batch of operations takes work. To eliminate this as well, (Section 6) uses a novel entropy-sorting algorithm, and a careful analysis yields the desired work bound.
Next, we cannot simply apply the generic “implicit batching” transformation in  to , because the Batcher Bound (Theorem 1 in ) would give an expected running time of for map operations, where is the work done by , and is the worst-case span of a size- batch. The problem is that is , because a batch with a search for an item in the last tree has span .
Firstly, this means that the term would be , and so the Batcher Bound would be no better than for a batched binary search tree. Secondly, the term would be . has the same span term, because if a cheap operation is ‘blocked’ by the previous batch that has an expensive operation, then the span of the cheap operation could be . To reduce this, we improve to using an intricate pipelining scheme (explained in Section 7) so that a cheap operation is ‘blocked’ less by the previous batch.
4 Parallel Computation Model
In this section, we describe how the parallel program generates an execution DAG, how we measure the cost of a given execution DAG, and issues related to the chosen memory model.
The actual complete execution of can be captured by the execution DAG (which may be schedule-dependent), in which each node is a unit-time instruction and the directed edges represent the underlying computation flow (such as constrained by forking/joining of threads and acquiring/releasing of locks). At any point during the execution of , a node in the program/execution DAG is said to be ready if its parent nodes have been executed. An active thread is simply a ready node in , while a suspended thread is a terminal node in .
The program DAG captures the high-level execution of , but interaction between data structure calls is only captured by the execution DAG. We further assume that all the data structures are (implicitly) batched data structures, and that the number of data structures in use is bounded by some constant. To support implicit batching, each data structure call is automatically handled by a parallel buffer for the data structure. (See Appendix Section A.1.)
The execution DAG consists of core nodes and ds nodes, which are dynamically generated as follows. At the start has a single core node, corresponding to the start of the program . Each node could be a local instruction or a synchronization instruction (including fork/join and acquire/release of a lock). Each core node could also be a data structure call. When a node is executed, it may generate child nodes or terminate. A join instruction also generates edges that linearize all the join operations according to the actual execution. Likewise, simultaneous operations on a non-blocking lock generate child nodes that are linearized by edges. For a blocking lock, a release instruction generates a child node that is simply the resumed thread that next acquires the lock (if any), with an edge to it from the node corresponding to the originally suspended thread.
The core nodes are further classified intoprogram nodes and buffer nodes. The program nodes (here -nodes) correspond to nodes in the program DAG , and they generate only program nodes except for data structure calls. An -call generates a buffer node corresponding to passing the call to the parallel buffer. This buffer node generates more buffer nodes, until at some point it generates an -node (every -node is a ds node), corresponding to the actual operation on , which passes the input batch to . That -node generates only -nodes except for when it returns the result of some operation in the batch (generating a program node with an edge to it from the original -call), or when it becomes ready for input (generating a buffer node that initiates flushing of the parallel buffer).
We shall now precisely define the notion of effective work/span/cost for a parallel data structure used by a (terminating) parallel program.
Definition 5 (Effective Work/Span/Cost).
Take any program using a batched data structure on processors. Let be the actual execution DAG of using . Then the effective work taken by (as used by ) is the total number of -nodes in . And the effective span taken by is the maximum number of -nodes on a path in . And the effective cost of is .
The effective cost has the desired property that it is subadditive across multiple parallel data structures. This implies that our results are composable with other data structures in this model, since we actually show the following for some linearization :
Interestingly, the bound for the effective cost of is independent of the scheduler, while the effective cost bound for requires a weak-priority scheduler. In addition, using any greedy scheduler, the parallel buffer for either map has effective cost (analogously defined) at most where is the effective work taken by (Appendix creftype 26). Therefore our main results (creftype 3 and creftype 4) follow from the above claims.
Unless otherwise stated, we work within the pointer machine model for parallel programs given by Goodrich and Kosaraju  333In short, the main memory can be accessed only via pointers, which can only be stored, dereferenced or tested for equality (no pointer arithmetic). . But instead of having synchronous processors, we introduce a new more realistic 444Exclusive reads/writes (EREW) is too strict, while concurrent reads/writes (CRCW) does not realistically model the cost of contention, as stated in . QRMW model with queued read-modify-write operations (including read, write, test-and-set, fetch-and-add, compare-and-swap) as described in , where multiple memory requests to the same memory cell are FIFO-queued and serviced one at a time, and the processor making each memory request is blocked until the request has been serviced. Our data structures can hence be implemented and used in the dynamic multithreading paradigm.
This QRMW pointer machine model supports binary fork/join primitives. It cannot support constant-time random-access locks, but it supports non-blocking locks (try-locks), where attempts to acquire the lock are serialized but do not block. Acquiring a non-blocking lock succeeds if the lock is not currently held but fails otherwise, and releasing always succeeds. If threads concurrently access a non-blocking lock, then each access completes within time steps. Non-blocking locks can be used to support activation calls to a process, where activating a process will start its execution iff it is not already executing and it is ready (some condition is satisfied), and the process can optionally reactivate itself on finishing.
We can also implement a dedicated lock, which is a blocking lock initialized with keys for some constant , such that simultaneous acquisitions must be made using distinct keys. When a thread attempts to acquire a dedicated lock, it is guaranteed to obtain the lock after at most other threads that attempt to acquire the lock at the same time or later.
5 Amortized Sequential Working-set Map
In this section we explain the amortized sequential working-set map , which is similar to Iacono’s working-set structure , but does not move an accessed item all the way to the front. This localization of self-adjustment is the basis for parallelizing it as in .
keeps the items in a list with segments . Each segment has capacity and every segment is full except perhaps the last. Items in each segment are stored in both a key-map and a recency-map, each of which is a BBT (balanced binary tree), sorted by key and by recency respectively. Consider any item currently in segment . On a search of , if then is moved to the front (most recent; i.e. first in the recency-map) of , otherwise is moved to the front of and the last (least recent) item of is shifted to the front of . On a deletion of , it is removed and for each the first (most recent) item of is moved to the back of . On an insertion, the item is added at the back of (if is full, then it is added to a new segment ).
We now prove an abstract lemma about a list with operations and costs that mimic . We will later use this same lemma to analyze and as well.
Lemma 6 (Working-Set Cost Lemma).
Take any sequence of operations on an abstract list , each of which is a search, insert, delete or demote (described below), and a constant , such that the following hold (where is the current size of ):
Searching for an item with rank in costs and it is pulled forward to within the first items in .
Searching for an item not in costs .
Inserting or deleting an item costs .
Demoting an item in costs and pushes it backward in , but that item subsequently can only be demoted or deleted.
Then the total cost of performing on is , where demotions are ignored in computing (they are not counted as accesses).
We shall perform the analysis via the accounting method; each operation on has a budget according to , and we must use those credits to pay for that operation, possibly saving surplus credits for later. Define the -recency of an item in to be the number of items in that have been inserted before or pulled forward past in since the last operation on . Clearly, for each search/insertion/deletion of an item in , its access rank (actual recency) is at least the -recency of . Each item in has some stored credit, and we shall maintain the invariant that every item in with -recency and stored credit is within the first items in or has been demoted. The invariant trivially holds at the start.
First we show that, on every operation on an item , the invariant can be preserved for itself. For insertion/deletion or unsuccessful search for , the budget of can pay for the operation and (for insertion) also pay for the stored credit for . For successful search for , it is as follows. Let be the stored credit and be the -recency of before the operation, and let be the rank of in after that. By the invariant, was within the first items in before the operation. Also the budget is . If , then and so the budget can pay for both the operation cost and a new stored credit of . If , then and so the stored credit can pay for the operation cost and a new stored credit of .
Finally we check that the invariant is preserved for every other item in . For search/insertion of , the rank of in changes by the same amount as its -recency. For deletion of , if is after in then is more recent than and so the -recency of does not change, and if is before in then the rank of in decreases by and its -recency decreases by at most . For a demotion, every other item’s rank in does not increase.
This lemma implies that has the desired working-set property.
Theorem 7 ( Performance).
The cost of satisfies the working-set bound.
Let be the number of items in . By construction, , and each operation on takes time on segment . Thus each insertion/deletion takes time, and each access/update of an item with rank in (in order of segment followed by order in the recency-map) takes time. Also, on each access of an item with rank in , its new rank is at most , because if is in then , and if is in for some then and . Thus by the Lemma 6 (Working-Set Cost Lemma). (creftype 6) we are done.
6 Simple Parallel Working-Set Map
We now present our simple batched working-set map . The idea is to use the amortized sequential working-set map (Section 5) and execute operations in batches 555Each batch is stored in a leaf-based balanced binary tree for efficient processing.. In order to get the bound we desire, however, we must combine operations in a batch that access the same item. In particular, for consecutive accesses to the same item, all but the first one should cost . Therefore, we must sort the batch (efficiently) using the Definition 32 (Parallel Entropy Sort). (Appendix creftype 32), to ‘combine duplicates’. We also control the size of batches — if batches are too small, then we lose parallelism; if the batches are too large, then the sorting cost is too large.
6.1 Description of
As described in the Section 4, -calls in the program are put into the parallel buffer for . When is ready (i.e. the previous batch is done), we flush the parallel buffer to obtain the next input batch , which we cut and store in a feed buffer, which is a queue of bunches (a bunch is a set supporting addition of a batch and -span conversion to a batch if it has size ) 666A bunch is implemented using a complete binary tree with batches at the leaves, with a linked list threaded through each level to support adding a leaf in steps. each of size except possibly the last. Specifically, we divide into small batches of size except possibly the first and last, the first having size , where is the size of and is the size of the last bunch in the feed buffer. Next we add that first small batch to , and append the rest as bunches to the feed buffer. Let be the current size of . Then we remove the first bunches from the feed buffer, convert them into batches, merge them in parallel into a cut batch , and process it as follows.
First we sort using the Definition 32 (Parallel Entropy Sort). (Appendix creftype 32). Then we combine all operations on each item into one group-operation 777Each group-operation stores the original operations as a batch. that is treated as a single operation with the same effect as the whole group of operations in the given order.
We pass the resulting batch (of group-operations) through each segment from to . At segment , first we search for the relevant items. For insertions, if the item is found then we treat it as an update. For successful searches/updates, we return the results immediately, and shift the items (keeping their relative order) to the front of the previous segment ( if ). For deletions, if the item is found then we delete the item. Next, we restore the capacity invariant for — for each segment from to , we transfer the appropriate number of items between the front of and the back of so that either have total size or is empty. Then we pass the unfinished (unreturned) operations (including all deletions) on to the next segment.
At the end, we handle the remaining insertions. First we insert at the back of up to its capacity. If there are leftover items, we create segments with just enough capacity, and carve out the correct amount for each segment from to in that order.
Finally, we return the results for the insertions/deletions and the unsuccessful searches/updates, and we are done processing the batch.
To parallelize the above, we need to be able to efficiently delete, from each segment, any sorted batch of items or any number of the most/least recent items. For this we replace each BBT by a A.2 Batched Parallel 2-3 Tree (Appendix Section A.2), where each leaf in the key-map also has a direct pointer to the corresponding leaf in the recency-map, and vice versa. Given a sorted batch of items, we can find them by one batch operation on the key-map, and then we have a batch of direct pointers to the leaves for these items in the recency-map, and hence can perform one reverse-indexing operation on the recency-map to obtain a sorted batch of indices for those leaves, with which we can perform one batch operation to delete them. Similarly, to remove the most recent items from a segment, we find them via the recency-map, and then do a reverse-indexing operation on the key-map to obtain them in sorted order, whence we can delete them.
6.2 Analysis of
We first bound the cost of parallel entropy-sorting each batch (creftype 10). To do so, we will find a batch-preserving linearization (creftype 8) such that for each batch of size , the entropy bound is at most the insert working-set bound (creftype 9), which we in turn bound by the cost according to the working-set bound , plus per operation when the map is small (i.e. has size ). This extra cost arises when a batch has many operations on distinct items while the map is small, which according to are cheap.
Definition 8 (Batch-Preserving Linearization).
Take any sequence of batches of operations on a map . We say that is a batch-preserving linearization of if is a permutation of the operations in that preserves the ordering of batches and (within each batch) the ordering of operations on the same item.
Remark 0 ().
When any two batch-preserving linearizations of are performed on , the items in are the same after each batch, and the successfulness of each operation remain the same.
Definition 9 (Insert Working-Set Bound).
The insert working-set bound for any sequence of map operations is the working-set bound for ‘inserting’ the items in in the given order (ignoring the actual operations) into an empty map, namely for each item first searching for it and then inserting it iff it is absent.
Lemma 10 (Batch-Sorting Cost Lemma).
Take any sequence of batches of operations on a map , and any constant . Then there is a batch-preserving linearization of such that parallel entropy-sorting (Appendix creftype 32) each batch in takes total work over all batches, where each batch in has size and has operations that are performed when has size less than (when is performed on ).
Let be a batch-preserving linearization of such that each batch in has the maximum insert working-set bound (creftype 9). By the Theorem 31 (Worse-case Working-set Bound). (Appendix creftype 31) is at least the entropy bound for . Thus parallel entropy-sorting takes work (Appendix creftype 33).
Let be the size of , and be the number of distinct items (accessed by operations) in . Partition into subsequences and such that has only the first operation of every distinct item in . For each , let be the cost of the operations in according to , and let be the cost of the operations in according to , so . Let be the number of operations in performed when has size less than (according to ).
Note that because every operation in is a successful search according to with access rank no more than according to . Thus it suffices to show that .
If , then obviously .
If , then at least operations in are performed when has size at least . So according to , each of those operations has access rank at least and hence costs . Thus . Also, , since any insertion on a map with at most items has access rank .
Next we prove a simple lemma that allows us to divide the work done on the segments among the operations.
Lemma 11 ( Segment Work).
Each segment takes work per operation that reaches it.
Searching/deleting/shifting the relevant items in the parallel 2-3 trees takes work per operation. Also, for each , the number of transfers (to restore the capacity invariant) between and is at most the number of operations, and each transfer takes work because there are always at most items in . Thus the transfers take total work.
Then we can prove the desired effective work bound for .
Theorem 12 ( Effective Work).
takes effective work for some linearization of .
Cutting the input batch of size from the parallel buffer into small batches takes work. Adding the first small batch to the last bunch in the feed buffer takes work. Inserting the bunches into the feed buffer takes work. Forming a cut batch of size (converting the bunches and merging the results) takes work. So all this buffering work adds up to per map operation.
Sorting the (cut) batches takes total work (over all batches) for some linearization , by the Lemma 10 (Batch-Sorting Cost Lemma). (creftype 10). Specifically, we choose . For each batch , let be the size of just before that batch, and then has size and:
If , then and so (as ).
If , then so none of the operations in that batch can be small-ops.
It now suffices to show that the work on segments is for some linearization (since either or suffices for the final bound). For this, we pretend that a deleted item is marked rather than removed, and when a segment is filled to capacity all marked items are simultaneously transferred to the next segment, and at the last segment the marked items are removed. This takes more work than what actually does, but is easier to bound.
We shall now use the Lemma 6 (Working-Set Cost Lemma). (creftype 6) on the list of the items in (including the marked items) in order of segment followed by recency within the segment, where is updated after the batch has passed through each segment in the actual execution of , and after we finish processing the batch.
We simulate the updates to by list operations as follows:
Shift successfully searched/updated items in : Search for them in reverse order (from back to front in ).
Shift marked (to-be-deleted) items in : Demote them.
Insert items in : Insert in the desired positions.
Remove marked items in : Delete them.
This simulation yields a sequence of list operations on , to which we can then apply the Lemma 6 (Working-Set Cost Lemma)..
For each search for an item with rank in , is found in or some segment such that , and so by Lemma 11 ( Segment Work). (creftype 11) the search takes work in , after which has new rank in at most , like in (creftype 7). After each batch , let be the final size of and be the new last segment, and then each insertion in takes work in . Each deletion takes work in .
Thus by creftype 6, takes work on segments. Now let be the same as but with each group-operation expanded to its original sequence of operations. Clearly , since each group-operation is on the same item, so we are done.
And now we turn to bounding the effective span.
Theorem 13 ( Effective Span).
takes effective span, where is the number of operations on , and is the maximum size of .
First we bound the the span of processing each cut batch (i.e. the span of the corresponding execution subDAG). Let denote the maximum span of processing a cut batch of size . Take any cut batch of size and let be the size of just before . takes span to be removed and formed from the feed buffer, and span to be sorted. then takes span in each segment (because shifting between parallel 2-3 trees of size or cutting a batch of size takes span), which adds up to span over all segments, since when . Returning the results for each group-operation takes span. Thus . If then . If then and hence .
Now let be the actual execution DAG for using (on processors). Then the effective span of is simply the time taken to execute on an unlimited number of processors when each -node in takes unit time while every other node takes zero time, since captures all relevant behaviour of using including all the dependencies created by the locks. In this execution, we put a counter at each -call in the program DAG , initialized to zero, and at each step we increment the counter at every pending -call (i.e., the result is not yet returned). Then the total number of steps is at most the final counter-weighted span of , which we now bound.
Take any path in . Consider each -call on . We trace the ‘journey’ of from the parallel buffer as an operation in an uncut batch of size to a cut batch of size to the end of .
Observe that any batch of size takes span to be flushed from the parallel buffer, and span to be cut and added/appended to the bunches in the feed buffer, which in total is at most span.
So, first of all, waits for the preceding uncut batch of size to be processed, taking span. Next, waits for the current cut batch of size to be processed, taking span. After that, is processed, taking span. Then waits for intervening cut batches (between and ) with operations in total. Each intervening batch has some size and hence . Finally, is processed, taking span. Thus takes span in total.
Note that no two -calls on the path can wait for the same intervening batch, because the second can be executed only after the first has returned. Thus over all counters at -calls on , each of will sum up to at most . Therefore the final counter-weighted span of is at most .
7 Faster Parallel Working-Set Map
To reduce the effective span of , we intuitively have to:
Shift each accessed item near enough to the front, so that accessing it again soon would be cheap.
Pipeline the batches somehow, so that an expensive access in a batch does not hold up the next batch.
Naive pipelining will not work, because operations on the same item may take too much work. Hence we shall use a filter before the pipelined segments to ensure that operations proceeding through them are on distinct items, namely we pass all operations through the filter and only allow an operation through if there is not already another operation on the same item in the pipeline.
For similar reasons as in , we must control both the batch size and filter size to achieve enough parallelism, and so we choose the filter capacity to be . However, we cannot put the filter before the first segment, because accessing the filter requires work per operation, whereas to meet the working-set bound we need operations with access rank to cost only work.
Therefore, we divide the segments into the first slab and the final slab, where the first slab comprises the first segments and the final slab contains the rest, and put the filter after the first slab. Only operations that do not finish in the first slab are passed through the filter, and so the filtering cost per operation is bounded by the work already incurred in going through the first slab. Furthermore, we shift accessed items to the front of the final slab, and ‘cascade’ the excess items only when a later batch passes.
We cannot pipeline the first slab, but since the first slab is essentially a copy of but with only trees, its non-pipelined span turns out to be bounded by the span of sorting. To allow operation on items in the first slab to finish quickly, we need to allow the first slab to run while the final slab is running, but only when the filter has size at most , so that the filter size is always .
We also use special locking schemes to guarantee that the first slab and the segments in the final slab can process the operations at a consistent pace without interfering with one another. Finally, we shall weakly prioritize the execution of the final slab, to prevent excessive work from being done in the first slab on an item if there is already an operation on in the final slab.
7.1 Description of
|First slab: where|
We shall now give the details for implementing this (see Figure 2), making considerable use of the A.2 Batched Parallel 2-3 Tree (Appendix Section A.2). has the same segments as in , where segment has assigned capacity but may be under-full or over-full. We shall group the first segments into the first slab, and the other segments into the final slab. uses a feed buffer (like ; see Section 6.1), which is a queue of bunches creftype 6 each of size except possibly the last.
The interface is ready iff both the following hold:
The parallel buffer or feed buffer is non-empty.
The filter has size at most .
When the interface is activated (and ready), it does the following (in sequence) on its run (the locks are described later):
Let be the size of the last bunch in the feed buffer. Flush the parallel buffer and cut the input batch of size into small batches of size except possibly the first and last, where the first has size . Add that first small batch to , and append the others as bunches to the feed buffer. Remove the first bunch from the feed buffer and convert it into a batch , which we shall call a cut batch.
Pass through the first slab, which processes the operations as in . Successful searches/updates immediately finish, while the rest finish only if there was no final slab. Successful deletions are tagged to indicate success. But just before running (if it exists) to process the remaining batch at that segment, acquire the neighbour-lock shared with (as shown in Figure 2) and then acquire the front-lock .
If there was a final slab, then pass the (sorted) batch of unfinished operations through the filter (including successful deletions), insert the filtered batch into the buffer before , and fork (a child thread) to activate .
Release and the neighbour-lock shared with .
The filter is used to ensure that at any point all the operations in the final slab are on distinct items. It is implemented using a batched parallel 2-3 tree that stores items, each tagged with a list of operations on that item (in the order they arrive at the filter) and their cumulative effect (as a single equivalent map operation).
When a batch is passed through the filter, each operation on an item in the filter is appended to the list for it (the effect is also updated) and filtered out of the batch, whereas each operation on an item not in the filter is added to the filter and put into the buffer of .
The final slab is pipelined in the following way. Between every pair of consecutive segments is a neighbour-lock, which is a dedicated lock (see Section 4 Memory Model) with key for each arrow to it in Figure 2. Since each segment needs to access the filter and the contents of , those accesses will also be guarded by a front-locking scheme using a series of front-locks , each of which is a dedicated lock with key for each arrow to it in Figure 3. (This will be fully spelt out below.)
Each final slab segment has a sorted buffer before it (for operations from ), which is a batched parallel 2-3 tree. is ready iff its buffer is non-empty, and when activated (and ready) it runs as follows (front-locking is highlighted):
Acquire the neighbour-locks (between and its neighbours) in the order given by the arrow number in Figure 2.
If , acquire .
If is the terminal segment and have total size exceeding their total capacity, create a new terminal segment .
Flush and process the operations in its buffer as follows:
Search for the accessed items in (by performing one batch operation on the key-map in ). Let be the (sorted batch of) items in that are found in , and delete from .
If , acquire to in that order.
Search for in the filter to determine what to do with it. Let be the items in to be searched or updated, and delete from the filter. (Insertions on items in are treated as updates.) Perform all the updates on items in .
Let . Fork to return the results for operations on , and insert at the front of . If is (now) the terminal segment, perform all insertions at the front of , and delete from the filter, and fork to return the results for operations on .
If the filter size is at most , fork to activate the interface.
If , release to in that order.
If is over-full, transfer items from the back of to the front of so that is full.
If is under-full by items and has items and has successful deletions, transfer items from the front of to the back of .
If is not the terminal segment, insert the operations on (with successful deletions tagged as such) into the buffer of , then fork to activate .
If is the terminal segment and is empty, remove to make the new terminal segment.
If , release .
Release both neighbour-locks and reactivate itself.
7.2 Weak-priority scheduler
It turns out that, ignoring the sorting cost, all we need to achieve the working-set bound is that, for each operation on an item in the final slab, the work ‘done’ by the first slab on can be counted against the work done in the final slab. This can be ensured using a weak-priority scheduler. A weak-priority scheduler has two queues and , where is the high-priority queue, and each ready node is assigned to either or , and at every step the following hold:
If there are ready nodes, then of them are executed.
If queue has ready nodes, then of them are executed (and so nodes are weakly-prioritized).
The -nodes generated for the final slab are assigned to , while all other -nodes are assigned to . Specifically, each (forked) activation call to and all nodes generated by that are assigned to , except for activation calls to the interface (which are assigned to ). Any suspended thread is put back into its original queue when it is resumed (i.e. the resuming node is assigned to the same queue as the acquiring node).
7.3 Analysis of
For each computation (subDAG of the actual execution DAG), we shall define its delay, which intuitively captures the minimum possible time it needs, including all waiting on locks. Each blocked acquire of a dedicated lock corresponds to an acquire-stall node in the execution DAG whose child node is created by the release just before the successful acquisition of the lock. Let be the ancestor nodes of that have not yet executed at the point when is executed. Then we define delay as follows.
Definition 14 (Computation Delay).
The delay of a computation is recursively defined as the weighted span of , where each acquire-stall node in is weighted by the delay of (to capture the total waiting at ), and every other node has unit weight.
Also, we classify each operation as segment-bound iff it finishes in some segment, and filter-bound otherwise (namely filtered out due to a prior operation on the same item that finishes in the final slab).
The key lemma (creftype 19) is that any operation that finishes in segment takes delay to be processed by the final slab, which is ensured (creftype 18) by the front-locking scheme and the balance invariants (creftype 16).
Then takes effective work for some linearization of , and here is the intuitive proof sketch:
The work divides into segment work (done by the slabs) and non-segment work (cutting and sorting batches, and filtering).
The segment work can be divided per operation; segment does work per operation that reaches it (creftype 20).
The cutting work is per operation, and the (overall) sorting work is for some linearization of (creftype 10). The filtering work is per filtered operation, which can be ignored since each filtered operation already took work in the first slab. Similarly we can ignore the work done in passing the batch through .
The segment work on filter-bound operations is too:
We can ignore work during high-busy steps (where has at least ready nodes), because the final slab takes work and so there are high-busy steps.
We can ignore every high-busy run of (namely with at least half its steps high-busy), because its work is times the work during high-busy steps.
High-idle runs of take total work.
Every filter-bound operation is filtered out due to some operation in the final slab. So take any operation on item that finishes in a final slab segment .
During each high-idle run (which takes high-idle steps), the processing of is not blocked by any -thread (since does not hold any neighbour-lock or filter-lock), so each high-idle step ‘reduces’ its remaining delay, which by the key lemma (creftype 19) is .
Therefore the work done on by high-idle runs while is in the final slab is times the work done by the final slab on .
Moreover, takes effective span for some linearization of :
The effective span is the time taken to run the execution DAG on infinitely many processors, where each -node takes unit time while every other node takes zero time.
There are filter-full steps (steps in which the filter has size at least ). To see why, let every operation in the final slab consume a token on each step. Then each filter-full step consumes at least tokens. But by the key lemma (creftype 19) the total token consumption is just times the total work in the final slab, which amounts to (creftype 21).
There are filter-empty steps (filter size at most ) for some linearization of , where is the number of -calls, because each operation essentially has the following path:
It waits filter-empty steps for the current cut batch in the first slab.
Then it waits filter-empty steps per intervening cut batch of size .
Finally it takes steps to pass through the slabs where is its access rank according to , by the key lemma and the rank invariant (creftype 24).
We shall now give the details. For the purpose of our analysis, we consider a segment to be running iff it has acquired all neighbour-locks and has not released them. We also consider an operation to be in segment exactly when it is in the buffer of or is being processed by up to item 4h (inclusive). Likewise, we consider an item that is found in and to be searched/updated to remain in until it is shifted to (in item 4d).
We begin by showing that remains ‘balanced’; each non-terminal segment has size not too far from capacity.
Definition 15 ( Segment Holes).
We say that a segment has holes if is not the terminal segment but has fewer items than its capacity. (If exceeds capacity then it has no holes.)
Lemma 16 ( Balance Invariants).
The following balance invariants hold:
If is not running or does not exist, then does not exceed capacity (if it exists).
If the interface is not running, has no holes, and has at most holes where is the number of successful deletions in .
Each final slab segment has at most items.
If a final slab segment is not running, is at most below capacity.
Let be the filter size. Then always, since the interface only runs if , on a batch of size at most .
can only exceed capacity when runs, and restores the invariant (in item 4g) before it finishes running.
Only the interface creates more holes in (in item 3), each corresponding to a unique successful deletion that is inserted into the buffer of , and so just after the interface finishes running has at most holes where is the number of successful deletions in , and all the holes must be in since . Once runs, will have no holes, because either was the terminal segment or had at least items by Invariant 4.
To establish Invariants 3,4, we shall prove sharper invariants. Let be the total size of minus their total capacity. Let be the number of unfinished operations in . Let be the number of successful deletions in . Then the following invariants hold:
If a final slab segment is not running, .
For each final slab segment , we always have .
If a final slab segment is not running, .
Firstly, (A) holds for , because and by Invariant 1 since the interface does not insert any item. Thus by induction it suffices to show that (A) holds for where assuming (A) holds for . This can be done by the following observations:
When is not running, never increases, because each search/update/deletion in does not increase or affect , and each operation that finishes in increases by at most but decreases by .
When is newly created, by (C) since was the previous terminal segment, and .
When is running, is not running. Thus just after finishes running, since does not exceed capacity (due to item 4g), and since has an empty buffer. Thus by (A) for .
We now establish (B) using (A). Consider each final slab segment run. Just before that run, by (A), and there are at most unfinished operations in . During that run, no new operation enters , and increases by at most for each unfinished operation in that finishes. Thus throughout that run.
Next we establish (C). Note that the terminal segment is only changed by the interface or the previous terminal segment, and so when the terminal segment is not running, never increases because no insertions finish. It suffices to observe the following:
Just after is newly created, it does not exceed capacity and so .
Just before was removed (making the new terminal segment), had finished running, and so did not exceed capacity, and hence by (A) for .
Whenever runs and does not create a new terminal segment, just before that run do not exceed total capacity and so at that point. There are two cases:
If : Just before that run, since do not exceed capacity, and has at most operations. During that run, increases by at most per unfinished operation in , and hence after that run .
If : Just before that run, by (A) for , and has operations. During that run, is increased only by per unfinished operation in , and hence after that run .
Finally we establish (D). Firstly, (D) holds for by Invariant 2. Thus by induction it suffices to show that (D) holds for where assuming (D) holds for . This can be done by the following observations:
Any search/update/insertion that finishes does not decrease .
If is not running, any deletion that succeeds in decreases by but increases by .
When is newly created by an run, after that run does not exceed capacity (due to item 4g) and so for each hole in there will be at least successful deletion in , and hence by (D) for .
When runs, after the run because:
If after the run is exactly full, then at that point and , and by (D) for .
If after the run is below capacity and still exists, the run must have made frontward transfers (in item 4h) where was the number of successful deletions in at the start of that run. Thus the run increased by at least , and decreased by .
Finally we can establish Invariants 3,4. By both (B) and (C), for each final slab segment we have , and hence has size at most . By (D), if a final slab segment is not running, then and hence is at most below capacity.
Corollary 17 ( Segment Access Bound).
Each batch operation on a parallel 2-3 tree in segment where takes work per operation in the batch and span.
Now we can prove a delay bound on the ‘front access’ (through the front-locks) by each final slab segment.
Lemma 18 ( Front Access Bound).
Any segment takes total delay to acquire the front-locks and run the front-locked section (in-between) and then release . And similarly the interface takes delay to acquire and run the front-locked section and then release .
The front-locked section takes delay, since each operation on a parallel 2-3 tree in takes span by creftype 17. We shall show by induction that any segment that has acquired will release within delay, where is a constant chosen to make it true when . If , then next attempts to acquire , and if it fails then must now be holding it and will release it within delay by induction, and then will actually acquire and then will release within delay by induction, which in total amounts to delay. Therefore any segment that attempts to acquire will wait at most delay for any current holder of to release it, and then take at most delay to run its front-locked section and release , which is in total delay. Similarly for when the interface attempts to acquire .
Then we can prove the key lemma:
Lemma 19 ( Final Slab Bound).
Take any segment where , and any operation . Then runs within delay. So if finishes in segment then the processing of in the final slab takes delay.
Once any acquires the second neighbour-lock, it will finish within delay, since the operations on the parallel 2-3 trees takes span by creftype 17, the front access take delay by creftype 18, and inserting the unfinished operations into the buffer of takes span. Thus once any acquires the first lock, it waits delay for the holder of the second lock to finish, and then itself finishes within delay. And once the interface acquires the lock shared with , it will finish within delay by creftype 18. Thus any when run will acquire both locks within delay, and then itself finish within delay. Therefore, the final slab takes delay to process any operation that finishes in .
To bound the total work, we begin by partitioning it per operation:
Lemma 20 ( Segment Work).
We can divide the segment work (work done on segments) in among the operations in the following natural way — each segment does work per operation it processes.
If , then the proof is the same as for Lemma 11 ( Segment Work). (