Parallel Finger Search Structures

08/07/2019 ∙ by Seth Gilbert, et al. ∙ 0

In this paper we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a fixed number of movable fingers. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O( (log p)^2 ) of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS. The total work done by either version of FS is bounded by the finger bound F[L] (for some linearization L of D ), i.e. each operation on an item with finger distance r takes O( log r + 1 ) amortized work; it is cheaper for items closer to a finger. Running P using the simpler version takes O( ( T[1] + F[L] ) / p + T[inf] + d * ( (log p)^2 + log n ) ) time on a greedy scheduler, where T[1],T[inf] are the size and span of D respectively, and n is the maximum number of items in FS, and d is the maximum number of calls to FS along any path in D. Using the faster version, this is reduced to O( ( T[1] + F[L] ) / p + T[inf] + d * (log p)^2 + s[L] ) time, where s[L] is the weighted span of D where each call to FS is weighted by its cost according to F[L]. The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Keywords

Parallel data structures, multithreading, dictionaries, comparison-based search, distribution-sensitive algorithms

Abstract

In this paper 111This is the authors’ version of a paper submitted to the 33rd International Symposium on Distributed Computing (DISC ’19). It is posted here for your personal or classroom use. Not for redistribution. © 2019 Copyright is held by the owner/author(s). we present two versions of a parallel finger structure on processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of by any parallel program that is modelled by a dynamically generated DAG where each node is either a unit-time instruction or a call to .

The total work done by either version of is bounded by the finger bound (for some linearization of ), i.e. each operation on an item with distance from a finger takes amortized work. Running using the simpler version takes time on a greedy scheduler, where are the size and span of respectively, and is the maximum number of items in , and is the maximum number of calls to along any path in . Using the faster version, this is reduced to time, where is the weighted span of where each call to is weighted by its cost according to . We also sketch how to extend to support a fixed number of movable fingers.

The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers.

Acknowledgements

We would like to express our gratitude to our families and friends for their wholehearted support, to the kind reviewers who provided helpful feedback, and to all others who have given us valuable comments and advice. This research was supported in part by Singapore MOE AcRF Tier 1 grant T1 251RES1719.

1 Introduction

There has been much research on designing parallel programs and parallel data structures. The dynamic multithreading paradigm (see [14] chap. 27) is one common parallel programming model, in which algorithmic parallelism is expressed through parallel programming primitives such as fork/join (also spawn/sync), parallel loops and synchronized methods, but the program cannot stipulate any mapping from subcomputations to processors. This is the case with many parallel languages and libraries, such as Cilk dialects [20, 24], Intel TBB [32], Microsoft Task Parallel Library [35] and subsets of OpenMP [29].

Recently, Agrawal et al. [3] introduced the exciting modular design approach of implicit batching, in which the programmer writes a multithreaded parallel program that uses a black box data structure, treating calls to the data structure as basic operations, and also provides a data structure that supports batched operations. Given these, the runtime system automatically combines these two components together, buffering data structure operations generated by the program, and executing them in batches on the data structure.

This idea was extended in [4] to data structures that do not process only one batch at a time (to improve parallelism). In this extended implicit batching framework, the runtime system not only holds the data structure operations in a parallel buffer, to form the next input batch, but also notifies the data structure on receiving the first operation in each batch. Independently, the data structure can at any point flush the parallel buffer to get the next batch.

This framework nicely supports pipelined batched data structures, since the data structure can decide when it is ready to get the next input batch from the parallel buffer, which may be even before it has finished processing the previous batch. Furthermore, this framework makes it easy for us to build composable parallel algorithms and data structures with composable performance bounds. This is demonstrated by both the parallel working-set map in [4] and the parallel finger structure in this paper.

Finger Structures

The map (or dictionary) data structure, which supports inserts, deletes and searches/updates, collectively referred to as accesses, comes in many different kinds. A common implementation of a map is a balanced binary search tree such as an AVL tree or a red-black tree, which (in the comparison model) takes worst-case cost per access for a tree with items. There are also maps such as splay trees [34] that have amortized rather than worst-case performance bounds.

A finger structure is a special kind of map that comes with a fixed finger at each end and a (fixed) number of movable fingers, each of which has a key (possibly or or between adjacent items in the map) that determines its position in the map, such that accessing items nearer the fingers is cheaper. For instance, the finger tree [22] was designed to have the finger property in the worst case; it takes steps per operation with finger distance (creftype 1), so its total cost satisfies the finger bound (creftype 2).

Definition 1 (Finger Distance).

Define the finger distance of accessing an item on a finger structure to be the number of items from to the nearest finger in (including ), and the finger distance of moving a finger to be the distance moved.

Definition 2 (Finger Bound).

Given any sequence of operations on a finger structure , let denote the finger bound for , defined by where is the finger distance of the -th operation in when is performed on .

Main Results

We present in this paper, to the best of our knowledge, the first parallel finger structure. In particular, we design two parallel maps that are work-optimal with respect to the Definition 2 (Finger Bound). (i.e. it takes work) for some linearization of the operations (that is consistent with the results), while having very good parallelism. (We assume that each key comparison takes steps.)

These parallel finger structures can be used by any parallel program , whose actual execution is captured by a program DAG , where each node is an instruction that finishes in time or a call to the finger structure , called an -call, that blocks until the result is returned, and each edge represents a dependency due to the parallel programming primitives.

The first design, called , is a simpler data structure that processes operations one batch at a time.

Theorem 3 ( Performance).

If uses (as ), then its running time on processes using any greedy scheduler (i.e. at each step, as many tasks are executed as are available, up to ) is

for some linearization of -calls in , where is the number of nodes in , and is the number of nodes on the longest path in , and is the maximum number of -calls on any path in , and is the maximum size of 222To cater to instructions that may not finish in time (e.g. due to memory contention), it suffices to define and to be the (weighted) work and span (creftype 5) respectively of the program DAG where each -call is assumed to take time.

Notice that if is an ideal concurrent finger structure (i.e. one that takes work), then running using on processors according to the linearization takes worst-case time where . Thus gives an essentially optimal time bound except for the ‘span term’ , which adds time per -call along some path in .

The second design, called , uses a complex internal pipeline to reduce the ‘span term’.

Theorem 4 ( Performance).

If uses , then its running time on processes using any greedy scheduler is

for some linearization of -calls in , where is the maximum number of -calls on any path in , and is the weighted span of where each -call is weighted by its cost according to , except that each finger-move operation is weighted by . Specifically, each access -call that is an access with finger distance according to is given the weight , and each -call that is a finger-move is given the weight , and is the maximum weight of any path in . Thus, ignoring finger-move operations, gives an essentially optimal time bound up to an extra time per -call along some path in .

We shall first focus on basic finger structures with just one fixed finger at each end, since we can implement the general finger structure with movable fingers by essentially concatenating basic finger structures, as we shall explain later in Section 6. We will also discuss later in Section 7 how to adapt our results for work-stealing schedulers that can actually be provided by a real runtime system.

Challenges & Key Ideas

The sequential finger structure in [22] (essentially a B-tree with carefully staggered rebalancing) takes worst-case time per operation with finger distance , but seems impossible to parallelize efficiently. It turns out that relaxing this bound to amortized time admits a simple sequential finger structure (Section 3) that can be parallelized. In , the items are stored in order in a list of segments , where each segment is a balanced binary search tree with size at most but at least unless , where . This ensures that has height , and that the least items are in the first segments and the greatest items are in the last segments. Thus for each operation with finger distance , it takes time to search through the segments from both ends simultaneously to find the correct segment and perform the operation in it. After that, we rebalance the segments to preserve the size invariant, in such a way that each imbalanced segment will have new size . This double-exponential segment sizes and the reset-to-middle rebalancing is critical in ensuring that all the rebalancing takes amortized time per operation, even if each rebalancing cascade may take up to time.

The challenge is to parallelize while preserving the total work. Naturally, we want to process operations in batches, and use a batch-parallel search structure in place of each binary search tree. This may seem superficially similar to the parallel working-set map in [4], but the techniques in the earlier paper cannot be applied in the same way, for three main reasons.

Firstly, searches and deletions for items not in the map must still be cheap if they have small finger distance, so we have to eliminate these operation in a separate preliminary phase by an unsorted search of the smaller segments, before sorting and executing the other operations.

Secondly, insertions and deletions must be cheap if they have small finger distance (e.g. deleting an item from the first segment must have cost), so we cannot enforce a tight segment size invariant, otherwise rebalancing would be too costly.

This is unlike the parallel working-set map, where we not only have a budget of for each insertion or deletion or failed search, but also must shift accessed items sufficiently near to the front to achieve the desired span bound. The rebalancing in the parallel finger structures in this paper is hence completely different from that in the parallel working-set map.

Thirdly, for the faster version where the larger segments are pipelined, in order to keep all segments sufficiently balanced, the pipelined segments must never be too underfull, so we must carefully restrict when a batch is allowed to be processed at a segment. Due to this, we cannot even guarantee that a batch of operations will proceed at a consistent pace through the pipeline, but we can use an accounting argument to bound the ‘excess delay’ by the number of -calls divided by .

Other Related Work

There are many approaches for designing efficient parallel data structures, so as to make maximal use of parallelism in a multi-processor system, whether with empirical or theoretical efficiency.

For example, Ellen et al. [17] show how to design a non-blocking concurrent binary search tree, with later work analyzing the amortized complexity [16] and generalizing this technique [13]. Another notable concurrent search tree is the CBTree [2, 1], which is based on the splay tree. But despite experimental success, the theoretical access cost for these tree structures may increase with the number of concurrent operations due to contention near the root, and some of them do not even maintain balance (i.e., the height may get large).

Another method is software combining [19, 23, 30], where each process inserts a request into a shared queue and at any time one process is sequentially executing the outstanding requests. This generalizes to parallel combining [6], where outstanding requests are executed in batches on a suitable batch-parallel data structure (similar to implicit batching). These methods were shown to yield empirically efficient concurrent implementations of various common abstract data structures including stacks, queues and priority queues.

In the PRAM model, Paul et al. [31] devised a parallel 2-3 tree where synchronous processors can perform a sorted batch of operations on a parallel 2-3 tree of size in time. Blelloch et al. [10] show how to increase parallelism of tree operations via pipelining. Other similar data structures include parallel treaps [11] and a variety of work-optimal parallel ordered sets [8] supporting unions and intersections with optimal work, but these do not have optimal span. As it turns out, we can in fact have parallel ordered sets with optimal work and span [5, 26].

Nevertheless, the programmer cannot use this kind of parallel data structure as a black box with atomic operations in a high-level parallel program, but must instead carefully coordinate access to it. This difficulty can be eliminated by designing a suitable batch-parallel data structure and using implicit batching [3] or extended implicit batching as presented in [4] and more fully in this paper. Batch-parallel implementations have been designed for various data structures including weight-balanced B-trees [18], priority queues [6], working-set maps [4] and euler-tour trees [36].

2 Parallel Computation Model

In this section, we describe parallel programming primitives in our model, how a parallel program generates an execution DAG, and how we measure the cost of an execution DAG.

2.1 Parallel Primitives

The parallel finger structures and in this paper are described and explained as multithreaded data structures that can be used as composable building blocks in a larger parallel program. In this paper we shall focus on the abstract algorithms behind and , relying merely on the following parallel programming primitives (rather than model-specific implementation details, but see Appendix Section A.6 for those):

  1. Threads: A thread can at any point terminate itself (i.e. finish running). Or it can fork another thread, obtaining a pointer to that thread, or join to a previously forked thread (i.e. wait until that thread terminates). Or it can suspend itself (i.e. temporarily stop running), after which a thread with a pointer to it can resume it (i.e. make it continue running from where it left off). Each of these takes time.

  2. Non-blocking locks: Attempts to acquire a non-blocking lock are serialized but do not block. Acquiring the lock succeeds if the lock is not currently held but fails otherwise, and releasing always succeeds. If threads concurrently access the lock, then each access finishes within time.

  3. Dedicated lock: A dedicated lock is a blocking lock initialized with a constant number of keys, where concurrent threads must use different keys to acquire it, but releasing does not require a key. Each attempt to acquire the lock takes time, and the thread will acquire the lock after at most subsequent acquisitions of that lock.

  4. Reactivation calls: A procedure with no input/output can be encapsulated by a reactivation wrapper, in which it can be run only via reactivations. If there are always at most concurrent reactivations of , then whenever a thread reactivates , if is not currently running then it will start running (in another thread forked in time), otherwise it will run within time after its current run finishes.

We also make use of basic batch operations, namely filtering, sorted partitioning, joining and merging (see Appendix Section A.2), which have easy implementations using arrays in the CREW PRAM model. So and (using a work-stealing scheduler) can be implemented in the (synchronous) Arbitrary CRCW PRAM model with fetch-and-add, achieving the claimed performance bounds. Actually, and were also designed to function correctly with the same performance bounds in a much stricter computation model called the QRMW parallel pointer machine model (see Appendix Section A.1 for details).

2.2 Execution DAG

The program DAG captures the high-level execution of , but the actual complete execution of (including interaction between data structure calls) is captured by the execution DAG (which may be schedule-dependent), in which each node is a basic instruction and the directed edges represent the computation dependencies (such as constrained by forking/joining of threads and acquiring/releasing of blocking locks). At any point during the execution of , a node in the program/execution DAG is said to be ready if its parent nodes have been executed. At any point in the execution, an active thread is simply a ready node in , while a terminated/suspended thread is an executed node in that has no child nodes.

The execution DAG consists of program nodes (specifically -nodes) and ds (data-structure) nodes, which are dynamically generated as follows. At the start has a single program node, corresponding to the start of the program . Each node could be a normal instruction (i.e. basic arithmetic/memory operation) or a parallel primitive (see Section 2.1). Each program node could also be a data structure call.

When a (ready) node is executed, it may generate child nodes or terminate. A normal instruction generates one child node and no extra edges. A join generates a child node with an extra edge to it from the terminate node of the joined thread. A resume generates an extra child node (the resumed thread) with an edge to it from the suspend node of the originally suspended thread. Accesses to locks and reactivation calls would each expand to a subDAG comprised of normal instructions and possibly fork/suspend/resume.

The program nodes correspond to nodes in the program DAG , and except for data structure calls they generate only program nodes. A call to a data structure is called an -call. If is an ordinary (non-batched) data structure, then an -call generates an -node (and every -node is a ds node), which thereafter generates only -nodes except for calls to other data structures (external to ) or returning the result of some operation (generating a program node with an edge to it from the original -call).

However, if is an (implicitly) batched data structure, then all -calls are automatically passed to the parallel buffer for (see Appendix Section A.3). So an -call generates a buffer node corresponding to passing the call to the parallel buffer, as if the parallel buffer for is itself another data structure and not part of . Buffer nodes generate only buffer nodes until it notifies of the buffered -calls or passes the input batch to , which generates an -node. In short, -nodes exclude all nodes generated as part of the buffer subcomputations (i.e. buffering the -calls, and notifying , and flushing the buffer).

2.3 Data Structure Costs

We shall now define work and span of any (terminating) subcomputation of a multithreaded program, i.e. any subset of the nodes in its execution DAG. This allows us to capture the intrinsic costs incurred by a data structure, separate from the costs of a parallel program using it.

Definition 5 (Subcomputation Work/Span/Cost).

Take any execution of a parallel program (on processors), and take any subset of nodes in its execution DAG . The work taken by is the total weight of where each node is weighted by the time taken to execute it. The span taken by is the maximum weight of nodes in on any (directed) path in . The cost of is .

Definition 6 (Data Structure Work/Span/Cost).

Take any parallel program using a data structure . The work/span/cost of (as used by ) is the work/span/cost of the -nodes in the execution DAG for .

Note that the cost of the entire execution DAG is in fact an upper bound on the actual time taken to run it on a greedy scheduler, which on each step assigns as many unassigned ready nodes (i.e. nodes that have been generated but have not been assigned) as possible to available processors (i.e. processors that are not executing any nodes) to be executed.

Moreover, the subcomputation cost is subadditive across subcomputations. Thus our results are composable with other algorithms and data structures in this model, since we actually show the following for some linearization (where are as defined in Section 1 Main Results, and is the total number of calls to the parallel finger structure).

Theorem 7 ( Work/Span Bounds).

Note that the bounds for the work/span of and are independent of the scheduler. In addition, using any greedy scheduler, the parallel buffer for either finger structure has cost (Appendix creftype 24). Therefore our main results (creftype 3 and creftype 4) follow from these composable bounds (creftype 7).

In general, if a program uses a fixed number of implicitly batched data structures, then running it using a greedy scheduler takes time, where is the total work of all the data structures, and is the total span of all the data structures, and is the maximum number of data structure calls on any path in the program DAG.

3 Amortized Sequential Finger Structure

In this section we explain a sequential finger structure with a fixed finger at each end, which (unlike finger structures based on balanced binary trees) is amenable to parallelization and pipelining due to its doubly-exponential segmented structure (which was partially inspired by Iacono’s working-set structure [iacono2001wstree]).

[style=mdroundedboxinfloat]

Figure 1: Outline; each box represents a 2-3 tree of size for

keeps the items in order in two halves, the front half stored in a chain of segments , and the back half stored in reverse order in a chain of segments . Let for each . Each segment has a target size , and a target capacity defined to be if but if . Each segment stores its items in order in a 2-3 tree. We say that a segment is balanced iff its size is within of its target capacity, and overfull iff it has more than items above target capacity, and underfull iff it has more than items below target capacity. At any time we associate every item to a unique segment that it fits in; fits in if is the minimum such that , and that fits in if is the minimum such that , and that fits in if . We shall maintain the invariant that every segment is balanced after each operation is finished.

For each operation on an item , we find the segment that fits in, by checking the range of items in and for each from to and stopping once is found, and then perform the desired operation on the 2-3 tree in . This takes steps, and where is the finger distance of the operation.

After that, if becomes imbalanced, we rebalance it by shifting (appropriate) items to or from (after creating empty segment if it does not exist) to make have target size or as close as possible (via a suitable split then join of the 2-3 trees), and then is removed if it is the last segment and is now empty. After the rebalancing, will not only be balanced but also have size within its target capacity. But now may become imbalanced, so the rebalancing may cascade.

Finally, if one chain is longer than the other chain , it must be that , so we rebalance the chains as follows: If is below target size, shift items from to to fill it up to target size. If is (still) below target size, remove the now empty , otherwise add a new empty segment .

Rebalancing may cascade throughout the whole chain and take steps. But we shall show below that the rebalancing costs can be amortized away completely, and hence each operation with finger distance takes amortized steps, giving us the finger bound for . We will later use the same technique in analyzing and as well.

Lemma 8 ( Rebalancing Cost).

All the rebalancing takes amortized steps per operation.

Proof ().

We shall maintain the invariant that each segment with items beyond (i.e. above or below) its target capacity has at least stored credits. Each operation is given credit, and we use it to pay for any needed extra stored credits at the segment where we perform the operation. Whenever a segment is rebalanced, it must have had items beyond its target capacity for some , and so had at least stored credits. Also, the rebalancing itself takes steps, after which needs at most extra stored credits. Thus the stored credits at can be used to pay for both the rebalancing and any extra stored credits needed by . Whenever the chains are rebalanced, it can be paid for by the last segment rebalancing (which created or removed a segment), and no extra stored credits are needed. Therefore the total rebalancing cost amounts to per operation.       

4 Simpler Parallel Finger Structure

We now present our simpler parallel finger structure . The idea is to use the amortized sequential finger structure (Section 3) and execute operations in batches. We group each pair of segments and into one section , and we say that an item fits in the sections iff fits in some segment in .

The items in each segment are stored in a batch-parallel map (Appendix Section A.5), which supports:

  • Unsorted batch search: Search for an unsorted batch of items within work and span, tagging each search with the result, where is the map size.

  • Sorted batch access: Perform an item-sorted batch of operations on distinct items within work and span, tagging each operation with the result, where is the map size.

  • Split: Split a map of size around a given pivot rank (into lower+upper parts) within work/span.

  • Join: Join maps of total size separated by a pivot (i.e. lower+upper parts) within work/span.

For each section , we can perform a batch of operations on it within work and span if we have the batch sorted. Excluding sorting, the total work would satisfy the finger bound for the same reason as in . However, we cannot afford to sort the input batch right at the start, because if the batch had searches of distinct items all with finger distance , then it would take work and exceed our finger bound budget of .

We can solve this by splitting the sections into two slabs, where the first slab comprises the first sections, and passing the batch through a preliminary phase in which we merely perform an unsorted search of the relevant items in the first slab, and eliminate operations on items that fit in the first slab but are neither found nor to be inserted.

This preliminary phase takes work per operation and span at each section . We then sort the uneliminated operations and execute them on the appropriate slab. For this, ordinary sorting still takes too much work as there can be many operations on the same item, but it turns out that the finger bound budget is enough to pay for entropy-sorting (Appendix creftype 31), which takes work for each item that occurs times in the batch. Rebalancing the segments and chains is a little tricky, but if done correctly it takes amortized work per operation. Therefore we achieve work-optimality while being able to process each batch within span. The details are below.

4.1 Description of

[style=mdroundedboxinfloat]  where

Figure 2: Outline; each batch is sorted only after being filtered through the smaller sections

-calls are put into the parallel buffer (Section 2) for . Whenever the previous batch is done, flushes the parallel buffer to obtain the next batch . Let be the size of , and we can assume . Based on , the sections in are conceptually divided into two slabs, the first slab comprising sections and the final slab comprising sections , where (where is the binary logarithm). The items in each segment are stored in a batch-parallel map (Appendix Section A.5).

processes the input batch in four phases:

  1. Preliminary phase: For each first slab section in order (i.e.  from to ) do as follows:

    1. Perform an unsorted search in each segment in for all the items relevant to the remaining batch (of direct pointers into ), and tag the operations in the original batch with the results.

    2. Remove all operations on items that fit in from the remaining batch .

    3. Skip the rest of the first slab if becomes empty.

  2. Separation phase: Partition based on the tags into three parts and handle each part separately as follows:

    1. Ineffectual operations (on items that fit in the first slab but are neither found nor to be inserted): Return the results.

    2. Effectual operations (on items found in or to be inserted into the first slab): Entropy-sort (Appendix creftype 31) them in order of access type (search, update, insertion, deletion) with deletions last, followed by item, combining operations of the same access type on the same item into one group-operation that is treated as a single operation whose effect is the last operation in that group. Each group-operation is stored in a leaf-based binary tree with height (but not necessarily balanced), and the combining is done during the entropy-sorting itself.

    3. Residual operations (on items that do not fit in the first slab): Sort them while combining operations in the same manner as for effectual operations. 333This does not require entropy-sorting, but combining merge-sort essentially achieves the entropy bound anyway.

  3. Execution phase: Execute the effectual operations as a batch on the first slab, and then execute the residual operations as a batch on the final slab, for each slab doing the following at each section in order (small to big):

    1. Let be the partition of the batch of operations into the access types (deletions last), each sorted by item.

    2. For each segment in , and for each from to , cut out the operations that fit in from , and perform those operations (as a sorted batch) on , and then return their results.

    3. Skip the rest of the slab if the batch becomes empty.

  4. Rebalancing phase: Rebalance all the segments and chains, by doing the following:

    1. Segment rebalancing: For each chain , for each segment in in order (small to big):

      1. If and is overfull, shift items from to to make have target size.

      2. If and is underfull and has at least items, let be the first underfull segment in , and fill using as follows: for each from down to , shift items from to to make have total size or as close as possible, and then remove if it is emptied.

      3. If is (still) overfull and is the last segment in , create a new (empty) segment .

      4. Skip the rest of the current slab if is (now) balanced and the execution phase had skipped .

    2. Chain rebalancing: After that, if one chain is longer than the other chain , repeat the following until the chains are the same length:

      1. Let the current chains be and . Create new (empty) segments , and shift all items from to , and then fill the underfull segments in using (as in item 4aii).

      2. If is (now) empty again, remove .

4.2 Analysis of

First we establish that the rebalancing phase works, by proving the following two lemmas.

Lemma 9 ( Segment Rebalancing Invariant).

During the segment rebalancing (item 4a), just after the iteration for segment , for any imbalanced segment in , either or are all underfull.

Proof ().

The invariant clearly holds for . Consider each iteration for segment during the segment rebalancing where . If had less than items, then that iteration does nothing, and the invariant is trivially preserved. If was overfull, then by the invariant it must have been the only imbalanced segment in , and would be balanced in item 4ai, preserving the invariant. If was underfull but had at least items, then in item 4aii would be filled using , which had at least items unless it is the last segment in its chain, and hence after that every segment in (that is not removed) would be balanced, preserving the invariant. Finally, if is balanced at the end of that iteration, and had been skipped by the execution phase, then by the invariant all segments in are balanced, and all segments skipped by the rebalancing phase are also balanced, so the invariant is preserved.       

Lemma 10 ( Chain Rebalancing Iterations).

The chain rebalancing (item 4b) takes at most two iterations, after which both chains and will have equal length and all their segments will be balanced.

Proof ().

By creftype 9, all segments in each chain will be balanced after the segment rebalancing (item 4a). After that, if one chain is longer than the other chain , the first chain rebalancing iteration transfers all items in to the other chain (item 4bi), leaving empty. If remains non-empty, then both chains have length and we are done. Otherwise, would be removed, and then the second chain rebalancing iteration transfers all items in to the other chain, which is at least items, so every segment in would be filled to target size, and hence both chains would have length .       

Next we bound the work done by .

Definition 11 (Inward Order).

Take any sequence of map operations and let be the set of items accessed by operations in . Define the inward distance of an operation in on an item to be . We say that is in inward order iff its operations are in order of (non-strict) increasing inward distance. Naturally, we say that is in outward order iff its reverse is in inward order.

Theorem 12 ( Work).

takes work for some linearization of -calls in .

Proof ().

Let be a linearization of -calls in such that:

  • Operations on in earlier input batches are before those in later input batches.

  • The operations within each batch are ordered as follows:

    1. Ineffectual operations are before effectual/residual operations.

    2. Effectual/residual operations are in order of access type (deletions last).

    3. Effectual insertions are in inward order, and effectual deletions are in outward order.

    4. Operations in each group-operation are consecutive and in the same order as in that group.

Let be the same as except that in item 3 effectual deletions are ordered so that those on items in earlier sections are later (instead of outward order). Now consider each input batch of operations on .

In the preliminary and execution phases, each section takes work per operation. Thus each operation in with finger distance according to on an item that was found to fit in section takes work, because if is in the first slab (since earlier effectual operations in did not delete items in ), and if is in the final slab (since ). Therefore these phases take work in total.

Let be the effectual operations in as a subsequence of . Entropy-sorting takes work (Appendix creftype 32), where is the entropy of (i.e.  where is the number of occurrences of the -th operation in ). Partition into parts: searches/updates and insertions and deletions . And let be the entropy of . Then where is the number of operations in the same part of as the -th operation in , and by Jensen’s inequality. Thus entropy-sorting takes work. Let be the cost of according to . Since each operation in has inward distance (with respect to ) at most its finger distance according to , we have (Appendix creftype 28), and hence entropy-sorting takes work in total.

Sorting the residual operations in the batch (that do not fit in the first slab) takes work per operation with finger distance according to , since .

Therefore the separation phase takes work in total. Finally, the rebalancing phase takes amortized work per operation, as we shall prove in the next lemma. Thus takes total work.       

Lemma 13 ( Rebalancing Work).

The rebalancing phase of takes amortized work per operation.

Proof ().

We shall maintain the credit invariant that each segment with items beyond its target capacity has at least stored credits. The execution phase clearly increases the total stored credits needed by at most per operation, which we can pay for. We now show that the invariant can be preserved after the segment rebalancing and the chain rebalancing.

During the segment rebalancing (item 4a), each shift is performed between some neighbouring segments and , where has items and has items just before the shift, and . The shift clearly takes work. If then this is obviously just work. But if , then will also be rebalanced in item 4ai of the next segment balancing iteration, since at most items will be shifted from to in item 4aii, and hence will still have at least items. In that case, the second term in the work bound for this shift can be bounded by the first term of the work bound for the subsequent shift from to , since . Therefore in any case we can treat this shift as taking only work.

Now consider the two kinds of segment rebalancing:

  • Overflow: item 4ai shifts items from overfull to , where has items just before the shift. After the shift, has target size and needs no stored credits, and would need at most extra stored credits. Thus the credits stored at can pay for both the shift and the needed extra stored credits.

  • Fill: item 4aii fills some underfull segments using , where has items just before the fill, for each . After the fill, every segment in would have target size and need no stored credits, and will need at most extra stored credits, which can be paid for by using half the credits stored at each segment in . The other half of the credits stored at suffices to pay for the shift from to for each .

The chain rebalancing (item 4b) is performed only when segment rebalancing creates or removes a segment and makes one chain longer than the other. Consider the biggest segment that was created or removed. If was created, it must be due to overflowing to in item 4ai, and hence the shift from to already took work. If was removed, it must be due to filling some segments using in item 4aii, but must have had at least items before the execution phase, and at least half of them were either deleted or shifted to , and hence either the deletions can pay credits, or the shift to already took work. Therefore in any case we can afford to ignore up to work done by chain rebalancing.

Now observe that the chain rebalancing performs at most two transfers (item 4bi) of items from the last segment of the longer chain to the shorter chain , by the Lemma 10 ( Chain Rebalancing Iterations). (creftype 10). Each transfer takes work to create the new segments and work to shift over to , and then fills underfull segments in using . The fill takes work for the shift from to , and takes work for each shift from to for each , since has at most items just before the shift. Therefore each transfer takes work in total, and hence we can ignore all the work done by the chain rebalancing.       

And now we turn to bounding the span of .

Theorem 14 ( Span).

takes span, where is the number of operations on , and is the maximum size of , and is the maximum number of -calls on any path in the program DAG .

Proof ().

Let denote the maximum span of processing an input batch of size (that has been flushed from the parallel buffer). Take any input batch of size . We shall bound the span taken by in each phase.

The preliminary phase takes span in each first slab segment , adding up to span. The separation phase also takes span, by Theorem 32 ( Costs). (creftype 32). The execution phase takes span in each segment , adding up to span. Returning the results for each group-operation takes span.

The rebalancing phase also takes span for each segment processed in item 4a, because each shift between segments with total size takes span, and filling using in item 4aii takes span for the first shift from to and then span for each subsequent shift from to . Similarly, the chain rebalancing in item 4b takes span, because it performs at most two iterations by Lemma 10 ( Chain Rebalancing Iterations). (creftype 10), each of which takes span to fill the underfull segments of the shorter chain using its last segment.

Therefore , since if .

Each batch of size waits in the buffer for the preceding batch of size to be processed, taking span, and then itself is processed, taking span, taking span in total. Since over all batches each of will sum up to at most the total number of -calls, and there are at most -calls on any path in the program DAG , the span of is .       

5 Faster Parallel Finger Structure

Although has optimal work and a small span, it is possible to reduce the span even further, intuitively by pipelining the batches in some fashion so that an expensive access in a batch does not hold up the next batch.

As with , we need to split the sections into two slabs, but this time we fix the first slab at sections where so that we can pipeline just the final slab. We need to allow big enough batches so that operations that are delayed because earlier batches are full can count their delay against the total work divided by . But to keep the span of the sorting phase down to , we need to restrict the batch size. It turns out that restricting to batches of size at most works.

We cannot pipeline the first slab (particularly the rebalancing), but the preliminary phase and separation phase would only take span. The execution phase and rebalancing phases are still carried out as before on the first slab, taking span, but execution and rebalancing on the final slab are pipelined, by having each final slab section process the batch passed to it and rebalance the preceding segments and if necessary.

To guarantee that this local rebalancing is possible, we do not allow to proceed if it is imbalanced or if there are more than pending operations in the buffer to . In such a situation, must stop and reactivate , which would clear its buffer and rebalance before restarting . It may be that also cannot proceed for the same reason and is stopped in the same manner, and so may be delayed by such a stop for a long time. But by a suitable accounting argument we can bound the total delay due to all such stops by the total work divided by . Similarly, we do not allow the first slab to run (on a new batch) if is imbalanced or there are more than pending operations in the buffer to .

Finally, we use an odd-even locking scheme to ensure that the segments in the final slab do not interfere with each other yet can proceed at a consistent pace. The details are below.

5.1 Description of

[style=mdroundedboxinfloat] :  First slab:  where Final slab: 

Figure 3: Sketch; the final slab is pipelined, facilitated by locks between adjacent sections

We shall now give the details (see Figure 3). We will need the bunch structure (Appendix creftype 23) for aggregating batches, which is an unsorted set supporting both addition of a batch of new elements within work/span and conversion to a batch within work and span if it has size .

has the same sections as in , with the first slab comprising the first sections, and the final slab comprising the other sections. uses a feed buffer, which is a queue of bunches of operations each of size exactly except the last (which can be empty). Whenever is notified of input (by the parallel buffer), it reactivates the first slab.

Each section in the final slab has a buffer before it (for pending operations from ), which for each access type uses an optimal batch-parallel map (Appendix Section A.5) to store bunches of group-operations of that type, where operations on the same item are in the same bunch. When a batch of group-operations on an item is inserted into the buffer, it is simply added to the correct bunch. Whenever we count operations in the buffer, we shall count them individually even if they are on the same item. The first slab and each final slab section also has a deferred flag, which indicates whether its run is deferred until the next section has run. Between every pair of consecutive sections starting from after is a neighbour-lock, which is a dedicated lock (see Section 2.1) with key for each arrow to it in Figure 3.

Whenever the first slab is reactivated, it runs as follows:

  1. If the parallel buffer and feed buffer are both empty, terminate.

  2. Acquire the neighbour-lock between and . (Skip steps 2 to 4 and steps 8 to 10 if does not exist.)

  3. If has any imbalanced segment or has more than operations in its buffer, set the first slab’s deferred flag and release the neighbour-lock, and then reactivate and terminate.

  4. Release the neighbour-lock.

  5. Let be the size of the last bunch in the feed buffer. Flush the parallel buffer (if it is non-empty) and cut the input batch of size into small batches of size except possibly the first and last, where the first has size . Add that first small batch to , and append the rest as bunches to the feed buffer.

  6. Remove the first bunch from the feed buffer and convert it into a batch , which we call a cut batch.

  7. Process using the same four phases as in (Figure 2), but restricted to the first slab (i.e. execute only the effectual operations on the first slab, and do segment rebalancing only on the first slab, and do chain rebalancing only if had not existed before this processing). Furthermore, do not update ’s segments’ sizes until after this processing (so that in item 4 will not find any of ’s segments imbalanced until the first slab rebalancing phase has finished).

  8. Acquire the neighbour-lock between and .

  9. Insert the residual group-operations (on items that do not fit in the first slab) into the buffer of , and then reactivate .

  10. Release the neighbour-lock.

  11. Reactivate itself.

Whenever a final slab section is reactivated, it runs as follows:

  1. Acquire the neighbour-locks (between and its neighbours) in the order given by the arrow number in Figure 3.

  2. If has any imbalanced segment or (exists and) has more than operations in its buffer, set ’s deferred flag and release the neighbour-locks, and then reactivate and terminate.

  3. For each access type, flush and process the (sorted) batch of bunches of group-operations of that type in its buffer as follows:

    1. Convert each bunch in to a batch of group-operations.

    2. For each segment in , cut out the group-operations on items that fit in from , and perform them (as a sorted batch) on , and then fork to return the results of the operations (according to the order within each group-operation).

    3. If is non-empty (i.e. has leftover group-operations), insert into the buffer of and then reactivate .

  4. Rebalance locally as follows (essentially like in ):

    1. For each segment in :

      1. If is overfull, shift items from to to make have target size.

      2. If is underfull, shift items from to to make have target size, and then remove if it is emptied.

      3. If is (still) overfull and is the last segment in , create a new segment and reactivate it.

    2. If is (still) the last section, but chain is longer than chain :

      1. Create a new segment and shift all items from to .

      2. If is (now) underfull, shift items from to to make have target size.

      3. If is (now) empty again, remove .

  5. If , and the first slab is deferred, clear its deferred flag then reactivate it.

  6. If , and is deferred, clear its defered flag then reactivate it.

  7. Release the neighbour-locks.

5.2 Analysis of

For each computation, we shall define its delay to intuitively capture the minimum time it needs, including all potential waiting on locks. Each blocked acquire of a dedicated lock corresponds to an acquire-stall node in the execution DAG whose child node is created by the release just before the successful acquisition of the lock. Let be the ancestor nodes of that have not yet executed at the point when is executed. Then the delay of a computation is recursively defined as the weighted span of , where each acquire-stall node in is weighted by the delay of (to capture the total waiting at ), and every other node is weighted by its cost. 444The delay of depends on the actual execution, due to the definition of for each acquire-stall node in . But it captures the minimum time needed to run in the following sense: For any computation , on any step that executes all ready nodes in the remaining computation (i.e. the unexecuted nodes in ), the delay of is reduced. (So if a greedy scheduler is used, the number of steps in which some processor is idle is bounded by the delay.)

Whenever the first slab or a final slab section runs, we say that it defers if it terminates with its deferred flag set (i.e. at step 2), otherwise we say that it proceeds (i.e. to step 3) and eventually finishes (i.e. reaches the last step) with its deferred flag cleared. We now establish some invariants, which guarantee that is always sufficiently balanced.

Lemma 15 ( Balance Invariants).

satisfies the following invariants:

  1. When the first slab is not running, every segment in is balanced and has at most items.

  2. When a final slab section rebalances a segment in (in item 4a), it will make that segment have size .

  3. Just after the last section finishes without creating new sections, the segments in are balanced and both chains have the same length.

  4. Each final slab section always has at most operations in its buffer.

  5. Each final slab segment always has at most items, and at least items unless is the last section.

Proof ().

Invariant 1 holds as follows: The first slab proceeds only if ’s segments are balanced, and from that point until after the rebalancing phase, its segments are modified only by itself (since will not modify ), and thereafter all its sections except remain unmodified until it processes the next cut batch. Thus the same proof as for Lemma 9 ( Segment Rebalancing Invariant). (creftype 9) shows that just before the segment rebalancing (item 4a) iteration for , for any imbalanced first slab segment , either or are underfull. But note that the cut batch had at most operations, and so after the execution phase, had at least items unless it was the last segment in its chain. Thus will be made balanced (by item 4ai or item 4aii in the iteration for , or by item 4b). Similarly, will have at most items in each segment, since .

Invariant 2 holds as follows. Each final slab section proceeds only if its segments each has at least items unless it is the last segment in its chain, and its buffer had at most operations by Invariant 4. Since , rebalancing a segment in (item 4a) will make it have size .

Invariant 3 holds as follows. The last section proceeds only if its segments each has at most items, and its buffer had at most operations by Invariant 4. Thus if any of its segments becomes overfull and it creates a new section , it will subsequently be deferred until runs. And during that run of , it will proceed and shift at most items from to , after which will not be overfull, and so will not create another new section . Therefore we can assume that the chains’ lengths never differ by more than one segment, and so the chain rebalancing (item 4b) will make the chains the same length while ensuring the segments in and are balanced.

Invariant 4 holds for , because the first slab proceeds only if ’s buffer has at most operations, and only processes a cut batch of size at most , hence after that ’s buffer will have at most operations. Invariant 4 holds for for each , because proceeds only if ’s buffer has at most operations, and only processes a buffered batch of size at most by Invariant 4 for , hence after that ’s buffer will have at most operations.

Invariant 5 holds as follows. Each final slab segment