Keywords
Parallel data structures, pointer machine, multithreading, dictionaries, 23 trees
Abstract
This paper presents a batchparallel 23 tree in the asynchronous PPM (parallel pointer machine) model that supports searches, insertions and deletions in sorted batches and has essentially optimal parallelism, even under the QRMW (queued readmodifywrite) memory model where concurrent memory accesses to the same location are queued and serviced one by one.
Specifically, if has items, then performing an itemsorted batch of operations on takes only work and span (in the worst case as ). This is informationtheoretically workoptimal for , and also spanoptimal in the PPM model. The input batch can be any balanced binary tree, and hence can also be used to implement sorted sets supporting optimal intersection, union and difference of sets with sizes in work and span.
To the author’s knowledge, is the first parallel sortedset data structure with these performance bounds that can be used in an asynchronous multiprocessor machine under a memory model with queued contention. This is unlike PRAM data structures such as the PVW 23 tree (by Paul, Vishkin and Wagener), which rely on lockstep synchronicity of the processors. In fact, is designed to have bounded contention and satisfy the claimed work and span bounds regardless of the execution schedule.
All data structures and algorithms in this paper fit into the dynamic multithreading paradigm. Also, as a consequence of working in the asynchronous PPM model, all their performance bounds are directly composable with those of other data structures and algorithms in the same model.
Acknowledgements
I am exceedingly grateful to my family and friends for their unfailing support, and to all others who have given me valuable comments and advice. In particular, I would like to specially thank my brother Wei Zhong Lim and my supervisor Seth Gilbert for very helpful discussions and feedback.
1 Introduction
The dynamic multithreading paradigm (see [6] chap. 27) is a common parallel programming model underlying many parallel languages and libraries such as subsets of OpenMP [15], Cilk dialects [8, 14], Intel Thread Building Blocks [19] and the Microsoft Task Parallel Library [21]. In this paradigm, algorithmic parallelism is expressed via programming primitives such as fork/join (also spawn/sync), parallel loops and synchronized methods, but the program cannot stipulate any mapping from subcomputations to processors.
Naturally, we consider a multithreaded procedure (which can be an algorithm or a data structure operation) to be correct if and only if it has the desired output behaviour regardless of how the subcomputations it generates are scheduled for execution. Moreover, we wish to obtain good bounds on the work and span of the procedure, again independent of the execution schedule.
Unfortunately, many data structures and algorithms are designed in theoretical computation models with synchronous processors, such as the (synchronous) PRAM models, and so they can be difficult or impossible to implement in dynamic multithreading. Thus it is desirable to have as many useful algorithms and data structures as possible designed in more realistic computation models that are compatible with dynamic multithreading. For example, the QRMW (queued readmodifywrite) PPM (parallel pointer machine) model described in Section 5.3 captures both the asynchronocity and contention costs inherent in running multithreaded procedures on most real multiprocessor machines.
One indispensable data structure is the map (or dictionary) data structure, which supports searches/updates, inserts and deletes (collectively referred to as accesses) of items from a linearly ordered set. Balanced binary trees such as the AVL tree or the redblack tree are commonly used to implement a map, taking worstcase cost (in the comparison model) per access for a tree with items. A related data structure is the sortedset data structure, which supports intersection, union and difference of any two sets. Using maps based on balanced binary trees to implement sortedsets yields worstcase cost of each set operation where are the sizes of the input sets.
The obvious question is whether we can have an efficient multithreaded parallel map, or an efficient multithreaded sortedset, or both. In this paper we describe a multithreaded batchparallel 23 tree that is informationtheoretically workoptimal and is spanoptimal in the PPM model (i.e. no other data structure in the PPM model can do asymptotically better). Here the input batch is given as a leafbased balanced binary tree. Specifically, performing a sorted batch of accesses on an instance of with items takes work and span. This is superior to the work and span bounds of the PVW 23 tree despite not having the luxury of lockstep synchronous processors.
Furthermore, since is a multithreaded data structure whose performance bounds are independent of the schedule, it is trivially composable, which means that we can use as a blackbox data structure in any multithreaded algorithm and easily obtain composable performance bounds. Indeed, the parallel workingset map in [1] and the parallel finger structure in [11] both rely such a parallel 23 tree as a key building block.
2 Related Work
To illustrate the difficulty of converting data structures designed in the PRAM models to efficient multithreaded implementations, consider the PVW 23 tree [17] that supports performing an itemsorted batch of searches, insertions or deletions, which was designed in the CREW/EREW PRAM model.
Searches are easy, and pose no problem for a multithreaded implementation. But performing a sorted batch of insertions or deletions on the PVW 23 tree with items essentially involves spawning synchronous waves of structural changes from the bottom of the 23 tree upwards to the root, each wave taking steps to move up one level. This relies crucially on the lockstep synchronicity of the processors to ensure that these waves never overlap, and naively attempting to use locking to prevent waves from overlapping will cause the worstcase span to increase from to .
Other map data structures in the PRAM models include parallel Btrees by Higham et al. [13], parallel redblack trees by Park et al. [16] and parallel trees by Akhremtsev et al. [2], all of which crucially rely on lockstep synchronous processors as well.
A different approach of pipelining using futures by Blelloch et al. [5] yields an implementation of insertion into a variant of PVW 23 trees that requires not only a CREW/EREW PRAM but also a unittime plusscan (allprefixsums) operation. The multithreaded parallel sortedsets presented by Blelloch et al. in [3] take work but up to span per operation between two sets of sizes where . The span was reduced to by Blelloch et al. in [4], but that algorithm relies on concurrent reads taking time regardless of the extent of contention.
This paper shows that, using special pipelining schemes, it is actually possible to design a multithreaded batchparallel 23 tree that takes work and span, even if only bounded memory contention is permitted.
3 Main Results
This paper presents, to the author’s best knowledge, the first multithreaded sortedset data structure that can be run on an asynchronous parallel pointer machine and achieves optimal work and span bounds even under a memory model with queued contention. Specifically, the data structure presented works within the QRMW PPM model (Section 5.3).
The underlying batchparallel 23 tree supports performing an itemsorted batch of accesses within work and span (in the worst case as ) (Section 7.3). Here we of course assume that we are given an step comparison function on pairs of items (i.e. the comparison model), but there is no loss of generality. This is informationtheoretically workoptimal for , and also spanoptimal in the PPM model. Furthermore, the input batch can be any balanced binary tree, including even another instance of , and hence can be used to implement optimal sorted sets supporting intersection, union and difference of sets with sizes in work and span (Section 8).
also supports performing an unsorted batch of searches within work and span (Section 7.2), which is useful when . Additionally, supports performing a reverseindexing on an unsorted batch of direct pointers to items in it, which yields a sorted batch of those items within work and span (Section 7.4).
Actually, is designed to have bounded contention, meaning that there is some constant such that every batch operation on never has more than concurrent memory accesses to the same field of a memory node.
4 Key Ideas
uses a pipelined splitting scheme to partition the 23 tree itself around the operations in the input batch, and then performs each operation on its associated part, and then uses a pipelined joining scheme to join the parts back up. Both pipelining schemes are topdown. The splitting scheme is similar to the search in the PVW 23 tree, except that we push the 23 tree down the input batch, rather than the input batch down the 23 tree. But the joining scheme is completely different from the bottomup restructuring in the PVW 23 tree.
The main difficulty in both the splitting phase and joining phase is in finding a topdown procedure that can be decomposed into ‘independent’ local procedures each of which runs in span, which can then be pipelined. This is not so hard for the splitting phase, but for the joining phase this seems to require using integers to maintain the structure of the spine nodes over a sequence of joins but without actually performing the joins (see the outline in Section 7).
5 Parallel Computation Model
In this section, we describe how a multithreaded computation generates an execution DAG, how we measure the cost of a given execution DAG, and the chosen memory model.
5.1 Execution DAG
The actual complete execution of a multithreaded computation is captured by the execution DAG (which may be scheduledependent), in which each node is a basic instruction and the directed edges represent the computation dependencies (such as constrained by forking/joining of threads and acquiring/releasing of blocking locks). At any point during the computation, a node in the execution DAG is said to be ready if its parent nodes have been executed. At any point in the execution, an active thread is simply a ready node in , while a suspended thread is an executed node in that has no child nodes.
The execution DAG is dynamically generated as follows. At the start has a single node, corresponding to the start of the computation. Each node could be a local instruction or a synchronization instruction (including fork/join and acquire/release of a lock). When a node is executed, it may generate child nodes or terminate. A join instruction also generates edges that linearize it with the preceding and succeeding computations of both joined threads. Concurrent memory accesses (including accesses to nonblocking locks) are not linearized. For a blocking lock, a release instruction generates an additional child node that is the resumed thread that next acquires the lock (if any), with an edge to it from the node corresponding to the originally suspended thread.
For analysis, we often assume that the computation is run on a greedy scheduler, which on each step assigns as many available instructions (i.e. instructions that have been generated but have not been assigned) as possible to available processors (i.e. processors that are not executing any instruction) to be executed.
5.2 Computation Costs
We shall now define work and span of a (terminating) multithreaded computation. This allows us to capture the intrinsic costs incurred by the computation itself, separate from the costs of any multithreaded program using it.
Definition 1 (Computation Work/Span/Cost).
Take any computation (on processors) with execution DAG , and take any subset of nodes in . The work taken by is the total weight of where each node is weighted by the time taken to execute it (which for a blocking lock is just the time it takes to be queued on the lock). The span taken by is the maximum weight of nodes in on any path in . The cost of is .
The computation cost has the desired property that it is subadditive across subcomputations. Thus our results are composable with other algorithms and data structures in this model. Note that the bounds for the work/span of are independent of the scheduler.
5.3 Memory Model
We shall work within the QRMW PPM model that was introduced in [1] as a more realistic PPM (parallel pointer machine) model for parallel programming. Unrealistic assumptions of the synchronous PRAM model include the lockstep synchronicity of processors and the lack of collision on concurrent memory accesses to the same locations [9, 10, 18]. For example, the load latency in the Cray XMT increases roughly linearly with number of concurrent accesses to the same address but stays roughly constant when the concurrent accesses are to random addresses [20].
In the QRMW (queued readmodifywrite) contention model, as described in [7], asynchronous processors perform memory accesses via RMW (readmodifywrite) operations (including read, write, testandset, fetchandadd, compareandswap), which are supported by almost all modern architectures. Also, to capture contention costs, RMW operations on each memory cell are FIFOqueued to be serviced, with only one RMW operation on that memory cell serviced per time step. The processor making each memory request is blocked until the request has been serviced.
In the QRMW PPM model, generalizing the PPM model in [12] to cater to the QRMW contention model, all memory accesses are done via pointers, which can be locally stored or tested for equality (but no pointer arithmetic). More precisely, each pointer (if not ) is to a memory node, which has a fixed number of memory cells. Each memory cell can hold a single field, which is either an integer or a pointer. Each processor also has a fixed number of local registers, each of which can hold a single field. At each step, each processor (that has finished its previous operation) can start any one of the following operations, which except for RMW operations finishes in one step:

Perform a basic arithmetic operation ^{1}^{1}1In this paper we will only use integer addition, subtraction, multiplication, modulo and equality. on integers in its registers, storing the result in another register.

Perform an equalitytest between pointers in its registers, storing the result ( or ) in an integer register.

Perform an RMW operation on a memory cell via a pointer to the memory node that it belongs to.

Create a new memory node, storing a pointer to it in a register.
This model supports nonblocking locks (trylocks) via testandset, where acquiring a nonblocking lock succeeds if the lock is not currently held but fails otherwise, and releasing always succeeds. If threads concurrently access a nonblocking lock, then each access completes within steps. Hence nonblocking locks can be used to support binary fork/join in steps. We can also support reactivation calls to a process via fetchandadd, where reactivating a process guarantees that it will run again within span after its current run (if any) finishes if there are always at most concurrent reactivations of that process, and that there are at most as many runs as reactivation calls. Reactivation calls can be used to implement a barrier, on which a thread can wait until it is notified. (See the Appendix for detailed implementations.)
All the results in this paper also hold in the (asynchronous) QRMW PRAM model in [7]. It is worth noting that the cost of contention in the QRMW PPM model requires more sophisticated techniques because we cannot use pointer arithmetic. Furthermore, since all data structures and algorithms in this paper have bounded contention (i.e. there is a constant such that there are never more than pending memory requests to each memory cell), they are trivially implementable in the (synchronous) EREW PPM model as well.
6 Basic Parallel Batch Operations
We will always store any (nonempty) batch of items in a BBT, namely a leafbased balanced binary tree (i.e. with the items only at its leaves). Each binary tree is identified with its root node , and each node of stores the following:

and are its left and right child nodes respectively.

and are the height and number of leaves respectively of the subtree at .

and are the first item and last item respective in the subtree at .
For convenience, we shall also use to denote the nodes in , and to denote the nodes in with subtree height (i.e. iff ).
In this section we shall show how to do some basic operations on batches in the QRMW PPM model:

Filter a given batch of items based on an span condition, within work and span.

Balancing a given batch of items (i.e. making it a complete BBT), within work and span.

Partition a sorted batch of items around a sorted batch of pivots, within work and span.

Joining a batch of batches of items, with items in total, within work and span.

Merging two sorted batches of items, with items in total, within work and span.
6.1 Pipelined Splitting
The pipelined splitting scheme is the key technique employed here for these parallel batch operations. The basic idea is that if we want to distribute the leaves of a binary tree to the leaves of another binary tree , such that each leaf of receives an ordered slice of (creftype 2), then we can push down in a pipelined fashion. Specifically, when a subtree arrives at an internal node, either we can push it down to a child node, or we can start splitting it according to the desired final distribution, each time pushing one of its halves down to a child node. Note that the subtrees arriving at each node of form a slice of .
Definition 2 (Binary Tree Slice).
A slice of a binary tree is a sequence of disjoint nonsibling subtrees of that contain a set of consecutive leaves of . An ordered slice of is a slice of that has the subtrees listed in rightward order in .
It turns out that we can use queues and to store the unprocessed subtrees at , and maintain the splitting invariant that forms an ordered slice of . To do so, when we process a subtree at from , if we can push it down whole to a child of then we push it onto , otherwise when we are splitting it we always push the split subtrees onto or . Figure 1 illustrates this pipelined splitting scheme.
This scheme can be carried out using a process at each node that is run only via reactivation calls (see Section 5.3), where each run of processes one subtree from each queue (if any), and reactivates for any child of that it pushes a subtree down to. If the processed subtree is to be split, forks a separate splitting process for that. At the start we can simply push onto . One can now observe that at most one subtree that arrives at a node will be split, and the splitting invariant guarantees that every subtree that arrives at later will only be pushed onto the outer queues or . Therefore no concurrent pushes and no concurrent pops are ever performed on any queue.
Thus each queue can be implemented using a dedicated queue, which is a nonblocking queue implemented by a linked list in the following manner. maintains pointers to both the first node and the last node , and every node in except stores an item and a pointer to the next node , and . Initially . The queue operations are implemented as follows:

Push( ): Create Node with . Set . Set . Set .

Pop(): Create Pointer . If , set . Return .
The total span is just if we can always determine whether a subtree of can be pushed down whole from a given node of to a given child of or not within span. Intuitively, this is because the processing of a subtree of at node of depends only on the processing of at the parent of and the processing of the preceding subtree of at (either is before in the queue or was a parent subtree of in the splitting process), and has lower depth than in . But we must know what exactly the pipelined splitting scheme is used for to get a good bound on the total work.
6.2 Parallel Filtering
Parallel filtering an (unsorted) batch according to a condition , without changing the order in the batch, is done in 3 phases:

Preprocessing phase: Each item in that satisfies has a rank in the sublist of that satisfies , which we shall call its filteredrank. Recursively compute for each node in the number of filtered items (i.e. items that satisfy ) in the subtree at . Then recursively compute for each node in the range of filteredranks of the filtered items in the subtree at . Then construct a blank batch of size that is a complete BBT (i.e. every level is full except perhaps the last), and compute for each node of the number of leaves in its subtree and the range of their ranks in . And place a barrier at each leaf of .

Pushdown phase: Use the pipelined splitting scheme (Section 6.1) to push down , where a subtree of is pushed down whole to a node of iff . Then clearly each leaf of will be pushed down to a unique leaf of that has an item satisfying , and the order of those leaves in is the same as the order of those items in . Thus when a leaf of reaches a leaf of , we can simply copy the item from to and then notify .

Collating phase: After initiating the pushdown phase, recursively wait on for each leaf of , before returning . Then clearly is only returned after the pushdown phase has finished.
We shall now give the technical details, including the specific pushdown phase produced by applying the pipelinedsplitting scheme here. We shall also state the splitting invariant in greater detail and prove it specifically for parallel filtering.
Definition 3 (Parallel Filtering).
Parallel filtering an (unsorted) batch (or more generally a leafbased binary tree) according to a condition (such as less than a pivot key), without changing the order in the batch, is done via the following procedure:

First preprocess the input batch and prepare the output batch:

Recursively for each node of , compute the number of items in the subtree at that satisfy . Then recursively compute and for each node of , defined by and and and for each internal node of .

If , return a blank output batch (skipping all the other phases).

Construct a blank output batch of size , and compute for each node of the number of leaves in its subtree, and and defined in exactly the same way as for .

Recursively place at each leaf of a Barrier . // see Appendix creftype 40


Then push down to the appropriate leaf nodes via a pipelined splitting scheme:

Recursively place at each node of DedicatedQueue initialized to be empty.

Define feeding to to be pushing onto and then reactivating .

Start by feeding to .

Whenever is reactivated for some node of , it runs the following for each in sequentially:

Pop subtree off . If (i.e. was empty), continue (i.e. skip this ).

Reactivate .

If is a leaf, copy the item from into and then call and return.

If , feed to and return.

If , feed to and return.

Fork the following splitting process:

While is not a leaf:

If :

Feed to and set .


Otherwise:

Feed to and set .



If , feed to , otherwise feed to .




In parallel with the pushdown phase, recursively call for each leaf of (and wait for all to finish), and then recursively update and for each node of before returning .
Lemma 4 (Parallel Filtering Invariants).
The 6.2 Parallel Filtering algorithm satisfies the following for each node of :

The subtrees fed to (i.e. to either or ) form a slice of .

The subtrees fed to are (strictly) on the left in of those fed to .

The subtrees fed to are in (strictly) leftward order and increasing depth in .

The subtrees fed to are in (strictly) rightward order and increasing depth in .

The splitting process (item 4f) at runs at most once, and once it starts running it will be the only process that pushes onto or .
Proof ().
We use structural induction on . Invariants 1,2 follow from themselves for (i.e. the parent of ). Invariants 3,4 follow from themselves and Invariants 2,5 for . To establish Invariant 5, observe that the splitting process runs at most once per node, by Invariant 1, and when it starts running on a subtree from the following hold:

Every subtree from is on the right of , by Invariant 2, and so none of it gets fed to . Hence from then on only the splitting process pushes onto .

Every subsequent subtree from is on the left of , by Invariant 3, and so none of it gets fed to . Also, every preceding subtree from had already been fed to in a preceding run of . Hence from then on only the splitting process pushes onto .
Likewise for a subtree from , by symmetry. Therefore Invariant 5 holds.
Using these invariants we can prove both the correctness and costs of parallel filtering. In particular, by Invariant 5 there are no concurrent pushes performed on the dedicated queues, as required, and hence the pipelined splitting scheme runs correctly. And now we shall bound the parallel filtering costs.
Definition 5 (LogSplitting Property).
We say that a binary tree is logsplitting if every slice of containing (consecutive) leaves of has at most subtrees of .
Theorem 6 (Parallel Filtering Costs).
Parallel filtering a batch of size according to a condition takes work and span if every evaluation of takes work and span.
Proof ().
The preprocessing phase clearly takes work and span. And the collating phase clearly takes work and finishes within span after the pushdown phase finishes. So it only remains to show that the pushdown phase takes work and span. Clearly initializing the queues takes work and span.
Now observe that the remaining work taken is times the number of feedings, since every selfreactivation of corresponds to a unique subtree that had been popped off or . Each node of has subtrees fed to it (by creftype 4 Invariant 1 since is logsplitting), and is times the length of the shortest path from to a leaf (since is a BBT). Thus the number of feedings is times the number of edges of , which is .
To bound the span taken, partition each run of or the splitting process (item 4f) into fragments based on the value of during the run, and for each such we call that fragment a fragment. Observe that each fragment runs within span. Now consider each fragment . If was popped off some , then depends on at most a fragment (that fed to ) and a fragment where is the subtree before in , and note that has a lower depth than in by creftype 4 Invariants 3,4. If was split off as a left/right child of a subtree during the splitting process, then depends only on the fragment, and of course has lower depth than in . In either case, runs within span once the fragments it depends on have finished. Therefore by induction each fragment runs within span where are the depths of in respectively. Thus the whole pushdown phase finishes within span.
Sometimes, it is also useful to use parallel filtering on a leafbased binary tree that may not be balanced, in which case we have the following cost bounds.
Theorem 7 (Parallel Filtering Costs).
Parallel filtering a leafbased binary tree of size and height according to a condition takes work and span if every evaluation of takes work and span and the output batch has size .
Proof ().
By the same argument as above, we just need to bound the number of feedings, which is clearly because the output batch has nodes and each is fed at most times. The proof of the span bound is the same as above.
Corollary 8 (Parallel Balancing).
Parallel balancing a batch of items to make the underlying BBT be a complete binary tree, without changing the order in the batch, can be done by parallel filtering (Section 6.2) with no condition (i.e. the condition always returns )
Theorem 9 (Parallel Balancing Costs).
Parallel balancing a batch of items takes work and span.
Proof ().
The claim follows immediately from Theorem 6 (Parallel Filtering Costs). (creftype 6).
In general, we can use change the shape of the underlying BBT of any batch by simply parallel filtering with the output batch constructed to have the desired shape.
6.3 Parallel Partitioning
Exactly the same technique allows us to do parallel multiway partitioning of a sorted batch of items around a sorted batch of pivot items, in 3 similar phases:

Preprocessing phase: Insert into (as the rightmost leaf). Then place a flag and a barrier at each node of .

Pushdown phase: Use the pipelined splitting scheme (Section 6.1) to push down , where a subtree of is pushed down whole from a node of to iff and to iff . Then clearly each item in will be pushed down in some subtree of to the leftmost leaf of such that (treating as more than every item). Along the way, if finds that both and are empty and that has been notified, then it waits for the splitting process at to finish (if any, via a barrier created just before the splitting process is started), and then freezes if is . Freezing comprises setting and for each child of notifying before reactivating . This ensures that is notified for every leaf node of once all items in have been pushed down to a leaf of , and no earlier.

Collating phase: After initiating the pushdown phase, recursively for each leaf of , wait on before joining the subtrees in each in reverse order (which is from short to tall), and then tagging with the join of the results.
As before, we shall give the technical details of the whole parallel partitioning algorithm here.
Definition 10 (Parallel Partitioning).
Parallel partitioning a sorted batch of items around a sorted batch of pivot items is done via the following procedure:

Insert into (as the rightmost leaf).

Then push down by essentially the same pipelined splitting scheme as in 6.2 Parallel Filtering (Section 6.2):

Recursively place at each node of :

DedicatedQueue , each initialized to be empty.

Bool .

Barrier . // see Appendix creftype 40

Pointer .


Define feeding to to be pushing onto and then reactivating .

Start by calling and then feeding to .

Whenever is reactivated for some node of , it runs the following:

If is a leaf, return.

Create Bool

For each in :

Pop subtree off . If (i.e. was empty), continue (i.e. skip this ).

Reactivate and set .

If , feed to and continue.

If , feed to and continue.

Create Barrier . // see Appendix creftype 40

Fork the following splitting process:

While is not a leaf:

If :

Feed to and set .


Otherwise:

Feed to and set .



If , feed to , otherwise feed to .

Call .



If and : // if will not feed any more subtrees to its children //

If , call .

If , return, otherwise set .

Call and .

Reactivate and .




In parallel with the pushdown phase, recursively call for every leaf of (i.e. wait for all leaves of to be pushed down to a leaf of ), and then collate the results:

At each leaf of , join the batches in each queue in reverse order, and tag with the join of the results.

Then every leaf of is tagged with the sorted batch of all items in whose least upper bound in is the pivot at .

The correctness of the pushdown phase in this parallel partitioning algorithm follows in the same way as for parallel filtering. To check the correctness of the whole algorithm, we just need to observe the following invariants:

If at any time is , then there will be no more feeding to any queue of .

If at any time an internal node has empty queues and is not running the splitting process, and is , then within span both and will be and remain .
These invariants imply that eventually will become and remain for every leaf of , after which the recursive waiting will be done and the results will be collated.
Lemma 11 (BBT Slice Joining).
Any ordered slice of a BBT containing leaves can be joined into a single BBT in sequential time.
Proof ().
is the concatenation of two ordered slices such that in each of them the subtrees have monotonic height with at most one pair of the same height. Thus we can join the subtrees in each ordered slice from shortest to tallest, taking time per join, and then join the two results in time.
Lemma 12 (BBT Log Sum Bound).
Take any real , and any BBT with leaves and a nonnegative real weightfunction on its nodes such that for every . Then .
Proof ().
Let be the size of . Each node in has at least leaves by induction, and hence where . And by Jensen’s inequality. Since increases when increases, .
Theorem 13 (Parallel Partitioning Costs).
Parallel partitioning a sorted batch of items around a sorted batch of pivots takes work and span.
Proof ().
First we bound the work and span taken by the pushdown phase. Preparing (i.e. inserting , balancing, and initializing the nodes) takes work and span, and by Jensen’s inequality. Observe that the remaining work is times the number of feedings plus times the number of times is set from to for some node . The latter is clearly , so it suffices to bound the number of feedings.
The subtrees fed to each node of form a slice of (creftype 2), so the number of feedings to is at most where is the total number of items in that slice (since is logsplitting). And clearly for every . Thus by the Lemma 12 (BBT Log Sum Bound). (creftype 12) the total number of feedings is .
Therefore the pushdown phase takes work and span, where the proof of the span bound is the same as for Theorem 6 (Parallel Filtering Costs). (creftype 6).
Finally, we bound the work and span taken by the collation phase. The waiting clearly takes work and span. Joining the subtrees at each leaf of takes work/span (see creftype 4 and creftype 11). Thus the collation phase takes times the work taken by the pushdown phase, and we are done.
6.4 Parallel Joining
Parallel joining of a batch of batches is a very useful basic operation, but we will not need to use it in the batchparallel 23 tree . Nevertheless, we include it here for the sake of completeness.
Definition 14 (Parallel Joining).
Parallel joining a batch of batches is done via the following 2phase procedure:

For each leaf of , replace by the BBT for the batch at (so that now each leaf of has only one item).

Parallel filter (Section 6.2) with no condition to obtain the output batch , except push down instead.
Theorem 15 (Parallel Joining Costs).
Parallel joining a batch of batches with total size takes work and span.
Proof ().
Phase 1 clearly takes work and span. After phase 1, may not be a BBT, but still has height and is logsplitting. Thus the same proof as for Theorem 6 (Parallel Filtering Costs). (creftype 6) holds, and hence phase 2 takes work and span.
Remark 0 ().
Actually, in phase 2 we can push down as originally. For any binary tree , let be the number of leaves of the subtree at each node , and we say that is balanced if for every node of . Then for every balanced binary tree with leaves, . This fact suffices for the work bound, since every BBT is balanced and so after the first step is balanced. Its proof is as follows. Let be the ancestors of any node of (including ). Then for the th ancestor of any leaf of such that . Note that and for any naturals . Thus , because .
Note that if we have a batch of unsorted instances of , rather than just a batch of plain batches, then there is a more efficient algorithm to join them (Section 7.5).
6.5 Parallel Merging
Another useful operation is parallel merging of two sorted batches. As with parallel joining, we do not need it in the batchparallel 23 tree , but we shall provide the algorithm here.
Definition 16 (Parallel Merging).
Parallel merging two sorted lists is done via the following 3phase procedure:

Parallel partition (Section 6.3) around , resulting in a part of at each leaf of .

For each leaf of , insert the item of at into the part of at , optionally combining duplicates of the same item. (This combining procedure can be any time procedure.)

Parallel join (Section 6.4) the resulting batches at the leaves of .
Theorem 17 (Parallel Merging Costs).
Parallel merging sorted lists with total size takes work and span.
Proof ().
Let be the sizes of respectively. Then phase 1 takes work and span (creftype 13). Let is the size of the part of that was at the th leaf of after phase 1. Then phase 2 takes work (by Jensen’s inequality) and span. And phase 3 takes work and span (creftype 15).
Note that if we have two sorted instances of , rather than plain batches, then we can merge them more efficiently (Section 8).
7 BatchParallel 23 Tree
We now present the batchparallel 23 tree and explain how to support the following operations on :

Unsorted batch search: Search for an unsorted batch of items within work and span, tagging each search with the result and a direct pointer to the item in (if it exists).

Sorted batch access: Perform an itemsorted batch of accesses (i.e. searches/updates, inserts and deletes) to distinct items within work and span, tagging each access with the result and a direct pointer to the item in (if it exists).

Batch reverseindexing: Given an unsorted batch of direct pointers to distinct items in , return a sorted batch of the items at those leaves within work and span.
Here is the number of items in , and a direct pointer is an object that allows reading or modifying any values attached to the item in (but of course not modifying the item itself) in steps. It must also be used in the reverseindexing operation.
The unsorted batch search is useful when in which case it is more efficient than sorting the batch. The reverseindexing operation is useful if we want to have synced instances of with the same items but sorted differently, which we can achieve by tagging each item with direct pointers into the other instances of .
Note that the sorted batch access requires that the accesses are to distinct items, but there is no actual disadvantage to that constraint. Suppose we are given an itemsorted input batch of accesses that may have multiple accesses to be the same item. We can perform an easy parallel recursion on to compute which accesses to an item are the leftmost access to in . Then we can recursively join all the accesses to into a single batch (see creftype 11), store it at the leftmost access to , and compute the effective result of (if they are performed in order), within work per access and span. After that, we can parallel filter (Section 6.2) out those leftmost accesses from to obtain an itemsorted batch of the effective accesses, which are to distinct items, within work and span (creftype 6). We can now perform the usual sorted batch access on , and perform one more parallel recursion to tag the original accesses in with the appropriate results.
7.1 Preliminaries & Notation
stores the items in a leafbased 23 tree encoded as a leafbased redblack tree (i.e. every red node has two black child nodes, and every black node is either a leaf or has two child nodes at most one of which is red, and the black nodes correspond to the nodes of the 23 tree). From now on we shall drop the adjective “leafbased” since we only use leafbased binary trees and 23 trees.
For any 23 tree we shall also denote the children of a node of by and and (if it exists), and denote the height of in by . If is encoded as a redblack tree and corresponds to the node in , then would correspond to the first black descendant of (and not necessarily ), and likewise for , and would be the number of black nodes excluding in any path from to a leaf in . These apparent ambiguities will always be resolved by the context, which will always specify whether we treat a node as in a 23 tree or in a redblack tree.
For convenience, let denote the standard join of 23 trees in that order, and identify a 23 tree with its root node. Also we shall write “” and “” as short for “” and “” respectively.
7.2 Unsorted Batch Search
Performing an unsorted search on an input batch of items is done by calling (creftype 18). Note that we cannot simply spawn a thread for each search that traverses from root to leaf, as it would incur span at the root of in the queued contention model (Section 5.3).
Definition 18 (Unsorted Search).

Private USearch( Node of BBT , Item Batch ): // treat as a BBT

If is empty, return.

If is a leaf, recursively tag each item in with a direct pointer to if , and then return.

Parallel partition around pivot into a lower part and an upper part (see Section 6.2).

In parallel call and (and wait for both to finish).

Theorem 19 (Unsorted Search Costs).
takes work and span.
Proof ().
Clearly we can ignore any call where is empty. Each call with nonempty at an internal node of takes work and span to partition into and (creftype 6). Hence the entire unsorted batch search takes work and span.
7.3 Sorted Batch Access
Performing a sorted access on an itemsorted input batch of accesses to distinct items is done in 3 phases:

Splitting phase: Split the items in (treated as a BBT) around the items in , using 6.3 Parallel Partitioning (Section 6.3) but without inserting and without collating. The result is that every item in will be in some subtree of at a leaf of such that every leaf of before has item less than and every leaf of after has item greater than .

Execution phase: At each leaf of , join the subtrees of at into a single 23 tree (see creftype 26), and execute the access at is on that 23 tree.

Joining phase: Recursively join the 23 trees at the leaves of via a pipelined joining scheme that pushes those 23 trees down each other. Here is a highlevel overview and explanation of the algorithm:

We define the spine structure (creftype 20) of a nonroot spine node (i.e. along the leftmost/rightmost path) of a 23 tree as the binary number of bitlength where the th (most significant) bit is if the th spine node from downwards (along the spine) has children, and is otherwise. Then given any 23 trees and their left/right children’s spine structures, we can within steps determine whether the join overflows (i.e. is taller than the original trees) and compute the left/right children’s spine structures for the join (creftype 21).

Augmenting every 23 tree with spine structure (i.e. the spine structure of every nonroot spine node is stored in ) allows us to join any 23 tree into topdown, if or , where we view the joining as pushing down the spine of , and at each node we perform a local adjustment that has the desired effect, based on and and alone. Specifically, we immediately update and to their final values after the join. This includes when overflows where is the next node along the way, in which we also create a blank child of and tag with , so that at we can move the overflowed subtrees to without having to access again. (See Table 1 below for all needed local adjustments.)

Observe that the above topdown joining procedure naturally decomposes into local adjustments (along the spine of the taller tree) each of which is independent from other local adjustments made at any other node in the 23 tree (creftype 27), and hence multiple join operations can be pipelined without affecting the result (creftype 28), as long as the local adjustments done at each node remain in the same order.

This order constraint is easily achieved by using a dedicated queue at each spine node of to maintain the 23 trees currently at , and using a process that is run only via reactivation calls (see Section 5.3) to process each 23 tree in one by one and perform the appropriate local adjustment at before pushing down to a child of if appropriate. To push a 23 tree down to a node , we push it onto (the back of) and then reactivate . also reactivates itself after it has processed each 23 tree from .

Putting everything above together: We just need to prepare each 23 tree at a leaf of by augmenting it with spine structure, and then at each internal node of recursively compute the join of the 23 trees computed by its children, pipelined in the above manner. Then the root of would effectively compute the join of all the 23 trees at the leaves of , in the sense that its final state after all queued trees have been processed is the desired join (creftype 30).

So at the end we just have to wait for all queued trees to be processed, which can be done by waiting on a barrier at every internal node of (see Section 5.3), where is notified when the corresponding joining has finished. If that joining was of into , it finishes after the local adjustment that makes a subtree of the resulting 23 tree, so we tag with and make the local adjustment that finishes the joining notify .

Operation  Case  Local Adjustment 
Join 23 trees and (where or )  
and overflows  
and does not overflow  
Join 23 tree to the right of 23 subtree (where )  
and overflows  
and does not overflow 
“” denotes that is to be joined to the right of . “” denotes that is tagged with .
It turns out that the same techniques used in the proof of the Theorem 13 (Parallel Partitioning Costs). (creftype 13) can be used to prove the desired work and span bounds for the sorted batch access (creftype 24, creftype 33, creftype 34).
We shall now fill in the technical details. First is the definition of spine structure and the proof that it can be easily computed for the result of any join without actually performing the join.
Definition 20 (23 Tree Spine Structure).
Take any 23 tree . The spine structure of a node of that is a right child is defined as where is the number of children of the node on the right spine of the subtree at with distance from the leaf. Symmetrically for the spine structure of a node on the left spine of . The right spine structure of a node of is denoted by and is defined as if is a leaf and otherwise, where is the number of children of and . Symmetrically for the left spine structure of denoted by . Note that if is a nonroot right spine node, then , and symmetrically for a nonroot left spine node.
Theorem 21 (23 Tree Join Spine Structure).
Take any 23 trees . Given and , within steps we can determine whether the join overflows (i.e.