Parallel Write-Efficient Algorithms and Data Structures for Computational Geometry

by   Guy E. Blelloch, et al.
Carnegie Mellon University

In this paper, we design parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem. This is motivated by emerging non-volatile memory technologies with read performance being close to that of random access memory but writes being significantly more expensive in terms of energy and latency. We design algorithms for planar Delaunay triangulation, k-d trees, and static and dynamic augmented trees. Our algorithms are designed in the recently introduced Asymmetric Nested-Parallel Model, which captures the parallel setting in which there is a small symmetric memory where reads and writes are unit cost as well as a large asymmetric memory where writes are ω times more expensive than reads. In designing these algorithms, we introduce several techniques for obtaining write-efficiency, including DAG tracing, prefix doubling, and α-labeling, which we believe will be useful for designing other parallel write-efficient algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4


Algorithmic Building Blocks for Asymmetric Memories

The future of main memory appears to lie in the direction of new non-vol...

"LOADS of Space": Local Order Agnosticism and Bit Flip Efficient Data Structure Codes

Algorithms, data structures, coding techniques, and other methods that r...

Semi-Asymmetric Parallel Graph Algorithms for NVRAMs

Emerging non-volatile main memory (NVRAM) technologies provide novel fea...

A Software-based NVM Emulator Supporting Read/Write Asymmetric Latencies

Non-volatile memory (NVM) is a promising technology for low-energy and h...

Latency-Redundancy Tradeoff in Distributed Read-Write Systems

Data is replicated and stored redundantly over multiple servers for avai...

Bandwidth-Optimal Random Shuffling for GPUs

Linear-time algorithms that are traditionally used to shuffle data on CP...

Virtual Coset Coding for Encrypted Non-Volatile Memories with Multi-Level Cells

PCM is a popular backing memory for DRAM main memory in tiered memory sy...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we design a set of techniques and parallel algorithms to reduce the number of writes to memory as compared to traditional algorithms. This is motivated by the recent trends in computer memory technologies that promise byte-addressability, good read latencies, significantly lower energy and higher density (bits per area) compared to DRAM. However, one characteristic of these memories is that reading from memory is significantly cheaper than writing to it. Based on projections in the literature, the asymmetry is between 5–40 in terms of latency, bandwidth, or energy. Roughly speaking, the reason for this asymmetry is that writing to memory requires a change to the state of the material, while reading only requires detecting the current state. This trend poses the interesting question of how to design algorithms that are more efficient than traditional algorithms in the presence of read-write asymmetry.

There has been recent research studying models and algorithms that account for asymmetry in read and write costs [8, 9, 14, 15, 21, 22, 28, 29, 39, 47, 57, 58]. Blelloch et al. [9, 14, 15] propose models in which writes to the asymmetric memory cost and all other operations are unit cost. The Asymmetric RAM model [9] has a small symmetric memory (a cache) that can be used to hold temporary values and reduce the number of writes to the large asymmetric memory. The Asymmetric NP (Nested Parallel) model [15] is the corresponding parallel extension that allows an algorithm to be scheduled efficiently in parallel, and is the model that we use in this paper to analyze our algorithms.

Write-efficient parallel algorithms have been studied for many classes of problems including graphs, linear algebra, and dynamic programming. However, parallel write-efficient geometric algorithms have only been developed for the 2D convex hull problem [9]. Achieving parallelism (polylogarithmic depth) and optimal write-efficiency simultaneously seems generally hard for many algorithms and data structures in computational geometry. Here, optimal write-efficiency means that the number of writes that the algorithm or data structure construction performs is asymptotically equal to the output size. In this paper, we propose two general frameworks and show how they can be used to design algorithms and data structures from geometry with high parallelism as well as optimal write-efficiency.

The first framework is designed for randomized incremental algorithms [23, 50, 44]. Randomized incremental algorithms are relatively easy to implement in practice, and the challenge is in simultaneously achieving high parallelism and write-efficiency. Our framework consists of two components: a DAG-tracing algorithm and a prefix doubling technique. We can obtain parallel write-efficient randomized incremental algorithms by applying both techniques together. The write-efficiency is from the DAG-tracing algorithm, that given a current configuration of a set of objects and a new object, finds the part of the configuration that “conflicts” with the new object. Finding objects in a configuration of size requires reads but only writes. Once the conflicts have been found, previous parallel incremental algorithms (e.g. [16]) can be used to resolve the conflicts among objects taking linear reads and writes. This allows for a prefix doubling approach in which the number of objects inserted in each round is doubled until all objects are inserted.

Using this framework, we obtain parallel write-efficient algorithms for comparison sort, planar Delaunay triangulation, and -d trees, all requiring optimal work, linear writes, and polylogarithmic depth. The most interesting result is for Delaunay triangulation (DT). Although DT can be solved in optimal time and linear writes sequentially using the plane sweep method [15], previous parallel DT algorithms seem hard to make write-efficient. Most are based on divide-and-conquer, and seem to inherently require writes. Here we use recent results on parallel randomized incremental DT [16] and apply the above mentioned approach. For comparison sort, our new algorithm is stand-alone (i.e., not based on other complicated algorithms like Cole’s mergesort [24, 14]). For -d trees, we introduce the -batched incremental construction technique that maintains the balance of the tree while asymptotically reducing the number of writes.

The second framework is designed for augmented trees, including interval trees, range trees, and priority search trees. Our goal is to achieve write-efficiency for both the initial construction as well as future dynamic updates. The framework consists of two techniques. The first technique is to decouple the tree construction from sorting, and introduce parallel algorithms to construct the trees in linear reads and writes after the objects are sorted (the sorting can be done with linear writes [14]). Such algorithms provide write-efficient constructions of these data structures, but can also be applied in the rebalancing scheme for dynamic updates—once a subtree is unbalanced, we reconstruct it. The second technique is -labeling. We subselect some tree nodes as critical nodes, and maintain part of the augmentation only on these nodes. By doing so, we can limit the number of tree nodes that need to be written on each update, at the cost of having to read more nodes.222At a very high level, the -labeling is similar to the weight-balanced B-tree (WBB tree) proposed by Arge et al. [5, 6], but there are many differences and we discuss them in Section 7.

Using this framework, we obtain efficient augmented trees in the asymmetric setting. In particular, we can construct the trees in optimal work and writes, and polylogarithmic depth. For dynamic updates, we provide a trade-off between performing extra reads in queries and updates, while doing fewer writes on updates. Standard algorithms use reads and writes per update ( reads on a 2D range tree). We can reduce the number of writes by a factor of for , at a cost of increasing reads by at most a factor of in the worst case. For example, when the number of queries and updates are about equal, we can improve the work by a factor of , which is significant given that the update and query costs are only logarithmic.

The contributions of this paper are new parallel write-efficient algorithms for comparison sorting, planar Delaunay triangulation, -d trees, and static and dynamic augmented trees (including interval trees, range trees and priority search trees). We introduce two general frameworks to design such algorithms, which we believe will be useful for designing other parallel write-efficient algorithms.

The first challenge is the complicatedness of many geometry algorithms, especially the parallel ones. For instance, one of the simplest parallel algorithm for 2D Delaunay triangulation (DT) seems to be the incremental DT algorithm [16], although the analysis of that is already sophisticated. An version that uses linear writes instead of writes is non-trivial due to the additional complexity by write-efficiency. The second challenge is the varied combinations of problems and settings. Taking the augmented trees for geometry problems as an example, there exist tens of various forms specifically designed for different queries and settings (e.g., online or offline). Hence, it is possible to optimize the writes for each specific case, but that might not be the most efficient solution.

2 Preliminaries

2.1 Computation Models

Nested-parallel model The algorithms in this paper is based on the nested-parallel model where a computation starts and ends with a single root task. Each task has a constant number of registers, and runs a standard instruction set from a random access machine, except it has one additional instruction called FORK. The FORK instruction takes an integer and creates child tasks, which can run in parallel. Child tasks get a copy of the parent’s register values, with one special register getting an integer from to indicating which child it is. The parent task suspends until all its children finish at which point it continues with the registers in the same state as when it suspended, except the program counter advanced by one. In this paper we consider the computation that has binary branching (i.e., ). In the model, a computation can be viewed as a (series-parallel) DAG in the standard way. We assume every instruction has a weight (cost). The work () is the sum of the weights of the instructions, and the 0pt () is the longest (unweighted) path in this DAG.

Asymmetric NP (Nested Parallel) model We use the Asymmetric NP (Nested Parallel) model [9], which is the asymmetric version of the nested-parallel model, to measure the cost of an algorithm in this paper. The memory in the Asymmetric NP model consists of (i) an infinitely large asymmetric memory (referred to as large-memory) accessible to all processors and (ii) a small private symmetric memory (small-memory) accessible only to one processor. The cost of writing to large memory is , and all other operations have unit cost. The size of the small-memory is measured in words. In this paper, we assume the small memory can store a logarithmic number of words, unless specified otherwise. A more precise and detailed definition of the Asymmetric NP model is given in [32].

The work of a computation is the sum of the costs of the operations in the DAG, which is similar to the symmetric version but just has extra charges for writes. The 0pt is still the longest unweighted path in the DAG. Under mild assumptions, a work-stealing scheduler can execute an algorithm with work and 0pt in expected time on a round-synchronous PRAM with processors [9]. We assume concurrent-read, and concurrent-writes use priority-writes to resolve conflicts. In our algorithm descriptions, the number of writes refers only to the writes to the large-memory, and does not include writes to the small-memory. All reads and writes are to words of size -bits for an input size of .

2.2 Write-Efficient Geometric Algorithms

Sorting and searching are widely used in geometry applications. Sorting requires work and depth [14]. Red-black trees with appropriate rebalancing rules require amortized work per update (insertion or deletion) [56].

These building blocks facilitate many classic geometric algorithms. The planar convex-hull problem can be solved by first sorting the points by coordinates and then using Graham’s scan that requires work [26]. This scan step can be parallelized with 0pt [31]. The output-sensitive version uses work and depth where is the number of points on the hull [9].

One-dimension stabbing query and the interval tree.
Two-dimension range query and the range tree.

Given a set of points where , the three-sided 2D range query ask for the list of (or weighted-sum, count, etc.) of points with x-coordinate in range and , and y-coordinate in range and .

This query can be answered with the range tree, which is a two-level tree structure. (…) In particular it can be represented by a two-level augmented map as:

( , , , , , , )
( , , , , singleton, , )

What is ? Why is it in the outer tree as well as the inner tree? When both level are implemented by augmented trees, this two-level map structure becomes a range tree.

Three-sided range query and the priority tree.

Given a set of points where , the three-sided 2D range query ask for the list of (or weighted-sum, count, etc.) of points with x-coordinate in range and , and y-coordinate no less than . We usually call the y-coordinate the priority of that point.

This query can be answered with the priority search tree [43], which costs work (and writes) then work is in constructing a tree of size , and work per query (with output size ) work is . The root of the priority tree stores the point in with the highest priority. Then all the other points in are evenly split into two parts by the median of their x-coordinate, which then recursively form the left and right subtree. In each level of construction we scan all points to find the point with the highest y-coordinate, remove it, write the left and right part into new arrays, and recursive build the subtrees. In total it costs work and writes per level. This guarantees the tree to be a complete binary tree. With a straight-forward divide-and-conquer technique, the construction can be parallelized with depth . This data structure is completely static.

3 General Techniques for Incremental Algorithms

In this section, we first introduce our framework for randomized incremental algorithms. Our goal is to have a systematic approach for designing geometric algorithms that are highly parallel and write-efficient.

Our observation is that it should be possible to make randomized incremental algorithms write-efficient since each newly added object on expectation only conflicts with a small region of the current configuration. For instance, in planar Delaunay triangulation, when a randomly chosen point is inserted, the expected number of encroached triangles is . Therefore, resolving such conflicts only makes minor modifications to the configuration during the randomized incremental constructions, leading to algorithms using fewer writes. The challenges are in finding the conflicted region of each newly added object write-efficiently and work-efficiently, and in adding multiple objects into the configuration in parallel without affecting write-efficiency. We will discuss the general techniques to tackle these challenges based on the history graph [36, 19], and then discuss how to apply them to develop parallel write-efficient algorithms for comparison sorting in Section 4, planar Delaunay triangulation in Section 5, and -d tree construction in Section 6.

3.1 DAG Tracing

We now discuss how to find the conflict set of each newly added object (i.e., only output the conflict primitives) based on a history (directed acyclic) graph [36, 19] in a parallel and write-efficient fashion. Since the history graphs for different randomized incremental algorithms can vary, we abstract the process as a DAG tracing problem that finds the conflict primitives in each step by following the history graph.

Definition 3.1 (DAG tracing problem).

The DAG tracing problem takes an element , a DAG , a root vertex with zero in-degree, and a boolean predicate function . It computes the vertex set .

We call a vertex visible if is true.

Definition 3.2 (tracable property).

We say that the DAG tracing problem has the tracable property when is visible only if there exists at least one direct predecessor vertex of that is visible.

Variable Description
the length of the longest path in
the set of all visible vertices in
the output set of vertices

The DAG tracing problem can be solved in work, depth and writes when the problem has the tracable property, each vertex has a constant degree, can be evaluated in constant time, and the small-memory has size . Here , , and are defined in the previous table.


We first discuss a sequential algorithm using work and writes. Because of the tracable property, we can use an arbitrary search algorithm to visit the visible nodes, which requires writes since we need to mark whether a vertex is visited or not. However, this approach is not write-efficient when , and we now propose a better solution.

Assume that we give a global ordering of the vertices in (e.g., using the vertex labels) and use the following rule to traverse the visible nodes based on this ordering: a visible node is visited during the search of its direct visible predecessor that has the highest priority among all visible direct predecessors of . Based on this rule, we do not need to store all visited vertices. Instead, when we visit a vertex via a directed edge from , we can check if has the highest priority among all visible predecessors of . This checking has constant cost since has a constant degree and we assume the visibility of a vertex can be verified in constant time. As long as we have a small-memory of size that keeps the recursion stack and each vertex in has a constant in-degree, we can find the output set using work and writes.

We note that the search tree generated under this rule is unique and deterministic. Therefore, this observation allows us to traverse the tree in parallel and in a fork-join manner: we can simultaneously fork off an independent task for each outgoing edges of the current vertex, and all these tasks can be run independently and in parallel. The parallel depth, in this case, is upper bounded by , the depth of the longest path in the graph. ∎

Here we assume the graph is explicitly stored and accessible, so we slightly modify the algorithms to generate the history graph, which is straightforward in all cases in this paper.

3.2 The Prefix-Doubling Approach

The sequential version of randomized incremental algorithms process one object (e.g., a point or vertex) in one iteration. The prefix-doubling approach splits an algorithm into multiple rounds, with the first round processing one iteration and each subsequent round doubling the number of iterations processed. This high-level idea is widely used in parallel algorithm design. We show that the prefix-doubling approach combined with the DAG tracing algorithm can reduce the number of writes by a factor of in a number of algorithms. In particular, our variant of prefix doubling first processes iterations using a standard write-inefficient approach (called as the initial round). Then the algorithm runs incremental rounds, where the ’th round processes the next iterations.

The -batched incremental construction The -batched incremental construction is a variant of the classic incremental construction when the iterative independence graph is a tree. In the classic version, each primitive is directly inserted into the current configuration in turn. However, in the -batched version, each primitive finds the leaf node it belongs to, but it leaves itself in the node, instead of directly adding to the configuration. We start to process a leaf node once it contains primitives (here we assume that the small-memory can hold all primitives). After all primitives are added, we finish the computation in each node and generate the final output.

4 Comparison Sort

We discuss a write-efficient version of incremental sort introduced in [16], which can also be used in many geometry problems and algorithms.

The first algorithm that we consider is sorting by incrementally inserting into a binary search tree (BST) with no rebalancing. Algorithm 1 gives pseudocode that works either sequentially or in parallel. In the parallel version, the for loop is a parallel for, such that each vertex tries to add itself to the tree in every round. When there are multiple assignments on Line 1 to the same location, the smallest value gets written using a priority-write.

Input: A sequence of keys.
Output: A binary search tree over the keys in .
// reads indirectly through the pointer .
// The check on Line 1 is only needed for the parallel version.
1 Root a pointer to a new empty location
2 for  to  do
3       newNode()
4       Root
5       while true do
6             if  null  then
7                   write into the location pointed to by
8                   if   then
9                        break
10            if .key .key   then
11                   pointer to .left
12            else
13                   pointer to .right
return Root
Algorithm 1 incrementalSort [16]

Blelloch et al. [16] showed that the parallel version of incrementalSort generates the same tree as the sequential version, and for a random order of keys runs in work and

depth with high probability

333We say with high probability () to indicate with probability at least . on a priority-write CRCW PRAM. The depth bound increases to when only binary forking is allowed. The key observation is that insertion of keys into this tree in random order has the longest dependence chain to be (i.e., the tree has depth ). However, this algorithm yields writes as each element can execute the while loop on Lines 5–13 times , with each iteration incurring a write. We discuss how the DAG-tracing algorithm and prefix doubling in Section 3 reduce the number of writes in this algorithm.

Linear-write and -depth incremental sort We discuss a linear-write parallel sorting algorithm based on the prefix-doubling approach. The initial round constructs the search tree for the first elements using Algorithm 1. For the ’th incremental round where , we add the next elements into the search tree. In an incremental round, instead of directly running Algorithm 1, we first find the correct position of each element to be inserted (i.e., to reach line 1 in Algorithm 1). This step can be implemented using the DAG tracing algorithm, and in this case the DAG is just the search tree constructed in the previous round. The root vertex is the tree root, and returns true iff the the search of element visit the node . Note that the DAG is actually a rooted tree, and also each element only visits one tree node in each level and ends up in one leaf node (stored in ), which means this step requires work, depth , and writes in the ’th round.

After each element finds the empty leaf node that it belongs to, in the second step in this round we then run Algorithm 1, but using the pointer that was computed in the first step. We refer to the elements in the same empty leaf node as belonging to the same bucket. Notice that the depth of this step in one incremental round is upper bounded by the depth of a random binary search tree which is , so this algorithm has depth : rounds, and in each round there are levels.

We now analyze the expected number of writes of in the second step. In each incremental round the number of elements inserted is the same as the number of elements already in the tree. Hence it is equivalent to randomly throwing balls into bins, where is the number of elements to be inserted in this incremental round. Assume that the adversary picks the relative priorities of the elements within each bin, so that it takes work and to sort elements within each bucket in the worst case. We can show that the probability that there are elements in a bucket is , and when , for some constant and . The expected number of writes within each bucket in this incremental round is therefore:

Hence the overall number of writes is also linear. Algorithm 1 sorts elements in a bucket with depth, and the number of balls in each bin is , so the depth in this step is included in the depth analysis in the previous paragraph. Combining the work and depth gives the following lemma.

Lemma 4.1.

incrementalSort for a random order of keys runs in expected work and depth on Asymmetric NP model with priority-write.

Improving the depth to We can improve the depth to as follows. Notice that for and ,

This indicates that only a small fraction of the buckets in each incremental round are not finished after iterations of the while-loop on Line 1 of Algorithm 1.

In the depth-improved version of the algorithm, the while-loop terminates after iterations, and postpones these insertions (and all further insertions into this subtree in future rounds) to a final round. The final round simply runs another round of Algorithm 2 and inserts all uninserted elements (not write-efficiently). Clearly the depth of the last round is , since it is upper bounded by the depth of running Algorithm 1 for all elements. The depth of the whole algorithm is therefore .

We now analyze the number of writes in the final round. The probability that a bucket in any round does not finish is , and pessimistically there are in total of such buckets. We also know that using Chernoff bound the maximum size of a bucket after the first round is . The number of writes in the last round is upper bounded by the overall number of uninserted elements times the tree depth, which is by setting and appropriately large. This leads to the main theorem.

incrementalSort for a random order of keys runs in expected work and depth on Asymmetric NP model with priority-write.

Note that this gives a much simpler work/write-optimal logarithmic-depth algorithm for comparison sorting than the write-optimal parallel sorting algorithm in [14] that is based on Cole’s mergesort [24], although our algorithm is randomized and requires priority-writes.

5 Planar Delaunay Triangulation

A Delaunay triangulation (DT) in the plane is a triangulation of a set of points such that no point in is inside the circumcircle of any triangle (the circle defined by the triangle’s three corner points). We say a point encroaches on a triangle if it is in the triangle’s circumcircle, so the triangle will be replaced once this point is added to the triangulation. We assume for simplicity that the points are in general position (no three points on a line or four points on a circle).

Delaunay triangulation is widely studied due to its importance in many geometry applications. Without considering the asymmetry between reads and writes, it can be solved sequentially in optimal work. It is relatively easy to generate a sequential write-efficient version that does reads and only requires writes based on the plane sweep method [15]. There are several work-efficient parallel algorithms that run in polylogarithmic depth [49, 7, 17, 16]. More practical ones (e.g., [36, 19]) have linear depth. Unfortunately, none of them perform any less than writes. In particular the divide-and-conquer algorithms [7, 17] seem to inherently require writes since the divide or merge step requires generating an output of size , and is applied for levels. The randomized incremental approach of Blelloch et al. (BGSS) [16], which improves the Boissonnat and Teillaud algorithm [19] to polylogarithmic depth, also requires writes for reasons described below.

In this section, we show how to modify the BGSS algorithm to use only a linear number of writes, while maintaining the expected bound on work, and polylogarithmic depth. Algorithm 2 shows the pseudocode for the BGSS algorithm. In the algorithm, the vertices are labeled from to and when taking a over vertices (Lines 7–2) it is with respect to these labels. The algorithm proceeds in rounds the algorithm adds some triangles (Line 2) and removes others (Line 2) in each round.

In the algorithm, there are dependences between triangles so that some of them need to be processed before the other triangles can proceed. For a sequence of points , BGSS define the dependence graph for the algorithm in the following way. The vertices correspond to triangles created by the algorithm, and for each call to ReplaceTriangle, we place an arc from triangle and its three neighbors (, and ) to each of the one, two, or three triangles created by ReplaceTriangle. Every triangle with depth in is created by the algorithm in round . BGSS show that for a randomly ordered set of input points of size , the depth of the dependence graph is 444We say with high probability () to indicate with probability to be at least for any constant ., and hence the algorithm runs in rounds . Each round can be done in depth giving an overall depth of on the nested-parallel model.

The algorithm, however, is not write-efficient. In particular, every point moves down the DAG through the rounds (on line 2), and therefore can be moved times, each requiring a write.

Input: A sequence of points in the plane.
Output: The Delaunay triangulation of .
Maintains: , the points that encroach on each triangle .
1 a sufficiently large bounding triangle
4 while  for any  do
5       parallel foreach triangle  do
6             Let be the three neighboring triangles
7             if  then
8                   ReplaceTriangle
9return function ReplaceTriangle(,,)
10       foreach edge (three of them) do
11             if  is a boundary of ’s encroached region then
12                   the other triangle sharing
Algorithm 2 ParIncrementalDT

A Linear-Write Version We now discuss a write-efficient version of the BGSS algorithm. We use the DAG tracing and prefix-doubling techniques introduced in Section 3. The algorithm first computes the DT of the earliest points in the randomized order, using the non-write-efficient version. This step requires linear writes. It then runs incremental rounds and in each round adds a number of points equal to the number of points already inserted.

To insert points, we need to construct a search structure in the DAG tracing problem. We can modify the BGSS algorithm to build such a structure. In fact, the structure is effectively a subset of the edges of the dependence graph . In particular, in the algorithm the only inCircle test is on Line 2. In this test, to determine if a point encroaches , we need only check its two ancestors and (we need not also check the two other triangles neighboring , as needed in ). This leads to a DAG with depth at most as large as , and for which every vertex has in-degree 2. The out-degree is not necessarily constant. However, by noting that there can be at most a constant number of outgoing edges to each level of the DAG, we can easily transform it to a DAG with constant out-degree by creating a copy of a triangle at each level after it has out-neighbors. This does not increase the depth, and the number of copies is at most proportional to the number of initial triangles ( in expectation) since the in-degrees are constant. We refer to this as the tracing structure. An example of this structure is shown in Figure 1.

The tracing structure can be used in the DAG tracing problem (Definition 3.1) using the predicate . This predicate has the traceable property since a point can only be added to a triangle (i.e., encroaches on the triangle) if it encroached one of the two input edges from and . We can therefore use the DAG tracing algorithm to find all of the triangles encroached on by a given point starting at the initial root triangle .

Figure 1: An example of the tracing structure. Here a point is added and the encroaching region contains triangles and (subfigure (a)). Four new triangles will be generated and replace the two previous triangles. They may or may not be created in the same round, and in this example this is done in two substeps (subfigures (b) and (c)). Part of the tracing structure is shown in subfigure (d). Four neighbor triangles , , , and are copied, and four new triangles are created. An arrow indicates that a point is encroached by the head triangle only if it is encroached by the tail triangle.

We first construct the DT of the first points in the initial round using Algorithm 2 while building the tracing structure. Then at the beginning of each incremental round, each point traces down the structure to find its encroached triangles, and leaves itself in the encroached set of that triangle. Note that the encroached set for a given point might be large, but the average size across points is constant in expectation.

We now analyze the cost of finding all the encroached triangles when adding a set of new points. As discussed, the depth of is upper bounded by . The number of encroached triangles of a point can be analyzed by considering the degree of the point (number of incident triangles) if added to the DT. By Euler’s formula, the average degree of a node in a planar graph is at most . Since we add the points in a random order, the expected value of in Theorem 3.1 is constant. Finally, the number of all encroached (including non-leaf) triangles of this point is upper bounded by the number of inCircle tests. Then , the expected number of visible vertices of , is (Theorem 4.2 in [16]).

After finding the encroached triangles for each point being added, we need to collect them together to add them to the triangle. This step can be done in parallel with a semisort, which takes linear expected work (writes) and depth  [34], where is the number of inserted points in this round. Combining these results leads to the following lemma.

Lemma 5.1.

Given points in the plane and a tracing structure generated by Algorithm 2 on a randomly selected subset of points, computing for each triangle in the points that encroach it among the remaining points takes work ( writes) and depth in the Asymmetric NP model.

The idea of the algorithm is to keep doubling the size of the set that we add (i.e., prefix doubling). Each round applies Algorithm 2 to insert the points and build a tracing structure, and then the DAG tracing algorithm to locate the points for the next round. The depth of each round is upper bounded by the overall depth of the DAG on all points, which is , where is the original size. We obtain the following theorem.

Planar Delaunay triangulation can be computed using work (i.e., writes) in expectation and depth on the Asymmetric NP model with priority-writes.


The original Algorithm 2 in [16] has depth . In the prefix-doubling approach, the depth of each round is no more than , and the algorithm has rounds. The overall depth is hence depth .

The work bound consists of the costs from the initial round, and the incremental rounds. The initial round computes the triangulation of the first points, using at most inCircle tests, writes and work. For the incremental rounds, we have two components, one for locating encroached triangles in the tracing structure, and one for applying Algorithm 2 on those points to build the next tracing structure. The first part is handled by Lemma 5.1. For the second part we we can apply a similar analysis to Theorem 4.2 of [16]. In particular, the probability that there is a dependence from a triangle in the ’th point (in the random order) to a triangle added by a later point at location in the ordering is upper bounded by . Summing across all points in the second half (we have already resolved the first half) gives:

This is a bound on both the number of reads and the number of writes. Since the points added in each round doubles, the cost is dominated by the last round, which is reads and writes, both in expectation. Combined with the cost of the initial round gives the stated bounds. ∎

6 Space-Partitioning Data Structures

Space partitioning divides a space into non-overlapping regions.555The other type of partitioning is object partitioning that subdivides the set of objects directly (e.g., R-tree [37, 41], bounding volume hierarchies [33, 59]). This process is usually applied repeatedly until the number of objects in a region is small enough, so that we can afford to answer a query in linear work within the region. We refer to the tree structure used to represent the partitioning as the space-partitioning tree. Commonly-used space-partitioning trees include binary space partitioning trees, quad/oct-trees, -d trees, and their variants, and are widely used in computational geometry [26, 38], computer graphics [3], integrated circuit design, learning theory, etc.

In this section, we propose write-efficient construction and update algorithms for -d trees [10]. We discuss how to support dynamic updates write-efficiently in Section 6.2, and we discuss how to apply our technique to other space-partitioning trees in Section 6.3.

6.1 -d Tree Construction and Queries

-d trees have many variants that facilitate different queries. We start with the most standard applications on range queries and nearest neighbor queries, and discussions for other queries are in Section 6.3. A range query can be answered in worst-case work, and an approximate -nearest neighbor (ANN) query requires work assuming bounded aspect ratio,666The largest aspect ratio of a tree node on any two dimensions is bounded by a constant, which is satisfied by the input instances in most real-world applications. both in -dimensional space. The tree to achieve these bounds can be constructed by always partitioning by the median of all of the objects in the current region either on the longest dimension of the region or cycling among the dimensions. The tree has linear size and depth [26], and can be constructed using reads and writes. We now discuss how to reduce the number of writes to .

One solution is to apply the incremental construction by inserting the objects into a -d tree one by one. This approach requires linear writes,

reads and polylogarithmic depth. However, the splitting hyperplane is no longer based on the median, but the object with the highest priority pre-determined by a random permutation. The expected tree depth can be

for , but to preserve the range query cost we need the tree depth to be (see details in Lemma 6.1). Motivated by the incremental construction, we propose the following variant, called -batched incremental construction, which guarantees both write-efficiency and low tree depth.

The -batched incremental construction The -batched incremental construction is a variant of the classic incremental construction where the dependence graph is a tree. Unlike the classic version, where the splitting hyperplane (splitter) of a tree node is immediately set when inserting the object with the highest priority, in the -batched version, each leaf node will buffer at most objects before it determines the splitter. We say that a leaf node overflows if it holds more than objects in its buffer. We say that a node is generated when created by its parent, and settled after finding the splitters, creating leaves and pushing the objects to the leaves’ buffers.

The algorithm proceeds in rounds, where in each round it first finds the corresponding leaf nodes that the inserted objects belong to, and adds them into the buffers of the leaves. Then it settles all of the overflowed leaves, and starts a new round. An illustration of this algorithm is shown in Figure 2. After all objects are inserted, the algorithm finishes building the subtree of the tree nodes with non-empty buffers recursively. For write-efficiency, we require the small-memory size to be , and the reason will be shown in the cost analysis.

Figure 2: An illustration of one round in the -batched incremental construction for . Subfigure (a) shows the initial state of this round. Then the new objects (shown in orange) are added to the buffers in the leaves, as shown in subfigure (b). Two of the buffers overflow, and so we settle these two leaves as shown in subfigure (c).

We make a partition once we have gathered objects in the corresponding subregion based on the median of these objects. When , the algorithm is the incremental algorithm mentioned above, but the range query cost cannot be preserved. When , the algorithm constructs the same tree as the classic -d tree construction algorithm, but requires more than linear writes unless the small-memory size is , which is impractical when is large. We now try to find the smallest value of that preserves the query cost, and we analyze the cost bounds accordingly.

Range query We use the following lemma to analyze the cost of a standard -d range query (on an axis-aligned hypercube for ).

Lemma 6.1.

A -d range query costs using our -d tree of height .

Proof Sketch.

A -d range query has at most faces that generate half-spaces, and we analyze the query cost of each half-space. Since each axis is partitioned once in every consecutive levels, one side of the partition hyperplane perpendicular to the query face will be either entirely in or out of the associated half-space. We do not need to traverse that subtree (we can either directly report the answer or ignore it). Therefore every levels will expand the search tree by a factor of at most . Thus the query cost is . ∎

Lemma 6.2.

For our -batched -d tree, guarantees the tree height to be no more than whp.


We now consider the -batched incremental construction. Since we are partitioning based on the median of random objects, the hyperplane can be different from the actual median. To get the same cost bound, we want the actual number of objects on the two sides to differ by no more than a factor of . Since we pick random samples, by a Chernoff bound the probability that more than samples are within the first objects is upper bounded by . Hence, the probability that the two subtree weights of a tree node differ by more than a factor of is no more than . This controls the tree depth, and based on the previous analysis we want to have . Namely, we want the tree to have no more than levels to reach the subtrees with less than elements, so the overall tree depth is bounded by . Combining these constraints leads to and . ∎

Lemma 6.2 indicates that setting gives a tree height of , and Lemma 6.1 shows that the corresponding range query cost is , matching the standard range query cost.

ANN query If we assume that the input objects are well-distributed and the -d tree satisfies the bounded aspect ratio, then the cost of a -ANN query is proportional to the tree height. As a result, leads to a query cost of .777Actually the tree depth is even when . However, for write-efficiency, we need to support efficient updates as discussed in Section 6.2 that requires the two subtree sizes to be balanced at every node.

Parallel construction and cost analysis To get parallelism, we use the prefix-doubling approach, starting with objects in the first round. The number of reads of the algorithm is still , since it is lower bounded by the cost of sorting when , and upper bounded by since the modified algorithm makes asymptotically no more comparisons than the classic implementation. We first present the following lemma.

Lemma 6.3.

When a leaf overflows at the end of a round, the number of objects in its buffer is when .

Proof Sketch.

In the previous round, assume objects were in the tree. At that time no more than objects are buffered in this leaf node. Then in the current round another objects are inserted, and by a Chernoff bound, the probability that the number of objects falling into this leaf node is more than is at most . Plugging in proves the lemma. ∎

We now bound the parallel depth of this construction. The initial round runs the standard construction algorithm on the first objects, which requires depth. Then in each of the next incremental rounds, we need to locate leaf nodes and a parallel semisort to put the objects into their buffers. Both steps can be done in depth  [34]. Then we also need to account for the depth of settling the leaves after the incremental rounds. When a leaf overflows, by Lemma 6.3 we need to split a set of objects for each leaf, which has a depth of using the classic approach, and is applied for no more than a constant number of times by Lemma 6.3.

We now analyze the number of writes this algorithm requires. The initial round requires writes as it uses a standard construction algorithm on objects. In the incremental rounds, writes are required for each object to find the leaf node it belongs to and add itself to the buffer using semisorting [34]. From Lemma 6.3, when finding the splitting hyperplane and splitting the object for a tree node, the number of writes required is . Note that after a new leaf node is generated from a split, it contains at least objects. Therefore, after all incremental rounds, the tree contains at most tree nodes, and the overall writes to generate them is . After the incremental rounds finish, we need writes to settle the leaves with non-empty buffers, assuming cache size. In total, the algorithm uses writes .

A -d tree that supports range and ANN queries efficiently can be computed using expected work (i.e., writes) and depth in the Asymmetric NP model. For range query the small-memory size required is .

6.2 -d Tree Dynamic Updates

Unlike many other tree structures, we cannot rotate the tree nodes in -d trees since each tree node represents a subspace instead of just a set of objects. Deletion is simple for -d trees, since we can afford to reconstruct the whole structure from scratch when a constant fraction of the objects in the -d tree have been removed, and before the reconstruction we just mark the deleted node (constant reads and writes per deletion via an unordered map). In total, the amortized cost of each deletion is . For insertions, we discuss two techniques that optimize either the update cost or the query cost.

Logarithmic reconstruction [46] We maintain at most -d trees of sizes that are increasing powers of . When an object is inserted, we create a -d tree of size containing the object. While there are trees of equal size, we flatten them and replace the two trees with a tree of twice the size. This process keeps repeating until there are no trees with the same size. When querying, we search in all (at most ) trees. Using this approach, the number of reads and writes on an insertion is , and on a deletion is . The costs for range queries and ANN queries are and respectively, plus the cost for writing the output.

If we apply our write-efficient -batched version when reconstructing the -d trees, we can reduce the writes (but not reads) by a factor of (i.e., and writes per update).

When using logarithmic reconstruction, querying up to trees can be inefficient in some cases, so here we show an alternative solution that only maintains a single tree.

Single-tree version As discussed in Section 6.1, only the tree height affects the costs for range queries and ANN queries. For range queries, Lemma 6.2 indicates that the tree height should be to guarantee the optimal query cost. To maintain this, we can tolerate an imbalance between the weights of two subtrees by a factor of , and reconstruct the subtree when the imbalance is beyond the constraint. In the worst case, a subtree of size is rebuilt once after insertions into the subtree. Since the reconstructing a subtree of size requires work, each inserted object contributes work to every node on its tree path, and there are such nodes. Hence, the amortized work for an insertion is . For efficient ANN queries, we only need the tree height to be , which can be guaranteed if the imbalance between two subtree sizes is at most a constant multiplicative factor. Using a similar analysis, in this case the amortized work for an insertion is .

6.3 Extension to Other Trees and Queries

In Section 6.1 we discussed the write-efficient algorithm to construct a -d tree that supports range and ANN queries. -d trees are also used in many other queries in real-world applications, such as ray tracing, collision detection for non-deformable objects,

-body simulation, and geometric culling (using BSP trees). The partition criteria in these applications are based on some empirical heuristics (e.g., the surface-area heuristic 

[30]), which generally work well on real-world instances, but usually with no theoretical guarantees.

The -batched incremental construction can be applied to these heuristics, as long as each object contributes linearly to the heuristic. Let us consider the surface-area heuristic [30] as an example, which does an axis-aligned split and minimizes the sum of the two products of the subtree’s surface area and the number of objects. Instead of sorting the coordinates of all objects in this subtree and finding the optimal split point, we can do approximately by splitting when at least of the objects are inserted into a region. When picking a reasonable value of (like or ), we believe the tree quality should be similar to the exact approach (which is a heuristic after all). However, the -batched approach does not apply to heuristics that are not linear in the size of the object set. Such cases happen in the Callahan-Kosaraju algorithm [20] when the region of each -d tree node shrinks to the minimum bounding box, or the object-partitioning data structures (like R-trees or bounding volume hierarchies) where each object can contribute arbitrarily to the heuristic.

7 Augmented Trees

An augmented tree is a tree that keeps extra data on each tree node other than what is used to maintain the balance of this tree. We refer to the extra data on each tree node as the augmentation. In this section, we introduce a framework that gives new algorithms for constructing both static and dynamic augmented trees including interval trees, 2D range trees, and priority search trees that are parallel and write-efficient. Using these data structures we can answer 1D stabbing queries, 2D range queries, and 3-sided queries (defined in Section 7.1). For all three problems, we assume that the query results need to be written to the large-memory. Our results are summarized in Table 1. We improve upon the traditional algorithms in two ways. First, we show how to construct interval trees and priority search trees using instead of writes (since the 2D range tree requires storage we cannot asymptotically reduce the number of writes). Second, we provide a tradeoff between update costs and query costs in the dynamic versions of the data structures. The cost bounds are parameterized by . By setting we achieve the same cost bounds as the traditional algorithms for queries and updates. can be chosen optimally if we know the update-to-query ratio . For interval and priority trees, the optimal value of is . The overall work without considering writing the output can be improved by a factor of . For 2D range trees, the optimal value of is .

Construction Query Update
Classic interval tree
Our interval tree
Classic priority search tree
Our priority search tree
Classic range Tree
Our range tree
Table 1: A summary of the work cost of the data structures discussed in Section 7. In all cases, we assume that the tree contains objects (intervals or points). For interval trees and priority search trees, we can reduce the number of writes in the construction from per element to . For dynamic updates, we can reduce the number of writes per update by a factor of at the cost of increasing the number of reads in update and queries by a factor of for any .

We discuss two techniques in this section that we use to achieve write-efficiency. The first technique is to decouple the tree construction from sorting, and we introduce efficient algorithms to construct interval and priority search trees in linear reads and writes after the input is sorted. Sorting can be done in parallel and write-efficiently (linear writes).Using this approach, the tree structure that we obtain is perfectly balanced.

The second technique that we introduce is the -labeling technique. We mark a subset of tree nodes as critical nodes by a predicate function parameterized by , and only maintain augmentations on these critical nodes. We can then guarantee that every update only modifies nodes, instead of nodes as in the classic algorithms. At a high level, the -labeling is similar to the weight-balanced B-tree (WBB tree) proposed by Arge et al. [5, 6] for the external-memory (EM) model [1]. However, as we discuss in Section 7.3, directly applying the EM algorithms [5, 6, 2, 53, 52] does not give us the desired bounds in our model. Secondly, our underlying tree is still binary. Hence, we mostly need no changes to the algorithmic part that dynamically maintains the augmentation in this trees, but just relax the balancing criteria so the underlying search trees can be less balanced. An extra benefit of our framework is that bulk updates can be supported in a straightforward manner. Such bulk updates seem complicated and less obvious in previous approaches. We propose algorithms on our trees that can support bulk updates write-efficiently and in polylogarithmic depth.

The rest of this section is organized as follows. We first provide the problem definitions and review previous results in Section 7.1. Then in Section 7.2, we introduce our post-sorted construction technique for constructing interval and priority search trees using a linear number of writes. Finally, we introduce the -labeling technique to support a tradeoff in query and update cost for interval trees, priority search trees, and range trees in Section 7.3.

7.1 Preliminaries and Previous Work

We define the weight or size of tree node or a subtree as the number of nodes in this subtree plus one. The “plus one” guarantees that the size of a tree node is always the sum of the sizes of its two children, which simplifies our discussion. This is also the standard balancing criteria used for weight-balanced trees [45].

Interval trees and the 1D stabbing queries An interval tree888There exist multiple versions of interval trees. In this paper, we use the version described in [26]. [26, 27, 42] organizes a set of intervals defined by their left and right endpoints. The key on the root of the interval tree is the median of the endpoints. This median divides all intervals into two categories: those completely on its left/right, which then form the left/right subtrees recursively, and those covering the median, which are stored in the root. The intervals in the root are stored in two lists sorted by the left and right endpoints respectively. In this paper, we use red-black trees to maintain such ordered lists to support dynamic updates and refer to them as the inner trees. In the worst case, the previous construction algorithms scan and copy intervals in levels, leading to reads and writes.

The interval tree can be used to answer a 1D stabbing query: given a set of intervals, report a list of intervals covering the specific query point . This can be done by searching in the tree. Whenever is smaller (larger) than the key of the current node, all intervals in the current tree node with left (right) endpoints smaller than should be reported. This can be done efficiently by scanning the list sorted by left (right) endpoints. The overall query cost is (where is the output size).

OLD VERSION: Given a set of intervals defined by their left and right endpoints, a 1D stabbing query asks for the list of intervals covering the specific query point .

1D stabbing queries can be answered efficiently using interval trees.999There exist multiple versions of interval trees. In this paper we use the version described in [26]. For intervals, the interval tree chooses the median of the endpoints as the key on the root, which is also denoted as the splitter. Each interval either fully lies on one side of the splitter, or overlaps the splitter. For intervals that overlap the splitter, we maintain them in two ordered lists on the node, one sorted by the left endpoint and one sorted by the right endpoint. We then recursively construct an interval tree for the intervals that fully on the left (right) side of the splitter, making it the left (right) child of the root. The base case is reached when no intervals lie completely on one side of the splitter. Note that each interval is stored in exactly one tree node (with two copies in that node). To support dynamic updates, the ordered lists of overlapping intervals can be maintained using balanced binary search trees (BSTs), and we refer to these BSTs as the inner trees. In the worst case, the construction scans and copies intervals in level, leading to reads and writes.

For a stabbing query on , we first compare with the root’s key. If is smaller (larger) than the root’s key, then no intervals in the right (left) subtree can cover . For the intervals stored in the root node, we report all intervals that have their left endpoints to the left of . This can be efficiently supported by scanning the ordered list sorted by the left (right) endpoints (or searching the corresponding inner tree), and the cost is proportional to the output size. We then continue the query in the left (right) subtree recursively. The overall query cost is (where is the output size), since we read tree nodes and write out intervals.

2D Range trees and the 2D range queries The 2D range tree [11] organizes a set of points on the 2D plane. It is a tree structure augmented with an inner tree, or equivalently, a two-level tree structure. The outer tree stores every point sorted by their -coordinate. Each node in the outer tree is augmented with an inner tree structure, which contains all the points in its subtree, sorted by their y-coordinate.

The 2D range tree can be used to answer the 2D range query: given points in the 2D plane, report the list of points with -coordinate between and , and -coordinate between and . Such range queries using range trees can be done by two nested searches on in the outer tree and in at most associated inner trees. Using balanced BSTs for both the inner and outer trees, a range tree can be constructed with reads and writes, and each query takes reads and writes (where is the output size). A range tree requires storage so the number of writes for construction is already optimal.

OLD VERSION Given a set of points on the plane, the 2D range query asks for the list of all points with -coordinate between and , and -coordinate between and .

This query can be answered using the range tree, which is a tree structure augmented with an inner tree, or equivalently, a two-level tree structure. The outer tree stores every point with their -coordinates as the keys. Each node in the outer tree is augmented with an inner tree structure, which contains all the points represented in its subtree, with their -coordinates as the keys. To answer a range query, we conduct nested searches on in the outer tree and in at most associated inner trees. Using balanced binary search trees for both the inner and outer trees, a range tree can be constructed with reads and writes, and each query takes reads and writes (where is the output size). A range tree requires storage so the number of writes for construction is already optimal.

Priority search trees and 3-sided range queries The priority search tree [43, 26] (priority tree for short) contains a set of points each with a coordinate () and a priority (). There are two variants of priority trees, one is a search tree on coordinates that keeps a heap order of the priorities as the augmented values [43, 5]. The other one is a heap of the priorities, where each node is augmented with a splitter between the left and right subtrees on the coordinate dimension [43, 26]. The construction of both variants uses reads and writes as shown in the original papers [43, 26]. For example, consider the second variant. The root of a priority tree stores the point with the highest priority in . All the other points are then evenly split into two sets by the median of their coordinates which recursively form the left and right subtrees. The construction scans and copies points in levels, leading to reads and writes for the construction.

Many previous results on dynamic priority search trees use the first variant because it allows for rotation-based updates. In this paper, we discuss how to construct the second variant allowing reconstruction-based updates, since it is a natural fit for our framework. We also show that bulk updates can be done write-efficiently in this variant. For the rest of this section, we discuss the second variant of the priority tree.

The priority tree can be used to answer the 3-sided queries: given a set of points, report all points with coordinates in the range , and priority higher than . This can be done by traversing the tree, skipping the subtrees whose coordinate range do not overlap , or where the priority in the root is lower than . The cost of each query is for an output of size  [26].

OLD VERSION Given a set of points on the plane the 3-sided range query asks for the list of points with -coordinates in the range , -coordinates in the range . We will refer to the -coordinate of a point as its priority.

The priority search tree [43, 26] (priority tree for short) can answer such queries efficiently. There are two variants of the priority trees, one is a search tree based on -coordinates and augments the tree node as a heap of the priorities of the points [43, 5]. The other variant is a heap of the priorities, and each node is augmented with a splitter on the coordinate dimension, which partitions the other points to the left and right subtrees [43, 26]. The construction of both variants uses reads and writes on points.

Most previous work uses the first variant because it allows for rotation-based updates. In this paper we discuss how to construct the second variant of priority trees and perform updates for it, since it is a natural fit for our framework. In addition, we show that by using the second variant, bulk updates on priority trees write-efficiently. For the rest of this section, we refer to a priority tree as the second variant described above.

The root of a priority tree stores the point with the highest priority in . Then all the other points are evenly split into two sets by the median of their -coordinates, and they recursively form the left and right subtrees. The construction algorithm scans all points to find the root, removes the corresponding point, finds the median of the -coordinates of the remaining points, splits them into two sets, and recursively build the left and right subtrees. The tree contains levels, and we read and write each point at most once per level. In total we need reads and writes for the construction.

To answer a 3-sided query, we traverse the tree, skipping the subtrees whose -coordinate range do not intersect with , or whose priority is lower than . The cost of each query is for an output of size  [26].

-WBB Tree.

(..) We use this technique in the pre-sorted interval tree, pre-sorted priority search tree and the range tree. It can reduce the number of writes per update at the cost of increasing the query time, hence is more applicable in the scenario where updates is significantly more frequent than queries. Due to page limit we only explain in details for the dynamic range tree as an example.

We summarize the cost of the previous work and our results in Table 1.

Augmented Tree.

The augmented tree structure is a search tree structure in which each node is associated with some extra information, probably reflecting the property of the subtree rooted at it. We note that the augmented value may depend on not only the entries in ’s subtree, but also ’s ancestors, siblings, etc. This may extend the definition of the augmented tree in most textbooks, but it is crucial for at least one of our applications (e.g., the interval tree and the balanced priority search tree). In an augmented tree , Assume we store in each node an entry , ordered by some comparison function , an augmented tree structure will also keep track of some quantity , called the augmented value of , which can be computed using an augmenting function . We note that may depend on not only the entries in ’s subtree, but also ’s ancestors, siblings, etc. This may extend the definition of the augmented tree in most textbooks, but it is crucial for at least one of our applications (e.g., the priority tournament tree). We accordingly represent an augmented tree structure using . We also use to represent a non-augmented tree.

Persistent tree using path-copying.

When implemented with path-copying, the tree structure can also be fully persistent (or functional), which means that any updates will not modify existing tree but create a new version, with no extra work asymptotically.

7.2 The Post-Sorted Construction

For interval trees and priority search trees, the standard construction algorithms [26, 25, 27, 42, 43] require reads and writes, even though the output is only of linear size. This section describes algorithms for constructing them in an optimal linear number of writes. Both algorithms first sort the input elements by their -coordinate in work and depth using the write-efficient comparison sort described in Section 4. We now describe how to build the trees in reads and writes given the sorted input. For a range tree, since the standard tree has size, the classic construction algorithm is already optimal.

Interval Tree After we sort all coordinates of the endpoints, we can first build a perfectly-balanced binary search tree on the endpoints using reads and writes and depth. We now consider how to construct the inner tree of each tree node.

We create a lowest common ancestor (LCA) data structure on the keys of the tree nodes that allows for constant time queries. This can be constructed in reads/writes and depth [12, 40]. Each interval can then find the tree node that it belongs to using an LCA query on its two endpoints. We then use a radix sort on the intervals. The key of an interval is a pair with the first value being the index of the tree node that the interval belongs to, and the second value being the index of the left endpoint in the pre-sorted order. The sorted result gives the interval list for each tree node sorted by left endpoints. We do the same for the right endpoints. This step takes reads/writes overall. Finally, we can construct the inner trees from the sorted intervals in reads/writes across all tree nodes.

Parallelism is straightforward for all steps except for the radix sort. The number of possible keys can be , and it is not known how to radix sort keys from such a range work-efficiently and in polylogarithmic depth. However, we can sort a range of in expected work and depth  [48]. Hence our goal is to limit the first value into a range. We note that given the left endpoint of an interval, there are only possible locations for the tree node (on the tree path) of this interval. Therefore instead of using the tree node as the first value of the key, we use the level of the tree node, which is in the range . By radix sorting these pairs, so we have the sorted intervals (based on left or right endpoint) for each level. We observe that the intervals of each tree node are consecutive in the sorted interval list per level. This is because for tree nodes and on the same level where is to the left of , the endpoints of ’s intervals must all be to the left of ’s intervals. Therefore, in parallel we can find the first and the last intervals of each node in the sorted list, and construct the inner tree of each node. Since the intervals are already sorted based on the endpoints, we can build inner trees in reads and writes and depth [13].

Priority Tree In the original priority tree construction algorithm, points are recursively split into sub-problems based on the median at each node of the tree. This requires writes at each level of the tree if we explicitly copy the nodes and pack out the root node that is removed. To avoid explicit copying, since the points are already pre-sorted, our write-efficient construction algorithm passes indices determining the range of points belonging to a sub-problem instead of actually passing the points themselves. To avoid packing, we simply mark the position of the removed point in the list as invalid, leaving a hole, and keep track of the number of valid points in each sub-problem.

Our recursive construction algorithm works as follows. For a tree node, we know the range of the points it represents, as well as the number of valid points . We then pick the valid point with the highest priority as the root, mark the point as invalid, find the median among the valid points, and pass the ranges based on the median and number of valid points (either or ) to the left and right sub-trees, which are recursively constructed. The base case is when there is only one valid point remaining, or when the number of holes is more than the valid points. Since each node in the tree can only cause one hole, for every range corresponding to a node, there are at most holes. Since the size of the small-memory is , when the number of valid points is fewer than the number of holes, we can simply load all of the valid points into the small-memory and construct the sub-tree.

To efficiently implement this algorithm, we need to support three queries on the input list: finding the root, finding the -th element in a range (e.g., the median), and deleting an element. All queries and updates can be supported using a standard tournament tree where each interior node maintains the minimum element and the number of valid nodes within the subtree. With a careful analysis, all queries and updates throughout the construction require linear reads/writes overall. The details are provided in Appendix A.

The parallel depth is —the bottleneck lies in removing the points. There are levels in the priority tree and it costs writes for removing elements from the tournament tree on each level. For the base cases, it takes linear writes overall to load the points into the small-memory and linear writes to generate all tree nodes. The depth is .

We summarize our result in this section in Theorem 7.2. An interval tree or a priority search tree can be constructed with pre-sorted input in expected work and depth on the Asymmetric NP model.

7.3 Dynamic Updates using Reconstruction-Based Rebalancing

Dynamic updates (insertions and deletions) are often supported on augmented trees [26, 25, 27, 42, 43] and the goal of this section is to support updates write-efficiently, at the cost of performing extra reads to reduce the overall work. Traditionally, an insertion or deletion costs for interval trees and priority search trees, and for range trees. In the asymmetric setting, the work is multiplied by . To reduce the overall work, we introduce an approach to select a subset of tree nodes as critical nodes, and only update the balance information of those nodes (the augmentations are mostly unaffected). The selection of these critical nodes are done by the -labeling introduced in Section 7.3.1. Roughly speaking, for each tree path from the root to a leaf node, we have critical nodes marked such that the subtree weights of two consecutive marked nodes differ by about a factor of . By doing so, we only need to update the balancing information in the critical nodes, leading to fewer tree nodes modified in an update.

Arge et al. [5, 6] use a similar strategy to support dynamic updates on augmented trees in the external-memory (EM) model, in which a block of data can be transferred in unit cost [1]. They use a -tree instead of a binary tree, which leads to a shallower tree structure and fewer memory accesses in the EM model. However, in the Asymmetric NP model, modifying a block of data requires work proportional to the block size, and directly using their approach cannot reduce the overall work. Inspired by their approach, we propose a simple approach to reduce the work of updates for the Asymmetric NP model.

this paragraph is mostly newly-written, trying explain the reconstruction and alpha-labeling at a high level. The main component of our approach is reconstruction-based rebalancing using the -labeling technique. The reconstruction-based rebalancing is used for avoiding writes in rotations. It defers rebalancing to where newly-added or -deleted nodes take a constant fraction of the current tree size, such that the number of writes amortized to each update is a constant. In Section 7.2, we have shown that given the sorted order of all (end)points, an interval or priority tree can be constructed in a linear number of reads and writes. We can always obtain the sorted order via the tree structure, so when imbalance occurs, we can afford to reconstruct the whole subtree in reads and writes proportional to the subtree size and polylogarithmic depth. The -labeling aims at selecting a subset of tree nodes on an augmented tree, and only maintain balancing information as well as augmented values of these nodes, such that the overall number of writes of each update is reduced. Roughly speaking, it selects every nodes on each tree path, such that the update cost is saved by a factor of . Altogether, combining reconstruction-based rebalancing with -labeling gives a unified approach for different augmented trees: interval trees, priority search trees, and range trees.

The main component of our approach is reconstruction-based rebalancing using the -labeling technique. We can always obtain the sorted order via the tree structure, so when imbalance occurs, we can afford to reconstruct the whole subtree in reads and writes proportional to the subtree size and polylogarithmic depth. This gives a unified approach for different augmented trees: interval trees, priority search trees, and range trees.

We introduce the -labeling idea in Section 7.3.1, the rebalancing algorithm in Section 7.3.2, and its work analysis in Section 7.3.3. We then discuss the maintenance of augmented values for different applications in Section 7.3.4. We mention how to parallelize bulk updates in Section 7.3.5.

7.3.1 -Labeling

The goal of the -labeling is to maintain the balancing information at only a subset of tree nodes, the critical nodes, such that the number of writes per update is reduced. Once the augmented tree is constructed, we label the node as a critical node if for some integer , (1) its subtree weight is between and (inclusive); or (2) its subtree weight is and its sibling’s subtree weight is . All other nodes are secondary nodes. As a special case, we always treat the root as a virtual critical node, but it does not necessary satisfy the invariants of critical nodes. Note that all leaf nodes are critical nodes in -labeling since they always have subtrees of weight 2. When we label a critical node, we refer to its current subtree weight (which may change after insertions/deletions) as its initial weight. Note that after the augmented tree is constructed, we can find and mark the critical nodes in reads/writes and depth. After that, we only maintain the subtree weights for these critical nodes, and use their weights to balance the tree.

For a critical node , holds for some integer . This fact directly follows the definition of the critical node.

For two critical nodes and , if is ’s ancestor and there is no other critical node on the tree path between them, we refer to as ’s critical child, and as ’s critical parent. We define a critical sibling accordingly.

We show the following lemma on the initial weights.

Lemma 7.1.

For any two critical nodes and where is ’s critical parent, their initial weights satisfy .


Based on Fact 7.3.1, we assume and for some integers and . We first show that . It is easy to check that cannot be larger than or equal to . Assume by contradiction that . With this assumption, we will show that there exists an ancestor of , which we refer to it as , which is a critical node. The existence of contradicts the fact that is ’s critical parent. We will use the property that for any tree node the weight of its parent is .

Assume that does not have such an ancestor . Let be the ancestor of with weight closest to but no more than . We consider two cases: (a) and (b) . In case (a) ’s parent has weight at most . cannot be less than by definition of , and so , leading to a contradiction. In case (b), ’s sibling does not have weight , otherwise . However, then , and either is not the ancestor with weight closest to or .

Given , we have (by plugging in and ). Furthermore, since is ’s ancestor, we have . Combining the results proves the lemma. ∎

7.3.2 Rebalancing Algorithm based on -Labeling

We now consider insertions and deletions on an augmented tree. Maintaining the augmented values on the tree are independent of our -labeling technique, and differs slightly for each of the three tree structures. We will further discuss how to maintain augmented values in Section 7.3.4.

We note that deletions can be handled by marking the deleted objects without actually applying the deletion, and reconstructing the whole subtree once a constant fraction of the objects is deleted. Therefore in this section, we first focus on the insertions only. We analyze single insertions here, and discuss bulk insertions later in Section 7.3.5. Once the subtree weight of a critical node reaches twice the initial weight , we reconstruct the whole subtree, label the critical nodes within the subtree, and recalculate the initial weights of the new critical nodes. An exception here is that, if and for a certain , we do not mark the new root since otherwise it violates the bound stated in Lemma 7.2 (see more details in Section 7.3.3) with ’s critical parent. After this reconstruction, ’s original critical parent gets one extra critical child, and the two affected children now have initial weights the same as ’s initial weight. If imbalance occurs at multiple levels, we reconstruct the topmost tree node. An illustration of this process is shown in Figure 3.

Figure 3: An illustration of rebalancing based on -labeling. The critical nodes are shaded. The case after construction is shown in (a) with solid borders. After some insertions, the size of one of the subtrees grows to twice its initial weight (dashed lines in (a)), so the algorithm reconstructs the subtree, as shown in (b). As we keep inserting new nodes along the left spine, the tree will look like what is shown in (c), but Lemma 7.2 guarantees that the subtree of the topmost critical node will be reconstructed before it gets more than critical children. The lemma also guarantees that on the path from a critical node to any of its critical children, there can be at most secondary nodes.

We can directly apply the algorithms in Section 7.2 to reconstruct a subtree as long as we have the sorted order of the (end)points in this subtree. For interval and range trees, we can acquire the sorted order by traversing the subtree. using linear work and depth [51, 9]. For priority trees, since the tree nodes are not stored in-order, we need to insert all interior nodes into the tree in a bottom-up order based on their coordinates (without applying rebalancing) to get the total order on coordinates of all points (the details and cost analysis can be found in Appendix A). After we have the sorted order, a subtree of weight can be constructed in work and depth.

As mentioned, we always treat the root as a virtual critical node, but it does not necessary satisfy the invariants of critical nodes. By doing so, once the weight of the whole tree doubles, we reconstruct the entire tree. We need insertions for one reconstruction on the root (there can be deletions). The cost for reconstruction is for interval trees and priority trees, and for range trees (shown in Section 7.3.4). The amortized cost is of a lower order compared to the update cost shown in Theorem 7.3.4.

7.3.3 Cost Analysis of the Rebalancing

For a critical tree node with subtree weight , the subtree weight of ’s critical sibling satisfies .

To show the rebalancing cost, we first prove some properties about our dynamic augmented trees.

Lemma 7.2.

In a dynamic augmented tree with -labeling, we have for any two critical nodes and where is ’s critical parent.


For any critical node in the tree, the subtree weight of its critical child can grow up to a factor of 2 of ’s initial weight, after which the subtree is reconstructed to two new critical nodes with the same initial weight of . ’s weight can grow up to a factor of 2 of ’s initial weight, without affecting ’s weight (i.e., all insertions occur in ’s other critical children besides ). Combining these observations with the result in Lemma 7.1 shows this lemma except for the part. Originally we have after the previous reconstruction. grows together when grows, and right before the reconstruction of we have