Log In Sign Up

GPU-friendly, Parallel, and (Almost-)In-Place Construction of Left-Balanced k-d Trees

by   Ingo Wald, et al.

We present an algorithm that allows for building left-balanced and complete k-d trees over k-dimensional points in a trivially parallel and GPU friendly way. Our algorithm requires exactly one int per data point as temporary storage, and uses O(log N) iterations, each of which performs one parallel sort, and one trivially parallel CUDA per-node update kernel.


page 1

page 2

page 3

page 4


A Stack-Free Traversal Algorithm for Left-Balanced k-d Trees

We present an algorithm that allows for find-closest-point and kNN-style...

Parallel Batch-Dynamic kd-Trees

kd-trees are widely used in parallel databases to support efficient neig...

GPGPU-Parallel Re-indexing of Triangle Meshes with Duplicate-Vertex and Unused-Vertex Removal

We describe a simple yet highly parallel method for re-indexing "indexed...

Eliminating Left Recursion without the Epsilon

The standard algorithm to eliminate indirect left recursion takes a prev...

OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems

With the strong computation capability, NUMA-based multi-GPU system is a...

cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs

Nonuniform fast Fourier transforms dominate the computational cost in ma...

1 Introduction

K-d trees are powerful, versatile, and widely used data structures for storing, managing, and performing queries on k-dimensional data. One particularly interesting type of k-d trees are those that are left-balanced and complete: for those, storing the tree’s nodes in in-order means that the entire tree topology—i.e., which are the parent, left, or right child of a given node—can be deduced from each node’s array index. This means that such trees require zero overhead for storing pointers etc, which makes them particularly useful for large data, or devices where memory is precious (such as GPUs, FPGAs, etc).

This paper is about building such trees, in a efficient, parallel, and GPU friendly way. Throughout the rest of this paper we will omit the terms ”left balanced” and ”complete”, and simply refer to ”k-d trees”; but always mean those that are both left-balanced and complete.

Figure 1: Illustration of a 2-dimensional k-d tree from Wikipedia [Wik]. Left: The balanced tree for 2D point set . Right: The space partitioning achieved by that tree.

The problem with building k-d trees is that any node within this tree has to simultaneously fulfill two different conditions: first, the k-d tree condition that all nodes in its left (respectively right) sub-tree are, relative to that node’s partition plane, all respectively that node’s coordinate (in that node’s chosen dimension); and second, the balancing condition that the resulting tree must be left-balanced (which implicitly fixes how many nodes any given sub-tree has to have). Building a k-d tree over a given set of elements thus requires finding an ordering of these elements such that every node fulfills both of these conditions; and the problem of finding this ordering becomes particularly interesting when dealing with highly parallel architectures such as GPUs.

The usual way to build such trees is to start with a single array of nodes, then sort this array, split it at the appropriate position that ensures balancing, and recurse; but this approach has several issues: one is that the sorting steps require array order, but the tree requires level order, typically requiring to keep two copies of the data. An even worse problem is that running this algorithm in parallel requires significant effort in tracking which sub-trees still need to be built, with which plane, with which elements, etc. Finally, like many other top-down hierarchical build algorithms this approach suffers from the fact that the granularity of parallelism is changing radically over time, from jobs with elements at first, to jobs of elements at the end; this causes issues because the “same” logical operation on a sub-tree (such as, for example, sorting its elements) may require very different implementations across different stages of the algorithm (e.g., sorting a list of 4 elements requires a different algorithm than sorting one with 4 million). This need to specialize for problem size often leads to complex code, lots of special cases, etc.

In this paper, we describe an algorithm that can transform a set of k-dimensional data points into a proper left-balanced complete k-d tree by performing a series of steps of re-arranging these points. This algorithm can best be visualized as starting each node in the root of the tree, and having them ”trickle” down the tree until each node has reached the location in the tree where both k-d tree and balancing properties are fulfilled. In particular, our algorithm can do that without having to require a secondary copy of the data set, without having to create, partition, or manage temporary lists of nodes during construction, and in a way where every kernel and sort operation always operates on the entire set of all data points, in a way that trivially lends to parallel execution.

2 Definitions and Preliminaries

Throughout this paper we assume we need to build a left-balanced binary k-d tree over elements of -dimension data points. Though we will later remove this restriction we will also, for now, assume that the k-d tree to be built will, in any level , partition a sub-tree’s data points using the dimension .

When referring to nodes in the k-d tree, we use the canonical way of doing so via that node’s array index if the tree is stored in level-order. I.e., we refer to the root node as node 0, its children as nodes 1 and 2, etc. Using that indexing, many important properties for a given node can be computed simply from a given node’s index. For example, the parent and children of node are , , and , etc.

In addition to parent-child relationships we can also compute information about the tree in general. For example, we can compute the tree level of a node as —or in integer bit-arithmetic, as , where clz is the instruction for counting the number of leading zeroes in a 32-bit int type (which every modern CPU or GPU supports). We can similarly compute the total number of levels in a tree of nodes as ; or the number of nodes in a full tree of levels as (or using integer bit-arithmetic); etc. Finally, in addition to nodes we can also argue about sub-trees within a given tree, which we refer to through the index of the node at the root of that sub-tree; i.e., the sub-tree is the set of nodes that are within the sub-tree rooted at node . We will derive some important properties for sub-trees below.

2.1 Sub-tree-Size and Pivot Element

One particularly interesting value we will be making use of in our algorithm is the value of the number of nodes in a sub-tree under a given node s, which we will refer to as (for subtree-size of sub-tree ). We show below that for any left-balanced tree of nodes this value can be computed, for any , in complexity (Section 4.1).

The reason this value is important is that for any given sub-tree we can use this to find the data elements within this sub-tree: assuming we have a sequential list of those elements, and assuming these are already sorted based on the proper split dimension desired for node , then by virtue of the k-d tree and balancing requirements we know that the first nodes must be in the left sub-tree, the next one must be the root (i.e., it must be the element that in the eventual level-order layout of the correct final k-d tree should be at array-position ), and all others must belong to the right sub-tree. This position—the one that splits this list into left sub-tree nodes, root, and right sub-tree nodes—we will also refer to as the pivot position, and the element at that location as the pivot element.

Observation 2.1

Let be the sequential list of data points known to form the set of points that make up the sub-tree under , and let these points be sorted by their ’th coordinate. Then, the value is the k-d tree pivot position for this list; the element at that position should be the node of the final tree, those with lower indices are those that belong into the left sub-tree under , and those with higher indices belong into the right sub-tree.

2.2 Level-l Ancestors of Nodes and Data Points

A second important value we will make use of in our algorithm is the level-l ancestor of a node (or . For any given k-d tree node with the level- ancestor is the index of the (uniquely defined) node that lives on level of the tree, and whose sub-tree is in. For values , we define the level- ancestor of to be itself. I.e., the level- ancestor of any node is the root node 0; the level-1 ancestor of any node is either 1 or 2 (except for , whose lla is ), etc.

Just like we can argue about level- ancestors of a given existing k-d tree of elements, so we can also argue about level- ancestors of data points that may or may not yet be in the proper k-d tree order: in this case, we use the term level- ancestor of a given input data point to mean the lla of the node that should be stored in if that array of data points were already in a valid balanced k-d tree order. The reason this value is important is that by definition, if we know the lla for any data point to be , and the level of this is , then must be the position that should be at to fulfill the k-d tree criteria. In particular, if we know the level- ancestor of every data point (where is the total depth of the tree) then we know, for each data point, exactly the position it should be at to form the proper k-d tree.

3 Algorithm

The core idea of our algorithm is to look at the level-l ancestor property of every input data point , for values of that increase by one in each iteration (and thus, effectively fixes one more level of the tree in each iteration). To do this we temporarily associate each data point with a 32-bit integer value (that we also call the tag for that data point), and use that to store each iteration’s lla value for this node. We start by initializing each data point’s tag to 0 (which is, by definition, the level- ancestor of every node), then in each iteration we read the node’s current lla value from its tag, then compute its respective lla value for the next level . Since the lla for levels greater than the one that the node is on will not change any more, this effectively fixes one more level of the tree in each iteration, meaning that after steps each node’s tag will store exactly the node ID that this data point should be in.

The key to this algorithm is, obviously, the kernel that refines a node’s lla from one level to the next level . To do this, we rely on a suitably chosen ordering of all (node,tag) pairs of data, that we will derive next. Using this ordering we will—after putting all data points into that chosen order using a parallel sort—then easily be able to derive each data point’s next-level lla.

3.1 Sort Order and Properties of that Order

In any given one of the iterations of our algorithm, let us consider a comparison operator that consists of two nested orderings: first, a major sort order that sorts data points by ascending tag value; and second, a minor sort order that sorts those elements with same tag based on their respective ’th coordinate, (where is the iteration we are currently performing (and in which we are, consequently, finding the position for all nodes on the ’th level of the tree).

In pseudo-code, we can use the following less()-operator to achieve this chosen ’th step ordering of our data points:

bool less(int idx_a, int idx_b, int l)
  int dim = l % k;
    (tags[idx_a] < tags[idx_b])
    || (tags[idx_a] == tags[idx_b])
    && (points[idx_a][dim] < points[idx_b][dim];

Let us now assume that in each iteration the tag for any given data point will initially specify that point’s lla for level . In that case, based on how that lla is defined we can make the following

Observation 3.1

Let be an array of ints where each stores the level- ancestor of a given node in a binary tree of nodes and . By definition of the lla, there must be exactly tags that each point to a node on a level ; and all other tags must point to a node on level .

If we further apply a sort using the above comparison operator, then we immediately arrive at

Observation 3.2

Let be the array of data points after the ’th iteration’s sorting step. Then, the first elements of the array will be the nodes with tags , and those will all be already at their proper locations in the array. Following those nodes for levels , starting at offset , there will be the nodes that will end up in sub-tree , sorted by their ’th coordinate. Following those, at offset there will be the nodes with lla (also sorted in the proper dimension), etc.

We can now use this observation to define an efficient tag update kernel that computes, from this array layout, each node’s next-level ancestor.

3.2 Tag Update Kernel

Using observation 3.2 we know that for each sub-tree on level we will find all nodes tagged with this sub-tree in a single coherent region of our array; we call this the segment of the -tagged array elements. Let us now call the index where this segment with -valued tags starts the segment begin index for , sb(s,l). As this segment contains all of the data points in the sub-tree under , its size will obviously be ss(s).

Within each such segment, we can now use observation 2.1 to determine, for each element of this array, whether it is the pivot element, or in the left respectively right sub-tree of . Using this, we arrive at a node update kernel that–assuming the data is properly sorted as above–for each array item in parallel, performs the following steps:

  • check if . If so, the tag is already correct; terminate.

  • read the tag from the tag array to know which sub-tree the data point is in.

  • compute and and use these to compute the pivot position .

  • for indices less than pivot position, change tag to ; for those greater than change it to .

This leads to the following pseudo-code:

__global__ void updateTags(int tags[], Point points[], int N, int l)
   int arrayIdx = CUDA thread index;
   if (arrayIdx >= N || arrayIdx < F(l))
      /* invalid index, or already done */
   int currentTag = tags[arrayIdx]; // must be a node index on level l
   int pivotPos = sb(currentTag) + ss(lChild(currentTag))
   if (arrayIdx < pivotPos)
      tags[arrayIdx] = lChild(currentTag)
   else if (arrayIdx > pivotPos)
      tags[arrayIdx] = rChild(currentTag)
      tag remains unchanged; this is the root of this sub-tree

Using this tag update kernel, our whole algorithm is simply the following set of iterations:

void parallelKDTreeBuild(Point points[N])
   int tags[N];
   in parallel: initialize all tags to 0;
   for (int l = 0 .. log2(N)) {
      parallel sort {tags,points}, using chosen less() w/ dim l%k
      in parallel: updateTags(tags,nodes,l)

After steps, every one of the data points will have a tag that indicates where in the array that data point should be stored; and since the array will be sorted by this tag every data point will automatically be at the proper position—meaning that the resulting array is indeed the proper k-d tree for these data points.

4 Computing and in O(1)

Throughout our algorithm there were two properties that we made frequent use of: the segment begin function , and the sub-tree size function . We have claimed that each one of those can be computed with complexity, but have not yet described how to do that.

4.1 Computing Size of Sub-tree , ss(s)

Let be the number of nodes in the sub-tree rooted in node . We observe that this values is an intrinsic property of any k-d tree of nodes, not specific to our algorithm. We also observe that this is useful even for other k-d tree construction algorithms, as it allows for computing the pivot-position even for list-based recursive partitioning schemes.

Computing this value recursively is trivial, but costs . To compute this in we first observe that the tree is left-balanced and complete; i.e., if is the number of levels in the tree then all levels with are entirely full, and level is filled from the left (i.e., it contains nodes ).

Now let be the level of node ; then we know that all levels from to including must be entirely full; and those must contain nodes. Furthermore, we can also compute the range of nodes that this sub-tree would have on level , and compare that with the range of nodes that the tree actually has.

The first node of sub-tree on level —which we call the first lowest-level child (fllc(s))—can be computed by repeatedly computing lChild for a total of times. Since , calling this function once is synonymous with shifting by 1 to the left, and setting the last bit to ; i.e., we take the binary code of and append a “1” bit. Doing this times on is then synonymous with taking the binary code of and appending “1” bits, which in integer arithmetic can also be computed as

    fllc_s = ~((~s) << (L-l-1)).

Obviously, if this value is , then sub-tree has no nodes on level ; otherwise, it will have at most nodes. We also know that in a full tree the sub-tree under would only be nodes wide, so the total number of sub-tree- nodes on level can be computed as

    numNodes_s_on_lowest_level = min(max(0,N-fllc_s), 1<<(N-l-1)).

Together with the full inner nodes, this for evaluates to

    fllc_s = ~((~s) << (L-l-1)).
    = /* inner nodes  */ 1<<(L-l-1) - 1
    + /* lowest level */ min(max(0,N-fllc_s), 1<<(N-l-1)).

In this expression, and can be computed in constant time using simple clz as described above; all other operations are basic arithmetic operations like bit-shift, addition, and integer min/max, all of which are constant-cost and trivially cheap on modern architectures.

4.2 Computing Segment--begin in step ,

The second important property our algorithm frequently relies on is what we called the sub-tree-s-begin index; i.e., the index in our sorted array (in step ) where all the data points tagged to be in sub-tree can be found (we observe that in any iteration our algorithm calls this function only on sub-trees that are in the ’th level).

Recalling observation 3.2 we know that in the ’th step the array will first contain the nodes of the top levels that have already found their final array position; followed by those for the first sub-tree in level (tagged with ); then those for sub-tree ; then those for sub-tree ; etc. Since we can compute the size of any sub-tree through , we could compute through

but this would have complexity.

To compute that same value in we refer to the same observations we made in the previous section. First, we observe that there are a total of sub-trees on level that are to the left of (and thus come before ); we call this the number of left siblings of , nls(s). Each of these sub-trees must have levels that are guaranteed to be full, and as there are of those there must be exactly such nodes on those inner levels, across all those trees to the left of . On the lowest level of the tree, these left siblings fill the tree from the left, so the number of nodes on the lowest level must be the lesser of how many nodes these trees would have if they were all filled (which is ), and how many the actual tree has on that level (which is ).

Taken together, this combines to

  /* typically have those already, so just for completeness: */
  /* level of s */ l = 31-clz(s+1)
  /* num levels */ L = 31-clz(N+2)
  nls_s = s - ((1<<l)-1)
  = /* top l levels */ (1<<l)-1
  + /* left siblings, inner  */ nls_s*(1<<(L-l-1)-1)
  + /* left siblings, lowest */ min(nls_s*(1<<(L-l-1)),

Like for , this can be computed with just a few clzs, shifts, and basic arithmetic operations, in .

5 An Example Algorithm-Walkthrough

To illustrate how this algorithm performs in practice, we have run our reference implementation of it on 10 randomly chosen int2 data points, and use this section to show–step by step–how it modifies the array of data points and tags.

In our example, let us assume we start with the 10 randomly chosen points . Let us attach each of these with a tag, initialize each such tag with 0, and show these in table form; this gives an initial state as

   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    0    0    0    0    0    0    0    0    0    0
   data[i].x:   10   46   68   40   25   15   44   45   62   53
   data[i].y:   15   63   21   33   54   43   58   40   69   67

Let us now run iteration , which—since all nodes have the same tag of 0—will first result in all nodes being sorted in :

   // after step 0 sort
   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    0    0    0    0    0    0    0    0    0    0
   data[i].x:   10   15   25   40   44   45   46   53   62   68
   data[i].y:   15   43   54   33   58   40   63   67   69   21

Running the update-tags steps will evaluate the pivot-pos for sub-tree (which in this stage all nodes will be in) as , which is , leading to these updated tags

   // after step 0 tags updated
   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    1    1    1    1    1    1    0    2    2    2
   data[i].x:   10   15   25   40   44   45   46   53   62   68
   data[i].y:   15   43   54   33   58   40   63   67   69   21

For sorting now already “pins” element , and sorts elements by sub-tree ID and , respectively, and with ascending values within the same sub-tree:

   // after step 1 sort
   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    0    1    1    1    1    1    1    2    2    2
   data[i].x:   46   10   40   45   15   25   44   68   53   62
   data[i].y:   63   15   33   40   43   54   58   21   67   69

Updating tags in step now finds nodes and , and updates their children into sub-trees 3–6, respectively:

   // after step 1 tags updated
   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    0    3    3    3    1    4    4    5    2    6
   data[i].x:   46   10   40   45   15   25   44   68   53   62
   data[i].y:   63   15   33   40   43   54   58   21   67   69

Sorting for iteration then yields nodes 0–2 fixed correctly, and sub-trees 3–6 in proper order:

   // after step 2 sort
   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    0    1    2    3    3    3    4    4    5    6
   data[i].x:   46   15   53   10   40   45   25   44   68   62
   data[i].y:   63   43   67   15   33   40   54   58   21   69

Updating tags now builds all tags for the third level:

   // after step 2 tags updated
   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    0    1    2    7    3    8    9    4    5    6
   data[i].x:   46   15   53   10   40   45   25   44   68   62
   data[i].y:   63   43   67   15   33   40   54   58   21   69

Finally, performing a final sort gives the (correct) final result as

   // after final sort
   array-idx:    0    1    2    3    4    5    6    7    8    9
   tags     :    0    1    2    3    4    5    6    7    8    9
   data[i].x:   46   15   53   40   44   68   62   10   45   25
   data[i].y:   63   43   67   33   58   21   69   15   40   54

The algorithm is now complete, and this is indeed the correct k-d tree array for this input.

6 Extension to Split-Widest-Dimension

In our discussion above we have assumed that for each node the builder would use dimension for partitioning, where is the level of the tree that is in. Though this is exactly what many users of k-d trees want to do, for some applications it has been shown that always partitioning each sub-tree along the dimension where its associated region of space has widest extent can lead to significantly better query performance.

To do this, a top-down recursive builder would first compute the bounding box of the entire data set (which is the domain of the entire tree, and the region of space that the root node is going to partition); then in each step the builder would consider the domain of the current sub-tree, choose the dimension where this domain’s bounding box is widest, and use that for partitioning. It would then compute the domains of the left and right sub-tree based on the parent’s domain partitioning plane, and recurse.

In our algorithm we cannot easily track bounding boxes for each sub-tree in a top-down manner, as we have potentially many such sub-trees in flight at any point in time. We can, however, easily compute a given sub-tree’s bounding box by following its path from the root backwards: doing so will also iterate over all the partitioning planes “above” s, and in each step, we know whether is to the left or or right of this tree. Thus, for any we can start with the (pre-computed) bounding box of the entire domain, then walk from upwards to the root, and in each step “clip” the bounding box by the given ancestor’s splitting plane.

With this in mind, we modify our algorithm as follows: We first reserve the lower bits in the tag value to store the partitioning index we want to use for the sub-tree that this node is in. In the initialization step we pre-compute the bounding box of the entire data sets, and initialize these bits to the widest dimension of this data set bounding box.

In the sort step, we proceed as above, except that for any comparison between points and we first retrieve ’s node tag, extract that tag’s stored split dimension, and, for similar tag compare and across this dimension. We observe that it can absolutely happen that this comparison operator gets called with two nodes and that have different dimensions stored in their tag; however, this is perfectly OK: the comparison operator first compares by tag, so the comparison by coordinate only ever happens if these two nodes have the same tag—in which case they are in the same sub-tree and use the same split dimension.

The last modification we have to do is make sure that the tag update kernel will properly compute the split dimension for each of the next step’s sub-trees. To do that we first observe that this has to be done only for those nodes whose new tag got updated to either or . If that is the case, we first compute the bounding box of sub-tree : we start with the scene bounding box, and first clip that by the split plane of , which this step has just determined, and which thus is now known. We then follow up the tree, and further clip this bounding box by the visited ancestors (all of which have already been fixed in previous iterations). The result of this is, for each , the bounding box of the newly sub-tree that this node just got tagged with. We then compute the proper split dimension, and store this in the new tag.

This step of computing the split dimension for each node obviously increases the cost of the tag update step from to per node, and therefore, to for the entire update step across all nodes. However, the cost of this inner loop was even before this change (due to the cost for sorting), so the total algorithm’s complexity remains unchanged at . Sample code for this algorithm is provided together with sample code for the round-robin based algorithm, at

7 Discussion

In this paper we have described an algorithm that allows for the parallel, GPU-friendly, and (mostly-)in place construction of left-balanced k-d trees. Sample code for both round-robin and split-widest-dimension, as well as for different data types and queries operating on those trees, can be found at [Wal22].

Our algorithm can build left-balanced k-d trees (of either round-robin or split-largest-dimension variety) in . We call our algorithm mostly-in place because our algorithm itself only needs a constant one int per node in temporary storage, independent of number of nodes, number of dimension, size of other payload stored with each data point, etc. Of course, we also rely on sorting, and based on used sorting algorithm this may require additional temporary storage; in particular the thrust [BH12] based sort we use in our sample code will use non-trivial amounts of storage. A sort based solely on swapping—for example, bitonic sort [MCS15]—would avoid this (and quite possibly be faster, too), but for the sake or readability has not been used in our reference implementation.

7.1 Performance

Our algorithm is designed for parallel execution, and reasonably fast; on a modern NVIDIA-3090TI GPU a k-d tree over 10 million points can be built in well under one second (see Table 1). That said, we make no claims that our algorithm (and in particular, our implementation) was necessarily the fastest way such trees can be built. As for memory usage, the main cost factor in terms of compute is parallel sorting: sorting on GPUs can be incredibly fast for radix sort-style algorithms that operate on simple scalar types; but for current libraries sorting with custom operators is much slower. This could potentially be addressed by using a different implementation of parallel sort, and/or by using a 64-bit tag in which the upper bits store sub-tree index and lower bits replicate the point’s coordinate in the chosen sorting dimension (one can then use a key-value radix sort with a 64-bit integer); but for didactic reasons this has not been done in our reference implementation.

N 1K 10K 100K 1M 10M 100M 1B
int1 data type
RTX 8000 ms 0.6ms 1.4ms 12.8ms 165ms 1.89s 24.7s
A-6000 ms 0.6ms 1.2ms 11.9ms 124ms 1.37s 17.7s
RTX 3090TI ms 0.6ms 1.2ms 11.0ms 106ms 1.07s 16.8s
float4 data type
RTX 8000 0.9ms 6.0ms 54.4ms 783ms 10.3s (oom)
A-6000 0.9ms 5.6ms 49.7ms 536ms 6.5s (oom)
RTX 3090TI 0.9ms 5.6ms 44.4ms 424ms 5.3s (oom)
Table 1: Build performance for given number of float4 data points ( means , means 1 million, and

means 1 billion), using round-robin partitioning and uniformly distributed random points in

(averaged over 1,000 runs for , over 100 runs each for , and across 10 runs for ). oom means “out of memory”

We also observe that there are some obvious savings in that in each ’th step the sort and tree update kernels can be omitted for the first elements of the array; and in that one does not actually need to build the lowest level of the tree since all sub-trees on that level are of size 1, and after sorting are already at the right position. These simple optimizations have already been done in our implementation.

7.2 Simplicity and Generality

Arguably the biggest advantage of our algorithm is its simplicity: though the computation of values like and at first glance does not look simple at all, once these are carefully derived (see above) the rest of the algorithm is a trivially simple sequence of alternating between parallel sort and our O(10)-liner node update kernel. In particular, our algorithm never needs to deal with issuing per-sub-tree launches, with adapting launch dimensions based on sub-tree sizes, with special-case code for sub-trees that are “small enough” to merit special code paths, etc. Though we absolutely have looked into such codes—and have often managed to outperform the above algorithm—the ability of not having to deal with more complicated method is a major advantage of our algorithm: Almost the entire complexity and cost of our algorithm lies in effective and stable parallel sorting, for which many good solutions already exist.

The simplicity of our algorithm also makes it excellently suited to templating over different data types: while our sample codes focus on float3 or float4 data types we also have a variant that is templated over arbitrary CUDA structs that can contain arbitrary-dimensional points over arbitrary scalar types, including arbitrary other per-point “payload” (such as, for example, photon power for photon mapping [Jen01]).

This simplicity also makes it trivial to port this algorithm to other languages, or hardware platforms. For example, the same algorithm trivially also ports to the CPU, where it can be run with any serial or parallel sort algorithm, and with the kernel update using, for example, TBB for parallelism.

8 Conclusion

In this paper, we have described an algorithm that can build a left-balanced k-d tree over any number of -dimensional data points in serial complexity (and parallel complexity, for processors). Our algorithm consists of four different but inter-operating ideas: (1), tagging each point with an int that stores that node’s level- ancestor (and refining this value in every iteration); (2), careful observations regarding how to easily and efficiently compute sub-tree size and pivot position for any given node; (3), a carefully crafted sort order that makes these computations possible; and (4), making significant use of parallel sort for re-arranging nodes.

We observe that the combination of tagging nodes by sub-tree ID and sorting to produce sequential arrays of same sub-tree ID allows our algorithm to completely avoid the need to explicitly track which sub-trees still need building, which lists of nodes these contain, etc. This in particular allows our algorithm to always, in each step, operate on all elements of the array in parallel, without the need to explicitly issue possibly many launches for different ranges of array elements—which makes the final algorithm outstandingly simple compared to existing list-based methods. This ability to build the tree in a predictable steps that each operate on all elements of the array make our algorithm a good match for highly parallel architectures such as GPUs.

We make no claims that this algorithm was necessarily the fastest possible construction method for balanced k-d trees; however, our experiments show it to nevertheless be quite fast, allowing to build 10 million points in well under a second. This speed, combined with how easy it is to implement this algorithm on a given architecture, makes us believe this algorithm to be a useful tool for anybody that needs to build of left-balanced k-d trees in a parallel and/or GPU friendly manner.


Appendix A Revision History

Revision 1 (ArXiv version v2)


  • Updated title (from “A GPU-friendly, Parallel, and (Almost-)In-Place Algorithm for Building Left-Balanced kd-Trees” to “GPU-friendly, Parallel, and (Almost-)In-Place Construction of Left-Balanced k-d Trees”)

  • Fixed missing URLs in references.

Original Version (ArXiv version v1)

: Originally uploaded to ArXiv Oct 31, 2022.