1 Introduction
Kd trees are powerful, versatile, and widely used data structures for storing, managing, and performing queries on kdimensional data. One particularly interesting type of kd trees are those that are leftbalanced and complete: for those, storing the tree’s nodes in inorder means that the entire tree topology—i.e., which are the parent, left, or right child of a given node—can be deduced from each node’s array index. This means that such trees require zero overhead for storing pointers etc, which makes them particularly useful for large data, or devices where memory is precious (such as GPUs, FPGAs, etc).
This paper is about building such trees, in a efficient, parallel, and GPU friendly way. Throughout the rest of this paper we will omit the terms ”left balanced” and ”complete”, and simply refer to ”kd trees”; but always mean those that are both leftbalanced and complete.
The problem with building kd trees is that any node within this tree has to simultaneously fulfill two different conditions: first, the kd tree condition that all nodes in its left (respectively right) subtree are, relative to that node’s partition plane, all respectively that node’s coordinate (in that node’s chosen dimension); and second, the balancing condition that the resulting tree must be leftbalanced (which implicitly fixes how many nodes any given subtree has to have). Building a kd tree over a given set of elements thus requires finding an ordering of these elements such that every node fulfills both of these conditions; and the problem of finding this ordering becomes particularly interesting when dealing with highly parallel architectures such as GPUs.
The usual way to build such trees is to start with a single array of nodes, then sort this array, split it at the appropriate position that ensures balancing, and recurse; but this approach has several issues: one is that the sorting steps require array order, but the tree requires level order, typically requiring to keep two copies of the data. An even worse problem is that running this algorithm in parallel requires significant effort in tracking which subtrees still need to be built, with which plane, with which elements, etc. Finally, like many other topdown hierarchical build algorithms this approach suffers from the fact that the granularity of parallelism is changing radically over time, from jobs with elements at first, to jobs of elements at the end; this causes issues because the “same” logical operation on a subtree (such as, for example, sorting its elements) may require very different implementations across different stages of the algorithm (e.g., sorting a list of 4 elements requires a different algorithm than sorting one with 4 million). This need to specialize for problem size often leads to complex code, lots of special cases, etc.
In this paper, we describe an algorithm that can transform a set of kdimensional data points into a proper leftbalanced complete kd tree by performing a series of steps of rearranging these points. This algorithm can best be visualized as starting each node in the root of the tree, and having them ”trickle” down the tree until each node has reached the location in the tree where both kd tree and balancing properties are fulfilled. In particular, our algorithm can do that without having to require a secondary copy of the data set, without having to create, partition, or manage temporary lists of nodes during construction, and in a way where every kernel and sort operation always operates on the entire set of all data points, in a way that trivially lends to parallel execution.
2 Definitions and Preliminaries
Throughout this paper we assume we need to build a leftbalanced binary kd tree over elements of dimension data points. Though we will later remove this restriction we will also, for now, assume that the kd tree to be built will, in any level , partition a subtree’s data points using the dimension .
When referring to nodes in the kd tree, we use the canonical way of doing so via that node’s array index if the tree is stored in levelorder. I.e., we refer to the root node as node 0, its children as nodes 1 and 2, etc. Using that indexing, many important properties for a given node can be computed simply from a given node’s index. For example, the parent and children of node are , , and , etc.
In addition to parentchild relationships we can also compute information about the tree in general. For example, we can compute the tree level of a node as —or in integer bitarithmetic, as , where clz is the instruction for counting the number of leading zeroes in a 32bit int type (which every modern CPU or GPU supports). We can similarly compute the total number of levels in a tree of nodes as ; or the number of nodes in a full tree of levels as (or using integer bitarithmetic); etc. Finally, in addition to nodes we can also argue about subtrees within a given tree, which we refer to through the index of the node at the root of that subtree; i.e., the subtree is the set of nodes that are within the subtree rooted at node . We will derive some important properties for subtrees below.
2.1 SubtreeSize and Pivot Element
One particularly interesting value we will be making use of in our algorithm is the value of the number of nodes in a subtree under a given node s, which we will refer to as (for subtreesize of subtree ). We show below that for any leftbalanced tree of nodes this value can be computed, for any , in complexity (Section 4.1).
The reason this value is important is that for any given subtree we can use this to find the data elements within this subtree: assuming we have a sequential list of those elements, and assuming these are already sorted based on the proper split dimension desired for node , then by virtue of the kd tree and balancing requirements we know that the first nodes must be in the left subtree, the next one must be the root (i.e., it must be the element that in the eventual levelorder layout of the correct final kd tree should be at arrayposition ), and all others must belong to the right subtree. This position—the one that splits this list into left subtree nodes, root, and right subtree nodes—we will also refer to as the pivot position, and the element at that location as the pivot element.
Observation 2.1
Let be the sequential list of data points known to form the set of points that make up the subtree under , and let these points be sorted by their ’th coordinate. Then, the value is the kd tree pivot position for this list; the element at that position should be the node of the final tree, those with lower indices are those that belong into the left subtree under , and those with higher indices belong into the right subtree.
2.2 Levell Ancestors of Nodes and Data Points
A second important value we will make use of in our algorithm is the levell ancestor of a node (or . For any given kd tree node with the level ancestor is the index of the (uniquely defined) node that lives on level of the tree, and whose subtree is in. For values , we define the level ancestor of to be itself. I.e., the level ancestor of any node is the root node 0; the level1 ancestor of any node is either 1 or 2 (except for , whose lla is ), etc.
Just like we can argue about level ancestors of a given existing kd tree of elements, so we can also argue about level ancestors of data points that may or may not yet be in the proper kd tree order: in this case, we use the term level ancestor of a given input data point to mean the lla of the node that should be stored in if that array of data points were already in a valid balanced kd tree order. The reason this value is important is that by definition, if we know the lla for any data point to be , and the level of this is , then must be the position that should be at to fulfill the kd tree criteria. In particular, if we know the level ancestor of every data point (where is the total depth of the tree) then we know, for each data point, exactly the position it should be at to form the proper kd tree.
3 Algorithm
The core idea of our algorithm is to look at the levell ancestor property of every input data point , for values of that increase by one in each iteration (and thus, effectively fixes one more level of the tree in each iteration). To do this we temporarily associate each data point with a 32bit integer value (that we also call the tag for that data point), and use that to store each iteration’s lla value for this node. We start by initializing each data point’s tag to 0 (which is, by definition, the level ancestor of every node), then in each iteration we read the node’s current lla value from its tag, then compute its respective lla value for the next level . Since the lla for levels greater than the one that the node is on will not change any more, this effectively fixes one more level of the tree in each iteration, meaning that after steps each node’s tag will store exactly the node ID that this data point should be in.
The key to this algorithm is, obviously, the kernel that refines a node’s lla from one level to the next level . To do this, we rely on a suitably chosen ordering of all (node,tag) pairs of data, that we will derive next. Using this ordering we will—after putting all data points into that chosen order using a parallel sort—then easily be able to derive each data point’s nextlevel lla.
3.1 Sort Order and Properties of that Order
In any given one of the iterations of our algorithm, let us consider a comparison operator that consists of two nested orderings: first, a major sort order that sorts data points by ascending tag value; and second, a minor sort order that sorts those elements with same tag based on their respective ’th coordinate, (where is the iteration we are currently performing (and in which we are, consequently, finding the position for all nodes on the ’th level of the tree).
In pseudocode, we can use the following less()operator to achieve this chosen ’th step ordering of our data points:
Let us now assume that in each iteration the tag for any given data point will initially specify that point’s lla for level . In that case, based on how that lla is defined we can make the following
Observation 3.1
Let be an array of ints where each stores the level ancestor of a given node in a binary tree of nodes and . By definition of the lla, there must be exactly tags that each point to a node on a level ; and all other tags must point to a node on level .
If we further apply a sort using the above comparison operator, then we immediately arrive at
Observation 3.2
Let be the array of data points after the ’th iteration’s sorting step. Then, the first elements of the array will be the nodes with tags , and those will all be already at their proper locations in the array. Following those nodes for levels , starting at offset , there will be the nodes that will end up in subtree , sorted by their ’th coordinate. Following those, at offset there will be the nodes with lla (also sorted in the proper dimension), etc.
We can now use this observation to define an efficient tag update kernel that computes, from this array layout, each node’s nextlevel ancestor.
3.2 Tag Update Kernel
Using observation 3.2 we know that for each subtree on level we will find all nodes tagged with this subtree in a single coherent region of our array; we call this the segment of the tagged array elements. Let us now call the index where this segment with valued tags starts the segment begin index for , sb(s,l). As this segment contains all of the data points in the subtree under , its size will obviously be ss(s).
Within each such segment, we can now use observation 2.1 to determine, for each element of this array, whether it is the pivot element, or in the left respectively right subtree of . Using this, we arrive at a node update kernel that–assuming the data is properly sorted as above–for each array item in parallel, performs the following steps:

check if . If so, the tag is already correct; terminate.

read the tag from the tag array to know which subtree the data point is in.

compute and and use these to compute the pivot position .

for indices less than pivot position, change tag to ; for those greater than change it to .
This leads to the following pseudocode:
Using this tag update kernel, our whole algorithm is simply the following set of iterations:
After steps, every one of the data points will have a tag that indicates where in the array that data point should be stored; and since the array will be sorted by this tag every data point will automatically be at the proper position—meaning that the resulting array is indeed the proper kd tree for these data points.
4 Computing and in O(1)
Throughout our algorithm there were two properties that we made frequent use of: the segment begin function , and the subtree size function . We have claimed that each one of those can be computed with complexity, but have not yet described how to do that.
4.1 Computing Size of Subtree , ss(s)
Let be the number of nodes in the subtree rooted in node . We observe that this values is an intrinsic property of any kd tree of nodes, not specific to our algorithm. We also observe that this is useful even for other kd tree construction algorithms, as it allows for computing the pivotposition even for listbased recursive partitioning schemes.
Computing this value recursively is trivial, but costs . To compute this in we first observe that the tree is leftbalanced and complete; i.e., if is the number of levels in the tree then all levels with are entirely full, and level is filled from the left (i.e., it contains nodes ).
Now let be the level of node ; then we know that all levels from to including must be entirely full; and those must contain nodes. Furthermore, we can also compute the range of nodes that this subtree would have on level , and compare that with the range of nodes that the tree actually has.
The first node of subtree on level —which we call the first lowestlevel child (fllc(s))—can be computed by repeatedly computing lChild for a total of times. Since , calling this function once is synonymous with shifting by 1 to the left, and setting the last bit to ; i.e., we take the binary code of and append a “1” bit. Doing this times on is then synonymous with taking the binary code of and appending “1” bits, which in integer arithmetic can also be computed as
Obviously, if this value is , then subtree has no nodes on level ; otherwise, it will have at most nodes. We also know that in a full tree the subtree under would only be nodes wide, so the total number of subtree nodes on level can be computed as
Together with the full inner nodes, this for evaluates to
In this expression, and can be computed in constant time using simple clz as described above; all other operations are basic arithmetic operations like bitshift, addition, and integer min/max, all of which are constantcost and trivially cheap on modern architectures.
4.2 Computing Segmentbegin in step ,
The second important property our algorithm frequently relies on is what we called the subtreesbegin index; i.e., the index in our sorted array (in step ) where all the data points tagged to be in subtree can be found (we observe that in any iteration our algorithm calls this function only on subtrees that are in the ’th level).
Recalling observation 3.2 we know that in the ’th step the array will first contain the nodes of the top levels that have already found their final array position; followed by those for the first subtree in level (tagged with ); then those for subtree ; then those for subtree ; etc. Since we can compute the size of any subtree through , we could compute through
but this would have complexity.
To compute that same value in we refer to the same observations we made in the previous section. First, we observe that there are a total of subtrees on level that are to the left of (and thus come before ); we call this the number of left siblings of , nls(s). Each of these subtrees must have levels that are guaranteed to be full, and as there are of those there must be exactly such nodes on those inner levels, across all those trees to the left of . On the lowest level of the tree, these left siblings fill the tree from the left, so the number of nodes on the lowest level must be the lesser of how many nodes these trees would have if they were all filled (which is ), and how many the actual tree has on that level (which is ).
Taken together, this combines to
Like for , this can be computed with just a few clzs, shifts, and basic arithmetic operations, in .
5 An Example AlgorithmWalkthrough
To illustrate how this algorithm performs in practice, we have run our reference implementation of it on 10 randomly chosen int2 data points, and use this section to show–step by step–how it modifies the array of data points and tags.
In our example, let us assume we start with the 10 randomly chosen points . Let us attach each of these with a tag, initialize each such tag with 0, and show these in table form; this gives an initial state as
Let us now run iteration , which—since all nodes have the same tag of 0—will first result in all nodes being sorted in :
Running the updatetags steps will evaluate the pivotpos for subtree (which in this stage all nodes will be in) as , which is , leading to these updated tags
For sorting now already “pins” element , and sorts elements by subtree ID and , respectively, and with ascending values within the same subtree:
Updating tags in step now finds nodes and , and updates their children into subtrees 3–6, respectively:
Sorting for iteration then yields nodes 0–2 fixed correctly, and subtrees 3–6 in proper order:
Updating tags now builds all tags for the third level:
Finally, performing a final sort gives the (correct) final result as
The algorithm is now complete, and this is indeed the correct kd tree array for this input.
6 Extension to SplitWidestDimension
In our discussion above we have assumed that for each node the builder would use dimension for partitioning, where is the level of the tree that is in. Though this is exactly what many users of kd trees want to do, for some applications it has been shown that always partitioning each subtree along the dimension where its associated region of space has widest extent can lead to significantly better query performance.
To do this, a topdown recursive builder would first compute the bounding box of the entire data set (which is the domain of the entire tree, and the region of space that the root node is going to partition); then in each step the builder would consider the domain of the current subtree, choose the dimension where this domain’s bounding box is widest, and use that for partitioning. It would then compute the domains of the left and right subtree based on the parent’s domain partitioning plane, and recurse.
In our algorithm we cannot easily track bounding boxes for each subtree in a topdown manner, as we have potentially many such subtrees in flight at any point in time. We can, however, easily compute a given subtree’s bounding box by following its path from the root backwards: doing so will also iterate over all the partitioning planes “above” s, and in each step, we know whether is to the left or or right of this tree. Thus, for any we can start with the (precomputed) bounding box of the entire domain, then walk from upwards to the root, and in each step “clip” the bounding box by the given ancestor’s splitting plane.
With this in mind, we modify our algorithm as follows: We first reserve the lower bits in the tag value to store the partitioning index we want to use for the subtree that this node is in. In the initialization step we precompute the bounding box of the entire data sets, and initialize these bits to the widest dimension of this data set bounding box.
In the sort step, we proceed as above, except that for any comparison between points and we first retrieve ’s node tag, extract that tag’s stored split dimension, and, for similar tag compare and across this dimension. We observe that it can absolutely happen that this comparison operator gets called with two nodes and that have different dimensions stored in their tag; however, this is perfectly OK: the comparison operator first compares by tag, so the comparison by coordinate only ever happens if these two nodes have the same tag—in which case they are in the same subtree and use the same split dimension.
The last modification we have to do is make sure that the tag update kernel will properly compute the split dimension for each of the next step’s subtrees. To do that we first observe that this has to be done only for those nodes whose new tag got updated to either or . If that is the case, we first compute the bounding box of subtree : we start with the scene bounding box, and first clip that by the split plane of , which this step has just determined, and which thus is now known. We then follow up the tree, and further clip this bounding box by the visited ancestors (all of which have already been fixed in previous iterations). The result of this is, for each , the bounding box of the newly subtree that this node just got tagged with. We then compute the proper split dimension, and store this in the new tag.
This step of computing the split dimension for each node obviously increases the cost of the tag update step from to per node, and therefore, to for the entire update step across all nodes. However, the cost of this inner loop was even before this change (due to the cost for sorting), so the total algorithm’s complexity remains unchanged at . Sample code for this algorithm is provided together with sample code for the roundrobin based algorithm, at https://github.com/ingowald/cudaKDTree.
7 Discussion
In this paper we have described an algorithm that allows for the parallel, GPUfriendly, and (mostly)in place construction of leftbalanced kd trees. Sample code for both roundrobin and splitwidestdimension, as well as for different data types and queries operating on those trees, can be found at https://github.com/ingowald/cudaKDTree [Wal22].
Our algorithm can build leftbalanced kd trees (of either roundrobin or splitlargestdimension variety) in . We call our algorithm mostlyin place because our algorithm itself only needs a constant one int per node in temporary storage, independent of number of nodes, number of dimension, size of other payload stored with each data point, etc. Of course, we also rely on sorting, and based on used sorting algorithm this may require additional temporary storage; in particular the thrust [BH12] based sort we use in our sample code will use nontrivial amounts of storage. A sort based solely on swapping—for example, bitonic sort [MCS15]—would avoid this (and quite possibly be faster, too), but for the sake or readability has not been used in our reference implementation.
7.1 Performance
Our algorithm is designed for parallel execution, and reasonably fast; on a modern NVIDIA3090TI GPU a kd tree over 10 million points can be built in well under one second (see Table 1). That said, we make no claims that our algorithm (and in particular, our implementation) was necessarily the fastest way such trees can be built. As for memory usage, the main cost factor in terms of compute is parallel sorting: sorting on GPUs can be incredibly fast for radix sortstyle algorithms that operate on simple scalar types; but for current libraries sorting with custom operators is much slower. This could potentially be addressed by using a different implementation of parallel sort, and/or by using a 64bit tag in which the upper bits store subtree index and lower bits replicate the point’s coordinate in the chosen sorting dimension (one can then use a keyvalue radix sort with a 64bit integer); but for didactic reasons this has not been done in our reference implementation.
N  1K  10K  100K  1M  10M  100M  1B 

int1 data type  
RTX 8000  ms  0.6ms  1.4ms  12.8ms  165ms  1.89s  24.7s 
A6000  ms  0.6ms  1.2ms  11.9ms  124ms  1.37s  17.7s 
RTX 3090TI  ms  0.6ms  1.2ms  11.0ms  106ms  1.07s  16.8s 
float4 data type  
RTX 8000  0.9ms  6.0ms  54.4ms  783ms  10.3s  (oom)  
A6000  0.9ms  5.6ms  49.7ms  536ms  6.5s  (oom)  
RTX 3090TI  0.9ms  5.6ms  44.4ms  424ms  5.3s  (oom) 
means 1 billion), using roundrobin partitioning and uniformly distributed random points in
(averaged over 1,000 runs for , over 100 runs each for , and across 10 runs for ). oom means “out of memory”We also observe that there are some obvious savings in that in each ’th step the sort and tree update kernels can be omitted for the first elements of the array; and in that one does not actually need to build the lowest level of the tree since all subtrees on that level are of size 1, and after sorting are already at the right position. These simple optimizations have already been done in our implementation.
7.2 Simplicity and Generality
Arguably the biggest advantage of our algorithm is its simplicity: though the computation of values like and at first glance does not look simple at all, once these are carefully derived (see above) the rest of the algorithm is a trivially simple sequence of alternating between parallel sort and our O(10)liner node update kernel. In particular, our algorithm never needs to deal with issuing persubtree launches, with adapting launch dimensions based on subtree sizes, with specialcase code for subtrees that are “small enough” to merit special code paths, etc. Though we absolutely have looked into such codes—and have often managed to outperform the above algorithm—the ability of not having to deal with more complicated method is a major advantage of our algorithm: Almost the entire complexity and cost of our algorithm lies in effective and stable parallel sorting, for which many good solutions already exist.
The simplicity of our algorithm also makes it excellently suited to templating over different data types: while our sample codes focus on float3 or float4 data types we also have a variant that is templated over arbitrary CUDA structs that can contain arbitrarydimensional points over arbitrary scalar types, including arbitrary other perpoint “payload” (such as, for example, photon power for photon mapping [Jen01]).
This simplicity also makes it trivial to port this algorithm to other languages, or hardware platforms. For example, the same algorithm trivially also ports to the CPU, where it can be run with any serial or parallel sort algorithm, and with the kernel update using, for example, TBB for parallelism.
8 Conclusion
In this paper, we have described an algorithm that can build a leftbalanced kd tree over any number of dimensional data points in serial complexity (and parallel complexity, for processors). Our algorithm consists of four different but interoperating ideas: (1), tagging each point with an int that stores that node’s level ancestor (and refining this value in every iteration); (2), careful observations regarding how to easily and efficiently compute subtree size and pivot position for any given node; (3), a carefully crafted sort order that makes these computations possible; and (4), making significant use of parallel sort for rearranging nodes.
We observe that the combination of tagging nodes by subtree ID and sorting to produce sequential arrays of same subtree ID allows our algorithm to completely avoid the need to explicitly track which subtrees still need building, which lists of nodes these contain, etc. This in particular allows our algorithm to always, in each step, operate on all elements of the array in parallel, without the need to explicitly issue possibly many launches for different ranges of array elements—which makes the final algorithm outstandingly simple compared to existing listbased methods. This ability to build the tree in a predictable steps that each operate on all elements of the array make our algorithm a good match for highly parallel architectures such as GPUs.
We make no claims that this algorithm was necessarily the fastest possible construction method for balanced kd trees; however, our experiments show it to nevertheless be quite fast, allowing to build 10 million points in well under a second. This speed, combined with how easy it is to implement this algorithm on a given architecture, makes us believe this algorithm to be a useful tool for anybody that needs to build of leftbalanced kd trees in a parallel and/or GPU friendly manner.
References
 [BH12] Nathan Bell and Jared Hoberock. Thrust: A productivityoriented library for CUDA. In GPU computing gems Jade edition, pages 359–371. Elsevier, 2012.
 [Jen01] Henrik Wann Jensen. Realistic image synthesis using photon mapping, volume 364. Ak Peters Natick, 2001.
 [MCS15] Qi Mu, Liqing Cui, and Yufei Song. The implementation and optimization of bitonic sort algorithm based on CUDA. CoRR, abs/1506.01446, 2015.
 [Wal22] Ingo Wald. cudaKDTree: Sample code for GPU Construction and Stackfree Traversal of kdTrees, 2022. https://github.com/ingowald/cudaKDTree.
 [Wik] Wikipedia. kd tree. https://en.wikipedia.org/wiki/Kd_tree, last visited Oct 23, 2022.
Appendix A Revision History
 Revision 1 (ArXiv version v2)

:

Updated title (from “A GPUfriendly, Parallel, and (Almost)InPlace Algorithm for Building LeftBalanced kdTrees” to “GPUfriendly, Parallel, and (Almost)InPlace Construction of LeftBalanced kd Trees”)

Fixed missing URLs in references.

 Original Version (ArXiv version v1)

: Originally uploaded to ArXiv Oct 31, 2022. https://arxiv.org/abs/2211.00120