1 Introduction
3D space is only one dimension higher than the 2D image space, yet the additional dimension introduces an unpredictable multiplier to the computational and storage cost. This increased dimension makes 3D perception tasks such as geometric reconstruction and appearance refinement challenging to implement. To reduce complexity, one compromise is to reuse dense data structures and constrain the 3D space by bounding the region of interest, e.g., adopting a 3D array in a bounded space [36]. While this simple approach succeeds at the scale of objects, it cannot meet the demand of room or city scale perception, which is necessary for virtual tour, telepresence, and autonomous driving.
Since the 3D space is generally a collection of 2D surface manifolds, its sparsity can be exploited by partitioning to reduce computational cost. The general idea is to split the large 3D space into smaller regions and only proceed with the nonempty ones. There is a plethora of wellestablished data structures for 3D space partitioning. Examples include trees (Octree [33], KDtree [4]) and hash maps (spatial hashing [37]). While trees are able to adaptively achieve high precision, they 1) require an initial bounding volume and 2) usually take unbalanced traversal time for a batch of spatial queries and hence 3) are less friendly to batched operations. On the other hand, spatial hashing coupled with a plain array structure is more scalable and parallelizable for reconstruction tasks [37, 42]. Nonetheless, most of the current implementations are taskspecific and difficult to generalize. A userfriendly, general hash map is missing for efficient spatial perception at scale.
The reason for this absence is understandable. A parallel hash map on GPU has to resolve collisions and thread conflicts and preferably organize an optimized memory manager, none of which is trivial to implement. Previous studies have attempted to tackle the problem in one or more aspects, driven by their selected downstream applications. Furthermore, most of the popular parallel GPU hash maps are implemented in C++/CUDA and only expose lowlevel interfaces. As a result, all the extensions using such a hash map must start from lowlevel programming. While these designs usually guarantee performance under certain circumstances [2, 3, 37, 42], as of today, they leave a gap from the standpoint of the research community, which prefers to use offtheshelf libraries for fast prototyping with a highlevel scripting language using tensors and automatic differentiation. Our motivation is to bridge this gap to enable researchers to develop sophisticated 3D perception routines with less effort and drive the community towards largescale 3D perception.
To this end, we design a modern hash map framework with the following major contributions:

[leftmargin=*]

a userfriendly dynamic and generic^{1}^{1}1A generic hash map supports arbitrary dimensional keys and values in various data types. hash map interface that enables tensor I/O, advanced indexing, and inplace automatic differentiation when bridged to autodiff engines such as PyTorch;

an indexfirst adaptor that supports various stateoftheart parallel GPU hash map backends and accelerates hash map operations with an improved data layout;

a handful of downstream applications that achieve higher performance compared to stateoftheart implementations with fewer LoC.
Experiments show that ASH achieves better performance with fewer LoC on both synthetic experiments and realworld tasks.
2 Related Work
2.1 Parallel Hash Map
The hash map is a data structure that seeks to map sparse keys (e.g. unbounded indices, strings, coordinates) from the set to values from the set with amortized access. It has a hash function that maps the key to the index set for indexing (or addressing) that is viable on a computer.
Ideally, with a perfect injective hash function , a hash map can be implemented by , where is an array of objects of type and is the trivial array element accessor. However, in practice, it is intractable to find an injective map given a sparse key distribution in and a constrained index set of size due to computational budget. Therefore, modifications are required to resolve inevitable collisions, where .
There are two classes of techniques for collision resolution, open addressing and separate chaining. Open addressing searches for another candidate via a probing algorithm until an empty address is found. The simplest probing, linear probing [25], computes starting from , where is the number of attempts. Separate chaining, on the other hand, maintains multiple entries per mapped index. In implementation, a linked list is grown at if .
While hash map implementations are widely available for CPU, their GPU counterparts have only emerged in the recent decade. The majority of GPU hash maps use open addressing [48, 2, 15, 24], mainly due to simplicity in implementation and capability of handling highly concurrent operations. CUDPP [2] utilizes Cuckoo Hashing [39], while CoherentHash [15] adopts Robin Hood Hashing [6] – both involving advanced probing design. Although being performant when are limited to integer sets, these variations cannot be generalized to spatial hashing and only allow static input. Recently, WarpCore [24] proposes to support noninteger and dynamic insertion, but the key domain is still limited to at most 64 bits.
There are also a few separate chaining implementations on GPU involving deviceside linked lists. SlabHash [3] builds a linked list with a 128bit Slab as the minimal unit, optimized for SIMT (Single Instruction Multiple Threads) warp operations. Although SlabHash allows dynamic insertion, similar to aforementioned GPU hash maps, only integer are supported. stdgpu [47]
follows the conventional C++ Standard Library std::unordered_map and builds supporting vectors, bitset lock guards, and linked lists from scratch, resulting in a generic, dynamic hash map. With these rich functionalities, however, stdgpu is not optimized for large value sets. In addition, due to its lowlevel templated design, users have to write device code for simple tasks.
We refer the readers to a comprehensive review of GPU hash maps [30].
2.2 Space Partitioning Structures
3D data is not as simple to organize as 2D images. While a 2D image can be stored in a dense matrix, exploiting sparsity in 3D data is paramount due to the limits in computer memory of the current day.
The most widely used data structures for 3D indexing are arguably trees. A KDtree [4] recursively sorts kdimensional data along a selected axis and partitions data at the median point. By nature, a KDTree is designed for neighbor search. In 3D, it is mainly used to organize 3D points and their features
. Examples include normal estimation and nearest neighbor association in Iterative Closest Points (ICP)
[44, 56] and 3D feature association in global registration [45, 55]. GPU adaptations exist for KDtrees[46, 52], but are not suitable for incrementally changing scenes, as they are usually constructed once and queried repeatedly.Bounding volume hierarchy (BVH) is another hierarchical representation that organizes primitives such as objects and triangles in 3D. There are various GPU adaptations [16, 29, 50] mostly targeted at ray tracing and dynamic collision detection. While a parallel construction is possible and deformation of the nodes is allowed, the tree structure typically remains unchanged, assuming a fixed layout.
While KDtrees and BVH split the space unevenly by data distribution, an Octree [33], on the other hand, recursively partitions the 3D space evenly into 8 even subvolumes according to space occupation states. It has been widely used in adaptive 3D mapping [20, 34] for robot navigation. There have been parallel implementations on GPU, from optimized data structures [19] to domainspecific languages [21]. However, these works generally focus on physics simulation within a bounded region of interest where the spatial partition is predefined. While parallel incremental division [51] is possible, an initial bounding region is still required, and the trees are not guaranteed to be balanced.
Spatial hashing is another variation of spatial management with access time depending on hash maps. With an unbounded region of interest, it has been widely used in realtime volumetric scene reconstruction, bundled with small dense 3D arrays. A handful of CPU implementations have achieved realtime performance [27, 17] at the expense of deep coupling with the tracking pipelines. Similarly, GPU implementations [37, 42, 12] reach high frame rates using GPUbased spatial hashing. However, all of the studies depend on ad hoc GPU hash maps exclusive to these specific systems. Concurrent race conditions have not been fully resolved in several implementations [37, 12], where volumes can be randomly underallocated.
2.3 Spatially Varying 3D Representations
A truncated signed distance field (TSDF) [11] is an implicit representation of surfaces, recording pointwise distance to the nearest surface point. It is frequently used for dense scene reconstruction with noisy input. The distribution of surfaces is generally spatially varying and therefore, a proper parameterization is usually necessary, either in a discrete [37] or neural [7] form.
Nonrigid deformation methods [53, 35] seek to embed point clouds in a deformable grid, where each point is anchored to neighbor grid points and deformed accordingly. They are mainly used for animation or calibration for distorted models. Similar to a deformation grid, complex lighting for rendering can be approximated by spatiallyvarying spherical harmonics (SVSH) [32] placed at a sparse grid. These grids are natural applications of spatial hashing.
A comprehensive review of spatially varying representations for realworld scene reconstruction is available [58].
While ad hoc implementations have been introduced for these representations either on CPU or GPU, ASH provides a deviceagnostic interface requiring less code written and providing better performance. Table I compares various aspects of existing GPU hash maps, either as a standalone data structure (Section 2.1) or embedded in an application (Section 2.2). To the best of our knowledge, ASH is the first implementation that simultaneously supports dynamic insertion, ensures correctness via atomic operations, allows generic keys, and has a modern tensor interface and Python binding for better usability.
3 Overview
Before plunging into the details, we first provide a highlevel overview of our framework in Fig. 1.
Conventional parallel hash maps reorganize the structure of arrays (SoA) input, i.e., the separated key array and value array, into an array of structures (AoS) where keys and values are paired, inserted, and stored. Therefore, array of pointers to pair structures (std::pair in C++, thrust::pair in CUDA, and tuple in Python) are returned upon query. Consequently, the operations from insertion and query to inplace value increment require users to write device code and visit AoS at the lowlevel pointers.
In contrast, ASH sticks to SoA. Fig. 1 shows the workflow of ASH. Instead of pointers to pairs, ASH returns indices and masks arrays that can be directly consumed by tensor libraries such as PyTorch [41] (without memory copy) and NumPy [18] (with GPU to host memory copy). As a result, following functions from duplicate key removal to inplace modification, can be chained with hash map insertion and query in ASH via advanced indexing without writing any device code. As a general and deviceagnostic interface for parallel hash maps, our framework is built upon switchable backends with details hidden from the user. Currently, separate chain backends are supported, including the generic stdgpu [47] backend and the integeronly SlabHash [3] backend extended to support arbitrary keyvalue data types. TBB’s concurrent hash map [28] powers the CPU counterpart with the identical interface to CUDA.
In this paper, we use calligraphic letters to represent sets, and normal lowercase letters for their elements. Normal uppercase letters denote functions. Bold lowercase letters denote vectors of elements, or arrays in the programmer’s perspective. Bold uppercase letters are for matrices. For instance, in a hash map, we are interested in key elements and their vectorized processing, e.g. query . Specifically, we use to denote a set of indices , and as the boolean selection . Given an arbitrary vector , we denote and as indexing and selection functions applied to when and . We use as the key and value sets for a hash map. is the internal hash function that converts a key to an index. is the general hash map enclosing .
4 The ASH Framework
4.1 Classical Hashing
In a hash map , since the hash function cannot be perfect as discussed in Section 2, we have to store keys to verify if collisions happen ( but ).
In separate chaining, to resolve hash collisions, the bucketlinked list architecture is used. With initial buckets, we construct the hash function where is defined in Section 2. As shown in Fig. 2, keys with the same hashed index are first aggregated in the th bucket, where a linked list grows adaptively to accommodate different keys. A conventional hash map stores keyvalue pairs as the storage units. Consequently, two keys can be distinguished by checking and from the pair in order, and manipulation of the keys and values can be achieved by iterating over such pairs.
With this formulation, assuming a subset has been inserted into the hash map with associated values , a query function can be described as
(1) 
where forms a concrete pair stored in the hash map. This format is common in implementations, e.g. in C++ (std::unordered_map) and Python (dict).
4.2 Function Chaining and Parallel Hashing
The elementwise operation in Eq. 1 can be extended to vectors via parallel kernels. However, interpretation of the returned iterators of pairs is still required at the low level. In other words, although the parallel version can be implemented efficiently, results are still packed in an AoS instead of SoA:
(2) 
This forms a barrier when the parallel query is located in a chain of functions. For instance, to apply any function (e.g., geometry transformation) over the result of a query, the lowlevel function second that selects the value element from a pair must be provided to dereference the lowlevel structures and manipulate the keys and values inplace. In other words, we have to implement a nontrivial :
(3) 
to force the conversion from AoS to SoA and chain a highlevel function with ^{2}^{2}2This can be simply achieved by returning a copy of values, but it is not feasible, especially when dealing with largescale data, e.g. hierarchical voxel grids.. This could be tedious when prototyping geometry perception that requires hash map structures since offtheshelf operations have to be reimplemented in kernels.
We reformulate this problem by introducing two affiliate arrays, and (note with a superscript for buffering, they are not the input ) of capacity , where is the number of buckets. These arrays are designed for explicit storage of keys and values, respectively, and serve as buffers to support natural SoA, and are exposed to the user for direct access and inplace modification. Now the query function can be rewritten as
(4) 
and this version is ready for parallelization. At this stage, to combine and , we can chain to manipulate values:
(5) 
which retains convenient properties such as array vectorization and advanced indexing.
When the input set stored in the hash map, our formulation maintains its effectiveness by a simple masked extension:
(6)  
which is also ready for parallelization. Now the chaining of functions is given by
(7)  
(8) 
using advanced indexing with masks. We can also select valid queries with without visiting . While our discussion was about the query function, the same applies to insertion.
In essence, by converting the pairfirst AoS to an indexfirst SoA format with the help of array buffers, we can conveniently chain highlevel functions over hash map query and insertion. This simple change enables easy development on hash maps and unleashes their potential for fast prototyping and differentiable computation. However, the layout requires fundamental changes to the hash map data structure. With this in mind, we move on to illustrate how the ASH layer converts the AoS in native backends to our SoA layout.
4.3 Generic Backends
We start with converting stdgpu [47], a stateoftheart generic GPU hash map as the backend of ASH. stdgpu follows the convention of its CPU counterpart std::unordered_map by providing a templated interface. The underlying implementation is a classical bucketlinked list structure with customized locks to avoid race conditions on GPU. To exploit the power of a generic hash map without reinventing the wheel, we seek to reuse the operations over keys (i.e. lockguarded bucket and linked list operations), and redirect the value mapping to our buffer .
A dynamic GPU hash map requires dynamic allocation and freeing of keys and values in device kernels. With preallocated key buffer and value buffer , we maintain an additional index heap , as shown in Fig. 3. The index heap stores buffer indices pointing to the buffers as a map , where the heap top maintains the currently available buffer index in . Heap top starts at , and is atomically increased at allocation and decreased at free. With and the dynamically changing , we instantiate a generic hash map with the templated value in stdgpu to be , where the values are buffer indices stored in to access exposed to the user.
4.3.1 Insertion
The insertion of a pair is now decoupled into two steps, with i) insertion of into the hash map, where is the buffer index dynamically acquired from the heap top and ii) insertion of into buffers.
A naive implementation will acquire a buffer index from on every insertion attempt and free it if the insertion fails because the key already exists. However, when running in parallel, atomicAdd and atomicSub may be conflicting among threads, leading to race conditions. A twopass insertion could resolve the issue: in the first pass, we allocate a batch of indices from determined by the input size, attempt insertions, and record results; in the second pass, we free the indices to from failed insertions.
We adopt a more efficient onepass lazy insertion. We first attempt to insert with as the dummy index into the backend and observe if it is successful. If not, nothing needs to be done. Otherwise, we capture the returned pointer to the pair, trigger an index allocation from , and directly replace the dummy 1 with . This significantly reduces the overhead when the key uniqueness is low (i.e., many duplicates exist in the keys to be inserted).
4.3.2 Query
The query operation is relatively simpler. We first look up the buffer index given in the backend. If it is a success, we end up with , and the target is accessible with by users.
4.4 Nongeneric Backends
While the generic GPU hash map has only recently been available, the research community in parallel computation has been focusing on more controlled setups where both and are limited to certain dimensions or data types. We seek to generalize this nongeneric setup with our index heap and verify their performance in more realworld applications. In this section, we show how ASH can be used to generalize SlabHash [3], a warporiented dynamic GPU hash map that only allows insertions and queries to Int32 data type.
An extension to generic key types is nontrivial for SlabHash since its warp operations only apply to variables with limited word length. Our implementation extends the hash set variation of SlabHash, where only integers as keys are maintained in the backend.
4.4.1 Generalization via Index Heap
The index heap is the core to generalizing the SlabHash backend. In brief, a generic key is represented by its associated buffer index in a integeronly hash set, allocated the same way as discussed in Section 4.3.1. As illustrated in Fig. 4, all the insertions and queries are redirected from the buffer indices to actual keys and values via the index heap. However, the actual implementation involves more complicated changes in design.
Given a generic key , we first locate the bucket . Ideally, we can then allocate a buffer index at ’s top and insert it into the linked list at the bucket in the integeronly hash set. The accompanying key and value are put in . In query, we similarly first locate the bucket then search the key in the linked list by visiting via the stored index .
4.4.2 MultiPass Insertion
Although query can be applied as mentioned above, lazy insertion mentioned in Section 4.3.1 is problematic in this setup. The main reason is that while the race condition in inserting index does not occur in warporiented insertions, the copy of the actual key to requires global memory write. They may not be synchronized among threads, as copying a multidimensional key takes several nonatomic instructions. As a result, the insertion of a key could be accidentally triggered when i) a duplicate ’s index has been inserted but ii) whose actual key has only been partially copied to the buffer . This would mistakenly result in followed by the unexpected insertion of
when unsynchronized. In practice, with more than 1 million keys to be inserted in parallel, these kinds of conflicts happen with probability as low as
. To resolve conflicts, we split insertion into three passes:
[leftmargin=*]

Pass 1: batch insert all keys to by directly copying all candidates via batch allocated corresponding indices from ;

Pass 2: perform parallel hashing with indices from pass 1. In this pass, keys are readonly in global buffers and hence do not face race conditions. Successful insertions are marked in a mask array.

Pass 3: batch insert values to with successful masks, and free the rest to .
While there is overhead due to the multipass operation, it is still practical for a dynamic hash map. First, keys are relatively inexpensive to copy, especially for spatial coordinates, while the more expensive copying of values is done without redundancy. Second, a dynamic hash map generally reserves sufficient memory for further growth so that the all key insertion would not exceed the buffer capacity.
4.5 Rehashing and Memory Management
While buffers are represented as fixedsize arrays, growth of storage is needed to accommodate the accumulated input data, which can exceed the hash map’s capacity, e.g. 3D points from an RGBD stream. This triggers rehashing, where we adopt the conventional strategy to double the buffer size as common in the C++ Standard Library, collect all the active keys and values, and batch insert them into the enlarged buffer.
In dynamic insertions, there can be frequent free and allocation of small memory blobs that are adjacent and mergeable. In view of this, we implement another treestructured global GPU manager similar to PyTorch [41].
4.6 Dispatch Routines
To enable bindings to nontemplated languages, e.g. Python, the tensor interface is nontemplated so that it can take dtypes and shapes as arguments. In the context of spatial hashing, we support arbitrary dimensional keys by expanding the dispatcher macros in C++. Float types have undetermined precision behaviors on GPU. Therefore, a conversion to the integers given the desired precision is recommended to use the hash map.
We also additionally dispatch values by their element byte sizes into intrinsically supported vectors: int, int2, int3, and int4. This adaptation accelerates trivially copiable value objects such as int3, and supports nontrivially copiable value blocks (e.g. an array pointed to a void pointer). This improves the insertion of large value chunks by a factor of 10 approximately.
4.7 Multivalue Hash Map and Hash Set
ASH supports multivalue hash maps that store values organized in SoA, and hash sets, with minor changes to the framework.
In application, a key can be mapped to multiple values. For instance, a 3D coordinate can be mapped to a normal and a color in a point cloud. While the mapped values can be packed as an array of structures (i.e., AoS) to fit a hash map, code complexity could increase since structurelevel functions have to be implemented. We can generalize the hash map’s functionality by extending the single value buffer to an array of value buffers and applying loops over properties per index during an insertion. This simple change supports the storage of complex value types in SoA that allows easy vectorized query and indexing.
A hash set, on the other hand, is a simplified hash map – an unordered set that stores unique keys. It is generally useful in maintaining a set by rejecting duplicates, such as in point cloud voxelization. By removing and ignoring value insertion, a hash map becomes a hash set.
5 Experiments
We start with synthetic experiments to show that ASH, with its optimized memory layout, increases performance while improving usability. All experiments in this section are conducted on a laptop with an Intel i76700HQ (4 cores, 8 threads) CPU and an Nvidia GTX 1070 GPU. In all the experiments, we assume the hash map capacity is equivalent to the number of input keys (regardless of duplicates). Each reported time is an average of 10 trials.
5.1 Spatial Hashing with Generic Backend
The first experiment is the performance comparison between vanilla stdgpu and ASH with stdgpu backend (ASHstdgpu). For fairness, we extend the examples of stdgpu such that an array of iterators and masks are returned for inplace manipulations. The number of buckets and the load factor are determined internally by stdgpu.
Setup 1.
We test randomly generated 3D spatial coordinates mapped to float value blocks of varying sizes. The key , value , capacity and uniqueness are chosen as follows:
where indicates the ratio of the unique number of keys to the total number of keys being inserted or queried.
Fig. 5 illustrates the comparison between vanilla stdgpu and ASHstdgpu. For insert operation, ASHstdgpu is significantly faster than stdgpu when is low, and the performance gain increases when value byte size increases. This is mainly due to the SoA memory layout and the lazy insertion mechanism, where a lightweight integer is inserted in an attempt instead of the actual value . At a high input uniqueness , ASHstdgpu maintains the performance advantage with low and medium value sizes, and its performance is comparable to stdgpu with a large value size. This indicates that our dispatch pattern in copying values helps in a high throughput scenario. For find operation, ASHstdgpu is consistently faster than vanilla stdgpu, under both high and low key uniqueness settings.
In addition to to insert and find, we introduce a new activate operation. The activate
operation “activates” the input keys by inserting them into the hash map and obtaining the associated buffer indices. This is especially useful when we can predetermine and apply the elementwise initialization. Examples include the TSDF voxel blocks (zeros) and multilayer perceptrons (random initializations). The
activate operation is absent in most existing hash maps and is only available as hardcoded functions [37, 42, 12].With the activate operation, we conduct ablation studies to compare the insertion time of merely the keys versus the insert time of both the keys and values. Fig. 6 compares the runtime between insert and activate in ASHstdgpu. The key, value, capacity, and uniqueness choices are the same as in Setup 1. We observe that while the insertion time increases as the value size increases, the activation time remains stable.
5.2 Integer Hashing with NonGeneric Backend
Next, we compare ASH based on SlabHash backend (ASHslab) with the vanilla SlabHash. Since SlabHash only supports integers as keys and values, we limit our ASHslab backend to the same integer types here. The number of buckets is capacity (load factor is approx. ), since it is empirically the best factor when ASHslab is applied to nongeneric and generic tasks. Since vanilla SlabHash only supports data I/O from the host, we include the data transfer time between host and device when measuring the performance of ASHslab.
Setup 2.
We test randomly scalar integer values mapped to scalar float values. The key , value , capacity and uniqueness are chosen as follows:
As shown in Fig. 7, although ASHslab does not make use of the nonblocking warporiented operations in SlabHash in order to enable support for generic key and value types, our insert is still comparable to vanilla SlabHash which is only optimized for integers. The drop in performance of ASHslab when increases is an expected indication that the overhead of multipass insertion increases correspondingly.
It is worth mentioning that with an improved global memory manager, the construction of an ASHslab hash map takes less than 1ms under all circumstances, while the vanilla SlabHash constantly takes 30ms for the redundant slab memory manager. In practice, where the hash map is constructed and used once (e.g. voxelization), ASHslab is a more practical solution.
5.3 Ablation Between Backends
We now conduct an ablation study with different backends in ASH, namely ASHstdgpu and ASHslab, with arbitrary input keyvalue types beyond integers. The experimental setup follows Section 5.1.
In Fig. 8, we can see that ASHstdgpu outperforms ASHslab in most circumstances with the 3D coordinate keys and varying length values that are common in realworld applications. While warporiented operations heavily used in SlabHash enjoy the benefits of intrinsic acceleration, they sacrifice the granularity of operations. Threads can only move on to the next task once all the operations in a warp (of 32) are finished. As a result, early termination when an insertion failure occurs are less likely in a warporiented hash map. If the data layout is not welldistributed for the intrinsic operations (e.g., lowuniqueness input, keys with long word width), the performance drop could be significant.
This observation is more apparent in insertion under varying input densities. With a relatively low value size and a high uniqueness, ASHslab performs better. When the uniqueness is low, however, each thread in ASHslab still has to finish a similar workload before termination, while ASHstdgpu can reject many failure insertions early and move on to the following workloads. As of now, ASHslab is suitable for the voxel downsampling application, while ASHstdgpu is better for other tasks. Therefore, we set stdgpu as the default backend for ASH in the remaining sections.
5.4 Code Complexity
We now study the usage at the user end. First of all, the ASH framework, regardless of the backend used, is already compiled as a library. A C++ developer can easily include the header and build the example directly with a CPU compiler and link to the precompiled library with a light tensor engine. An equivalent Python interface is provided via pybind [22] as shown in Fig. 1.
On the other hand, to use SlabHash’s interface with an input array from host memory, a CUDA compiler is required, along with manual bucketlinked list chain configurations. For further performance improvement, detailed memory management has to be done manually via cudaMalloc, cudaMemcpy and cudaFree. stdgpu provides a builtin memory manager but requires writing deviceside kernel function with explicit execution.
In a query operation, the found values are returned bycopy for SlabHash, so inplace modification requires further modification of the library. stdgpu exposes the iterators, and therefore customized kernels need to be implemented to return an array of iterators and masks for further operations. The LoC required for the same functionality in C++ are listed in Table II.
stdgpu  SlabHash  ASH  
Device code free?  ✗  ✓  ✓ 
CUDA compiler free?  ✗  ✗  ✓ 
Construct  3  9  3 
Find  22  2  2 
Insert  27  1  2 
6 Applications
We now illustrate a number of applications and readytouse systems in 3D perception to demonstrate the power of ASH with fewer LoC and better performance. The presented applications include:

Voxelization;

Realtime dense volumetric SLAM, including TSDF integration and surface reconstruction;

Nonrigid registration and deformation;

Joint geometry and appearance refinement with shape from shading (SfS).
The first two experiments are conducted on the machine with an Intel i76700HQ CPU (4 cores, 8 threads) and an Nvidia GeForce GTX 1070 GPU. The rest are done on an Intel i77700 CPU (4 cores, 8 threads) and an Nvidia GeForce GTX 1080Ti GPU.
6.1 Voxelization
Setup 3.
In voxelization, a hash map maps a point cloud’s discretized coordinates to its natural array indices, and the hash map capacity is the point cloud size, typically ranging from to
Voxelization is a core operation for discretizing the input data. It is essential for sparse convolution [10, 9] at the quantization preprocessing stage and is usually used in coarsetofine point cloud registration tasks to generate a point cloud “pyramid” [56].
Voxelization is a natural task for parallel hashing, as the essence of the operation is to discard duplicates at grid points. To achieve this, we first discretize the input by converting the coordinates described by the continuous meter metric to the voxel units. Then a simple hash map insertion eliminates the duplicates and corresponds them to the remaining unique coordinates.^{3}^{3}3A hash set can be used if the correspondence is not required. The returned indices can be reused for tracing other properties such as colors and point normals associated with the input. The pseudocode for voxelization can be found in Algorithm 1.
We compare voxelization implemented in ASH with two popular implementations, MinkowskiEngine [10] on CUDA and Open3D [56] on CPU. Our experiments are conducted on a large scene input with points which is typical for scene perception, and a small fragment of the scene with points which is typical for an RGBD input frame, as shown in Fig. 9.
To evaluate the performance, we vary the parameter voxel size
from 5mm to 5cm, which is typical in the spectrum of voxelization applications, from dense reconstruction to feature extraction. In Fig.
10 we can see that our implementation outperforms baselines consistently for inputs at both scales. Meanwhile, in measuring the LoC written in C++ (the Python wrapper are all oneliners) required for the functionality given the hash map interface, we observe that ASH requires 1/3 LoC, as shown in Table III.MinkowskiEngine  Open3D  ASH  

Voxelization  71  72  28 
6.2 Real Time Volumetric Scene Reconstruction
6.2.1 Truncated Signed Distance Function
Scene representation with truncated signed distance function (TSDF) from a sequence of 3D input has been introduced [11] and adapted to RGBD in KinectFusion [36]. It takes a sequence of depth images with their poses as input, and seeks to optimize the signed distance, an implicit function value per point at . The signed distance measured for frame is given by^{4}^{4}4Details including depth masking and projective pinhole camera model are omitted for clarity and could be found in KinectFusion [36].
(9)  
(10) 
where
projects the 3D point to 2D with a range reading after a rigid transformation. To reject outliers, a truncate function
is applied to . There are multiple variations of and the definition of signed distance [5]. For this paper, we follow the convention in KinectFusion [36].With a sequence of inputs, perpoint signed distance can be estimated in least squares with a closed form solution
(11) 
where is the selected weight depending on view angles and distances [5]. In other words, with a sequence of depth inputs and their poses, we can measure TSDF at any point in 3D within the camera frustums. We can also rewrite Eq. 11 incrementally:
(12) 
where is the accumulated weight paired with .
6.2.2 Spatially Hashed TSDF Volume
Setup 4.
In a scene represented by a volumetric TSDF grid, a hash map maps the coarse voxel blocks’ coordinates to the TSDF data structure of the voxel block, and the hash map capacity is typically to for small to largescale indoor scenes:
where is the voxel block resolution, which is set to or .
While recent neural representations utilize multilayer perceptrons to approximate the TSDF in continuous space [7], classical approaches use discretized voxels. Such representations have a long history and are ready for realworld, realtime applications. They can also provide data for training neural representations.
The stateoftheart volumetric discretization for TSDF reconstruction is spatial hashing, where points are allocated around surfaces ondemand at a voxel resolution of around . While it is possible to directly hash highresolution voxels, the access pattern could be less cachefriendly, as the neighbor voxels are scattered in the hash map. A hierarchical structure is a better layout, where small dense voxel grids (e.g. in the shape of or ) are the minimal unit in a hash map; detailed access can be redirected to simple 3D indexing. In other words, a voxel can be indexed by a coupled hash map lookup and a direct local addressing
(13)  
(14) 
where is the voxel size and is the voxel block resolution as described in Setup 4.
While previous implementations [37, 42, 12] have achieved remarkable performance, modularized designs are missing. Geometry processing and hash map operations were coupled due to the absence of a welldesigned parallel GPU hash map. One deficiency of this design is unsafe parallel insertion, where the capacity of a hash map can be exceeded. Another is ad hoc recurring lowlevel linked list access in geometry processing kernels. Our implementation demonstrates the first modularized and decoupled pipeline where safe hash map operations are used without any ad hoc modifications.
6.2.3 Allocation and TSDF Integration
Setup 5.
For an input depth image, a hash map maps the unprojected point coordinates to the active indices as described in Setup 4, and the capacity of a hash map is typically to , with to valid entries:
In the modularized design, we first introduce a double hash map structure for voxel block allocation and TSDF estimation. Voxel block allocation identifies points from as surfaces and computes coordinates with Eq. 13. Intuitively, they can be directly inserted to the global hash map described in Setup 4. This is achievable in an ad hoc implementation where the core of the hash map is modified at the kernel code level, and unsafe insertion is allowed [37, 42]. However, in a modularized setup, this could lead to problems. A VGA resolution depth input contains points and easily exceeds the empirical global hash map capacity. As we have mentioned, rehashing will be triggered under such circumstances, which is both time and memory consuming, especially for a hash map with memorydemanding voxel blocks as values.
To address this issue without changing the lowlevel implementation and sacrificing safety, we introduce a second hash map, the local hash map from Setup 5. This hash map is similar to that used in voxelization as it maps 3D coordinates unprojected from depths at the voxel block precision to an integer index. With this setup, a larger capacity of input is acceptable, as the local hash map is lightweight and can be cleared or constructed from scratch per iteration.
There are two main benefits to using a local hash map: it converts the input from the raw point scale to the voxel block scale, which is safe for the global hash map without rehashing; as a byproduct, it keeps track of the active voxel blocks for the current frame , which can be directly used in the following integration and ray casting. The local and global hash maps can be linked through inplace value modification, where a query of coordinate in the local map is redirected to the global map. Fig. 11 shows the roles of the two hash maps. Algorithm 2 details the construction and interaction between the two hash maps.
Algorithm 2 summarizes the usage of hash maps in voxel block allocation. TSDF integration can then be implemented following Eq. 12 in a pure geometry function, either in a lowlevel CUDA kernel or a highlevel vectorized python script. In other words, spatial hashing is detached from the core geometric computation, providing more flexibility in performance optimization.
6.2.4 Ray Casting and Tracking
With known camera poses, TSDF integration is ready to use with the operations above. We now extend it to a dense SLAM system with ray casting and localization.
Ray casting renders depth images by marching a ray in the spatially hashed volumes and finding zerocrossing interfaces. It provides both a visualization of the currently reconstructed scene, and enables frametomodel tracking for a dense SLAM system. A vectorized implementation is described in Algorithm 3, where the hash map query and the geometric computations are conducted alternatively in a batch fashion, which is ready for differentiable rendering when incorporated in an autodiff engine like PyTorch.
It is worth noting that while the vectorized ray casting is easy to implement and enables autodiff, an optimized version is better suited for a realtime reconstruction system, where devicelevel iterative hash map access is applied. Therefore, ASHstdgpu is preferred over ASHslab, since the warporiented operations entangle groups of rays and forbid early termination where surfaces are found.
Further accelerations can improve the speed of ray casting. As discussed in [42], adaptive spherical ray casting and a precomputed minmax range estimate will constrain the search range and boost performance. The latter can be conducted by simply reusing the active keys collected in Algorithm 2 and conducting projection without any involvement of hash maps.
In a performancefirst application, we can squeeze more from our doublehashmap architecture. By default, the global hash map is used in Algorithm 3. Since the local hash map is associated with the global hash map with active indices, we can replace the global hash map with the local one accompanied with the buffer in the global hash map. With such a simple change, we can now query the local hash map, and access the global hash map inplace without touching the geometric computations. This operation does sacrifice rendering quality, as shown in Fig. 12, since outoffrustum voxel blocks are ignored. However, we find that in practice it only affects the boundary regions and does not affect the tracking accuracy, and produces highquality scene reconstruction, see Fig. 13.
6.2.5 Surface Extraction
A volumetric scene reconstruction is not usable for most software and solutions until the results are exported in the point cloud or triangle mesh form. Hence we implement a variation of Marching Cubes [31] that extracts vertices with triangle faces at zero crossings in a volume. In a spatially hashed implementation, boundary voxels in voxel blocks are harder to process since queries of neighbor voxel blocks are frequent, and shared vertices among triangles are hard to track.
One popular practice is to simply visit vertices at isosurfaces and disregard duplicates, but this usually results in a heavily redundant mesh [27, 42], or timeconsuming postprocessing to merge vertices [37]. Another method is to introduce an assistant volumetric data structure to atomically record the vertexvoxel association, but the implementations are overcomplex and require frequent lowlevel hash map queries coupled with surface extraction [13, 12].
Now that we have a unified hash map interface, we simplify the voxel block neighbor search routine [12] and set up a 1radius neighbor lookup table in advance, as described in Algorithm 4. Surface extraction is then detached from hash map access and can be optimized separately. As a lowhanging fruit, point cloud extraction is implemented with the same routine by ignoring the triangle generation step. In fact, surface extraction of a medianscale scene shown in Fig. 13 takes less than 100ms, making interactive surface update possible in a realtime reconstruction system.
6.2.6 Switchable Modes and Reconstruction System
To demonstrate the ultimate performance we can achieve by making full use of the hash map, and for a fair comparison to the baselines [37, 42, 12], we develop the fast mode for the reconstruction system. In this setup, we use a less aggressive raybased allocation strategy [42], and voxel blocks without color information consistent with the baselines. We compare the performance of the fast mode against the stateoftheart implementations with the same parameter setups: voxel size is 5.8mm, voxel block resolution is , TSDF truncation distance is 4cm, min/max acceptable range of depth scanning is 0.2m and 3m. Performance is profiled on the lounge scene shown in Fig. 13, since there are multiple loop closures in the sequence and the tracking is stable, resulting in minimal disturbance to the reconstruction side. Fig. 15 compares both the runtime and the LoC^{5}^{5}5All the code are reformatted with clangformat with a modified Google style. of the modules among the baselines. In most cases, we can see a significant gain of performance, with fewer LoC to write thanks to the elimination of redundant hash map lookups in geometry kernels.
Fast  Complete  

Integration  0.75  1.80 
Raycasting  0.97  3.72 
Mesh extraction  146.47  190.57 
Point extraction  30.17  87.78 
In addition to the fast mode, we also implement a complete mode for richer volumetric information including color, and reduce noise by integrating depth into TSDF voxel blocks from a 1radiusneighbor allocation [56]. A comparison of performance between the modes is shown in Table IV
. The introduction of color requires double memory and triple computation cost in trilinear interpolation for ray casting and surface extraction. Hence the complete mode is slower but it provides a better user experience. Fig.
14 shows the interactive reconstruction system where realtime point cloud reconstruction is provided in the complete mode. The system runs at frame rate on a midend laptop, including realistic rendering.Note that to switch between the fast and complete mode, we only need to change the allocation function and several parameters – in total several dozen of lines – without rewritten the core.
To conclude this subsection, we presented a dense SLAM system with a modular design, separating hash maps and geometric operations. Our system is faster, requires fewer LoC, and supports easy tradeoff between speed and fidelity. Our systems also provides flexibility in optimizing the geometry functions without touching the hash map interface.
6.3 NonRigid Volumetric Deformation
The aforementioned modules are suitable for mediumscale reconstruction with 10K input VGAresolution frames. For input sequences beyond that range or with heavy noise, drift is inevitable, and therefore offline reconstruction [8] is preferred. In a submapbased offline reconstruction system, long sequences are split into smaller subsets, each yielding a submap reconstruction with less drift. A global pose graph is then optimized after robust registration of such submaps to refine poses . For the details, we refer the readers to stateoftheart reconstruction systems [8, 12].
We can simply reuse the integration and surface extraction components in the previous subsection for an offline system, with a fast processing speed for fragment estimation and scene reconstruction [8, 12]. However, issues still persist in challenging scenes, e.g., heavy misalignment due to the strong simulated noise in the Augmented ICL dataset [8], and the artifacts presented in the largescale indoor RGBD LiDAR dataset [40], as shown in Fig. 16.
To deal with this, nonrigid volumetric deformation is presented in Simultaneous Localization and Calibration (SLAC) [54, 53] to jointly optimize camera poses and refine geometry details by nonrigid deformation. In brief, SLAC attempts to embed fragment points in a control grid
for deformation, and optimizes loss functions parameterized over the grids with asrigidaspossible regularizers:
(15) 
where are corresponding 3D points between submaps obtained in nearest neighbor search after pose graph optimization, and deforms a point by embedding it in the control grid and performing a trilinear interpolation of neighbor grid deformations. accesses the offset value at a grid point . is the rigid rotation that minimizes locally, where is a 1ring neighbor of . It controls local distortions in the asrigidaspossible regularizer.
This problem formulation is, however, complicated to realize in code, and in the original implementation, the deformation grid is a simplified regular 3D array where points outofbound are discarded during optimization. As of today, SLAC has never been reproduced apart from the original implementation. We observe that similar operations for TSDF grids can be applied here by ASH to generate a spatially hashed control grid.
Setup 6.
A volumetric deformation hash map maps deformation grid coordinates to grid position offsets, and the capacity of the hash map is typically to :
Equipped with ASH, the nonrigid deformation can be written in several lines, combining Algorithm 1 and modifying Algorithm 4 for 1cube nearest neighbor search. We first voxelize the input point cloud with the deformation grid size. Then, instead of the 1radius nearest neighbors ( entries), we look for 1cube nearest neighbors (), where a point is enclosed in a cube formed by grid points. The interpolation ratio can be computed jointly. We can also adapt the 1radius neighbor search to 1ring neighbors for the regularizer. The embedding of a fragment point cloud is visualized in Fig. 17, where the edges indicate the association between points to grids, and the colors show the interpolation ratio. Note that this visualization is also made easy thanks to the simple interface of ASH.
With this parameterization, we reproduce SLAC after rewriting the nonlinear least squares solver and jointly optimizing the grid points and submap poses given the correspondences. In addition, the hash map can be saved and loaded from the disk for further processing, including deformed TSDF integration that reconstructs the scene from the deformed input depth images embedded in the grids. Experiments show that with a modularized design and a spatial hash map, we are able to reproduce SLAC by reducing artifacts after optimization, as shown in Fig. 16.
We can see a gain of performance with fewer LoC in Table V in the livingroom 1 scene with heavy simulated noise^{6}^{6}6To control the experiment, we use the initial fragment pose graph from the baseline implementation.. Note while the hash map generalizes from the bounded to unbounded deformation grid, the LoC and time contributing to the core nonlinear least squares is not reduced, which takes up the greatest portion of optimization. Our deformation and integration speed per frame is significantly faster (), which is critical for largescale (
K frames) sequences. While being faster and easier to develop, our system achieves a higher reconstruction quality in terms of precision, recall, and Fscore with a distance threshold
mm [40].Operation  Original SLAC  ASHSLAC  

Time (ms)  LoC  Time (ms)  LoC  
Nonrigid optim.  2041.1  1585  1982.1  1535 
Deformed integration  125.38  944  10.62  446 
Reconstruction quality  
Precision ()  29.19  36.10  
Recall ()  51.44  61.34  
Fscore ()  37.24  45.45 
6.4 ShapefromShading Refinement
SLAC reduces artifacts for largescale scenes. For smallscale objects, while volumetric reconstruction outputs smooth surfaces, fine details are often impaired due to the weight averaging of the TSDF.
(a) Normal map  (b) Color map  (c) Rendering with a ceramic material 
ShapefromShading (SfS) refines details by jointly optimizing volumetric TSDF functions given the initial geometry and appearance [57]. It takes a reconstructed volumetric TSDF grid with a set of high resolution key frame RGB images and their poses as input, and outputs jointly optimized TSDF and albedo through an image formation model
(16) 
where the estimated voxelwise appearance is computed by (SH stands for spherical harmonics), and associated with the closest surface point
(17) 
which is projected to image through after a rigid transformation . Here the voxelwise normal is directly derived from with numerical gradients
(18) 
Similar to SLAC, we use to access TSDF and albedo values at grid point . are coefficients for regularizing smoothness through the Laplacian, stability, and piecewise albedo constancy via a weighted chromaticity regularizer , respectively [57].
While the image formation model is straightforward, similar to SLAC, the underlying data structure used in implementing the model can be complex mainly because of the prevalent nearest neighbor search in normal computation and neighbor voxel regularizers. As a result, to enable such a system without a modern hash map, one has to rely on lowlevel C++ implementation and is consequently limited to the lowlevel Ceres solver [1] for autodiff in optimization. Further, the spatially hashed voxels have to be bounded to reduce computation cost [32].
Now equipped with ASH, we provide a simplified solution that is built upon the hash map and advanced indexing. Unlike SLAC that requires timeconsuming deformable TSDF reintegration for final scene reconstruction, SfS allows reusing accelerated surface extraction from the TSDF grids without further optimization. Therefore, we implement the SfS pipeline in pure Python as an example of fast prototyping of a differentiable rendering pipeline. Running on GPU, we lift the constraint of a userdefined bounding box and optimize the full reconstructed surface.
Without the requirement of extreme performance, we drop the hierarchical volumetric layout and use the simple voxelbased hash map:
Setup 7.
A voxel indexer is given by a hash set . The typical capacity is to .
With this setup, we can reuse the code in SLAC to look up the 1ring neighborhood for normal estimation and Laplacian regularization. There is, however, another lookup required since we are minimizing the difference of appearance gradient in Eq. 16: we need to find the 1ring neighbors that also have 1ring neighbors. In other words, we have to find the intersection of two sets. While NumPy provides the functionality for 1D arrays through ordered sorting, our hash map allows unordered intersection that can be generalized to multidimensional inputs:
Setup 8.
With two input sets , the intersection is given by the following operations: initialize a hash set with ; query and obtain success mask ; return .
After data association is found and SH parameters are estimated in a preprocessing step, all the terms in Eq. 16 are converted to a trivial combination of indexing and arithmetic operations. We can take advantage of PyTorch’s autodiff and back propagate the gradient through the builtin differentiable index layer. ADAM [26] with an initial learning rate is used. Thus the core volumetric SfS pipeline [57] is reproduced in pure Python.
An extension can be easily implemented by introducing spatially varying lights [32], wrapped up with a hash map.
Setup 9.
Spatially varying spherical harmonics (SVSH) (bands = 3) can be described by a hash map that maps lighting subvolume coordinates to the corresponding coefficients:
The embedding of an active voxel in an SVSH map is identical to SLAC, with 1cube neighbor for the data term and 1ring neighbor for the regularizer. Further description is omitted here as the formulation and implementation are similar to Eq. 16 [32].
With both SfS and SVSH optimization implemented^{7}^{7}7Pose optimization and voxel upsampling are disabled at current. [32], we show the results on the scene lion in Fig. 18. Without voxel grid upsampling, both the geometry and appearance details are sharper. Regarding performance and code complexity, we show in Table VI that our code is much shorter in pure Python, and faster per iteration thanks to the CUDA autodiff engine in PyTorch. Note that Ceres is a 2ndorder optimizer on CPU that empirically converges faster than the 1storder ADAM optimizer. In practice, however, we found that in 50 iterations ADAM converges well against the preset 10 iterations for nonlinear least squares solver. Thus the total optimization performance of our implementation is still faster with more voxels to process (remember that we do not require an additional bounding box).
We also evaluate reconstruction quality in Table VI. We render our optimized mesh given the keyframe camera extrinsic and intrinsic parameters and compute RMSE against the raw input images. For the baseline [32], we follow a similar procedure and render the optimized mesh (not upsampled for fairness) given refined camera parameters. We use the same mask given by the baseline to ensure the same region of interest. The results show that our implementation produces improved RMSE despite the simplified development.
Operation  OriginalIntrinsic3D  ASHIntrinsic3D  

Time (s)  LoC  Time (s)  LoC  
SVSH optim.  0.503  605  0.092  254 
Joint optim.  147.323  7399  0.916  1416 
Rendering quality  
RMSE mean ()  0.677  0.627  
RMSE std ()  0.095  0.120 
Performance per epoch (with 26 highresolution keyframes) and LoC (top) and rendering quality (bottom) comparison between ASHIntrinsic3D and the original implementation
[32] on the lion scene [32]. ASHIntrinsic3D is faster with fewer LoC, and results in comparable rendering from the refined reconstruction.7 Conclusions and Future Work
We presented ASH, a performant and easytouse framework for spatial hashing. Both synthetic and realworld experiments demonstrate the power of the framework. With ASH, users can achieve the same or better performance in 3D perception tasks while writing less code.
There are various avenues for future work. At the architecture level, we seek to introduce the open address variation [2, 24] of parallel hash maps for flexibility and potential high performance static hash maps. At the low level, we plan to further optimize the GPU backend, and accelerate the CPU counterpart, potentially with cache level optimization and code generation [23, 21]. We also plan to apply ASH to sparse convolution [10, 49] and neural rendering [14, 43], where spatially varying parameterizations are exploited.
ASH accelerates a variety of 3D perception workloads. We hope that the presented framework will serve both research and production applications.
References
 [1] Ceres solver. Note: http://ceressolver.org Cited by: §6.4.
 [2] (2009) Realtime parallel hashing on the gpu. In ACM SIGGRAPH Asia, Cited by: §1, §2.1, TABLE I, §7.
 [3] (2018) A dynamic hash table for the gpu. In IEEE IPDPS, Cited by: §1, §2.1, TABLE I, §3, §4.4.
 [4] (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §1, §2.2.
 [5] (2013) Direct camera pose tracking and 3d reconstruction with signed distance functions. In RSS, Cited by: §6.2.1, §6.2.1.
 [6] (1985) Robin hood hashing. In Annual Symposium on Foundations of Computer Science, Cited by: §2.1.
 [7] (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In ECCV, Cited by: §2.3, §6.2.2.
 [8] (2015) Robust reconstruction of indoor scenes. In CVPR, Cited by: Fig. 15, Fig. 16, §6.3, §6.3, TABLE IV, TABLE V.
 [9] (2020) Deep global registration. In CVPR, Cited by: §6.1.

[10]
(2019)
4D spatiotemporal convnets: minkowski convolutional neural networks
. In CVPR, Cited by: §6.1, §6.1, §7.  [11] (1996) A volumetric method for building complex models from range images. In ACM SIGGRAPH, Cited by: §2.3, §6.2.1.
 [12] (2019) GPU accelerated robust scene reconstruction. In IROS, Cited by: §2.2, TABLE I, §5.1, §6.2.2, §6.2.4, §6.2.5, §6.2.5, §6.2.6, §6.3, §6.3.
 [13] (2018) An efficient volumetric mesh representation for realtime scene reconstruction using spatial hashing. In ICRA, Cited by: Fig. 15, §6.2.5.
 [14] (2021) Fastnerf: highfidelity neural rendering at 200fps. arXiv preprint arXiv:2103.10380. Cited by: §7.
 [15] (2011) Coherent parallel hashing. ACM Transactions on Graphics 30 (6), pp. 1–8. Cited by: §2.1.
 [16] (2007) Real time ray tracing on gpu with bvhbased packet traversal. In Symposium on Interactive Ray Tracing, Cited by: §2.2.
 [17] (2018) FlashFusion: realtime globally consistent dense 3d reconstruction using cpu computing.. In RSS, Cited by: §2.2.
 [18] (202009) Array programming with NumPy. Nature 585 (7825), pp. 357–362. Cited by: §3.
 [19] (2016) GVDB: raytracing sparse voxel database structures on the gpu. In Proceedings of High Performance Graphics, Cited by: §2.2.
 [20] (2013) OctoMap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots 34 (3), pp. 189–206. Cited by: §2.2.
 [21] (2019) Taichi: a language for highperformance computation on spatially sparse data structures. ACM Transactions on Graphics 38 (6), pp. 1–16. Cited by: §2.2, §7.
 [22] (2017) Pybind11 – seamless operability between c++11 and python. Note: https://github.com/pybind/pybind11 Cited by: §5.4.
 [23] (2019) Enoki: structured vectorization and differentiation on modern processor architectures. Note: https://github.com/mitsubarenderer/enoki Cited by: §7.
 [24] (2020) WarpCore: a library for fast hash tables on gpus. arXiv preprint arXiv:2009.07914. Cited by: §2.1, TABLE I, §7.
 [25] (2015) Optimizing gpuaccelerated groupby and aggregation.. VLDB 8, pp. 20. Cited by: §2.1.
 [26] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.4.
 [27] (2015) Chisel: real time large scale 3d reconstruction onboard a mobile device using spatially hashed signed distance fields.. In RSS, Cited by: §2.2, §6.2.5.
 [28] (2007) The foundations for scalable multicore software in intel threading building blocks.. Intel Technology Journal 11 (4). Cited by: §3.
 [29] (2009) Fast bvh construction on gpus. In Computer Graphics Forum, Vol. 28, pp. 375–384. Cited by: §2.2.
 [30] (2019) Dataparallel hashing techniques for gpu architectures. IEEE Transactions on Parallel and Distributed Systems 31 (1), pp. 237–250. Cited by: §2.1.
 [31] (1987) Marching cubes: a high resolution 3d surface construction algorithm. ACM SIGGRAPH 21 (4), pp. 163–169. Cited by: §6.2.5.
 [32] (2017) Intrinsic3D: highquality 3D reconstruction by joint appearance and geometry optimization with spatiallyvarying lighting. In ICCV, Cited by: §2.3, §6.4, §6.4, §6.4, §6.4, §6.4, TABLE VI.
 [33] (1982) Geometric modeling using octree encoding. Computer Graphics and Image Processing 19 (2), pp. 129–147. Cited by: §1, §2.2.
 [34] (2013) OpenVDB: an opensource data structure and toolkit for highresolution volumes. In ACM SIGGRAPH Courses, pp. 1–1. Cited by: §2.2.
 [35] (2015) Dynamicfusion: reconstruction and tracking of nonrigid scenes in realtime. In CVPR, Cited by: §2.3.
 [36] (2011) Kinectfusion: realtime dense surface mapping and tracking. In ISMAR, Cited by: §1, §6.2.1, §6.2.4, footnote 4.
 [37] (2013) Realtime 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics 32 (6), pp. 1–11. Cited by: §1, §1, §2.2, §2.3, TABLE I, §5.1, Fig. 15, §6.2.2, §6.2.3, §6.2.5, §6.2.6.
 [38] (2019) Mitsuba 2: a retargetable forward and inverse renderer. ACM Transactions on Graphics 38 (6), pp. 1–17. Cited by: Fig. 13, Fig. 16.
 [39] (2004) Cuckoo hashing. Journal of Algorithms 51 (2), pp. 122–144. Cited by: §2.1.
 [40] (2017) Colored point cloud registration revisited. In ICCV, Cited by: Fig. 16, §6.3, §6.3.

[41]
(2019)
PyTorch: an imperative style, highperformance deep learning library
. In NeurIPS, Cited by: §3, §4.5.  [42] (2017) Infinitam v3: a framework for largescale 3d reconstruction with loop closure. arXiv preprint arXiv:1708.00783. Cited by: §1, §1, §2.2, TABLE I, §5.1, Fig. 15, §6.2.2, §6.2.3, §6.2.4, §6.2.5, §6.2.6.
 [43] (2021) KiloNeRF: speeding up neural radiance fields with thousands of tiny mlps. arXiv preprint arXiv:2103.13744. Cited by: §7.
 [44] (2001) Efficient variants of the icp algorithm. In 3DV, Cited by: §2.2.
 [45] (2009) Fast point feature histograms (fpfh) for 3d registration. In ICRA, Cited by: §2.2.
 [46] (2011) 3d is here: point cloud library (pcl). In ICRA, Cited by: §2.2.
 [47] (2019) stdgpu: Efficient STLlike Data Structures on the GPU. arXiv:1908.05936. Cited by: §2.1, TABLE I, §3, §4.3.

[48]
(2018)
RAPIDS: collection of libraries for end to end gpu data science
. External Links: Link Cited by: §2.1, TABLE I.  [49] (2020) Lagrangian fluid simulation with continuous convolutions. In ICLR, Cited by: §7.
 [50] (2007) Ray tracing deformable scenes using dynamic bounding volume hierarchies. ACM Transactions on Graphics 26 (1), pp. 6–es. Cited by: §2.2.
 [51] (2013) Octreebased fusion for realtime 3d reconstruction. Graphical Models 75 (3), pp. 126–136. Cited by: §2.2.
 [52] (2008) Realtime kdtree construction on graphics hardware. ACM Transactions on Graphics 27 (5), pp. 1–11. Cited by: §2.2.
 [53] (2014) Simultaneous localization and calibration: selfcalibration of consumer depth cameras. In CVPR, Cited by: §2.3, §6.3, TABLE V.
 [54] (2013) Elastic fragments for dense scene reconstruction. In ICCV, Cited by: §6.3.
 [55] (2016) Fast global registration. In ECCV, Cited by: §2.2.
 [56] (2018) Open3D: A modern library for 3D data processing. arXiv:1801.09847. Cited by: §2.2, §6.1, §6.1, §6.2.6.
 [57] (2015) Shadingbased refinement on volumetric signed distance functions. ACM Transactions on Graphics 34 (4), pp. 1–14. Cited by: Fig. 18, §6.4, §6.4.
 [58] (2018) State of the art on 3d reconstruction with rgbd cameras. In Computer Graphics Forum, Vol. 37, pp. 625–652. Cited by: §2.3.
Comments
There are no comments yet.