ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception

by   Wei Dong, et al.

We present ASH, a modern and high-performance framework for parallel spatial hashing on GPU. Compared to existing GPU hash map implementations, ASH achieves higher performance, supports richer functionality, and requires fewer lines of code (LoC) when used for implementing spatially varying operations from volumetric geometry reconstruction to differentiable appearance reconstruction. Unlike existing GPU hash maps, the ASH framework provides a versatile tensor interface, hiding low-level details from the users. In addition, by decoupling the internal hashing data structures and key-value data in buffers, we offer direct access to spatially varying data via indices, enabling seamless integration to modern libraries such as PyTorch. To achieve this, we 1) detach stored key-value data from the low-level hash map implementation; 2) bridge the pointer-first low level data structures to index-first high-level tensor interfaces via an index heap; 3) adapt both generic and non-generic integer-only hash map implementations as backends to operate on multi-dimensional keys. We first profile our hash map against state-of-the-art hash maps on synthetic data to show the performance gain from this architecture. We then show that ASH can consistently achieve higher performance on various large-scale 3D perception tasks with fewer LoC by showcasing several applications, including 1) point cloud voxelization, 2) dense volumetric SLAM, 3) non-rigid point cloud registration and volumetric deformation, and 4) spatially varying geometry and appearance refinement. ASH and its example applications are open sourced in Open3D (



There are no comments yet.


page 3

page 5

page 10

page 12

page 13

page 15

page 17

page 18


WarpCore: A Library for fast Hash Tables on GPUs

Hash tables are ubiquitous. Properties such as an amortized constant tim...

Data-Parallel Hashing Techniques for GPU Architectures

Hash tables are one of the most fundamental data structures for effectiv...

BCL: A Cross-Platform Distributed Container Library

One-sided communication is a useful paradigm for irregular parallel appl...

An Efficient Volumetric Mesh Representation for Real-time Scene Reconstruction using Spatial Hashing

Mesh plays an indispensable role in dense real-time reconstruction essen...

Wormhole: A Fast Ordered Index for In-memory Data Management

In-memory data management systems, such as key-value store, have become ...

MetaCache-GPU: Ultra-Fast Metagenomic Classification

The cost of DNA sequencing has dropped exponentially over the past decad...

Low-latency Visual SLAM with Appearance-Enhanced Local Map Building

A local map module is often implemented in modern VO/VSLAM systems to im...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D space is only one dimension higher than the 2D image space, yet the additional dimension introduces an unpredictable multiplier to the computational and storage cost. This increased dimension makes 3D perception tasks such as geometric reconstruction and appearance refinement challenging to implement. To reduce complexity, one compromise is to reuse dense data structures and constrain the 3D space by bounding the region of interest, e.g., adopting a 3D array in a bounded space [36]. While this simple approach succeeds at the scale of objects, it cannot meet the demand of room or city scale perception, which is necessary for virtual tour, telepresence, and autonomous driving.

Since the 3D space is generally a collection of 2D surface manifolds, its sparsity can be exploited by partitioning to reduce computational cost. The general idea is to split the large 3D space into smaller regions and only proceed with the non-empty ones. There is a plethora of well-established data structures for 3D space partitioning. Examples include trees (Octree [33], KD-tree [4]) and hash maps (spatial hashing [37]). While trees are able to adaptively achieve high precision, they 1) require an initial bounding volume and 2) usually take unbalanced traversal time for a batch of spatial queries and hence 3) are less friendly to batched operations. On the other hand, spatial hashing coupled with a plain array structure is more scalable and parallelizable for reconstruction tasks [37, 42]. Nonetheless, most of the current implementations are task-specific and difficult to generalize. A user-friendly, general hash map is missing for efficient spatial perception at scale.

The reason for this absence is understandable. A parallel hash map on GPU has to resolve collisions and thread conflicts and preferably organize an optimized memory manager, none of which is trivial to implement. Previous studies have attempted to tackle the problem in one or more aspects, driven by their selected downstream applications. Furthermore, most of the popular parallel GPU hash maps are implemented in C++/CUDA and only expose low-level interfaces. As a result, all the extensions using such a hash map must start from low-level programming. While these designs usually guarantee performance under certain circumstances [2, 3, 37, 42], as of today, they leave a gap from the standpoint of the research community, which prefers to use off-the-shelf libraries for fast prototyping with a high-level scripting language using tensors and automatic differentiation. Our motivation is to bridge this gap to enable researchers to develop sophisticated 3D perception routines with less effort and drive the community towards large-scale 3D perception.

To this end, we design a modern hash map framework with the following major contributions:

  1. [leftmargin=*]

  2. a user-friendly dynamic and generic111A generic hash map supports arbitrary dimensional keys and values in various data types. hash map interface that enables tensor I/O, advanced indexing, and in-place automatic differentiation when bridged to autodiff engines such as PyTorch;

  3. an index-first adaptor that supports various state-of-the-art parallel GPU hash map backends and accelerates hash map operations with an improved data layout;

  4. a handful of downstream applications that achieve higher performance compared to state-of-the-art implementations with fewer LoC.

Experiments show that ASH achieves better performance with fewer LoC on both synthetic experiments and real-world tasks.

2 Related Work

2.1 Parallel Hash Map

The hash map is a data structure that seeks to map sparse keys (e.g. unbounded indices, strings, coordinates) from the set to values from the set with amortized access. It has a hash function that maps the key to the index set for indexing (or addressing) that is viable on a computer.

Ideally, with a perfect injective hash function , a hash map can be implemented by , where is an array of objects of type and is the trivial array element accessor. However, in practice, it is intractable to find an injective map given a sparse key distribution in and a constrained index set of size due to computational budget. Therefore, modifications are required to resolve inevitable collisions, where .

There are two classes of techniques for collision resolution, open addressing and separate chaining. Open addressing searches for another candidate via a probing algorithm until an empty address is found. The simplest probing, linear probing [25], computes starting from , where is the number of attempts. Separate chaining, on the other hand, maintains multiple entries per mapped index. In implementation, a linked list is grown at if .

While hash map implementations are widely available for CPU, their GPU counterparts have only emerged in the recent decade. The majority of GPU hash maps use open addressing [48, 2, 15, 24], mainly due to simplicity in implementation and capability of handling highly concurrent operations. CUDPP [2] utilizes Cuckoo Hashing [39], while CoherentHash [15] adopts Robin Hood Hashing [6] – both involving advanced probing design. Although being performant when are limited to integer sets, these variations cannot be generalized to spatial hashing and only allow static input. Recently, WarpCore [24] proposes to support non-integer and dynamic insertion, but the key domain is still limited to at most 64 bits.

There are also a few separate chaining implementations on GPU involving device-side linked lists. SlabHash [3] builds a linked list with a 128-bit Slab as the minimal unit, optimized for SIMT (Single Instruction Multiple Threads) warp operations. Although SlabHash allows dynamic insertion, similar to aforementioned GPU hash maps, only integer are supported. stdgpu [47]

follows the conventional C++ Standard Library std::unordered_map and builds supporting vectors, bitset lock guards, and linked lists from scratch, resulting in a generic, dynamic hash map. With these rich functionalities, however, stdgpu is not optimized for large value sets. In addition, due to its low-level templated design, users have to write device code for simple tasks.

We refer the readers to a comprehensive review of GPU hash maps [30].

ZS[table-format=2.1,table-auto-round] Method Dynamic Atomic Generic Python SlabHash [3] CUDPP [2] cuDF [48] WarpCore [24] stdgpu [47] InfiniTAM [42] VoxelHashing [37] GPURobust [12] ASH

TABLE I: Comparison of existing parallel GPU hash maps. ASH preserves the dynamic, atomic, generic properties, and is extendable to the non-templated high-level Python interfaces.

2.2 Space Partitioning Structures

3D data is not as simple to organize as 2D images. While a 2D image can be stored in a dense matrix, exploiting sparsity in 3D data is paramount due to the limits in computer memory of the current day.

The most widely used data structures for 3D indexing are arguably trees. A KD-tree [4] recursively sorts k-dimensional data along a selected axis and partitions data at the median point. By nature, a KD-Tree is designed for neighbor search. In 3D, it is mainly used to organize 3D points and their features

. Examples include normal estimation and nearest neighbor association in Iterative Closest Points (ICP) 

[44, 56] and 3D feature association in global registration [45, 55]. GPU adaptations exist for KD-trees[46, 52], but are not suitable for incrementally changing scenes, as they are usually constructed once and queried repeatedly.

Bounding volume hierarchy (BVH) is another hierarchical representation that organizes primitives such as objects and triangles in 3D. There are various GPU adaptations [16, 29, 50] mostly targeted at ray tracing and dynamic collision detection. While a parallel construction is possible and deformation of the nodes is allowed, the tree structure typically remains unchanged, assuming a fixed layout.

While KD-trees and BVH split the space unevenly by data distribution, an Octree [33], on the other hand, recursively partitions the 3D space evenly into 8 even subvolumes according to space occupation states. It has been widely used in adaptive 3D mapping [20, 34] for robot navigation. There have been parallel implementations on GPU, from optimized data structures [19] to domain-specific languages [21]. However, these works generally focus on physics simulation within a bounded region of interest where the spatial partition is predefined. While parallel incremental division [51] is possible, an initial bounding region is still required, and the trees are not guaranteed to be balanced.

Spatial hashing is another variation of spatial management with access time depending on hash maps. With an unbounded region of interest, it has been widely used in real-time volumetric scene reconstruction, bundled with small dense 3D arrays. A handful of CPU implementations have achieved real-time performance [27, 17] at the expense of deep coupling with the tracking pipelines. Similarly, GPU implementations [37, 42, 12] reach high frame rates using GPU-based spatial hashing. However, all of the studies depend on ad hoc GPU hash maps exclusive to these specific systems. Concurrent race conditions have not been fully resolved in several implementations [37, 12], where volumes can be randomly under-allocated.

Fig. 1: The insertion and query workflow of ASH. Left: ASH takes a key and/or a value tensor as input, with a one-liner interface. For insertion, tensors are organized in SoA. Middle: inside ASH, a switchable parallel hash map is maintained, connected to the buffers that store actual keys and values via an index heap (details are discussed in §4.3-4.4). Right: buffer indices and masks are returned upon insertion and query, organized together in SoA. Chained functions can be easily applied to hashed data by indexing ASH buffers with indices and masks. Examples include obtaining unique keys through insertion, and applying in-place value increment through query.

2.3 Spatially Varying 3D Representations

A truncated signed distance field (TSDF) [11] is an implicit representation of surfaces, recording point-wise distance to the nearest surface point. It is frequently used for dense scene reconstruction with noisy input. The distribution of surfaces is generally spatially varying and therefore, a proper parameterization is usually necessary, either in a discrete [37] or neural [7] form.

Non-rigid deformation methods [53, 35] seek to embed point clouds in a deformable grid, where each point is anchored to neighbor grid points and deformed accordingly. They are mainly used for animation or calibration for distorted models. Similar to a deformation grid, complex lighting for rendering can be approximated by spatially-varying spherical harmonics (SVSH) [32] placed at a sparse grid. These grids are natural applications of spatial hashing.

A comprehensive review of spatially varying representations for real-world scene reconstruction is available [58].

While ad hoc implementations have been introduced for these representations either on CPU or GPU, ASH provides a device-agnostic interface requiring less code written and providing better performance. Table I compares various aspects of existing GPU hash maps, either as a standalone data structure (Section 2.1) or embedded in an application (Section 2.2). To the best of our knowledge, ASH is the first implementation that simultaneously supports dynamic insertion, ensures correctness via atomic operations, allows generic keys, and has a modern tensor interface and Python binding for better usability.

3 Overview

Before plunging into the details, we first provide a high-level overview of our framework in Fig. 1.

Conventional parallel hash maps reorganize the structure of arrays (SoA) input, i.e., the separated key array and value array, into an array of structures (AoS) where keys and values are paired, inserted, and stored. Therefore, array of pointers to pair structures (std::pair in C++, thrust::pair in CUDA, and tuple in Python) are returned upon query. Consequently, the operations from insertion and query to in-place value increment require users to write device code and visit AoS at the low-level pointers.

In contrast, ASH sticks to SoA. Fig. 1 shows the workflow of ASH. Instead of pointers to pairs, ASH returns indices and masks arrays that can be directly consumed by tensor libraries such as PyTorch [41] (without memory copy) and NumPy [18] (with GPU to host memory copy). As a result, following functions from duplicate key removal to in-place modification, can be chained with hash map insertion and query in ASH via advanced indexing without writing any device code. As a general and device-agnostic interface for parallel hash maps, our framework is built upon switchable backends with details hidden from the user. Currently, separate chain backends are supported, including the generic stdgpu [47] backend and the integer-only SlabHash [3] backend extended to support arbitrary key-value data types. TBB’s concurrent hash map [28] powers the CPU counterpart with the identical interface to CUDA.

In this paper, we use calligraphic letters to represent sets, and normal lower-case letters for their elements. Normal upper-case letters denote functions. Bold lower-case letters denote vectors of elements, or arrays in the programmer’s perspective. Bold upper-case letters are for matrices. For instance, in a hash map, we are interested in key elements and their vectorized processing, e.g. query . Specifically, we use to denote a set of indices , and as the boolean selection . Given an arbitrary vector , we denote and as indexing and selection functions applied to when and . We use as the key and value sets for a hash map. is the internal hash function that converts a key to an index. is the general hash map enclosing .

4 The ASH Framework

4.1 Classical Hashing

In a hash map , since the hash function cannot be perfect as discussed in Section 2, we have to store keys to verify if collisions happen ( but ).

In separate chaining, to resolve hash collisions, the bucket-linked list architecture is used. With initial buckets, we construct the hash function where is defined in Section 2. As shown in Fig. 2, keys with the same hashed index are first aggregated in the -th bucket, where a linked list grows adaptively to accommodate different keys. A conventional hash map stores key-value pairs as the storage units. Consequently, two keys can be distinguished by checking and from the pair in order, and manipulation of the keys and values can be achieved by iterating over such pairs.

With this formulation, assuming a subset has been inserted into the hash map with associated values , a query function can be described as


where forms a concrete pair stored in the hash map. This format is common in implementations, e.g. in C++ (std::unordered_map) and Python (dict).

Fig. 2: Illustration of a classical hash map using separate chaining. Keys (left) are put into corresponding buckets (middle) obtained by the hash function . A linked list (right) is constructed per bucket to store key value pairs within the same bucket but with unequal keys.

4.2 Function Chaining and Parallel Hashing

The element-wise operation in Eq. 1 can be extended to vectors via parallel kernels. However, interpretation of the returned iterators of pairs is still required at the low level. In other words, although the parallel version can be implemented efficiently, results are still packed in an AoS instead of SoA:


This forms a barrier when the parallel query is located in a chain of functions. For instance, to apply any function (e.g., geometry transformation) over the result of a query, the low-level function second that selects the value element from a pair must be provided to dereference the low-level structures and manipulate the keys and values in-place. In other words, we have to implement a non-trivial :


to force the conversion from AoS to SoA and chain a high-level function with 222This can be simply achieved by returning a copy of values, but it is not feasible, especially when dealing with large-scale data, e.g. hierarchical voxel grids.. This could be tedious when prototyping geometry perception that requires hash map structures since off-the-shelf operations have to be reimplemented in kernels.

We reformulate this problem by introducing two affiliate arrays, and (note with a superscript for buffering, they are not the input ) of capacity , where is the number of buckets. These arrays are designed for explicit storage of keys and values, respectively, and serve as buffers to support natural SoA, and are exposed to the user for direct access and in-place modification. Now the query function can be rewritten as


and this version is ready for parallelization. At this stage, to combine and , we can chain to manipulate values:


which retains convenient properties such as array vectorization and advanced indexing.

When the input set stored in the hash map, our formulation maintains its effectiveness by a simple masked extension:


which is also ready for parallelization. Now the chaining of functions is given by


using advanced indexing with masks. We can also select valid queries with without visiting . While our discussion was about the query function, the same applies to insertion.

In essence, by converting the pair-first AoS to an index-first SoA format with the help of array buffers, we can conveniently chain high-level functions over hash map query and insertion. This simple change enables easy development on hash maps and unleashes their potential for fast prototyping and differentiable computation. However, the layout requires fundamental changes to the hash map data structure. With this in mind, we move on to illustrate how the ASH layer converts the AoS in native backends to our SoA layout.

4.3 Generic Backends

We start with converting stdgpu [47], a state-of-the-art generic GPU hash map as the backend of ASH. stdgpu follows the convention of its CPU counterpart std::unordered_map by providing a templated interface. The underlying implementation is a classical bucket-linked list structure with customized locks to avoid race conditions on GPU. To exploit the power of a generic hash map without reinventing the wheel, we seek to reuse the operations over keys (i.e. lock-guarded bucket and linked list operations), and redirect the value mapping to our buffer .

A dynamic GPU hash map requires dynamic allocation and freeing of keys and values in device kernels. With pre-allocated key buffer and value buffer , we maintain an additional index heap , as shown in Fig. 3. The index heap stores buffer indices pointing to the buffers as a map , where the heap top maintains the currently available buffer index in . Heap top starts at , and is atomically increased at allocation and decreased at free. With and the dynamically changing , we instantiate a generic hash map with the templated value in stdgpu to be , where the values are buffer indices stored in to access exposed to the user.

Fig. 3: Illustration of a generic hash map bridged to tensors in ASH. Buffer indices are dynamically provided by the available indices maintained in the index heap at the increasing heap top (middle), acting as the values in the underlying hash map (left). It connects the hash map and the actual key values stored in the buffer (right) by accessing . The key-bucket correspondences are the same as Fig. 2, omitted for simplicity.

4.3.1 Insertion

The insertion of a pair is now decoupled into two steps, with i) insertion of into the hash map, where is the buffer index dynamically acquired from the heap top and ii) insertion of into buffers.

A naive implementation will acquire a buffer index from on every insertion attempt and free it if the insertion fails because the key already exists. However, when running in parallel, atomicAdd and atomicSub may be conflicting among threads, leading to race conditions. A two-pass insertion could resolve the issue: in the first pass, we allocate a batch of indices from determined by the input size, attempt insertions, and record results; in the second pass, we free the indices to from failed insertions.

We adopt a more efficient one-pass lazy insertion. We first attempt to insert with as the dummy index into the backend and observe if it is successful. If not, nothing needs to be done. Otherwise, we capture the returned pointer to the pair, trigger an index allocation from , and directly replace the dummy -1 with . This significantly reduces the overhead when the key uniqueness is low (i.e., many duplicates exist in the keys to be inserted).

4.3.2 Query

The query operation is relatively simpler. We first look up the buffer index given in the backend. If it is a success, we end up with , and the target is accessible with by users.

4.4 Non-generic Backends

While the generic GPU hash map has only recently been available, the research community in parallel computation has been focusing on more controlled setups where both and are limited to certain dimensions or data types. We seek to generalize this non-generic setup with our index heap and verify their performance in more real-world applications. In this section, we show how ASH can be used to generalize SlabHash [3], a warp-oriented dynamic GPU hash map that only allows insertions and queries to Int32 data type.

An extension to generic key types is non-trivial for SlabHash since its warp operations only apply to variables with limited word length. Our implementation extends the hash set variation of SlabHash, where only integers as keys are maintained in the backend.

4.4.1 Generalization via Index Heap

The index heap is the core to generalizing the SlabHash backend. In brief, a generic key is represented by its associated buffer index in a integer-only hash set, allocated the same way as discussed in Section 4.3.1. As illustrated in Fig. 4, all the insertions and queries are redirected from the buffer indices to actual keys and values via the index heap. However, the actual implementation involves more complicated changes in design.

Fig. 4: Illustration of a non-generic hash set enhanced by ASH. Integer buffer indices allocated from the index heap (middle) are inserted as delegate keys directly into the hash set (left), associated with actual keys in the buffer (right) at . The key-bucket correspondences are the same as Figs. 2 and 3, omitted for simplicity.

Given a generic key , we first locate the bucket . Ideally, we can then allocate a buffer index at ’s top and insert it into the linked list at the bucket in the integer-only hash set. The accompanying key and value are put in . In query, we similarly first locate the bucket then search the key in the linked list by visiting via the stored index .

4.4.2 Multi-Pass Insertion

Although query can be applied as mentioned above, lazy insertion mentioned in Section 4.3.1 is problematic in this setup. The main reason is that while the race condition in inserting index does not occur in warp-oriented insertions, the copy of the actual key to requires global memory write. They may not be synchronized among threads, as copying a multi-dimensional key takes several non-atomic instructions. As a result, the insertion of a key could be accidentally triggered when i) a duplicate ’s index has been inserted but ii) whose actual key has only been partially copied to the buffer . This would mistakenly result in followed by the unexpected insertion of

when unsynchronized. In practice, with more than 1 million keys to be inserted in parallel, these kinds of conflicts happen with probability as low as

. To resolve conflicts, we split insertion into three passes:

  • [leftmargin=*]

  • Pass 1: batch insert all keys to by directly copying all candidates via batch allocated corresponding indices from ;

  • Pass 2: perform parallel hashing with indices from pass 1. In this pass, keys are read-only in global buffers and hence do not face race conditions. Successful insertions are marked in a mask array.

  • Pass 3: batch insert values to with successful masks, and free the rest to .

While there is overhead due to the multi-pass operation, it is still practical for a dynamic hash map. First, keys are relatively inexpensive to copy, especially for spatial coordinates, while the more expensive copying of values is done without redundancy. Second, a dynamic hash map generally reserves sufficient memory for further growth so that the all key insertion would not exceed the buffer capacity.

4.5 Rehashing and Memory Management

While buffers are represented as fixed-size arrays, growth of storage is needed to accommodate the accumulated input data, which can exceed the hash map’s capacity, e.g. 3D points from an RGB-D stream. This triggers rehashing, where we adopt the conventional strategy to double the buffer size as common in the C++ Standard Library, collect all the active keys and values, and batch insert them into the enlarged buffer.

In dynamic insertions, there can be frequent free and allocation of small memory blobs that are adjacent and merge-able. In view of this, we implement another tree-structured global GPU manager similar to PyTorch [41].

4.6 Dispatch Routines

To enable bindings to non-templated languages, e.g. Python, the tensor interface is non-templated so that it can take dtypes and shapes as arguments. In the context of spatial hashing, we support arbitrary dimensional keys by expanding the dispatcher macros in C++. Float types have undetermined precision behaviors on GPU. Therefore, a conversion to the integers given the desired precision is recommended to use the hash map.

We also additionally dispatch values by their element byte sizes into intrinsically supported vectors: int, int2, int3, and int4. This adaptation accelerates trivially copiable value objects such as int3, and supports non-trivially copiable value blocks (e.g. an array pointed to a void pointer). This improves the insertion of large value chunks by a factor of 10 approximately.

4.7 Multi-value Hash Map and Hash Set

ASH supports multi-value hash maps that store values organized in SoA, and hash sets, with minor changes to the framework.

In application, a key can be mapped to multiple values. For instance, a 3D coordinate can be mapped to a normal and a color in a point cloud. While the mapped values can be packed as an array of structures (i.e., AoS) to fit a hash map, code complexity could increase since structure-level functions have to be implemented. We can generalize the hash map’s functionality by extending the single value buffer to an array of value buffers and applying loops over properties per index during an insertion. This simple change supports the storage of complex value types in SoA that allows easy vectorized query and indexing.

A hash set, on the other hand, is a simplified hash map – an unordered set that stores unique keys. It is generally useful in maintaining a set by rejecting duplicates, such as in point cloud voxelization. By removing and ignoring value insertion, a hash map becomes a hash set.

5 Experiments

Fig. 5: Hash map performance comparison between ASH-stdgpu and the vanilla stdgpu with 3D integer keys. Each curve shows the average operation time (y-axis) with varying hash map value sizes in bytes (x-axis), given a controlled backend, input length, and input key uniqueness ratio. Lower is better. Further factors are denoted by the legends on the right. ASH-stdgpu performs consistently faster than the vanilla stdgpu.

We start with synthetic experiments to show that ASH, with its optimized memory layout, increases performance while improving usability. All experiments in this section are conducted on a laptop with an Intel i7-6700HQ (4 cores, 8 threads) CPU and an Nvidia GTX 1070 GPU. In all the experiments, we assume the hash map capacity is equivalent to the number of input keys (regardless of duplicates). Each reported time is an average of 10 trials.

5.1 Spatial Hashing with Generic Backend

The first experiment is the performance comparison between vanilla stdgpu and ASH with stdgpu backend (ASH-stdgpu). For fairness, we extend the examples of stdgpu such that an array of iterators and masks are returned for in-place manipulations. The number of buckets and the load factor are determined internally by stdgpu.

Setup 1.

We test randomly generated 3D spatial coordinates mapped to float value blocks of varying sizes. The key , value , capacity and uniqueness are chosen as follows:

where indicates the ratio of the unique number of keys to the total number of keys being inserted or queried.

Fig. 5 illustrates the comparison between vanilla stdgpu and ASH-stdgpu. For insert operation, ASH-stdgpu is significantly faster than stdgpu when is low, and the performance gain increases when value byte size increases. This is mainly due to the SoA memory layout and the lazy insertion mechanism, where a lightweight integer is inserted in an attempt instead of the actual value . At a high input uniqueness , ASH-stdgpu maintains the performance advantage with low and medium value sizes, and its performance is comparable to stdgpu with a large value size. This indicates that our dispatch pattern in copying values helps in a high throughput scenario. For find operation, ASH-stdgpu is consistently faster than vanilla stdgpu, under both high and low key uniqueness settings.

In addition to to insert and find, we introduce a new activate operation. The activate

operation “activates” the input keys by inserting them into the hash map and obtaining the associated buffer indices. This is especially useful when we can pre-determine and apply the element-wise initialization. Examples include the TSDF voxel blocks (zeros) and multi-layer perceptrons (random initializations). The

activate operation is absent in most existing hash maps and is only available as hard-coded functions [37, 42, 12].

With the activate operation, we conduct ablation studies to compare the insertion time of merely the keys versus the insert time of both the keys and values. Fig. 6 compares the runtime between insert and activate in ASH-stdgpu. The key, value, capacity, and uniqueness choices are the same as in Setup 1. We observe that while the insertion time increases as the value size increases, the activation time remains stable.

Fig. 6: Study of the activate operation introduced in ASH against insert with 3D integer keys on the ASH-stdgpu backend. Each curve shows the average operation time (y-axis) with varying hash map value sizes in bytes (x-axis), given an input length and input key uniqueness ratio. Lower is better. Activate keeps a stable runtime in the tasks that do not require explicit value insertion, while insert time increases corresponding to the hash map value size.

5.2 Integer Hashing with Non-Generic Backend

Next, we compare ASH based on SlabHash backend (ASH-slab) with the vanilla SlabHash. Since SlabHash only supports integers as keys and values, we limit our ASH-slab backend to the same integer types here. The number of buckets is capacity (load factor is approx. ), since it is empirically the best factor when ASH-slab is applied to non-generic and generic tasks. Since vanilla SlabHash only supports data I/O from the host, we include the data transfer time between host and device when measuring the performance of ASH-slab.

Setup 2.

We test randomly scalar integer values mapped to scalar float values. The key , value , capacity and uniqueness are chosen as follows:

As shown in Fig. 7, although ASH-slab does not make use of the non-blocking warp-oriented operations in SlabHash in order to enable support for generic key and value types, our insert is still comparable to vanilla SlabHash which is only optimized for integers. The drop in performance of ASH-slab when increases is an expected indication that the overhead of multi-pass insertion increases correspondingly.

Fig. 7: Hash map performance comparison between ASH-slab and the vanilla SlabHash with 1D integer keys and values. Each curve shows the average operation time (y-axis) with varying input key uniqueness ratio (x-axis), given an input length. Lower is better. ASH-slab retains a comparable performance for integers while supporting generalization to arbitrary dimensional keys and values of various data types.

It is worth mentioning that with an improved global memory manager, the construction of an ASH-slab hash map takes less than 1ms under all circumstances, while the vanilla SlabHash constantly takes 30ms for the redundant slab memory manager. In practice, where the hash map is constructed and used once (e.g.  voxelization), ASH-slab is a more practical solution.

5.3 Ablation Between Backends

We now conduct an ablation study with different backends in ASH, namely ASH-stdgpu and ASH-slab, with arbitrary input key-value types beyond integers. The experimental setup follows Section 5.1.

Fig. 8: Ablation study of the hash map performance with 3D integer keys over different backends. Each curve shows the average operation time (y-axis) with varying hash map value sizes in bytes (x-axis), given a controlled backend, input length, and input key uniqueness ratio. Lower is better. ASH-stdgpu outperforms ASH-slab in most circumstances with the 3D integer keys and varying length values, common in real-world applications.

In Fig. 8, we can see that ASH-stdgpu outperforms ASH-slab in most circumstances with the 3D coordinate keys and varying length values that are common in real-world applications. While warp-oriented operations heavily used in SlabHash enjoy the benefits of intrinsic acceleration, they sacrifice the granularity of operations. Threads can only move on to the next task once all the operations in a warp (of 32) are finished. As a result, early termination when an insertion failure occurs are less likely in a warp-oriented hash map. If the data layout is not well-distributed for the intrinsic operations (e.g., low-uniqueness input, keys with long word width), the performance drop could be significant.

This observation is more apparent in insertion under varying input densities. With a relatively low value size and a high uniqueness, ASH-slab performs better. When the uniqueness is low, however, each thread in ASH-slab still has to finish a similar workload before termination, while ASH-stdgpu can reject many failure insertions early and move on to the following workloads. As of now, ASH-slab is suitable for the voxel downsampling application, while ASH-stdgpu is better for other tasks. Therefore, we set stdgpu as the default backend for ASH in the remaining sections.

5.4 Code Complexity

We now study the usage at the user end. First of all, the ASH framework, regardless of the backend used, is already compiled as a library. A C++ developer can easily include the header and build the example directly with a CPU compiler and link to the precompiled library with a light tensor engine. An equivalent Python interface is provided via pybind [22] as shown in Fig. 1.

On the other hand, to use SlabHash’s interface with an input array from host memory, a CUDA compiler is required, along with manual bucket-linked list chain configurations. For further performance improvement, detailed memory management has to be done manually via cudaMalloc, cudaMemcpy and cudaFree. stdgpu provides a built-in memory manager but requires writing device-side kernel function with explicit execution.

In a query operation, the found values are returned by-copy for SlabHash, so in-place modification requires further modification of the library. stdgpu exposes the iterators, and therefore customized kernels need to be implemented to return an array of iterators and masks for further operations. The LoC required for the same functionality in C++ are listed in Table II.

stdgpu SlabHash ASH
Device code free?
CUDA compiler free?
Construct 3 9 3
Find 22 2 2
Insert 27 1 2
TABLE II: Comparison of the complexity of coding (top) and LoC (bottom) of each operation among the implementations. Unlike stdgpu and SlabHash, ASH does not require that users write device code or use a CUDA compiler. It requires few LoC for construction, query, and insertion.

6 Applications

We now illustrate a number of applications and ready-to-use systems in 3D perception to demonstrate the power of ASH with fewer LoC and better performance. The presented applications include:

  1. Voxelization;

  2. Real-time dense volumetric SLAM, including TSDF integration and surface reconstruction;

  3. Non-rigid registration and deformation;

  4. Joint geometry and appearance refinement with shape from shading (SfS).

The first two experiments are conducted on the machine with an Intel i7-6700HQ CPU (4 cores, 8 threads) and an Nvidia GeForce GTX 1070 GPU. The rest are done on an Intel i7-7700 CPU (4 cores, 8 threads) and an Nvidia GeForce GTX 1080Ti GPU.

6.1 Voxelization

Setup 3.

In voxelization, a hash map maps a point cloud’s discretized coordinates to its natural array indices, and the hash map capacity is the point cloud size, typically ranging from to

Voxelization is a core operation for discretizing the input data. It is essential for sparse convolution [10, 9] at the quantization preprocessing stage and is usually used in coarse-to-fine point cloud registration tasks to generate a point cloud “pyramid” [56].

Voxelization is a natural task for parallel hashing, as the essence of the operation is to discard duplicates at grid points. To achieve this, we first discretize the input by converting the coordinates described by the continuous meter metric to the voxel units. Then a simple hash map insertion eliminates the duplicates and corresponds them to the remaining unique coordinates.333A hash set can be used if the correspondence is not required. The returned indices can be reused for tracing other properties such as colors and point normals associated with the input. The pseudo-code for voxelization can be found in Algorithm 1.

1:input points
2:array indices , voxel size
3:hash map , insert function
Algorithm 1 Voxelization

We compare voxelization implemented in ASH with two popular implementations, MinkowskiEngine [10] on CUDA and Open3D [56] on CPU. Our experiments are conducted on a large scene input with points which is typical for scene perception, and a small fragment of the scene with points which is typical for an RGB-D input frame, as shown in Fig. 9.

Fig. 9: Visualization of point cloud voxelization. Top: scene-level large-scale inputs. Bottom: fragment-level small-scale inputs. Left: original point clouds. Right: voxelized point clouds.

To evaluate the performance, we vary the parameter voxel size

from 5mm to 5cm, which is typical in the spectrum of voxelization applications, from dense reconstruction to feature extraction. In Fig. 

10 we can see that our implementation outperforms baselines consistently for inputs at both scales. Meanwhile, in measuring the LoC written in C++ (the Python wrapper are all one-liners) required for the functionality given the hash map interface, we observe that ASH requires 1/3 LoC, as shown in Table III.

Fig. 10: Performance comparison of voxelization. Each curve shows the run time (y-axis) over the varying voxel size (x-axis). Lower is better. ASH is consistently faster than Open3D’s default voxelizer (CPU) and MinkowskiEngine (CUDA).
MinkowskiEngine Open3D ASH
Voxelization 71 72 28
TABLE III: Comparison of LoC of voxelization in the C++ implementation calling the underlying hash map.

6.2 Real Time Volumetric Scene Reconstruction

6.2.1 Truncated Signed Distance Function

Scene representation with truncated signed distance function (TSDF) from a sequence of 3D input has been introduced [11] and adapted to RGB-D in KinectFusion [36]. It takes a sequence of depth images with their poses as input, and seeks to optimize the signed distance, an implicit function value per point at . The signed distance measured for frame is given by444Details including depth masking and projective pinhole camera model are omitted for clarity and could be found in KinectFusion [36].



projects the 3D point to 2D with a range reading after a rigid transformation. To reject outliers, a truncate function

is applied to . There are multiple variations of and the definition of signed distance  [5]. For this paper, we follow the convention in KinectFusion [36].

With a sequence of inputs, per-point signed distance can be estimated in least squares with a closed form solution


where is the selected weight depending on view angles and distances [5]. In other words, with a sequence of depth inputs and their poses, we can measure TSDF at any point in 3D within the camera frustums. We can also rewrite Eq. 11 incrementally:


where is the accumulated weight paired with .

6.2.2 Spatially Hashed TSDF Volume

Setup 4.

In a scene represented by a volumetric TSDF grid, a hash map maps the coarse voxel blocks’ coordinates to the TSDF data structure of the voxel block, and the hash map capacity is typically to for small to large-scale indoor scenes:

where is the voxel block resolution, which is set to or .

While recent neural representations utilize multi-layer perceptrons to approximate the TSDF in continuous space [7], classical approaches use discretized voxels. Such representations have a long history and are ready for real-world, real-time applications. They can also provide data for training neural representations.

The state-of-the-art volumetric discretization for TSDF reconstruction is spatial hashing, where points are allocated around surfaces on-demand at a voxel resolution of around . While it is possible to directly hash high-resolution voxels, the access pattern could be less cache-friendly, as the neighbor voxels are scattered in the hash map. A hierarchical structure is a better layout, where small dense voxel grids (e.g. in the shape of or ) are the minimal unit in a hash map; detailed access can be redirected to simple 3D indexing. In other words, a voxel can be indexed by a coupled hash map lookup and a direct local addressing


where is the voxel size and is the voxel block resolution as described in Setup 4.

While previous implementations [37, 42, 12] have achieved remarkable performance, modularized designs are missing. Geometry processing and hash map operations were coupled due to the absence of a well-designed parallel GPU hash map. One deficiency of this design is unsafe parallel insertion, where the capacity of a hash map can be exceeded. Another is ad hoc recurring low-level linked list access in geometry processing kernels. Our implementation demonstrates the first modularized and decoupled pipeline where safe hash map operations are used without any ad hoc modifications.

Fig. 11: Illustration of local and global hash maps iteratively used in real-time reconstruction. The local hash map activates voxel blocks enclosing points observed in the viewing frustum. The global hashmap accumulates such activated blocks and maintains all the blocks around the isosurface.

6.2.3 Allocation and TSDF Integration

Setup 5.

For an input depth image, a hash map maps the unprojected point coordinates to the active indices as described in Setup 4, and the capacity of a hash map is typically to , with to valid entries:

In the modularized design, we first introduce a double hash map structure for voxel block allocation and TSDF estimation. Voxel block allocation identifies points from as surfaces and computes coordinates with Eq. 13. Intuitively, they can be directly inserted to the global hash map described in Setup 4. This is achievable in an ad hoc implementation where the core of the hash map is modified at the kernel code level, and unsafe insertion is allowed [37, 42]. However, in a modularized setup, this could lead to problems. A VGA resolution depth input contains points and easily exceeds the empirical global hash map capacity. As we have mentioned, rehashing will be triggered under such circumstances, which is both time and memory consuming, especially for a hash map with memory-demanding voxel blocks as values.

To address this issue without changing the low-level implementation and sacrificing safety, we introduce a second hash map, the local hash map from Setup 5. This hash map is similar to that used in voxelization as it maps 3D coordinates unprojected from depths at the voxel block precision to an integer index. With this setup, a larger capacity of input is acceptable, as the local hash map is lightweight and can be cleared or constructed from scratch per iteration.

There are two main benefits to using a local hash map: it converts the input from the raw point scale to the voxel block scale, which is safe for the global hash map without rehashing; as a byproduct, it keeps track of the active voxel blocks for the current frame , which can be directly used in the following integration and ray casting. The local and global hash maps can be linked through in-place value modification, where a query of coordinate in the local map is redirected to the global map. Fig. 11 shows the roles of the two hash maps. Algorithm 2 details the construction and interaction between the two hash maps.

1:surface points estimated from input
2:local hash map with activation
3:global hash map with activation and query
7: Active local entries globally
9: Map local coords to global voxels
Algorithm 2 Double hash maps for allocation

Algorithm 2 summarizes the usage of hash maps in voxel block allocation. TSDF integration can then be implemented following Eq. 12 in a pure geometry function, either in a low-level CUDA kernel or a high-level vectorized python script. In other words, spatial hashing is detached from the core geometric computation, providing more flexibility in performance optimization.

6.2.4 Ray Casting and Tracking

With known camera poses, TSDF integration is ready to use with the operations above. We now extend it to a dense SLAM system with ray casting and localization.

Ray casting renders depth images by marching a ray in the spatially hashed volumes and finding zero-crossing interfaces. It provides both a visualization of the currently reconstructed scene, and enables frame-to-model tracking for a dense SLAM system. A vectorized implementation is described in Algorithm 3, where the hash map query and the geometric computations are conducted alternatively in a batch fashion, which is ready for differentiable rendering when incorporated in an autodiff engine like PyTorch.

Fig. 12: Visualization of volumetric ray casting. Left: pseudo-colored depth image. Right: RGB color image. First row: rendering in fast mode without color. Second row: rendering in complete mode. Third row: raw RGBD input.
1:ray starting points , ray directions
2:voxel size , block length
3:voxel query function that mixes hash map query (Eq. 13) and array query (Eq. 14)
5:for t [0, …, max_step) and  do
6:      Query valid TSDF and masks
8:      += Invalid TSDF: march forward
10:      Valid TSDF: check zero-crossing
11:      +=
12:      Non-surface: march an adaptive step
14:end for
Algorithm 3 Volumetric ray marching

It is worth noting that while the vectorized ray casting is easy to implement and enables autodiff, an optimized version is better suited for a real-time reconstruction system, where device-level iterative hash map access is applied. Therefore, ASH-stdgpu is preferred over ASH-slab, since the warp-oriented operations entangle groups of rays and forbid early termination where surfaces are found.

Further accelerations can improve the speed of ray casting. As discussed in [42], adaptive spherical ray casting and a precomputed min-max range estimate will constrain the search range and boost performance. The latter can be conducted by simply reusing the active keys collected in Algorithm 2 and conducting projection without any involvement of hash maps.

In a performance-first application, we can squeeze more from our double-hash-map architecture. By default, the global hash map is used in Algorithm 3. Since the local hash map is associated with the global hash map with active indices, we can replace the global hash map with the local one accompanied with the buffer in the global hash map. With such a simple change, we can now query the local hash map, and access the global hash map in-place without touching the geometric computations. This operation does sacrifice rendering quality, as shown in Fig. 12, since out-of-frustum voxel blocks are ignored. However, we find that in practice it only affects the boundary regions and does not affect the tracking accuracy, and produces high-quality scene reconstruction, see Fig. 13.

Fig. 13: Visualization of triangle mesh extracted from the real-time dense SLAM system on scene lounge and copyroom in the fast mode. Rendered with Mitsuba 2 [38].

With the rendered depth from a model, we can perform frame-to-model tracking as described in  [36], obtain for an incoming depth image on GPU [12], and complete the dense SLAM system.

6.2.5 Surface Extraction

A volumetric scene reconstruction is not usable for most software and solutions until the results are exported in the point cloud or triangle mesh form. Hence we implement a variation of Marching Cubes [31] that extracts vertices with triangle faces at zero crossings in a volume. In a spatially hashed implementation, boundary voxels in voxel blocks are harder to process since queries of neighbor voxel blocks are frequent, and shared vertices among triangles are hard to track.

One popular practice is to simply visit vertices at isosurfaces and disregard duplicates, but this usually results in a heavily redundant mesh [27, 42], or time-consuming post-processing to merge vertices [37]. Another method is to introduce an assistant volumetric data structure to atomically record the vertex-voxel association, but the implementations are over-complex and require frequent low-level hash map queries coupled with surface extraction [13, 12].

Now that we have a unified hash map interface, we simplify the voxel block neighbor search routine [12] and set up a 1-radius neighbor lookup table in advance, as described in Algorithm 4. Surface extraction is then detached from hash map access and can be optimized separately. As a low-hanging fruit, point cloud extraction is implemented with the same routine by ignoring the triangle generation step. In fact, surface extraction of a median-scale scene shown in Fig. 13 takes less than 100ms, making interactive surface update possible in a real-time reconstruction system.

Fig. 14: Visualization of the real-time dense SLAM system with interactive surface reconstruction. Viewpoints can be changed by users to visualize the incremental reconstruction of the scene.
1:input , radius
2:hash map , query function
4:for  do
6:end for
Algorithm 4 1-radius nearest neighbor search in 3D
Fig. 15: Performance and LoC comparison of our real-time dense SLAM pipeline in the fast mode (ASH-fast) against state-of-the-art implementations: InfiniTAM [42], VoxelHashing [37], GPU-robust [13]. Evaluated on the lounge scene [8]. Left: detailed comparison of separating modules. Right: corresponding LoC comparison. Lower is better. Note the meshing LoC in InfiniTAM is significantly fewer since the implementation is over-simplified and requires further postprocessing. ASH-fast achieves a consistent fast speed with fewer LoC.

6.2.6 Switchable Modes and Reconstruction System

To demonstrate the ultimate performance we can achieve by making full use of the hash map, and for a fair comparison to the baselines [37, 42, 12], we develop the fast mode for the reconstruction system. In this setup, we use a less aggressive ray-based allocation strategy [42], and voxel blocks without color information consistent with the baselines. We compare the performance of the fast mode against the state-of-the-art implementations with the same parameter setups: voxel size is 5.8mm, voxel block resolution is , TSDF truncation distance is 4cm, min/max acceptable range of depth scanning is 0.2m and 3m. Performance is profiled on the lounge scene shown in Fig. 13, since there are multiple loop closures in the sequence and the tracking is stable, resulting in minimal disturbance to the reconstruction side. Fig. 15 compares both the runtime and the LoC555All the code are reformatted with clang-format with a modified Google style. of the modules among the baselines. In most cases, we can see a significant gain of performance, with fewer LoC to write thanks to the elimination of redundant hash map look-ups in geometry kernels.

Fast Complete
Integration 0.75 1.80
Raycasting 0.97 3.72
Mesh extraction 146.47 190.57
Point extraction 30.17 87.78
TABLE IV: Performance breakdown (in ms) between ASH-fast and ASH-complete modes. Evaluated on the lounge scene [8].

In addition to the fast mode, we also implement a complete mode for richer volumetric information including color, and reduce noise by integrating depth into TSDF voxel blocks from a 1-radius-neighbor allocation [56]. A comparison of performance between the modes is shown in Table IV

. The introduction of color requires double memory and triple computation cost in trilinear interpolation for ray casting and surface extraction. Hence the complete mode is slower but it provides a better user experience. Fig. 

14 shows the interactive reconstruction system where real-time point cloud reconstruction is provided in the complete mode. The system runs at frame rate on a mid-end laptop, including realistic rendering.

Note that to switch between the fast and complete mode, we only need to change the allocation function and several parameters – in total several dozen of lines – without re-written the core.

To conclude this subsection, we presented a dense SLAM system with a modular design, separating hash maps and geometric operations. Our system is faster, requires fewer LoC, and supports easy trade-off between speed and fidelity. Our systems also provides flexibility in optimizing the geometry functions without touching the hash map interface.

6.3 Non-Rigid Volumetric Deformation

The aforementioned modules are suitable for medium-scale reconstruction with 10K input VGA-resolution frames. For input sequences beyond that range or with heavy noise, drift is inevitable, and therefore offline reconstruction [8] is preferred. In a submap-based offline reconstruction system, long sequences are split into smaller subsets, each yielding a submap reconstruction with less drift. A global pose graph is then optimized after robust registration of such submaps to refine poses . For the details, we refer the readers to state-of-the-art reconstruction systems [8, 12].

We can simply reuse the integration and surface extraction components in the previous subsection for an offline system, with a fast processing speed for fragment estimation and scene reconstruction [8, 12]. However, issues still persist in challenging scenes, e.g., heavy misalignment due to the strong simulated noise in the Augmented ICL dataset [8], and the artifacts presented in the large-scale indoor RGBD LiDAR dataset [40], as shown in Fig. 16.

Fig. 16: Visualization of scene reconstructions before and after ASH-SLAC. First row: before ASH-SLAC. Second row: after ASH-SLAC. Left: livingroom-1 from Augmented ICL [8]. Right: apartment from Indoor LiDAR RGBD [40]. Artifacts are eliminated by global pose adjustment and local deformation via deformable TSDF integration. Rendered with Mitsuba 2 [38].

To deal with this, non-rigid volumetric deformation is presented in Simultaneous Localization and Calibration (SLAC) [54, 53] to jointly optimize camera poses and refine geometry details by non-rigid deformation. In brief, SLAC attempts to embed fragment points in a control grid

for deformation, and optimizes loss functions parameterized over the grids with as-rigid-as-possible regularizers:


where are corresponding 3D points between submaps obtained in nearest neighbor search after pose graph optimization, and deforms a point by embedding it in the control grid and performing a trilinear interpolation of neighbor grid deformations. accesses the offset value at a grid point . is the rigid rotation that minimizes locally, where is a 1-ring neighbor of . It controls local distortions in the as-rigid-as-possible regularizer.

This problem formulation is, however, complicated to realize in code, and in the original implementation, the deformation grid is a simplified regular 3D array where points out-of-bound are discarded during optimization. As of today, SLAC has never been reproduced apart from the original implementation. We observe that similar operations for TSDF grids can be applied here by ASH to generate a spatially hashed control grid.

Setup 6.

A volumetric deformation hash map maps deformation grid coordinates to grid position offsets, and the capacity of the hash map is typically to :

Equipped with ASH, the non-rigid deformation can be written in several lines, combining Algorithm 1 and modifying Algorithm 4 for 1-cube nearest neighbor search. We first voxelize the input point cloud with the deformation grid size. Then, instead of the 1-radius nearest neighbors ( entries), we look for 1-cube nearest neighbors (), where a point is enclosed in a cube formed by grid points. The interpolation ratio can be computed jointly. We can also adapt the 1-radius neighbor search to 1-ring neighbors for the regularizer. The embedding of a fragment point cloud is visualized in Fig. 17, where the edges indicate the association between points to grids, and the colors show the interpolation ratio. Note that this visualization is also made easy thanks to the simple interface of ASH.

Fig. 17: Visualization of a point cloud and its embedding in the volumetric deformation grids. Left: original point cloud. Right: embedding graph connecting the input points and associated deformation grid points. Each edge’s color indicates the interpolation weight: blue shows a lower weight (closer to ), while red shows a higher weight (closer to ).

With this parameterization, we reproduce SLAC after rewriting the non-linear least squares solver and jointly optimizing the grid points and submap poses given the correspondences. In addition, the hash map can be saved and loaded from the disk for further processing, including deformed TSDF integration that reconstructs the scene from the deformed input depth images embedded in the grids. Experiments show that with a modularized design and a spatial hash map, we are able to reproduce SLAC by reducing artifacts after optimization, as shown in Fig. 16.

We can see a gain of performance with fewer LoC in Table  V in the livingroom 1 scene with heavy simulated noise666To control the experiment, we use the initial fragment pose graph from the baseline implementation.. Note while the hash map generalizes from the bounded to unbounded deformation grid, the LoC and time contributing to the core non-linear least squares is not reduced, which takes up the greatest portion of optimization. Our deformation and integration speed per frame is significantly faster (), which is critical for large-scale (

K frames) sequences. While being faster and easier to develop, our system achieves a higher reconstruction quality in terms of precision, recall, and F-score with a distance threshold

mm [40].

Operation Original SLAC ASH-SLAC
Time (ms) LoC Time (ms) LoC
Non-rigid optim. 2041.1 1585 1982.1 1535
Deformed integration 125.38 944 10.62 446
Reconstruction quality
Precision () 29.19 36.10
Recall () 51.44 61.34
F-score () 37.24 45.45
TABLE V: Performance and LoC (top) and reconstruction quality (bottom) comparison between ASH-SLAC and the original implementation [53] on the livingroom-1 scene [8]. ASH-SLAC is faster with fewer LoC, and produces a better reconstruction.

6.4 Shape-from-Shading Refinement

SLAC reduces artifacts for large-scale scenes. For small-scale objects, while volumetric reconstruction outputs smooth surfaces, fine details are often impaired due to the weight averaging of the TSDF.

(a) Normal map (b) Color map (c) Rendering with a ceramic material
Fig. 18: Appearance and geometry refinement before and after ASH-Intrinsic3D on lion [57]. First row: initial reconstruction from volumetric integration. Second row: refined reconstruction after optimization. Best viewed when enlarged in color.

Shape-from-Shading (SfS) refines details by jointly optimizing volumetric TSDF functions given the initial geometry and appearance [57]. It takes a reconstructed volumetric TSDF grid with a set of high resolution key frame RGB images and their poses as input, and outputs jointly optimized TSDF and albedo through an image formation model


where the estimated voxel-wise appearance is computed by (SH stands for spherical harmonics), and associated with the closest surface point


which is projected to image through after a rigid transformation . Here the voxel-wise normal is directly derived from with numerical gradients


Similar to SLAC, we use to access TSDF and albedo values at grid point . are coefficients for regularizing smoothness through the Laplacian, stability, and piece-wise albedo constancy via a weighted chromaticity regularizer , respectively [57].

While the image formation model is straightforward, similar to SLAC, the underlying data structure used in implementing the model can be complex mainly because of the prevalent nearest neighbor search in normal computation and neighbor voxel regularizers. As a result, to enable such a system without a modern hash map, one has to rely on low-level C++ implementation and is consequently limited to the low-level Ceres solver [1] for autodiff in optimization. Further, the spatially hashed voxels have to be bounded to reduce computation cost [32].

Now equipped with ASH, we provide a simplified solution that is built upon the hash map and advanced indexing. Unlike SLAC that requires time-consuming deformable TSDF re-integration for final scene reconstruction, SfS allows reusing accelerated surface extraction from the TSDF grids without further optimization. Therefore, we implement the SfS pipeline in pure Python as an example of fast prototyping of a differentiable rendering pipeline. Running on GPU, we lift the constraint of a user-defined bounding box and optimize the full reconstructed surface.

Without the requirement of extreme performance, we drop the hierarchical volumetric layout and use the simple voxel-based hash map:

Setup 7.

A voxel indexer is given by a hash set . The typical capacity is to .

With this setup, we can reuse the code in SLAC to look up the 1-ring neighborhood for normal estimation and Laplacian regularization. There is, however, another lookup required since we are minimizing the difference of appearance gradient in Eq. 16: we need to find the 1-ring neighbors that also have 1-ring neighbors. In other words, we have to find the intersection of two sets. While NumPy provides the functionality for 1D arrays through ordered sorting, our hash map allows unordered intersection that can be generalized to multi-dimensional inputs:

Setup 8.

With two input sets , the intersection is given by the following operations: initialize a hash set with ; query and obtain success mask ; return .

After data association is found and SH parameters are estimated in a preprocessing step, all the terms in Eq. 16 are converted to a trivial combination of indexing and arithmetic operations. We can take advantage of PyTorch’s autodiff and back propagate the gradient through the built-in differentiable index layer. ADAM [26] with an initial learning rate is used. Thus the core volumetric SfS pipeline [57] is reproduced in pure Python.

An extension can be easily implemented by introducing spatially varying lights [32], wrapped up with a hash map.

Setup 9.

Spatially varying spherical harmonics (SVSH) (bands = 3) can be described by a hash map that maps lighting subvolume coordinates to the corresponding coefficients:

The embedding of an active voxel in an SVSH map is identical to SLAC, with 1-cube neighbor for the data term and 1-ring neighbor for the regularizer. Further description is omitted here as the formulation and implementation are similar to Eq. 16 [32].

With both SfS and SVSH optimization implemented777Pose optimization and voxel upsampling are disabled at current. [32], we show the results on the scene lion in Fig. 18. Without voxel grid upsampling, both the geometry and appearance details are sharper. Regarding performance and code complexity, we show in Table VI that our code is much shorter in pure Python, and faster per iteration thanks to the CUDA autodiff engine in PyTorch. Note that Ceres is a 2nd-order optimizer on CPU that empirically converges faster than the 1st-order ADAM optimizer. In practice, however, we found that in 50 iterations ADAM converges well against the preset 10 iterations for non-linear least squares solver. Thus the total optimization performance of our implementation is still faster with more voxels to process (remember that we do not require an additional bounding box).

We also evaluate reconstruction quality in Table VI. We render our optimized mesh given the keyframe camera extrinsic and intrinsic parameters and compute RMSE against the raw input images. For the baseline [32], we follow a similar procedure and render the optimized mesh (not upsampled for fairness) given refined camera parameters. We use the same mask given by the baseline to ensure the same region of interest. The results show that our implementation produces improved RMSE despite the simplified development.

Operation Original-Intrinsic3D ASH-Intrinsic3D
Time (s) LoC Time (s) LoC
SVSH optim. 0.503 605 0.092 254
Joint optim. 147.323 7399 0.916 1416
Rendering quality
RMSE mean () 0.677 0.627
RMSE std () 0.095 0.120

Performance per epoch (with 26 high-resolution keyframes) and LoC (top) and rendering quality (bottom) comparison between ASH-Intrinsic3D and the original implementation 

[32] on the lion scene [32]. ASH-Intrinsic3D is faster with fewer LoC, and results in comparable rendering from the refined reconstruction.

7 Conclusions and Future Work

We presented ASH, a performant and easy-to-use framework for spatial hashing. Both synthetic and real-world experiments demonstrate the power of the framework. With ASH, users can achieve the same or better performance in 3D perception tasks while writing less code.

There are various avenues for future work. At the architecture level, we seek to introduce the open address variation [2, 24] of parallel hash maps for flexibility and potential high performance static hash maps. At the low level, we plan to further optimize the GPU backend, and accelerate the CPU counterpart, potentially with cache level optimization and code generation [23, 21]. We also plan to apply ASH to sparse convolution [10, 49] and neural rendering [14, 43], where spatially varying parameterizations are exploited.

ASH accelerates a variety of 3D perception workloads. We hope that the presented framework will serve both research and production applications.


  • [1] S. Agarwal, K. Mierle, et al. Ceres solver. Note: Cited by: §6.4.
  • [2] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta (2009) Real-time parallel hashing on the gpu. In ACM SIGGRAPH Asia, Cited by: §1, §2.1, TABLE I, §7.
  • [3] S. Ashkiani, M. Farach-Colton, and J. D. Owens (2018) A dynamic hash table for the gpu. In IEEE IPDPS, Cited by: §1, §2.1, TABLE I, §3, §4.4.
  • [4] J. L. Bentley (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §1, §2.2.
  • [5] E. Bylow, J. Sturm, C. Kerl, F. Kahl, and D. Cremers (2013) Direct camera pose tracking and 3d reconstruction with signed distance functions. In RSS, Cited by: §6.2.1, §6.2.1.
  • [6] P. Celis, P. Larson, and J. I. Munro (1985) Robin hood hashing. In Annual Symposium on Foundations of Computer Science, Cited by: §2.1.
  • [7] R. Chabra, J. E. Lenssen, E. Ilg, T. Schmidt, J. Straub, S. Lovegrove, and R. Newcombe (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In ECCV, Cited by: §2.3, §6.2.2.
  • [8] S. Choi, Q. Zhou, and V. Koltun (2015) Robust reconstruction of indoor scenes. In CVPR, Cited by: Fig. 15, Fig. 16, §6.3, §6.3, TABLE IV, TABLE V.
  • [9] C. Choy, W. Dong, and V. Koltun (2020) Deep global registration. In CVPR, Cited by: §6.1.
  • [10] C. Choy, J. Gwak, and S. Savarese (2019)

    4D spatio-temporal convnets: minkowski convolutional neural networks

    In CVPR, Cited by: §6.1, §6.1, §7.
  • [11] B. Curless and M. Levoy (1996) A volumetric method for building complex models from range images. In ACM SIGGRAPH, Cited by: §2.3, §6.2.1.
  • [12] W. Dong, J. Park, Y. Yang, and M. Kaess (2019) GPU accelerated robust scene reconstruction. In IROS, Cited by: §2.2, TABLE I, §5.1, §6.2.2, §6.2.4, §6.2.5, §6.2.5, §6.2.6, §6.3, §6.3.
  • [13] W. Dong, J. Shi, W. Tang, X. Wang, and H. Zha (2018) An efficient volumetric mesh representation for real-time scene reconstruction using spatial hashing. In ICRA, Cited by: Fig. 15, §6.2.5.
  • [14] S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, and J. Valentin (2021) Fastnerf: high-fidelity neural rendering at 200fps. arXiv preprint arXiv:2103.10380. Cited by: §7.
  • [15] I. García, S. Lefebvre, S. Hornus, and A. Lasram (2011) Coherent parallel hashing. ACM Transactions on Graphics 30 (6), pp. 1–8. Cited by: §2.1.
  • [16] J. Gunther, S. Popov, H. Seidel, and P. Slusallek (2007) Real time ray tracing on gpu with bvh-based packet traversal. In Symposium on Interactive Ray Tracing, Cited by: §2.2.
  • [17] L. Han and L. Fang (2018) FlashFusion: real-time globally consistent dense 3d reconstruction using cpu computing.. In RSS, Cited by: §2.2.
  • [18] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020-09) Array programming with NumPy. Nature 585 (7825), pp. 357–362. Cited by: §3.
  • [19] R. K. Hoetzlein (2016) GVDB: raytracing sparse voxel database structures on the gpu. In Proceedings of High Performance Graphics, Cited by: §2.2.
  • [20] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard (2013) OctoMap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots 34 (3), pp. 189–206. Cited by: §2.2.
  • [21] Y. Hu, T. Li, L. Anderson, J. Ragan-Kelley, and F. Durand (2019) Taichi: a language for high-performance computation on spatially sparse data structures. ACM Transactions on Graphics 38 (6), pp. 1–16. Cited by: §2.2, §7.
  • [22] W. Jakob, J. Rhinelander, and D. Moldovan (2017) Pybind11 – seamless operability between c++11 and python. Note: Cited by: §5.4.
  • [23] W. Jakob (2019) Enoki: structured vectorization and differentiation on modern processor architectures. Note: Cited by: §7.
  • [24] D. Jünger, R. Kobus, A. Müller, C. Hundt, K. Xu, W. Liu, and B. Schmidt (2020) WarpCore: a library for fast hash tables on gpus. arXiv preprint arXiv:2009.07914. Cited by: §2.1, TABLE I, §7.
  • [25] T. Karnagel, R. Müller, and G. M. Lohman (2015) Optimizing gpu-accelerated group-by and aggregation.. VLDB 8, pp. 20. Cited by: §2.1.
  • [26] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.4.
  • [27] M. Klingensmith, I. Dryanovski, S. S. Srinivasa, and J. Xiao (2015) Chisel: real time large scale 3d reconstruction onboard a mobile device using spatially hashed signed distance fields.. In RSS, Cited by: §2.2, §6.2.5.
  • [28] A. Kukanov and M. J. Voss (2007) The foundations for scalable multi-core software in intel threading building blocks.. Intel Technology Journal 11 (4). Cited by: §3.
  • [29] C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha (2009) Fast bvh construction on gpus. In Computer Graphics Forum, Vol. 28, pp. 375–384. Cited by: §2.2.
  • [30] B. Lessley and H. Childs (2019) Data-parallel hashing techniques for gpu architectures. IEEE Transactions on Parallel and Distributed Systems 31 (1), pp. 237–250. Cited by: §2.1.
  • [31] W. E. Lorensen and H. E. Cline (1987) Marching cubes: a high resolution 3d surface construction algorithm. ACM SIGGRAPH 21 (4), pp. 163–169. Cited by: §6.2.5.
  • [32] R. Maier, K. Kim, D. Cremers, J. Kautz, and M. Nießner (2017) Intrinsic3D: high-quality 3D reconstruction by joint appearance and geometry optimization with spatially-varying lighting. In ICCV, Cited by: §2.3, §6.4, §6.4, §6.4, §6.4, §6.4, TABLE VI.
  • [33] D. Meagher (1982) Geometric modeling using octree encoding. Computer Graphics and Image Processing 19 (2), pp. 129–147. Cited by: §1, §2.2.
  • [34] K. Museth, J. Lait, J. Johanson, J. Budsberg, R. Henderson, M. Alden, P. Cucka, D. Hill, and A. Pearce (2013) OpenVDB: an open-source data structure and toolkit for high-resolution volumes. In ACM SIGGRAPH Courses, pp. 1–1. Cited by: §2.2.
  • [35] R. A. Newcombe, D. Fox, and S. M. Seitz (2015) Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In CVPR, Cited by: §2.3.
  • [36] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon (2011) Kinectfusion: real-time dense surface mapping and tracking. In ISMAR, Cited by: §1, §6.2.1, §6.2.4, footnote 4.
  • [37] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger (2013) Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics 32 (6), pp. 1–11. Cited by: §1, §1, §2.2, §2.3, TABLE I, §5.1, Fig. 15, §6.2.2, §6.2.3, §6.2.5, §6.2.6.
  • [38] M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob (2019) Mitsuba 2: a retargetable forward and inverse renderer. ACM Transactions on Graphics 38 (6), pp. 1–17. Cited by: Fig. 13, Fig. 16.
  • [39] R. Pagh and F. F. Rodler (2004) Cuckoo hashing. Journal of Algorithms 51 (2), pp. 122–144. Cited by: §2.1.
  • [40] J. Park, Q. Zhou, and V. Koltun (2017) Colored point cloud registration revisited. In ICCV, Cited by: Fig. 16, §6.3, §6.3.
  • [41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In NeurIPS, Cited by: §3, §4.5.
  • [42] V. A. Prisacariu, O. Kähler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. Torr, and D. W. Murray (2017) Infinitam v3: a framework for large-scale 3d reconstruction with loop closure. arXiv preprint arXiv:1708.00783. Cited by: §1, §1, §2.2, TABLE I, §5.1, Fig. 15, §6.2.2, §6.2.3, §6.2.4, §6.2.5, §6.2.6.
  • [43] C. Reiser, S. Peng, Y. Liao, and A. Geiger (2021) KiloNeRF: speeding up neural radiance fields with thousands of tiny mlps. arXiv preprint arXiv:2103.13744. Cited by: §7.
  • [44] S. Rusinkiewicz and M. Levoy (2001) Efficient variants of the icp algorithm. In 3DV, Cited by: §2.2.
  • [45] R. B. Rusu, N. Blodow, and M. Beetz (2009) Fast point feature histograms (fpfh) for 3d registration. In ICRA, Cited by: §2.2.
  • [46] R. B. Rusu and S. Cousins (2011) 3d is here: point cloud library (pcl). In ICRA, Cited by: §2.2.
  • [47] P. Stotko (2019) stdgpu: Efficient STL-like Data Structures on the GPU. arXiv:1908.05936. Cited by: §2.1, TABLE I, §3, §4.3.
  • [48] R. D. Team (2018)

    RAPIDS: collection of libraries for end to end gpu data science

    External Links: Link Cited by: §2.1, TABLE I.
  • [49] B. Ummenhofer, L. Prantl, N. Thuerey, and V. Koltun (2020) Lagrangian fluid simulation with continuous convolutions. In ICLR, Cited by: §7.
  • [50] I. Wald, S. Boulos, and P. Shirley (2007) Ray tracing deformable scenes using dynamic bounding volume hierarchies. ACM Transactions on Graphics 26 (1), pp. 6–es. Cited by: §2.2.
  • [51] M. Zeng, F. Zhao, J. Zheng, and X. Liu (2013) Octree-based fusion for realtime 3d reconstruction. Graphical Models 75 (3), pp. 126–136. Cited by: §2.2.
  • [52] K. Zhou, Q. Hou, R. Wang, and B. Guo (2008) Real-time kd-tree construction on graphics hardware. ACM Transactions on Graphics 27 (5), pp. 1–11. Cited by: §2.2.
  • [53] Q. Zhou and V. Koltun (2014) Simultaneous localization and calibration: self-calibration of consumer depth cameras. In CVPR, Cited by: §2.3, §6.3, TABLE V.
  • [54] Q. Zhou, S. Miller, and V. Koltun (2013) Elastic fragments for dense scene reconstruction. In ICCV, Cited by: §6.3.
  • [55] Q. Zhou, J. Park, and V. Koltun (2016) Fast global registration. In ECCV, Cited by: §2.2.
  • [56] Q. Zhou, J. Park, and V. Koltun (2018) Open3D: A modern library for 3D data processing. arXiv:1801.09847. Cited by: §2.2, §6.1, §6.1, §6.2.6.
  • [57] M. Zollhöfer, A. Dai, M. Innmann, C. Wu, M. Stamminger, C. Theobalt, and M. Nießner (2015) Shading-based refinement on volumetric signed distance functions. ACM Transactions on Graphics 34 (4), pp. 1–14. Cited by: Fig. 18, §6.4, §6.4.
  • [58] M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner, R. Klein, and A. Kolb (2018) State of the art on 3d reconstruction with rgb-d cameras. In Computer Graphics Forum, Vol. 37, pp. 625–652. Cited by: §2.3.