I Introduction
Point cloud registration is the process of estimating a rigid transform that best aligns a pair of point clouds. It is the key component for 3D reconstruction and robotic applications, e.g., odometry and SLAM, which are increasingly important in autonomous mobile robots with limited computational resources and power budgets. This necessitates the implementation of a lightweight registration method running on lowpower mobile devices such as FPGA SoCs.
Inspired by the tremendous success of deep learning, significant advancements have been made in the development of learningbased registration methods over the past few years. PointNetLK [1]
is a representative learningbased method, which combines the LucasKanade (LK)based pose estimation and PointNet feature embedding. The conventional geometrybased methods such as ICP
[2] and many other learningbased methods rely on point correspondences to obtain the rigid transform in closed form, which results in the computational complexity of , where is the number of points. In contrast, PointNetLK hascomplexity, and PointNet is a quite small and easytoimplement neural network. Combined with these, PointNetLK brings a better scalability and a certain performance advantage; it is therefore suitable for resourcelimited computing platforms.
From these considerations, in this paper, we propose a highlyefficient design of PointNetLK targeting resourcelimited FPGA SoCs. As PointNet feature extraction becomes a performance bottleneck, we develop a dedicated accelerator IP core for PointNet, and implement it on the FPGA logic circuit. We use Xilinx ZCU104 Evaluation Kit as an affordable midrange FPGA SoC. We fully optimize the IP core design by leveraging a high degree of parallelism in PointNet. Experiments demonstrate that our accelerator improves the performance by a large margin without degrading the generalization ability and accuracy. We conduct weight quantization to further reduce the resource usage, and show that the quantized IP core can be implemented on even smaller and lowcost FPGAs (Avnet Ultra96v2).
The rest of the paper is organized as follows: Section II overviews related works. Section III formulates the registration problem and describes PointNetLK algorithm. Section IV illustrates design optimizations carried out for our FPGA SoCbased implementation. Evaluation results in terms of speed, accuracy, resource utilization, and power consumption are presented in Section VI. Section VII concludes the paper.
Ii Related Works
Iia Deep Learningbased Point Cloud Registration
Deep learning techniques have been successfully applied to a registration problem, outperforming conventional geometrybased methods such as ICP and its variants [2, 3]. There exists a line of work that employs DNNs to predict a rigid transform from input point sets in an endtoend fashion [4, 5, 6, 7, 8]. Another approach combines DNN feature extraction and nonlearningbased closedform pose estimation. LORAX [9]
leverages a shallow autoencoder to extract a feature descriptor from a subset of points, and computes a rigid motion from the matched descriptor pairs. A number of methods
[10, 11, 12, 13, 14, 15, 16, 17] predict point correspondences using DNNs, and perform SVD to compute a rigid transform. Aside from these methods, other learningbased methods perform the iterative registration in the framework of the LK algorithm. PointNetLK [1] is a representative method; it iteratively refines a rigid transform by aligning global point cloud features extracted by PointNet. Li et al. [18] improves the generalization ability of PointNetLK by analytically computing a Jacobian instead of approximating it.IiB FPGAbased Acceleration for Point Cloud Registration
Despite the growing importance, only a few works have investigated the FPGAbased acceleration of 3D point registration. Kosuge et al. [19] develop an accelerator for the ICPbased object pose estimation, which is a critical process in picking robots. They focus on the nearest neighbor (NN) search, which constitutes a major bottleneck in ICP, and devise a novel hierarchical graph data structure for improved efficiency. The proposed accelerator combines a parallelized distance computation unit and a dedicated sorter unit to speed up the graph construction and NN search. Deng et al. [20]
present an FPGAbased accelerator for Normal Distributions Transform (NDT). NDT
[21]models point clouds as a set of voxels, each of which represents the Gaussian distribution of points. They introduce a new hierarchical memoryefficient data structure to accelerate the voxel search operations. Eisoldt
et al. [22] implement Truncated Signed Distance Function (TSDF)based registration method and TSDF map update process onto the FPGA logic circuit for efficient 3D TSDFbased LiDAR SLAM. These works successfully demonstrate the effectiveness of FPGA acceleration for wellestablished and geometrybased registration methods. This paper is the first to explore the FPGAbased accelerator design for a learningbased method.Iii Background
Iiia PointNetLK Algorithm
In this section, we briefly describe the PointNetLK algorithm. We refer to [1, 18] for the detailed derivation.
The aim of point registration is to align two 3D point clouds, referred to as a template and source , by estimating a 3D rigid transform from to . PointNetLK finds an optimal transform such that global features of two point clouds are equal: . denotes a PointNet that maps a point cloud of points into a D global feature. The transform is computed from a 6D twist via the exponential map. The definition of wedge operator () is found in [23]. For efficiency, PointNetLK swaps the roles of template and source; it computes a twist such that the rigid transform from to minimizes the difference between and :
(1) 
By applying the firstorder Taylor expansion, we linearize the term :
(2) 
where
is a Jacobian matrix. Each column vector
of is computed by numerical gradient approximation as follows:(3) 
where is an infinitesimal perturbation to the twist and is a unit vector whose th element is 1 and the others are 0. It turns out that is computed only once at the initialization. By substituting Eq. 2 into Eq. 1, we can solve for the optimal twist as follows:
(4) 
where is a pseudoinverse of . We transform the source using and proceed to the next iteration until convergence. The final solution is obtained as a product of all incremental transforms, i.e., , where is the estimate at the th iteration, and is the number of iterations.
IiiB Advantages of PointNetLK
PointNet is a simple yet powerful network for point cloud processing, which contributes to the computational efficiency, low memory consumption, and ease of implementation. The network consists of five fullyconnected layers (see Fig. 2
), each of which is followed by batch normalization and ReLU activation. These are responsible for extracting a
D pointwise local feature from a given point. The maxpooling layer is placed at the end to aggregate pointwise local features and compute a global feature. The preprocessing such as normal estimation are not required, as it directly processes raw 3D point coordinates.
PointNetLK does not depend on the correspondences, and is instead based on the alignment of global point cloud features. Importantly, PointNet has a computational and space complexity of , so does the PointNetLK. This leads to a significant advantage compared to the correspondencebased methods mentioned in Section IIA, which has complexity due to the correspondence search. As shown later, the onchip memory consumption of our PointNet accelerator is , since it processes input points onebyone and requires a storage for only one D local feature. PointNet does not even contain convolutional layers, skip connections, and looping structures. Fullyconnected layers are amenable to massive parallelization on the FPGA circuit. The data flows in one direction from the input to output layer, thus PointNet is suitable for the interlayer pipelining.
Iv Design of PointNet Accelerator
Iva Overview of the Design
This section presents an FPGA SoCbased design of PointNet accelerator, since PointNet feature extraction becomes a major bottleneck in PointNetLK as described in Section VI. Fig. 1 depicts a block diagram of our boardlevel implementation, which is partitioned into the processing system (PS) part and the programmable logic (PL) part. The proposed PointNet IP core and a Direct Memory Access (DMA) controller are instantiated inside the PL part, which computes a global feature from an input point cloud upon a request from the PS part. The PS part is responsible for setting up the IP core and triggering a DMA controller. Other steps such as Jacobian computation and coordinate transformation are also performed on the PS part. For the highspeed data transfer, the DMA controller is connected to a 32bit wide highperformance slave port (HPC port) and utilizes AXI4Stream protocol (red lines in Fig. 1). The control registers are accessible through the AXI4Lite interface connected to a highperformance master port (HPM port, blue lines in Fig. 1).
Our PointNet core has two modes: weight initialization and feature extraction. In the weight initialization mode, the IP core receives PointNet model parameters (e.g., weight and bias) through the AXI4Stream interface and stores them to the onchip BRAM buffer. The IP core returns a nonzero 32bit value as an acknowledgement message to notify that the initialization is complete and it is in ready state. In the feature extraction mode, a 1024D global feature is firstly initialized with zeros. Then, as illustrated in Fig. 2, the IP core receives 3D coordinate for each point and computes a 1024D pointwise local feature by propagating through five consecutive MLP layers. The global feature is updated by taking the elementwise maximum of and (Eq. 6). In this way, pointwise local features are aggregated into one global feature . After the computation is done for all points, the current is returned to PS as a final result. Our design takes advantage from the property of PointNet: the computation for each point is independent except the last maxpooling layer. This substantially reduces the BRAM consumption, as it obviates the need to keep intermediate results and local features for all points. The design is also flexible and scalable in a sense that it does not limit the number of input points. To prevent the accuracy loss, our design uses the 32bit fixedpoint format.
IvB Modules in the PointNet IP core
As shown in Fig. 2, the IP core is composed of three types of modules: FC, BNReLU, and MaxPool. FC corresponds to a fullyconnected layer; it computes a D output from a D input , where is a weight and is a bias term. BNReLU combines a batch normalization and a ReLU activation: given an input , its output is obtained as follows:
(5) 
where
are the mean and standard deviation, and
denote the weight and bias. MaxPool updates a global feature using a pointwise local feature as follows:(6) 
IvC Exploiting the Intralayer Parallelism
FC involves a matrixvector multiplication between a weight and an input , represented by two nested loops over and . We unroll the inner loop over by setting an unrolling factor to to parallelize the multiplication between weights and inputs . The values () are then accumulated using an adder tree, which takes iterations. In this way, the number of iterations is reduced from to , which is roughly x speedup. This approach requires DSP blocks and the array partitioning of to increase the number of read operations per clock cycle. We further reduce the latency by pipelining the inner loop. BNReLU and MaxPool are easily parallelizable, as the computation for each output ( or ) is independent as seen in Eqs. 5 and 6. We set the unrolling factor to compute multiple output elements ( or ) and obtain x performance improvement.
IvD Exploiting the Interlayer Parallelism
We also exploit the coarsegrained tasklevel parallelism to further improve the performance. As depicted in Fig. 3, the modules work in a pipelined manner: this allows to overlap the computations for multiple input points and hide the data transfer overhead. For instance, the fifth MLP layer (MLP5) computes a 1024D local feature of the first point, while the fourth MLP layer (MLP4) computes a 128D local feature of the second point. We carefully choose a loop unrolling factor for each module to make the latency of all modules as even as possible (i.e., the pipeline evenly divides the workload among modules) and maximize the effectiveness of pipelining. Table I lists the unrolling factors and latencies ( and ) for modules inside the core. As expected, FC module for the last fullyconnected layer is a bottleneck of the pipeline: we use the maximum possible value to fully unroll the loop. For the other modules, we adjust the unrolling factor such that their latencies do not exceed the one of FC.
Module  (s)  Module  (s)  

FC  1  5.77  BNReLU  1  0.68 
FC  16  5.13  BNReLU  1  1.32 
FC  32  7.69  BNReLU  2  5.16 
FC  128  10.28  MaxPool  2  5.14 
V Implementation Details
We developed a custom accelerator for PointNet using Xilinx Vitis HLS 2020.2, and used Xilinx Vivado 2020.2 for synthesis and placeandroute. We chose Xilinx Zynq UltraScale+ MPSoC devices, namely, Xilinx ZCU104 Evaluation Kit (XCZU7EV2FFVC1156) and Avnet Ultra96v2 (ZU3EG A484) as target FPGA SoCs (Fig. 4), which integrates an FPGA and a mobile CPU on the same board. The specifications of these chips are listed in Table II. They both run Ubuntu 20.04based Pynq Linux 2.7 on a quadcore ARM CortexA53 CPU at 1.2GHz and have a 2GB of DRAM. We set the operation frequency of our accelerator to 100MHz.
Board  BRAM  DSP  FF  LUT 

ZCU104  312  1728  460800  230400 
Ultra96v2  216  360  141120  70560 
We took the PointNetLK source code used in the original paper [1]
, and modified it to offload PointNet feature extraction to our FPGA accelerator. The code is implemented using Python 3.8.2 with PyTorch 1.10.2. For ZCU104 and Ultra96v2, PyTorch was compiled using GCC 9.3.0 with ARM Neon intrinsics enabled to take advantage of the quadcore CPU. The authors
[1]first pretrained the PointNet classification network and finetuned its weights by a PointNetLK loss function. On the other hand, we trained PointNetLK from scratch and did not apply a transfer learning approach. We used the same setting of hyperparameters as in the original code, and did not conduct a further parameter search. The number of training epochs is set to 250.
Vi Evaluation
Via Accuracy
In this section, we evaluate the registration accuracy of PointNetLK using our proposed IP core in comparison with the CPU version and ICP [2]. As done in the original paper [1], we trained PointNetLK on the training sets of 20 object classes (airplane to lamp) in ModelNet40 [24], and tested on the test sets of the same 20 classes.
For each CAD model, we extracted a template point cloud from the vertices, and normalized it to fit inside the unit cube. We rotated around a random axis by a constant angle
, and then translated it by a random vector with each element uniformly distributed on
to generate a source . From a groundtruth transform and an estimated transform , we computed rotational and translational errors. We downsampled (or upsampled) the input point clouds as necessary to fix the number of points in and to 1024 for all data samples. In both ICP and PointNetLK, the same dataset and groundtruth were used, and the maximum number of iterations was set to 20 for a fair comparison. Fig. 5 shows the results with varying initial angles.PointNetLK with our proposed IP core (magenta) achieves almost the same accuracy as the software implementation (red), and PointNetLK provides better accuracy than ICP for . For , PointNetLK does not converge to correct solutions and showed larger rotational errors than ICP. This is an expected behavior; during training, we created a rigid transform between point clouds from a 6D vector with norm less than 0.8. In other words, PointNetLK was never trained on point cloud pairs with initial angles larger than 0.8 radians ().
We also trained PointNetLK on the training sets of 20 classes and tested on the test sets of the other 20 classes (laptop to xbox). While it (cyan, green) shows a larger translational error than PointNetLK trained and tested with the same classes (magenta, red) for , it still achieves the same level of accuracy especially in the rotation estimation. Besides, for , the registration error is lower than ICP and closer to that of PointNetLK in the previous setting. This indicates that PointNetLK has a generalization ability to align point clouds which are distinct from the training dataset. As shown in Fig. 5, PointNetLK with the FPGA acceleration (green) has almost the same accuracy as the software counterpart (cyan), meaning that our IP core yields faster computation time without compromising the accuracy. For qualitative analysis, Figs. 6 and 7 visualize the registration results obtained from PointNetLK with our IP core for ModelNet40 and Stanford bunny [25], respectively.
ViB Computation Time
PointNetLK is evaluated in terms of the computation time to highlight its significant advantage over ICP. Fig. 8 shows the results with the varying number of input points from 128 to 4096. We used the table category in ModelNet40 and plotted an average wallclock time. We included the data transfer overhead between PS–PL for a fair comparison. The initial angle is set to . We also note that PointNetLK was trained on the first 20 categories in ModelNet40, which do not include the table category. The wallclock time increases linearly in PointNetLK and quadratically in ICP, which stems from the fact that the computational complexity of PointNetLK and ICP are and , respectively. It directly follows that PointNetLK provides a better performance advantage over ICP as the input size increases. For , the CPU version of PointNetLK was 1.36x slower than ICP (5.47s and 4.04s). The FPGA version (red) took only 366ms per input, which was 14.98x faster compared to the CPU version (green), and eventually lead to 11.04x speedup than ICP (blue). As shown in Fig. 8, we obtained better results for : compared to ICP, the CPU version was 3.26x faster (71.16s and 21.82s), and the FPGA version was 69.60x faster, which is attributed to the 21.34x speedup (21.82s to 1.02s).
Fig. 9 shows the breakdown of processing time for PointNetLK with and without FPGA acceleration. We set the initial angles to (first two rows) and (last two rows). PointNet feature extraction (red + green) is inevitably a major bottleneck, accounting for 91.90% () and 93.29% (), which were reduced to 58.01% and 57.96% by FPGA acceleration.
ViC Effects of Quantization
This section analyzes a relationship between the number of quantization bits used in the IP core and the accuracy. Fig. 10 shows the PointNetLK registration errors, evaluated using five different numbers of quantization bits from 16 to 32. Table III summarizes the FPGA resource utilization. Each IP core design uses the bit fixedpoint format with bit integer part and bit fraction part (). We trained PointNetLK on the first 20 object classes and tested with the table class in ModelNet40.
As apparent in Fig. 10, the 16bit quantized version exhibited larger errors than the others. In contrast, for , the 20bit version produced nearly the same results as the 32bit version. Even for , the bitwidth reduction from 32 to 20bit only introduced a slight accuracy loss. Notably, the DSP usage was halved by reducing from 32 to 24bit (Table III). The reduction from 24 to 20bit further halved the DSP footprint (24.07% to 12.56%) and increased the LUT usage (9.91% to 12.87%), since arithmetic units such as multipliers were implemented using more LUTs and less DSPs. The results indicate that the 20bit version strikes the best balance between accuracy and resource consumption. As seen in Table IV, the 20bit version fits within a lowcost and resourcelimited FPGA, Avnet Ultra96v2, whereas the 32bit version cannot be implemented due to the shortage of DSP and LUT resources.
# of Bits  BRAM (%)  DSP (%)  FF (%)  LUT (%) 

32  55.13  48.50  5.46  16.31 
28  55.13  48.09  4.73  13.70 
24  44.87  24.07  4.05  9.91 
20  44.23  12.56  3.85  12.87 
16  27.40  12.21  2.83  7.74 
# of Bits  BRAM (%)  DSP (%)  FF (%)  LUT (%) 

20  64.81  60.28  12.81  43.52 
32  79.63  100.00  22.35  238.77 
ViD Effects of Design Optimization
Here, we discuss the effects of design optimizations described in Section IV on the performance and FPGA resource utilization. In addition to the final design, we also consider the design without interlayer pipelining and the naive design with no optimization as a baseline. Fig. 11 plots the processing time with varying point cloud sizes from to , and Table V compares the resource utilization. In Fig. 11, we observe a linear increase of the processing time, and the naive design (blue) is 3.49x slower than the CPU (black) for (363.49ms and 1267.08ms). By exploiting the intralayer parallelism, the design (green) attains a speedup of 34.46x () compared to the unoptimized version (1267.08ms to 36.77ms) at the expense of 14.38x increase in the DSP usage (3.07% to 44.16%). The intralayer pipelining allows a further speedup of 4.29x (green and blue, 36.77ms to 8.58ms) with a few additional resources, by overlapping data transfer and computation. This leads to a total performance improvement of 147.68x and 42.36x compared to the unoptimized version and CPU for , respectively.
Design  BRAM (%)  DSP (%)  FF (%)  LUT (%) 

Naive  45.99  3.07  0.82  3.73 
Intralayer  54.17  44.16  4.56  10.66 
Inter & intralayer  55.13  48.50  5.46  16.31 
ViE Power Consumption
The power consumption of our accelerator was 722mW according to the estimates reported by Xilinx Vivado 2020.2.
Vii Conclusion
In this paper, we present a resourceefficient FPGAbased implementation for 3D point cloud registration. We opt to use PointNetLK, which combines PointNet feature embedding and LucasKanade (LK)based pose estimation. We develop a custom PointNet accelerator and implement it on a midrange FPGA (Xilinx ZCU104). We exploit both intra and interlayer parallelism in PointNet to fully optimize the design, achieving computational complexity and memory requirement. Experiments demonstrate that PointNetLK with our accelerator achieves up to 21.34x and 69.60x speedup compared to the CPU counterpart and ICP, respectively, without compromising the accuracy. Besides, it consumes only 722mW at runtime and offers better scalability than ICP. The quantized design fits within even smaller FPGAs (Avnet Ultra96v2). Experiments also highlight the generalization ability of PointNetLK.
References

[1]
Y. Aoki, H. Goforth, R. A. Srivatsan, and S. Lucey, “PointNetLK: Robust &
Efficient Point Cloud Registration using PointNet,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2019, pp. 7156–7165.  [2] P. J. Besl and N. D. McKay, “A Method for Registration of 3D Shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 14, no. 2, pp. 239–256, Feb. 1992.
 [3] A. V. Segal, D. Haehnel, and S. Thrun, “GeneralizedICP,” in Proceedings of the Robotics: Science and Systems Conference (RSS), June 2009.
 [4] J. Li, H. Zhan, B. M. Chen, I. Reid, and G. H. Lee, “Deep Learning for 2D Scan Matching and Loop Closure,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sept. 2017, pp. 763–768.
 [5] M. Valente, C. Joly, and A. de La Fortelle, “An LSTM Network for RealTime Odometry Estimation,” in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), June 2019, pp. 1434–1440.
 [6] L. Ding and C. Feng, “DeepMapping: Unsupervised Map Estimation From Multiple Point Clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019, pp. 8642–8651.
 [7] V. Sarode, X. Li, H. Goforth, Y. Aoki, R. A. Srivatsan, S. Lucey, and H. Choset, “PCRNet: Point Cloud Registration Network using PointNet Encoding,” arXiv Preprint 1908.07906, Aug. 2019.
 [8] G. D. Pais, S. Ramalingam, V. M. Govindu, J. C. Nascimento, R. Chellappa, and P. Miraldo, “3DRegNet: A Deep Neural Network for 3D Point Registration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 7193–7203.
 [9] G. Elbaz, T. Avraham, and A. Fischer, “3D Point Cloud Registration for Localization Using a Deep Neural Network AutoEncoder,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 4631–4640.
 [10] Y. Wang and J. M. Solomon, “Deep Closest Point: Learning Representations for Point Cloud Registration,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, pp. 3523–3532.
 [11] W. Lu, G. Wan, Y. Zhou, X. Fu, P. Yuan, and S. Song, “DeepVCP: An EndtoEnd Deep Neural Network for Point Cloud Registration,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Feb. 2019, pp. 12–21.

[12]
Y. Wang and J. M. Solomon, “PRNet: SelfSupervised Learning for PartialtoPartial Registration,” in
Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Dec. 2019, pp. 8814–8826.  [13] Z. J. Yew and G. H. Lee, “RPMNet: Robust Point Matching Using Learned Features,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 11 824–11 833.
 [14] C. Choy, W. Dong, and V. Koltun, “Deep Global Registration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 2514–2523.
 [15] A. Kurobe, Y. Sekikawa, K. Ishikawa, and H. Saito, “CorsNet: 3D Point Cloud Registration by Deep Neural Network,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 3960–3966, Feb. 2020.
 [16] K. Fu, S. Liu, X. Luo, and M. Wang, “Robust Point Cloud Registration Framework Based on Deep Graph Matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 8893–8902.
 [17] T. Min, E. Kim, and I. Shim, “Geometry Guided Network for Point Cloud Registration,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7270–7277, Oct. 2021.
 [18] X. Li, J. K. Pontes, and S. Lucey, “PointNetLK Revisited,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 12 763–12 772.
 [19] A. Kosuge, K. Yamamoto, Y. Akamine, and T. Oshima, “An SoCFPGABased IterativeClosestPoint Accelerator Enabling Faster Picking Robots,” IEEE Transactions on Industrial Electronics, vol. 68, no. 4, pp. 3567–3576, Mar. 2020.
 [20] Q. Deng, H. Sun, F. Chen, Y. Shu, H. Wang, and Y. Ha, “An Optimized FPGABased RealTime NDT for 3DLiDAR Localization in Smart Vehicles,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 68, no. 9, pp. 3167–3171, July 2021.
 [21] P. Biber and W. Straßer, “The Normal Distributions Transform: A New Approach to Laser Scan Matching,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2003, pp. 2743–2748.
 [22] M. Eisoldt, M. Flottmann, J. Gaal, P. Buschermöhle, S. Hinderink, M. Hillmann, A. Nitschmann, P. Hoffmann, T. Wiemann, and M. Porrmann, “HATSDF SLAM – Hardwareaccelerated TSDF SLAM for Reconfigurable SoCs,” in Proceedings of the European Conference on Mobile Robots (ECMR), Aug. 2021.
 [23] T. D. Barfoot, State Estimation for Robotics. Cambridge University Press, 2017.
 [24] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D ShapeNets: A Deep Representation for Volumetric Shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [25] G. Turk and M. Levoy, “The Stanford 3D Scanning Repository,” http://graphics.stanford.edu/data/3Dscanrep/.