Log In Sign Up

Orbeez-SLAM: A Real-time Monocular Visual SLAM with ORB Features and NeRF-realized Mapping

by   Chi-Ming Chung, et al.

A spatial AI that can perform complex tasks through visual signals and cooperate with humans is highly anticipated. To achieve this, we need a visual SLAM that easily adapts to new scenes without pre-training and generates dense maps for downstream tasks in real-time. None of the previous learning-based and non-learning-based visual SLAMs satisfy all needs due to the intrinsic limitations of their components. In this work, we develop a visual SLAM named Orbeez-SLAM, which successfully collaborates with implicit neural representation (NeRF) and visual odometry to achieve our goals. Moreover, Orbeez-SLAM can work with the monocular camera since it only needs RGB inputs, making it widely applicable to the real world. We validate its effectiveness on various challenging benchmarks. Results show that our SLAM is up to 800x faster than the strong baseline with superior rendering outcomes.


page 1

page 6


InterpolationSLAM: A Novel Robust Visual SLAM System in Rotating Scenes

In recent years, visual SLAM has achieved great progress and development...

A survey on non-filter-based monocular Visual SLAM systems

Extensive research in the field of Visual SLAM for the past fifteen year...

NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields

We propose a novel geometric and photometric 3D mapping pipeline for acc...

Learning Deeply Supervised Visual Descriptors for Dense Monocular Reconstruction

Visual SLAM (Simultaneous Localization and Mapping) methods typically re...

Estimation of Absolute Scale in Monocular SLAM Using Synthetic Data

This paper addresses the problem of scale estimation in monocular SLAM b...

gradSLAM: Dense SLAM meets Automatic Differentiation

The question of "representation" is central in the context of dense simu...

Constructing Category-Specific Models for Monocular Object-SLAM

We present a new paradigm for real-time object-oriented SLAM with a mono...

Code Repositories

I Introduction

An intelligent spatial AI that can receive visual signals (RGB-D images) and cooperate with humans to solve complicated tasks is highly valued. To efficiently understand semantics knowledge from the environments and act as humans, spatial AI requires a core component named visual simultaneous localization and mapping (SLAM). The visual SLAM should quickly adapt to new scenes without pre-training and generate real-time fine-grained maps for downstream tasks, such as domestic robots. However, traditional visual SLAMs [20, 7] mainly focus on localization accuracy and only provide crude maps. To this end, this work aims to develop a visual SLAM with the aforementioned properties.

To compute dense maps, a recent learning-based visual SLAM, Tandem [10], leverages the truncated signed distance function (TSDF) fusion to provide a dense 3D map. As claimed in [10]

, Tandem achieves real-time inference and can work with a monocular camera. But, depth estimation is involved in the TSDF fusion, and the depth estimation module in Tandem needs pre-training before inference, which limits its adaptability to a novel scene significantly different from pre-trained scenes.

Neural Radiance Field (NeRF) [16], another implicit neural representation, does not require depth supervision during training and can be trained from scratch at the target scene. Due to this attribute, using NeRF as the map in visual SLAM is a potential direction. Two latest NeRF-SLAMs [29, 35] echo our motivations. Among them, iMAP [29] is the first work that lets NeRF serve as the map representation in SLAM. Meanwhile, it optimizes the camera pose via back-propagation from NeRF photometric loss. Then, NICE-SLAM [35]

extends iMAP and develops a hierarchical feature grid module. The module allows NICE-SLAM to scale up for large scenes without catastrophic forgetting. Nevertheless, the above NeRF-SLAMs need RGB-D inputs since they optimize camera pose purely through the neural network without visual odometry (VO), causing bad initial localizations. In other words, they still need depth information to guide the 3D geometry. Besides, a notable shortcoming of NeRF is its slow convergence speed. Specifically, there are lots of rendering during the progress, which makes real-time training NeRF floundering. By observing this, instant-ngp

[17] compensates for the training speed issue. With the help of the multi-resolution hash encoding and the CUDA framework [18], instant-ngp can train NeRFs in a few seconds.

To tackle the above shortcomings, we seek to develop a monocular visual SLAM that is pre-training-free and achieves real-time inference for practical applications. To this end, we propose Orbeez-SLAM, combining feature-based SLAM (e.g., ORB-SLAM2 [20]) and a NeRF based on the instant-ngp framework [17]. Different from [29, 35]

, we emphasize that VO (in ORB-SLAM2) can provide a better camera pose estimation even at the early stage of the training, which lets Orbeez-SLAM can work with monocular cameras, i.e., without depth supervision. Moreover, we simultaneously estimate camera poses via VO and update the NeRF network. Notably, the training process is online and real-time without pre-training, as depicted in Fig.

1. As a result, Orbeez-SLAM can render dense information such as the depth and color of scenes. Besides, it is validated in various indoor scenes and outperforms NeRF-SLAM baselines on speed, camera tracking, and reconstruction aspects. To summarize, our contributions are threefold:

  • We propose Orbeez-SLAM, the first real-time monocular visual SLAM that is pre-training-free and provides dense maps, tailored for spatial AI applications.

  • By combining visual odometry and a fast NeRF framework, our method reaches real-time inference and produces dense maps.

  • We extensively validate Orbeez-SLAM with state-of-the-art (SOTA) baselines on challenging benchmarks, showing superior quantitative and qualitative results.

Ii Related works

Ii-a Implicit neural representations

To represent a 3D scene, explicit representations (e.g., point clouds) need huge space to store information. By contrast, implicit surface representations, such as signed distance functions (SDF), alleviate the space issue and have been widely developed in recent years. Among them, some works [31, 22] leverage neural networks to learn the implicit function, called implicit neural representations (INRs). With the property of continuous representation of the signals, INRs demonstrate several advantages: (a) they are not coupled to the spatial dimension/resolution of input signals, and (b) they can predict/synthesize the unobserved regions.

Besides, NeRF, a novel and popular INR, has illustrated its success in novel view synthesis [16, 14, 1, 2]. Nonetheless, most NeRFs assume that the camera pose is known. Thus, COLMAP [24, 25] is often used to estimate intrinsic and extrinsic (camera poses) in NeRF-related works. In addition, a few works [33, 9, 12] optimize camera poses via NeRF photometric loss, but the process requires a long training time. Hence, as aforementioned, instant-ngp [17] develops a framework that can train NeRFs in a few seconds, leveraging the multi-resolution hash encoding and CUDA platform [18].

Intuitively, implicit surface representations can serve as maps in visual SLAM systems. For instance, some studies [21, 32] leverage the SDF to construct the map. Besides, two recent NeRF-SLAMs [29, 35] pave the way for cooperating NeRFs and visual SLAM. However, they need RGB-D inputs and show a slow convergence speed, which do not satisfy our needs. Therefore, we aim to build a NeRF-SLAM to generate a dense map in real-time. Moreover, our work can work with the monocular camera and train from scratch at the target scene without a lengthy pre-training process.

Ii-B Visual SLAM systems

Traditional visual SLAMs reveal strengths in outdoor and large scenes. Also, they can rapidly compute accurate locations but lack the fine-grained information from scenes. There are two categories of visual SLAMs, feature-based and direct SLAM. Feature-based SLAMs [19, 20, 3] extract and match image features between frames and then minimize the reprojection error. Besides, Direct SLAM [7] uses pixel intensities to localize and minimize the photometric error.

To satisfy the needs of spatial AI, we require a visual SLAM that provides a dense map for complicated tasks. Several works [10, 29, 35]

achieve this objective under deep learning techniques. However, they either need pre-training

[10] that limits the adaptation capability or optimize the camera poses and network parameters only relying on NeRF photometric loss and depth supervision [29, 35], lacking the knowledge of VO. Thus, we develop Orbeez-SLAM that eliminates these drawbacks by considering the VO guidance and fast NeRF implementation. Consequently, Orbeez-SLAM is pre-training-free for novel scenes and reaches real-time inference (with online training).

Iii Preliminaries

Iii-a NeRF

NeRF [16] reconstructs a 3D scene by training on a sequence of 2D images from distinct viewpoints. A continuous scene can be represented as a function where is the 3D location,

is the 3D Cartesian unit vector representing the direction

. Outputs are color and volume density . Such a continuous representation, as a function, can be approximated with an MLP network by optimizing the weight .

Following above definitions, for any ray , the color within the near and far bounds can be obtained through the intergral of products of transmittance , volume density and the color at each point, i.e.,




To feed the input rays into the neural network, discrete points are sampled to estimate the color (Eq. 1) with the quadrature rule [15]:




The weight of a sample point is denoted as

Fig. 2: Skip voxel strategy. When sampling positions along a cast ray, a voxel is skipped if it is unoccupied (mark as 0); we sample voxels which intersect the surface (mark as 1).

Iii-B Density grid

The rendering equation (1) in NeRF requires sampling positions on the ray. We should be only interested in the positions which intersect the surface since they contribute more to (1). Some studies [16, 1, 2] leverage a coarse-to-fine strategy that samples uniformly on the ray to find the density distribution via querying NeRF. After knowing the density distribution of the ray, they only sample those positions near the surface. However, these steps require frequent NeRF querying, which is time-consuming.

To tackle this, recent works [17, 13, 23, 30, 5] store the query results in density grid, and then the skip voxel strategy is usually applied, as shown in Fig. 2. In this work, we further extend the skip voxel strategy with knowledge of NeRF to process the ray-casting triangulation (see Section IV-C).

Iv Methodology

Fig. 3: System Pipeline. The tracking and mapping processes run concurrently. A frame from the image stream must satisfy the two conditions to become a keyframe. The first condition filters out those frames with weak tracking results. The second condition drops the frame if the mapping process is busy. The tracking process provides camera poses estimation. The mapping process refines the camera poses and maintains the maps. We also show the dense point cloud generated from our proposed ray-casting triangulation which is introduced in Section IV-C.

Unlike previous NeRF-SLAMs [29, 35] which require depth information to perceive geometry better, we develop Orbeez-SLAM that leverages VO for accurate pose estimations to generate a dense map with a monocular camera. Besides, it achieves pre-training-free adaptation and real-time inference. Next, the system overview is depicted in Sec. IV-A and the optimization objectives are described in Sec. IV-B. At last, ray-casting triangulation is introduced in Sec. IV-C.

Iv-a System overview

Fig. 3 shows our system pipeline. The tracking process extracts the image features from the input image stream and estimates the camera poses via visual odometry. The mapping system generates map points with triangulation and optimizes camera poses and map points with bundle adjustment (reprojection error). These map points represent a sparse point cloud. We then utilize the updated camera poses and the map to train NeRF. Since the process is differentiable, we can still optimize the camera poses from NeRF photometric loss. In the end, the NeRF can generate a dense map for downstream tasks. Moreover, this pipeline should work for any SLAM that provides sparse point cloud.

Fig. 4: Ray-casting triangulation in NeRF. We record the sample count for each density grid voxel. If the weight of a voxel (See (5) in Sec. III-A) exceeds the threshold to be a surface candidate, we add 1 to the voxel counter. Those voxels with high sample counts will likely contain surface and be added as map points for the dense point cloud.

Iv-B Optimization

The following objectives are used to optimize Orbeez-SLAM: (a) pose estimation, (b) bundle adjustment, and (c) NeRF regression. Among them, (a) is in the tracking process, and (b) and (c) are conducted in the mapping process.

Iv-B1 Pose estimation

Reprojection error [4] is widely used in feature-based SLAM [19, 20, 3] to estimate the pose, and its formulation is as follows:


where is the pixel position on the image, which is observed by the th camera and is projected by the th 3D point. The projects the 3D map point to the pixel coordinate via , where and are the intrinsic and extrinsic (world to camera) described by . We optimizes the camera poses by minimizing the reprojection error:


Iv-B2 Bundle adjustment

After the triangulation step in VO, new map points are added to the local map. The bundle adjustment objective also minimizes the reprojection error for both the map point positions and the camera poses:


Minimizing (6) is actually a nonlinear least square problem. We solve these two objectives by the Levenberg-Marquardt method, followed [20]. The bundle adjustment optimizes the camera poses of keyframes and observable map points in these keyframes. Then, these optimized keyframes and map points are passed to the NeRF.

Iv-B3 NeRF regression

NeRF minimizes the photometric error by regressing the image color. A ray can be formulated by giving a keyframe and a pixel coordinate :


By applying the skip voxel strategy mentioned in Fig. 2, we sample positions on the ray were near to the surface. Finally, the NeRF photometric loss is the L2 norm between predicted color and the pixel color .


where is the observed color of ray in image . Since (9) is differentiable, both camera extrinsic and network parameters can be optimized by . But, after examinations (cf. Tab. V), we only optimize by (6).

Iv-C Ray-casting triangulation

In Fig. 2, we show that the density grid can accelerate the rendering process. However, this structure only considers a ray and highly relies on the density prediction of the NeRF model. We additionally store the number of sampling times for each density grid. A voxel that frequently blocks the casting ray is more likely to be the surface, as shown in Fig. 4. For noise rejection, we only triangulate points that lie within voxels that are scanned by rays frequently enough. We chose 64 as the threshold for practical implementation since such a value has the best visualization, according to our experience. We also utilize the data structure’s map point generated from the sparse point cloud. Since the map point’s surroundings are much more likely to be the surface, we add a significant number to the sample counter of the density grid. Instead of relying on NeRF, we claim that this method can find a more reliable surface. Map points generated by this method are not optimized in (8). We show the dense point cloud generated by this method in Fig. 3.

V Experiments

V-a Experimental setup

V-A1 Datasets.

For a fair comparison, we conduct our experiments on three benchmarks, TUM RGB-D [28], Replica [27], and ScanNet [6], which provide extensive images, depths, and camera trajectories and are widely used in previous works.

V-A2 Baselines.

We compare proposed Orbeez-SLAM with two categories of baselines, (a) learning-based SLAM: DI-Fusion[8], iMap[29], iMap(re-implemented in [35]), and NICE-SLAM [35]. (b) traditional based SLAM: BAD-SLAM[26], Kintinuous[34], and ORB-SLAM2[20].

V-A3 Evaluation settings.

In practice, monocular SLAM works validate the effectiveness under depth version since they cannot estimate the correct scale of the scenes without knowing depth. Also, all previous NeRF-SLAMs require depth supervision. Thus, all methods are verified on the depth version. We still demonstrate that Orbeez-SLAM can work with monocular cameras. Moreover, we extensively examine the efficacy from two aspects, tracking and mapping results.

V-A4 Metrics.

To evaluate the tracking results, we report the absolute trajectory error (ATE), which computes the root mean square error (RMSE) between the ground truth (GT) trajectory and the aligned estimated trajectory. For the mapping results, we modify two metrics that are often used in NeRF works, PSNR and Depth L1 metrics. PSNR assesses the distortion rate of NeRF rendered and GT images traversed by the GT trajectory. As for Depth L1, we calculate the L1 error of estimated and GT depth traversed by the GT trajectory.

Unlike in [29, 35], we do not evaluate on meshes. We argue that assessing performance on meshes may be unfair because the mesh generation process by post processing NeRF is not unified. In addition, our setting has the following advantages:

  • Numbers of sampled keyframes in distinct works are various while evaluating with GT trajectory provides a consistent standard.

  • The depth and PSNR can effectively reflect the quality of the geometry and radiance learned by NeRF.

  • Our setting can still verify the methods on novel views since only keyframes (in GT trajectory) are used during training. Besides, even if the model backs up seen keyframes, the metric can still reveal that when they are localized to the wrong viewpoints.

fr1/desk fr2/xyz fr3/office
DI-Fusion[8] 4.4 2.3 15.6
iMAP[29] 4.9 2.0 5.8
iMAP[35] 7.2 2.1 9.0
NICE-SLAM[35] 2.7 1.8 3.0
Orbeez-SLAM (Ours) 1.9 0.3 1.0
BAD-SLAM[26] 1.7 1.1 1.7
Kintinuous[34] 3.7 2.9 3.0
ORB-SLAM2[20] 1.6 0.3 0.9
TABLE I: Tracking Results on TUM RGB-D. ATE [cm] () is used. Learning-based and traditional SLAMs are separated by the middle line. Results of DI-Fusion, iMAP, iMAP, NICE-SLAM, BAD-SLAM and Kintinuous are from [35]. The best out of 5 runs are reported.
Scene ID 0000 0059 0106 0169 0181 0207 Avg.
DI-Fusion  [8] 62.99 128.00 18.50 75.80 87.88 100.19 78.89
iMAP [29] 55.95 32.06 17.50 70.51 32.10 11.91 36.67
NICE-SLAM[35] 8.64 12.25 8.09 10.28 12.93 5.59 9.63
Orbeez-SLAM (Ours) 7.22 7.15 8.05 6.58 15.77 7.16 8.655
ORB-SLAM2 [20] 7.57 6.92 8.30 6.90 16.42 8.78 9.15
TABLE II: Tracking Results on ScanNet. ATE [cm] () is used. Results of iMAP, DI-Fusion and NICE-SLAM are from [35].
w/o GT
w/ GT
w/o GT
w/ GT
NICE-SLAM [35] 13.49 4.22 17.74 24.60
Orbeez-SLAM (Ours) 11.88 - 29.25 -
TABLE III: Reconstruction Results on Replica. Depth L1 [cm] () and PSNR [dB] () are used. The values are averaged over office 0 to 4 and room 0 to 2. Since NICE-SLAM use GT depth during rendering color and depth. We show the results of NICE-SLAM w and w/o GT depth.

V-A5 Implementation Details.

We conduct all experiments on a desktop PC with an Intel i7-9700 CPU and a NVIDIA RTX 3090 GPU. We follow the official code in ORB-SLAM2 111 [20] and instant-ngp 222 [17] to implement Orbeez-SLAM. Note that Orbeez-SLAM inherits the loop-closing process from ORB-SLAM2 [20] to improve the trajectory accuracy. We do not cull the keyframe like ORB-SLAM2 to ensure the keyframe is not eliminated after passing to the NeRF. The code is written in C++ and CUDA. About losses, reprojection error is optimized via g2o [11] framework, and the photometric error in NeRF is optimized via tiny-cuda-nn framework [18].

fr1/desk fr2/xyz fr3/office
# images 613 3669 2585
NICE-SLAM [35] 0.056 0.028 0.037
Orbeez-SLAM (Ours) 19.210 22.725 21.542
TABLE IV: Runtime Comparison. Frame per second [fps] () when running on TUM RGB-D. We show that Orbeez-SLAM is much faster than the SOTA NeRF-SLAM.
5.3 13.43 25.52
0.8 11.88 29.25
TABLE V: Ablation study on Replica. We demonstrate that optimizing camera poses only from reprojection error is better than from both and photometric error .

V-B Quantitative Results

Through the experiments, we aim to verify whether Orbeez-SLAM can produce accurate trajectories (ATE), precise 3D reconstructions (Depth), and detailed perception information (PSNR) under our challenging settings, i.e., real-time inference without pre-training. Notably, previous works mainly focus on the first two indicators. However, a dense map containing rich perception information is vital for spatial AI applications; thus, we also attend to this aspect.

V-B1 Evaluation on TUM RGB-D (small-scale) [28].

TABLE I lists tracking results of all methods. Note that Orbeez-SLAM outperforms all deep-learning baselines with a significant gap (top half). Besides, ORB-SLAM2 is our upper bound on the tracking results since our method is built on it. Nevertheless, Orbeez-SLAM only shows a minor performance drop while it provides a dense map generated by NeRF.

V-B2 Evaluation on ScanNet (large-scale) [6].

As revealed in TABLE II, we obtain the best average results across all scenes. We assume the performance difference between us and ORB-SLAM2 is due to randomness. In addition, NICE-SLAM performs best in some cases, echoing the claimed strength for scaling-up scenes in [35]. Especially scenes 0181 and 0207 contain compartments. Improving performance in large scenes with rooms is one of future works.

Fig. 5: Comparison of Rendering Results. RGB and Depth of NeRF-rendered results from Orbeez-SLAM (ours) and NICE-SLAM [35] are visualized. We provide the results of Orbeez-SLAM (mono and RGB-D) and NICE-SLAM (RGB-D w/ and w/o GT depth during inference). It demonstrates that Orbeez-SLAM can generate high-quality RGB results rapidly, even under the monocular setting. Notably, we do not use depth information for NeRF rendering in the RGB-D setting (depth only used for the tracking process); thus, NICE-SLAM provides a better depth rendering result.
Fig. 6: NeRF Results across Time. Our NeRF-rendered results from TUM-fr3/office, Replica-office, and ScanNet-0207 are listed across time. The elapsed time is indicated at the top left corner. The first, second, and third columns are the rendering results at the beginning of training, the end of the tracking process, and the full convergence of loss values, respectively. Our NeRF computes good results on TUM and Replica but fails on ScanNet (large scene). It successfully reconstructs the bed but fails to rebuild the desk on the left, revealing that large scenes are more challenging.

V-B3 Evaluation on Replica [27].

NICE-SLAM evaluates the mapping results on Replica since it provides GT meshes. But, as stated before, we argue that the mesh generation process from NeRF is not unified and tricky. Hence, we use common metrics in NeRF works, Depth L1 and PSNR.

As demonstrated in TABLE III, NICE-SLAM obtains the best values on the Depth L1 when GT depth is supported during rendering depth. However, our Depth L1 values outperform NICE-SLAM when it has no GT depth in rendering. Note that our NeRF is never supervised by GT depth. Next, when comparing the quality of rendered images from NeRF, we beat all variants of NICE-SLAM on PSNR, indicating that our method provides a superior color result.

V-B4 Runtime Comparison.

TABLE IV depicts the elapsed time of our Orbeez-SLAM and NICE-SLAM (SOTA NeRF-SLAM) running on the TUM RGB-D benchmark. Attributed to the VO for estimating an accurate camera pose at the early stage of training, Orbeez-SLAM is 360 800 times faster than NICE-SLAM, making it achieves real-time inference.

V-C Ablation Study

TABLE V illustrates the ablations. We can observe that the camera pose guided only by achieves a better result than the one guided by both and (from NeRF). We claim that the convergence speed of is much slower, which brings a negative influence when being leveraged in real-time inference. And that is also the reason we did not provide the version only guided by since it produces horrible results and is not available under our setting. We refer interested readers to our demo video for more details.

V-D Qualitative Results

We deliver qualitative results in Fig. 5 and Fig. 6. As stated in Fig. 5, NICE-SLAM renders images with the help of GT depth. To be clear, GT depth is used during training of both NICE-SLAM cases. By contrast, our Orbeez-SLAM does not use depth supervision to render images, even in the RGB-D case where the GT depth is only used for tracking. Notably, Orbeez-SLAM provides a superior RGB result than NICE-SLAM under both settings. We highlight that NICE-SLAM produces better depth results due to accessing GT depth.

Besides, we provide Orbeez-SLAM rendered results at distinct timestamps in Fig 6. After the real-time SLAM is ended (second column), we can apply offline training for NeRF until losses fully converge (third column). Orbeez-SLAM demonstrates excellent outcomes in TUM and Replica cases (first two rows) but fails at the large-scale ScanNet case. We assume large-scale scenes are more challenging to Orbeez-SLAM, and we leave it as one of future works.

Vi Conclusion

We aim to develop a core component in spatial AI, i.e., a pre-training-free visual SLAM that reaches real-time inference and provides dense maps for downstream tasks. To this end, we propose Orbeez-SLAM, which utilizes ORB features and NeRF-realized mapping. To achieve the above requirements, we cooperate with visual odometry and fast NeRF implementation on the instant-ngp platform. Moreover, Orbeez-SLAM can work with monocular cameras, leading to flexible, practical applications. We verify Orbeez-SLAM with SOTA NeRF-SLAM baselines on three challenging benchmarks, and it demonstrates superior performance on average. Besides, we also provide detailed qualitative results to prove that our work satisfies the claims. We believe we pave the way for speeding up the development progress of spatial AI. Notably, how to effectively leverage dense maps in downstream tasks is interesting but is out of the scope of this paper; we leave it as future works.

Vii Acknowledgement

This work was supported in part by the National Science and Technology Council, under Grant MOST 110-2634-F-002-051, Qualcomm through a Taiwan University Research Collaboration Project, Mobile Drive Technology Co., Ltd (MobileDrive), and NOVATEK fellowship. We are grateful to the National Center for High-performance Computing.


  • [1] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan (2021) Mip-nerf: a multiscale representation for anti-aliasing neural radiance fields. ICCV. Cited by: §II-A, §III-B.
  • [2] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman (2022) Mip-nerf 360: unbounded anti-aliased neural radiance fields. CVPR. Cited by: §II-A, §III-B.
  • [3] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J. D. Tardós (2021)

    ORB-slam3: an accurate open-source library for visual, visual–inertial, and multimap slam

    IEEE Transactions on Robotics 37 (6), pp. 1874–1890. External Links: Document Cited by: §II-B, §IV-B1.
  • [4] Y. Chen, Y. Chen, and G. Wang (2019) Bundle adjustment revisited. CoRR abs/1912.03858. External Links: Link, 1912.03858 Cited by: §IV-B1.
  • [5] R. Clark (2022-06) Volumetric bundle adjustment for online photorealistic scene capture. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 6124–6132. Cited by: §III-B.
  • [6] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §V-A1, §V-B2.
  • [7] J. Engel, T. Schöps, and D. Cremers (2014-09) LSD-SLAM: large-scale direct monocular SLAM. In European Conference on Computer Vision (ECCV), Cited by: §I, §II-B.
  • [8] J. Huang, S. Huang, H. Song, and S. Hu (2021) DI-fusion: online implicit 3d reconstruction with deep priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §V-A2, TABLE I, TABLE II.
  • [9] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, and J. Park (2021-10) Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5846–5854. Cited by: §II-A.
  • [10] L. Koestler, N. Yang, N. Zeller, and D. Cremers (2021) TANDEM: tracking and dense mapping in real-time using deep multi-view stereo. In Conference on Robot Learning (CoRL), External Links: 2111.07418 Cited by: §I, §II-B.
  • [11] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard (2011) G2o: a general framework for graph optimization. In 2011 IEEE International Conference on Robotics and Automation, Vol. , pp. 3607–3613. External Links: Document Cited by: §V-A5.
  • [12] C. Lin, W. Ma, A. Torralba, and S. Lucey (2021) BARF: bundle-adjusting neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II-A.
  • [13] L. Liu, J. Gu, K. Z. Lin, T. Chua, and C. Theobalt (2020) Neural sparse voxel fields. NeurIPS. Cited by: §III-B.
  • [14] R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth (2021) NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR, Cited by: §II-A.
  • [15] N. Max (1995) Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics 1 (2), pp. 99–108. External Links: Document Cited by: §III-A.
  • [16] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) NeRF: representing scenes as neural radiance fields for view synthesis. In ECCV, Cited by: §I, §II-A, §III-A, §III-B.
  • [17] T. Müller, A. Evans, C. Schied, and A. Keller (2022-07) Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41 (4), pp. 102:1–102:15. External Links: Link, Document Cited by: §I, §I, §II-A, §III-B, §V-A5.
  • [18] T. Müller (2021) Tiny CUDA neural network framework. Note: Cited by: §I, §II-A, §V-A5.
  • [19] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. External Links: Document Cited by: §II-B, §IV-B1.
  • [20] R. Mur-Artal and J. D. Tardós (2017) ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. External Links: Document Cited by: §I, §I, §II-B, §IV-B1, §IV-B2, §V-A2, §V-A5, TABLE I, TABLE II.
  • [21] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon (2011) KinectFusion: real-time dense surface mapping and tracking. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Vol. , pp. 127–136. External Links: Document Cited by: §II-A.
  • [22] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019-06) DeepSDF: learning continuous signed distance functions for shape representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-A.
  • [23] Sara Fridovich-Keil and Alex Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa (2022) Plenoxels: radiance fields without neural networks. In CVPR, Cited by: §III-B.
  • [24] J. L. Schönberger and J. Frahm (2016) Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-A.
  • [25] J. L. Schönberger, E. Zheng, M. Pollefeys, and J. Frahm (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §II-A.
  • [26] T. Schöps, T. Sattler, and M. Pollefeys (2019) BAD slam: bundle adjusted direct rgb-d slam. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 134–144. External Links: Document Cited by: §V-A2, TABLE I.
  • [27] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe (2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §V-A1, §V-B3.
  • [28] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012-Oct.) A benchmark for the evaluation of rgb-d slam systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Cited by: §V-A1, §V-B1.
  • [29] E. Sucar, S. Liu, J. Ortiz, and A. Davison (2021) iMAP: implicit mapping and positioning in real-time. In Proceedings of the International Conference on Computer Vision (ICCV), Cited by: §I, §I, §II-A, §II-B, §IV, §V-A2, §V-A4, TABLE I, TABLE II.
  • [30] C. Sun, M. Sun, and H. Chen (2022) Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In CVPR, Cited by: §III-B.
  • [31] T. Takikawa, J. Litalien, K. Yin, K. Kreis, C. Loop, D. Nowrouzezahrai, A. Jacobson, M. McGuire, and S. Fidler (2021) Neural geometric level of detail: real-time rendering with implicit 3D shapes. Cited by: §II-A.
  • [32] E. Vespa, N. Nikolov, M. Grimm, L. Nardi, P. H. J. Kelly, and S. Leutenegger (2018) Efficient octree-based volumetric slam supporting signed-distance and occupancy mapping. IEEE Robotics and Automation Letters 3 (2), pp. 1144–1151. External Links: Document Cited by: §II-A.
  • [33] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu (2021) NeRF: neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064. Cited by: §II-A.
  • [34] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, and J. McDonald (2012) Kintinuous: spatially extended kinectfusion. Cited by: §V-A2, TABLE I.
  • [35] Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys (2022) NICE-slam: neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, §II-A, §II-B, §IV, Fig. 5, §V-A2, §V-A4, §V-B2, TABLE I, TABLE II, TABLE III, TABLE IV.