A spatial AI that can perform complex tasks through visual signals and cooperate with humans is highly anticipated. To achieve this, we need a visual SLAM that easily adapts to new scenes without pre-training and generates dense maps for downstream tasks in real-time. None of the previous learning-based and non-learning-based visual SLAMs satisfy all needs due to the intrinsic limitations of their components. In this work, we develop a visual SLAM named Orbeez-SLAM, which successfully collaborates with implicit neural representation (NeRF) and visual odometry to achieve our goals. Moreover, Orbeez-SLAM can work with the monocular camera since it only needs RGB inputs, making it widely applicable to the real world. We validate its effectiveness on various challenging benchmarks. Results show that our SLAM is up to 800x faster than the strong baseline with superior rendering outcomes.READ FULL TEXT VIEW PDF
An intelligent spatial AI that can receive visual signals (RGB-D images) and cooperate with humans to solve complicated tasks is highly valued. To efficiently understand semantics knowledge from the environments and act as humans, spatial AI requires a core component named visual simultaneous localization and mapping (SLAM). The visual SLAM should quickly adapt to new scenes without pre-training and generate real-time fine-grained maps for downstream tasks, such as domestic robots. However, traditional visual SLAMs [20, 7] mainly focus on localization accuracy and only provide crude maps. To this end, this work aims to develop a visual SLAM with the aforementioned properties.
, Tandem achieves real-time inference and can work with a monocular camera. But, depth estimation is involved in the TSDF fusion, and the depth estimation module in Tandem needs pre-training before inference, which limits its adaptability to a novel scene significantly different from pre-trained scenes.
Neural Radiance Field (NeRF) , another implicit neural representation, does not require depth supervision during training and can be trained from scratch at the target scene. Due to this attribute, using NeRF as the map in visual SLAM is a potential direction. Two latest NeRF-SLAMs [29, 35] echo our motivations. Among them, iMAP  is the first work that lets NeRF serve as the map representation in SLAM. Meanwhile, it optimizes the camera pose via back-propagation from NeRF photometric loss. Then, NICE-SLAM 
extends iMAP and develops a hierarchical feature grid module. The module allows NICE-SLAM to scale up for large scenes without catastrophic forgetting. Nevertheless, the above NeRF-SLAMs need RGB-D inputs since they optimize camera pose purely through the neural network without visual odometry (VO), causing bad initial localizations. In other words, they still need depth information to guide the 3D geometry. Besides, a notable shortcoming of NeRF is its slow convergence speed. Specifically, there are lots of rendering during the progress, which makes real-time training NeRF floundering. By observing this, instant-ngp compensates for the training speed issue. With the help of the multi-resolution hash encoding and the CUDA framework , instant-ngp can train NeRFs in a few seconds.
To tackle the above shortcomings, we seek to develop a monocular visual SLAM that is pre-training-free and achieves real-time inference for practical applications. To this end, we propose Orbeez-SLAM, combining feature-based SLAM (e.g., ORB-SLAM2 ) and a NeRF based on the instant-ngp framework . Different from [29, 35]
, we emphasize that VO (in ORB-SLAM2) can provide a better camera pose estimation even at the early stage of the training, which lets Orbeez-SLAM can work with monocular cameras, i.e., without depth supervision. Moreover, we simultaneously estimate camera poses via VO and update the NeRF network. Notably, the training process is online and real-time without pre-training, as depicted in Fig.1. As a result, Orbeez-SLAM can render dense information such as the depth and color of scenes. Besides, it is validated in various indoor scenes and outperforms NeRF-SLAM baselines on speed, camera tracking, and reconstruction aspects. To summarize, our contributions are threefold:
We propose Orbeez-SLAM, the first real-time monocular visual SLAM that is pre-training-free and provides dense maps, tailored for spatial AI applications.
By combining visual odometry and a fast NeRF framework, our method reaches real-time inference and produces dense maps.
We extensively validate Orbeez-SLAM with state-of-the-art (SOTA) baselines on challenging benchmarks, showing superior quantitative and qualitative results.
To represent a 3D scene, explicit representations (e.g., point clouds) need huge space to store information. By contrast, implicit surface representations, such as signed distance functions (SDF), alleviate the space issue and have been widely developed in recent years. Among them, some works [31, 22] leverage neural networks to learn the implicit function, called implicit neural representations (INRs). With the property of continuous representation of the signals, INRs demonstrate several advantages: (a) they are not coupled to the spatial dimension/resolution of input signals, and (b) they can predict/synthesize the unobserved regions.
Besides, NeRF, a novel and popular INR, has illustrated its success in novel view synthesis [16, 14, 1, 2]. Nonetheless, most NeRFs assume that the camera pose is known. Thus, COLMAP [24, 25] is often used to estimate intrinsic and extrinsic (camera poses) in NeRF-related works. In addition, a few works [33, 9, 12] optimize camera poses via NeRF photometric loss, but the process requires a long training time. Hence, as aforementioned, instant-ngp  develops a framework that can train NeRFs in a few seconds, leveraging the multi-resolution hash encoding and CUDA platform .
Intuitively, implicit surface representations can serve as maps in visual SLAM systems. For instance, some studies [21, 32] leverage the SDF to construct the map. Besides, two recent NeRF-SLAMs [29, 35] pave the way for cooperating NeRFs and visual SLAM. However, they need RGB-D inputs and show a slow convergence speed, which do not satisfy our needs. Therefore, we aim to build a NeRF-SLAM to generate a dense map in real-time. Moreover, our work can work with the monocular camera and train from scratch at the target scene without a lengthy pre-training process.
Traditional visual SLAMs reveal strengths in outdoor and large scenes. Also, they can rapidly compute accurate locations but lack the fine-grained information from scenes. There are two categories of visual SLAMs, feature-based and direct SLAM. Feature-based SLAMs [19, 20, 3] extract and match image features between frames and then minimize the reprojection error. Besides, Direct SLAM  uses pixel intensities to localize and minimize the photometric error.
achieve this objective under deep learning techniques. However, they either need pre-training that limits the adaptation capability or optimize the camera poses and network parameters only relying on NeRF photometric loss and depth supervision [29, 35], lacking the knowledge of VO. Thus, we develop Orbeez-SLAM that eliminates these drawbacks by considering the VO guidance and fast NeRF implementation. Consequently, Orbeez-SLAM is pre-training-free for novel scenes and reaches real-time inference (with online training).
NeRF  reconstructs a 3D scene by training on a sequence of 2D images from distinct viewpoints. A continuous scene can be represented as a function where is the 3D location,
is the 3D Cartesian unit vector representing the direction. Outputs are color and volume density . Such a continuous representation, as a function, can be approximated with an MLP network by optimizing the weight .
Following above definitions, for any ray , the color within the near and far bounds can be obtained through the intergral of products of transmittance , volume density and the color at each point, i.e.,
The rendering equation (1) in NeRF requires sampling positions on the ray. We should be only interested in the positions which intersect the surface since they contribute more to (1). Some studies [16, 1, 2] leverage a coarse-to-fine strategy that samples uniformly on the ray to find the density distribution via querying NeRF. After knowing the density distribution of the ray, they only sample those positions near the surface. However, these steps require frequent NeRF querying, which is time-consuming.
To tackle this, recent works [17, 13, 23, 30, 5] store the query results in density grid, and then the skip voxel strategy is usually applied, as shown in Fig. 2. In this work, we further extend the skip voxel strategy with knowledge of NeRF to process the ray-casting triangulation (see Section IV-C).
Unlike previous NeRF-SLAMs [29, 35] which require depth information to perceive geometry better, we develop Orbeez-SLAM that leverages VO for accurate pose estimations to generate a dense map with a monocular camera. Besides, it achieves pre-training-free adaptation and real-time inference. Next, the system overview is depicted in Sec. IV-A and the optimization objectives are described in Sec. IV-B. At last, ray-casting triangulation is introduced in Sec. IV-C.
Fig. 3 shows our system pipeline. The tracking process extracts the image features from the input image stream and estimates the camera poses via visual odometry. The mapping system generates map points with triangulation and optimizes camera poses and map points with bundle adjustment (reprojection error). These map points represent a sparse point cloud. We then utilize the updated camera poses and the map to train NeRF. Since the process is differentiable, we can still optimize the camera poses from NeRF photometric loss. In the end, the NeRF can generate a dense map for downstream tasks. Moreover, this pipeline should work for any SLAM that provides sparse point cloud.
The following objectives are used to optimize Orbeez-SLAM: (a) pose estimation, (b) bundle adjustment, and (c) NeRF regression. Among them, (a) is in the tracking process, and (b) and (c) are conducted in the mapping process.
where is the pixel position on the image, which is observed by the th camera and is projected by the th 3D point. The projects the 3D map point to the pixel coordinate via , where and are the intrinsic and extrinsic (world to camera) described by . We optimizes the camera poses by minimizing the reprojection error:
After the triangulation step in VO, new map points are added to the local map. The bundle adjustment objective also minimizes the reprojection error for both the map point positions and the camera poses:
Minimizing (6) is actually a nonlinear least square problem. We solve these two objectives by the Levenberg-Marquardt method, followed . The bundle adjustment optimizes the camera poses of keyframes and observable map points in these keyframes. Then, these optimized keyframes and map points are passed to the NeRF.
NeRF minimizes the photometric error by regressing the image color. A ray can be formulated by giving a keyframe and a pixel coordinate :
By applying the skip voxel strategy mentioned in Fig. 2, we sample positions on the ray were near to the surface. Finally, the NeRF photometric loss is the L2 norm between predicted color and the pixel color .
where is the observed color of ray in image . Since (9) is differentiable, both camera extrinsic and network parameters can be optimized by . But, after examinations (cf. Tab. V), we only optimize by (6).
In Fig. 2, we show that the density grid can accelerate the rendering process. However, this structure only considers a ray and highly relies on the density prediction of the NeRF model. We additionally store the number of sampling times for each density grid. A voxel that frequently blocks the casting ray is more likely to be the surface, as shown in Fig. 4. For noise rejection, we only triangulate points that lie within voxels that are scanned by rays frequently enough. We chose 64 as the threshold for practical implementation since such a value has the best visualization, according to our experience. We also utilize the data structure’s map point generated from the sparse point cloud. Since the map point’s surroundings are much more likely to be the surface, we add a significant number to the sample counter of the density grid. Instead of relying on NeRF, we claim that this method can find a more reliable surface. Map points generated by this method are not optimized in (8). We show the dense point cloud generated by this method in Fig. 3.
In practice, monocular SLAM works validate the effectiveness under depth version since they cannot estimate the correct scale of the scenes without knowing depth. Also, all previous NeRF-SLAMs require depth supervision. Thus, all methods are verified on the depth version. We still demonstrate that Orbeez-SLAM can work with monocular cameras. Moreover, we extensively examine the efficacy from two aspects, tracking and mapping results.
To evaluate the tracking results, we report the absolute trajectory error (ATE), which computes the root mean square error (RMSE) between the ground truth (GT) trajectory and the aligned estimated trajectory. For the mapping results, we modify two metrics that are often used in NeRF works, PSNR and Depth L1 metrics. PSNR assesses the distortion rate of NeRF rendered and GT images traversed by the GT trajectory. As for Depth L1, we calculate the L1 error of estimated and GT depth traversed by the GT trajectory.
Unlike in [29, 35], we do not evaluate on meshes. We argue that assessing performance on meshes may be unfair because the mesh generation process by post processing NeRF is not unified. In addition, our setting has the following advantages:
Numbers of sampled keyframes in distinct works are various while evaluating with GT trajectory provides a consistent standard.
The depth and PSNR can effectively reflect the quality of the geometry and radiance learned by NeRF.
Our setting can still verify the methods on novel views since only keyframes (in GT trajectory) are used during training. Besides, even if the model backs up seen keyframes, the metric can still reveal that when they are localized to the wrong viewpoints.
We conduct all experiments on a desktop PC with an Intel i7-9700 CPU and a NVIDIA RTX 3090 GPU. We follow the official code in ORB-SLAM2 111https://github.com/raulmur/ORB_SLAM2  and instant-ngp 222https://github.com/NVlabs/instant-ngp  to implement Orbeez-SLAM. Note that Orbeez-SLAM inherits the loop-closing process from ORB-SLAM2  to improve the trajectory accuracy. We do not cull the keyframe like ORB-SLAM2 to ensure the keyframe is not eliminated after passing to the NeRF. The code is written in C++ and CUDA. About losses, reprojection error is optimized via g2o  framework, and the photometric error in NeRF is optimized via tiny-cuda-nn framework .
Through the experiments, we aim to verify whether Orbeez-SLAM can produce accurate trajectories (ATE), precise 3D reconstructions (Depth), and detailed perception information (PSNR) under our challenging settings, i.e., real-time inference without pre-training. Notably, previous works mainly focus on the first two indicators. However, a dense map containing rich perception information is vital for spatial AI applications; thus, we also attend to this aspect.
TABLE I lists tracking results of all methods. Note that Orbeez-SLAM outperforms all deep-learning baselines with a significant gap (top half). Besides, ORB-SLAM2 is our upper bound on the tracking results since our method is built on it. Nevertheless, Orbeez-SLAM only shows a minor performance drop while it provides a dense map generated by NeRF.
As revealed in TABLE II, we obtain the best average results across all scenes. We assume the performance difference between us and ORB-SLAM2 is due to randomness. In addition, NICE-SLAM performs best in some cases, echoing the claimed strength for scaling-up scenes in . Especially scenes 0181 and 0207 contain compartments. Improving performance in large scenes with rooms is one of future works.
NICE-SLAM evaluates the mapping results on Replica since it provides GT meshes. But, as stated before, we argue that the mesh generation process from NeRF is not unified and tricky. Hence, we use common metrics in NeRF works, Depth L1 and PSNR.
As demonstrated in TABLE III, NICE-SLAM obtains the best values on the Depth L1 when GT depth is supported during rendering depth. However, our Depth L1 values outperform NICE-SLAM when it has no GT depth in rendering. Note that our NeRF is never supervised by GT depth. Next, when comparing the quality of rendered images from NeRF, we beat all variants of NICE-SLAM on PSNR, indicating that our method provides a superior color result.
TABLE IV depicts the elapsed time of our Orbeez-SLAM and NICE-SLAM (SOTA NeRF-SLAM) running on the TUM RGB-D benchmark. Attributed to the VO for estimating an accurate camera pose at the early stage of training, Orbeez-SLAM is 360 800 times faster than NICE-SLAM, making it achieves real-time inference.
TABLE V illustrates the ablations. We can observe that the camera pose guided only by achieves a better result than the one guided by both and (from NeRF). We claim that the convergence speed of is much slower, which brings a negative influence when being leveraged in real-time inference. And that is also the reason we did not provide the version only guided by since it produces horrible results and is not available under our setting. We refer interested readers to our demo video for more details.
We deliver qualitative results in Fig. 5 and Fig. 6. As stated in Fig. 5, NICE-SLAM renders images with the help of GT depth. To be clear, GT depth is used during training of both NICE-SLAM cases. By contrast, our Orbeez-SLAM does not use depth supervision to render images, even in the RGB-D case where the GT depth is only used for tracking. Notably, Orbeez-SLAM provides a superior RGB result than NICE-SLAM under both settings. We highlight that NICE-SLAM produces better depth results due to accessing GT depth.
Besides, we provide Orbeez-SLAM rendered results at distinct timestamps in Fig 6. After the real-time SLAM is ended (second column), we can apply offline training for NeRF until losses fully converge (third column). Orbeez-SLAM demonstrates excellent outcomes in TUM and Replica cases (first two rows) but fails at the large-scale ScanNet case. We assume large-scale scenes are more challenging to Orbeez-SLAM, and we leave it as one of future works.
We aim to develop a core component in spatial AI, i.e., a pre-training-free visual SLAM that reaches real-time inference and provides dense maps for downstream tasks. To this end, we propose Orbeez-SLAM, which utilizes ORB features and NeRF-realized mapping. To achieve the above requirements, we cooperate with visual odometry and fast NeRF implementation on the instant-ngp platform. Moreover, Orbeez-SLAM can work with monocular cameras, leading to flexible, practical applications. We verify Orbeez-SLAM with SOTA NeRF-SLAM baselines on three challenging benchmarks, and it demonstrates superior performance on average. Besides, we also provide detailed qualitative results to prove that our work satisfies the claims. We believe we pave the way for speeding up the development progress of spatial AI. Notably, how to effectively leverage dense maps in downstream tasks is interesting but is out of the scope of this paper; we leave it as future works.
This work was supported in part by the National Science and Technology Council, under Grant MOST 110-2634-F-002-051, Qualcomm through a Taiwan University Research Collaboration Project, Mobile Drive Technology Co., Ltd (MobileDrive), and NOVATEK fellowship. We are grateful to the National Center for High-performance Computing.
ORB-slam3: an accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics 37 (6), pp. 1874–1890. External Links: Cited by: §II-B, §IV-B1.