One of the most reliable cues towards 3D perception from a monocular camera arises from camera motion that induces multiple-view geometric constraints [hartley2003multiple] wherein the 3D scene structure is encoded. Over the years, Simultaneous Localization And Mapping (SLAM) [davison2007monoslam, klein2007parallel, newcombe2011dtam]
has been long studied to simultaneously recover the 3D scene structure of the surrounding and estimate the ego-motion of the agent. With the advent of Convolutional Neural Networks (CNNs), unsupervised learning of single-view depth estimation[garg2016unsupervised, godard2017unsupervised, zhou2017unsupervised] has emerged as a promising alternative to the traditional geometric approaches. Such methods rely on CNNs to extract meaningful depth cues (e.g., shading, texture, and semantics) from a single image, yielding very promising results.
Despite the general maturity of monocular geometric SLAM [engel2014lsd, mur2015orb, engel2017direct] and the rapid advances in unsupervised monocular depth prediction approaches [mahjourian2018unsupervised, wang2018learning, yin2018geonet, bian2019unsupervised, godard2019digging, Sheng_2019_ICCV], they both still have their own limitations.
Monocular SLAM. Traditional monocular SLAM has well-known limitations in robustness and accuracy as compared to those leveraging active depth sensors, e.g., RGB-D SLAM [mur2017orb]. This performance issue is due to the inherent scale ambiguity of depth recovery from monocular cameras, which causes the so-called scale drift in both the camera trajectory and 3D scene depth, and thus lowers robustness and accuracy of conventional monocular SLAM. In addition, the triangulation-based depth estimation employed by traditional SLAM methods is degenerate under pure rotational camera motion [hartley2003multiple].
Unsupervised Monocular Depth Prediction. Most of the unsupervised and self-supervised depth estimation methods [zhou2017unsupervised, godard2017unsupervised, godard2019digging, bian2019unsupervised] formulate single image depth estimation as a novel-view synthesis problem, with appearance based photometric losses being central to the training strategy. Usually, these models train two networks, one each for pose and depth. As photometric losses largely rely on the brightness constancy assumption, nearly all existing self-supervised approaches operate in a narrow-baseline setting optimizing the loss over a snippet of 2-5 consecutive frames. Consequently, models like MondoDepth2 [godard2019digging], work very well for close range points, but generate inaccurate depth estimates for points that are farther away (e.g., see iteration in Fig. 6
). While it is well known that a wide-baseline yields better depth estimates for points at larger depth, a straightforward extension of existing CNN based approaches is inadequate for the following two reasons. A wide baseline in a video sequence implies a larger temporal window, which in most practical scenarios will violate the brightness constancy assumption, rendering the photometric loss ineffective. Secondly, larger temporal windows (wider baselines) would also imply more occluded regions that behave as outliers. Unless these aspects are effectively handled, training of CNN based depth and pose networks in the wide baseline setting will lead to inaccuracies and biases.
In view of the limitations in both monocular geometric SLAM and unsupervised monocular depth estimation approaches, a particularly interesting question to ask is whether these two approaches can complement each other (see Sec. 5) and mitigate the issues discussed above. Our work makes contributions towards answering this question. Specifically, we propose a self-supervised, self-improving framework of these two tasks, which is shown to improve the robustness and accuracy on each of them.
While the performance gap between geometric SLAM and self-supervised learning-based SLAM methods is still large, incorporating depth information drastically improves the robustness of geometric SLAM methods (e.g., see RGB-D SLAM vs. RGB SLAM on the KITTI Odometry leaderboard [geiger2012we]). Inspired by this success of RGB-D SLAM, we postulate the use of an unsupervised CNN-based depth estimation model as a pseudo depth sensor, which allows us to design our self-supervised approach, pseudo RGB-D SLAM (pRGBD-SLAM) that only uses monocular cameras and yet achieves significant improvements in robustness and accuracy as compared to RGB SLAM.
Our fusion of geometric SLAM and CNN-based monocular depth estimation turns out to be symbiotic and this complementary nature sets the basis of our self-improving framework. To improve the depth predictions, we make use of two main modifications in the training strategy. First, we eschew the learning based pose estimates in favor of geometric SLAM based estimates (an illustrative motivation is shown in Fig. 1). Second, we make use of common tracked keypoints from neighboring keyframes and impose a symmetric depth transfer and a depth consistency loss on the CNN model. These adaptations are based on the observation that both pose estimates and sparse 3D feature point estimates from geometric SLAM are robust, as most techniques typically apply multiple bundle adjustment iterations over wide baseline depth estimates of common keypoints. This simple observation and the subsequent modification is key to our self-improving framework, which can leverage any unsupervised CNN-based depth estimation model and a modern monocular SLAM method. In this paper, we test our framework, with ORBSLAM [mur2017orb] as the geometric SLAM method and MonoDepth2 [godard2019digging] as the CNN-based model. We show that our self-improving framework outperforms previously proposed self-supervised approaches that utilizes monocular, stereo, and monocular-plus-stereo cues for self-supervision (see Tab. 1) and a strong feature based RGB-SLAM baseline (see Tab. 3).
The framework runs in a simple alternating update fashion: first, we use depth maps from the CNN-based depth network and run pRGBD-SLAM; second, we inject the outputs of pRGBD-SLAM, i.e., the relative camera poses and common tracked keypoints and keyframes to fine-tune the depth network parameters to improve the depth prediction; then, we repeat the process until we see no improvement. Our specific contributions are summarized here:
We propose a self-improving strategy to inject into depth prediction networks the supervision from SLAM outputs, which stem from more generally applicable geometric principles.
We introduce two wide baseline losses, i.e., the symmetric depth transfer loss and the depth consistency loss on common tracked points, and propose a joint narrow and wide baseline based depth prediction learning setup, where appearance based losses are computed on narrow baselines and purely geometric losses on wide baselines (non-consecutive temporally distant keyframes).
Through extensive experiments on KITTI [geiger2012we] and TUM RGB-D [sturm2012benchmark], our framework is shown to outperform both monocular SLAM system (i.e., ORB-SLAM [mur2015orb]) and the state-of-the-art unsupervised single-view depth prediction network (i.e., Monodepth2 [godard2019digging]).
2 Related Work
Visual SLAM has a long history of research in the computer vision community. Due to its well-understood underlying geometry, various geometric approaches have been proposed in the literature, ranging from the classical MonoSLAM[davison2007monoslam], PTAM [klein2007parallel], DTAM [newcombe2011dtam] to the more recent LSD-SLAM [engel2014lsd], ORB-SLAM [mur2015orb] and DSO [engel2017direct]
. More recently, in view of the successful application of deep learning in a wide variety of areas, researchers have also started to exploit deep learning approaches for SLAM, in the hope that it can improve certain components of geometric approaches or even serve as a complete alternative. Our work makes further contributions along this line of research.
Monocular Depth Prediction. Inspired by the pioneering work by Eigen et al. [eigen2014depth]
on learning single-view depth estimation, a vast amount of learning methods emerge along this line of research. The earlier works often require ground truth depths for fully-supervised training. However, per-pixel depth ground truth is generally hard or prohibitively costly to obtain. Therefore, many self-supervised methods that make use of geometric constraints as supervision signals are proposed. Specifically, thanks to the Spatial Transform Network[jaderberg2015spatial], differentiable photometric reconstruction loss is successfully applied to monocular depth estimation. One example work is by Godard et al. [godard2017unsupervised], which relies on the photo-consistency between the left-right cameras of a calibrated stereo. Zhou et al. [zhou2017unsupervised]
go one step further to learn monocular depth prediction as well as ego-motion estimation, thereby permitting unsupervised learning with only a monocular camera. This pipeline has inspired a large amount of follow-up works that utilize various additional heuristics, including 3D geometric constraints on point clouds[mahjourian2018unsupervised], direct visual odometry [wang2018learning], joint learning with optical flow [yin2018geonet], scale consistency [bian2019unsupervised], and others [godard2019digging, Sheng_2019_ICCV].
Using Depth to Improve Monocular SLAM. Approaches [tateno2017cnn, yin2017scale, yang2018deep, loo2019cnn] leveraging CNN-based depth estimates to tackle issues in monocular SLAM have been proposed. CNN-SLAM [tateno2017cnn] uses learned depth maps to initialize keyframes’ depth maps in LSD-SLAM [engel2014lsd] and refines them via a filtering framework. Yin et al. [yin2017scale] use a combination of CNNs and conditional random fields to recover scale from the depth predictions and iteratively refine ego-motion and depth estimates. Recently, DVSO [yang2018deep] trains a single CNN to predict both the left and right disparity maps, forming a virtual stereo pair. The CNN is trained with photo-consistency between stereo images and consistency with depths estimated by Stereo DSO [wang2017stereo]. More recently, CNN-SVO [loo2019cnn] uses depths learned from stereo images to initialize depths of keypoints and reduce their corresponding uncertainties in SVO [forster2014svo]. In contrast to our self-supervised approach, [tateno2017cnn, yin2017scale] use ground truth depths for training depth networks while [yang2018deep, loo2019cnn] need stereo images.
Using SLAM to Improve Monocular Depth Prediction. Depth estimates from geometric SLAM have been leveraged for training monocular depth estimation networks in recent works [klodt2018supervising, andraghetti2019enhancing]. In [andraghetti2019enhancing], sparse depth maps by Stereo ORB-SLAM [mur2017orb] are first converted into dense ones via an auto-encoder, which are then integrated into geometric constraints for training the depth network. Klodt and Vedaldi [klodt2018supervising] employ depths and poses by ORB-SLAM [mur2015orb] as supervision signals for training the depth and pose networks respectively. This approach only considers five consecutive frames, thus restricting its operation in the narrow-baseline setting.
3 Method: A Self-Improving Framework
Our self-improving framework leverages the strengths of each, the unsupervised single-image depth estimation and the geometric SLAM approaches, to mitigate the other’s shortcomings. On one hand, the depth network typically generates reliable depth estimates for nearby points, which assist in improving the geometric SLAM estimates of poses and sparse 3D points (Sec. 3.1). On the other hand, geometric SLAM methods rely on a more holistic view of the scene to generate robust pose estimates as well as identify persistent 3D points that are visible across many frames, thus providing an opportunity to perform wide-baseline and reliable sparse depth estimation. Our framework leverages these sparse, but robust estimates to improve the noisier depth estimates of the farther scene points by minimizing a blend of the symmetric transfer and depth consistency losses (Sec. 3.2) and the commonly used appearance based loss. In the following iteration, this improved depth estimate further enhances the capability of geometric SLAM and the cycle continues until the improvements become negligible. Even in the absence of ground truth, our self-improving framework continues to produce better pose and depth estimates.
An overview of the proposed self-improving framework is shown in Fig. 2, which iterates between improving poses and improving depths. Our pose refinement and depth refinement steps are then detailed in Sec. 3.1 and 3.2 respectively. An overview of narrow and wide baseline losses we use for improving the depth network is shown in Fig. 3 and details are provided in Sec. 3.2.
3.1 Pose Refinement
Pseudo RGB-D for Improving Monocular SLAM. We employ a well explored and widely used geometry-based SLAM system, i.e., the RGB-D version of ORB-SLAM [mur2017orb], to process the pseudo RGB-D data, yielding camera poses as well as 3D map points and the associated 2D keypoints. Any other geometric SLAM system that provides these output estimates can also be used in place of ORB-SLAM. A trivial direct use of pseudo RGB-D data to run RGB-D ORB-SLAM is not possible, because CNN might predict depth at a very different scale compared to depth measurements from real active sensors, e.g., LiDAR. Keeping the above difference in mind, we discuss an important adaptation in order for RGB-D ORB-SLAM to work well in our setting. We first note that RGB-D ORB-SLAM transforms the depth data into disparity on a virtual stereo to reuse the framework of stereo ORB-SLAM. Specifically, considering a keypoint with 2D coordinates (i.e., and denote the horizontal and vertical coordinates respectively) and a CNN-predicted depth , the corresponding 2D keypoint coordinates on the virtual rectified right view are where is the horizontal focal length and is the virtual stereo baseline.
Adaptation. In order to have a reasonable range of disparity, we mimic the setup of the KITTI dataset [geiger2012we] by making the baseline adaptive, , where represents the maximum CNN-predicted depth of the input sequence, and and (both in meters) are respectively the actual stereo baseline and empirical maximum depth value of the KITTI dataset.
We also summarize the overall pipeline of RGB-D ORB-SLAM here. The 3D map is initialized at the very first frame of the sequence due to the availability of depth. After that, the following main tasks are performed: i) track the camera by matching 2D keypoints against the local map, ii) enhance the local map via local bundle adjustment, and iii) detect and close loops for pose-graph optimization and full bundle adjustment to improve camera poses and scene depths. As we will show in Sec. 4.4, using pseudo RGB-D data leads to better robustness and accuracy as compared to using only RGB data.
3.2 Depth Refinement
We start from the pre-trained depth network of Monodepth2 [godard2019digging], a state-of-the-art monocular depth estimation network, and fine-tune its network parameters with the camera poses, 3D map points and the associated 2D keypoints produced by the above pseudo RGB-D ORB-SLAM (pRGBD-SLAM). In contrast to Monodepth2, which relies only on the narrow baseline photometric reconstruction loss between adjacent frames for short-term consistencies, we propose wide baseline symmetric depth transfer and sparse depth consistency losses to introduce long-term consistencies. Our final loss (Eq. (4)) consists of both narrow and wide baseline losses. The narrow baseline losses, i.e., photometric and smoothness losses, involve the current keyframe and its temporally adjacent frames and , while wide baseline losses are computed on the current keyframe and the two neighboring keyframes and that are temporally farther than and (see Fig. 3). Next, we introduce the notation and describe the losses in detail.
Notation. Let represent the set of common tracked keypoints visible in all the three keyframes , and obtained from pRGBD-SLAM. Note that and are two neighboring keyframes of the current frame (i.e., ) in which keypoints are visible. Let , and be the 2D coordinates of the common tracked keypoint in the keyframes , and respectively, and the associated depth values obtained from pRGBD-SLAM are represented by , , and respectively. The depth values corresponding to the keypoints , and can also be obtained from the depth network and are represented by , , and respectively, where w stands for the depth network parameters.
Symmetric Depth Transfer Loss. Given the camera intrinsic matrix K, and the depth value of the keypoint , the 2D coordinates of the keypoint can be back-projected to its corresponding 3D coordinates as: (w) . Let represent the relative camera pose of frame w.r.t. frame obtained from pRGBD-SLAM. Using , we can transfer the 3D point (w) from frame to as: (w) = (w). Here, is the transferred depth of the keypoint from frame to frame . Following the above procedure, we can obtain the transferred depth of the same keypoint from frame to frame . The symmetric depth transfer loss of the keypoint between frame pair and , is the sum of absolute errors ( distance) between the transferred network-predicted depth and the existing network-predicted depth in the target keyframe , and vice-versa. Mathematically, it can be written as:
Similarly, we can compute the symmetric depth transfer loss of the same keypoint between frame pair and , i.e., , and between and , i.e., . We accumulate the total symmetric transfer loss between frame and in , which is the loss of all the common tracked keypoints and the points within the patch of size centered at the common tracked keypoints. Similarly, we compute the total symmetric depth transfer loss and between frame pair , and respectively.
Depth Consistency Loss. The role of the depth consistency loss is to make depth network’s prediction consistent with the refined depth values obtained from the pRGBD-SLAM. Note that depth values from pRGBD-SLAM undergo multiple optimization over wide baselines, hence are more accurate and capture long-term consistencies. We inject these long-term consistent depths from pRGBD-SLAM to depth network through the depth consistency loss. The loss for the frame can be written as follows:
Photometric Reconstruction Loss. Denote the relative camera pose of frame and w.r.t. current keyframe obtained from pRGBD-SLAM by and respectively. Using frame , , network-predicted depth map of the keyframe , and the camera intrinsic K, we can synthesize the current frame [godard2019digging, godard2017unsupervised]. Let the synthesized frame be represented in the functional form as: , K). Similarly we can synthesize , K) using frame . The photometric reconstruction error between the synthesized and the original current frame [garg2016unsupervised, godard2017unsupervised, zhou2017unsupervised] is then computed as:
where we follow [godard2017unsupervised, godard2019digging] to construct the photometric reconstruction error function . Additionally, we adopt the more robust per-pixel minimum error, multi-scale strategy, auto-masking, and depth smoothness loss from [godard2019digging].
Our final loss for fine-tuning the depth network at the depth refinement step is the weighted sum of narrow baseline losses (i.e., photometric () and smoothness loss ()), and wide baseline losses (i.e., symmetric depth transfer () and depth consistency loss ()):
We conduct experiments to evaluate depth refinement and pose refinement steps of our self-improving framework with the state-of-the-arts in self-supervised depth estimation and RGB-SLAM based pose estimation respectively.
4.1 Datasets and Evaluation Metrics
KITTI Dataset. Our experiments are mostly performed on the KITTI dataset [geiger2012we], which contains outdoor driving sequences. We further split KITTI experiments into two parts: one focused on depth refinement evaluation and the other on pose refinement. For depth refinement evaluation we train/fine-tune the depth network using the Eigen train split [eigen2014depth] which contains 28 training sequences and evaluate depth prediction on the Eigen test split [eigen2014depth] following the baselines [zhou2017unsupervised, yang2017unsupervised, mahjourian2018unsupervised, yin2018geonet, wang2018learning, zou2018dfnet, yang2018lego, ranjan2019competitive, luo2018every, casser2019depth]. For pose refinement evaluation, we train/fine-tune the depth network using KITTI odometry sequences 00-08 and test on sequences 09-10 and 11-21. Note, for evaluation on sequences 09-10 we use the ground-truth trajectories provided by [geiger2012we], while for evaluation on sequences 11-21, since the ground-truth is not available we use the pseudo ground-truth trajectories obtained by running stereo version of ORB-SLAM on these sequences.
TUM RGB-D Dataset. For completeness and to demonstrate the capability of our self-improving framework on indoor scenes, we evaluate on the TUM RGB-D dataset [sturm2012benchmark], which consists of indoor sequences captained by a hand-held camera. We use 6 of 8 freiburg3 sequences to train/fine-tune the depth network and the remaining 2 for evaluation. We choose freiburg3 sequences because only they have undistorted RGB images and ground truth to train/fine-tune and evaluate respectively.
Metrics for Pose Evaluation. For quantitative pose evaluation, we compute the Root Mean Square Error (RMSE), Relative Translation (Rel Tr) error, and Relative Rotation (Rel Rot) error of the predicted camera trajectory. Since monocular SLAM systems can only recover camera poses up to a global scale, we align the camera trajectory estimated by each method with the ground truth one using the EVO toolbox [grupp2017evo]. We then use the official evaluation code from the KITTI Odometry benchmark to compute the Rel Tr and Rel Rot errors for all sub-trajectories with length in meters.
Metrics for Depth Evaluation. For quantitative depth evaluation, we use the standard metrics, including the Absolute Relative (Abs Rel) error, Squared Relative (Sq Rel) error, RMSE, RMSE log, (namely a1), (namely a2), and (namely a3) as defined in [eigen2014depth]. Again, since the depths from monocular images can only be estimated up to scale, we align the predicted depth map with the ground truth one using their median depth values. Following [eigen2014depth] and other baselines, we also clip the depths to 80 meters.
Note. In all the tables, the best performance is shown in bold and the second best is underlined.
4.2 Implementation Details
We implement our framework based on Monodepth2 [godard2019digging] and ORB-SLAM [mur2017orb], i.e., we use the depth network of Monodepth2 and the RGB-D version of ORB-SLAM for depth refinement and pose refinement respectively. We would like to emphasize, that our self-improving strategy is not specific to MonoDepth2 or ORB-SLAM. Any other depth network that allows to incorporate SLAM outputs and any SLAM system that can provide the desired SLAM outputs can be put into the self-improving framework. We set the weight of the smoothness loss term of the final loss (Eq. (4)) similar as in [godard2019digging] and ,, and to 1. The ablation study results on disabling different loss terms can be found in Tab. 6.
KITTI Eigen Split/Odometry Experiments. We pre-train MonoDepth2 using monocular videos of the KITTI Eigen split training set with the hyper-parameters as suggested in MonoDepth2 [godard2019digging]. We use an input/output resolution of
for pre-training/fine-tuning and scale it up to the original resolution while running pRGBD-SLAM. We use same hyperparameters as for KITTI Eigen split to pre-train/fine-tune the depth model on KITTI Odometry train sequences mentioned in Sec.4.1. During a self-improving loop, we discard pose network of MonoDepth2 and instead use camera poses from pRGBD-SLAM.
Outlier Removal. Before running a depth refinement step, we run an outlier removal step on the SLAM outputs. Specifically, we filter out outlier 3D map points and the associated 2D keypoints that satisfy at least one of the following conditions: i) it is observed in less than 3 keyframes, ii) its reprojection error in the current keyframe is larger than 3 pixels.
Camera Intrinsics. Monodepth2 computes the average camera intrinsics for the KITTI dataset and uses it for the training. However, for our fine-tuning of the depth network, using the average camera intrinsics leads to inferior performance, because we use the camera poses from pRGBD-SLAM, which runs with different camera intrinsics. Therefore, we use different camera intrinsics for different sequences when fine-tuning the depth network.
For fine-tuning the depth network pre-trained on KITTI Eigen split training sequences, we run pRGBD-SLAM on all the training sequences, and extract camera poses, 2D keypoints and the associated depths from keyframes. For pRGBD-SLAM(RGB-D ORB-SLAM), we use the default setting of ORB-SLAM, except for the adjusted described in Sec. 3.1. The same above procedure is followed for depth model pre-trained on KITTI Odometry training sequences. The average number of keyframes used in a self-improving loop is and
for KITTI Eigen split and KITTI Odometry experiments respectively. At each depth refinement step, we fine-tune the depth network parameters with 1 epoch only, using learning rate 1e-6, keeping all the other hyperparameters the same as pre-training. For both KITTI Eigen split and KITTI Odometry experiments we report results after5 self-improving loops.
TUM RGB-D Experiments. For TUM RGB-D, we pre-train/fine-tune the depth network on 6 freiburg3 sequences, and test on 2 freiburg3 sequences. The average number of keyframes in a self-improving loop is . We use an input/output resolution of for pre-training/fine-tuning and scale it up to the original resolution while running pRGBD-SLAM. We report results after 3 self-improving loops. Other details can be found in the supplementary material.
4.3 Monocular Depth/Depth Refinement Evaluation
In the following, we evaluate the performance of our depth estimation on the KITTI Raw Eigen split test set and TUM RGB-D frieburg3 sequences.
Results on KITTI Eigen Split Test Set. We show the depth evaluation results on the Eigen split test set in Tab. 1. From the table, it is evident that our refined depth model (pRGBD-Refined) outperforms all the competing monocular (M) unsupervised methods by non-trivial margins, including the pre-trained depth model, i.e., MonoDepth2-M, and even surpasses the unsupervised methods with stereo (S) training, i.e., Monodepth2-S, and combined monocular-stereo (MS) training, i.e., MonoDepth2-MS, in most metrics. Our method also outperforms several ground-truth/auxiliary depth supervised methods [eigen2014depth, liu2015learning, klodt2018supervising, nath2018adadepth].
|Lower is better||Higher is better|
|Method||Train||Abs Rel||Sq Rel||RMSE||RMSE log||a1||a2||a3|
|3Net (VGG) [poggi2018learning]||S||0.119||1.201||5.888||0.208||0.844||0.941||0.978|
The reason is probably that the aggregated cues from multiple views with wide baseline losses (e.g., our symmetric depth transfer, depth consistency losses) lead to more well-posed depth recovery, and hence even higher accuracy than learning with the pre-calibrated stereo rig with smaller baselines. Further analysis is provided in Sec. 5. Fig. 4 shows some qualitative results, where pRGBD-Refined shows visible improvements at occlusion boundaries and thin objects. Results on TUM RGB-D Sequences. The depth evaluation results on the two TUM frieburg3 RGB-D sequences is shown in Tab. 5. Our refined depth model (pRGBD-Refined) outperforms pRGBD-Initial/Monodepth2-M in both sequences and all metrics. Due to space limitation we have moved qualitative results to the supplementary material.
4.4 Monocular SLAM/Pose Refinement Evaluation
In this section, we evaluate pose estimation/refinement on the KITTI Odometry sequences 09 and 10, KITTI Odometry test set sequences 11-21, and two TUM frieburg3 RGB-D sequences.
|Seq. 09||Seq. 10|
|Method||RMSE||Rel Tr||Rel Rot||RMSE||Rel Tr||Rel Rot|
|Wang et al.[wang2019recurrent]||-||9.88||0.034||-||12.24||0.052|
|Li et al.[li2019pose]||-||8.10||0.028||-||12.90||0.032|
Results on KITTI Odometry Sequences 09 and 10. We show the quantitative results on seqs 09 and 10 in Tab. 2. It can be seen that our pRGBD-Initial outperforms RGB ORB-SLAM [mur2015orb] both in terms of RSME and Rel Tr.
Our pRGBD-Refined further improves pRGBD-Initial in all metrics, which verifies the effectiveness of our self-improving mechanism in terms of pose estimation. The higher Rel Rot errors of our methods compared to RGB ORB-SLAM could be due to the high uncertainty of CNN-predicted depths for far-away points, which affects our rotation estimation [hartley2003multiple]. In addition, our methods outperform all the competing supervised and self-supervised methods by a large margin, except for the supervised method of [xue2019beyond] with lower Rel Tr than ours on sequence 10. Note that we evaluate the camera poses produced by the pose network of Monodepth2-M [godard2019digging] in Tab. 2, yielding much higher errors than ours.
Fig. 5(a) shows the camera trajectories estimated for sequence 09 by RGB ORB-SLAM, our pRGBD-Initial, and pRGBD-Refined. It is evident that, although all the methods perform loop closure successfully, our methods generate camera trajectories that align better with the ground truth.
Results on KITTI Odometry Test Set. The KITTI Odometry leaderboard requires complete camera trajectories of all frames of all the sequences. Since we keep the default setting from ORB-SLAM, causing tracking failures in a few sequences, to facilitate quantitative evaluation on this test set (i.e., sequences 11-21), we use pseudo-ground-truth computed as mentioned in Sec. 4.1 to evaluate all the competing methods in Tab. 3. From the results, RGB ORB-SLAM fails on three challenging sequences due to tracking failures, whereas our pRGBD-Initial fails on two sequences and our pRGBD-Refined fails only on one sequence. Among the sequences where all the competing methods succeed, our pRGBD-Initial reduces the RMSEs of RGB ORB-SLAM by a considerable margin for all sequences except for sequence 19. After our self-improving mechanism, our pRGBD-Refined further boosts the performance, reaching the best results both in terms of RMSE and Rel Tr. Fig. 5(b) shows qualitative comparisons on sequence 19.
|RMSE||Rel Tr||Rel Rot||RMSE||Rel Tr||Rel Rot||RMSE||Rel Tr||Rel Rot|
|(a) seq 09||(b) seq 19|
Results on TUM RGB-D Sequences. Performance of pose refinement step on the two TUM RGB-D sequences is shown in Tab. 4. The result shows increased robustness and accuracy by pRGBD-Refined. In particular, RGB ORB-SLAM fails on walking_xyz, while pRGBD-Refined succeeds and achieves the best performance on both sequences. Due to space limitation we have moved qualitative results to supplementary material.
|TUM RGBD Sequences|
|Lower is better||Higher is better|
|Method||Ab Rel||Sq Rel||RMSE||RMSElog||a1||a2||a3|
5 Analysis of Self-Improving Loops
Depth/pose evaluation metric w.r.t. self-improving loops. Depth evaluation metrics in (a-c) are computed at different max depth caps ranging from 30-80 meters.
In this section, we analyze the behaviour of three different evaluation metrics for depth estimation: Squared Relative (Sq Rel) error, RMSE error and accuracy metric a2, as defined in Sec. 4. The pose estimation is evaluated using the absolute trajectory pose error. In Fig. 6, we use the KITTI Eigen split dataset and report these metrics for each iteration of the self-improving loop. The evaluation metrics corresponding to the self-improving loop are of the pre-trained MonoDepth2-M. We summarize the findings from the plots in Fig. 6 as below:
A comparison of evaluation metrics of farther scene points (e.g. max depth 80) with nearby points (e.g. max depth 30) at the self-improving loop shows that the pre-trained MonoDepth2 performs poorly for farther scene points compared to nearby points.
In the subsequent self-improving loops, we can see the rate of reduction in the Sq Rel and RMSE error is significant for farther away points compared to nearby points, e.g., slope of error curves in Fig. 6(b-c) corresponding to max depth 80 is steeper than that of max depth 30. This validates our hypothesis of including wider baseline losses that help the depth network predict more accurate depth values for farther points. Overall, our joint narrow and wide baseline based learning setup helps improve the depth prediction of both the nearby and farther away points, and outperforms MonoDepth2 [godard2019digging].
In this work, we propose a self-improving framework to couple geometrical and learning based methods for 3D perception. A win-win situation is achieved — both the monocular SLAM and depth prediction are improved by a significant margin without any additional active depth sensor or ground truth label. Currently, our self-improving framework only works in an off-line mode, so developing an on-line real-time self-improving system remains one of our future works. Another avenue for our future works is to move towards more challenging settings, e.g., uncalibrated cameras [zhuang2019degeneracy] or rolling shutter cameras [zhuang2019learning].
This work was part of L. Tiwari’s internship at NEC Labs America, in San Jose. L. Tiwari was supported by Visvesvarya Ph.D. Fellowship. L. Tiwari and S. Anand were also supported by Infosys Center for Artificial Intelligence, IIIT-Delhi.
Appendix 0.A Supplementary Material
This supplementary material is organized as follows. We first present depth refinement results on KITTI Odometry sequences in Sec. 0.A.1. Next, we give a comparison of our pose refinement with state-of-the-art RGB SLAM approaches in Sec. 0.A.2. We further evaluate pose refinement on KITTI Leaderboard in Sec. 0.A.3. Additional implementation details and qualitative results of TUM RGB-D experiments are included in Sec. 0.A.4. Additional analysis of the self-improving loop with all the 7 depth evaluation metrics is presented in Sec. 0.A.5. Some additional qualitative depth evaluation results of KITTI Eigen experiments and pose evaluation results of KITTI Odometry experiments are presented in Sec. 0.A.6 and Sec. 0.A.7 respectively. Finally, we provide some demo videos on KITTI Odometry and TUM RGB-D sequences in Sec. 0.A.8.
0.a.1 Depth Refinement Evaluation on KITTI Odometry
We evaluate the depth refinement step of our self-improving pipeline on KITTI Odometry sequences 09 and 10. The first block (i.e.MonoDepth2-M vs pRGBD-Refined) of the Tab. 7 shows the improved results after the depth refinement step. We also compare our method with a state-of-the-art depth refinement method DCNF [yin2017scale]. Note: DCNF [yin2017scale] uses ground-truth depths for pre-training the network, while our method uses only unlabelled monocular images, and still outperforms DCNF (see second block of the Tab. 7). The result shows that our self-improving framework with the wide-baseline losses (i.e., symmetric depth transfer and depth consistency losses) improves the depth prediction.
|Depth||Lower is better||Higher is better|
|Method||Train||Cap||Abs Rel||Sq Rel||RMSE||RMSE||a1||a2||a3|
0.a.2 Comparison with State-Of-The-Art SLAM Methods
In this section, we compare our pRGBD-Initial and pRGBD-Refined methods against state-of-the-art RGB SLAM methods, i.e., Direct Sparse Odometry (DSO) [engel2017direct], Direct Sparse Odometry with Loop Closure (LDSO) [gao2018ldso], and Direct Sparse Odometry in Dynamic Environments (DSOD) [ma2019dsod]. The results are shown in Tab. 8. From the results, it is evident that our pRGBD-Refined outperforms all the competing methods in Absolute Trajectory Error (RMSE) and Relative Translation (Rel Tr) Error . While the improvement in Absolute Trajectory Error (RMSE) and Relative Translation (Rel Tr) error is substantial, the performance in Relative Rotation (Rel Rot) is not comparable. The higher Rel Rot errors of our method compared to other RGB ORB-SLAM methods could be due to the high uncertainty of CNN-predicted depths for far-away points, which affects our rotation estimation [hartley2003multiple]. However, if we compare Rel Rot error of pRGBD-Initial with the pRGBD-Refined, as depth prediction improves (see Tab. 7 MonoDepth2-M/pRGBD-Initial vs pRGBD-Refined) the Rel Rot error also improves (see Tab. 8 ).
|Seq. 09||Seq. 10|
|Method||RMSE||Rel Tr||Rel Rot||RMSE||Rel Tr||Rel Rot|
0.a.3 KITTI Odometry Leaderboard Results
In the main paper, we keep the default setting from ORB-SLAM, which leads to tracking failures of all methods in a few sequences (i.e., see Tab. 3 of the main paper). The KITTI Odometry leaderboard requires the results of all sequences (i.e., sequences 11-21) for evaluation. Therefore, we increase the minimum number of inliers for adding keyframes from 100 to 500 so that our pRGBD-Refined succeeds on all sequences. We report the results of our pRGBD-Refined on the KITTI Odometry leaderboard in Tab. 9. Results show our method outperforms the competing monocular/LiDAR-based methods both in terms of relative translation and rotation errors.
|Method||Rel Tr||Rel Rot|
|VISO2-M+GP [geiger2011stereoscan, song2014robust]||7.46||0.0245|
0.a.4 Experiments on TUM RGB-D Sequences
0.a.4.1 Implementation Details
We pre-train/fine-tune the depth network on image resolution . For pre-training, we set the learning rate to initially, reduce it to after 20 epochs, and train for 30 epochs. For fine-tuning, we extract camera poses, 2D keypoints and the associated depths from keyframes while running RGB-D ORB-SLAM on the training sequences. We fine-tune the depth network with the fixed learning rate of . We use the following 6 sequences for pre-training/fine-tuning: 1. fr3/long_office_household, 2. fr3/long_office_household_validation, 3. fr3/sitting_xyz, 4. fr3/structure_texture_far, 5. fr3/structure_texture_near, 6. fr3/teddy, and the following 2 sequences for testing: 1. fr3/walking_xyz, 2. fr3/large_cabinet_validation. Note that these are the only 8 sequences with provided rectified images among the entire TUM RGB-D dataset.
0.a.4.2 Qualitative Results
Fig. 7(a) and Fig. 7(b) shows qualitative pose evaluation results on test sequences walking_xyz and large_cabinet_validation respectively. The results, show the increased robustness and accuracy by pRGBD-Refined. In particular, RGB ORB-SLAM fails on walking_xyz, while pRGBD-Refined succeeds and achieves the best performance on both sequences. Some qualitative depth refinement results are presented in Fig. 8. It can be seen that the disparity between the depth values of nearby and farther scene points become clearer, e.g., see depth around the two monitors.
|(a) RGB||(b) pRGBD-Initial||(c) pRGBD-Refined|
0.a.5 Additional Plots of Self-Improving Loop Analysis
In the main paper, we have shown behaviours of 3 depth evaluation metrics named as (Sq. Rel), (RMSE) and (a2). In this section we present behaviours of all metrics and pose evaluation metrics. Our analysis in the Sec. 5 of the main paper holds true with respect to all the 7 depth evaluation metrics.
0.a.6 Additional Depth Refinement Qualitative Results
Fig. 10 shows some visual improvements in depth predictions of farther scene points. Fig. 11 shows some additional qualitative results, where pRGBD-Refined shows visible improvements at occlusion boundaries and thin objects. The reason for the improvements is the aggregated cues from multiple views with wider baselines (e.g., our depth transfer and depth consistency losses) lead to more well-posed depth recovery.
0.a.7 Additional Pose Refinement Qualitative Results
Some additional pose refinement qualitative results are shown in Fig. 12. In all the three sequences our pRGBD-Refined aligned well with the ground-truth trajectory. Note that both RGB ORB-SLAM and our pRGBD-Initial fail on sequence 12, whereas our pRGBD-Refined succeeds, showing the enhanced robustness by our self-improving framework.
|(a) Seq 11||(b) Seq 12||(a) Seq 15|
0.a.8 Demo Videos
We include example videos on sequences 11 and 19 of KITTI Odometry (i.e., http://tiny.cc/pRGBD_KITTI_11 and http://tiny.cc/pRGBD_KITTI_19 , respectively) and sequence fr3/large_cabinet_ validation of TUM RGB-D (i.e., http://tiny.cc/pRGBD_TUM_LCV). In particular, we illustrate the improvements in depth prediction at frames 140, 352 of pRGBD_KITTI_11, frames 1652, 3248, 3529 of pRGBD_KITTI_19, and frames 153, 678 of pRGBD_TUM_LCV. In addition, we highlight the failure of RGB ORB-SLAM at frame 2985 of pRGBD_KITTI_19.