1 Introduction
One of the most reliable cues towards 3D perception from a monocular camera arises from camera motion that induces multipleview geometric constraints [hartley2003multiple] wherein the 3D scene structure is encoded. Over the years, Simultaneous Localization And Mapping (SLAM) [davison2007monoslam, klein2007parallel, newcombe2011dtam]
has been long studied to simultaneously recover the 3D scene structure of the surrounding and estimate the egomotion of the agent. With the advent of Convolutional Neural Networks (CNNs), unsupervised learning of singleview depth estimation
[garg2016unsupervised, godard2017unsupervised, zhou2017unsupervised] has emerged as a promising alternative to the traditional geometric approaches. Such methods rely on CNNs to extract meaningful depth cues (e.g., shading, texture, and semantics) from a single image, yielding very promising results.Despite the general maturity of monocular geometric SLAM [engel2014lsd, mur2015orb, engel2017direct] and the rapid advances in unsupervised monocular depth prediction approaches [mahjourian2018unsupervised, wang2018learning, yin2018geonet, bian2019unsupervised, godard2019digging, Sheng_2019_ICCV], they both still have their own limitations.
Monocular SLAM. Traditional monocular SLAM has wellknown limitations in robustness and accuracy as compared to those leveraging active depth sensors, e.g., RGBD SLAM [mur2017orb]. This performance issue is due to the inherent scale ambiguity of depth recovery from monocular cameras, which causes the socalled scale drift in both the camera trajectory and 3D scene depth, and thus lowers robustness and accuracy of conventional monocular SLAM. In addition, the triangulationbased depth estimation employed by traditional SLAM methods is degenerate under pure rotational camera motion [hartley2003multiple].
Unsupervised Monocular Depth Prediction. Most of the unsupervised and selfsupervised depth estimation methods [zhou2017unsupervised, godard2017unsupervised, godard2019digging, bian2019unsupervised] formulate single image depth estimation as a novelview synthesis problem, with appearance based photometric losses being central to the training strategy. Usually, these models train two networks, one each for pose and depth. As photometric losses largely rely on the brightness constancy assumption, nearly all existing selfsupervised approaches operate in a narrowbaseline setting optimizing the loss over a snippet of 25 consecutive frames. Consequently, models like MondoDepth2 [godard2019digging], work very well for close range points, but generate inaccurate depth estimates for points that are farther away (e.g., see iteration in Fig. 6
). While it is well known that a widebaseline yields better depth estimates for points at larger depth, a straightforward extension of existing CNN based approaches is inadequate for the following two reasons. A wide baseline in a video sequence implies a larger temporal window, which in most practical scenarios will violate the brightness constancy assumption, rendering the photometric loss ineffective. Secondly, larger temporal windows (wider baselines) would also imply more occluded regions that behave as outliers. Unless these aspects are effectively handled, training of CNN based depth and pose networks in the wide baseline setting will lead to inaccuracies and biases.
In view of the limitations in both monocular geometric SLAM and unsupervised monocular depth estimation approaches, a particularly interesting question to ask is whether these two approaches can complement each other (see Sec. 5) and mitigate the issues discussed above. Our work makes contributions towards answering this question. Specifically, we propose a selfsupervised, selfimproving framework of these two tasks, which is shown to improve the robustness and accuracy on each of them.
While the performance gap between geometric SLAM and selfsupervised learningbased SLAM methods is still large, incorporating depth information drastically improves the robustness of geometric SLAM methods (e.g., see RGBD SLAM vs. RGB SLAM on the KITTI Odometry leaderboard [geiger2012we]). Inspired by this success of RGBD SLAM, we postulate the use of an unsupervised CNNbased depth estimation model as a pseudo depth sensor, which allows us to design our selfsupervised approach, pseudo RGBD SLAM (pRGBDSLAM) that only uses monocular cameras and yet achieves significant improvements in robustness and accuracy as compared to RGB SLAM.
Our fusion of geometric SLAM and CNNbased monocular depth estimation turns out to be symbiotic and this complementary nature sets the basis of our selfimproving framework. To improve the depth predictions, we make use of two main modifications in the training strategy. First, we eschew the learning based pose estimates in favor of geometric SLAM based estimates (an illustrative motivation is shown in Fig. 1). Second, we make use of common tracked keypoints from neighboring keyframes and impose a symmetric depth transfer and a depth consistency loss on the CNN model. These adaptations are based on the observation that both pose estimates and sparse 3D feature point estimates from geometric SLAM are robust, as most techniques typically apply multiple bundle adjustment iterations over wide baseline depth estimates of common keypoints. This simple observation and the subsequent modification is key to our selfimproving framework, which can leverage any unsupervised CNNbased depth estimation model and a modern monocular SLAM method. In this paper, we test our framework, with ORBSLAM [mur2017orb] as the geometric SLAM method and MonoDepth2 [godard2019digging] as the CNNbased model. We show that our selfimproving framework outperforms previously proposed selfsupervised approaches that utilizes monocular, stereo, and monocularplusstereo cues for selfsupervision (see Tab. 1) and a strong feature based RGBSLAM baseline (see Tab. 3).
The framework runs in a simple alternating update fashion: first, we use depth maps from the CNNbased depth network and run pRGBDSLAM; second, we inject the outputs of pRGBDSLAM, i.e., the relative camera poses and common tracked keypoints and keyframes to finetune the depth network parameters to improve the depth prediction; then, we repeat the process until we see no improvement. Our specific contributions are summarized here:

We propose a selfimproving strategy to inject into depth prediction networks the supervision from SLAM outputs, which stem from more generally applicable geometric principles.

We introduce two wide baseline losses, i.e., the symmetric depth transfer loss and the depth consistency loss on common tracked points, and propose a joint narrow and wide baseline based depth prediction learning setup, where appearance based losses are computed on narrow baselines and purely geometric losses on wide baselines (nonconsecutive temporally distant keyframes).

Through extensive experiments on KITTI [geiger2012we] and TUM RGBD [sturm2012benchmark], our framework is shown to outperform both monocular SLAM system (i.e., ORBSLAM [mur2015orb]) and the stateoftheart unsupervised singleview depth prediction network (i.e., Monodepth2 [godard2019digging]).
2 Related Work
Monocular SLAM.
Visual SLAM has a long history of research in the computer vision community. Due to its wellunderstood underlying geometry, various geometric approaches have been proposed in the literature, ranging from the classical MonoSLAM
[davison2007monoslam], PTAM [klein2007parallel], DTAM [newcombe2011dtam] to the more recent LSDSLAM [engel2014lsd], ORBSLAM [mur2015orb] and DSO [engel2017direct]. More recently, in view of the successful application of deep learning in a wide variety of areas, researchers have also started to exploit deep learning approaches for SLAM, in the hope that it can improve certain components of geometric approaches or even serve as a complete alternative. Our work makes further contributions along this line of research.
Monocular Depth Prediction. Inspired by the pioneering work by Eigen et al. [eigen2014depth]
on learning singleview depth estimation, a vast amount of learning methods emerge along this line of research. The earlier works often require ground truth depths for fullysupervised training. However, perpixel depth ground truth is generally hard or prohibitively costly to obtain. Therefore, many selfsupervised methods that make use of geometric constraints as supervision signals are proposed. Specifically, thanks to the Spatial Transform Network
[jaderberg2015spatial], differentiable photometric reconstruction loss is successfully applied to monocular depth estimation. One example work is by Godard et al. [godard2017unsupervised], which relies on the photoconsistency between the leftright cameras of a calibrated stereo. Zhou et al. [zhou2017unsupervised]go one step further to learn monocular depth prediction as well as egomotion estimation, thereby permitting unsupervised learning with only a monocular camera. This pipeline has inspired a large amount of followup works that utilize various additional heuristics, including 3D geometric constraints on point clouds
[mahjourian2018unsupervised], direct visual odometry [wang2018learning], joint learning with optical flow [yin2018geonet], scale consistency [bian2019unsupervised], and others [godard2019digging, Sheng_2019_ICCV].Using Depth to Improve Monocular SLAM. Approaches [tateno2017cnn, yin2017scale, yang2018deep, loo2019cnn] leveraging CNNbased depth estimates to tackle issues in monocular SLAM have been proposed. CNNSLAM [tateno2017cnn] uses learned depth maps to initialize keyframes’ depth maps in LSDSLAM [engel2014lsd] and refines them via a filtering framework. Yin et al. [yin2017scale] use a combination of CNNs and conditional random fields to recover scale from the depth predictions and iteratively refine egomotion and depth estimates. Recently, DVSO [yang2018deep] trains a single CNN to predict both the left and right disparity maps, forming a virtual stereo pair. The CNN is trained with photoconsistency between stereo images and consistency with depths estimated by Stereo DSO [wang2017stereo]. More recently, CNNSVO [loo2019cnn] uses depths learned from stereo images to initialize depths of keypoints and reduce their corresponding uncertainties in SVO [forster2014svo]. In contrast to our selfsupervised approach, [tateno2017cnn, yin2017scale] use ground truth depths for training depth networks while [yang2018deep, loo2019cnn] need stereo images.
Using SLAM to Improve Monocular Depth Prediction. Depth estimates from geometric SLAM have been leveraged for training monocular depth estimation networks in recent works [klodt2018supervising, andraghetti2019enhancing]. In [andraghetti2019enhancing], sparse depth maps by Stereo ORBSLAM [mur2017orb] are first converted into dense ones via an autoencoder, which are then integrated into geometric constraints for training the depth network. Klodt and Vedaldi [klodt2018supervising] employ depths and poses by ORBSLAM [mur2015orb] as supervision signals for training the depth and pose networks respectively. This approach only considers five consecutive frames, thus restricting its operation in the narrowbaseline setting.
3 Method: A SelfImproving Framework
Our selfimproving framework leverages the strengths of each, the unsupervised singleimage depth estimation and the geometric SLAM approaches, to mitigate the other’s shortcomings. On one hand, the depth network typically generates reliable depth estimates for nearby points, which assist in improving the geometric SLAM estimates of poses and sparse 3D points (Sec. 3.1). On the other hand, geometric SLAM methods rely on a more holistic view of the scene to generate robust pose estimates as well as identify persistent 3D points that are visible across many frames, thus providing an opportunity to perform widebaseline and reliable sparse depth estimation. Our framework leverages these sparse, but robust estimates to improve the noisier depth estimates of the farther scene points by minimizing a blend of the symmetric transfer and depth consistency losses (Sec. 3.2) and the commonly used appearance based loss. In the following iteration, this improved depth estimate further enhances the capability of geometric SLAM and the cycle continues until the improvements become negligible. Even in the absence of ground truth, our selfimproving framework continues to produce better pose and depth estimates.
An overview of the proposed selfimproving framework is shown in Fig. 2, which iterates between improving poses and improving depths. Our pose refinement and depth refinement steps are then detailed in Sec. 3.1 and 3.2 respectively. An overview of narrow and wide baseline losses we use for improving the depth network is shown in Fig. 3 and details are provided in Sec. 3.2.
3.1 Pose Refinement
Pseudo RGBD for Improving Monocular SLAM. We employ a well explored and widely used geometrybased SLAM system, i.e., the RGBD version of ORBSLAM [mur2017orb], to process the pseudo RGBD data, yielding camera poses as well as 3D map points and the associated 2D keypoints. Any other geometric SLAM system that provides these output estimates can also be used in place of ORBSLAM. A trivial direct use of pseudo RGBD data to run RGBD ORBSLAM is not possible, because CNN might predict depth at a very different scale compared to depth measurements from real active sensors, e.g., LiDAR. Keeping the above difference in mind, we discuss an important adaptation in order for RGBD ORBSLAM to work well in our setting. We first note that RGBD ORBSLAM transforms the depth data into disparity on a virtual stereo to reuse the framework of stereo ORBSLAM. Specifically, considering a keypoint with 2D coordinates (i.e., and denote the horizontal and vertical coordinates respectively) and a CNNpredicted depth , the corresponding 2D keypoint coordinates on the virtual rectified right view are where is the horizontal focal length and is the virtual stereo baseline.
Adaptation. In order to have a reasonable range of disparity, we mimic the setup of the KITTI dataset [geiger2012we] by making the baseline adaptive, , where represents the maximum CNNpredicted depth of the input sequence, and and (both in meters) are respectively the actual stereo baseline and empirical maximum depth value of the KITTI dataset.
We also summarize the overall pipeline of RGBD ORBSLAM here. The 3D map is initialized at the very first frame of the sequence due to the availability of depth. After that, the following main tasks are performed: i) track the camera by matching 2D keypoints against the local map, ii) enhance the local map via local bundle adjustment, and iii) detect and close loops for posegraph optimization and full bundle adjustment to improve camera poses and scene depths. As we will show in Sec. 4.4, using pseudo RGBD data leads to better robustness and accuracy as compared to using only RGB data.
3.2 Depth Refinement
We start from the pretrained depth network of Monodepth2 [godard2019digging], a stateoftheart monocular depth estimation network, and finetune its network parameters with the camera poses, 3D map points and the associated 2D keypoints produced by the above pseudo RGBD ORBSLAM (pRGBDSLAM). In contrast to Monodepth2, which relies only on the narrow baseline photometric reconstruction loss between adjacent frames for shortterm consistencies, we propose wide baseline symmetric depth transfer and sparse depth consistency losses to introduce longterm consistencies. Our final loss (Eq. (4)) consists of both narrow and wide baseline losses. The narrow baseline losses, i.e., photometric and smoothness losses, involve the current keyframe and its temporally adjacent frames and , while wide baseline losses are computed on the current keyframe and the two neighboring keyframes and that are temporally farther than and (see Fig. 3). Next, we introduce the notation and describe the losses in detail.
Notation. Let represent the set of common tracked keypoints visible in all the three keyframes , and obtained from pRGBDSLAM. Note that and are two neighboring keyframes of the current frame (i.e., ) in which keypoints are visible. Let , and be the 2D coordinates of the common tracked keypoint in the keyframes , and respectively, and the associated depth values obtained from pRGBDSLAM are represented by , , and respectively. The depth values corresponding to the keypoints , and can also be obtained from the depth network and are represented by , , and respectively, where w stands for the depth network parameters.
Symmetric Depth Transfer Loss. Given the camera intrinsic matrix K, and the depth value of the keypoint , the 2D coordinates of the keypoint can be backprojected to its corresponding 3D coordinates as: (w) . Let represent the relative camera pose of frame w.r.t. frame obtained from pRGBDSLAM. Using , we can transfer the 3D point (w) from frame to as: (w) = (w). Here, is the transferred depth of the keypoint from frame to frame . Following the above procedure, we can obtain the transferred depth of the same keypoint from frame to frame . The symmetric depth transfer loss of the keypoint between frame pair and , is the sum of absolute errors ( distance) between the transferred networkpredicted depth and the existing networkpredicted depth in the target keyframe , and viceversa. Mathematically, it can be written as:
(1) 
Similarly, we can compute the symmetric depth transfer loss of the same keypoint between frame pair and , i.e., , and between and , i.e., . We accumulate the total symmetric transfer loss between frame and in , which is the loss of all the common tracked keypoints and the points within the patch of size centered at the common tracked keypoints. Similarly, we compute the total symmetric depth transfer loss and between frame pair , and respectively.
Depth Consistency Loss. The role of the depth consistency loss is to make depth network’s prediction consistent with the refined depth values obtained from the pRGBDSLAM. Note that depth values from pRGBDSLAM undergo multiple optimization over wide baselines, hence are more accurate and capture longterm consistencies. We inject these longterm consistent depths from pRGBDSLAM to depth network through the depth consistency loss. The loss for the frame can be written as follows:
(2) 
Photometric Reconstruction Loss. Denote the relative camera pose of frame and w.r.t. current keyframe obtained from pRGBDSLAM by and respectively. Using frame , , networkpredicted depth map of the keyframe , and the camera intrinsic K, we can synthesize the current frame [godard2019digging, godard2017unsupervised]. Let the synthesized frame be represented in the functional form as: , K). Similarly we can synthesize , K) using frame . The photometric reconstruction error between the synthesized and the original current frame [garg2016unsupervised, godard2017unsupervised, zhou2017unsupervised] is then computed as:
(3) 
where we follow [godard2017unsupervised, godard2019digging] to construct the photometric reconstruction error function . Additionally, we adopt the more robust perpixel minimum error, multiscale strategy, automasking, and depth smoothness loss from [godard2019digging].
Our final loss for finetuning the depth network at the depth refinement step is the weighted sum of narrow baseline losses (i.e., photometric () and smoothness loss ()), and wide baseline losses (i.e., symmetric depth transfer () and depth consistency loss ()):
(4) 
4 Experiments
We conduct experiments to evaluate depth refinement and pose refinement steps of our selfimproving framework with the stateofthearts in selfsupervised depth estimation and RGBSLAM based pose estimation respectively.
4.1 Datasets and Evaluation Metrics
KITTI Dataset. Our experiments are mostly performed on the KITTI dataset [geiger2012we], which contains outdoor driving sequences. We further split KITTI experiments into two parts: one focused on depth refinement evaluation and the other on pose refinement. For depth refinement evaluation we train/finetune the depth network using the Eigen train split [eigen2014depth] which contains 28 training sequences and evaluate depth prediction on the Eigen test split [eigen2014depth] following the baselines [zhou2017unsupervised, yang2017unsupervised, mahjourian2018unsupervised, yin2018geonet, wang2018learning, zou2018dfnet, yang2018lego, ranjan2019competitive, luo2018every, casser2019depth]. For pose refinement evaluation, we train/finetune the depth network using KITTI odometry sequences 0008 and test on sequences 0910 and 1121. Note, for evaluation on sequences 0910 we use the groundtruth trajectories provided by [geiger2012we], while for evaluation on sequences 1121, since the groundtruth is not available we use the pseudo groundtruth trajectories obtained by running stereo version of ORBSLAM on these sequences.
TUM RGBD Dataset. For completeness and to demonstrate the capability of our selfimproving framework on indoor scenes, we evaluate on the TUM RGBD dataset [sturm2012benchmark], which consists of indoor sequences captained by a handheld camera. We use 6 of 8 freiburg3 sequences to train/finetune the depth network and the remaining 2 for evaluation. We choose freiburg3 sequences because only they have undistorted RGB images and ground truth to train/finetune and evaluate respectively.
Metrics for Pose Evaluation. For quantitative pose evaluation, we compute the Root Mean Square Error (RMSE), Relative Translation (Rel Tr) error, and Relative Rotation (Rel Rot) error of the predicted camera trajectory. Since monocular SLAM systems can only recover camera poses up to a global scale, we align the camera trajectory estimated by each method with the ground truth one using the EVO toolbox [grupp2017evo]. We then use the official evaluation code from the KITTI Odometry benchmark to compute the Rel Tr and Rel Rot errors for all subtrajectories with length in meters.
Metrics for Depth Evaluation. For quantitative depth evaluation, we use the standard metrics, including the Absolute Relative (Abs Rel) error, Squared Relative (Sq Rel) error, RMSE, RMSE log, (namely a1), (namely a2), and (namely a3) as defined in [eigen2014depth]. Again, since the depths from monocular images can only be estimated up to scale, we align the predicted depth map with the ground truth one using their median depth values. Following [eigen2014depth] and other baselines, we also clip the depths to 80 meters.
Note. In all the tables, the best performance is shown in bold and the second best is underlined.
4.2 Implementation Details
We implement our framework based on Monodepth2 [godard2019digging] and ORBSLAM [mur2017orb], i.e., we use the depth network of Monodepth2 and the RGBD version of ORBSLAM for depth refinement and pose refinement respectively. We would like to emphasize, that our selfimproving strategy is not specific to MonoDepth2 or ORBSLAM. Any other depth network that allows to incorporate SLAM outputs and any SLAM system that can provide the desired SLAM outputs can be put into the selfimproving framework. We set the weight of the smoothness loss term of the final loss (Eq. (4)) similar as in [godard2019digging] and ,, and to 1. The ablation study results on disabling different loss terms can be found in Tab. 6.
KITTI Eigen Split/Odometry Experiments. We pretrain MonoDepth2 using monocular videos of the KITTI Eigen split training set with the hyperparameters as suggested in MonoDepth2 [godard2019digging]. We use an input/output resolution of
for pretraining/finetuning and scale it up to the original resolution while running pRGBDSLAM. We use same hyperparameters as for KITTI Eigen split to pretrain/finetune the depth model on KITTI Odometry train sequences mentioned in Sec.
4.1. During a selfimproving loop, we discard pose network of MonoDepth2 and instead use camera poses from pRGBDSLAM.Outlier Removal. Before running a depth refinement step, we run an outlier removal step on the SLAM outputs. Specifically, we filter out outlier 3D map points and the associated 2D keypoints that satisfy at least one of the following conditions: i) it is observed in less than 3 keyframes, ii) its reprojection error in the current keyframe is larger than 3 pixels.
Camera Intrinsics. Monodepth2 computes the average camera intrinsics for the KITTI dataset and uses it for the training. However, for our finetuning of the depth network, using the average camera intrinsics leads to inferior performance, because we use the camera poses from pRGBDSLAM, which runs with different camera intrinsics. Therefore, we use different camera intrinsics for different sequences when finetuning the depth network.
For finetuning the depth network pretrained on KITTI Eigen split training sequences, we run pRGBDSLAM on all the training sequences, and extract camera poses, 2D keypoints and the associated depths from keyframes. For pRGBDSLAM(RGBD ORBSLAM), we use the default setting of ORBSLAM, except for the adjusted described in Sec. 3.1. The same above procedure is followed for depth model pretrained on KITTI Odometry training sequences. The average number of keyframes used in a selfimproving loop is and
for KITTI Eigen split and KITTI Odometry experiments respectively. At each depth refinement step, we finetune the depth network parameters with 1 epoch only, using learning rate 1e6, keeping all the other hyperparameters the same as pretraining. For both KITTI Eigen split and KITTI Odometry experiments we report results after
5 selfimproving loops.TUM RGBD Experiments. For TUM RGBD, we pretrain/finetune the depth network on 6 freiburg3 sequences, and test on 2 freiburg3 sequences. The average number of keyframes in a selfimproving loop is . We use an input/output resolution of for pretraining/finetuning and scale it up to the original resolution while running pRGBDSLAM. We report results after 3 selfimproving loops. Other details can be found in the supplementary material.
4.3 Monocular Depth/Depth Refinement Evaluation
In the following, we evaluate the performance of our depth estimation on the KITTI Raw Eigen split test set and TUM RGBD frieburg3 sequences.
Results on KITTI Eigen Split Test Set. We show the depth evaluation results on the Eigen split test set in Tab. 1. From the table, it is evident that our refined depth model (pRGBDRefined) outperforms all the competing monocular (M) unsupervised methods by nontrivial margins, including the pretrained depth model, i.e., MonoDepth2M, and even surpasses the unsupervised methods with stereo (S) training, i.e., Monodepth2S, and combined monocularstereo (MS) training, i.e., MonoDepth2MS, in most metrics. Our method also outperforms several groundtruth/auxiliary depth supervised methods [eigen2014depth, liu2015learning, klodt2018supervising, nath2018adadepth].
Lower is better  Higher is better  
Method  Train  Abs Rel  Sq Rel  RMSE  RMSE log  a1  a2  a3  
selfsupervised 
Zhou[zhou2017unsupervised]  M  0.183  1.595  6.709  0.270  0.734  0.902  0.959 
Yang[yang2017unsupervised]  M  0.182  1.481  6.501  0.267  0.725  0.906  0.963  
Mahjourian[mahjourian2018unsupervised]  M  0.163  1.240  6.220  0.250  0.762  0.916  0.968  
Klodt[klodt2018supervising]  M  0.166  1.490  5.998    0.778  0.919  0.966  
GeoNet[yin2018geonet]  M  0.149  1.060  5.567  0.226  0.796  0.935  0.975  
DDVO[wang2018learning]  M  0.151  1.257  5.583  0.228  0.810  0.936  0.974  
DFNet[zou2018dfnet]  M  0.150  1.124  5.507  0.223  0.806  0.933  0.973  
LEGO[yang2018lego]  M  0.162  1.352  6.276  0.252        
Ranjan[ranjan2019competitive]  M  0.148  1.149  5.464  0.226  0.815  0.935  0.973  
EPC++[luo2018every]  M  0.141  1.029  5.350  0.216  0.816  0.941  0.976  
Struct2depth(M)[casser2019depth]  M  0.141  1.026  5.291  0.215  0.816  0.945  0.979  
MonoDepth2M [godard2019digging]  M  0.117  0.941  4.889  0.194  0.873  0.957  0.980  
pRGBDRefined  M  0.113  0.793  4.655  0.188  0.874  0.960  0.983  
Garg[garg2016unsupervised]  S  0.152  1.226  5.849  0.246  0.784  0.921  0.967  
StrAT[mehta2018structured]  S  0.128  1.019  5.403  0.227  0.827  0.935  0.971  
3Net (R50)[poggi2018learning]  S  0.129  0.996  5.281  0.223  0.831  0.939  0.974  
3Net (VGG) [poggi2018learning]  S  0.119  1.201  5.888  0.208  0.844  0.941  0.978  
SuperDepth [pillai2019superdepth]  S  0.112  0.875  4.958  0.207  0.852  0.947  0.977  
Monodepth2S[godard2019digging]  S  0.109  0.873  4.960  0.209  0.864  0.948  0.975  
UnDeepVO  MS  0.183  1.730  6.570  0.268        
EPC++  MS  0.128  0.935  5.011  0.209  0.831  0.945  0.979  
Monodepth2MS[godard2019digging]  MS  0.106  0.818  4.750  0.196  0.874  0.957  0.979  

Eigen[eigen2014depth]  D  0.203  1.548  6.307  0.282  0.702  0.890  0.890 
Liu[liu2015learning]  D  0.201  1.584  6.471  0.273  0.680  0.898  0.967  
AdaDepth[nath2018adadepth]  D*  0.167  1.257  5.578  0.237  0.771  0.922  0.971  
Kuznietsov[kuznietsov2017semi]  DS  0.113  0.741  4.621  0.189  0.862  0.960  0.986  
DVSO[yang2018deep]  D*S  0.097  0.734  4.442  0.187  0.888  0.958  0.980  
SVSM FT[luo2018every]  DS  0.094  0.626  4.252  0.177  0.891  0.965  0.984  
Guo[guo2018learning]  DS  0.096  0.641  4.095  0.168  0.892  0.967  0.986  
DORN[fu2018deep]  D  0.072  0.307  2.727  0.120  0.932  0.984  0.994 
RGB  Monodepth2S  Monodepth2M  pRGBDRefined 
The reason is probably that the aggregated cues from multiple views with wide baseline losses (
e.g., our symmetric depth transfer, depth consistency losses) lead to more wellposed depth recovery, and hence even higher accuracy than learning with the precalibrated stereo rig with smaller baselines. Further analysis is provided in Sec. 5. Fig. 4 shows some qualitative results, where pRGBDRefined shows visible improvements at occlusion boundaries and thin objects. Results on TUM RGBD Sequences. The depth evaluation results on the two TUM frieburg3 RGBD sequences is shown in Tab. 5. Our refined depth model (pRGBDRefined) outperforms pRGBDInitial/Monodepth2M in both sequences and all metrics. Due to space limitation we have moved qualitative results to the supplementary material.4.4 Monocular SLAM/Pose Refinement Evaluation
In this section, we evaluate pose estimation/refinement on the KITTI Odometry sequences 09 and 10, KITTI Odometry test set sequences 1121, and two TUM frieburg3 RGBD sequences.
Seq. 09  Seq. 10  
Method  RMSE  Rel Tr  Rel Rot  RMSE  Rel Tr  Rel Rot  
Supervised 
DeepVO[wang2017deepvo]          8.11  0.088 
ESPVO[wang2018end]          9.77  0.102  
GFSVO[xue2018guided]          6.32  0.023  
GFSVORNN[xue2018guided]          7.44  0.032  
BeyondTracking[xue2019beyond]          3.94  0.017  
DeepV2D[teed2018deepv2d]  79.06  8.71  0.037  48.49  12.81  0.083  
SelfSupervised 
SfMLearner [zhou2017unsupervised]  24.31  8.28  0.031  20.87  12.20  0.030 
GeoNet[yin2018geonet]  158.45  28.72  0.098  43.04  23.90  0.090  
DepthVO[zhan2018unsupervised]    11.93  0.039    12.45  0.035  
vid2depth[mahjourian2018unsupervised]          21.54  0.125  
UnDeepVO[li2018undeepvo]    7.01  0.036    10.63  0.046  
Wang et al.[wang2019recurrent]    9.88  0.034    12.24  0.052  
CC[ranjan2019competitive]  29.00  6.92  0.018  13.77  7.97  0.031  
DeepMatchVO[shen2019icra]  27.08  9.91  0.038  24.44  12.18  0.059  
Li et al.[li2019pose]    8.10  0.028    12.90  0.032  
Monodepth2M[godard2019digging]  55.47  11.47  0.032  20.46  7.73  0.034  
SCSfMLearer[bian2019unsupervised]    11.2  0.034    10.1  0.050  
RGB ORBSLAM  18.34  7.42  0.004  8.90  5.85  0.004  
pRGBDInitial  12.21  4.26  0.011  8.30  5.55  0.017  
pRGBDRefined  11.97  4.20  0.010  6.35  4.40  0.016 
Results on KITTI Odometry Sequences 09 and 10. We show the quantitative results on seqs 09 and 10 in Tab. 2. It can be seen that our pRGBDInitial outperforms RGB ORBSLAM [mur2015orb] both in terms of RSME and Rel Tr.
Our pRGBDRefined further improves pRGBDInitial in all metrics, which verifies the effectiveness of our selfimproving mechanism in terms of pose estimation. The higher Rel Rot errors of our methods compared to RGB ORBSLAM could be due to the high uncertainty of CNNpredicted depths for faraway points, which affects our rotation estimation [hartley2003multiple]. In addition, our methods outperform all the competing supervised and selfsupervised methods by a large margin, except for the supervised method of [xue2019beyond] with lower Rel Tr than ours on sequence 10. Note that we evaluate the camera poses produced by the pose network of Monodepth2M [godard2019digging] in Tab. 2, yielding much higher errors than ours.
Fig. 5(a) shows the camera trajectories estimated for sequence 09 by RGB ORBSLAM, our pRGBDInitial, and pRGBDRefined. It is evident that, although all the methods perform loop closure successfully, our methods generate camera trajectories that align better with the ground truth.
Results on KITTI Odometry Test Set. The KITTI Odometry leaderboard requires complete camera trajectories of all frames of all the sequences. Since we keep the default setting from ORBSLAM, causing tracking failures in a few sequences, to facilitate quantitative evaluation on this test set (i.e., sequences 1121), we use pseudogroundtruth computed as mentioned in Sec. 4.1 to evaluate all the competing methods in Tab. 3.
From the results, RGB ORBSLAM fails on three challenging sequences due to tracking failures, whereas our pRGBDInitial fails on two sequences and our pRGBDRefined fails only on one sequence. Among the sequences where all the competing methods succeed, our pRGBDInitial reduces the RMSEs of RGB ORBSLAM by a considerable margin for all sequences except for sequence 19. After our selfimproving mechanism, our pRGBDRefined further boosts the performance, reaching the best results both in terms of RMSE and Rel Tr. Fig. 5(b) shows qualitative comparisons on sequence 19.
Seq  RGB ORBSLAM  pRGBDInitial  pRGBDRefined  
RMSE  Rel Tr  Rel Rot  RMSE  Rel Tr  Rel Rot  RMSE  Rel Tr  Rel Rot  
11  14.83  7.69  0.003  6.68  3.28  0.016  3.64  2.96  0.015 
13  6.58  2.39  0.006  6.83  2.52  0.008  6.43  2.31  0.007 
14  4.81  5.19  0.004  4.30  4.14  0.014  2.15  3.06  0.014 
15  3.67  1.78  0.004  2.58  1.61  0.005  2.07  1.33  0.004 
16  6.21  2.66  0.002  5.78  2.14  0.006  4.65  1.90  0.004 
18  6.63  2.38  0.002  5.50  2.30  0.008  4.37  2.21  0.006 
19  18.68  4.91  0.002  23.96  2.82  0.007  13.85  2.52  0.006 
20  9.19  6.74  0.016  8.94  5.43  0.027  7.03  4.50  0.022 
12  X  X  X  X  X  X  94.2  32.94  0.026 
17  X  X  X  14.71  8.98  0.011  12.23  7.23  0.011 
21  X  X  X  X  X  X  X  X  X 
(a) seq 09  (b) seq 19 
Results on TUM RGBD Sequences. Performance of pose refinement step on the two TUM RGBD sequences is shown in Tab. 4. The result shows increased robustness and accuracy by pRGBDRefined. In particular, RGB ORBSLAM fails on walking_xyz, while pRGBDRefined succeeds and achieves the best performance on both sequences. Due to space limitation we have moved qualitative results to supplementary material.
RGB ORB_SLAM  pRGBDInitial  pRGBDRefined  

Seq  RMSE  RMSE  RMSE 
X  0.23  0.09  
1.72  1.40  0.39 
TUM RGBD Sequences  
Lower is better  Higher is better  
Method  Ab Rel  Sq Rel  RMSE  RMSElog  a1  a2  a3 
pRGBDInitial/MonoDepth2M  0.397  0.848  1.090  0.719  0.483  0.722  0.862 
pRGBDRefined  0.307  0.341  0.743  0.655  0.522  0.766  0.873 
5 Analysis of SelfImproving Loops
(a)  (b)  (c)  (d) 
Depth/pose evaluation metric w.r.t. selfimproving loops. Depth evaluation metrics in (ac) are computed at different max depth caps ranging from 3080 meters.
In this section, we analyze the behaviour of three different evaluation metrics for depth estimation: Squared Relative (Sq Rel) error, RMSE error and accuracy metric a2, as defined in Sec. 4. The pose estimation is evaluated using the absolute trajectory pose error. In Fig. 6, we use the KITTI Eigen split dataset and report these metrics for each iteration of the selfimproving loop. The evaluation metrics corresponding to the selfimproving loop are of the pretrained MonoDepth2M. We summarize the findings from the plots in Fig. 6 as below:

A comparison of evaluation metrics of farther scene points (e.g. max depth 80) with nearby points (e.g. max depth 30) at the selfimproving loop shows that the pretrained MonoDepth2 performs poorly for farther scene points compared to nearby points.

In the subsequent selfimproving loops, we can see the rate of reduction in the Sq Rel and RMSE error is significant for farther away points compared to nearby points, e.g., slope of error curves in Fig. 6(bc) corresponding to max depth 80 is steeper than that of max depth 30. This validates our hypothesis of including wider baseline losses that help the depth network predict more accurate depth values for farther points. Overall, our joint narrow and wide baseline based learning setup helps improve the depth prediction of both the nearby and farther away points, and outperforms MonoDepth2 [godard2019digging].
6 Conclusion
In this work, we propose a selfimproving framework to couple geometrical and learning based methods for 3D perception. A winwin situation is achieved — both the monocular SLAM and depth prediction are improved by a significant margin without any additional active depth sensor or ground truth label. Currently, our selfimproving framework only works in an offline mode, so developing an online realtime selfimproving system remains one of our future works. Another avenue for our future works is to move towards more challenging settings, e.g., uncalibrated cameras [zhuang2019degeneracy] or rolling shutter cameras [zhuang2019learning].
Acknowledgement
This work was part of L. Tiwari’s internship at NEC Labs America, in San Jose. L. Tiwari was supported by Visvesvarya Ph.D. Fellowship. L. Tiwari and S. Anand were also supported by Infosys Center for Artificial Intelligence, IIITDelhi.
Appendix 0.A Supplementary Material
This supplementary material is organized as follows. We first present depth refinement results on KITTI Odometry sequences in Sec. 0.A.1. Next, we give a comparison of our pose refinement with stateoftheart RGB SLAM approaches in Sec. 0.A.2. We further evaluate pose refinement on KITTI Leaderboard in Sec. 0.A.3. Additional implementation details and qualitative results of TUM RGBD experiments are included in Sec. 0.A.4. Additional analysis of the selfimproving loop with all the 7 depth evaluation metrics is presented in Sec. 0.A.5. Some additional qualitative depth evaluation results of KITTI Eigen experiments and pose evaluation results of KITTI Odometry experiments are presented in Sec. 0.A.6 and Sec. 0.A.7 respectively. Finally, we provide some demo videos on KITTI Odometry and TUM RGBD sequences in Sec. 0.A.8.
0.a.1 Depth Refinement Evaluation on KITTI Odometry
We evaluate the depth refinement step of our selfimproving pipeline on KITTI Odometry sequences 09 and 10. The first block (i.e.MonoDepth2M vs pRGBDRefined) of the Tab. 7 shows the improved results after the depth refinement step. We also compare our method with a stateoftheart depth refinement method DCNF [yin2017scale]. Note: DCNF [yin2017scale] uses groundtruth depths for pretraining the network, while our method uses only unlabelled monocular images, and still outperforms DCNF (see second block of the Tab. 7). The result shows that our selfimproving framework with the widebaseline losses (i.e., symmetric depth transfer and depth consistency losses) improves the depth prediction.
Depth  Lower is better  Higher is better  

Method  Train  Cap  Abs Rel  Sq Rel  RMSE  RMSE  a1  a2  a3 
MonoDepth2M [godard2019digging]  M  80  0.123  0.703  4.165  0.188  0.854  0.956  0.985 
pRGBDRefined  M  80  0.121  0.649  3.995  0.184  0.853  0.960  0.986 
DCNF [yin2017scale]  M  20  0.112    2.047         
pRGBDRefined  M  20  0.098  0.242  1.610  0.145  0.906  0.978  0.993 
0.a.2 Comparison with StateOfTheArt SLAM Methods
In this section, we compare our pRGBDInitial and pRGBDRefined methods against stateoftheart RGB SLAM methods, i.e., Direct Sparse Odometry (DSO) [engel2017direct], Direct Sparse Odometry with Loop Closure (LDSO) [gao2018ldso], and Direct Sparse Odometry in Dynamic Environments (DSOD) [ma2019dsod]. The results are shown in Tab. 8. From the results, it is evident that our pRGBDRefined outperforms all the competing methods in Absolute Trajectory Error (RMSE) and Relative Translation (Rel Tr) Error . While the improvement in Absolute Trajectory Error (RMSE) and Relative Translation (Rel Tr) error is substantial, the performance in Relative Rotation (Rel Rot) is not comparable. The higher Rel Rot errors of our method compared to other RGB ORBSLAM methods could be due to the high uncertainty of CNNpredicted depths for faraway points, which affects our rotation estimation [hartley2003multiple]. However, if we compare Rel Rot error of pRGBDInitial with the pRGBDRefined, as depth prediction improves (see Tab. 7 MonoDepth2M/pRGBDInitial vs pRGBDRefined) the Rel Rot error also improves (see Tab. 8 ).
Seq. 09  Seq. 10  
Method  RMSE  Rel Tr  Rel Rot  RMSE  Rel Tr  Rel Rot 
RGB ORBSLAM[mur2017orb]  18.34  7.42  0.004  8.90  5.85  0.004 
DSO[engel2017direct]  74.29  16.32  
LDSO[gao2018ldso]  21.64      17.36     
DSOD[ma2019dsod]    13.85  0.002    13.53  0.002 
pRGBDInitial  12.21  4.26  0.011  8.30  5.55  0.017 
pRGBDRefined  11.97  4.20  0.010  6.35  4.40  0.016 
0.a.3 KITTI Odometry Leaderboard Results
In the main paper, we keep the default setting from ORBSLAM, which leads to tracking failures of all methods in a few sequences (i.e., see Tab. 3 of the main paper). The KITTI Odometry leaderboard requires the results of all sequences (i.e., sequences 1121) for evaluation. Therefore, we increase the minimum number of inliers for adding keyframes from 100 to 500 so that our pRGBDRefined succeeds on all sequences. We report the results of our pRGBDRefined on the KITTI Odometry leaderboard in Tab. 9. Results show our method outperforms the competing monocular/LiDARbased methods both in terms of relative translation and rotation errors.
Method  Rel Tr  Rel Rot 

ORBSLAM2S [mur2017orb]  1.70  0.0028 
OABA [frost2016object]  20.95  0.0135 
VISO2M [geiger2011stereoscan]  11.94  0.0234 
BLO [velas2018cnn]  9.21  0.0163 
VISO2M+GP [geiger2011stereoscan, song2014robust]  7.46  0.0245 
pRGBDRefined  6.24  0.0097 
0.a.4 Experiments on TUM RGBD Sequences
0.a.4.1 Implementation Details
We pretrain/finetune the depth network on image resolution . For pretraining, we set the learning rate to initially, reduce it to after 20 epochs, and train for 30 epochs. For finetuning, we extract camera poses, 2D keypoints and the associated depths from keyframes while running RGBD ORBSLAM on the training sequences. We finetune the depth network with the fixed learning rate of . We use the following 6 sequences for pretraining/finetuning: 1. fr3/long_office_household, 2. fr3/long_office_household_validation, 3. fr3/sitting_xyz, 4. fr3/structure_texture_far, 5. fr3/structure_texture_near, 6. fr3/teddy, and the following 2 sequences for testing: 1. fr3/walking_xyz, 2. fr3/large_cabinet_validation. Note that these are the only 8 sequences with provided rectified images among the entire TUM RGBD dataset.
0.a.4.2 Qualitative Results
Fig. 7(a) and Fig. 7(b) shows qualitative pose evaluation results on test sequences walking_xyz and large_cabinet_validation respectively. The results, show the increased robustness and accuracy by pRGBDRefined. In particular, RGB ORBSLAM fails on walking_xyz, while pRGBDRefined succeeds and achieves the best performance on both sequences. Some qualitative depth refinement results are presented in Fig. 8. It can be seen that the disparity between the depth values of nearby and farther scene points become clearer, e.g., see depth around the two monitors.
(a) fr3/walking_xyz  (b)fr3/large_cabinet_validation 
(a) RGB  (b) pRGBDInitial  (c) pRGBDRefined 
0.a.5 Additional Plots of SelfImproving Loop Analysis
In the main paper, we have shown behaviours of 3 depth evaluation metrics named as (Sq. Rel), (RMSE) and (a2). In this section we present behaviours of all metrics and pose evaluation metrics. Our analysis in the Sec. 5 of the main paper holds true with respect to all the 7 depth evaluation metrics.
(a)  (b)  (c)  (d) 
(e)  (f)  (g)  (h) 
0.a.6 Additional Depth Refinement Qualitative Results
Fig. 10 shows some visual improvements in depth predictions of farther scene points. Fig. 11 shows some additional qualitative results, where pRGBDRefined shows visible improvements at occlusion boundaries and thin objects. The reason for the improvements is the aggregated cues from multiple views with wider baselines (e.g., our depth transfer and depth consistency losses) lead to more wellposed depth recovery.
RGB  Monodepth2M  pRGBDRefined 
RGB  MonoDepth2M  pRGBDRefined 
0.a.7 Additional Pose Refinement Qualitative Results
Some additional pose refinement qualitative results are shown in Fig. 12. In all the three sequences our pRGBDRefined aligned well with the groundtruth trajectory. Note that both RGB ORBSLAM and our pRGBDInitial fail on sequence 12, whereas our pRGBDRefined succeeds, showing the enhanced robustness by our selfimproving framework.
(a) Seq 11  (b) Seq 12  (a) Seq 15 
0.a.8 Demo Videos
We include example videos on sequences 11 and 19 of KITTI Odometry (i.e., http://tiny.cc/pRGBD_KITTI_11 and http://tiny.cc/pRGBD_KITTI_19 , respectively) and sequence fr3/large_cabinet_ validation of TUM RGBD (i.e., http://tiny.cc/pRGBD_TUM_LCV). In particular, we illustrate the improvements in depth prediction at frames 140, 352 of pRGBD_KITTI_11, frames 1652, 3248, 3529 of pRGBD_KITTI_19, and frames 153, 678 of pRGBD_TUM_LCV. In addition, we highlight the failure of RGB ORBSLAM at frame 2985 of pRGBD_KITTI_19.