Deep Online Correction for Monocular Visual Odometry

03/18/2021 ∙ by Jiaxin Zhang, et al. ∙ Huazhong University of Science u0026 Technology 3

In this work, we propose a novel deep online correction (DOC) framework for monocular visual odometry. The whole pipeline has two stages: First, depth maps and initial poses are obtained from convolutional neural networks (CNNs) trained in self-supervised manners. Second, the poses predicted by CNNs are further improved by minimizing photometric errors via gradient updates of poses during inference phases. The benefits of our proposed method are twofold: 1) Different from online-learning methods, DOC does not need to calculate gradient propagation for parameters of CNNs. Thus, it saves more computation resources during inference phases. 2) Unlike hybrid methods that combine CNNs with traditional methods, DOC fully relies on deep learning (DL) frameworks. Though without complex back-end optimization modules, our method achieves outstanding performance with relative transform error (RTE) = 2.0 benchmark for Seq. 09, which outperforms traditional monocular VO frameworks and is comparable to hybrid methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Monocular visual odometry has attracted more and more attention for its wide applications in robotics, autonomous driving and augmented reality (AR). As an effective complement to other sensors such as GPS, Inertial Navigation System (INS) and wheel odometry, etc., monocular VO is popular for its low cost and easy access.

A larger number of robust and accurate monocular VO systems have been developed in the past decades [27]

. These methods can be roughly classified into three categories: traditional, DL-based and hybrid methods.

Fig. 1: Mapping results for KITTI Seq. 09 via our method. 3D points are first obtained by inverse projection with depth and camera intrinsics and then transformed into a global coordinate via the camera pose. The green points represent LiDAR points just for comparison. The blue curves describe the trajectory of our method.

Fig. 2: The inference pipeline of our method for monocular VO. (1) A pair of consecutive frames are fed into the Depth-CNN and Pose-CNN to get depth maps and pose initialization. (2) Photometric error maps are calculated by forward and backward warping. (3) Relative transform are optimized by minimizing photometric errors. Gray arrows represent the error calculation path. Red arrows refer to gradient back-propagation.

Traditional monocular VO methods [19, 25, 4, 10, 9, 8] usually consist of tracking, local or global optimization and loop closure modules, which make full use of geometric constraints. Although some traditional methods have shown excellent performance in terms of robustness and accuracy, they inherently suffer from the loss of absolute scales without extra information. Moreover, no reliable ego-motion will be obtained unless the parallax between successive frames is large enough.

Deep learning based methods [20, 1, 30, 29, 22]

try to tackle the above problems by training CNNs on large amounts of data. Rather than specifying these geometric constraints manually, DL-based methods are able to obtain them by exploiting prior knowledge from training data. As a result, reasonable poses and depth can be estimated even when parallax is not large enough. Besides, online-learning can be utilized to further improve the performance. Despite the advantages of the DL-based methods, the accuracy of estimated ego-motion is still inferior to traditional approaches.

Hybrid methods try to combine the interpretability of traditional methods and strong data fitting abilities of DL-based methods. Existing hybrid methods usually leverage CNNs as initialization for traditional VO frameworks [23, 33, 32]. Though hybrid methods achieve state-of-the-art results, the heavy calculation burdens prevent further application into practice.

In this paper, we propose a novel deep online correction (DOC) framework for monocular VO, which is composed of two CNNs and the proposed online correction module. The former provides depth estimation and initial poses, while the latter further improves the accuracy and robustness of poses via gradient propagations. Different from existing hybrid methods, the whole pipeline is concise and does not involve traditional modules like global bundle adjustment or pose graph optimization. Different from existing online-learning methods, the parameters of depth and pose CNNs will not be updated during inference phases, which improves the efficiency of real-time performance. This approach is much more effective because it reduces dimensions of optimizable space from millions of parameters to only 6DoF poses. In addition, it can be implemented in a two-frame or three-frame manner, which is quite flexible.

The contributions of our works are as follows:

  • We propose a fully DL-based VO which is composed of CNNs and an online correction module. The proposed method combines the advantages of both traditional and DL-based methods.

  • Another version named DOC+ is designed to further improve the VO performance. Two kinds of implementations are provided to show flexibility.

  • Our approach achieves state-of-the-art accuracy among current monocular methods. On the KITTI Odometry Seq. 09 (see Fig. 1 for visualization), DOC and DOC+ achieve performance of RTE=2.3% and RTE=2.0% respectively.

Fig. 3: Visualization of photometric errors before and after online correction. Sub-figure (a) and (b) are original images and corresponding depth maps respectively. Dark to bright in the depth maps represent near to far. Sub-figure (c) and(d) are reconstructed images before and after online correction. Sub-figure (e) and (f) are photometric errors before and after online correction. From dark to red represents errors from small to large. Note that in the error maps, errors around the curb and the manhole cover (circled out in the error map) have been reduced during the online correction.

Ii Related Works

Ii-a Traditional VO

Traditional VO is a vital component for most popular SLAM frameworks. They can be roughly divided into two categories: indirect methods [19, 25, 4] that recover depth and poses by minimizing geometric errors; direct methods [10, 8, 9] that minimize photometric errors. Despite great accuracy and efficiency, these methods usually can only estimate poses and depth up to an unknown scale factor in a monocular setup. Moreover, accurate poses may not be recovered in extreme conditions such as texture-less surfaces or dynamic scenes.

Ii-B Deep-learning-based VO

Recent years have witnessed thrived development of deep learning based visual odometry. One of the first DL-based VO estimation methods is proposed by Konda et al. [20] and Wang et al. [30]. Despite impressive results, their applications are limited by the requirements of labeled data.

To solve this problem, unsupervised methods are proposed and become popular. SfM-Learner [36] proposed by Zhou et al. is a representative pioneer of unsupervised VO. It contains a Pose-CNN and a Depth-CNN, and the generated poses and depth are utilized to synthesize a new view as supervision. However, the performance of SfM-Learner is less comparative to traditional methods and it still suffers from scale ambiguities problem. Li et al. [21] and Bian et al. [1] solve the scale problem by introducing either stereo image training or extra scale consistent loss term. The performance of unsupervised methods is further improved by some following works: Monodepth2 [13] utilize a minimum reprojection loss and a multi-scale strategy during training for better depth estimation; Zhao, W et al. [35] proposed a novel framework that utilizes optical flow to estimate poses; Wagstaff et al. [29]

proposed a two-stage method in Deep Pose Correction (DPC) to further improve the results from Pose-CNN. Since unsupervised learning does not rely on labeled data, online-learning can be used to improve performance on test data: Li et al.

[22] utilize meta-learning for better generalizability. Chen et al. [6] proposed the concept of the Output Fine Tune (OFT) and the Parameter Fine Tune (PFT). However, the comparison has only been made for depth refinements and the OFT shows little improvements to the overall results. While our methods exploit the potential of OFT in visual odometry by designing specific optimization procedure for online correction. Despite promising results on visual odometry, DL-based methods are still inferior to some traditional frameworks concerning generalizability and efficiency. We kindly recommend the visual odometry survey [5] for more details.

Ii-C Hybrid VO

Since the pros and cons of traditional VO and DL-based VO vary, it is a natural idea to combine them to build a robust VO system. CNN-SVO [23] is one of the first hybrid VO frameworks by integrate a depth-CNN into SVO [10]. DVSO [33] and D3VO [32] use CNNs as initialization and use DSO [8] as optimization backend. Benefits from good initialization of CNNs and strong robustness from traditional frameworks, hybrid methods have now achieved the state-of-the-art result in public benchmark. Deep frame-to-frame visual odometry (DFVO) [34] utilizes CNNs to predict optical flow and depth of given image and then use traditional geometric constraints to recover 6DoF and achieve a promising result. BANet [28] modified the bundle adjustment such that it is differentiable and thus can be integrated into an end-to-end framework. Our approach is different from existing hybrid methods as we propose an novel online correction module based on DL frameworks without using global bundle adjustment.

Iii Approach

The core idea of the DOC framework is that the relative poses are directly optimized by minimizing photometric errors based on gradient propagation. DOC does not rely on traditional frameworks and only need to calculate gradient w.r.t the 6DoF pose (see Fig. 2 for details). In this section, we first describe the training procedure for Depth-CNN and Pose-CNN. And then we describe details of DOC (two-frame optimization) and DOC+ (three-frame optimization) as shown in Fig 4.

Iii-a Training of Depth-CNN and Pose-CNN

For training, the proposed Depth-CNN and Pose-CNN have similar architectures to Monodepth2 [13]

, but are tailored for the need for online correction initialization. Depth-CNN has a U-Net like structure with skip-connections. It takes a single RGB image and outputs a depth map. Pose-CNN takes two concatenated images as input and outputs rotation and translation vectors. Stereo images are leveraged in training phases to recover absolute scales, while only monocular images are fed into the networks during testing. Besides, to improve the performance of the networks and for the usage of online correction model. We use the “explainability” mask

[36] instead of auto-masking in original paper.

Iii-B DOC: Two-frame-based Optimization

For testing, DOC considers two consecutive frames in the online correction module. Given the input images {, }, the Depth-CNN and Pose-CNN are used to the infer depth {, } and initial pose respectively. The online correction module will iteratively refine the ego-motion via minimizing special photometric errors defined as

(1)

where is the photometric errors at time step i. and are forward and backward errors as follows:

(2)
(3)

where denotes element-wise multiplication. Here we take for an illustration and can be obtained in a similar way.

in (2) is the warping function as proposed by Jaderberg et al. in [17]. It synthesis a novel view according to input image , corresponding depth , camera intrinsic and transform matrix by

(4)

in (2) calculate photometric errors between images:

(5)

where

is a function that behaves as an outlier rejecter:

(6)

where and

are means and standard deviations of

respectively. Instead of using SSIM [31], we use the truncated loss because we find it achieves similar accuracy for the DOC module but is more computational efficient.

in (2) is a mask composed of an occlusion mask and an explainability mask denoted as and respectively, = . Similar to [14], the occlusion mask is defined as follows:

(7)

where is the Iverson bracket and is a threshold set to 5 meters to filter out wrong occlusion mask pixels caused by incorrect depth estimation especially in the distance. is the explainability mask predicted by CNN according to SfM-Learner [36]. As we can see from Fig. 5, successfully covers the pixels with “ghosting effect” where the double traffic rods appear. reduces photometric errors in the high-frequency areas like roofs and vegetation.

Fig. 4: Illustrations of DOC (two-frame) and DOC+ (three-frame) frameworks. The blue box represents DOC which only minimizes photometric errors from two consecutive frames. The orange box represents DOC+ which uses reprojection errors from pairs of frames among three frames.

The equation (2) and (3) is connected with the following constraint:

(8)

represents a rigid transform parameterized by a 6 degree vector . The vector can be easily transformed into transform matrix with the Rodrigues’ rotation formula.

Finally, Adam [18] optimizer is used to update relative poses. The maximum number of iterations is set to 20 in all our experiments. The full algorithm is presented in Alg. 1. The visualization and discussion of error maps before and after optimization are shown in Fig. 3.

Iii-C DOC+: Three-frame-based Optimization

DOC+ takes three frames in the online correction module, which can further improve the pose accuracy.

For each frame , we consider two previous frames and . The relative pose for last frames has been optimized during last step. is initialized by Pose-CNN and will be further optimized. We consider 4 photometric errors into the total energy function:

(9)

where is a balancing factor between current and previous frames and is set to 0.8 during all experiments.

Require: Depth-CNN; Pose-CNN; Intrinsic:

Input: Image sequence: []

Output: Refined Pose: []

Initialization:
for  do
     Get Depth-CNN Prediction
     Get Pose-CNN Prediction
     Compute transform matrix from
     for  do
         Warp into :
         Compute by (2)
         Warp into :
         Compute by (3)
         
         Compute gradient w.r.t.
         Use Adam optimizer to update
     end for
     
end for
Algorithm 1 Deep Online Correction (two-frame based)

The photometric errors is defined similar as (2) and (3). Once is obtained, will be calculated as

(10)

The Adam optimizer is used to minimize with respect to and . Since the former one has been already updated during the last optimization, it is a natural idea to prevent from being updated too much from initial values. In traditional frameworks, this is usually achieved by marginalization. We achieve this by setting different learning rates for each step. We empirically set the learning rate of previous frames as 10 times less than current frames.

Iv Experiments

Iv-a Implementation Details

We conduct experiments on both KITTI odometry dataset [12, 11] and EuRoC MAV dataset [3]. For KITTI odometry dataset, the input images are resized to 832256. The Pose-CNN and Depth-CNN are jointly trained on sequence 00-08, which contains 36671 training frames in total, and then validated on sequence 09 and 10 combined with the online correction module. The relative translation error (RTE) and relative rotation error (RRE) are applied for evaluation. RTE is the average translational root mean square error (RMSE) drift in percentage on length from 100, 200, …, 800m, while RRE is the average rotation RMSE drift (°/100m) on length from 100, 200, …, 800m. It is worth noting that we do not use the full KITTI Eigen split for training since it has some overlaps with KITTI odometry test dataset. For EuRoC MAV dataset, all stereo images are rectified and then resized to 736

480. Sequence MH_03 and MH_05 are used for testing. All the other sequences, which contains 22067 frames, are used for training. The RMSE of absolute trajectory errors (ATE) is used as evaluation metrics.

Fig. 5: Visualization. From top to bottom: warped image, the combination of occlusion (black) and explainability (gray) masks, photometric error map. In the top, we can see the double traffic rods in the warped image produced by occlusion during backward warping. The occlusion mask successfully calculates the area of pixels where occlusion happens. As a result the occluded region is not been calculated in photometric errors. The explainability mask in the middle produced by CNN usually reduces photometric errors in the high-frequency areas like roofs and vegetation.
(a) Trajectories of our methods and traditional methods.
(b) Trajectories of our methods, DL-based and hybrid methods.
Fig. 6: Comparison of our methods with traditional methods and hybrid methods. Comparative experiments are conducted on KITTI Odometry Seq. 09 (left) and Seq. 10 (right). Figures in (a) show results of our method and traditional methods, while figures in (b) demonstrate the trajectories of our methods and hybrid methods.

For both datasets, stereo images are utilized to train the CNNs while only monocular images are needed in the test phases. With the prior knowledge learned from stereo images, our method can recover the absolute scales even for unseen images. Backbones of Depth-CNN and Pose-CNN are ResNet-18 [16]

with pretrained weights from ImageNet

[7]

. We train the Depth-CNN and Pose-CNN for 20 epochs and use the parameters from last epoch to get depth estimation and pose initialization for testing sequences. The Adam optimizer with a learning rate of 1e-4 for the first 15 epochs and 1e-5 for the last 5 epochs are used during training. The batch size is set to 8 for training and 1 for online correction module. The whole framework is implemented by PyTorch

[26] on a single NVIDIA TITAN Xp GPU. To speed up the online correction process, the whole online correction is slightly modified from Alg. 1. The equation 4 is composed of three parts: unprojection, transform and projection. For every frame, the unprojection part does not involve gradient propagation. Thus, it can be pre-computed before optimization iterations. The running speed for DOC and DOC+ is about 8 FPS and 5 FPS respectively. The running time includes both CNN inference and online correction modules.

Sequence 09 Sequence 10
Methods RTE (%) RRE (°/100m) ATE (m) RTE (%) RRE (°/100m) ATE (m)
DSO [8] 15.91 0.20 52.23 6.49 0.20 11.09
ORB-SLAM2 (w/o LC) [25] 9.3 0.26 38.77 2.57 0.32 5.42
ORB-SLAM2 (w/ LC) [25] 2.88 0.25 8.39 3.30 0.30 6.63
SC-SfM-Learner [2] 7.64 2.19 15.02 10.74 4.58 20.19
Monodepth2 [13] 14.89 3.44 68.75 11.29 4.97 21.93
Online Adaptation [22] 5.89 3.34 - 4.79 0.83 -
DPC [29] 2.82 0.76 - 3.81 1.34 -
DFVO (Stereo Trained) [34] 2.61 0.29 10.88 2.29 0.37 3.72
Gordon et al. [14] 2.7 - - 6.8 - -
Ours (DOC) 2.26 0.87 7.34 2.61 1.59 4.23
Ours (DOC+) 2.02 0.61 4.76 2.29 1.10 3.38
TABLE I: Monocular visual odometry comparison on KITTI Odometry Seq.09 and Seq.10, with different approaches including traditional, DL-based and hybrid methods. RTE, RRE and ATE are abbreviations for relative translation errors, relative rotation errors and absolute translation errors respectively.
Method Loss Frames RTE (%) RRE (°/100m) ATE (m)
(a) DOC w/o mask w/ truncation 2 2.66 1.00 12.92
(b) w/ explainability mask w/ truncation 2 2.53 0.93 12.54
(c) w/ occlusion mask w/ truncation 2 2.54 0.96 11.42
(d) DOC w/o truncation w/o truncation 2 2.81 1.07 13.30
(e) DOC w/ SSIM SSIM 2 2.30 0.82 7.31
(f) DOC w/ truncation 2 2.26 0.87 7.34
(g) DOC+ w/ SSIM SSIM 3 2.03 0.68 5.51
(h) DOC+ w/ truncation 3 2.02 0.61 4.76
TABLE II: Ablation results are evaluated on KITTI Odometry Seq. 09. and refer to explainability mask and occlusion mask respectively. “Loss” means which loss to use during online correction methods. “Frames” = 2 or 3 stands for two-frame or three-frame optimization.

Iv-B Visual Odometry Evaluation

For KITTI odometry dataset, we take Pose-CNN from Monodepth2 as our baseline. As illustrated in Table I, our proposed DOC method outperforms traditional and DL-based methods and is competitive to existing hybrid methods. Trajectories of these methods are visualized in Fig. 6 using EVO [15]. Li et al. [22] use the meta-learning technique to update parameters during test time. Compared to our approach, we only need to calculate gradient propagation to 6DoF poses without updating the whole networks. DFVO [34] achieves better results in RRE as it uses traditional essential matrix estimation and PnP method to recover pose from depth and flow. As we can see from Seq. 09, traditional methods usually suffer from scale drift problems in monocular setups without loop closure. While DL-based methods can not guarantee accurate poses throughout the whole trajectories. It is observable that even with inferior performance on RTE, traditional methods usually show better performance on RRE than DL-based methods. The reason behind maybe twofold: First, scale drift problem clearly does not affect RRE as rotation is scale invariant. Second, traditional methods explicitly model optimization in rotation manifold SO(3) while DL-based methods numerically solve these problems through gradient update and Adam optimizer. Compared to both traditional and existing DL-based methods, our methods (DOC and DOC+) clearly show better performance and is comparable to hybrid methods. Besides, the trajectories estimated by DOC+ clearly show very small translational drift (with lowest ATE among all methods) without loop closure.

For EuRoC MAV dataset, we use ATE as evaluation metrics. It is a very challenging dataset as it contains large motion and various illumination conditions. As shown in Table III, the proposed methods clearly improve the odometry accuracy from Monodepth2 initialization. Our methods also outperform traditional methods like DSO and is comparable with ORB-SLAM. Indoor datasets like EuRoC MAV usually have trajectories in one room or small space. Thus a full SLAM system with re-localization like ORB-SLAM usually perform better than pure VO methods.

Methods MH_03 MH_05
DSO [8] 0.17 0.11
ORB-SLAM [24] 0.08 0.16
Monodepth2 [13] 1.88 1.56
Ours (DOC) 0.15 0.11
Ours (DOC+) 0.13 0.09
TABLE III: Monocular visual odometry comparison on EuRoC MAV dataset. The RMSE of absolute trajectory errors (ATE) is used as evaluation metrics.

Iv-C Ablation Study

We conduct a detailed ablation study for the DOC module on KITTI datasets (see Table II

). First, we explore different loss functions for photometric errors. We found that truncated L1 loss can achieve similar performance compared to SSIM loss (e, f, g, h). The calculation of SSIM loss requires the computation of local mean and variance for every pixel while L1 loss between two images is clearly more computationally efficient. Second, we found that both explainability masks and occlusion masks improve the overall result (a-c). Occlusion masks stop gradient propagation where pixels have “ghosting effect” (see Fig.

5). Explainability masks produced by Depth-CNN reduce weights for pixels usually occupied by high-frequency areas like trees and roofs or non-Lambertian surfaces like windows. These areas are noises for online correction and can not be described by reprojection warping. Finally, with all methods mentioned above, by only using the consecutive frame for online correction, our method already achieve satisfying odometry result. By using the three-frame-based optimization, DOC+ further achieve better result and smaller translation drift thanks to reprojection constraints from more image pairs. Adding more frames for optimization shows very little improvements for accuracy while increases the overall running time.

V Conclusions

In this paper, we propose a novel monocular visual odometry algorithm with an online correction module. It relies on DL-based frameworks and leverages advantages of both CNNs and geometric constraints. Specifically, Depth-CNN and Pose-CNN are trained in a self-supervised manner to provided initial ego-motion and depth maps with absolute scales. And then a novel online correction module based on gradient back-propagation is proposed to further improve the VO accuracy. Different from existing online learning methods, our online correction module does not update the networks’ parameters which makes it more concise and computationally efficient. Experiment results on KITTI dataset demonstrate that our method outperforms existing traditional and DL-based methods and is comparable with state-of-the-art hybrid methods.

References

  • [1] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Advances in neural information processing systems, pp. 35–45. Cited by: §I, §II-B.
  • [2] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Advances in neural information processing systems, pp. 35–45. Cited by: TABLE I.
  • [3] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart (2016) The euroc micro aerial vehicle datasets. The International Journal of Robotics Research 35 (10), pp. 1157–1163. Cited by: §IV-A.
  • [4] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós (2020)

    ORB-slam3: an accurate open-source library for visual, visual-inertial and multi-map slam

    .
    arXiv preprint arXiv:2007.11898. Cited by: §I, §II-A.
  • [5] C. Chen, B. Wang, C. X. Lu, N. Trigoni, and A. Markham (2020) A survey on deep learning for localization and mapping: towards the age of spatial machine intelligence. arXiv preprint arXiv:2006.12567. Cited by: §II-B.
  • [6] Y. Chen, C. Schmid, and C. Sminchisescu (2019)

    Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera

    .
    In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 7063–7072. Cited by: §II-B.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §IV-A.
  • [8] J. Engel, V. Koltun, and D. Cremers (2017) Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence 40 (3), pp. 611–625. Cited by: §I, §II-A, §II-C, TABLE I, TABLE III.
  • [9] J. Engel, T. Schöps, and D. Cremers (2014) LSD-slam: large-scale direct monocular slam. In European conference on computer vision, pp. 834–849. Cited by: §I, §II-A.
  • [10] C. Forster, M. Pizzoli, and D. Scaramuzza (2014) SVO: fast semi-direct monocular visual odometry. In 2014 IEEE international conference on robotics and automation (ICRA), pp. 15–22. Cited by: §I, §II-A, §II-C.
  • [11] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §IV-A.
  • [12] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §IV-A.
  • [13] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE international conference on computer vision, pp. 3828–3838. Cited by: §II-B, §III-A, TABLE I, TABLE III.
  • [14] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova (2019) Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8977–8986. Cited by: §III-B, TABLE I.
  • [15] M. Grupp (2017) Evo: python package for the evaluation of odometry and slam.. Note: https://github.com/MichaelGrupp/evo Cited by: §IV-B.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-A.
  • [17] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §III-B.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B.
  • [19] G. Klein and D. Murray (2007) Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM international symposium on mixed and augmented reality, pp. 225–234. Cited by: §I, §II-A.
  • [20] K. R. Konda and R. Memisevic (2015) Learning visual odometry with a convolutional network.. In VISAPP (1), pp. 486–490. Cited by: §I, §II-B.
  • [21] R. Li, S. Wang, Z. Long, and D. Gu (2018) Undeepvo: monocular visual odometry through unsupervised deep learning. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 7286–7291. Cited by: §II-B.
  • [22] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha (2020) Self-supervised deep visual odometry with online adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6339–6348. Cited by: §I, §II-B, §IV-B, TABLE I.
  • [23] S. Y. Loo, A. J. Amiri, S. Mashohor, S. H. Tang, and H. Zhang (2019) CNN-svo: improving the mapping in semi-direct visual odometry using single-image depth prediction. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5218–5223. Cited by: §I, §II-C.
  • [24] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31 (5), pp. 1147–1163. Cited by: TABLE III.
  • [25] R. Mur-Artal and J. D. Tardós (2017) Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §I, §II-A, TABLE I.
  • [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §IV-A.
  • [27] D. Scaramuzza and F. Fraundorfer (2011) Visual odometry [tutorial]. IEEE robotics & automation magazine 18 (4), pp. 80–92. Cited by: §I.
  • [28] C. Tang and P. Tan (2018) Ba-net: dense bundle adjustment network. arXiv preprint arXiv:1806.04807. Cited by: §II-C.
  • [29] B. Wagstaff, V. Peretroukhin, and J. Kelly (2020) Self-supervised deep pose corrections for robust visual odometry. arXiv preprint arXiv:2002.12339. Cited by: §I, §II-B, TABLE I.
  • [30] S. Wang, R. Clark, H. Wen, and N. Trigoni (2017) Deepvo: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050. Cited by: §I, §II-B.
  • [31] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-B.
  • [32] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers (2020) D3VO: deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1281–1292. Cited by: §I, §II-C.
  • [33] N. Yang, R. Wang, J. Stuckler, and D. Cremers (2018) Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 817–833. Cited by: §I, §II-C.
  • [34] H. Zhan, C. S. Weerasekera, J. Bian, and I. Reid (2020) Visual odometry revisited: what should be learnt?. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4203–4210. Cited by: §II-C, §IV-B, TABLE I.
  • [35] W. Zhao, S. Liu, Y. Shu, and Y. Liu (2020) Towards better generalization: joint depth-pose learning without posenet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9151–9161. Cited by: §II-B.
  • [36] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §II-B, §III-A, §III-B.