1 Introduction
Estimating camera motion from monocular videos plays an essential role in many realworld applications, such as autonomous driving and robotics. This problem is usually solved by visual odometry (VO) or simultaneous localization and mapping (SLAM). Classic SLAM/VO methods [DSO, LSD, svo, orb] perform well in favorable conditions but often fail in challenging situations (e.g.
textureless region, dynamic object) due to the reliance on lowlevel features and handcrafted pipeline. Since deep neural networks are able to extract highlevel features and infer endtoend by learning from data, many learningbased VO methods
[onlinevo, savo, GeoNet, SfMLearner] have been proposed to break through the limitations of classic SLAM/VO. Among them, selfsupervsied VO methods are able to jointly learn camera pose, depth and optical flow by minimizing photometric error [GeoNet], which have shown promising results in recent years.However, learningbased VO often fails during inference when the scenes are different from the training data. The inability of pretrained VO to generalize to unseen environments limits its wide applications [onlinevo, open]. To this end, the pretrained networks are required to achieve realtime online adaptation in a selfsupervised manner.
As a result, several previous works [onlinedepth, onlinevo, open] have been proposed to mitigate the domain generalization problem of stereo matching and VO. However, the performance is still much inferior to classic methods in terms of accuracy and the pretrained networks suffer from slow convergence. These methods treat VO as a blackbox by learning all components (pose, depth, optical flow, etc.) but ignore welldefined geometric computations and optimization methods, which leads to slow convergence during online adaptation.
Existing deep VO methods predict depth by singleview estimation, which is an illposed problem [savo]. The learned depth has a strong reliance on the training dataset. During inference, the camera intrinsics, scene layouts and distances are usually different. Meanwhile, the camera pose is learned rather than calculated analytically, which requires favorable camera motion with sufficient disparity (e.g. KITTI dataset). Therefore, these methods tend to fail when faced with unseen or more complicated motion patterns. In addition, existing learningbased methods do not explicitly ensure multiview geometric consistency during inference, which leads to large scale drift in trajectories.
In order to improve the online adaptation of VO to unseen environments, we propose a selfsupervised framework that combines the advantage of deep learning and geometric computations. The proposed framework utilizes sceneagnostic 3D geometry constraints and Bayesian inference formulations to speed up online adaptation. During inference, the singleview depth estimation is used as a prior of the current scene geometry and is continuously improved with incoming observations by a probabilistic Bayesian updating framework. The refined depth is used as Maximum A Posteriori (MAP) to train DepthNet for better estimation at the next timestep. Instead of predicting pose by PoseNet, our framework solves pose analytically from optical flow and refined depth. Meanwhile, in order to deal with observation noise, the proposed method online learns depth and photometric uncertainties which are used in the depth refinement process and differentiable GaussNewton optimization, respectively. Finally, the optimized pose, depth and flow are used for online selfsupervision. Our framework ensures scale consistency by exploiting multiview geometric constraints. The welldefined
sceneagnostic computation helps our VO framework achieve good generalization ability across different scene conditions. Our contributions can be summarized as follows:
We propose a generalizable deep VO that uses sceneagnostic geometric formulation and Bayesian inference to speed up selfsupervised online adaptation.

The predicted depth is continuously refined by a Bayesian fusion framework, which is further used to train depth and optical flow during online learning.

We introduce online learned depth and photometric uncertainties for better depth refinement and differentiable GaussNewton optimization.
Our method achieves much better generalization than stateoftheart baselines when tested cross different domains, including Cityscapes [cityscapes] to KITTI [kitti] and outdoor KITTI to indoor TUM [TUM] datasets. Meanwhile, we also achieve stateoftheart depth estimation results on KITTI and NYUv2 [nyu] datasets.
2 Related works
Learningbased VO has been widely studied in recent years and shown impressive results [deepvo, beyond, deeptam]. DeepTAM [deeptam] mimics the framework of parallel tracking and mapping in classic SLAM/VO by using two networks for depth and pose estimation simultaneously. Xue et al [beyond] extends the VO pipeline to tracking, selecting memory and refining modules, which shows superior performance under challenging conditions. However, these methods require ground truth which is often impractical to obtain. In order to alleviate the need of ground truth data, selfsupervised VO has been proposed. SfMLearner [SfMLearner] learns depth and pose simultaneously by minimizing photometric loss between warped and input image. Zhao [towards] and Ranjan [competitive] extend this idea to joint estimation of pose, depth and optical flow. Monodepth2 [monodepth2] explicitly handles nonrigid and occluded cases which are against staticscene assumption. SAVO [savo] exploits spatialtemporal correlations over long sequence and utilizes RNN to reduce scale drift. In this paper, we use the depth network of Monodepth2 [monodepth2] for singleview depth estimation.
Online adaptation
Most machine learning algorithms assume that the training and testing data are sampled from the same feature distribution. However, when the test data are different from the training set, most pretrained models suffer from a significant reduce in performance. In this situation, online learning
[domainshift, lifelong] is an effective method to solve the domain shift problem. Previous methods use online gradient update [onlineSGD] and probabilistic filtering [stream]to accelerate domain adaptaion. In the computer vision field, Zhong
[open] proposes a selfsupervised framework for stereo matching in the open world. Li [onlinevo] proposes an online metalearning algorithm for VO to continuously adapt to unseen environments. However, these methods learn all components by deep networks, leading to slow convergence and inferior performance. In contrast, our method combines the advantage of deep learning and welldefined geometric computations to achieve better generalization.3D Geometric computations In classic 3D computer vision, the relative pose between two images and scene depth can be solved analytically by multiview geometric constraints. Given a set of correspondences, the pose can be solved by epipolar geometry [epi, sampson] with 2D2D matching or PerspectivenPoint (PnP) [pnp] with 3D2D matching. The depth of each correspondence can be recovered by midpoint triangulation [orb]. On the other hand, the depth and pose can also be solved by minimizing photometric error [DSO, LSD] via classic optimizations. If more observations are available, the 3D map can be further refined by Bundle Adjustment (BA) [orb] or filtering [svo]. In this paper, we adopt a Bayesian depth fusion method to refine singleview depth estimation and propose a differentiable GaussNewton layer to minimize weighted photometric residuals.
3 Method
In this section, we will introduce our framework in detail. The system overview is illustrated in Fig. 2. Firstly, the FlowNet predicts dense optical flow between the keyframe and current frame (Section 3.1), and predicts photometric uncertainty map (Section 3.4) as a side output. Meanwhile, the DepthNet estimates depth mean and uncertainty of keyframe, providing a prior of the current scene geometry (Section 3.2). The relative pose is solved by essential matrix or PnP from selected flow correspondences. During online adaptation, we firstly reconstruct the sparse depth of by a differentiable triangulation module. Then, the prior keyframe depth is continuously improved by subsequent depth estimations in a Bayesian updating framework (Section 3.3). Next, the differentiable GaussNewton layer minimizes the photometric loss of and warped image weighted by predicted (Section 3.5). Finally, the optimized depth and flow are used as pseudo ground truth to supervise the online learning of DepthNet and FlowNet (Section 3.6).
3.1 Pose recovery from optical flow
We use RAFT [raft] to learn dense optical flow between keyframe and current frame . The optical flow between and is used as a prior to initialize current flow prediction. However, the predicted flow is not accurate for all pixels and the pose estimation error will increase if the displacement becomes small. Thus we select robust correspondences with good forwardbackward flow consistency and moderate flow magnitude [dfvo]:
(1) 
where we set . We select as a new keyframe if the mean flow of robust correspondences is larger than 30. Benefiting from this keyframebased scheme, the motion disparity between two frames are increased, enabling more accurate pose and depth estimation.
Given 2D correspondences between , the relative pose is computed by solving essential matrix with RANSAC [ransac] algorithm:
(2) 
where denotes camera intrinsics. The scale of uptoscale pose is recovered by aligning triangulated sparse depth (detailed in Section 3.3) with keyframe depth. However, when confronted with small translation or pure rotation, the 2D2D estimation fails. In these cases, we recover pose with PnP [pnp] by minimizing reprojection error:
3.2 Depth modeling
In this paper, we model the depth estimation and updating in a unified Bayesian framework. The inverse depth of every pixel is used since it obeys Gaussianlike distribution and is more robust to distant objects. For inverse depth measurement at time
, we model the good measurement as Gaussian distribution around the ground truth
while the bad one is regarded as observation noise which is uniformly distributed within the interval
. For every new observation, the probability of being a inlier is
. Thus is modeled as [svo]:(4) 
where
denotes the variance of a good measurement. We follow
[svo] to set inverse depth variance as the photometric disparity error of one pixel.During online inference, we seek to find the Maximum A Posteriori (MAP) estimation of at each timestep, which can be approximated [beta] by the product of a Gaussian distribution for
and a Beta distribution for inlier ratio
:(5) 
where are the parameters in Beta distribution, and the mean and variance of Gaussian depth estimate.
The depth of keyframe is initialized with singleview estimation and inverse depth uncertainty from DepthNet as follows:
(6)  
During adaptation, the DepthNet online learns the prior knowledge of the new scene geometry. Besides, the learned uncertainties can also serve to gauge the reliability in probabilistic depth fusion.
3.3 Online depth refinement
Given the relative pose and 2D correspondences, the subsequent depth estimation of keyframe can be further calculated by twoview triangulation [orb]:
(7) 
where dis() denotes the distance between and two camera rays generated from 2D correspondences. The midpoint triangulation is naturally differentiable, enabling our VO framework to perform endtoend online learning.
The triangulated depth map is usually very sparse (2000 points) and we densify each point with a local patch . The depth of each patch pixel is assumed the same as the central point. The patchbased representation allows larger region of depth filtering and provides more valid gradients with a wider basin of convergence.
During online adaptation, is used to update the prior depth estimate to get a MAP estimation according to Eq. 5 as illustrated in Fig. 3. Meanwhile, the parameters in Eq. 5 are incrementally updated by Bayesian formulation. The updating method can be found in the supplementary materials. We assume the inverse depth have converged to the ground truth once the uncertainty is lower than a threshold.
3.4 Photometric residuals with learned uncertainty
Given the estimated pose and refined depth , one can synthesize by warping to the target image [SfMLearner]:
(8) 
However, view synthesis builds on the photometric constancy assumption, which is often violated in practice. In order to alleviate this issue, we regard these corner cases as observation noise
and use deep neural network to predict a posterior probability distribution
for each RGB pixel parametrized by mean and variance over ground truth intensity . By assuming the observation noise to be Laplacian, the online learning process can be formulated as minimizing the negative loglikelihood, which can be converted to a weighted photometric loss:(9) 
where denotes photometric uncertainty map.
3.5 Differentiable GaussNewton optimization
Furthermore, we propose to use a differentiable GaussNewton [DSO] layer to miminize for optimized depth and pose . The predicted in Eq. 9 improves the robustness to illumination change and occlusions. Specifically, starting with an initial depth and pose , we compute the weighted photometric loss for each pixel in all frames among two keyframes :
(10) 
The first order derivatives with respect to and are:
(11) 
Thus the increment to the current estimation is:
(12) 
where denotes the stack of Jacobians and denotes the stack of weighted photometric residuals . The GaussNewton algorithm is naturally differentiable and we implement it as a layer in neural network. In practice, we find that it converges within only 3 iterations.
3.6 Loss functions
We propose to use the following loss functions to online learn DepthNet and FlowNet in a selfsupervised manner.
Smoothness loss We introduce an edgeaware loss for depth and flow to enforce local smoothness:
(13)  
where denotes optical flow or depth.
Depth loss We derive a loss function of depth by evaluating the negative loglikelihood of the estimated inverse depth with uncertainty defined in Eq. 6. This allows the network to atenuate the cost of difficult regions and to focus more on well explained parts. We assume a Laplacian distribution of inverse depth residuals:
(14) 
We use refined inverse depth as for selfsupervision. Thus the negative loglikelihood becomes:
(15) 
Intuitively, the network will tune the depth uncertainty that best minimize the depth loss while being subject to the regularization term . In order to enforce depth continuity, we modify Eq. 15 to:
(16) 
Flow loss The optimized depth and pose can be used to synthesize optical flow by calculating the difference between warped coordinates and . We use to supervise FlowNet during online adaptation:
(17) 
Photometric loss is defined in Eq. 9. Thus the total selfsupervised loss is:
(18) 
4 Experiments
4.1 Implementation details
Network Architectures Since our method focuses on improving online adaptation of deep VO to achieve better generalization, we adopt similar networks with existing selfsupervised VO methods. As for DepthNet, we use the same architecture as Monodepth2 [monodepth2] and add a convolution layer at the output to predict depth uncertainty map . The optical flow network is based on RAFT [raft]. We add a convolution + Sigmoid layer at output to predict photometric uncertainty at the same time.
Learning Settings
Our model is implemented by PyTorch
[pytorch] on a single NVIDIA GTX 2080Ti. The images are resized to for KITTI [kitti] and Cityscapes [cityscapes] datasets while set for TUM dataset [TUM]. The FlowNet and DepthNet are pretrained in a selfsupervised manner for iterations according to [competitive]. The Adam [adam] optimizer with is used. The learning objective (Eq. 18) is used for both pretraining and online adaptation with the learning rate of . The uncertainty maps are also jointly trained by minimizing Eq. 18. During online adaptation, we retrain FlowNet and DepthNet for 2 iterations in every time step.4.2 Cityscapes to KITTI
Firstly, we try to test the generalization ability of our framework to different outdoor environments. We pretrain our method on Cityscapes [cityscapes] dataset and test on KITTI [kitti] dataset, which differ not only in scene contents and white balance but also in camera intrinsics. We compare with recent selfsupervised VO baselines: GeoNet [GeoNet], Vid2Depth [vid2depth], Zhan [deepvofeat], SAVO [savo] and Li [onlinevo] as well as classic methods: ORBSLAM2 [orb] (with and without loop closure) and VISO2 [viso2]. Besides, we compare with Zhao [towards] and DFVO [dfvo] which are stateoftheart methods that combine the output of pretrained networks with classic VO pipeline.
As for pose estimation, we evaluate on 11 KITTI sequences with ground truth poses [GeoNet]. It’s worthy to note that all the other VO baselines are pretrained on KITTI, while our method is only pretrained on Cityscapes and directly tested on KITTI dataset. Although in such unfair conditions, our method achieves stateoftheart results even compared with ORBSLAM2 (LC) (shown in Table 1 and Fig. 4). Meanwhile, different from most selfsupervised VO baselines, our method maintains a consistent scale of the entire trajectory. Thus, instead of calculating absolute trajectory error (ATE) on short sequence as previous methods, we align trajectories with ground truth [kitti] by a single scaling factor and compute translation/rotation error on entire trajectory.
Our method outperforms all the other baselines (including endtoend learning and combination of geometric computation methods) by a clear margin. The rotation and translation errors are an order of magnitude smaller than the other selfsupervised baselines, indicating that pose, depth and scale estimation collaborated with probabilistic geometric computation is much better than learningbased inference. As for classic baselines, ORBSLAM2 is implemented by a local map tracking with bundle adjustment (BA) and ORBSLAM2 (LC) processes the entire sequence with loop closure, pose graph optimization and global BA to ensure good performance. Our method doesn’t use any optimization backend techniques but it still achieves comparable results with ORBSLAM2 (LC).
Method  Seq.00  Seq.01  Seq.02  Seq.03  Seq.04  Seq.05  Seq.06  Seq.07  Seq.08  Seq.09  Seq.10 

Vid2Depth [vid2depth]  59.97 22.59  9.34 4.18  55.20 14.61  27.02 10.39  1.89 1.19  51.14 21.86  58.07 26.83  51.21 36.64  45.82 18.10  44.52 12.11  21.45 12.50 
GeoNet [GeoNet]  27.60 5.72  12.25 4.15  42.21 6.14  19.21 9.78  9.09 7.55  20.12 7.67  9.28 4.34  8.27 5.93  18.59 7.85  23.94 9.81  20.73 9.10 
Zhan et al. [deepvofeat]  6.23 2.44  23.78 1.75  6.59 2.26  15.76 10.62  3.14 2.02  4.94 2.34  5.80 2.06  6.49 3.56  5.45 2.39  11.89 3.62  12.82 3.40 
SAVO [savo]  18.67 3.12  9.86 1.23  17.58 4.29  15.01 6.54  3.35 1.18  9.82 2.53  5.27 4.30  9.85 4.03  21.37 3.65  9.52 3.64  6.45 2.41 
Li [onlinevo]  8.42 3.91  17.36 4.60  14.38 2.62  18.24 0.92  3.28 4.40  7.58 3.31  4.36 2.28  5.58 3.12  7.51 2.63  5.89 3.34  4.79 0.83 
VISO2 [viso2]  12.66 2.73  41.93 7.68  9.47 1.19  3.93 2.21  2.50 1.78  15.10 3.65  6.80 1.93  10.80 4.67  14.82 2.52  3.69 1.25  21.01 3.26 
DFVO [dfvo]  2.25 0.58  66.98 17.04  3.60 0.52  2.67 0.50  1.43 0.29  1.10 0.30  1.03 0.30  0.97 0.27  1.60 0.32  2.61 0.29  2.29 0.37 
D3VO [d3vo] (stereo)     1.07   0.80            0.67      1.00   0.78   0.62  
Zhao [towards]  4.45 1.13  62.54 2.71  4.64 0.91  6.86 1.26  4.76 3.31  2.93 0.90  3.48 1.32  2.57 1.21  5.09 1.19  6.81 0.72  4.39 1.05 
ORBSLAM2 [orb]  11.43 0.58  107.57 0.89  10.34 0.26  0.97 0.19  1.30 0.27  9.04 0.26  14.56 0.26  9.77 0.36  11.46 0.28  9.30 0.26  2.57 0.32 
ORBSLAM2 (LC)  2.35 0.35  109.10 0.45  3.32 0.31  0.91 0.19  1.56 0.27  1.84 0.20  4.99 0.23  1.91 0.28  9.41 0.30  2.88 0.25  3.30 0.30 
Ours (w/o RDS)  4.67 1.28  6.99 2.83  4.33 1.05  8.73 1.14  3.78 2.09  4.20 1.98  5.02 3.61  7.24 1.11  3.30 2.78  7.99 2.53  5.21 2.87 
Ours (w/o PU)  2.28 0.87  5.42 1.40  3.98 1.87  7.76 0.99  2.92 1.04  3.63 1.28  4.92 2.07  8.25 2.39  3.28 1.69  4.60 1.13  3.25 1.70 
Ours  1.32 0.45  2.83 0.65  1.42 0.45  1.77 0.39  1.22 0.27  1.07 0.44  1.02 0.41  2.06 1.18  1.50 0.42  1.87 0.46  1.93 0.30 
Vid2Depth  GeoNet  Zhan  SAVO  Li  DFVO  Zhao  DSO  ORBSLAM2  Ours  Ours  Ours  

Sequence  [vid2depth]  [GeoNet]  [deepvofeat]  [savo]  [onlinevo]  [dfvo]  [towards]  [DSO]  (LC) [orb]  (w/o RDS)  (w/o PU)  
fr2/desk  0.698  0.462  0.570  0.402  0.214  0.306  0.485  X  X  0.158  0.572  0.221 
fr2/pioneer_360  0.581  0.662  0.453  0.402  0.218  0.599  0.693  X  X  0.201  0.638  0.254 
fr2/pioneer_slam  0.367  0.301  0.309  0.338  0.190  0.585  0.354  0.737  X  0.176  0.481  0.210 
fr2/360_kidnap  0.564  0.579  0.430  0.421  0.357  0.745  0.468  X  0.582  0.384  0.605  0.371 
fr3/cabinet  0.492  0.282  0.316  0.281  0.272  0.447  0.227  X  X  0.213  0.453  0.276 
fr3/long_office_hou_valid  0.401  0.316  0.327  0.297  0.237  0.227  0.534  0.327  0.042  0.133  0.529  0.168 
fr3/nostr_texture_near_loop  0.328  0.277  0.340  0.440  0.255  0.564  0.348  0.093  0.057  0.159  0.401  0.186 
fr3/str_notexture_far  0.227  0.258  0.235  0.216  0.177  0.505  0.175  0.543  X  0.104  0.432  0.201 
fr3/str_notexture_near  0.235  0.198  0.217  0.204  0.128  0.603  0.218  0.481  X  0.207  0.579  0.224 
4.3 Outdoor KITTI to indoor TUM
In order to further evaluate the generalization ability to more complex indoor environments, we test on TUM [TUM] dataset using networks pretrained on KITTI. TUM indoor dataset contains much more complicated motion patterns and challenging conditions. As shown in Table 2 and Fig. 5, learningbased baselines have large errors when confronted with significant domamin shift and different motion patterns (from fast planar motion to small motion in axies). On the contrary, our method yields promising results due to fast online adaptation. Besides, our method is more robust than classic methods (ORBSLAM2 [orb] and DSO [DSO]) in textureless scenes, abrupt motion and illumination changes, indicating that it tends to find out robust correspondences and online learns depth/photometric uncertainty in challenging conditions.
4.4 Depth evaluation on KITTI and NYUv2
We demonstrate the effectiveness of using optimized for selfsupervision by evaluating different singleview depth estimation methods on KITTI [kitti] and NYUv2 [nyu] datasets. We only use triangulation and Bayesian updating for training. During test, our method predicts singleview depth without refinement. As for KITTI, we take Eigen [eigen] split for training and test. As for NYUv2, we use the raw training set and evaluate depth prediction results on labeled test set. The predicted depth is multiplied by a scaling factor to match the median with ground truth [eigen].
Table 3, 4 and Fig. 6 show the depth evaluation results on KITTI and NYUv2 datasets. Benefiting from the patchbased depth triangulation and multiframe refinement process, our method is able to synthesize refined depth for selfsupervision. The learned depth is more accurate and preserves sharper edges with fine details than other methods. More qualitative results and analysis can be found in the supplementary materials.
Method  Supervision  Abs Rel  Sq Rel  RMSE  RMSE log  

SfMLearner [SfMLearner]    0.208  1.768  6.856  0.283  0.678  0.885  0.957 
Garg [garg]  stereo  0.169  1.080  5.104  0.273  0.740  0.904  0.962 
Vid2Depth [vid2depth]    0.163  1.240  6.220  0.250  0.762  0.916  0.968 
GeoNet [GeoNet]    0.155  1.296  5.857  0.233  0.793  0.931  0.973 
Zhan [deepvofeat]  stereo  0.135  1.132  5.585  0.229  0.820  0.933  0.971 
Mahjourian [mah]    0.163  1.240  6.220  0.250  0.762  0.916  0.968 
SAVO [savo]    0.150  1.127  5.564  0.229  0.823  0.936  0.974 
SCSfMLearner [scsfm]    0.137  1.089  5.439  0.217  0.830  0.942  0.975 
Zhao [towards]    0.113  0.704  4.581  0.184  0.871  0.961  0.984 
Monodepth2 [monodepth2] (w/o pretrain)    0.132  1.044  5.142  0.210  0.845  0.948  0.977 
Monodepth2 (ImageNet pretrain) 
  0.115  0.882  4.701  0.190  0.879  0.961  0.982 
Ranjan [competitive]    0.148  1.149  5.464  0.226  0.815  0.935  0.973 
Ours (w/o RDS)    0.136  1.087  5.118  0.210  0.843  0.952  0.980 
Ours (w/o PU)    0.115  0.799  4.282  0.253  0.882  0.965  0.981 
Ours    0.106  0.701  4.129  0.210  0.889  0.967  0.984 
Error  Accuracy  
Method  Rel  log10  RMSE  
Make3D [make3d]  0.349    1.214  0.447  0.745  0.987 
Li [libo]  0.232  0.094  0.821  0.621  0.886  0.968 
MSCRF [mscrf]  0.121  0.052  0.586  0.811  0.954  0.987 
DORN [dorn]  0.115  0.051  0.509  0.828  0.965  0.992 
Zhou [moving]  0.208  0.086  0.712  0.674  0.900  0.968 
Zhao [towards]  0.201  0.085  0.708  0.687  0.903  0.968 
Net* [p2net]  0.147  0.062  0.553  0.801  0.951  0.987 
Ours (w/o RDS)  0.225  0.090  0.702  0.711  0.882  0.970 
Ours (w/o PU)  0.142  0.087  0.631  0.784  0.923  0.976 
Ours  0.139  0.071  0.528  0.805  0.967  0.989 
4.5 Ablation studies
In order to demonstrate the effectiveness of each component, we present ablation studies on various versions of our method on KITTI, TUM and NYUv2 datasets (shown in Table 1, 2, 3, 4). ‘w/o RDS’ means without the final step of retraining both DepthNet and FlowNet. It can be seen that the performance of pose and depth estimation shows a considerable improvement when the refined depth is used for online training the DepthNet. Besides, it can be noticed that KITTI contains many moving objects (cars, people) and all these datasets have many sequences with changing camera exposure time. The online learned photometric uncertainty (w/o PU) helps a lot on KITTI and TUM for pose estimation. We suggest readers to refer to supplementary materials for more qualitative comparisons.
5 Conclusions
In this paper, we propose an online adaptation framework for deep VO with the assistance of sceneagnostic geometric computations and Bayesian inference. The predicted singleview depth is continuously improved with incoming observations by Baysian depth filter. Meanwhile, we explicitly model depth and photometric uncertainties to deal with the observation noise. The optimized pose, depth and flow from differentiable GaussNewton layer are used for online selfsupervision. Extensive experiments on various environment shifting demonstrate that our method has much better generalization ability than stateoftheart learningbased VO methods.
Acknowledgments This work is supported by the National Key Research and Development Program of China (2017YFB1002601) and National Natural Science Foundation of China (61632003, 61771026).
Comments
There are no comments yet.