Estimating camera motion from monocular videos plays an essential role in many real-world applications, such as autonomous driving and robotics. This problem is usually solved by visual odometry (VO) or simultaneous localization and mapping (SLAM). Classic SLAM/VO methods [DSO, LSD, svo, orb] perform well in favorable conditions but often fail in challenging situations (e.g.
textureless region, dynamic object) due to the reliance on low-level features and hand-crafted pipeline. Since deep neural networks are able to extract high-level features and infer end-to-end by learning from data, many learning-based VO methods[onlinevo, savo, GeoNet, SfMLearner] have been proposed to break through the limitations of classic SLAM/VO. Among them, self-supervsied VO methods are able to jointly learn camera pose, depth and optical flow by minimizing photometric error [GeoNet], which have shown promising results in recent years.
However, learning-based VO often fails during inference when the scenes are different from the training data. The inability of pretrained VO to generalize to unseen environments limits its wide applications [onlinevo, open]. To this end, the pretrained networks are required to achieve real-time online adaptation in a self-supervised manner.
As a result, several previous works [onlinedepth, onlinevo, open] have been proposed to mitigate the domain generalization problem of stereo matching and VO. However, the performance is still much inferior to classic methods in terms of accuracy and the pretrained networks suffer from slow convergence. These methods treat VO as a black-box by learning all components (pose, depth, optical flow, etc.) but ignore well-defined geometric computations and optimization methods, which leads to slow convergence during online adaptation.
Existing deep VO methods predict depth by single-view estimation, which is an ill-posed problem [savo]. The learned depth has a strong reliance on the training dataset. During inference, the camera intrinsics, scene layouts and distances are usually different. Meanwhile, the camera pose is learned rather than calculated analytically, which requires favorable camera motion with sufficient disparity (e.g. KITTI dataset). Therefore, these methods tend to fail when faced with unseen or more complicated motion patterns. In addition, existing learning-based methods do not explicitly ensure multi-view geometric consistency during inference, which leads to large scale drift in trajectories.
In order to improve the online adaptation of VO to unseen environments, we propose a self-supervised framework that combines the advantage of deep learning and geometric computations. The proposed framework utilizes scene-agnostic 3D geometry constraints and Bayesian inference formulations to speed up online adaptation. During inference, the single-view depth estimation is used as a prior of the current scene geometry and is continuously improved with incoming observations by a probabilistic Bayesian updating framework. The refined depth is used as Maximum A Posteriori (MAP) to train DepthNet for better estimation at the next timestep. Instead of predicting pose by PoseNet, our framework solves pose analytically from optical flow and refined depth. Meanwhile, in order to deal with observation noise, the proposed method online learns depth and photometric uncertainties which are used in the depth refinement process and differentiable Gauss-Newton optimization, respectively. Finally, the optimized pose, depth and flow are used for online self-supervision. Our framework ensures scale consistency by exploiting multi-view geometric constraints. The well-definedscene-agnostic computation helps our VO framework achieve good generalization ability across different scene conditions. Our contributions can be summarized as follows:
We propose a generalizable deep VO that uses scene-agnostic geometric formulation and Bayesian inference to speed up self-supervised online adaptation.
The predicted depth is continuously refined by a Bayesian fusion framework, which is further used to train depth and optical flow during online learning.
We introduce online learned depth and photometric uncertainties for better depth refinement and differentiable Gauss-Newton optimization.
Our method achieves much better generalization than state-of-the-art baselines when tested cross different domains, including Cityscapes [cityscapes] to KITTI [kitti] and outdoor KITTI to indoor TUM [TUM] datasets. Meanwhile, we also achieve state-of-the-art depth estimation results on KITTI and NYUv2 [nyu] datasets.
2 Related works
Learning-based VO has been widely studied in recent years and shown impressive results [deepvo, beyond, deeptam]. DeepTAM [deeptam] mimics the framework of parallel tracking and mapping in classic SLAM/VO by using two networks for depth and pose estimation simultaneously. Xue et al [beyond] extends the VO pipeline to tracking, selecting memory and refining modules, which shows superior performance under challenging conditions. However, these methods require ground truth which is often impractical to obtain. In order to alleviate the need of ground truth data, self-supervised VO has been proposed. SfMLearner [SfMLearner] learns depth and pose simultaneously by minimizing photometric loss between warped and input image. Zhao [towards] and Ranjan [competitive] extend this idea to joint estimation of pose, depth and optical flow. Monodepth2 [monodepth2] explicitly handles non-rigid and occluded cases which are against static-scene assumption. SAVO [savo] exploits spatial-temporal correlations over long sequence and utilizes RNN to reduce scale drift. In this paper, we use the depth network of Monodepth2 [monodepth2] for single-view depth estimation.
Most machine learning algorithms assume that the training and testing data are sampled from the same feature distribution. However, when the test data are different from the training set, most pretrained models suffer from a significant reduce in performance. In this situation, online learning[domainshift, lifelong] is an effective method to solve the domain shift problem. Previous methods use online gradient update [onlineSGD] and probabilistic filtering [stream]
to accelerate domain adaptaion. In the computer vision field, Zhong[open] proposes a self-supervised framework for stereo matching in the open world. Li [onlinevo] proposes an online meta-learning algorithm for VO to continuously adapt to unseen environments. However, these methods learn all components by deep networks, leading to slow convergence and inferior performance. In contrast, our method combines the advantage of deep learning and well-defined geometric computations to achieve better generalization.
3D Geometric computations In classic 3D computer vision, the relative pose between two images and scene depth can be solved analytically by multi-view geometric constraints. Given a set of correspondences, the pose can be solved by epipolar geometry [epi, sampson] with 2D-2D matching or Perspective-n-Point (PnP) [pnp] with 3D-2D matching. The depth of each correspondence can be recovered by mid-point triangulation [orb]. On the other hand, the depth and pose can also be solved by minimizing photometric error [DSO, LSD] via classic optimizations. If more observations are available, the 3D map can be further refined by Bundle Adjustment (BA) [orb] or filtering [svo]. In this paper, we adopt a Bayesian depth fusion method to refine single-view depth estimation and propose a differentiable Gauss-Newton layer to minimize weighted photometric residuals.
In this section, we will introduce our framework in detail. The system overview is illustrated in Fig. 2. Firstly, the FlowNet predicts dense optical flow between the keyframe and current frame (Section 3.1), and predicts photometric uncertainty map (Section 3.4) as a side output. Meanwhile, the DepthNet estimates depth mean and uncertainty of keyframe, providing a prior of the current scene geometry (Section 3.2). The relative pose is solved by essential matrix or PnP from selected flow correspondences. During online adaptation, we firstly reconstruct the sparse depth of by a differentiable triangulation module. Then, the prior keyframe depth is continuously improved by subsequent depth estimations in a Bayesian updating framework (Section 3.3). Next, the differentiable Gauss-Newton layer minimizes the photometric loss of and warped image weighted by predicted (Section 3.5). Finally, the optimized depth and flow are used as pseudo ground truth to supervise the online learning of DepthNet and FlowNet (Section 3.6).
3.1 Pose recovery from optical flow
We use RAFT [raft] to learn dense optical flow between keyframe and current frame . The optical flow between and is used as a prior to initialize current flow prediction. However, the predicted flow is not accurate for all pixels and the pose estimation error will increase if the displacement becomes small. Thus we select robust correspondences with good forward-backward flow consistency and moderate flow magnitude [dfvo]:
where we set . We select as a new keyframe if the mean flow of robust correspondences is larger than 30. Benefiting from this keyframe-based scheme, the motion disparity between two frames are increased, enabling more accurate pose and depth estimation.
Given 2D correspondences between , the relative pose is computed by solving essential matrix with RANSAC [ransac] algorithm:
where denotes camera intrinsics. The scale of up-to-scale pose is recovered by aligning triangulated sparse depth (detailed in Section 3.3) with keyframe depth. However, when confronted with small translation or pure rotation, the 2D-2D estimation fails. In these cases, we recover pose with PnP [pnp] by minimizing reprojection error:
3.2 Depth modeling
In this paper, we model the depth estimation and updating in a unified Bayesian framework. The inverse depth of every pixel is used since it obeys Gaussian-like distribution and is more robust to distant objects. For inverse depth measurement at time
, we model the good measurement as Gaussian distribution around the ground truth
while the bad one is regarded as observation noise which is uniformly distributed within the interval. For every new observation
, the probability of being a inlier is. Thus is modeled as [svo]:
denotes the variance of a good measurement. We follow[svo] to set inverse depth variance as the photometric disparity error of one pixel.
During online inference, we seek to find the Maximum A Posteriori (MAP) estimation of at each timestep, which can be approximated [beta] by the product of a Gaussian distribution for
and a Beta distribution for inlier ratio:
where are the parameters in Beta distribution, and the mean and variance of Gaussian depth estimate.
The depth of keyframe is initialized with single-view estimation and inverse depth uncertainty from DepthNet as follows:
During adaptation, the DepthNet online learns the prior knowledge of the new scene geometry. Besides, the learned uncertainties can also serve to gauge the reliability in probabilistic depth fusion.
3.3 Online depth refinement
Given the relative pose and 2D correspondences, the subsequent depth estimation of keyframe can be further calculated by two-view triangulation [orb]:
where dis() denotes the distance between and two camera rays generated from 2D correspondences. The mid-point triangulation is naturally differentiable, enabling our VO framework to perform end-to-end online learning.
The triangulated depth map is usually very sparse (2000 points) and we densify each point with a local patch . The depth of each patch pixel is assumed the same as the central point. The patch-based representation allows larger region of depth filtering and provides more valid gradients with a wider basin of convergence.
During online adaptation, is used to update the prior depth estimate to get a MAP estimation according to Eq. 5 as illustrated in Fig. 3. Meanwhile, the parameters in Eq. 5 are incrementally updated by Bayesian formulation. The updating method can be found in the supplementary materials. We assume the inverse depth have converged to the ground truth once the uncertainty is lower than a threshold.
3.4 Photometric residuals with learned uncertainty
Given the estimated pose and refined depth , one can synthesize by warping to the target image [SfMLearner]:
However, view synthesis builds on the photometric constancy assumption, which is often violated in practice. In order to alleviate this issue, we regard these corner cases as observation noise
and use deep neural network to predict a posterior probability distributionfor each RGB pixel parametrized by mean and variance over ground truth intensity . By assuming the observation noise to be Laplacian, the online learning process can be formulated as minimizing the negative log-likelihood, which can be converted to a weighted photometric loss:
where denotes photometric uncertainty map.
3.5 Differentiable Gauss-Newton optimization
Furthermore, we propose to use a differentiable Gauss-Newton [DSO] layer to miminize for optimized depth and pose . The predicted in Eq. 9 improves the robustness to illumination change and occlusions. Specifically, starting with an initial depth and pose , we compute the weighted photometric loss for each pixel in all frames among two keyframes :
The first order derivatives with respect to and are:
Thus the increment to the current estimation is:
where denotes the stack of Jacobians and denotes the stack of weighted photometric residuals . The Gauss-Newton algorithm is naturally differentiable and we implement it as a layer in neural network. In practice, we find that it converges within only 3 iterations.
3.6 Loss functions
We propose to use the following loss functions to online learn DepthNet and FlowNet in a self-supervised manner.
Smoothness loss We introduce an edge-aware loss for depth and flow to enforce local smoothness:
where denotes optical flow or depth.
Depth loss We derive a loss function of depth by evaluating the negative log-likelihood of the estimated inverse depth with uncertainty defined in Eq. 6. This allows the network to atenuate the cost of difficult regions and to focus more on well explained parts. We assume a Laplacian distribution of inverse depth residuals:
We use refined inverse depth as for self-supervision. Thus the negative log-likelihood becomes:
Intuitively, the network will tune the depth uncertainty that best minimize the depth loss while being subject to the regularization term . In order to enforce depth continuity, we modify Eq. 15 to:
Flow loss The optimized depth and pose can be used to synthesize optical flow by calculating the difference between warped coordinates and . We use to supervise FlowNet during online adaptation:
Photometric loss is defined in Eq. 9. Thus the total self-supervised loss is:
4.1 Implementation details
Network Architectures Since our method focuses on improving online adaptation of deep VO to achieve better generalization, we adopt similar networks with existing self-supervised VO methods. As for DepthNet, we use the same architecture as Monodepth2 [monodepth2] and add a convolution layer at the output to predict depth uncertainty map . The optical flow network is based on RAFT [raft]. We add a convolution + Sigmoid layer at output to predict photometric uncertainty at the same time.
Our model is implemented by PyTorch[pytorch] on a single NVIDIA GTX 2080Ti. The images are resized to for KITTI [kitti] and Cityscapes [cityscapes] datasets while set for TUM dataset [TUM]. The FlowNet and DepthNet are pretrained in a self-supervised manner for iterations according to [competitive]. The Adam [adam] optimizer with is used. The learning objective (Eq. 18) is used for both pretraining and online adaptation with the learning rate of . The uncertainty maps are also jointly trained by minimizing Eq. 18. During online adaptation, we retrain FlowNet and DepthNet for 2 iterations in every time step.
4.2 Cityscapes to KITTI
Firstly, we try to test the generalization ability of our framework to different outdoor environments. We pretrain our method on Cityscapes [cityscapes] dataset and test on KITTI [kitti] dataset, which differ not only in scene contents and white balance but also in camera intrinsics. We compare with recent self-supervised VO baselines: GeoNet [GeoNet], Vid2Depth [vid2depth], Zhan [deepvofeat], SAVO [savo] and Li [onlinevo] as well as classic methods: ORB-SLAM2 [orb] (with and without loop closure) and VISO2 [viso2]. Besides, we compare with Zhao [towards] and DF-VO [dfvo] which are state-of-the-art methods that combine the output of pretrained networks with classic VO pipeline.
As for pose estimation, we evaluate on 11 KITTI sequences with ground truth poses [GeoNet]. It’s worthy to note that all the other VO baselines are pretrained on KITTI, while our method is only pretrained on Cityscapes and directly tested on KITTI dataset. Although in such unfair conditions, our method achieves state-of-the-art results even compared with ORB-SLAM2 (LC) (shown in Table 1 and Fig. 4). Meanwhile, different from most self-supervised VO baselines, our method maintains a consistent scale of the entire trajectory. Thus, instead of calculating absolute trajectory error (ATE) on short sequence as previous methods, we align trajectories with ground truth [kitti] by a single scaling factor and compute translation/rotation error on entire trajectory.
Our method outperforms all the other baselines (including end-to-end learning and combination of geometric computation methods) by a clear margin. The rotation and translation errors are an order of magnitude smaller than the other self-supervised baselines, indicating that pose, depth and scale estimation collaborated with probabilistic geometric computation is much better than learning-based inference. As for classic baselines, ORB-SLAM2 is implemented by a local map tracking with bundle adjustment (BA) and ORB-SLAM2 (LC) processes the entire sequence with loop closure, pose graph optimization and global BA to ensure good performance. Our method doesn’t use any optimization backend techniques but it still achieves comparable results with ORB-SLAM2 (LC).
|Vid2Depth [vid2depth]||59.97 22.59||9.34 4.18||55.20 14.61||27.02 10.39||1.89 1.19||51.14 21.86||58.07 26.83||51.21 36.64||45.82 18.10||44.52 12.11||21.45 12.50|
|GeoNet [GeoNet]||27.60 5.72||12.25 4.15||42.21 6.14||19.21 9.78||9.09 7.55||20.12 7.67||9.28 4.34||8.27 5.93||18.59 7.85||23.94 9.81||20.73 9.10|
|Zhan et al. [deepvofeat]||6.23 2.44||23.78 1.75||6.59 2.26||15.76 10.62||3.14 2.02||4.94 2.34||5.80 2.06||6.49 3.56||5.45 2.39||11.89 3.62||12.82 3.40|
|SAVO [savo]||18.67 3.12||9.86 1.23||17.58 4.29||15.01 6.54||3.35 1.18||9.82 2.53||5.27 4.30||9.85 4.03||21.37 3.65||9.52 3.64||6.45 2.41|
|Li [onlinevo]||8.42 3.91||17.36 4.60||14.38 2.62||18.24 0.92||3.28 4.40||7.58 3.31||4.36 2.28||5.58 3.12||7.51 2.63||5.89 3.34||4.79 0.83|
|VISO2 [viso2]||12.66 2.73||41.93 7.68||9.47 1.19||3.93 2.21||2.50 1.78||15.10 3.65||6.80 1.93||10.80 4.67||14.82 2.52||3.69 1.25||21.01 3.26|
|DF-VO [dfvo]||2.25 0.58||66.98 17.04||3.60 0.52||2.67 0.50||1.43 0.29||1.10 0.30||1.03 0.30||0.97 0.27||1.60 0.32||2.61 0.29||2.29 0.37|
|D3VO [d3vo] (stereo)||- -||1.07 -||0.80 -||- -||- -||- -||0.67 -||- -||1.00 -||0.78 -||0.62 -|
|Zhao [towards]||4.45 1.13||62.54 2.71||4.64 0.91||6.86 1.26||4.76 3.31||2.93 0.90||3.48 1.32||2.57 1.21||5.09 1.19||6.81 0.72||4.39 1.05|
|ORB-SLAM2 [orb]||11.43 0.58||107.57 0.89||10.34 0.26||0.97 0.19||1.30 0.27||9.04 0.26||14.56 0.26||9.77 0.36||11.46 0.28||9.30 0.26||2.57 0.32|
|ORB-SLAM2 (LC)||2.35 0.35||109.10 0.45||3.32 0.31||0.91 0.19||1.56 0.27||1.84 0.20||4.99 0.23||1.91 0.28||9.41 0.30||2.88 0.25||3.30 0.30|
|Ours (w/o RDS)||4.67 1.28||6.99 2.83||4.33 1.05||8.73 1.14||3.78 2.09||4.20 1.98||5.02 3.61||7.24 1.11||3.30 2.78||7.99 2.53||5.21 2.87|
|Ours (w/o PU)||2.28 0.87||5.42 1.40||3.98 1.87||7.76 0.99||2.92 1.04||3.63 1.28||4.92 2.07||8.25 2.39||3.28 1.69||4.60 1.13||3.25 1.70|
|Ours||1.32 0.45||2.83 0.65||1.42 0.45||1.77 0.39||1.22 0.27||1.07 0.44||1.02 0.41||2.06 1.18||1.50 0.42||1.87 0.46||1.93 0.30|
|Sequence||[vid2depth]||[GeoNet]||[deepvofeat]||[savo]||[onlinevo]||[dfvo]||[towards]||[DSO]||(LC) [orb]||(w/o RDS)||(w/o PU)|
4.3 Outdoor KITTI to indoor TUM
In order to further evaluate the generalization ability to more complex indoor environments, we test on TUM [TUM] dataset using networks pretrained on KITTI. TUM indoor dataset contains much more complicated motion patterns and challenging conditions. As shown in Table 2 and Fig. 5, learning-based baselines have large errors when confronted with significant domamin shift and different motion patterns (from fast planar motion to small motion in axies). On the contrary, our method yields promising results due to fast online adaptation. Besides, our method is more robust than classic methods (ORB-SLAM2 [orb] and DSO [DSO]) in textureless scenes, abrupt motion and illumination changes, indicating that it tends to find out robust correspondences and online learns depth/photometric uncertainty in challenging conditions.
4.4 Depth evaluation on KITTI and NYUv2
We demonstrate the effectiveness of using optimized for self-supervision by evaluating different single-view depth estimation methods on KITTI [kitti] and NYUv2 [nyu] datasets. We only use triangulation and Bayesian updating for training. During test, our method predicts single-view depth without refinement. As for KITTI, we take Eigen [eigen] split for training and test. As for NYUv2, we use the raw training set and evaluate depth prediction results on labeled test set. The predicted depth is multiplied by a scaling factor to match the median with ground truth [eigen].
Table 3, 4 and Fig. 6 show the depth evaluation results on KITTI and NYUv2 datasets. Benefiting from the patch-based depth triangulation and multi-frame refinement process, our method is able to synthesize refined depth for self-supervision. The learned depth is more accurate and preserves sharper edges with fine details than other methods. More qualitative results and analysis can be found in the supplementary materials.
|Method||Supervision||Abs Rel||Sq Rel||RMSE||RMSE log|
|Monodepth2 [monodepth2] (w/o pretrain)||-||0.132||1.044||5.142||0.210||0.845||0.948||0.977|
Monodepth2 (ImageNet pretrain)
|Ours (w/o RDS)||-||0.136||1.087||5.118||0.210||0.843||0.952||0.980|
|Ours (w/o PU)||-||0.115||0.799||4.282||0.253||0.882||0.965||0.981|
|Ours (w/o RDS)||0.225||0.090||0.702||0.711||0.882||0.970|
|Ours (w/o PU)||0.142||0.087||0.631||0.784||0.923||0.976|
4.5 Ablation studies
In order to demonstrate the effectiveness of each component, we present ablation studies on various versions of our method on KITTI, TUM and NYUv2 datasets (shown in Table 1, 2, 3, 4). ‘w/o RDS’ means without the final step of retraining both DepthNet and FlowNet. It can be seen that the performance of pose and depth estimation shows a considerable improvement when the refined depth is used for online training the DepthNet. Besides, it can be noticed that KITTI contains many moving objects (cars, people) and all these datasets have many sequences with changing camera exposure time. The online learned photometric uncertainty (w/o PU) helps a lot on KITTI and TUM for pose estimation. We suggest readers to refer to supplementary materials for more qualitative comparisons.
In this paper, we propose an online adaptation framework for deep VO with the assistance of scene-agnostic geometric computations and Bayesian inference. The predicted single-view depth is continuously improved with incoming observations by Baysian depth filter. Meanwhile, we explicitly model depth and photometric uncertainties to deal with the observation noise. The optimized pose, depth and flow from differentiable Gauss-Newton layer are used for online self-supervision. Extensive experiments on various environment shifting demonstrate that our method has much better generalization ability than state-of-the-art learning-based VO methods.
Acknowledgments This work is supported by the National Key Research and Development Program of China (2017YFB1002601) and National Natural Science Foundation of China (61632003, 61771026).