1 Introduction
The joint learning of depth and relative pose from monocular videos [vijayanarasimhan2017sfm, yin2018geonet, zhou2017unsupervised] has been an active research area due to its key role in simultaneous localization and mapping (SLAM) and visual odometry
(VO) applications. The simplicity and the unsupervised nature make itself a potential replacement for traditional approaches that involve complicated geometric computations. Given adjacent frames, this approach uses convolutional neural networks (CNNs) to jointly predict the depth map of the target image and the relative poses from the target image to its visible neighboring frames. With the predicted depth and relative poses, photometric error is minimized between the original target image and the synthesized images formed by bilinearsampling
[jaderberg2015spatial] the adjacent views.However, several existing problems hinder the performance of this approach. First, the photometric loss requires the modeling scene to be static without nonLambertian surfaces or occlusions. This assumption is often violated in streetview datasets [cordts2016cityscapes, Geiger2013IJRR] with moving cars and pedestrians (see Figure 1(a) for some failure cases). To this end, we need other stable supervisions that are less affected when the photometric consistency is invalid. Second, as the monocular depth inference considers only single images, there is no guarantee that adjacent frames would have consistent depth estimation. This increases the chance that the inferred outcome would contain noisy depth values, and ignores the information from adjacent views when it is readily available. In addition, using pure color information is subject to the wellknown gradient locality issue [bergen1992hierarchical]. When image regions with vastly different depth ranges have the similar appearance (e.g. the road in Figure 1(b)), gradients inferred from photometric information are not able to effectively guide the optimization, leading to erroneous patterns such as ‘black holes’ (erratic depth).
In this paper, we propose a novel formulation that emphasizes various consistency constraints of deep interplay between depth and pose, seeking to resolve the photometric inconsistency issue. We propose the geometric consistency from sparse feature matches, which is robust to illumination changes and calibration errors. We also show that enforcing the depth consistency across adjacent frames significantly improves the depth estimation with much fewer noisy pixels. The geometric information is implicitly embedded into neural networks and does not bring overhead for inference.
The consistency of multiview geometry has been widely applied to and even forms the basis for many substeps in SfM, from feature matching [baumberg2000reliable], view graph construction [shen2016graph, zach2010disambiguating], motion averaging [govindu2006robustness] to bundle adjustment [triggs1999bundle]. Yet enforcing the consistency is nontrivial in the learningbased settings. Instead of tweaking the network design or learning strategy, we seek a unified framework that effectively encodes geometries in different forms, and emphasize the efficacy of geometric reasoning for the remarkable improvement. Our contributions are summarized as follows:
(1) We introduce traditional geometric quantities based on robust local descriptors into the learning pipeline, to complement the noisy photometric loss.
(2) We propose a simple method to enforce pairwise and trinocular depth consistency in the unsupervised setting when both depth and pose are unknown.
(3) Combined with a differentiable pixel selector mask, the proposed method outperform previous methods for the joint learning of depth and motion using monocular sequences.
2 Related Works
StructurefromMotion and Visual SLAM. StructurefromMotion (SfM) [agarwal2009building] and visual SLAM problems aim to simultaneously recover the camera pose and 3D structures from images. Both problems are well studied and render practical systems [engel2017direct, mur2017orb, wu2011visualsfm] by different communities for decades, with the latter emphasizes more on the realtime performance. The selfsupervised depth and motion learning framework derives from direct SLAM methods [engel2017direct, engel2014lsd, newcombe2011dtam]. Different from indirect methods [davison2007monoslam, konolige2008frameslam, mur2017orb] that use reliable sparse intermediate geometric quantities like local features [rublee2011orb], direct methods optimize the geometry using dense pixels in the image. With accurate photometric calibration such as gamma and vignetting correction [kim2008robust], this formulation does not rely on sparse geometric computation and is able to generate finergrained geometry. However, this formulation is less robust than indirect ones when the photometric loss is not meaningful, the scene containing moving or nonLambertian objects.
Supervised Approaches for Learning Depth. Some early monocular depth estimation works rely on information from depth sensors [eigen2014depth, saxena2006learning] without the aid of geometric relations. Liu [liu2016learning] combine deep CNN and conditional random field for estimating single monocular images. DeMoN [ummenhofer2017demon] is an iterative supervised approach to jointly estimate optical flow, depth and motion. This coarsetofine process considers the use of stereopsis and produces good results with both depth and motion supervision.
Unsupervised Depth Estimation from Stereo Matching. Based on warpingbased view synthesis [zitnick2004high], Garg [garg2016unsupervised] propose to learn depth using calibrated stereo camera pairs, in which perpixel disparity is obtained by minimizing the image reconstruction loss. Godard [godard2017unsupervised] improve this training paradigm with leftright consistency checking. Pilzer [pilzer2019refine] propose knowledge distillation from cycleinconsistency refinement. These methods use synchronized and calibrated stereo images which are less affected by occulusion and photometric inconsistency. Therefore, this task is easier than ours which uses temporal multiview images and outputs relative poses in addition.
Unsupervised Depth and Pose Estimation. The joint unsupervised optimization of depth and pose starts from Zhou [zhou2017unsupervised] and Vijayanarasimhan [vijayanarasimhan2017sfm]. They propose similar approaches that use two CNNs to estimate depth and pose separately, and constrain the outcome with photometric loss. Later, a series of improvements [klodt2018supervising, mahjourian2018unsupervised, wang2018learning, yin2018geonet, zhan2018unsupervised] are proposed. Wang [wang2018learning] discuss the scale ambiguity and combine the estimated depth with direct methods [steinbrucker2011real, engel2017direct]. Zhan [zhan2018unsupervised]
consider warping deep features from the neural nets instead of the raw pixel values. Klodt
[klodt2018supervising] propose to integrate weak supervision from SfM methods. Mahjourian [mahjourian2018unsupervised] employ geometric constraints of the scene by enforcing an approximate ICP based matching loss. In this work, we follow the previous good practices, with the major distinction that we incorporate golden standards from indirect methods and enforce consistency terms to the stateoftheart results.3 Method
3.1 Problem Formulation
We first formalize the problem and present effective practices employed by previous methods [klodt2018supervising, mahjourian2018unsupervised, vijayanarasimhan2017sfm, wang2018learning, yin2018geonet, zhan2018unsupervised, zhou2017unsupervised]. Given adjacent view monocular image sequences (e.g. for ), the unsupervised depth and motion estimation problem aims to simultaneously estimate the depth map of the target (center) image ( in the 3view case) and the 6DoF relatives poses to source views ( and ), using CNNs with photometric supervision.
For a sourcetarget view pair , can be inversely warped to the source frame given the estimated depth map and the transformation from target to source . Formally, given a pixel coordinate in which is covisible in , the pixel coordinate in is given by the following equation which determines the warping transformation
(1) 
where denotes ‘equality in the homogeneous coordinates’, and are the intrinsics for the input image pair, and is the depth for this pairticular in .
With this coordinate transformation, synthesized images can be generated from the source view using the differentiable bilinearsampling method [jaderberg2015spatial]. The unsupervised framework then minimizes the pixel error between the target view and the synthesized image
(2) 
where represents the synthesized target image from source image. is the function that maps the image coordinate in image to pixel value, and the first term is the bilinearsampling operation used to acquire the synthesized view given relative motion and depth. is a binary mask that determines if the inverse warping falls into a valid region in the source image, and can be computed analytically given the perpixel depth and relative transformation. denotes the total number of valid pixels.
In addition to the perpixel error, structured similarity (SSIM) [wang2004image] is shown to improve the performance [godard2017unsupervised, yin2018geonet], which is defined on local image patches and rather than every single pixel. We follow the previous approaches [mahjourian2018unsupervised, yin2018geonet] to compute the SSIM loss on image patches () as follows
(3) 
The depth map is further constrained by the smoothness loss to push the gradients to propagate to nearby regions, known as the gradient locality issue [bergen1992hierarchical]. Specifically, we adopt the imageaware smoothness formulation [godard2017unsupervised, yin2018geonet] which allows sharper depth changes on edge regions
(4) 
where denotes the 2D differential operator for computing image gradients. Optimizing a combination of above loss terms wraps up the basic formulation of training objectives, which forms the baseline written as
(5) 
However, there are drawbacks with the basic formulation. We then describe the key ingredients of our contributions.
3.2 Learning from Indirect Methods
The above view synthesis formulation requires several important assumptions: 1) the modeling scene should be static without moving objects; 2) the surfaces in the scene should be Lambertian; 3) no occlusion exists between adjacent views; 4) cameras should be photometrically calibrated, a technique adopted in direct SLAM methods [engel2017direct, engel2014lsd] to compensate for vignetting [kim2008robust] and exposure time. Violation to any of the above criteria would lead to photometric inconsistency. The first three assumptions are inevitably violated to some extent because it is hard to capture temporally static images with no occlusion in the real world. The fourth restriction is often neglected by datasets with no photometric calibration parameters provided.
To address these limitations, previous methods [klodt2018supervising, zhou2017unsupervised] additionally train a mask indicating whether the photometric loss is meaningful. Yet, we present a novel approach to tackle this issue by injecting indirect geometric information into the direct learning framework. Different from direct methods that rely on dense photometric consistency, indirect methods for SfM and visual SLAM are based on sparse local descriptors such as SIFT [wu2011visualsfm] and ORB [mur2017orb]. Local invariant features are much less likely to be affected by the scale and illumination changes and can be implicitly embedded into the learning framework.
Symmetric epipolar error. Assuming the pinhole camera model, the feature matches between the target and source views satisfy the epipolar constraint, where and are the calibrated image coordinates. The loss with the feature matches and the estimated pose can be quantified using the symmetric epipolar distance [hartley2003multiple]
(6) 
where being the essential matrix computed by , is the matrix representation of the cross product with . We simply omit the subindices for conciseness ( for , for , for ).
Reprojection error. The epipolar constraint does not concern the depth in its formulation. To involve depth optimization using the feature match supervision, there are generally two methods: 1) triangulate the correspondence using the optimal triangulation method [hartley2003multiple] assuming the Gaussian noise model, to obtain the 3D track for depth supervision; 2) backproject 2D features in one image using the estimated depth to compute the 3D track, and reproject the 3D track to another image to compute the reprojection error. We take the second method because the estimated depth and pose are sufficient to compute the 3D loss, and triangulation is often imprecise for egomotion driving scenes [cordts2016cityscapes, Geiger2013IJRR] (see Figure 3 for illustration, and a mathematically rigorous explanation in the Appendix).
(7) 
where is the bilinearsampling operation [jaderberg2015spatial] in the target depth map as the feature coordinate is not an integer. Minimizing reprojection error using feature matches can be viewed as creating sparse anchors between the weak geometric supervision and the estimated depth and pose. In contrast, Equation 6 does not involve the estimated depth.
Since outliers may exist if they lie close to the epipolar line, we use the pairwise matches that are confirmed in three views
[hartley1997defense]. Minimizing the epipolar and reprojection errors of all matches using CNNs mimics the nonlinear pose estimation [bartoli2004non]. The experiment shows that this weak supervisory signal significantly improves the pose estimation and is superior to other SfM supervisions such as [klodt2018supervising].3.3 Consistent Depth Estimation
In this section, we describe the depth estimation module. Previous methods, whether operating on three or five views, are pairwise approaches in essence because loss terms are computed pairwisely from the source frame to the target frame. Even though the pose network outputs relative poses at once, it is unknown if these relative poses are aligned to the same scale. We propose the motionconsistent depth estimation formulation to address this issue. Rather than minimizing the loss between the target frame and adjacent source frames, our proposed formulation also considers the depth and motion consistency between adjacent frames themselves.
Forwardbackward consistency. As shown in Figure 2, our network architecture estimates the depth maps of the target image (), as well as the forward and backward depths. Inspired by [godard2017unsupervised, poggi2018learning] that uses leftright consistency on stereo images, we propose forwardbackward consistency for monocular images. In addition to bilinearsampling pixel values, it samples the estimated depth maps of forward and backward images (). This process generates two synthesized depth maps that can be used to constrain the estimation of the target image depth map .
However, the availability of only monocular images makes the problem more challenging. For learning with stereo images, the images are rectified in advance so the scale ambiguity issue is not considered. While for learning monocular depth, the estimated depth is determined only up to scale, therefore the alignment of depth scale is necessary before constraining the depth discrepancy. We first normalize the target depth map using its mean to resolve the scale ambiguity of the target depth [wang2018learning], which determines the scale of relative poses. Then we apply a mean alignment to the synthesized depth maps and the normalized target depth map in the corresponding region informed by the analytical mask (Equation 2), and further optimize the depth discrepancy
(8) 
where means the elementwise multiplication and the loss is averaged over all the valid pixel in the mask .
Multiview consistency. The above losses are all defined on the single target image (e.g. smoothness loss) or among image pairs, even though the input is view () image sequences. The pose network outputs relative poses between the target and source images, but the relative poses are only weakly connected by the monocular depth. To strengthen the scale consistency for triplet relation, we propose the multiview consistency loss which penalizes inconsistency of the forward depth and backward depth using the target image as a bridge for scale alignment. Formally, given image sequence with the target image, and corresponding pose and depth predictions and , we again obtained the normalized depth map where the scaling ratio as used in Equation 8. The transformation from the backward image to the forward image is . The multiview loss minimizes the depth consistency term and photometric consistency term as
(9) 
where and are the synthesized image and synthesized normalized depth given and . The subindices 1 and 3 are interchangeable in the above Equation 9. goes beyond the pairwise loss terms , , and because it utilizes the chained pose and pushes the two relative poses to be aligned on the same scale. This benefits monocular SLAM because it facilitates the incremental localization by aligning multiple view outputs, as we show in Section 4.4.
Method  Supervision  Dataset  Cap (m)  Abs Rel  Sq Rel  RMSE  RMSE log  
Eigen [eigen2014depth] Fine  Depth  K  80  0.203  1.548  6.307  0.282  0.702  0.890  0.958 
Liu [liu2016learning]  Depth  K  80  0.202  1.614  6.523  0.275  0.678  0.895  0.965 
Godard [godard2017unsupervised]  Stereo/Pose  K  80  0.148  1.344  5.927  0.247  0.803  0.922  0.964 
Godard [godard2017unsupervised]  Stereo/Pose  K + CS  80  0.114  0.898  4.935  0.206  0.861  0.949  0.976 
Zhou [zhou2017unsupervised] updated  No  K  80  0.183  1.595  6.709  0.270  0.734  0.902  0.959 
Zhou [zhou2017unsupervised] updated  No  K    0.185  2.170  6.999  0.271  0.734  0.901  0.959 
Klodt [klodt2018supervising]  No  K  80  0.166  1.490  5.998    0.778  0.919  0.966 
Mahjourian [mahjourian2018unsupervised]  No  K  80  0.163  1.24  6.22  0.25  0.762  0.916  0.968 
Wang [wang2018learning]  No  K  80  0.151  1.257  5.583  0.228  0.810  0.936  0.974 
Yin [yin2018geonet]  No  K  80  0.155  1.296  5.857  0.233  0.793  0.931  0.973 
Yin [yin2018geonet]  No  K    0.156  1.470  6.197  0.235  0.793  0.931  0.972 
Yin [yin2018geonet] updated  No  K + CS  80  0.149  1.060  5.567  0.226  0.796  0.935  0.975 
Ours  No  K  80  0.140  1.025  5.394  0.222  0.816  0.938  0.974 
Ours  No  K    0.140  1.026  5.397  0.222  0.816  0.937  0.974 
Ours  No  K + CS  80  0.139  0.964  5.309  0.215  0.818  0.941  0.977 
Garg [garg2016unsupervised]  Stereo/Pose  K  50  0.169  1.080  5.104  0.273  0.740  0.904  0.962 
Zhou [zhou2017unsupervised]  No  K  50  0.201  1.391  5.181  0.264  0.696  0.900  0.966 
Yin [yin2018geonet]  No  K + CS  50  0.147  0.936  4.348  0.218  0.810  0.941  0.977 
Ours  No  K  50  0.133  0.778  4.069  0.207  0.834  0.947  0.978 
3.4 Differentiable Sparse Feature Selection
Photometric inconsistency inevitably exists due to occlusion or nonLambertian properties. Previous works employ an additional branch to regress an uncertainty map, which helps a little [zhou2017unsupervised]. Instead, we follow the explicit occlusion modeling approach [shen2019icra] which does not rely on the datadriven uncertainty. We have observed that photometric inconsistency such as moving objects usually incurs larger photometric errors (Figure 4(b)). On the other hand, image region with small gradient changes does not offer meaningful supervisory information because of the gradient locality issue [bergen1992hierarchical] (Figure 4(c)).
Therefore, we combine the error mask with the gradient mask to select the meaningful sparse features, inspired by the direct sparse odometry [engel2017direct] but can be fit into the differentiable training pipeline. Given the pixel error map, we compute the error histogram and mask out the pixels which are above th percentile. We also compute the gradient mask and keep only the values that are above th percentile. The final composite mask is the multiplication of both masks with dynamic thresholding. As shown in Figure 4(d), this mask operation filters out a majority of photometric inconsistency regions like the moving car. The composite mask is only used for the final depth refinement when the error suppression mask is stable, otherwise we observe a performance drop if training from scratch.
Our final formulation takes into account the basic losses in Equation 5, the geometric terms, as well as the consistency terms, written as
(10) 
The weighting for different losses are set empirically given hyperparameters in previous methods and our attempts (). We also try to learn the optimal weighting using homoscedastic uncertainty [kendall2017geometric], but find no better result than empirically setting the weights.
4 Experiments
4.1 Training Dataset
KITTI. We evaluate our method on the KITTI datasets [Geiger2013IJRR, Menze2015CVPR], using the raw dataset with Eigen split [eigen2014depth] for depth estimation, and the odometry dataset for pose estimation. Images are downsampled to 128 416 to facilitate the training and provide a fair evaluation setting. For Eigen split, we use 20129 images for training and 2214 images for validation. The 697 testing images are selected by [eigen2014depth] from 28 scenes whose images are excluded from the training set. For the KITTI odometry dataset, we follow the previous convention [yin2018geonet, zhou2017unsupervised] to train the model on sequence 0008 and test on sequence 0910. We further split sequence 0008 to 18361 images for training and 2030 for validation.
Cityscapes. We also try pretraining the model on the Cityscapes [cordts2016cityscapes] dataset since starting from a pretrained model boosts the performance [zhou2017unsupervised]. The process is conducted without adding feature matches for 60k steps. 88084 images are used for training and 9659 images for validation.
4.2 Implementation Details
Data preparation. We extract SIFT [lowe2004distinctive] feature matches as the weak geometric supervision using SiftGPU [wu2011visualsfm] offline. The putative matches are further filtered by geometric verification [hartley1997defense] with RANSAC [fischler1981random]. 100 feature pairs are randomly sampled and used for training. Matches are only used for training and not necessary for inference.
Learning.
We implement our pipeline using Tensorflow
[abadi2016tensorflow]. The depth estimation part follows [yin2018geonet] which uses ResNet50 [he2016deep] as the depth encoder. The relative pose net follows [zhou2017unsupervised, yin2018geonet] which is a 7layer CNN, with the lengths of feature maps reduced by half and the number of feature channels multiplied by two from each previous layer. If not explicitly specified, we train the neural networks using 3view image sequences as the photometric error would accumulate for longer input sequences. We use the Adam [kingma2014adam] solver with , , a learning rate of 0.0001 and a batch size of 4.Training efficiency. The proposed method takes longer time per step due to more depth estimations and loss computations. With a single GTX 1080 Ti, training takes 0.35s per step compared with 0.19s for the baseline approach based on Equation 5. It is noted that the inference efficiency is the same as the baseline.
4.3 Depth Estimation
The evaluation of depth estimation follows previous works [mahjourian2018unsupervised, yin2018geonet, zhou2017unsupervised]. As shown in Table 1, our method achieves the best performance among all unsupervised methods that jointly learn depth and pose. Previous methods often filter the predicted depth map by setting a maximum depth at 50m or 80m (the groundtruth depth range is within 80m) before computing depth error, since distant pixels may have prediction outliers. We also evaluate the performance without this filtering step, marked by ‘’ in the Cap(m) column. It shows that without capping the maximum depth, [yin2018geonet, zhou2017unsupervised] become worse while our result seldom changes, meaning our consistent training renders depth predictions with little noise. Figure 5 provides a qualitative comparison of the predictions. We show the depth value (the nearer the darker) instead of the inverse depth (disparity) parameterization, which highlights the distant areas.
Since both Klodt [klodt2018supervising] and ours use selfsupervised weak supervisions, we redo the experiments in [klodt2018supervising] that use selfgenerated poses and sparse depth maps from ORBSLAM2 [mur2017orb] for weak supervision, fixing other settings. We obtain slightly better statistics which still lags behind the proposed method that uses feature matches. This implies that the raw matches are more robust as the supervisory signal, whereas using pose and depth computed from SfM/SLAM is possible to introduce additional bias inherited from the PnP [lepetit2009epnp] or triangulation algorithms.
4.4 Pose Estimation
Method  Seq 09  Seq 10 

ORBSLAM2 [mur2017orb]  0.014 0.008  0.012 0.011 
Zhou [zhou2017unsupervised] updated (5frame)  0.016 0.009  0.013 0.009 
Yin [yin2018geonet] (5frame)  0.012 0.007  0.012 0.009 
Mahjourian [mahjourian2018unsupervised] , no ICP (3frame)  0.014 0.010  0.013 0.011 
Mahjourian [mahjourian2018unsupervised] , with ICP (3frame)  0.013 0.010  0.012 0.011 
Klodt [klodt2018supervising] (5frame)  0.014 0.007  0.013 0.009 
Ours (3frame)  0.009 0.005  0.008 0.007 
We evaluate the performance of relative pose estimation on the KITTI odometry dataset. We have observed that with the pairwise matching supervision, the result for motion estimation has been extensively improved. We measure the Absolute Trajectory Error (ATE) over
frame snippets. The mean error and variance are averaged from the full sequence. As shown in Table
2, with the same underlying network structure, the proposed method outperforms stateoftheart methods by a large margin.However, the above comparison is in favor of learningbased approaches which are only able to generate view pose segments, but not fair for mature SLAM systems which emphasizes the accuracy of the full trajectory. To demonstrate that our method produces consistent pose estimation, we chain the relative poses by averaging the two overlapping frames of 3view snippets. We first align the chained motion with the groundtruth trajectory by estimating a similarity transformation [umeyama1991least], and then compute the average of APE for each frame. As shown in Figure 6, even without global motion averaging techniques [govindu2006robustness], our method achieves comparable performance (8.82m/23.09m median APE for Seq. 9/10) against monocular ORBSLAM [mur2015orb] (36.83m/5.74m median APE for Seq. 9/Seq. 10) without loop closure. This comparison just provides a fair comparison in terms of the full sequence, yet by no means shows the learningbased method has surpassed tradition VO methods. In fact, monocular ORBSLAM with loop closure and global bundle adjustment results in a much smaller 7.08m median APE for Seq. 9 (Seq. 10 stays unchanged because it has no loop).
4.5 Ablation Study
3 Loss Configuration  Depth (KITTI raw Eigen split)  Pose (KITTI odometry)  

Baseline  Epipolar  Reprojection  Forwardbackward  Multiview  Mask  Abs Rel  Sq Rel  RMSE  RMSE log  Seq 09  Seq 10  
3            0.163  1.371  6.275  0.249  0.773  0.918  0.966  0.014 0.009  0.012 0.012 
        0.159  1.287  5.725  0.239  0.791  0.927  0.969  0.010 0.005  0.009 0.008  
      0.152  1.205  5.56  0.227  0.800  0.935  0.973  0.009 0.005  0.009 0.008  
    0.146  1.391  5.791  0.229  0.814  0.936  0.972  0.009 0.005  0.008 0.007  
  0.143  1.114  5.681  0.225  0.816  0.938  0.974  0.009 0.005  0.008 0.007  
0.140  1.025  5.394  0.222  0.816  0.938  0.974  0.009 0.005  0.008 0.007  
3 (5view)            0.169  1.607  6.129  0.255  0.779  0.917  0.963  0.014 0.009  0.013 0.009 
(5view)        0.157  1.449  5.796  0.239  0.803  0.929  0.970  0.012 0.008  0.010 0.007  
3 
Performance with different modules. We conduct an ablation study to show the effect of each component. The models for depth and pose evaluation are trained solely on KITTI raw dataset and odometry dataset respectively. We choose an incremental order for the proposed techniques to avoid too many loss term combinations. As shown in Table 3, we have the following observations:

[leftmargin=*]

The reimplemented baseline model, using Equation 5, has already surpassed several models [klodt2018supervising, mahjourian2018unsupervised, zhou2017unsupervised]. The reasons can be attributed to the more capable depth encoder ResNet50, which is also used by [yin2018geonet].

The result for pose estimation is greatly improved with the epipolar loss term . It shows the efficacy of using raw feature matches as the weakly supervised signal. However, the improvement for depth estimation is not as significant as pose estimation.

Reprojection loss further improves the depth inference. The improvements for pose estimation brought by ingredients other than the epipolar loss are marginal.

The depth consistency and multiview consistency are the essential parts for the improvement in depth estimation.
In summary, the epipolar geometric supervision helps the pose estimation most, while the geometric consistency terms in Section 3.3 essentially improve depth estimation.
Sequence Length. The multiview depth consistency loss boosts the depth estimation. However, the performance boost can be also attributed to using longer image snippets, since similar secondorder relations can be exploited by using 5view image sequences for training. Therefore, we further evaluate the performance of using 5view images. As shown in Table 3, training on longer image sequences would deteriorate the performance, because long sequences also contain larger photometric noises. It shows that the proposed formulation elevates the results not from more data, but the consistency embedded in geometric relations.
4.6 Generalization on Make3D
3 Method  Supervision  Metrics  

depth  pose  Abs Rel  Sq Rel  RMSE  RMSE  
3 Karsch [karsch2014depth]†    0.417  4.894  8.172  0.144  
Liu [liu2014discrete]†    0.462  6.625  9.972  0.161  
Laina [laina2016deeper] †    0.198  1.665  5.461  0.082  
Godard [godard2017unsupervised]    0.443  7.112  8.860  0.142  
Zhou [zhou2017unsupervised]      0.392  4.473  8.307  0.194 
Ours      0.378  4.348  7.901  0.183 
3 
Generalization experiments on Make3D. The evaluation metrics are the same as the ones in Table
1 except the last one (RMSE ) to conform with [karsch2014depth]. The methods marked with are trained on Make3D. The depth estimation is evaluated with maximum depth capped at 70m. We use the centercropped images as in [godard2017unsupervised] and resize them to for inference.To illustrate that the proposed method is able to generalize to other datasets unseen in the training, we compare to several supervised/selfsupervised methods on the Make3D dataset [saxena2009make3d], using the same evaluation protocol as in [godard2017unsupervised]. As shown in Table 4, our best model achieves reasonable generalization ability and even beats several supervised methods on some metrics. A qualitative comparison is shown in Figure 7.
5 Conclusion
We have presented an unsupervised pose and depth estimation pipeline that absorbs both the geometric principles and learningbased metrics. We emphasize on the consistency issue and propose novel ingredients to make the result more robust and reliable. Yet, we should realize that the current learningbased methods are still far from solving the SfM problem in an endtoend fashion. Further investigations include enforcing consistency across the whole dataset, such as incorporating loop closure and bundle adjustment techniques into the learningbased methods.
References
6 Appendix
6.1 Triangulation Uncertainty
As shown in figure 3, the case with forward motion, which is often encountered in the driving scenarios, renders high uncertainty with the triangulated 3d track. To model its uncertainty, we consider the simplified case where the point on a plane has 2d position given it two 1d image points and using line cameras [hartley2003multiple]. Line camera projects plane points to line points, which is similar to the normal pinhole camera model that projects 3d points to 2d points. Suppose the measurement and are corrupted by the Gaussian noise
, then the probability of obtaining
given the 2d point is(11) 
We assume
has the same probability, then the posterior probability of obtaining
given and is(12) 
assuming a uniform prior distribution for and the two measurements and are independent identically distributed (i.i.d.). The bias and variance of this distribution is illustrated with the shaded area in Figure 3 intuitively using the angle between the rays, and is also discussed in [hartley2003multiple]. The variance of the triangulated point would be high if the camera possesses the forward motion.
Though using the triangulated point as the depth supervision is also feasible, we argue that this is inferior to using reprojection error because 1) the depth of the triangulated track is not stable thus may incur large gradient for distant tracks; 2) even if the depth predictions for distant tracks are wrong, it would not incur much error for the gradient and therefore make the training more stable; 3) the triangulation computation itself imposes additional computation in the training, as this process involves the relative pose and should be conducted online.
6.2 Full Pose Prediction on KITTI
The accurate pose prediction is one of the significant improvements brought by the proposed method, which is not feasible for supervised approaches [eigen2014depth, liu2016learning] nor semisupervised approaches that rely on stereo pairs [garg2016unsupervised, godard2017unsupervised]. Here we show more pose prediction comparisons with monocular ORBSLAM [mur2015orb], one of the stateoftheart indirect SLAM methods. We remove the loop closure functionality from the original ORBSLAM (denoted as MORBNOCL) to conduct a fair comparison since it is currently infeasible to apply such technique to the learningbased methods. For the proposed method, we simply average the rotation and translation of the adjacent 3view pose predictions to obtain the full pose trajectory, fixing the first frame as the origin and incrementally aligning the 3view poses. This simple alignment without global motion averaging [govindu2006robustness] may incur biases with respect to the starting point, but we would like to emphasize the consistency and potential of the proposed method. We evaluate the median absolute position error (in meters) on KITTI odometry sequence 0010.
3  Seq. 00  Seq. 01  Seq. 02  Seq. 03  Seq. 04  Seq. 05  Seq. 06  Seq. 07  Seq. 08  Seq. 09  Seq. 10 

3 #Frame  4541  1101  4661  801  271  2761  1101  1101  4071  1591  1201 
Ours mAPE (m)  38.27  96.62  48.18  7.19  1.79  9.82  6.97  4.82  24.11  8.82  23.09 
MORBNOCL mAPE (m)  17.76  F  58.27  0.48  0.39  31.24  47.17  13.16  36.37  39.86  4.57 
Ours time (s)  8.18  3.56  8.08  3.25  2.57  5.67  3.64  3.61  7.27  4.20  3.76 
MORBNOCL time (s)  481.83  122.30  493.95  90.59  35.18  295.41  121.89  122.20  431.92  173.14  132.71 
3 
As shown in Table 5, the proposed method achieves better performance on 7 of the 11 sequences. In addition, the proposed method is an order of magnitude faster than MORBNOCL. It is noted that the proposed method utilizes GPU (GTX 1080 Ti) while ORBSLAM is run on CPU (Intel Core i74770k), yet the traditional methods are hard to compete with the learningbased methods in terms of efficiency. Monocular ORBSLAM would fail on Seq. 01, the scene of a car moving on a highway, which is difficult for sparse feature detection and matching. In this case, learningbased methods have the advantage that a coarse result is guaranteed, even if it is inaccurate. Note that monocular ORBSLAM usually takes the first a few frames for initialization (which we set to ), while learningbased methods do not need this initialization step. Figure 8 shows the sidebyside comparison of the full trajectories.
Comments
There are no comments yet.