I Introduction
Simultaneous localization and and mapping (SLAM) and visual odometry (VO) serve as the basis for many emerging technologies such as autonomous driving and virtual reality. Among various implementations that rely on different sensors, the monocular approach is advantageous in mobile robot with limited budgets. Although it is sometimes unstable compared with stereo inputs or fusing more sensors such as IMU and GPS, it is still desirable considering the low cost and applicability. The visual system of humans also serves as the the proof of existence for an accurate visual monocular SLAM system. We humans are capable of perceiving the environment even viewing a scene with one eye. Several monocular cues such as motion parallax [8] and optical expansion [38] embed prior knowledge into depth sensing. Enlightened by the biological resemblance, the joint inference of depth and relative motion [50, 42, 47] has recently attracted the attention of the visual SLAM community. Given adjacent frames, this method uses CNN to predict the depth map of the target image and the relative motion of the target frame to other source frames. With depth and pose, the source image can be projected onto the target frame to synthesize the target view. It minimizes the error between the synthesis view and the actual image.
There are generally two sources of information that involve the interaction of depth and motion: photometric information like intensity and color from images [6], and geometric information computed from stable local keypoints [27]
. Most unsupervised or selfsupervised methods for depth and motion estimation utilize image reconstruction error based on photometric consistency. Given known camera intrinsics, the approach would not require large amount of labelled data, making it more general and applicable to a broader ranges of applications. However, the unsupervised learning formulation enforces strong assumptions that require the scenes to be static without dynamic objects, the modeling surface to be Lambertian, and no occlusion exists between adjacent views. These criterions generally do not hold in a realworld scenario, even for a very short camera baseline. For example, the stateoftheart singleview depth estimation result is obtained by training with 3 consecutive frames, but not on longer image sequences such as using 5 frames, as demonstrated in several previous works
[50, 47]. This implies that photometric error would accumulate for wide baselines (5 rather than 3 frames), which further shows the limitation of using only photometric error as the supervision.We show in this paper that the selfgenerated geometric quantities can be implicitly embedded into the training process without breaking the simplicity of inference. Specifically, we explore intermediate geometric information such as pairwise matching and weak geometry generated automatically to improve the joint optimization for depth and motion. These intermediate geometric representations are much less likely to be affected by the intrinsic photometric limitations. We also analyze the intrinsic flaw with perpixel photometric error and propose a simple percentile mask to mitigate the problem. The method is evaluated on the KITTI dataset, which achieves the best relative pose estimation performance of its kind. In addition, we demonstrate a VO system that chains and averages the predicted relative motions for full trajectory, which even outperforms monocular ORBSLAM2 without loop closure on KITTI Odometry Sequence 09.
Ii Related Works
In this section, we discuss the related works on traditional visual VO/SLAM systems and learningbased methods for visual odometry.
Iia Traditional visual SLAM approaches
Current stateoftheart visual SLAM approaches can be generally characterized into two categories: indirect and direct formulations. Indirect methods conquer the motion estimation problem by first computing some stable and intermediate geometric representations such as keypoint [31], edgelet [20] and optical flow [33]. Geometric error is then minimized using these reliable geometric representations either with slidingwindow or global bundle adjustment [40]. This is the most widelyused formulation for SLAM systems [4, 22, 31].
For visual odometry or visual SLAM (vSLAM), direct methods directly optimize the photometric error which corresponds to the light value received by the actual sensors. Examples include [32, 7, 6]. Given accurate photometric calibration information (such as gamma correction, lens attenuation), this formulation spares the costly sparse geometric computation and could potentially generate finergrained geometry like perpixel depth. However, this formulation is less robust than indirect ones with the presence of dynamic moving objects, reflective surfaces and inaccurate photometric calibration. Note that the selfsupervised learning framework derives from the direct method.
IiB Learning Depth and Pose from Data
Most of pioneering depth estimation works rely on supervision from depth sensors [35, 5]. Ummenhofer et al. [41] propose an iterative supervised approach to jointly estimate optical flow, depth and motion. This iterative process allows the use of stereopsis and gives fairly good results given depth and motion supervision.
The selfsupervised approaches for structure and motion borrow ideas from warpingbased view synthesis [52], a classical paradigm of which is to composite novel view based on the underlying 3D geometry. Garg et al. [11] propose to learn depth using stereo camera pairs with known relative pose. Godard et al. [13] also rely on calibrated stereo to obtain monocular depth with leftright consistency checking. Zhan et al. [48]
consider deep features from the neural nets in addition to the photometric error. The above three methods have limited usability in the monocular scenario where the pose is unknown. Zhou et al.
[50] and Vijayanarasimhan et al. [42] develop similar joint learning methods for the traditional structure and motion (SfM) problem [37, 51], with the major difference that [42] can incorporate supervised information and directly solve for dynamic object motion. Later, [43] discuss the critical scale ambiguity issue for monocular depth estimation, which is neglected by previous works. To resolve scale ambiguity, the estimated depth is first normalized before being fed into the loss layer. Geometric constraints of the scene are enforced by an approximate ICP based matching loss in [29]. For realworld applications, pose and depth estimation using CNNs have also been integrated into visual odometry systems [44, 24]. Ma et al. [28] consider the sparse depth measurements with RGB data to reconstruct the full depth map.The above viewsynthesisbased methods [50, 42, 24, 29] is based on the assumptions that the modeling scene is static and the camera is carefully calibrated to get rid of photometric distortions such as automatic exposure changes and lens attenuation (vignetting) [18]. This problem becomes serious as most of the previous works train models on KITTI [12] or Cityscapes [3] datasets, in which the camera calibration does not consider nonlinear response functions (gammacorrection / whitebalancing) and vignetting. As the input image size is limited by the GPU memory, the pixel value information is further degraded by downsampling.
These learningbased methods optimizing photometric error corresponds to the direct methods [7, 6] for SLAM. Indirect methods [4, 31], on the other hand, decompose the structure and motion estimation problem by first generating an intermediate representation and then computing the desired quantities based on geometric loss. These intermediate representations like keypoints [27, 34] are typically stable and resilient to occlusions and photometric distortions. In this paper, we advocate to import geometric losses into the selfsupervised depth and relative pose estimation problem.
Iii Methods
Iiia Overview
Our method combines the accurate intermediate geometric representations of traditional monocular SLAM with selfsupervised depth estimation to deliver a better formulation for joint depth and motion estimation. Figure 1 shows the architecture of our method with concatenated three adjacent frames as input, and the predicted depth map of the target frame and relative poses as output. We first give a brief overview of previous works that rely much on the photometric errors.
Taken two adjacent frames and as an example (the case for frame and is the same), the pose module takes the concatenated image and output a 6DoF relative pose in an endtoend fashion. The depth module, which is a encoderdecoder network, takes the target frame as input to generate the depth map for , denoted as .
The typical methods [11, 13, 50, 42, 47, 29, 48, 43] for unsupervised estimation of and are to employ the image synthesis loss. Suppose denotes a pixel in that is also visible in , its projection on is represented by
(1) 
where mean ‘equal in the homogeneous coordinate’ and and are the intrinsic matrix for the corresponding two images. Given this relation, we can obtain a synthesis image using source frames by bilinear sampling [17]. Depth and relative pose are then optimized by the image reconstruction loss between and . Early works usually adopt the loss of the corresponding pixels while later structured similarity [45] (SSIM) is introduced to evaluate the quality of image predictions. We follow [47, 29] among others and use the combination of the both loss and SSIM loss as the image reconstruction loss :
(2) 
where is the balancing factor which we set to 0.85 [47, 29]. This loss formulation should be accompanied with a smoothness term to resolve the gradientlocality issue in motion estimation [2] and remove discontinuity of the learned depth in lowtexture regions. We adopt the edgeaware depth smoothness loss in [47] which uses image gradient to weight depth gradient:
(3) 
where is the pixel on the depth map and image , denotes the 2D differential operator, and is the elementwise absolute value.
IiiB Geometric Error from Epipolar Geometry
One of the main reasons for the success of indirect SLAM method is the use of stable invariants computed from raw image input, such as keypoints and line segments. Though still computed from pixel values, descriptors for these stable image patches have strong invariance guaranteed by scalespace theory [25]. For learningbased approaches, these geometric ingredients can be preprocessed offline and implicitly integrated into CNNs. In this paper, we demonstrate the boost of several geometric elements to overcome the intrinsic drawbacks of the current approaches.
One of the fundamental building blocks for sparsefeaturebased SLAM or SfM is the pairwise matching with geometric verification. For a pair of overlap images () viewing the same scene with canonical relative motion , a set of feature matches in the homogeneous image coordinates can be reliably obtained. Then the following epipolar geometry constraint holds:
(4) 
where is the corresponding fundamental matrix, and represent the homogeneous camera coordinates of the th matched points, and and are their corresponding intrinsic matrix.
is the matrix representation of the cross product with a 3dimensional vector.
We use the projection error from the first image to the second image as the supervision signal for relative pose estimation. defines the epipolar line [14] on which must lie on. Therefore, the geometric loss is defined by the sum of the distance from point to line for all (or sampled) corresponding matches.
(5) 
where the 2D point to line distance is defined by , and the sum iterates over corresponding image matches in adjacent frames.
IiiC Other weak pairwise geometric supervisions
To incorporate geometric losses into the selfsupervised framework, several intermediate geometric computations can be employed. Apart from using epipolar geometry (PairwiseMatching), indirect methods have provided other forms of geometric supervisions, such as the camera pose computed using perspectivenpoint (PnP) algorithms [10, 23]. Since these properties can be computed offline, it belongs to the selfsupervised category to utilize the weak geometric supervisions. With 3d point to 2d projection matches, we can obtain a set of inaccurate/weak supervision for absolution camera poses for . We have explored two ways of incorporating the weak supervision. The first one, denoted as DirectWeakPose, is to directly use the weak poses without explicitly learning the relative pose CNN. Since the weak poses are absolute with respect to the current scene (instead of the relative ones learned from the pose CNN), Equation (1) becomes
(6) 
The second way is to use the weak pose as a prior [21], which we denote as PriorWeakPose. Different from DirectWeakPose, the pose CNN is used for relative pose estimation, while its deviation from the weak pose computed using traditional geometric methods is added to the optimization. Formally, PriorWeakPose considers one additional prior pose loss written as
(7) 
where and are the estimated 6DoF relative motion and weak pose, with rotation in Euler angle form and translation normalized . and are weights for rotation part and translation part respectively. Yet, we will show that the both ways of using weak poses computed from traditional methods like [31] are worse than the proposed method that utilizes raw feature matches.
IiiD Fixing the Photometric Error
As photometric error is inevitably one of the major supervision signals, we also consider mitigating the systematic error rooted in the optimization process. To this end, we introduce a simple solution that works well in practice. Since occlusions and dynamic objects prevalently exist in images, previous work such as [50, 42] further train a network to mask out these erroneous regions. Yet, this approach only brings marginal performance boost because it entangles with the depth and motion networks. Instead of learning the mask, we propose a deterministic mask that is computed onthefly. During the training process, we compute the mask based on the distribution of image reconstruction loss, defined as
(8) 
where pixel positions whose photometric loss is above a percentile threshold are filtered out. This is based on the fact that objects or regions that do not obey the static assumption usually impose larger errors. Throughout the experiment, we fix to be 0.99 which is a modest choice that filters out extremely false regions while preserves much of the image content to facilitate the optimization (as shown in Figure 2.). Experiments validate that this simple strategy improves the depth estimation by better handling occlusions and reflections.
In the end, the total loss becomes
(9) 
where are weights for the smoothness loss, geometric loss and weak pose loss respectively. The smoothness weight is set to 0.1 throughout the evaluation. As for and , since the weak geometric prior used in has the same functionality as the pairwise matching used for , we add the two losses separately and compare their performance. We refer to the case where as the PairwiseMatching approach, while the case where as the PriorWeakPose approach. As we describe in Section IIIC, we can also directly use the pose computed from PnP algorithms as the pose supervision. In this case (DirectWeakPose), we only train the depth network for monocular depth estimation with . The performance comparison for these three approaches is shown in the ablation study.
Iv Experiments
Method  Geometric Info  Cap (m)  Abs Rel  Sq Rel  RMSE  RMSE log  
Baseline (w/o Mask)  No  80  0.171  1.512  6.332  0.250  0.764  0.918  0.967 
Baseline (w Mask)  No  80  0.163  1.370  6.397  0.258  0.758  0.910  0.962 
PairwiseMatching  Selfgenerated Matches  80  0.156  1.357  6.139  0.247  0.781  0.920  0.965 
PriorWeakPose [21]  Selfgenerated Pose  80  0.163  1.371  6.275  0.249  0.773  0.918  0.967 
DirectWeakPose  Selfgenerated Pose  80  0.162  1.46  6.27  0.249  0.775  0.919  0.965 
Iva Dataset
KITTI. We evaluate our method on the most common KITTI [12, 30] benchmark dataset, which includes a full set of input sources including raw images, 3D point cloud data from LIDAR and camera trajectories. To conduct a fair comparison with related works, we adopt the Eigen split for singleview depth benchmark and use the odometry sequences to evaluate the visual odometry performance. All the training and testing images are from the left monocular camera from the stereo pair and downsampled to .
Eigen Split. We evaluate the singleview depth estimation performance on the test split composed of 697 images from 28 scenes as in [5]. Images in the test scenes are excluded in the training set. Since the test scenes overlaps with the KITTI odometry split (i.e. some test images of Eigen split are contained in the KITTI odometry training set, and vice versa), we train the model solely on the Eigen split with 20129 training images and 2214 validation images.
KITTI odometry. The KITTI odometry dataset contains 11 driving sequences with groundtruth poses and depth available (and 11 sequences without groundtruth). For pose estimation, we train the model on KITTI odometry sequence 0008 and evaluate the pose error on sequence 09 and 10. 18361 images are used for training and 2030 for validation.
Cityscapes. We also try pretraining the model on the Cityscapes [3] dataset too boost performance. The process is conducted without geometric loss for 60k steps, with 88084 images for training and 9659 images for validation.
IvB Implementation Details
Geometric Supervision. We use SIFT descriptor for feature matching [49], which is widely used for SfM. The average feature number for each image is around 8000. For weaklysupervised poses, we use the consecutive motion generated by PnP algorithm used in stereo ORBSLAM2 [31], which is essentially EPnP [23] with RANSAC [9]. We choose the stereo version rather than the monocular one because (1) it is more accurate than monocular (but still cannot be viewed as the ground truth) and (2) the initialization process takes the initial stereo pair and all frames get reconstructed, while the first few frames may be missing in the monocular version. For feature matching supervision, pairwise matching is conducted between adjacent frames filtered by the epipolar geometry using the normalized eightpoint algorithm [15], which leads to around 2000 fundamental matrix inliers for adjacent frames. We random sample 100 matching features of each image pair for training.
Learning.
We implement the neural nets using the Tensorflow
[1] framework. During training, we use the Adam [19] solver with , , a learning rate of 0.0001 and a batch size of 4. We use ResNet50 [16] as the depth encoder and the same architecture for pose network as [50]. Most of the training tasks usually converge within 200k iterations. To address the gradient locality issue, many works [50, 47] take the multiscale approach to allow gradients to be derived from larger spatial regions. As this approach alleviates the problem a bit, it also brings new error since lowresolution images have inaccurate photometric values. We therefore only use one image scale for training without downsampling, and observe a slight improvement for depth estimation performance.IvC Ablation Study
We first show that adding the threshold mask (section IIID) improves the depth estimation (the first two items of Table I), and then compare three ways of incorporating geometric information, namely PairwiseMatching, PriorWeakPose [21] and DirectWeakPose
. Since pose data is more conveniently generated from the KITTI odometry sequences, we train the models on KITTI odometry sequence 0008 and evaluate the monocular depth estimation performance on the Eigen split test set. Since some test images in the Eigen split test set are included in the training sequence 0008, we remove the intraining test samples using matchable image retrieval
[36]. Therefore, the result is not comparable with Table II because the test sets are different. Note that here we do not directly compare the pose estimation performance because DirectWeakPose does not even learn to estimate pose. The error measures conform with the one used in [5].where is the total number of pixels in image . As shown in Table I, PairwiseMatching achieves the best depth estimation performance among the three. This is explainable because PriorWeakPose and DirectWeakPose both introduce the geometric bias in the estimation algorithms, while PairwiseMatching uses the raw matches.
IvD Depth Estimation
Method  Supervision  Dataset  Cap (m)  Abs Rel  Sq Rel  RMSE  RMSE log  
Eigen et al. [5] Fine  Depth  K  80  0.203  1.548  6.307  0.282  0.702  0.890  0.958 
Liu et al. [26]  Depth  K  80  0.202  1.614  6.523  0.275  0.678  0.895  0.965 
Godard et al. [13]  Pose  K  80  0.148  1.344  5.927  0.247  0.803  0.922  0.964 
Zhou et al. [50] updated  No  K  80  0.183  1.595  6.709  0.270  0.734  0.902  0.959 
Mahjourian et al. [29]  No  K  80  0.163  1.24  6.22  0.25  0.762  0.916  0.968 
Yin et al. [47]  No  K  80  0.155  1.296  5.857  0.233  0.793  0.931  0.973 
Yin et al. [47]  No  K + CS  80  0.153  1.328  5.737  0.232  0.802  0.934  0.972 
Ours  No  K  80  0.156  1.309  5.73  0.236  0.797  0.929  0.969 
Ours  No  K + CS  80  0.152  1.205  5.564  0.227  0.8  0.935  0.973 
Garg et al. [11]  Stereo (Pose)  K  50  0.169  1.080  5.104  0.273  0.740  0.904  0.962 
Zhou et al. [50]  No  K  50  0.201  1.391  5.181  0.264  0.696  0.900  0.966 
Ours  No  K  50  0.149  1.01  4.36  0.222  0.812  0.937  0.973 
Further, we compare our model trained with pairwise matching loss (PairwiseMatching) on KITTI Eigen train/val split with various approaches with either depth supervision, pose supervision or no supervision (selfsupervision). The evaluation process is similar to [50] and we match medians of the predicted depth and groundtruth depth since the predicted monocular depth is defined up to scale. All groundtruth depth maps are capped at 80m (maximum depth is 80m) except [11] that are capped at 50m. As shown in Table II, match loss improves the baseline selfsupervised approach [50] by a large margin and achieves stateoftheart performance compared with methods using sophisticated information such as optical flow [47] or ICP [29].
IvE Visual Odometry Performance
Method  Seq.09  Seq.10 (no loop)  

Snippet  Full (m)  Snippet  Full (m)  
ORBSLAM2 (full, w LC)  0.014 0.008  7.08  0.012 0.011  5.74 
ORBSLAM2 (full, w/o LC)    38.56    5.74 
Zhou et al. [50] updated (5frame)  0.016 0.009  41.79  0.013 0.009  23.78 
Yin et al. [47] (5frame)  0.012 0.007  152.68  0.012 0.009  48.19 
Mahjourian et al. [29] , no ICP (3frame)  0.014 0.010    0.013 0.011   
Mahjourian et al. [29] , with ICP (3frame)  0.013 0.010    0.012 0.011   
Ours et al. (3frame)  0.0089 0.0054  18.36  0.0084 0.0071  16.15 
Relative pose estimation is evaluated on the KITTI odometry sequence 09/10 and compared with both learningbased methods and traditional ones such as ORBSLAM2 [34]. Compared with depth estimation, we care much more about the relative pose estimation ability since the match loss directly interacts with it. We have observed that with the pairwise matching supervision, the result for visual odometry has been extensively improved. We first measure the Absolute Trajectory Error (ATE) over frame snippets (=3 or 5), as measured in [50, 47, 29]. As shown in Table III (‘Snippet’ column), our method outperforms other stateoftheart approaches by a large margin.
However, simply comparing ATE over snippets is advantageous to the learningbased methods, since traditional methods like ORBSLAM2 utilize windowbased optimization over a longer sequence. Therefore, we chain the relative motions given by frames and apply simple motion averaging to obtain the full trajectory (1591 for seq.09 and 1201 for seq.10). The full pose is compared with the full trajectory computed by monocular ORBSLAM2 approach without loop closure. Since the relative motion recovered by monocular visual odometry systems has an undefined scale, we first align the trajectories with the groundtruth using a similarity transformation from the evaluation package evo^{1}^{1}1https://github.com/MichaelGrupp/evo.
As shown in Table III (‘Full’ represents the median translation error measured in meters), our method has the lowest full trajectory error compared with similar methods due to the geometric supervision. Compared with ORBSLAM2, our method achieves lower median ATE error (18.36m) on sequence 09 but is worse on sequence 10 (16.15m). Note that sequence 09 has a loop structure while sequence 10 does not, as shown in Figure 3. We also show the trajectories of sequence 11 and 12 where the groundtruth poses are unavailable, using stereo ORBSLAM2 results as the reference. It is observed that the learned model has worse performance for rotation with large angles. This may be due to the lack of rotating motion in the KITTI training data as forward motion dominates the car movement. Considering the input smaller image size () and the simplicity of the implementation, this endtoend visual odometry method still has great potential for future improvement.
V Conclusions
In this paper, we first analyze the limitation of the previous loss formulation used for selfsupervised depth and motion estimation. We then propose to incorporate intermediate geometric computations such as feature matches into the motion estimation problem. This paper is a preliminary exploration for the usability of geometric quantities in selfsupervised motion learning problem. Currently, we only consider twoview geometric relations. Future directions include fusing geometric quantities in longer image sequences as in bundle adjustment [40] and combing learning methods with traditional approaches as used in [39, 46].
Acknowledgement. This work is supported by Hong Kong RGC T22603/15N and Hong Kong ITC PSKL12EG02. We also thank the generous support of Google Cloud Platform.
References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: a system for largescale machine learning.
In OSDI, 2016.  [2] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical modelbased motion estimation. In ECCV, 1992.

[3]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In CVPR, 2016.  [4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Realtime single camera slam. PAMI, 2007.
 [5] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multiscale deep network. In NIPS, 2014.
 [6] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. PAMI, 2017.
 [7] J. Engel, T. Schöps, and D. Cremers. Lsdslam: Largescale direct monocular slam. In ECCV, 2014.
 [8] S. H. Ferris. Motion parallax and absolute distance. Journal of experimental psychology, 1972.
 [9] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981.
 [10] X.S. Gao, X.R. Hou, J. Tang, and H.F. Cheng. Complete solution classification for the perspectivethreepoint problem. PAMI, 2003.
 [11] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
 [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
 [13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, 2017.

[14]
R. Hartley and A. Zisserman.
Multiple view geometry in computer vision
. Cambridge university press, 2003.  [15] R. I. Hartley. In defense of the eightpoint algorithm. PAMI, 1997.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [17] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
 [18] S. J. Kim and M. Pollefeys. Robust radiometric calibration and vignetting correction. PAMI, 2008.
 [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [20] G. Klein and D. Murray. Improving the agility of keyframebased slam. In ECCV, 2008.
 [21] M. Klodt and A. Vedaldi. Supervising the new with the old: learning sfm from sfm. In ECCV, 2018.
 [22] K. Konolige and M. Agrawal. Frameslam: From bundle adjustment to realtime visual mapping. IEEE Transactions on Robotics, 2008.
 [23] V. Lepetit, F. MorenoNoguer, and P. Fua. Epnp: An accurate o (n) solution to the pnp problem. IJCV, 2009.

[24]
R. Li, S. Wang, Z. Long, and D. Gu.
Undeepvo: Monocular visual odometry through unsupervised deep learning.
In ICRA, 2018.  [25] T. Lindeberg. Scalespace theory: A basic tool for analyzing structures at different scales. Journal of applied statistics, 1994.
 [26] F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth from single monocular images using deep convolutional neural fields. PAMI, 2016.
 [27] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and L. Quan. Geodesc: Learning local descriptors by integrating geometry constraints. In ECCV, 2018.
 [28] F. Ma and S. Karaman. Sparsetodense: depth prediction from sparse depth samples and a single image. ICRA, 2017.
 [29] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and egomotion from monocular video using 3d geometric constraints. In CVPR, 2018.
 [30] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In CVPR, 2015.
 [31] R. MurArtal and J. D. Tardós. Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras. IEEE Transactions on Robotics, 2017.
 [32] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in realtime. In ICCV, 2011.
 [33] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In CVPR, 2016.
 [34] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In ICCV, 2011.
 [35] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In NIPS, 2006.
 [36] T. Shen, Z. Luo, L. Zhou, R. Zhang, S. Zhu, T. Fang, and L. Quan. Matchable image retrieval by learning from surface reconstruction. In ACCV, 2018.
 [37] T. Shen, S. Zhu, T. Fang, R. Zhang, and L. Quan. Graphbased consistent matching for structurefrommotion. In ECCV, 2016.
 [38] M. T. Swanston and W. C. Gogel. Perceived size and motion in depth from optical expansion. Perception & psychophysics, 1986.
 [39] C. Tang and P. Tan. Banet: Dense bundle adjustment network. In ICLR, 2019.
 [40] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, 1999.
 [41] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In CVPR, 2017.
 [42] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. Sfmnet: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
 [43] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In CVPR, 2018.

[44]
S. Wang, R. Clark, H. Wen, and N. Trigoni.
Deepvo: Towards endtoend visual odometry with deep recurrent convolutional neural networks.
In ICRA, 2017.  [45] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
 [46] N. Yang, R. Wang, J. Stuckler, and D. Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In ECCV, 2018.
 [47] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, 2018.
 [48] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, 2018.
 [49] L. Zhou, S. Zhu, T. Shen, J. Wang, T. Fang, and L. Quan. Progressive large scaleinvariant image matching in scale space. In ICCV, 2017.
 [50] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. In CVPR, 2017.
 [51] S. Zhu, R. Zhang, L. Zhou, T. Shen, T. Fang, P. Tan, and L. Quan. Very largescale global sfm by distributed motion averaging. In CVPR, 2018.

[52]
C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski.
Highquality video view interpolation using a layered representation.
In TOG, 2004.
Comments
There are no comments yet.