I Introduction
For autonomous systems such as Micro Aerial Vehicles (MAVs), depth and egomotion (pose) estimation are important capabilities. Depth measurements are vital while reasoning in 3D space, which is necessary for environmental understanding and obstacle avoidance, particularly for moving platforms. Egomotion estimation enables the robot to track its motion, allowing it to follow particular trajectories. However, performing depth estimation onboard MAVs is a challenging task given the size, weight, and power (SWaP) constraints, resulting in additional computational constraints. While accurate depth sensors such as LiDARs are relatively heavy, they provide accurate depth estimates and are often used on larger UAV platforms [3] and are also popular in the autonomous vehicle domain. These depth sensors are often heavy and expensive, limiting their use in MAV applications. In contrast, inexpensive depth sensors such as active and passive stereo cameras are lightweight, but their reduced weight comes at the cost of accuracy. TimeofFlight (ToF) cameras are viable alternatives to LiDARs in certain situations and have been used on MAVs as they meet the size and weight constraints, but they are often limited in resolution and field of view when compared to LiDARs [4].
In this study, we propose a method where we use noisy depth measurements from popular lowcost stereo cameras, RGBD sensors, and sparse depth from LiDAR. We present a framework capable of filtering input noise and estimating pose simultaneously, and argue for jointly estimating pose and depth to improve estimations in both domains.
Stateoftheart monocular algorithms such as DSO [5] show impressive performance using monocular images only. However, these methods do not provide an easy way to incorporate depth measurements from other sources. DSO has be extended to exploit stereo images [6], and other stereo odometry algorithms have been developed such as [7], but these methods do not generalize well to handle wide varieties of depth data, from sparse, precise, LiDAR depth to dense, noisy, ToF depth.
Given the advances in deep neural networks and in particular convolutional neural networks (CNNs), there has been increasing interest in applying CNNs to the depth and egomotion estimation problem. Unsupervised algorithms such as
[8, 9, 10] are appealing as they utilize only regular camera images to train and are capable of predicting both depth and egomotion. However, they can only provide depth up to scale. In addition, the performance of stateoftheart methods including [11], which ranks number one in the KITTI depth prediction benchmark at the time of writing, is still far from usable in autonomous robots where accurate depth estimation is crucial for fast, safe execution. Recent success in fusing sparse depth measurements and RGB images [12, 13] have encouraged the application of depth completion to robotics. However, these methods rely heavily on the sparse depth input and regions where sparse depth information exists, rendering them unable to extrapolate depth in regions where there is no LiDAR measurement. They also isolate the pose and depth estimation problems from each other, which we believe amounts to throwing out information that can be used to improve both estimators. The method proposed by Ma et al. [13] uses PerspectivenPose (PnP) to estimate pose after first estimating depth, but becomes susceptible to failure in situations with low image texture, given the limitations of PnP. Additionally, the authors assume that the sparse depth input is noiseless, which limits the method’s usability with inexpensive noisy depth sensors. Their method was additionally trained and tested on accurate LiDAR depth information only.In this work, we argue that the depth prediction CNN should be jointly trained with a pose prediction CNN, resulting in an endtoend algorithm to fuse the sparse depth and the RGB image. Specifically, our contributions are as follows:

We propose an endtoend framework that can learn to complete and refine depth images (as shown in Fig. 2) from a variety of depth sensors with high degrees of noise.

Our endtoend framework learns to simultaneously generate refined, completed depth estimates along with accurate pose estimates of the camera.

We evaluate our proposed approach in comparison with stateoftheart approaches on indoor and outdoor datasets, using both sparse LiDAR depth as well as dense depth from stereo.

Experimental results show that our method outperforms other competing methods on pose estimation and depth prediction in challenging environments.
Ii Related Work
Iia Depth Sensing
Depth measurements can be obtained from a variety of hardware, including sensors such as LiDARs, Time of Flight (ToF) cameras, and active and passive stereo cameras. With these sensors however, there is typically a tradeoff between size, weight and cost on one axis, and accuracy, resolution, and range on the other. Sensors that are accurate, such as 3D LiDARs, are expensive and heavy, while small lightweight sensors such as stereo cameras or structured light sensors have limited range and are often noisier.
An affordable and popular option among these depth sensors is the stereo camera which, in both active and passive cases, consists of two cameras separated by a baseline distance. The estimation of depth comes from the measurement of disparity of a 3D point in the two 2D images. Popular stereo reconstruction algorithms such as SGM [14] and local matching [15] perform a triangulation of 3D points from corresponding 2D image points observed by two cameras to obtain the depth of these 3D points. Compared to LiDARs and ToF cameras, stereo cameras can provide denser depth since they are usually only limited by the raw image resolution of the stereo pair. However, their effective range is limited by the baseline, which is inevitably limited by the robot’s size. Even the best stereo methods are prone to noise and are susceptible to all the challenges of regular image based perception algorithms. Some of the leading stereo algorithms are compared on the KITTI Stereo 2015 [1] and Middlebury benchmarks [16].
IiB Visual Odometry (Joint Depth and Pose Estimation)
3D reconstruction can also be performed by triangulating 3D points observed by the same camera at two consecutive time steps. Monocular depth estimation approaches estimate the pose or transformation between two consecutive camera frames either from other sensors such as a GPS or IMU or by a pose estimation process.
One prominent approach for monocular depth estimation is featurebased visual odometry [17, 18, 19] which determines pose from a set of well established correspondences. These methods keep track of only a small set of features, often losing relevant contextual information about the scene. They also suffer from degraded imaging conditions such as image blur and low light where detecting and establishing features is challenging.
Direct visual odometry methods [20, 21, 22, 5], on the other hand, directly use pixel intensity values and exploit the locality assumption of the motion to formulate an energy function. The purpose of this energy function is to minimize the reprojection error and thereby obtain the pose of the camera. These approaches generally yield better results then featurebased algorithms, but are more sensitive to changes in illumination between successive frames. Direct visual odometry can also provide a dense depth and is intuitively similar to our approach. However, purely monocular based approaches suffer from an inherent scale ambiguity.
Stateoftheart approaches to deeplearning based monocular dense depth estimation rely on DCNNs
[23, 24]. However, as Godard et al. [25] note, relying solely on image reconstruction for monocular depth estimation results in low quality depth images. However, they mitigate this problem by leveraging spatial constraints between the two images captured by the cameras during training. This method then requires stereo images to be available during training process.The problem of scale ambiguity can be solved by utilitzing additional information. Wang et al. [6] modify DSO [5] to take as input a stereo pair of images instead of purely monocular data. They demonstrate significantly improved depth estimation and odometry, but rely on stereo images and cannot generalize to handle other depth sensors. Yang et al. [26] improve upon this by incorporating a deep network for depth estimation using not only a leftright consistency loss but also a supervised loss from the stereo depth. Their superior results show that an additional supervised loss does help boost performance.
Kuznietsov et al. [27]
alternatively solve the scale ambiguity problem of solely monocular depth estimation by adding sparse depth as a loss term to the overall loss function. Their loss function also contains a stereo loss term as in Godard
[9]. Compared to this work, our approach takes advantage of the temporal constraint between pairs of sequential images at no extra cost without the need of the stereo image pair. Our model can also take as the input the RGB images only. Results with these settings are reported in section VI. While making use of stereo images limit Yang’s and Godard’s works to stereo sensors only, our proposed method can work with a wide range of depth sensors.Wang et. al [28] incorporate a Differentiable Direct Visual Odometry (DDVO) pose predictor as an alternative to poseCNN such that the backpropagation signals reaching the camera pose predictor can be propagated to the depth estimator. Unlike our approach, this framework requires no learnable parameters for the pose estimation branch and also enables joint training for pose and depth estimation. However, because of the nonlinear optimization in DDVO a good initialization is required, such as using a pretrained PoseCNN to provide pose initialization, for DDVO to converge during the training.
IiC Depth Completion
There has been recent success in generating dense depth images by exploiting sparse depth input from LiDAR sensors and high resolution RGB imagery [12, 29, 30, 31] with the KITTI Depth Completion benchmark. However, these methods rely heavily on the supplied sparse depth input and the accuracy of these measurements, making them susceptible to failure with noisy sparse input samples. They additionally often struggle to extrapolate depth values where there are no LiDAR measurements [32].
The method proposed by Ma et al. [13] is similar to ours. In this work, pose estimation relies upon the PnP method, which is handled independently from the depth completion pipeline. PnP relies on feature detection and correspondences, which is likely to perform poorly in low texture environments and assumes that the input depth is relatively free of noise. Alternatively, our endtoend framework utilizes temporal constraints to formulate a measure of reprojection error, providing a training signal to the pose estimation and depth completion branches simultaneously. This reprojection error signal handles the noisy sparse depth input, while the supervision from ground truth depth provides the scale for the depth estimation.
Iii Problem Formulation
Notation: Let be the color image captured at timestep , where is the domain of the color image. The camera position and orientation at is expressed as a rigidbody transformation with respect to the world frame . A relative transformation between two consecutive frames can be represented as .
Let be the true depth map that we want to estimate and be the corresponding 3D projection of depth measurement on the image frame, where corresponds to pixels where depth measurement is not available or invalid. Let be the indicator that indicates the validity of the depth measurement at . A sparse and noisy depth measurement can then be represented as
(1) 
where
is a zero mean Gaussian noise model that we assume for depth sensors in this study. We model the noise with standard deviation proportional to the ground truth depth at a given pixel image location:
where controls the noise level of the sensors. In experiments with the TUM dataset in section VI, we set .Given a measurement , our goal is to estimate a set that maximizes the likelihood of the measurement.
Iv Framework
We develop a deep neural network to model the rigidbody pose transformation as well as the depth map. The network consists of two branches: one CNN to learn the function that estimates the depth (), and one CNN to learn the function that estimates the pose ().
This network takes as input the image sequence and corresponding sparse depth maps and outputs the transformation as well as the dense depth map. During training, the two sets of parameters are simultaneously updated by the training signal which will be detailed in this section.
The architecture is depicted in Fig. 3. We adopt the revised depth net of Ma [13]
as a DepthCNN where the last layer before the ReLU is replaced by normalization and an Exponential Linear Unit function (ELU). PoseCNN is adapted from Sfmlearner
[8]. The losses used to train the network are detailed as follows.Iva Supervised Loss
In this study, we assume that during the training, a semidense ground truth depth is known and used to supervise the network by penalizing the difference between the depth prediction and itself. Note that this semidense ground truth depth is not necessary during the testing. The supervised loss is applied to the set of pixels with directly measured depth available. It reads:
(2) 
IvB Photometric Loss
The semidense depth is often sparse due to the hardware and software limitations of depth sensors, making the supervised loss incapable of generating a pixelwise cost for the estimated dense depth . To cope with this problem, we introduce an unsupervised loss which is essentially a photometric loss that enforces the temporal constraint between two consecutive image frames.
Photometric Loss
The unsupervised loss is similar to the residuals used in direct visual odometry which is often computed at different image scales. An unsupervised loss computed at scale can be represented as follows.
(3) 
where and the residual intensity is defined by the photometric difference between pixels observed from the same 3D point between one image and its warping under the transformation .
To find the corresponding transformation between the two frames to compute the loss in Eq. 3, Ma et al. [13] utilize a PnP framework which is susceptible to failure in lowtexture environments. In contrast, our framework introduces another DCNN that estimates the pose between two consecutive frames and . In this way, both depth estimation and egomotion are differentiable, enabling the effective training of both pose estimation and depth estimation in an endtoend scheme.
Residuals
Given the depth prediction of the first frame and the estimated relative pose between the current frame and the consecutive frame , a warped image can be generated by warping the frame to the previous frame . The intensity value of a pixel in the warped image can be computed using the pinhole camera model as follows.
(4) 
where is the homogeneous representation of pixel .
Similarly, the warped image can be generated from the inverse pose :
(5) 
The average residual is then computed from the residuals of these two warped images. The residual for scale is defined as:
IvC Masked Photometric Loss
The synthesis of image and implicitly assumes that the scene is static — a condition that does not always hold. To cope with moving objects in the scene, we introduce the explainability mask network as in [8], which models the confidence of each pixel in the synthesized image. A static object will have pixels with confidence of . In practice, this mask network can be a DCNN branch attached to the pose network, making the whole network model trainable in an endtoend manner. The masked photometric loss that takes into account the confidence is, therefore, capable of coping with moving objects, and formulated as
(6) 
IvD Smoothness Loss
We also enforce the smoothness of the depth estimation by using a smoothness loss term which penalizes the discontinuity in the depth prediction. The norm of the secondorder gradients is adopted as the smoothness loss as in [33].
IvE Training Loss Summarization
In summary, the final loss for our entire framework is defined as:
(7) 
where , and are weights for the corresponding losses, which are set to 1.0, 0.1, 0.1 and 0.2 empirically. An additional loss term is also adopted from Zhou [8] to avoid degeneration of the mask, where the mask can be zero to favor the minimal photometric loss.
V Experimental Settings
Va Datasets
In this study, we conduct experiments to evaluate our approach in comparison with others on two datasets.
We first evaluate our approach on KITTI raw dataset [1] on two different aspects: depth completion and pose estimation. We train our framework on 44,000 training data, validate on 1,000 selected data.
For depth completion, Kitti provides a separate test set with unknown ground truth for benchmarking. The predicted depth on the test set is submitted to the Kitti’s benchmark system to obtain the result given in Tab. I.
For pose estimation, we evaluate based on three different metrics: absolute trajectory error(ATE) [2], relative pose error (RE) [2] and average photometric loss. The comparison result is shown in Tab. II. We then evaluate the performance of our approach on TUM RGBD SLAM Dataset [2]. We divide this dataset into two sets: the training set consists of sequences: freiburg2 pioneer slam and freiburg2 pioneer slam3. The test set is freiburg2 pioneer slam2.
VB Implementation Details
Our algorithm is implemented with Pytorch. We use ADAM optimizer with a momentum of 0.9 and weight decay of 0.0003. The ADAM parameters are preselected as
and. Two Tesla V100 with RAM of 32GB are used for training with the batch size of 8 and 15 epochs take around 12 hours.
For Kitti dataset, we take input images and sparse depth both with the size of . The supervision of semidense depth map is formed by aggregating sparse depth from the nearby 10 frames. For TUM robot slam dataset, input images and sparse depth are both with the size of , being taken into the framework. Input images are normalized with and input sparse depth is scaled in value. Input sparse depth is scaled by a factor of 0.01 on Kitti while of 1/15 on TUM. The supervision depth is as the same size of input sparse depth while not being scaled.
Vi Evaluation
Via Depth Completion
In this section, we benchmark our proposed method against other stateoftheart methods which mostly focus on depth completion task, specifically on the KITTI [1] Depth Completion benchmark. In Tab. I, our method achieves competitive depth completion performance and ranks only 17th in terms of RMSE error but is able to fill depth for more pixels without depth ground truth as shown in Fig. 5. This is because the incorporation of both supervised loss and unsupervised loss. Moreover, the difference between our RMSE and that of the number one approach is only equivalent to of the maxinum distance on the Kitti dataset.
We focus on comparison with Ma [13] since this method is close to our approach. In the KITTI dataset, they rank number while ours ranks number .
The first three rows of Tab. IV depicts the performance of our approach in comparison with that of Ma [13] and Sfmlearner on the depth completion problem in the TUM dataset. In this dataset, our proposed method performs better than that of Ma [13].
Method  iRMSE  iMAE  RMSE  MAE 

RGB guide certainty  2.19  0.93  772.87  215.02 
DL2  2.79  1.25  775.52  245.28 
RGB guide certainty  2.35  1.01  775.62  223.49 
DL1  2.26  0.99  777.90  209.86 
DLNL  3.95  1.54  785.98  276.68 
RASP  2.60  1.21  810.62  256.00 
Ma [13]  2.80  1.21  814.73  249.95 
NConvCNNL2 (gd)  2.60  1.03  829.98  233.26 
MSFFNet  2.81  1.18  832.90  247.15 
DDP  2.10  0.85  832.94  203.96 
NConvCNNL1 (gd)  2.52  0.92  859.22  207.77 
LateFusion  4.59  2.25  885.92  347.61 
SpadeRGBsD  2.17  0.95  917.64  234.81 
glob guide certainty  2.80  1.07  922.93  249.11 
RDSS  3.22  1.44  927.40  312.23 
HMSNet  2.93  1.14  937.48  258.48 
Ours(RGBD)  3.21  1.39  943.89  304.17 
Ours (d)  4.48  1.78  1184.35  403.18 
ViB Pose Estimation
Since most methods in the depth completion benchmark do not investigate joint pose estimation, we compare our trained pose estimator against Sfmlearner, and Ma [13]  which is essentially PnP. Sfmlearner is already capable of simultaneously estimating both pose and depth estimation similar to our method. The parameters for the PnP algorithm are set according to Ma [13]. Tab. II and III shows that our method achieves lower photometric loss than all other mentioned methods on KITTI [1] dataset while exhibiting very competitive ATE and RE values to the other methods. Compared with Sfmlearner’s result stated in the paper [8], our depth prediction RMRSE error, , is far superior to that of Sfmlearner, . This explains why our approach achieves lower photometrics loss.
To diversify the evaluation, we compare the same approaches on the TUM [2] dataset. The TUM RGBD dataset is particularly challenging, as it contains large amounts of rotation, particularly at the beginning. In addition, there are several blurred frames, as well as points of rapid rotation causing significant jump between frames. Finally, the dataset was gathered using a Kinect camera, which uses a rolling shutter camera creating geometric distortion in the image. These challenges make TUM RGBD dataset ideal for evaluating the robustness of the pose estimation algorithms.
We additionally benchmark monocular DSO [5] on the TUM dataset, as a representative of the stateoftheart in monocular SLAM (Simultaneous Localization and Mapping). Monocular methods are particularly sensitive to distortion from rolling shutters as well as calibration error. These factors cause DSO to diverge early in the dataset. Thus, we omit its result here.
As Tab. III depicts, our approach outperforms both Ma [13] and Sfmlearner when considering ATE, RE and photometric loss as metrics. This result, along with the experiments in Tab. IV concludes that our approach outperforms Ma [13] and Sfmlearner on both pose estimation and depth estimation in TUM dataset.
Method  ATE[m]  RE  Photometric Loss 

Ma [13]  0.01050.0082  0.00110.0006  0.1052 
sfmlearner  0.01790.0110  0.00180.0009  0.1843 
Groundtruth  0.0.  0.0.  0.1921 
Ours (rgbd)  0.01700.0094  0.00460.0031  0.0726 
Method  ATE[m]  RE  Photometric Loss 

Ma [13]  0.01160.0067  0.00250.0022  0.0820 
sfmlearner  0.01520.0092  0.00330.0021  0.1043 
Groundtruth  0.0.  0.0.  0.0533 
Ours (rgbd)  0.01010.0051  0.00210.0015  0.0369 
ViC Depth Refinement
One of the prominent problems in robotics vision is the noisy depth measurement as the result of lightweight and lowcost depth sensors. In this section, we simulate noisy input depth by adding Gaussian noise with standard deviation proportional to each depth value and evaluate the network’s ability to handle this situation with different methods. Particularly, we evaluate on the TUM robot SLAM dataset by adding to each valid depth point a Gaussian noise with standard deviation of upto of the depth value. We then train and test with different settings and report the result in Tab. IV.
For Sfmlearner, it does not use depth so the results are the same with and without input depth noise. For ours, we evaluate on two cases: one that trains without noisy depth and one trained with noisy depth. All are tested with noisy depth. Note that the input depth is also sparse as it is obtained by randomly sampling from the groundtruth depth. The qualitative results are shown in Fig. 4 where each number on the image refers to the RMSE in depth estimation error on the image given by corresponding approaches. Comparisons are shown on the last rows on Tab. IV. Both qualitative and quantitative results favor our approach’s performance.
Method  Depth Input  RMSE  MAE  iRMSE  iMAE 

Sfmlearner  O/O  3436.53  2839.16  2659.01  2577.77 
Ma [13]  O/O  119.53  60.01  15.23  8.97 
Ours (rgbd)  O/O  118.34  59.82  14.77  8.88 
Sfmlearner  Noise / Noise  3436.53  2839.16  2659.01  2577.77 
Ma [13]  Noise/Noise  209.81  122.24  60.75  27.05 
Ours (rgbd)  O/Noise  779.582  579.18  116.95  83.81 
Ours (rgbd)  Noise/Noise  180.63  100.20  45.54  21.08 
Vii Conclusion
Depth and egomotion (pose) estimation are essential for autonomous robots to understand the environment and avoid obstacles. However, obtaining a dense, accurate depth is challenging. Depth sensors equipped to small robot platforms such as stereo cameras are often prone to noisy while accurate sensors such as LiDars can only provide sparse depth measurement. In this work, we mitigate these constraints by introducing an endtoend deep neural network to jointly estimate the camera pose and scene structure. We evaluate our proposed approach in comparison with the stateoftheart approaches on Kitti and TUM datasets. The empirical results demonstrate the superior performance of our model under sparse and noisy depth input as well as the capability to work with multiple depth sensors. These capabilities are beneficial on various scenarios from autonomous vehicles to MAVs.
References

[1]
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the
kitti vision benchmark suite,” in
2012 IEEE Conference on Computer Vision and Pattern Recognition
. IEEE, 2012, pp. 3354–3361.  [2] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgbd slam systems,” in Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.
 [3] K. Mohta, M. Watterson, Y. Mulgaonkar, S. Liu, C. Qu, A. Makineni, K. Saulnier, K. Sun, A. Zhu, J. Delmerico, and et al., “Fast, autonomous flight in gpsdenied and cluttered environments,” Journal of Field Robotics, vol. 35, no. 1, p. 101–120, Dec 2017. [Online]. Available: http://dx.doi.org/10.1002/rob.21774
 [4] S. S. Shivakumar, K. Mohta, B. Pfrommer, V. Kumar, and C. J. Taylor, “Real time dense depth estimation by fusing stereo with sparse depth measurements,” 2018.
 [5] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” CoRR, vol. abs/1607.02565, 2016. [Online]. Available: http://arxiv.org/abs/1607.02565
 [6] R. Wang, M. Schwörer, and D. Cremers, “Stereo DSO: largescale direct sparse visual odometry with stereo cameras,” CoRR, vol. abs/1708.07878, 2017. [Online]. Available: http://arxiv.org/abs/1708.07878
 [7] W. Liu, G. Loianno, K. Mohta, K. Daniilidis, and V. Kumar, “Semidense visualinertial odometry and mapping for quadrotors with swap constraints,” 05 2018, pp. 1–6.

[8]
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and egomotion from video,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858.  [9] R. Li, S. Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visual odometry through unsupervised deep learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7286–7291.
 [10] V. Prasad and B. Bhowmick, “Sfmlearner++: Learning monocular depth & egomotion using meaningful geometric constraints,” arXiv preprint arXiv:1812.08370, 2018.
 [11] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2002–2011.
 [12] W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool, “Sparse and noisy lidar completion with rgb guidance and uncertainty,” arXiv preprint arXiv:1902.05356, 2019.
 [13] F. Ma, G. V. Cavalheiro, and S. Karaman, “Selfsupervised sparsetodense: Selfsupervised depth completion from lidar and monocular camera,” arXiv preprint arXiv:1807.00275, 2018.
 [14] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Transactions on pattern analysis and machine intelligence, vol. 30, no. 2, pp. 328–341, 2008.
 [15] N. Einecke and J. Eggert, “A multiblockmatching approach for stereo,” in 2015 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2015, pp. 585–592.
 [16] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense twoframe stereo correspondence algorithms,” International journal of computer vision, vol. 47, no. 13, pp. 7–42, 2002.
 [17] P. H. Torr and A. Zisserman, “Feature based methods for structure and motion estimation,” in International workshop on vision algorithms. Springer, 1999, pp. 278–294.
 [18] D. Nistér, O. Naroditsky, and J. Bergen, “Visual odometry,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 1. Ieee, 2004, pp. I–I.
 [19] H. Badino, A. Yamamoto, and T. Kanade, “Visual odometry by multiframe feature integration,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 222–229.
 [20] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique with an application to stereo vision,” 1981.
 [21] M. Irani and P. Anandan, “About direct methods,” in International Workshop on Vision Algorithms. Springer, 1999, pp. 267–277.
 [22] H. Alismail, M. Kaess, B. Browning, and S. Lucey, “Direct visual odometry in low light using binary descriptors,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 444–451, 2017.
 [23] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 11, pp. 3174–3182, 2018.
 [24] H. Fu, M. Gong, C. Wang, and D. Tao, “A compromise principle in deep monocular depth estimation,” arXiv preprint arXiv:1708.08267, 2017.
 [25] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with leftright consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270–279.
 [26] N. Yang, R. Wang, J. Stuckler, and D. Cremers, “Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 817–833.
 [27] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semisupervised deep learning for monocular depth map prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6647–6655.
 [28] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2022–2030.
 [29] A. Eldesokey, M. Felsberg, and F. S. Khan, “Confidence propagation through cnns for guided sparse depth regression,” arXiv preprint arXiv:1811.01791, 2018.
 [30] M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi, “Sparse and dense data with cnns: Depth completion and semantic segmentation,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 52–60.
 [31] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in 2017 International Conference on 3D Vision (3DV). IEEE, 2017, pp. 11–20.
 [32] S. S. Shivakumar, T. Nguyen, S. W. Chen, and C. J. Taylor, “Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion,” 2019.
 [33] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, “Sfmnet: Learning of structure and motion from video,” arXiv preprint arXiv:1704.07804, 2017.
 [34] W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool, “Sparse and noisy lidar completion with rgb guidance and uncertainty,” arXiv preprint arXiv:1902.05356, 2019.