1 Introduction
Visual odometry results on sampled sequence 09 and 10 from KITTI Odometry dataset. We sample the original sequences with large stride (stride=3) to simulate fast camera egomotion that is unseen during training. Surprisingly, all tested PoseNetbased methods get similar failure on trajectory estimation under this challenging scenario. Our system significantly improves the generalization ability and robustness and still works reasonably well on both sequences. See more discussions in Sec
4.4.Reconstructing the underlying 3D scenes from a collection of video frames or multiview images has been a longstanding fundamental topic named structurefrommotion (SfM), which serves as an essential module to many realworld applications such as autonomous vehicles, robotics, augmented reality, etc. While traditional methods are built on the golden rule of feature correspondence and multiview geometry, a recent trend of deep learning based methods
[43, 15, 67] try to jointly learn the prediction of monocular depth and egomotion in a selfsupervised manner, aiming to make use of the great learning ability of deep networks to learn geometric priors from large amount of training data.The key to those selfsupervised learning methods is to build a task consistency for training separated CNN networks, where depth and pose predictions are jointly constrained by depth reprojection and image reconstruction error. While achieving fairly good results, most existing methods assume that a consistent scale of CNNbased monocular depth prediction and relative pose estimation can be learned across all input samples, since relative pose estimation inherently has scale ambiguity. Although several recent proposals manage to mitigate this scale problem [2, 11], this strong hypothesis still makes the learning problem difficult and leads to severely degraded performance, especially in longsequence visual odometry applications and indoor environments, where the changes of relative pose across sequences are significantly remarkable.
Motivated by those observations, we propose a new selfsupervised depthpose learning system which explicitly disentangles scale from the joint estimation of the depth and relative pose. Instead of using a CNNbased camera pose prediction module (e.g. PoseNet), we directly solve the fundamental matrix from optical flow correspondences and implement a differentiable twoview triangulation module to locally recover an uptoscale 3D structure. This triangulated point cloud is later used to align the predicted depth map via a scale transformation for depth error computation and reprojection consistency check.
Our system essentially resolves the scale inconsistency problem in design. With twoview triangulation and explicit scaleaware depth adaptation, the scale of the predicted depth always matches that of the estimated pose, enabling us to remove the scale ambiguity for joint depthpose learning. Likewise, we borrow the advantage of traditional twoview geometry to acquire more direct, accurate and robust depth supervision in a selfsupervised endtoend manner, where the depth and flow prediction can benefit from each other. Moreover, because our relative pose is directly solved from the optical flow, we simplify the learning process and do not require the knowledge of correspondence to be learned from the PoseNet architecture, enabling our system to have better generalization ability in challenging scenarios. See an example in Figure 1.
Experiments show that our unified system significantly improves the robustness of selfsupervised learning methods in challenging scenarios such as long video sequences, unseen camera egomotions, and indoor environments. Specifically, our proposed method achieves significant performance gain on NYU v2 dataset and KITTI Odometry over existing selfsupervised learningbased methods, and maintains stateoftheart performance on KITTI depth and flow estimation. We further test our framework on TUMRGBD dataset and again demonstrate its much promising generalization ability compared to baselines.
2 Related Work
Monocular Depth Estimation.
Recovering 3D depth from a single monocular image is a fundamental problem in computer vision. Early methods
[46, 47]use feature vectors along with a probabilistic model to provide monocular clues. Later, with the advent of deep networks, a variety of systems
[8, 10, 43] are proposed to learn monocular depth estimation from groundtruth depth maps in a supervised manner. To resolve the data deficiency problem, [36] uses synthetic data to help the disparity training, and several works [30, 26, 29, 27] leverage standard structurefrommotion (SfM) pipeline [48, 49] to generate a psuedogroundtruth depth map by reprojecting the reconstructed 3D structure. Recently, a bunch of works [12, 15, 67] on selfsupervised learning are proposed to jointly estimate other geometric entities that help depth estimation learning via photometric reprojection error. However, although some recent works [55, 11] try to address the scale ambiguity for monocular depth estimation with either normalization or affine adaptation, selfsupervised methods still suffer from the problem of scale inconsistency when applied to challenging scenarios. Our work combines the advantages of SfMbased unsupervised methods and selfsupervised learning methods, essentially disentangles scale from our learning process and benefits from the more accurate and robust triangulated structure with twoview geometry.SelfSupervised DepthPose Learning.
Structionfrommotion (SfM) is a golden standard for depth reconstruction and camera trajectory recovery from videos and image collections. Recently many works [54, 3, 61, 53]
try to combine neural networks into SfM pipeline to make use of the learned geometric priors from training data. Building on several unsupervised methods
[12, 15], Zhou et al. [67]first proposes a joint unsupervised learning framework of depth and camera egomotion from monocular videos. The core idea is to use photometric error as supervision signal to jointly train depth and egomotion networks. Along this line, several methods
[62, 68, 35, 2, 42, 34, 5, 6] further improve the performance by incorporating better training strategies and additional constraints including ICP regularization [35], collaborative competition [42], dense online bundle adjustment [5, 6], etc. Most related to us, Bian et al. [2] introduce geometry consistency loss to enforce the scaleconsistent depth learning. Different from them, our method essentially avoids the scale inconsistency in deign by directly solving relative pose from optical flow correspondence. Our system designs and findings are orthogonal to existing depthpose learning works, significantly improving those methods on both accuracy and generalization.Twoview Geometry.
Establishing pixelwise correspondences between two images is a longstanding visual problem. Traditional methods utilize handcrafted descriptors [32, 1, 44] to build rough correspondence for the subsequent fundamental matrix estimation. Recently, building on classic works of optical flow [21, 33], researchers [7, 22, 52]
find deep neural networks powerful on feature extraction and dense correspondence estimation between adjacent frames. Likewise, several selfsupervised methods
[23, 38, 31] are proposed to supervise optical flow training with photometric consistency.Another line of research is to combine learningbased methods with the fundamental matrix estimation after establishing the correspondence. While some researches [4, 41] focus on making RANSAC [9] differentiable, another alternative is to use an endtoend pose estimation network [24]. However, some recent findings [45, 66] on imagebased localization show that PoseNet design [24] can degrade the generalization ability compared to geometrybased methods. Also, the inherent problem of scale ambiguity for pose estimation makes it hard to decouple with depth scale during joint training. In our work, we show that by building on conventional twoview geometry, our optical flow estimation module is able to accurately recover relative poses and can benefit from the joint depthpose learning.
3 Method
3.1 Motivation and System Overview
The central idea of existing selfsupervised depthpose learning methods is to learn two separated networks on the estimation of monocular depth and relative pose by enforcing geometric constraints on image pairs. Specifically, the predicted depth is reprojected onto another image plane using the predicted relative camera pose and then photometric error is measured. However, this class of methods assume a consistent scale of depth and pose across all images, which could make the learning problem difficult and lead to a scale drift when applied to visual odometry applications.
Some recent proposals [55, 2] introduce additional consistency constraints to mitigate this scale problem. Nonetheless, the scaleinconsistent issue naturally exists because the scales of the estimated depth and pose from neural networks are hard to measure. Also, the photometric error on the image plane supervises the depth in an implicit manner, which could suffer from data noise when large textureless regions exist. Furthermore, similar to two recent findings [45, 66]
that CNNbased absolute pose estimation is difficult to generalize beyond image retrieval, the performance of the CNNbased egomotion estimation also significantly degrades when applied to challenging scenarios.
To address the above challenges, we propose a novel system that explicitly disentangles scale consistency at both training and inference. The overall pipeline of our method is shown in Figure 2. Instead of relying on CNNbased relative pose estimation, we first predict optical flow and solve the fundamental matrix from the dense flow correspondence, thereby recovering relative camera pose. Then, we sample over the inlier regions and use a differentiable triangulation module to reconstruct an uptoscale 3D structure. Finally, depth error is directly computed after a scale adaptation from the predicted depth to the triangulated structure and reprojection error on depth and flow is measured to further enforce endtoend joint training. Our training objective is formulated as follows:
(1) 
The denotes the unsupervised loss on optical flow, where we follow the photometric error design (pixel + SSIM [57] + smooth) on PWCNet [52]. Occlusion mask is derived from optical flow by following [56]. We also add a forwardbackforward consistency [62] to generate a score map for subsequent fundamental matrix estimation. is the loss between triangulated depth and predicted depth. is the reprojection error for image pairs, which consists of two parts, depth map reconstruction error and flow error between optical flow and rigid flow generated by depth reprojection. is the depth smoothness loss, which follows the same formulation in [2].
In the following parts, we first describe how we recover relative pose via fundamental matrix from optical flow. Then, we show how to use the recovered pose to build up self supervision geometrically without scale ambiguity. Finally, a brief description is given on the inference pipeline of our system when applied to visual odometry applications.
3.2 Fundamental Matrix from Correspondence
We recover camera pose from optical flow correspondence via traditional fundamental matrix computation algorithm. Optical flow offers correspondence for every pixel, while some of them are noisy and thus not suitable for solving the fundamental matrix. We first select reliable correspondences using the occlusion mask and forwardbackward flow consistency score map , which are both generated from our flow network. Specifically, we sample the correspondences that locate in nonoccluded regions and have top 20% forwardbackward scores. Then we randomly acquire 6k samples out of the selected correspondences and solve the fundamental matrix via the simple normalized 8point algorithm [19] in RANSAC [9] loop. Fundamental matrix is then decomposed into camera relative pose, which is denoted as . Note that there are 4 possible solutions for and we adopt cheirality condition check, meaning that the triangulated 3D points must be in front of both cameras, to find the best one solution. In this way, our predicted camera pose fully depends on the optical flow network, which can better generalize across image sequences and under challenging scenarios.
3.3 Twoview Triangulation as Depth Supervision
Recovering the relative camera pose with fundamental matrix estimation from optical flow formulates an easier learning problem and improves the generalization, but cannot enforce scaleconsistent prediction on its own. To follow up with this design, we propose to explicitly align the scale of depth and pose. Intuitively two reasonable solutions on scale optimization exist: 1) aligning depth with pose 2) aligning pose with depth. We adopt the former one as it can be formulated as a linear problem using twoview triangulation [18].
Again, instead of using all pixel matches to perform dense triangulation, we first select top accurate correspondences. Specifically, we generate an inlier score map by computing the distance map from each pixel to its corresponding epipolar line, which is helpful for masking out bad matches and nonrigid regions, such as moving objects. Then this inlier score map is combined with occlusion mask , optical flow forwardbackward score , to sample rigid, nonoccluded and accurate correspondences. Here we also randomly acquire 6k samples out of the top 20% correspondence and perform twoview triangulation to reconstruct an uptoscale 3D structure. We adopt the midpoint triangulation as it has a linear and robust solution. Its formulation is as follows:
(2) 
where and denote two camera rays generated from optical flow correspondence. This problem can be directly solved analytically and the solver is naturally differentiable, enabling our system to perform endtoend joint training. The derivation of its analytical solution is included in supplementary materials. We use the triangulated 3D structure as the depth supervision. To mitigate the numerical issue, such as triangulation of matches around epipoles, we filter the correspondence online with respect to the angle of the camera rays. Also, we filter the triangulated samples with negative or outofbound depth reprojection. Figure 3 visualizes samples for the depth reprojection of the dense triangulated structure. The quality of the depth is much promising and feasible to be used as a psuedo depth groundtruth signal to guide the network learning. This design shares similar spirits with many recent methods [30, 26, 29, 27] on supervising the monocular depth estimation with offline SfM inference where they also use the reconstructed structure as the psuedo groundtruth. Compared to those works, our online robust triangulation module explicitly handles occlusion, moving objects and bad matches, and is successfully integrated into the joint training system where correspondence generation and depth prediction could benefit together.
3.4 Scaleinvariant Design
As aforementioned, we can resolve the scaleinconsistent problem by aligning predicted depth with the triangulated structure. Specifically, we align the monocular depth estimation with a single scale transformation to minimize the error between the transformed depth and the psuedo groundtruth depth from triangulation in Eq. (3). Then, the minimized error is used as the depth loss for backpropagation. This online fitting technique was also introduced in a recent work [11].
(3) 
The transformed depth is explicitly aligned to the triangulated 3D structure, whose scale is decided by relative pose scale, thus scale inconsistency is essentially disentangled from the system. Also, the transformed depth can be further used for computing the dense reprojection error . This error is formulated in Eq. (4):
(4) 
Given an image pair , scaletransformed depth estimations , camera intrinsic parameter , and recovered relative pose from optical flow , loss is calculated as follows, which measures the 2D error between optical flow and rigid flow generated by depth reprojection.
(5)  
where is the pixel coordinate in , and indicates the homogeneous coordinates of . Operator gives pixel coordinates. As mentioned in Sec 3.3, is the distance map of each pixel to its corresponding epipolar line and is the inlier score map. serves as a geometric regularization term to help improve the correspondences. is for normalization. Depth reprojection error is defined as:
(6) 
where is the reprojected depth map by and .
is the interpolated depth map of
to align with reprojected pixel coordinates , which is defined in Eq. (LABEL:eq::depth_reproj1). is the occlusion mask from optical flow.3.5 Inference Pipeline on Video Sequences
At inference step, we use the same strategy for relative pose estimation via fundamental matrix estimation from optical flow correspondence. Then, the scale of the triangulated structure is aligned as the same with that of monocular depth estimation. When the optical flow magnitude is too small, we use perspectivenpoint (PnP) method over the predicted depth directly. In this way, we essentially avoid the scale inconsistency between depth and pose during inference. A recent paper [64] employs similar visual odometry inference strategies to utilize neural network predictions. However, their depth and flow network are pretrained separately using PoseNet architecture, while our method builds a robust joint learning system to learn better depth, pose and flow predictions in a selfsupervised manner.
4 Experiments
4.1 Implementation Details
Dataset. We first validate our design on KITTI dataset [13], then conduct extensive experiments on KITTI Odometry, NYUv2 [50] and TUMRGBD [51] datasets to demonstrate the robustness and generalization ability of our proposed system. For original KITTI dataset, we use Eigen et al.’s split [8] of the raw dataset for training, which is consistent with related works [67, 42, 6, 14]. The images are resized to 832256. We evaluate the depth network on the Eigen et al.’s testing split, and the optical flow network on KITTI 2015 training set. For KITTI Odometry dataset, we follow the standard setting [6, 67, 62] of using sequences 0008 for training and 0910 for testing. Since the camera egomotions in KITTI odometry dataset are relatively regular and steady, we sample the original test sequences to shorter versions, mimicking fast camera motions, for testing the generalization ability of networks on unseen data. NYUv2 [50] and TUMRGBD [51] are two challenging indoor datasets which consist of large textureless surfaces and more complex camera egomotions.
Network Architectures. Since our work focuses on an improved selfsupervised depthpose learning scheme, we adopt similar network designs that align with existing selfsupervised learning methods. For the depth network, we use the same architecture as [14] which adopts ResNet18 [20] as encoder and DispNet [15] as decoder. The optical flow network is based on PWCNet [52] and handles occlusion using the method described in [56]. Camera pose is calculated from filtered optical flow correspondences in a nonparametric manner.
Training.
Our system is implemented in PyTorch
[40]. We use Adam [25] optimizer and set learning rate toand batch size to 8. The whole training schedule consists of three stages. Firstly, we only train optical flow network in an unsupervised manner via image reconstruction loss. After 20 epochs, we freeze optical flow network and train the depth network for another 20 epochs. Finally, we jointly train both networks for 10 epochs.
4.2 Conventional KITTI Setting
Monocular Depth Estimation. We report results on monocular depth estimation on Eigen et al.’s testing split on KITTI [13] dataset. The results are summarized in Table 1. Our method achieves comparable or better performance with stateoftheart methods [14, 16]. The performance gain is benefited from our system design, where the scale is disentangled from training and robust supervision is acquired from twoview triangulation module. We further explore the effects of different loss terms. The performance slightly drops without reprojection loss as shown in Table 1, and the training cannot converge without triangulation supervision loss . Figure 4 shows qualitative results of our depth prediction. Note that our method is orthogonal to many previous works, and could be potentially incorporated with many advanced techniques such as online refinement [5, 6], and more effective architecture [17].
Error  Accuracy,  
Method  AbsRel  SqRel  RMS  RMSlog  
Zhou et al. [67]  0.183  1.595  6.709  0.270  0.734  0.902  0.959 
Mahjourian et al. [35]  0.163  1.240  6.220  0.250  0.762  0.916  0.968 
Geonet [62]  0.155  1.296  5.857  0.233  0.793  0.931  0.973 
DDVO [55]  0.151  1.257  5.583  0.228  0.810  0.936  0.974 
DFNet [68]  0.150  1.124  5.507  0.223  0.806  0.933  0.973 
CC [42]  0.140  1.070  5.326  0.217  0.826  0.941  0.975 
EPC++ [34]  0.141  1.029  5.350  0.216  0.816  0.941  0.976 
Struct2depth (ref.) [5]  0.141  1.026  5.291  0.215  0.816  0.945  0.979 
GLNet (ref.) [6]  0.135  1.070  5.230  0.210  0.841  0.948  0.980 
SCSfMLearner [2]  0.137  1.089  5.439  0.217  0.830  0.942  0.975 
Gordon et al. [16]  0.128  0.959  5.230  0.212  0.845  0.947  0.976 
Monodepth2 (w/o pretrain) [14]  0.132  1.044  5.142  0.210  0.845  0.948  0.977 
Monodepth2 [14]  0.115  0.882  4.701  0.190  0.879  0.961  0.982 
Ours (w/o pretrain and ) 
0.135  0.932  5.128  0.208  0.830  0.943  0.978 
Ours (w/o pretrain)  0.130  0.893  5.062  0.205  0.832  0.949  0.981 
Ours  0.113  0.704  4.581  0.184  0.871  0.961  0.984 
indicates ImageNet pretraining.
Optical Flow Estimation. Table 2 summarizes the results of optical flow estimation on KITTI 2015 training set. We also report the performance of only training our optical flow network, denoted as FlowNetonly. Results show that the optical flow module can benefit from joint depthpose learning process and therefore outperforms most previous unsupervised flow estimation methods and joint learning methods. Figure 4 shows some qualitative results.
Method  Noc  All  Fl 

FlowNetS [22]  8.12  14.19   
FlowNet2 [52]  4.93  10.06  30.37% 
UnFlow [38]    8.10  23.27% 
Back2Future [23]    7.04  24.21% 
Geonet [62]  8.05  10.81   
DFNet [68]    8.98  26.01% 
EPC++ [34]    5.84   
CC [42]    5.66  20.93% 
GLNet [6]  4.86  8.35   
Ours (FlowNetonly)  4.96  8.97  25.84% 
Ours  3.60  5.72  18.05% 
4.3 Generalization on Long Sequences
We further extend our system for visual odometry applications. Most of current depthpose learning methods suffer from error drift when applied on long sequences since the pose network is trained to predict relative pose in short snippets. Recently, Bian et al. [2] propose a geometric consistency loss to enforce the longterm consistency of pose prediction and show better results. We test our system with their method and other stateoftheart depthpose learning methods on KITTI Odometry datatset. Since monocular systems lack real world scale factor, we align all the predicted trajectory to groundtruth by applying 7DoF (scale + 6DoF) transformation. Table 3 shows the results. Because our method essentially mitigates the scale drift of existing depthpose learning methods with scale inconsistency, we achieve significant performance improvement over stateoftheart depthpose learning systems. Although our dense correspondence is learned in an unsupervised manner and no local BA and mapping are used at inference, we achieve comparable results with conventional SLAM systems [39]. Figure 5 shows the recovered trajectories on two tested sequences respectively.
4.4 Generalization on Unseen Egomotions
To verify the robustness of our method, we design an experiment to test visual odometry application with unseen camera egomotions. Original sequences in KITTI Odometry dataset are recorded by driving cars with relatively steady velocity, therefore there are nearly no abrupt motions. Meanwhile, the data distributions of relative poses on testing sequences are quite similar to those on training set. We sample the sequences 09 and 10 with different strides to mimic the velocity changes of cameras, and directly test our methods and other depthpose learning methods, which are all trained on original KITTI Odometry training split, and tested on these new sequences. Table 4 shows the results on sequences 09 and 10 which are sampled with stride 3. It is clearly shown that our method is robust and generalize much better on this unseen data distribution, even compared to ORBSLAM2 [39], which frequently fails and reinitializes under fast motion. More surprisingly, as shown in Figure 1, all existing depthpose learning methods relying on PoseNet fail to predict reasonable and consistent poses, and produce relatively similar trajectories, which drift far away from the groundtruth trajectory. This might be due to the fact that CNNbased pose estimation acts more like a retrieval method and cannot generalize to unseen data. This interesting finding shares similar spirits with recent works [45, 66], where the generalization ability of CNNbased absolute pose estimation is studied in depth. With our scaleagnostic system design and the use of conventional twoview geometry, we achieve significantly more robust performance on videos with unseen perframe egomotions.
Methods  Seq. 09  Seq. 10  

()  ()  ()  ()  
ORBSLAM2 [39]  9.31  0.26  2.66  0.39 
ORBSLAM2 [39]  2.84  0.25  2.67  0.38 
Zhou et al. [67]  11.34  4.08  15.26  4.08 
DeepVOFeat [63]  9.07  3.80  9.60  3.41 
CC [42]  7.71  2.32  9.87  4.47 
SCSfMLearner [2]  7.60  2.19  10.77  4.63 
Ours  6.93  0.44  4.66  0.62 
Methods  Seq. 09  Seq. 10  

()  ()  ()  ()  
ORBSLAM2 [39]  X  X  X  X 
Zhou et al. [67]  49.62  13.69  33.55  16.21 
DeepVOFeat [63]  41.24  10.80  24.17  11.31 
CC [42]  41.99  11.47  30.08  14.68 
SCSfMLearner [2]  52.05  14.39  37.22  18.91 
Ours  7.21  0.56  11.43  2.57 
4.5 Generalization on Indoor Datasets
To further test our generalization ability, we evaluate our method on two indoor datasets: NYUv2 [50] and TUMRGBD [51] benchmark. Indoor environments are challenging due to the existence of large textureless regions and much more complex egomotion (compared to relatively consistent egomotion on KITTI [13]), making the training of most existing selfsupervised depthpose learning method collapse, as shown in Figure 7. We train our network on NYUv2 raw training set and evaluate the depth prediction on labeled test set. Training images are resized to 192256 by default. Quantitative results are shown in Table 5. Our method achieves stateoftheart performance among unsupervised learning baselines. To further study the effects on our system design, we introduce two baseline methods in Table 5: PoseNet
baseline is built by substituting our optical flow and twoview triangulation module with a PoseNetlike architecture, where relative pose is directly predicted with a convolutional neural network, and
PoseNetFlow baseline uses optical flow as input for PoseNet branch to predict relative pose. See supplementary material for more details about these two baselines. Our proposed system achieves a large performance gain, indicating the effectiveness and robustness of our system design.Error  Accuracy,  
Method  rel  log10  rms  
Make3D [47]  0.349    1.214  0.447  0.745  0.897 
Li et al. [28]  0.232  0.094  0.821  0.621  0.886  0.968 
MSCRF [59]  0.121  0.052  0.586  0.811  0.954  0.987 
DORN [10]  0.115  0.051  0.509  0.828  0.965  0.992 
Zhou et al. [65]  0.208  0.086  0.712  0.674  0.900  0.968 
PoseNet  0.283  0.122  0.867  0.567  0.818  0.912 
PoseNetFlow  0.221  0.091  0.764  0.659  0.883  0.959 
Ours  0.201  0.085  0.708  0.687  0.903  0.968 
Ours (448576)  0.189  0.079  0.686  0.701  0.912  0.978 
In addition, we test our method on TUMRGBD [51] dataset, which is widely used for evaluating visual odometry and SLAM systems [39, 58]. This dataset is collected mainly by handheld cameras in indoor environments, and consists of various challenging conditions such as extreme textureless regions, moving objects, and abrupt motions, etc. We follow the same train/test setting as [60]. Figure 6 shows four trajectory results. The PoseNetlike baseline fails to generalize under this setting and produce poor results. Conventional SLAM system like ORBSLAM2 works well if there exists rich textures but tends to fail when large textureless region occurs, such as the first and the third cases shown in Figure 6. In most cases, thanks to joint dense correspondence learning, our method can establish accurate pixel associations to recover camera egomotions and produce reasonably well trajectories, again demonstrating our improved generalization.
4.6 Discussion
Our experiments show that in addition to that our method maintains on par or even better performance on the widely tested KITTI benchmark, we achieve significant improvement on robustness and generalization from a variety of different aspects. This gain on generalization comes from our two novel designs as follows: 1) direct camera egomotion prediction from optical flow, and 2) explicit scale alignment between the depth and the triangulated 3D structure. Our findings suggest that optical flow, which does not suffer from scale ambiguity naturally, is a more robust visual clue compared to relative pose estimation for deep learning models, especially under challenging scenarios. Likewise, explicitly handling the scale of depth and pose is still crucial for deep learning based visual SLAM. However, our current system cannot handle multiview images where the motion magnitude is beyond the cost volume of optical flow, and pure rotation cannot be handled online with the twoview triangulation module.
5 Conclusion
In this paper, we propose a novel system which tackles the scale inconsistency for selfsupervised joint depthpose learning, by (1) directly recovering relative pose from optical flow and (2) explicit scale alignment between depth and pose via triangulation. Experiments demonstrate that our method achieves significant improvement on both accuracy and generalization ability over existing methods. Handling the above mentioned failure cases, developing general correspondence prediction and integration with backend optimization could be interesting future directions.
Acknowledgements
This work was partially supported by NSFC (61725204, 61521002), BNRist and MOEKey Laboratory of Pervasive Computing.
References
 [1] (2006) Surf: speeded up robust features. In ECCV, pp. 404–417. Cited by: §2.
 [2] (2019) Unsupervised scaleconsistent depth and egomotion learning from monocular video. In NeurIPS, pp. 35–45. Cited by: Appendix C, Table 6, Table 7, §1, §2, §3.1, §3.1, §4.3, Table 1, Table 3, Table 4.
 [3] (2018) CodeSLAM—learning a compact, optimisable representation for dense visual slam. In CVPR, pp. 2560–2568. Cited by: §2.
 [4] (2017) DSACdifferentiable ransac for camera localization. In CVPR, pp. 6684–6692. Cited by: §2.
 [5] (2019) Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In AAAI, Vol. 33, pp. 8001–8008. Cited by: §2, §4.2, Table 1.
 [6] (2019) Selfsupervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In ICCV, pp. 7063–7072. Cited by: §2, §4.1, §4.2, Table 1, Table 2.
 [7] (2015) Flownet: learning optical flow with convolutional networks. In ICCV, pp. 2758–2766. Cited by: §2.
 [8] (2014) Depth map prediction from a single image using a multiscale deep network. In NeurIPS, pp. 2366–2374. Cited by: §2, §4.1.
 [9] (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §2, §3.2.
 [10] (2018) Deep ordinal regression network for monocular depth estimation. In CVPR, pp. 2002–2011. Cited by: §2, Table 5.
 [11] (2019) Learning single camera depth estimation using dualpixels. In ICCV, pp. 7628–7637. Cited by: §1, §2, §3.4.
 [12] (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In ECCV, pp. 740–756. Cited by: §2, §2.
 [13] (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: §4.1, §4.2, §4.5, Table 1.
 [14] (2019) Digging into selfsupervised monocular depth estimation. In CVPR, pp. 3828–3838. Cited by: Appendix E, §4.1, §4.1, §4.2, Table 1.
 [15] (2017) Unsupervised monocular depth estimation with leftright consistency. In CVPR, pp. 270–279. Cited by: Appendix E, §1, §2, §2, §4.1.
 [16] (2019) Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8977–8986. Cited by: §4.2, Table 1.
 [17] (2019) PackNetsfm: 3d packing for selfsupervised monocular depth estimation. arXiv preprint arXiv:1905.02693. Cited by: §4.2.
 [18] (1997) Triangulation. Computer Vision and Image Understanding 68 (2), pp. 146–157. Cited by: §3.3.
 [19] (1997) In defense of the eightpoint algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (6), pp. 580–593. Cited by: §3.2.
 [20] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1.
 [21] (1981) Determining optical flow. Artificial Intelligence 17 (13), pp. 185–203. Cited by: §2.
 [22] (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In CVPR, pp. 2462–2470. Cited by: §2, Table 2.
 [23] (2018) Unsupervised learning of multiframe optical flow with occlusions. In ECCV, pp. 690–706. Cited by: §2, Table 2.
 [24] (2015) Posenet: a convolutional network for realtime 6dof camera relocalization. In ICCV, pp. 2938–2946. Cited by: Appendix C, §2.
 [25] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 [26] (2018) Supervising the new with the old: learning sfm from sfm. In ECCV, pp. 698–713. Cited by: §2, §3.3.
 [27] (2019) Towards robust monocular depth estimation: mixing datasets for zeroshot crossdataset transfer. arXiv preprint arXiv:1907.01341. Cited by: §2, §3.3.

[28]
(2015)
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs
. In CVPR, pp. 1119–1127. Cited by: Table 5.  [29] (2019) Learning the depths of moving people by watching frozen people. In CVPR, pp. 4521–4530. Cited by: §2, §3.3.
 [30] (2018) Megadepth: learning singleview depth prediction from internet photos. In CVPR, pp. 2041–2050. Cited by: §2, §3.3.
 [31] (2019) SelFlow: selfsupervised learning of optical flow. In CVPR, pp. 4571–4580. Cited by: §2.
 [32] (1999) Object recognition from local scaleinvariant features.. In ICCV, Vol. 99, pp. 1150–1157. Cited by: §2.
 [33] (1981) An iterative image registration technique with an application to stereo vision. Cited by: §2.
 [34] (2018) Every pixel counts++: joint learning of geometry and motion with 3d holistic understanding. arXiv preprint arXiv:1810.06125. Cited by: §2, Table 1, Table 2.
 [35] (2018) Unsupervised learning of depth and egomotion from monocular video using 3d geometric constraints. In CVPR, pp. 5667–5675. Cited by: §2, Table 1.
 [36] (2018) What makes good synthetic training data for learning disparity and optical flow estimation?. IJCV 126 (9), pp. 942–960. Cited by: §2.
 [37] (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pp. 4040–4048. Cited by: Appendix E.
 [38] (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In AAAI, Cited by: §2, Table 2.

[39]
(2017)
Orbslam2: an opensource slam system for monocular, stereo, and rgbd cameras
. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: Table 6, Table 7, Table 8, §4.3, §4.4, §4.5, Table 3, Table 4. 
[40]
(2017)
Automatic differentiation in pytorch.
In
NIPS 2017 Autodiff Workshop: The Future of Gradientbased Machine Learning Software and Techniques
, Cited by: §4.1.  [41] (2018) Deep fundamental matrix estimation. In ECCV, pp. 284–299. Cited by: §2.
 [42] (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, pp. 12240–12249. Cited by: Table 6, Table 7, §2, §4.1, Table 1, Table 2, Table 3, Table 4.
 [43] (2016) Monocular depth estimation using neural regression forest. In CVPR, pp. 5506–5514. Cited by: §1, §2.
 [44] (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Vol. 11, pp. 2. Cited by: §2.
 [45] (2019) Understanding the limitations of cnnbased absolute camera pose regression. In CVPR, pp. 3302–3312. Cited by: §2, §3.1, §4.4.
 [46] (2006) Learning depth from single monocular images. In NeurIPS, pp. 1161–1168. Cited by: §2.
 [47] (2008) Make3d: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. Cited by: §2, Table 5.
 [48] (2016) Structurefrommotion revisited. In CVPR, Cited by: §2.
 [49] (2016) Pixelwise view selection for unstructured multiview stereo. In ECCV, Cited by: §2.
 [50] (2012) Indoor segmentation and support inference from rgbd images. In ECCV, pp. 746–760. Cited by: §4.1, §4.5.
 [51] (2012) A benchmark for the evaluation of rgbd slam systems. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. Cited by: §4.1, §4.5, §4.5.
 [52] (2018) PWCnet: cnns for optical flow using pyramid, warping, and cost volume. In CVPR, pp. 8934–8943. Cited by: Appendix E, §2, §3.1, §4.1, Table 2.
 [53] (2019) Banet: dense bundle adjustment network. In ICLR, Cited by: §2.
 [54] (2017) Cnnslam: realtime dense monocular slam with learned depth prediction. In CVPR, pp. 6243–6252. Cited by: §2.
 [55] (2018) Learning depth from monocular videos using direct methods. In CVPR, pp. 2022–2030. Cited by: Appendix E, §2, §3.1, Table 1.
 [56] (2018) Occlusion aware unsupervised learning of optical flow. In CVPR, pp. 4884–4893. Cited by: Appendix E, §3.1, §4.1.
 [57] (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: Appendix E, §3.1.
 [58] (2015) ElasticFusion: dense slam without a pose graph. Cited by: §4.5.
 [59] (2017) Multiscale continuous crfs as sequential deep networks for monocular depth estimation. In CVPR, pp. 5354–5362. Cited by: Table 5.
 [60] (2019) Beyond tracking: selecting memory and refining poses for deep visual odometry. In CVPR, pp. 8575–8583. Cited by: §4.5.
 [61] (2018) Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In ECCV, pp. 817–833. Cited by: §2.
 [62] (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In CVPR, pp. 1983–1992. Cited by: Appendix E, §2, §3.1, §4.1, Table 1, Table 2.
 [63] (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, pp. 340–349. Cited by: Table 6, Table 7, Table 3, Table 4.
 [64] (2019) Visual odometry revisited: what should be learnt?. arXiv preprint arXiv:1909.09803. Cited by: §3.5.
 [65] (2019) Moving indoor: unsupervised video depth learning in challenging environments. In ICCV, pp. 8618–8627. Cited by: Appendix D, Figure 7, Table 5.
 [66] (2019) To learn or not to learn: visual localization from essential matrices. arXiv preprint arXiv:1908.01293. Cited by: §2, §3.1, §4.4.
 [67] (2017) Unsupervised learning of depth and egomotion from video. In CVPR, pp. 1851–1858. Cited by: Appendix A, Appendix C, Table 6, Table 7, §1, §2, §2, §4.1, Table 1, Table 3, Table 4.
 [68] (2018) Dfnet: unsupervised joint learning of depth and flow using crosstask consistency. In ECCV, pp. 36–53. Cited by: Appendix E, §2, Table 1, Table 2.
Appendix
This document provides a list of supplemental materials that accompany the main paper.

Discussion on ScaleInvariant Design  We provide more detailed discussion for the scaleinvariant design in our system in Section A.

Derivation of Triangulation Module  We include the detailed derivation of differentiable triangulation module in Section B.

Details for PoseNet and PoseNetFlow  We introduce more details and results about PoseNet and PoseNetFlow in Section C.

Additional Results and Discussion for PoseNetFlow  We present additional experiemental results for PoseNetFlow on visual odometry in Section D.

Implementation Details
 We provide more implementation details about network architectures and system hyperparameters in Section
E. 
Additional Comparison on sampled KITTI Odometry dataset  We show more comparsion results about sampled KITTI Odometry dataset in Section F.

Numerical Results of TUMRGBD dataset  We report quantitative results for TUMRGBD dataset in Section G.

Additional Visualizations  In Section H, we provide additional visualizations generated by our system on different datasets.
Appendix A Discussion on ScaleInvariant Design
Given a pair of input images, assume that the fundamental matrix can be accurately recovered from point correspondence and no additional priors exist, the relative translation of the pair should be up to an arbitrary scale. On the other hand, the monocular depth estimation aims to use learned priors from data to directly infer the corresponding depth image. Assume that the intrinsic parameters of the camera are known and fixed, the system can possibly make use of the common priors such as the height of human, the width of the car as well as subtle structural clues to infer the monocular depth, which does not suffer from the scale ambiguity problem.
Most previous works (e.g., [67]) use two separate convolutional neural networks to learn both monocular depth and relative pose, and directly put photometric consistency constraint by using the predicted relative pose to reproject the predicted depth. This makes the assumption that the scale of the predicted relative pose should correspond to the predicted monocular depth, which means that the relative pose estimation is required to not only learn the feature matching and relative pose recovery, but also implicitly learn the scale priors which are exactly the same as the monocular depth estimation is required to learn. This requires the network to firstly infer scale from two input images respectively, and implicitly integrate the predicted scale into the recovered relative pose, making the learning of pose prediction network extremely hard and degrade its generalization capability.
Our method explicitly resolves this problem with two novel designs:

i@. We use an optical flow network to specifically learn pixelwise matching, then solve the fundamental matrix and recover the relative pose up an arbitrary scale.

ii@. We triangulate the predicted correspondence and explicitly align the predicted depth to the triangulated point clouds to compute the error map.
In this way, the relative pose prediction is not required to implicitly learn the scale priors. This significantly improves the generalization both for training on indoor environments and inference on video sequences with unseen camera egomotion. Note that, the two designs are necessary to be coupled together. Suppose that if the system only employs design i@ without aligning the depth to the triangulated point clouds, the joint training cannot converge because it is impossible to fit the scale of the depth estimation network to an arbitrary scale of relative pose.
Based on the previous discussion, we can infer that our system is robust under the circumstances where the camera intrinsic parameters are known and fixed. When the camera intrinsic parameters are flexible across different sequences on training and inference, only under the assumption that the monocular depth estimation network can automatically learn the camera calibration from structural clues in the single image can our method still accurately recover the depth image. Otherwise, further system designs on the monocular depth network are required to disentangle the influence of different camera field of view to make the learning problem feasible.
Appendix B Derivation of Triangulation Module
We adopt midpoint triangulation method to build an uptoscale 3D structure from 2D correspondences and relative pose. Midpoint triangulation problem could be easily solved with linear algorithms. The objective function is as follows:
(7) 
Where and are two camera rays generated with optical flow correspondence, and denotes the euclidean distance. is the ray origin, where is the camera extrinsic, and is the ray direction, where is the pixel coordinate. The objective function could be written as:
(8) 
To minimize , we need which easily gives us:
(9) 
After substitution of , the cost function becomes:
(10) 
Then we have:
(11)  
From these two linear equations, the solutions of and could be expressed as:
(12) 
(13) 
The triangulation solution is then computed with Eq. (9). By this way, the triangulation module is naturally differentiable.
Appendix C Details for PoseNet and PoseNetFlow
We implement two baseline methods, named PoseNet and PoseNetFlow, to compare with our method. PoseNet system takes image pairs as input, predicts monocular depth and relative pose by depth and pose branch, respectively. The depth branch uses the same network as our system and the pose branch adopts standard PoseNet [24]. Following previous PoseNetbased unsupervised depth pose joint learning methods [67, 2], we utilize photometric loss and depth reprojection loss to train the network. For PoseNetFlow system, we add a flow network to generate optical flow, and feed generated optical flow, rather than RGB image pair, to PoseNet for relative pose estimation. The flow network is the same as that of our system. The depth network and the depthpose training objectives remain the same as PoseNet system. We adopt twostage training stragegy for PoseNetFlow system. In the first stage we train the optical flow network. Then the flow network is frozen and both the depth and pose networks are joint trained.
Appendix D Additional Results and Discussion for PoseNetFlow
Table 5 shows the depth estimation results of PoseNet and PoseNetFlow in indoor NYUv2 dataset. Due to complex camera motions and large textureless regions, traditional PoseNet method fails to generate plausible predictions. PoseNetFlow uses optical flow for pose regression, thus improves the interpretability of the system and makes learning problem easier. This is also discussed in [65]. To further explore the capacity of PoseNetFlow system, we conduct experiments on KITTI Odometry dataset. We use two consecutive images as training pairs. Figure 8 and Figure 9 show the results of standard KITTI dataset and sampled KITTI dataset with stride 3. While the PoseNetFlow system could produce feasible results on NYUv2 and standard KITTI dataset, it still tends to fail on unseen egomotions. This could possibly due to the nature of trained PoseNet that it performs more like image retrieval rather than solving physical constraints and thus works well only on the test data which is similar with training samples. On the contrary, our method works well under all these challenging scenarios, showing much improved robustness and generalization ability.
Appendix E Implementation Details
Here we introduce more details about network architectures and training objectives used in our system.
For depth estimation network, we adopt a same encoderdecoder network with skip connections as proposed in [14]. Specifically, ResNet18 is used as encoder and DispNet [37, 15] is used as decoder with ELU nonlinearities for all conv layers except output layer, where we use sigmoids and convert the output disparity to depth with . and are set to be 0.1 and 100 to constrain the range of output depth. We only supervise the largest scale of depth output, and replace the nearest upsampling layers in decoder with bilinear upsampling, which makes the training more stable. The depth loss consists of three parts, triangulation depth loss , reprojection loss and edgeaware depth smoothness loss . The detailed descriptions of and are included in the main paper. Given image input and disparity prediction , depth smooth loss is computed as follows:
(14) 
where is the normalized disparity prediction to avoid depth shrinking, proposed by [55].
For flow estimation network, we adopt the PWCNet [52] as backbone for predicting forward and backward optical flow of an image pair. We utilize the backward warping method proposed in [56] to explicitly handle occlusions. Generated occlusion masks are not only used as a better supervision for the optical flow, but also for sampling reliable pixel matches when solving relative pose and triangulation. Optical flow is predicted and supervised at three different scales. Following [62, 68], we use a combination of L1 loss, SSIM loss [57] and flow smoothness loss for flow supervision. Therefore, the total flow loss is expressed as:
(15) 
where is the flow smoothness loss which has a similar formulation as Eq. (14). and are set to be 0.85 and 0.1 respectively.
Image  Epipolar Lines 
Dense Triangulation  Angle Mask 
For relative pose estimation, we recover it by solving fundamental matrix. Specifically, we first compute optical flow forwardbackward distance map by flow warping. Then forwardbackward score map is generated as . Together, is used for sampling accurate correspondences from dense flow. We sample the top 20% correspondences according to score map and then randomly sample 6k matches. We perform this sampling strategy, rather than directly top sampling, to discourage spatial accumulation of sampled matches. Then we run the normalized 8point algorithm in RANSAC loop to solve fundamental matrix. The RANSAC inlier threshold and desirable confidence are set to be 0.1 and 0.99 respectively. After solving fundamental matrix, we decompose it into and further triangulate matches for all four solutions. We choose the one which has the most triangulated points in front of both cameras as final relative pose. An inlier score map is generated from fundamental matrix to mask out nonrigid regions such as moving objects and bad matches. See examples in Figure 11. Specifically, we compute the distance from one pixel to its corresponding epipolar line, resulting in distance map . The inlier score map is computed as . Again we perform top score sampling and random sampling from to acquire 6k matches. We filter out the matches which have extremely small ray angles or have invalid reprojection. To be specific, given two camera rays and , where is the ray origin and is the ray direction, we could have . Then the cosine value of angle between and is computed. We filter out the regions where the cosine value is smaller than 0.001. See an example in Figure 10. After filtering, matches are further triangulated to 3D structure, and then used for scale alignment and supervision of depth prediction.
Methods  Seq. 09  Seq. 10  

()  ()  ()  ()  
ORBSLAM2 [39]  11.12  0.33  2.97  0.36 
ORBSLAM2 [39]  2.37  0.40  2.97  0.36 
Zhou et al. [67]  24.75  7.79  25.09  11.39 
DepthVOFeat [63]  20.54  6.33  16.81  7.59 
CC [42]  24.49  6.58  19.49  10.13 
SCSfMLearner [2]  33.35  8.21  27.21  14.04 
Ours  7.02  0.45  4.94  0.64 
Methods  Seq. 09  Seq. 10  

()  ()  ()  ()  
ORBSLAM2 [39]  X  X  X  X 
Zhou et al. [67]  61.24  18.32  38.94  19.62 
DepthVOFeat [63]  42.33  11.88  25.83  11.58 
CC [42]  51.45  14.39  34.97  17.09 
SCSfMLearner [2]  59.32  17.91  42.25  21.04 
Ours  7.72  1.14  17.30  5.94 
Appendix F Additional Comparison on sampled KITTI Odometry dataset
To better demonstrate the robustness of our system, we provide additional comparison on sampled KITTI Odometry dataset. The test sequences 09 and 10 are sampled with stride 2 and 4, and we run the PoseNetbased learning systems and ORBSLAM2 on these sampled sequences without additional training. Table 6 and 7 summarize the results of sampling with stride 2 and 4 respectively. Trajectories results are shown in Figure 12 and 13. Again our system shows improved robustness and generalization ability compared to our baselines. However, when the camera moves extremely fast, such as sampling with stride 4 or more, the optical flow estimation becomes bottleneck and the performance degrades due to inaccurate correspondences.
Appendix G Numerical Results of TUMRGBD dataset
Sequences  fr3/cabinet  fr2/desk  fr3/str_ntex_far  fr3/str_tex_far 

PoseNet  1.45  1.51  0.32  0.38 
ORBSLAM2 [39]  X  0.006  X  0.009 
Ours  1.09  0.52  0.24  0.14 
In Table 8, we report the quantitative results of TUMRGBD dataset. Our methods could produce reasonable trajectories under challenging scenarios while PoseNet baseline fails to generalize. ORBSLAM2 relies on sparse ORB features to establish correspondences, and it suffers on large textureless regions (fr3/cabinet, fr3/str_ntex_far). However, ORBSLAM2 works much better than ours when the scene contains rich textures (fr2/desk, fr3/str_tex_far). Our system could be further improved with better optical flow estimation and combination with backend optimization. TUMRGBD and NYUv2 are both indoor datasets and share some similar data distributions. We trained our method and PoseNet on TUMRGBD dataset and directly tested on the NYUv2 dataset to demonstrate the transfer ability of trained model. Experimental results show that our model achieves better transfer performance (AbsRel 0.276) than PoseNet baseline (AbsRel 0.324). However, this transfer ability is still limited and has large room for improvement in the future.
Appendix H Additional Visualizations
Image  Depth Estimation  Flow Estimation 

Image  Baseline  Ours  Groundtruth 

Comments
There are no comments yet.