KP3D
Code for "SelfSupervised 3D Keypoint Learning for Egomotion Estimation"
view repo
Generating reliable illumination and viewpoint invariant keypoints is critical for featurebased SLAM and SfM. Stateoftheart learningbased methods often rely on generating training samples by employing homography adaptation to create 2D synthetic views. While such approaches trivially solve data association between views, they cannot effectively learn from real illumination and nonplanar 3D scenes. In this work, we propose a fully selfsupervised approach towards learning depthaware keypoints purely from unlabeled videos by incorporating a differentiable pose estimation module that jointly optimizes the keypoints and their depths in a StructurefromMotion setting. We introduce 3D MultiView Adaptation, a technique that exploits the temporal context in videos to selfsupervise keypoint detection and matching in an endtoend differentiable manner. Finally, we show how a fully selfsupervised keypoint detection and description network can be trivially incorporated as a frontend into a stateoftheart visual odometry framework that is robust and accurate.
READ FULL TEXT VIEW PDFCode for "SelfSupervised 3D Keypoint Learning for Egomotion Estimation"
Detecting interest points in RGB images and matching them across views is a fundamental capability of many robotic systems. Tasks such as StructurefromMotion (SfM) [2], Visual Odometry (VO) or Simultaneous Localization and Mapping (SLAM) [10] assume that salient keypoints can be detected and reidentified in diverse settings, which requires strong invariance to lighting, viewpoint changes, scale etc. Until recently, these tasks have mostly relied on handengineered keypoint features [34, 44]
which have been limited in performance. Deep learning has recently revolutionized many computer vision applications in the supervised setting
[25, 49, 29, 47], however, these methods rely on strong supervision in the form of groundtruth labels which are often expensive to acquire. Moreover, supervising interest point detection is unnatural, as a human annotator cannot readily identify salient regions in images as well as key signatures or descriptors, which would allow their reidentification in diverse scenarios. Inspired by recent approaches to keypoint learning [14, 11, 5], we propose a fully selfsupervised approach that exploits the temporal context in videos to learn to extract accurate and robust 3D keypoints from a single monocular image (Figure 1).Our main contribution is a fully selfsupervised framework for the learning of depthaware keypoint detection and description purely from unlabeled videos. We propose a novel framework for the simultaneous learning of keypoint detection, matching and 3D lifting by incorporating a differentiable pose estimation module that tightly couples the two task networks for keypoint estimation (KeypointNet) and depth estimation (DepthNet). We show that by enforcing strong regularization in the form of sparse multiview geometric constraints, the keypoint and depth networks strongly benefit from jointly optimizing for robust visual egomotion. Our second contribution is the introduction of 3D multiview adaptation, a novel adaptation technique that exploits the temporal context in videos to further boost the repeatability and matching performance of the keypoint network. For our final contribution, we show how our selfsupervised depthaware keypoint networks can be incorporated as a frontend into a visual odometry framework, enabling robust and accurate egomotion estimation results. We show that when integrating our method with a stateoftheart tracking method such as Direct Sparse Odometry (DSO) [16], we achieve longterm tracking results which are especially on par with stateoftheart stereomethods such as DVSO [53]. Through extensive experiments and ablative analysis, we show that the proposed selfsupervised keypoint learning adaptation achieves stateoftheart results on challenging benchmarks for keypoint detection, matching and visual odometry.
Until recently, handcrafted image features such as SIFT [34] or ORB [44] have been the key enabler of featurebased SLAM [37] and SfM applications [1]. Stateoftheart learningbased keypoint detectors and descriptors, however, have increasingly been demonstrating improved performance on challenging benchmarks [14, 11, 45, 5], setting a new standard for keypointbased applications.
LearningBased Methods for Keypoint Estimation Rosten and Drummond [42, 43]
pioneered the detection of learningbased image features by learning a decision tree over image patches and accurately classifying corner features with realtime considerations. In TILDE
[51], the authors introduced piecewise linear regression models to detect illuminationinvariant features. LIFT
[54] uses an offtheshelf SfM algorithm to generate more realistic training data under extreme viewpoint configurations, and learns to describe features that are robust to significant viewpoint and illumination differences. In LFNet [39], the authors introduced an endtoend differentiable network which estimates position, scale and orientation of features by jointly optimizing the detector and descriptor in a single module.More recently, Quadnetworks [46]
introduced an unsupervised keypoint learning method that learns to rank invariant interest points under diverse image transformations and extracts keypoints from the top and bottom quantiles. In
[14] the authors propose SuperPoint: a selfsupervised framework aimed at keypoint learning where they used a sharedencoder with detector and descriptor heads to predict interest points and descriptors simultaneously. In their work, the authors introduce Homographic Adaptation, multiscale homographybased augmentation approach to boosting interest point detection repeatability and crossdomain generalization using synthetic datasets. Building on this work, UnsuperPoint [11] proposed a similar method for efficient keypoint detection and description trained in a fully selfsupervised manner without the need for pseudo groundtruth keypoints.Other works including SelfImproving Visual Odometry algorithm [13], take advantage of classical SfM techniques to classify the stability and repeatability of keypoints based on their reprojection error. However, due to the nondifferentiable nature of their method, training these models requires multiple iterations of updates with diminishing improvements to the keypoint model. Most recently, in [5], the authors incorporate an endtoend differentiable and neurallyguided outlierrejection mechanism (IONet), that explicitly generates an additional proxy supervisory signal for the matching keypointpairs. This allows keypoint descriptions to be further refined, as a result of the outlierrejection network predictions occurring during the twoview matching stage.
Learningbased Methods for Visual Odometry Selfsupervised methods for depth and egomotion estimation are becoming increasingly popular, as accurate groundtruth measurements rely heavily on expensive and specialized equipment such as LiDAR and Inertial Navigation Systems (INS). One of the earliest works in selfsupervised depth estimation [20] used the photometric loss as proxy supervision to learn a monocular depth network from stereo imagery. Zhou et al. [58] extended this selfsupervision to the generalized multiview case, leveraging constraints typically incorporated in SfM to simultaneously learn depth and camera egomotion from monocular image sequences.
Several works have extended this work further  engineering the loss function to handle errors specifically in handling outliers. However, it has been shown that direct pose estimation (i.e. directly from input images
[27]) is prone to overfitting and benefits from feature sparsification as shown in [4]. Teed and Deng [48] proposed an iterative method to regress dense correspondences from pairs of depth frames and compute the 6DoF estimate using a PnP [30] algorithm. More recently, the authors of [36] use a modelbased pose estimation solution via PerspectivenPoint to recover 6 DoF pose estimates from monocular videos and use the estimate as a form of supervision to enable semisupervised depth learning from unlabeled videos and LiDAR. Our work borrows a similar concept, however, we take advantage of the modelbased PnP solution and the inliers established to outfit a fully differentiable pose estimation module within the 3D keypoint learning framework. [56] uses PnP along with an estimation of the essential matrix to compute the egomotion, however they rely on estimating the dense flow using multiple frames, while our method focuses on sparse keypoint detection and optimization using a single frame.In this section, we introduce the fully selfsupervised framework for monocular depthaware keypoint learning for the task of egomotion estimation. Notably, we perform depthaware keypoint learning purely from watching large volumes of unlabeled videos, without any need for supervision in the form of groundtruth or pseudo groundtruth labels. As a consequence of learning the 2Dto3D keypoint lifting function from monocular videos, we show that this capability can be additionally used to accurately estimate the egomotion between temporally adjacent images. We illustrate the proposed monocular SfMbased keypoint learning framework in Figure 2.
We formulate monocular depthaware keypoint learning as follows: Given an input monocular image , we aim to regress keypoint locations , descriptors , and scores along with dense depth map . Functionally, we define 3 components in our framework that are used to enable depthaware keypoint learning in an endtoend differentiable setting: (i) KeypointNet that learns to regress output keypoint locations , descriptors and scores given an input image . (ii) DepthNet , that learns to predict the scaleambiguous dense depth map , and as a result, provides a mechanism to lift the sparse 2D keypoints to 3D by directly sampling from the predicted dense depth , . We refer to the resulting 3D keypoints along with their associated descriptors and scores as . (iii) A fullydifferentiable egomotion estimator , that predicts the relative 6DoF rigidbody transformation between the target image and the source image . We use to denote the warped keypoints from the target image to the source image via the transformation .
Following [58] we compute the depth at multiple scales during training, however, when referring to the associated sparse depth for a set of descriptors we refer to the depth from the scale with the highest resolution. Notably in the monocular SfM setting, the depth scale is ambiguous up to an unknown scale factor.
MultiView Adaptation Following the concept of leveraging known geometric transformations to selfsupervise and boost keypoint learning [14], we introduce MultiView Adaptation  a novel selfsupervised adaptation technique that leverages epipolar constraints in twoview camera geometry for robust 3D keypoint learning. Crucially, we generalize the works of [5, 11] and selfsupervise 3D keypoint learning that leverage the structured geometry of scenes in unlabeled monocular videos. An overview of the proposed pipeline is illustrated in Figure 2.
In the adaptation step, we are interested in computing the set of corresponding keypoints . i.e. from target image along with warped in source image . We use predicted keypoints and in the target and source images to compute via reciprocal matching in descriptor space. Given the set of corresponding keypoints we compute the associated egomotion (see Section 3.3). Once is known, we compute by warping and we induce a combination of dense photometric losses via imagesynthesis and sparse geometric losses via reprojection in the monocular twoview setting.
Specifically, we use (i) a dense photometric loss based on the warped projection of in aimed at optimizing the dense depth prediction by the DepthNet; and (ii) a sparse geometric loss aimed at minimizing the reprojection error between corresponding keypoints and predicted by the KeypointNet.
Homography Adaptation Following [5, 11], the KeypointNet is additionally trained on image pairs related through a known homography transformation which warps pixels from the source image to the target image. As a result, the training image pairs are generated by randomly sampling from a set of predefined homographies. For every warped keypoint in , we compute the corresponding keypoint from based on Euclidean distance, and denote the resulting set as . The resulting correspondence set is then directly used in selfsupervising the keypoints by imposing a loss on the consistency of known keypoint pair matches. Figure 3 to compare the two adaptation techniques.
Having computed correspondences and , we utilize a robust estimator to compute the 6DoF rigidbody pose transformation between the target and source views.
Pose Estimation via PerspectivenPoint By lifting the 2D keypoints from the target image to 3D with the associated depth , we use the PnP algorithm [30] to compute the initial relative pose transformation to geometrically match the keypoints in the target image to those in the source image. Specifically, we minimize:
(1) 
where is the standard pinhole camera projection model used the project the warped points on to the source image .
The estimated relative pose is obtained by minimizing the residual error in Equation (1) using the GaussianNewton (GN) method (see supplementary material) with RANSAC to ensure robustness to outliers. This step allows us to compute the pose robustly, however, this makes the pose no longer differentiable with respect to the 3D keypoints used to estimate it. To alleviate this limitation, we address how the resulting pose estimate can be used as an initialguess to an endtoend differentiable pose estimation module within the proposed selfsupervised 3D keypoint learning framework.
Differentiable Pose Estimation from 3D Keypoints Inspired by recent monocular direct methods that perform frametokeyframe tracking[17], we show that by calculating the reprojected source 3D keypoints from the target keypoints via the initial pose estimate , a 3D residual can be formulated to recover the pose in closedform for the established inlier set in PnP:
(2) 
The 3D residual above can be effectively minimized by estimating the rotation and translation separately using a closedform solution on the established inlier set. We first estimate the rotation by subtracting the means of the points and minimizing the eq. 3 by solving an SVD in closedform (otherwise known as the Orthogonal Procrustes problem [57]):
(3) 
(4) 
Once the rotation is computed, the translation can be directly recovered by minimizing:
(5) 
Thus, the gradients for the pose rotation and translation can be effectively propagated with respect to the lifted 3D keypoint locations, making the overall pose estimation fullydifferentiable. The differentiable pose estimated using the 2D keypoints from the source image and 3D keypoints from the target image tightly couples keypoint and depth estimation, thereby allowing both predictions to be further optimized using overall keypoint learning objective.
In this work, we selfsupervise the learning of depthaware keypoints in a fully endtoend differentiable manner using a combination of photometric and geometric losses. We optimize both the KeypointNet and DepthNet jointly using the following losses:
Keypoint Loss Based on the descriptormatched correspondences and the 3D adapted keypoints , we define a loss term that enforces geometric consistency between the 2D keypoints in the source view and the 3D keypoints in the target view of the same scene:
(6) 
Descriptor Loss Following [5], we use nested hardest sample mining to selfsupervise the keypoint descriptors between the two views. Given anchor descriptors from the target frame and their associated positive descriptors in the in the source frame, we define the triplet loss:
(7) 
where is the hardest descriptor sample mined from with margin .
Score Loss The score loss is introduced to identify reliable and repeatable keypoints in the matching process. In particular, we want to ensure that (i) the featurepairs have consistent scores across matching views; and (ii) the network learns to predict high scores for good keypoints with low geometric error and strong repeatability. This objective is achieved by minimizing the squared distance between scores for each matched keypointpair, and minimizing or maximizing the average score of a matched keypointpair if the distance between the paired keypoints is greater or less than the average distance respectively:
(8) 
where and are the scores of the source and target frames respectively, and is the average reprojection error of associated points in the current frame, given by . Here, refers to the 2D Euclidean distance in feature space between the matching keypoints.
We define similar keypoint, descriptor and score loss terms in the Homography Adaptation (HA) case using the correspondences.
Photometric Loss In addition to the geometric losses, we impose a dense photometric loss to learn dense depth in the DepthNet. Following [21, 58, 26], we warp the depth from the target from along via the predicted egomotion estimate to the source frame , and impose a structural similarity (SSIM) loss [52] between the synthesized target image and the original target image . The resulting dense photometric loss is regularized with an L1 pixelwise loss term (See Appendix for more details):
(9) 
To account for parallax errors and the presence of dynamic objects in videos, we compute the pixelwise minimum between the set of synthesized source images (i.e. context images) and the target image [22].
(10) 
In addition, we mask out static pixels by removing those which have a warped photometric loss higher than their corresponding unwarped photometric loss , calculated using the original source image without viewsynthesis [22]. This has the effect of removing pixels with nonchanging appearances, including static frames and dynamic objects with no relative motion.
(11) 
Depth Smoothness Loss In order to regularize the depth in textureless lowimage gradient regions, we also incorporate an edgeaware term similar to [20]:
(12) 
Depth Consistency Recall that the depth we regress is scaleambiguous. While recovering scaleconsistent depth is not a strict requirement for the proposed framework to learn 3D keypoints, scaleconsistency has been shown to be crucial for tasks that involve accurate egomotion estimation [9, 23]. To this end, we incorporate a depth consistency term that discourages scaledrift between dense depth predictions in adjacent frames:
(13) 
Note that is a sparse loss defined based on the correspondences .
Overall Objective The overall objective used to simultaneously optimize the KeypointNet and DepthNet becomes:
(14) 
where and are weights used to balance the depth and keypoint losses, and they are chosen as and respectively during the training.
In this section, we will explain how the fully selfsupervised depthaware keypoint network can be incorporated as the frontend into a visual SLAM framework. We show that integrating our method into a stateoftheart monocular visual SLAM framework such as DSO [16], we are able to achieve longterm tracking results which are especially on par with stereo methods such as DVSO [53] or ORBSLAM2 [38]. Unlike other monocular visual odometry approaches, the superior keypoint matching and stable 3D lifting performance of our proposed method allows us to bootstrap the tracking system, rejecting false matches and outliers and avoiding significant scaledrift as demonstrated in Figures 5 and 1.
Figure 4 shows the whole pipeline of our Deep SemiDirect Sparse Odometry (DSDSO) system which is built on top of the windowed sparse direct bundle adjustment formulation of DSO. As illustrated, we improve depthinitialization of keyframes in the original DSO implmenetation by using the depth estimated through our proposed selfsupervised 3D keypoint network. In addition, we modify the handengineered direct semidense tracking component to our proposed sparse and robust learned keypointbased method introduced in this work.
We evaluate our system on the KITTI [18] dataset. We follow the training protocol of [53] and train on KITTI odometry sequences 01, 02, 06, 08, 09 and 10, and evaluate on sequences 00, 03, 04, 05 and 07. We report  average translational RMSE drift (%) on trajectories of length 100800m, and  average rotational RMSE drift (deg /100m) on trajectories of length 100800m. To evaluate the performance of our DepthNet we use the Eigen [15] test splits which consists of 697 images with associated depth (we note that the eigen test split does not overlap the KITTI odometry sequence we use for training).
To evaluate the performance of our kepoint detector and decriptor we use the HPatches [6] dataset. HPatches consists of a set of 116 image sequences (illumination and viewpoint), each sequence containing a source image and five target images, for a total of 580 image pairs. We quantify detector performance through the Repeatability and Localization Error metrics and descriptor performance through the Correctness and Matching Score metrics (the exact definition of these metrics can be found in the Appendix). For a fair comparison, we evaluate the results generated without applying NonMaxima Suppression (NMS). Following related work [5, 11, 14], we pretrain our KeypointNet on the COCO [33] dataset, which contains training images. We note that pretraining on COCO is completely selfsupervised, using Homography Adaptation (more details in the following section).
We implement our networks in PyTorch
[41] using the ADAM optimizer [28]. We use as the learning rate and train KeypointNet and DepthNet jointly for epochs with a batch size of . We implement KeypointNet following [5], with the mention that we use an ImageNet pretrained ResNet18 backbone, which we found to perform much better than the reference architecture used. We follow
[21] and implement DepthNet using an ImageNet [12] pretrained ResNet18 backbone along with a depth decoder that outputs inverse depth at 4 scales. However, at testtime, only the highest resolution scale is used for 2Dto3D keypoint lifting.We train on snippets of 3 images , for with target image and images as context images (otherwise referred to as source images). Using the pair of target and source images generated via 3D MultiView Adaptation, we compute the losses as defined in Section 3.4. The dense photometric loss is computed over the context as shown in Equation 11. Additionally, starting from the target image , we also perform Homography Adaptation similar to [5], e.g. translation, rotation, scaling, cropping and symmetric perspective transform. Additionally, we apply perpixel Gaussian noise, color jitter and Gaussian blur to the images for additional robustness to image lighting.
Pretraining We pretrain KeypointNet on COCO using Homography Adaptation for epochs using a learning rate of which is later halved after epochs. We refer to this as our baseline KeypointNet, and evaluate its performance in Table 1. To speed up convergence, we pretrain our DepthNet on the KITTI training sequnces (e.g. 01, 02, 06, 08, 09 and 10) using the method described in [21]. We train for 200 epochs with a learning rate of 1e4 which is decayed every epochs.We refer to this as our baseline DepthNet, and we evaluate its performance in the experiments below.
Table 1 shows the performance of our keypoints and descriptors on HPatches [6]. We note that our baseline method, trained on COCO using Homography Adaptation, outperforms all classical as well as learningbased methods in terms of keypoint robustness (Repeatability and Localization Error) and descriptor matching performance (Correctness and Maching Score). As seen in the table, we show further improvements when training using the proposed 3d multiview adaptation method. In addition to the superior VO results reported in Table 3, our method allows us to train a stateoftheart keypoint detector with associated descriptor that can robustly detect correspondences in challenging situations. We refer the reader to the supplementary materials for additional qualitative results.
Method  240x320, 300 points  480 x 640, 1000 points  

Rep.  Loc.  Cor3  M.Score  Rep.  Loc.  Cor3  M.Score  
ORB [44]  0.532  1.429  0.422  0.218  0.525  1.430  0.607  0.204 
SURF [7]  0.491  1.150  0.702  0.255  0.468  1.244  0.745  0.230 
BRISK [31]  0.566  1.077  0.767  0.258  0.746  0.211  1.207  0.653 
SIFT [34]  0.451  0.855  0.845  0.304  0.421  1.011  0.833  0.265 
LFNet(indoor) [40]  0.486  1.341  0.628  0.326  0.467  1.385  0.679  0.287 
LFNet(outdoor) [40]  0.538  1.084  0.728  0.296  0.523  1.183  0.745  0.241 
SuperPoint [14]  0.631  1.109  0.833  0.318  0.593  1.212  0.834  0.281 
UnsuperPoint [11]  0.645  0.832  0.855  0.424  0.612  0.991  0.843  0.383 
IONet [5]  0.686  0.890  0.867  0.544  0.684  0.970  0.851  0.510 
KeyPointNet (Baseline)  0.683  0.816  0.879  0.573  0.682  0.898  0.848  0.534 
KeyPointNet  0.686  0.799  0.858  0.578  0.674  0.886  0.867  0.529 
KPN baseline  DN baseline  KPN trained  DN trained  DP: Diff.Pose  TR: Tracking  

Method  train  test  train  test  
1. Baseline          1.02  1.63  6.08  3.14  
2. Ours TR, DP      0.89  1.43  6.12  2.92  
3. Ours TR, KPN trained      0.93  1.61  5.94  2.88  
4. Ours TR, DN trained      0.91  1.58  5.38  2.88  
5. Ours TR    0.83  1.56  5.61  2.68  
6. Ours  0.24  0.26  3.21  1.24 
We summarize our results and comparisons of the visual odometry performance with stateoftheart methods in Table 3. Our method outperforms all other monoculartrained methods, as well as all other stereotrained methods except for DVSO [53]. However, we emphasize that while DVSO is trained from a widebaseline stereo setup which provides a very strong prior for outlier rejection, our system is trained in a fully selfsupervised manner purely relying on monocular videos  a significantly harder problem. The experimental results indicate that our depthaware keypoints provide superior matching performance that even rivals stateoftheart methods trained on stereo imagery.
In addition, we report frametoframe trajectory estimation results using the PnP formulation described in Section 3.3. Notably, our frametoframe (F2F) method outperforms all other monocular methods except for DFVO [56], which heavily relies on opticalflow and RANSACbased essential matrix and handengineered scalefactor recovery. Comparing our F2F estimation results with PnPbased DFVO [56] (DFVO PnPt in Table 3), we attribute the superior performance to the direct optimization of sparse 2D3D keypoints, as opposed to [56] which relies purely on dense optical flow. We show qualitative results of our method in Figure 5, noting that our DSDSO results accurately follow the ground truth trajectory with minimal scale drift.
Method  Type  01  02  06  08  09  10  00  03  04  05  07  Train  Test 

 Average Translational RMSE drift (%) on trajectories of length 100800m.  
ORBSLAMM [38]  Mono        32.40      25.29      26.01  24.53    27.05 
SfMLearner [58]  Mono  35.2  58.8  25.9  21.9  18.8  14.3  66.4  10.8  4.49  18.7  21.3  29.28  16.55 
Zhan et al [55]  Mono          11.9  12.6              12.30 
Bian et al [8]  Mono          11.2  10.1              10.7 
EPC++(mono) [35]  Mono          8.84  8.86              8.85 
Ambrus et al [4]  Mono  17.59  6.82  8.93  8.38  6.49  9.83  7.16  7.66  3.8  6.6  11.48  9.67  7.34 
Monodepth2 [21]  Mono  19.74  3.99  3.80  5.62  5.28  8.47  6.65  8.59  3.62  7.46  9.37  7.82  7.14 
DFVO [56] PnP  Mono          7.12  6.83              6.98 
DFVO [56]  Mono  66.98  3.60  1.03  2.23  2.47  1.96  2.25  2.67  1.43  1.15  0.93  10.2  2.21 
UnDeepVO [32]  Stereo  69.1  5.58  6.20  4.08  7.01  10.6  4.14  5.00  4.49  3.40  3.15  11.68  8.81 
SuperDepth [3]  Stereo  13.48  3.48  1.81  2.25  3.74  2.26  6.12  7.90  11.80  4.58  7.60  4.50  7.60 
Zhu et al [59]  Stereo  45.5  6.40  3.49  4.08  4.66  6.30  4.95  4.83  2.43  3.97  4.50  8.91  5.48 
DFVO [56]  Stereo  56.76  2.38  1.03  1.60  2.61  2.29  1.96  2.49  1.03  1.10  0.97  8.67  2.45 
DVSO [53]  Stereo  1.18  0.84  0.71  1.03  0.83  0.74  0.71  0.77  0.35  0.58  0.73  0.89  0.63 
Ours F2F  Mono  17.79  3.15  1.88  3.06  2.69  5.12  2.76  3.02  1.93  3.30  2.41  5.61  2.68 
Ours DSDSO  Mono  4.70  3.62  0.92  2.46  2.31  5.24  1.83  1.21  0.76  1.84  0.54  3.21  1.24 
 Average Rotational RMSE drift () on trajectories of length 100800m.  
ORBSLAMM [38]  Mono        12.13      7.37      10.62  10.83    10.23 
Bian et al [8]  Mono          3.35  4.96              4.2 
Zhan et al [55]  Mono          3.60  3.43              3.52 
SfMLearner [58]  Mono  2.74  2.74  4.8  2.91  3.21  3.30  6.13  3.92  5.24  4.1  6.65  4.45  3.26 
EPC++(mono) [35]  Mono          3.34  3.18              3.26 
DFVO [56] PnP  Mono          2.43  3.88              3.12 
Monodepth2 [21]  Mono  1.97  1.56  1.09  1.90  1.60  2.26  2.62  4.77  2.66  2.92  5.38  3.67  1.73 
Ambrus et al [4]  Mono  1.01  0.87  0.39  0.61  0.86  0.98  1.70  3.49  0.42  0.90  2.05  0.79  1.71 
DFVO [56]  Mono  17.04  0.52  0.26  0.30  0.30  0.31  0.58  0.50  0.29  0.30  0.29  2.51  0.31 
UnDeepVO [32]  Stereo  1.60  2.44  1.98  1.79  3.61  4.65  1.92  6.17  2.13  1.5  2.48  2.45  4.13 
SuperDepth [3]  Stereo  1.97  1.10  0.78  0.84  1.19  1.03  2.72  4.30  1.90  1.67  5.17  1.15  3.15 
Zhu et al [59]  Stereo  1.78  1.92  1.02  1.17  1.69  1.59  1.39  2.11  1.16  1.2  1.78  1.50  1.64 
DFVO [56]  Stereo  13.93  0.55  0.30  0.32  0.29  0.37  0.60  0.39  0.25  0.30  0.27  2.11  0.33 
DVSO [53]  Stereo  0.11  0.22  0.20  0.25  0.21  0.21  0.24  0.18  0.06  0.22  0.35  0.20  0.21 
Ours F2F  Mono  0.72  1.01  0.80  0.76  0.61  1.07  1.17  2.45  1.93  1.11  1.16  0.83  1.56 
Ours DSDSO  Mono  0.16  0.22  0.13  0.31  0.30  0.29  0.33  0.33  0.18  0.22  0.23  0.24  0.26 
We summarize our ablative analysis in Table 2. Our baseline  KeypointNet pretrained on COCO and DepthNet trained on KITTI, but the two are not optimized together  shows superior results compared to most monocular methods (see Table 3
), thus motivating our approach of combining keypoints and depth in a selfsupervised learning framework. We notice a significant improvement when training the two networks together (
Row 2: Ours TR, DP). Adding the the differential pose estimation (Row 5: Ours TR) further improves the performance of our system for the metric; we note that the metric does not improve, mostly due to an error in Sequence (please refer to the supplementary material for detailed results for each version of our method on all the KITTI odometry sequence). We further ablate the KeypointNet (Row 3: Ours TR, KPN trained)  i.e. we estimate the egomotion using the DepthNetwork after training together with the KeypointNet, but we use the original KeypointNet trained only on COCO. We perform a similar experiment ablating the trained DepthNet (Row 4: Ours TR, DepthNet trained).In both cases we note a performance drop for both the and metrics, concluding that the Multiview Adaptation training procedure along with the differentiable pose improves both the DepthNet and KeypointNet for the task of Visual Odometry. We note a percentage point improvement in the metric and a percentage point improvement in the metric when comparing the proposed method with Multiview Adaptation and differentiable pose with the baseline (Row 5 versus row 1). Finally, we note that when using the DSDSO tracking system (row 6) our results improve significantly, which we attribute to the robustness of our features both from a geometry as well as an appearance perspective. We emphasize that all our results, including the pretraining of our networks, is done in a selfsupervised fashion, without any supervision.
In this paper, we propose a fully selfsupervised framework for depthaware keypoint learning from unlabeled monocular video, by incorporating a novel differentiable pose estimation module that simultaneously optimizes the keypoints and their depths in a structurefrommotion setting. Unlike existing learned keypoint methods that employ simple homography adaptation, we introduce multiview adaptation that exploits the temporal context in videos to further boost the repeatability and matching performance of our proposed keypoint network. The resulting 3D keypoints and associated descriptors exhibit superior performance compared to all other traditional and learned methods, and is also able to learn from realistic nonplanar 3D scenes. Finally, we show how our proposed network can be integrated with a monocular visual odometry system to achieve accurate, scaleaware, longterm tracking results which are on par with stateoftheart stereomethods.
ResNet18DepthNet we provide a detailed description of our DepthNet architecture in Table 4, and note that we follow [19] and use a ResNet18 encoder followed by a decoder which outputs inverse depth at 4 scales.
Layer Description  K  Output Tensor Dim. 


#0  Input RGB image  3HW  
ResidualBlock  
Conv2d + BatchNorm + ReLU 
3  
Conv2d + BatchNorm  3  
Depth Encoder  
#1  Conv2d (S2) + BatchNorm + ReLU  7  64H/2W/2 
#2  Conv2d + BatchNorm + ReLU  3  64H/2W/2 
#3  ResidualBlock (#2) x2    64H/2W/2 
#4  Max. Pooling (1/2)  3  64H/4W/4 
#5  ResidualBlock (#3 + $2) x2    128H/4W/4 
#6  Max. Pooling (1/2)  3  128H/8W/8 
#7  ResidualBlock (#4 + #3) x2    256H/8W/8 
#8  Max. Pooling (1/2)  3  256H/16W/16 
#9  ResidualBlock (#5 + #4) x2    512H/16W/16 
Depth Decoder  
#10  Conv2D + ELU (#9)  3  128H/16W/16 
#11  Conv2D + Upsample (#10)  3  128H/8W/8 
#12  Conv2D + Sigmoid  3  1H/8W/8 
#13  Conv2D + ELU  3  64H/8W/8 
#14  Conv2D + Upsample(#7 #13)  3  64H/4W/4 
#15  Conv2D + Sigmoid  3  1H/8W/8 
#16  Conv2D + ELU  3  32H/4W/4 
#17  Conv2D + Upsample (#5 #16)  3  32H/2W/2 
#18  Conv2D + Sigmoid  3  1H/8W/8 
#19  Conv2D + ELU  3  16H/2W/2 
#20  Conv2D + Upsample (#3 #19)  3  16HW 
#21  Conv2D + Sigmoid  3  1HW 
is a nearestneighbor interpolation operation that doubles the spatial dimensions of the input tensor.
denotes feature concatenation for skip connections.ResNet18 KeypointNet Table 5 details the network architecture of our KeypointNet. We follow [5] but change the network encoder and use a ResNet18 architecture instead, which we found to perform better.
Layer Description  K  Output Tensor Dim.  

#0  Input RGB image  3HW  
ResidualBlock  
Conv2d + BatchNorm + ReLU  3  
Conv2d + BatchNorm  3  
KeyPoint Encoder  
#1  Conv2d (S2) + BatchNorm + ReLU  7  64H/2W/2 
#2  Conv2d + BatchNorm + ReLU  3  64H/2W/2 
#3  ResidualBlock (#2) x2    64H/2W/2 
#4  Max. Pooling (1/2)  3  64H/4W/4 
#5  ResidualBlock (#3 + $2) x2    128H/4W/4 
#6  Max. Pooling (1/2)  3  128H/8W/8 
#7  ResidualBlock (#4 + #3) x2    256H/8W/8 
#8  Max. Pooling (1/2)  3  256H/16W/16 
#9  ResidualBlock (#5 + #4) x2    512H/16W/16 
KeyPoint Decoder  
#10  Conv2D + BatchNorm + LReLU (#9)  3  256H/16W/16 
#11  Conv2D + Upsample (#10)  3  256H/8W/8 
#12  Conv2D + BatchNorm + LReLU  3  256H/8W/8 
#13  Conv2D + Upsample(#7 #12)  3  128H/4W/4 
#14  Conv2D + BatchNorm + LReLU  3  128H/4W/4 
#15  Conv2D + Upsample (#5 #14)  3  64H/2W/2 
#16  Conv2D + BatchNorm + LReLU  3  64H/2W/2 
Score Head  
#12  Conv2d + BatchNorm + LReLU (#12)  3  256H/8W/8 
#13  Conv2d + Sigmoid  3  1H/8W/8 
Location Head  
#14  Conv2d + BatchNorm + LReLU (#12)  3  256H/8W/8 
#15  Conv2d + Tan. Harmonic  3  2H/8W/8 
Descriptor Head  
#16  Conv2d + BatchNorm + LReLU (#16)  3  64H/2W/2 
#17  Conv2d  3  64H/2W/2 
Method  Abs Rel  Sq Rel  RMSE  RMSE  

Monodepth2 [22]  0.090  0.545  3.942  0.137  0.914  0.983  0.995  
DepthNet baseline  0.089  0.543  3.968  0.136  0.916  0.982  0.995  
DepthNet finetuned  0.094  0.572  3.805  0.138  0.912  0.981  0.994  
We perform a qualitative evaluation of our DepthNet on the KITTI dataset, specifically on the Eigen [15] test split, and report the numbers in Table 6. We also include the numbers reported by [22] and note that our DepthNet baseline numbers are on par with those of [22] (note that this corresponds to row 1. Baseline of Table 2 in the main text). Table 6 also shows our numbers after finetuning the DepthNet and KeypointNet through the proposed MultiView Adaptation method (note that this corresponds to row 5. Ours TR of Table 2 in the main text). We note a slight decrease in the Abs Rel and Sq Rel metrics, but otherwise the numbers are within error margin with respect to our baseline. These results provide an important sanity check: as the main focus of this work is sparse, depthaware keypoint learning, we don’t expect to see much variation when performing dense depth evaluation. We mention that sparsely evaluating the depth using the keypoints regressed by our method is not feasible using the depth available in the KITTI dataset: even using the denser depth maps provided by [50], only about of our keypoints have valid depths in the ground truth maps, which amounts to a very small number of points () per image.
We define the SSIM loss [52] as:
(15) 
Recall that we aim to minimize:
(16) 
where is the rotation matrix and
is the translation vector. They together compose a rigid body transform
, which is defined by . is a member of the Lie algebra and is mapped to the Lie group through the matrix exponential :(17) 
The estimated relative pose can be obtained by optimizing the residual error in Equation (16). The GaussianNewton (GN) method is used to solve this nonlinear least square problem. GN calculates iteratively as follows:
(18) 
where is the Jacobian matrix with respect to the residual measurements. RANSAC is performed to achieve a robust estimation, rejecting outliers with three major types against the egomotion assumption: falsepositive matching pairs, dynamic objects or points with wrong depth estimations.
Table 7 provides detailed results on all the KITTI odometry sequences for each entry of our ablation study (Table 2 of the main text). We note that (i) the proposed contributions  MultiView Adaptation (row 2 vs row 1) and Differentiable Pose (row 5 vs row 1) consistently improve over the baseline; and that (ii) by swapping out the KeypointNet or DepthNet trained using the proposed MultiView Adaptation with their baseline counterparts (rows 3 and 4) results in worse performance for both the and metrics.
Method  Type  01  02  06  08  09  10  00  03  04  05  07  Train  Test 

 Average Translational RMSE drift (%) on trajectories of length 100800m.  
Baseline  Mono  18.96  3.35  2.16  3.80  3.15  5.06  3.50  3.64  2.33  3.25  3.00  6.08  3.14 
Ours TR, DP  Mono  20.17  3.37  2.15  3.01  2.61  5.39  2.89  3.10  2.88  3.09  2.66  6.12  2.92 
Ours TR, KPN trained  Mono  19.09  3.30  2.23  3.16  2.84  5.03  2.83  3.29  2.02  3.69  2.58  5.94  2.88 
Ours TR, DN trained  Mono  15.53  3.37  1.84  3.63  2.83  5.06  3.56  3.06  2.11  3.33  2.34  5.38  2.88 
Ours TR  Mono  17.79  3.15  1.88  3.06  2.69  5.12  2.76  3.02  1.93  3.30  2.41  5.61  2.68 
Ours  Mono  4.70  3.62  0.92  2.46  2.31  5.24  1.83  1.21  0.76  1.84  0.54  3.21  1.24 
 Average Rotational RMSE drift () on trajectories of length 100800m.  
Baseline  Mono  1.02  1.12  0.82  1.00  0.72  1.43  1.26  3.17  1.09  1.24  1.39  1.02  1.63 
Ours TR, DP  Mono  1.08  1.03  0.97  0.73  0.65  0.91  1.24  2.64  1.00  1.08  1.18  0.89  1.43 
Ours TR, KPN trained  Mono  0.84  1.12  0.98  0.77  0.64  1.24  1.23  2.81  1.56  1.24  1.23  0.93  1.61 
Ours TR, DN trained  Mono  0.66  1.13  0.68  0.88  0.62  1.44  1.20  2.74  1.73  1.22  1.04  0.91  1.58 
Ours TR  Mono  0.72  1.01  0.80  0.76  0.61  1.07  1.17  2.45  1.93  1.11  1.16  0.83  1.56 
Ours  Mono  0.16  0.22  0.13  0.31  0.30  0.29  0.33  0.33  0.18  0.22  0.23  0.24  0.26 
We follow [14] and use the Repeatability and Localization Error metrics to estimate keypoint performance and Homography Accuracy and Matching Score matrics to estimate descriptor performance. We note that for all metrics we used a distance threshold of . For the Homography estimation, consistent with other reported methods, we used keypoints with the highest scores. Similarly, for the frame to frame tracking we selected keypoints to estimate the relative pose.
Repeatability is computed as the ratio of correctly associated keypoints after warping onto the target frame. We consider a warped keypoint correctly associated if the nearest keypoint in the target frame (based on Euclidean distance) is below a certain threshold.
Localization Error is computed as the average Euclidean distance between warped and associated keypoints.
Homography Accuracy To compute the homography between two images we perform reciprocal descriptor matching and we used OpenCV’s findHomography method with RANSAC, with a maximum of 5000 iterations and error threshold 3. To compute the Homography Accuracy we compare the estimated homography with the ground truth homography. Specifically we warp the image corners of the original image onto the target image using both the estimated homography and the ground truth homography, and we compute the average distance between the two sets of warped image corners, noting whether the average distance is below a certain threshold.
Matching Score is computed as the ratio between successful keypoint associations between the two images, with the association being performed using Euclidean distance in descriptor space.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5173–5182. Cited by: Figure 6, §5.1, §5.3, Table 1.Orbslam2: an opensource slam system for monocular, stereo, and rgbd cameras
. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §4, Table 3.Quadnetworks: unsupervised learning to rank for interest point detection
. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction
. In CVPR, Cited by: Table 3.