Log In Sign Up

Self-Supervised 3D Keypoint Learning for Ego-motion Estimation

by   Jiexiong Tang, et al.
KTH Royal Institute of Technology
Toyota Research Institute

Generating reliable illumination and viewpoint invariant keypoints is critical for feature-based SLAM and SfM. State-of-the-art learning-based methods often rely on generating training samples by employing homography adaptation to create 2D synthetic views. While such approaches trivially solve data association between views, they cannot effectively learn from real illumination and non-planar 3D scenes. In this work, we propose a fully self-supervised approach towards learning depth-aware keypoints purely from unlabeled videos by incorporating a differentiable pose estimation module that jointly optimizes the keypoints and their depths in a Structure-from-Motion setting. We introduce 3D Multi-View Adaptation, a technique that exploits the temporal context in videos to self-supervise keypoint detection and matching in an end-to-end differentiable manner. Finally, we show how a fully self-supervised keypoint detection and description network can be trivially incorporated as a front-end into a state-of-the-art visual odometry framework that is robust and accurate.


page 1

page 11


Neural Outlier Rejection for Self-Supervised Keypoint Learning

Identifying salient points in images is a crucial component for visual o...

Self-Supervised Equivariant Learning for Oriented Keypoint Detection

Detecting robust keypoints from an image is an integral part of many com...

BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos

Quantifying motion in 3D is important for studying the behavior of human...

Correct and Certify: A New Approach to Self-Supervised 3D-Object Perception

We consider an object pose estimation and model fitting problem, where -...

Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment

Maintaining proper form while exercising is important for preventing inj...

Self-Supervised Deep Visual Odometry with Online Adaptation

Self-supervised VO methods have shown great success in jointly estimatin...

SEKD: Self-Evolving Keypoint Detection and Description

Researchers have attempted utilizing deep neural network (DNN) to learn ...

Code Repositories


Code for "Self-Supervised 3D Keypoint Learning for Ego-motion Estimation"

view repo

1 Introduction

Detecting interest points in RGB images and matching them across views is a fundamental capability of many robotic systems. Tasks such as Structure-from-Motion (SfM) [2], Visual Odometry (VO) or Simultaneous Localization and Mapping (SLAM) [10] assume that salient keypoints can be detected and re-identified in diverse settings, which requires strong invariance to lighting, viewpoint changes, scale etc. Until recently, these tasks have mostly relied on hand-engineered keypoint features [34, 44]

which have been limited in performance. Deep learning has recently revolutionized many computer vision applications in the supervised setting 

[25, 49, 29, 47], however, these methods rely on strong supervision in the form of ground-truth labels which are often expensive to acquire. Moreover, supervising interest point detection is unnatural, as a human annotator cannot readily identify salient regions in images as well as key signatures or descriptors, which would allow their re-identification in diverse scenarios. Inspired by recent approaches to keypoint learning [14, 11, 5], we propose a fully self-supervised approach that exploits the temporal context in videos to learn to extract accurate and robust 3D keypoints from a single monocular image (Figure 1).

Our main contribution is a fully self-supervised framework for the learning of depth-aware keypoint detection and description purely from unlabeled videos. We propose a novel framework for the simultaneous learning of keypoint detection, matching and 3D lifting by incorporating a differentiable pose estimation module that tightly couples the two task networks for keypoint estimation (KeypointNet) and depth estimation (DepthNet). We show that by enforcing strong regularization in the form of sparse multi-view geometric constraints, the keypoint and depth networks strongly benefit from jointly optimizing for robust visual ego-motion. Our second contribution is the introduction of 3D multi-view adaptation, a novel adaptation technique that exploits the temporal context in videos to further boost the repeatability and matching performance of the keypoint network. For our final contribution, we show how our self-supervised depth-aware keypoint networks can be incorporated as a front-end into a visual odometry framework, enabling robust and accurate ego-motion estimation results. We show that when integrating our method with a state-of-the-art tracking method such as Direct Sparse Odometry (DSO) [16], we achieve long-term tracking results which are especially on par with state-of-the-art stereo-methods such as DVSO [53]. Through extensive experiments and ablative analysis, we show that the proposed self-supervised keypoint learning adaptation achieves state-of-the-art results on challenging benchmarks for keypoint detection, matching and visual odometry.

2 Related Work

Until recently, handcrafted image features such as SIFT [34] or ORB [44] have been the key enabler of feature-based SLAM [37] and SfM applications [1]. State-of-the-art learning-based keypoint detectors and descriptors, however, have increasingly been demonstrating improved performance on challenging benchmarks [14, 11, 45, 5], setting a new standard for keypoint-based applications.

Learning-Based Methods for Keypoint Estimation Rosten and Drummond [42, 43]

pioneered the detection of learning-based image features by learning a decision tree over image patches and accurately classifying corner features with real-time considerations. In TILDE 


, the authors introduced piece-wise linear regression models to detect illumination-invariant features. LIFT 

[54] uses an off-the-shelf SfM algorithm to generate more realistic training data under extreme viewpoint configurations, and learns to describe features that are robust to significant viewpoint and illumination differences. In LF-Net [39], the authors introduced an end-to-end differentiable network which estimates position, scale and orientation of features by jointly optimizing the detector and descriptor in a single module.

More recently, Quad-networks [46]

introduced an unsupervised keypoint learning method that learns to rank invariant interest points under diverse image transformations and extracts keypoints from the top and bottom quantiles. In 

[14] the authors propose SuperPoint: a self-supervised framework aimed at keypoint learning where they used a shared-encoder with detector and descriptor heads to predict interest points and descriptors simultaneously. In their work, the authors introduce Homographic Adaptation, multi-scale homography-based augmentation approach to boosting interest point detection repeatability and cross-domain generalization using synthetic datasets. Building on this work, UnsuperPoint [11] proposed a similar method for efficient keypoint detection and description trained in a fully self-supervised manner without the need for pseudo ground-truth keypoints.

Other works including Self-Improving Visual Odometry algorithm [13], take advantage of classical SfM techniques to classify the stability and repeatability of keypoints based on their re-projection error. However, due to the non-differentiable nature of their method, training these models requires multiple iterations of updates with diminishing improvements to the keypoint model. Most recently, in [5], the authors incorporate an end-to-end differentiable and neurally-guided outlier-rejection mechanism (IO-Net), that explicitly generates an additional proxy supervisory signal for the matching keypoint-pairs. This allows keypoint descriptions to be further refined, as a result of the outlier-rejection network predictions occurring during the two-view matching stage.

Learning-based Methods for Visual Odometry Self-supervised methods for depth and ego-motion estimation are becoming increasingly popular, as accurate ground-truth measurements rely heavily on expensive and specialized equipment such as LiDAR and Inertial Navigation Systems (INS). One of the earliest works in self-supervised depth estimation [20] used the photometric loss as proxy supervision to learn a monocular depth network from stereo imagery. Zhou et al. [58] extended this self-supervision to the generalized multi-view case, leveraging constraints typically incorporated in SfM to simultaneously learn depth and camera ego-motion from monocular image sequences.

Several works have extended this work further - engineering the loss function to handle errors specifically in handling outliers. However, it has been shown that direct pose estimation (i.e. directly from input images

[27]) is prone to over-fitting and benefits from feature sparsification as shown in [4]. Teed and Deng [48] proposed an iterative method to regress dense correspondences from pairs of depth frames and compute the 6-DoF estimate using a PnP [30] algorithm. More recently, the authors of [36] use a model-based pose estimation solution via Perspective-n-Point to recover 6 DoF pose estimates from monocular videos and use the estimate as a form of supervision to enable semi-supervised depth learning from unlabeled videos and LiDAR. Our work borrows a similar concept, however, we take advantage of the model-based PnP solution and the inliers established to outfit a fully differentiable pose estimation module within the 3D keypoint learning framework. [56] uses PnP along with an estimation of the essential matrix to compute the ego-motion, however they rely on estimating the dense flow using multiple frames, while our method focuses on sparse keypoint detection and optimization using a single frame.

3 Self-Supervised 3D Keypoint Learning for Ego-motion Estimation

Figure 2: Monocular SfM-based 3D Keypoint Learning. We illustrate the overall architecture of our proposed method that uses two consecutive images (target and source ) as input to self-supervise 3D keypoint learning for monocular ego-motion estimation. We train the DepthNet and KeypointNet simultaneously in an end-to-end fashion with a combination of photometric and multi-view geometric losses (Section 3.4), to develop a robust 3D keypoint estimator for long-term ego-motion estimation.

In this section, we introduce the fully self-supervised framework for monocular depth-aware keypoint learning for the task of ego-motion estimation. Notably, we perform depth-aware keypoint learning purely from watching large volumes of unlabeled videos, without any need for supervision in the form of ground-truth or pseudo ground-truth labels. As a consequence of learning the 2D-to-3D keypoint lifting function from monocular videos, we show that this capability can be additionally used to accurately estimate the ego-motion between temporally adjacent images. We illustrate the proposed monocular SfM-based keypoint learning framework in Figure 2.

3.1 Notation

We formulate monocular depth-aware keypoint learning as follows: Given an input monocular image , we aim to regress keypoint locations , descriptors , and scores along with dense depth map . Functionally, we define 3 components in our framework that are used to enable depth-aware keypoint learning in an end-to-end differentiable setting: (i) KeypointNet that learns to regress output keypoint locations , descriptors and scores given an input image . (ii) DepthNet , that learns to predict the scale-ambiguous dense depth map , and as a result, provides a mechanism to lift the sparse 2D keypoints to 3D by directly sampling from the predicted dense depth , . We refer to the resulting 3D keypoints along with their associated descriptors and scores as . (iii) A fully-differentiable ego-motion estimator , that predicts the relative 6-DoF rigid-body transformation between the target image and the source image . We use to denote the warped keypoints from the target image to the source image via the transformation .

Following [58] we compute the depth at multiple scales during training, however, when referring to the associated sparse depth for a set of descriptors we refer to the depth from the scale with the highest resolution. Notably in the monocular SfM setting, the depth scale is ambiguous up to an unknown scale factor.

3.2 Adaptations for Keypoint Learning

Multi-View Adaptation  Following the concept of leveraging known geometric transformations to self-supervise and boost keypoint learning [14], we introduce Multi-View Adaptation - a novel self-supervised adaptation technique that leverages epipolar constraints in two-view camera geometry for robust 3D keypoint learning. Crucially, we generalize the works of [5, 11] and self-supervise 3D keypoint learning that leverage the structured geometry of scenes in unlabeled monocular videos. An overview of the proposed pipeline is illustrated in Figure 2.

In the adaptation step, we are interested in computing the set of corresponding keypoints . i.e. from target image along with warped in source image . We use predicted keypoints and in the target and source images to compute via reciprocal matching in descriptor space. Given the set of corresponding keypoints we compute the associated ego-motion (see Section 3.3). Once is known, we compute by warping and we induce a combination of dense photometric losses via image-synthesis and sparse geometric losses via re-projection in the monocular two-view setting.

(a) Homography Adaptation
(b) Multi-View Adaptation
Figure 3: Adaptations for Keypoint Learning. We contrast between Homography Adaptation where can be trivially computed and Multi-View Adaptation where we first compute   via the correspondence set .

Specifically, we use (i) a dense photometric loss based on the warped projection of in aimed at optimizing the dense depth prediction by the DepthNet; and (ii) a sparse geometric loss aimed at minimizing the re-projection error between corresponding keypoints and predicted by the KeypointNet.

Homography Adaptation  Following [5, 11], the KeypointNet is additionally trained on image pairs related through a known homography transformation which warps pixels from the source image to the target image. As a result, the training image pairs are generated by randomly sampling from a set of predefined homographies. For every warped keypoint in , we compute the corresponding keypoint from based on Euclidean distance, and denote the resulting set as . The resulting correspondence set is then directly used in self-supervising the keypoints by imposing a loss on the consistency of known keypoint pair matches. Figure 3 to compare the two adaptation techniques.

3.3 Pose Estimation from 3D Keypoints

Having computed correspondences and , we utilize a robust estimator to compute the 6-DoF rigid-body pose transformation between the target and source views.

Pose Estimation via Perspective-n-Point By lifting the 2D keypoints from the target image to 3D with the associated depth , we use the PnP algorithm [30] to compute the initial relative pose transformation to geometrically match the keypoints in the target image to those in the source image. Specifically, we minimize:


where is the standard pinhole camera projection model used the project the warped points on to the source image .

The estimated relative pose is obtained by minimizing the residual error in Equation (1) using the Gaussian-Newton (GN) method (see supplementary material) with RANSAC to ensure robustness to outliers. This step allows us to compute the pose robustly, however, this makes the pose no longer differentiable with respect to the 3D keypoints used to estimate it. To alleviate this limitation, we address how the resulting pose estimate can be used as an initial-guess to an end-to-end differentiable pose estimation module within the proposed self-supervised 3D keypoint learning framework.

Differentiable Pose Estimation from 3D Keypoints Inspired by recent monocular direct methods that perform frame-to-keyframe tracking[17], we show that by calculating the re-projected source 3D keypoints from the target keypoints via the initial pose estimate , a 3D residual can be formulated to recover the pose in closed-form for the established inlier set in PnP:


The 3D residual above can be effectively minimized by estimating the rotation and translation separately using a closed-form solution on the established inlier set. We first estimate the rotation by subtracting the means of the points and minimizing the eq. 3 by solving an SVD in closed-form (otherwise known as the Orthogonal Procrustes problem [57]):


Once the rotation is computed, the translation can be directly recovered by minimizing:


Thus, the gradients for the pose rotation and translation can be effectively propagated with respect to the lifted 3D keypoint locations, making the overall pose estimation fully-differentiable. The differentiable pose estimated using the 2D keypoints from the source image and 3D keypoints from the target image tightly couples keypoint and depth estimation, thereby allowing both predictions to be further optimized using overall keypoint learning objective.

3.4 Keypoint Learning Objective

In this work, we self-supervise the learning of depth-aware keypoints in a fully end-to-end differentiable manner using a combination of photometric and geometric losses. We optimize both the KeypointNet and DepthNet jointly using the following losses:

Keypoint Loss Based on the descriptor-matched correspondences and the 3D adapted keypoints , we define a loss term that enforces geometric consistency between the 2D keypoints in the source view and the 3D keypoints in the target view of the same scene:


Descriptor Loss  Following [5], we use nested hardest sample mining to self-supervise the keypoint descriptors between the two views. Given anchor descriptors from the target frame and their associated positive descriptors in the in the source frame, we define the triplet loss:


where is the hardest descriptor sample mined from with margin .

Score Loss  The score loss is introduced to identify reliable and repeatable keypoints in the matching process. In particular, we want to ensure that (i) the feature-pairs have consistent scores across matching views; and (ii) the network learns to predict high scores for good keypoints with low geometric error and strong repeatability. This objective is achieved by minimizing the squared distance between scores for each matched keypoint-pair, and minimizing or maximizing the average score of a matched keypoint-pair if the distance between the paired keypoints is greater or less than the average distance respectively:


where and are the scores of the source and target frames respectively, and is the average re-projection error of associated points in the current frame, given by . Here, refers to the 2D Euclidean distance in feature space between the matching keypoints.

We define similar keypoint, descriptor and score loss terms in the Homography Adaptation (HA) case using the correspondences.

Photometric Loss  In addition to the geometric losses, we impose a dense photometric loss to learn dense depth in the DepthNet. Following [21, 58, 26], we warp the depth from the target from along via the predicted ego-motion estimate to the source frame , and impose a structural similarity (SSIM) loss [52] between the synthesized target image and the original target image . The resulting dense photometric loss is regularized with an L1 pixel-wise loss term (See Appendix for more details):


To account for parallax errors and the presence of dynamic objects in videos, we compute the pixel-wise minimum between the set of synthesized source images (i.e. context images) and the target image  [22].


In addition, we mask out static pixels by removing those which have a warped photometric loss higher than their corresponding unwarped photometric loss , calculated using the original source image without view-synthesis [22]. This has the effect of removing pixels with non-changing appearances, including static frames and dynamic objects with no relative motion.


Depth Smoothness Loss   In order to regularize the depth in texture-less low-image gradient regions, we also incorporate an edge-aware term similar to [20]:


Depth Consistency   Recall that the depth we regress is scale-ambiguous. While recovering scale-consistent depth is not a strict requirement for the proposed framework to learn 3D keypoints, scale-consistency has been shown to be crucial for tasks that involve accurate ego-motion estimation [9, 23]. To this end, we incorporate a depth consistency term that discourages scale-drift between dense depth predictions in adjacent frames:


Note that is a sparse loss defined based on the correspondences .

Overall Objective   The overall objective used to simultaneously optimize the KeypointNet and DepthNet becomes:


where and are weights used to balance the depth and keypoint losses, and they are chosen as and respectively during the training.

Figure 4: The proposed DS-DSO pipeline. DS-DSO leverages the depth initialization and robust feature tracking using our self-supervised depth-aware keypoint detection and description. The red block and arrrows show that where the 3D keypoint is affecting the original DSO system, the purple texts show where 2D and 3D information is utilized.

4 Deep Semi-Direct Sparse Odometry

In this section, we will explain how the fully self-supervised depth-aware keypoint network can be incorporated as the front-end into a visual SLAM framework. We show that integrating our method into a state-of-the-art monocular visual SLAM framework such as DSO [16], we are able to achieve long-term tracking results which are especially on par with stereo methods such as DVSO [53] or ORB-SLAM2 [38]. Unlike other monocular visual odometry approaches, the superior keypoint matching and stable 3D lifting performance of our proposed method allows us to bootstrap the tracking system, rejecting false matches and outliers and avoiding significant scale-drift as demonstrated in Figures 5 and 1.

Figure 4 shows the whole pipeline of our Deep Semi-Direct Sparse Odometry (DS-DSO) system which is built on top of the windowed sparse direct bundle adjustment formulation of DSO. As illustrated, we improve depth-initialization of keyframes in the original DSO implmenetation by using the depth estimated through our proposed self-supervised 3D keypoint network. In addition, we modify the hand-engineered direct semi-dense tracking component to our proposed sparse and robust learned keypoint-based method introduced in this work.

5 Experiments

5.1 Datasets

We evaluate our system on the KITTI [18] dataset. We follow the training protocol of [53] and train on KITTI odometry sequences 01, 02, 06, 08, 09 and 10, and evaluate on sequences 00, 03, 04, 05 and 07. We report - average translational RMSE drift (%) on trajectories of length 100-800m, and - average rotational RMSE drift (deg /100m) on trajectories of length 100-800m. To evaluate the performance of our DepthNet we use the Eigen [15] test splits which consists of 697 images with associated depth (we note that the eigen test split does not overlap the KITTI odometry sequence we use for training).

To evaluate the performance of our kepoint detector and decriptor we use the HPatches [6] dataset. HPatches consists of a set of 116 image sequences (illumination and viewpoint), each sequence containing a source image and five target images, for a total of 580 image pairs. We quantify detector performance through the Repeatability and Localization Error metrics and descriptor performance through the Correctness and Matching Score metrics (the exact definition of these metrics can be found in the Appendix). For a fair comparison, we evaluate the results generated without applying Non-Maxima Suppression (NMS). Following related work [5, 11, 14], we pre-train our KeypointNet on the COCO [33] dataset, which contains training images. We note that pretraining on COCO is completely self-supervised, using Homography Adaptation (more details in the following section).

5.2 Implementation Details

We implement our networks in PyTorch 

[41] using the ADAM optimizer [28]. We use as the learning rate and train KeypointNet and DepthNet jointly for epochs with a batch size of . We implement KeypointNet following [5]

, with the mention that we use an ImageNet pre-trained ResNet-18 backbone, which we found to perform much better than the reference architecture used. We follow 

[21] and implement DepthNet using an ImageNet [12] pretrained ResNet-18 backbone along with a depth decoder that outputs inverse depth at 4 scales. However, at test-time, only the highest resolution scale is used for 2D-to-3D keypoint lifting.

We train on snippets of 3 images , for with target image and images as context images (otherwise referred to as source images). Using the pair of target and source images generated via 3D Multi-View Adaptation, we compute the losses as defined in Section 3.4. The dense photometric loss is computed over the context as shown in Equation 11. Additionally, starting from the target image , we also perform Homography Adaptation similar to [5], e.g. translation, rotation, scaling, cropping and symmetric perspective transform. Additionally, we apply per-pixel Gaussian noise, color jitter and Gaussian blur to the images for additional robustness to image lighting.

Pretraining We pre-train KeypointNet on COCO using Homography Adaptation for epochs using a learning rate of which is later halved after epochs. We refer to this as our baseline KeypointNet, and evaluate its performance in Table 1. To speed up convergence, we pretrain our DepthNet on the KITTI training sequnces (e.g. 01, 02, 06, 08, 09 and 10) using the method described in [21]. We train for 200 epochs with a learning rate of 1e-4 which is decayed every epochs.We refer to this as our baseline DepthNet, and we evaluate its performance in the experiments below.

5.3 Keypoint Detector and Descriptor Performance

Table 1 shows the performance of our keypoints and descriptors on HPatches [6]. We note that our baseline method, trained on COCO using Homography Adaptation, outperforms all classical as well as learning-based methods in terms of keypoint robustness (Repeatability and Localization Error) and descriptor matching performance (Correctness and Maching Score). As seen in the table, we show further improvements when training using the proposed 3d multi-view adaptation method. In addition to the superior VO results reported in Table 3, our method allows us to train a state-of-the-art keypoint detector with associated descriptor that can robustly detect correspondences in challenging situations. We refer the reader to the supplementary materials for additional qualitative results.

Method 240x320, 300 points 480 x 640, 1000 points
Rep. Loc. Cor-3 M.Score Rep. Loc. Cor-3 M.Score
ORB [44] 0.532 1.429 0.422 0.218 0.525 1.430 0.607 0.204
SURF [7] 0.491 1.150 0.702 0.255 0.468 1.244 0.745 0.230
BRISK [31] 0.566 1.077 0.767 0.258 0.746 0.211 1.207 0.653
SIFT [34] 0.451 0.855 0.845 0.304 0.421 1.011 0.833 0.265
LF-Net(indoor) [40] 0.486 1.341 0.628 0.326 0.467 1.385 0.679 0.287
LF-Net(outdoor) [40] 0.538 1.084 0.728 0.296 0.523 1.183 0.745 0.241
SuperPoint [14] 0.631 1.109 0.833 0.318 0.593 1.212 0.834 0.281
UnsuperPoint [11] 0.645 0.832 0.855 0.424 0.612 0.991 0.843 0.383
IO-Net [5] 0.686 0.890 0.867 0.544 0.684 0.970 0.851 0.510
KeyPointNet (Baseline) 0.683 0.816 0.879 0.573 0.682 0.898 0.848 0.534
KeyPointNet 0.686 0.799 0.858 0.578 0.674 0.886 0.867 0.529
Table 1: Keypoint and descriptor performance on HPatches [6]. Repeatability and Localization Error measure keypoint performance while Correctness (pixel threshold 3) and Matching score measure descriptor performance. Higher is better for all metrics except Localization Error.
KPN baseline DN baseline KPN trained DN trained DP: Diff.Pose TR: Tracking
Method train test train test
1. Baseline - - - - 1.02 1.63 6.08 3.14
2. Ours TR, DP - - 0.89 1.43 6.12 2.92
3. Ours TR, KPN trained - - 0.93 1.61 5.94 2.88
4. Ours TR, DN trained - - 0.91 1.58 5.38 2.88
5. Ours TR - 0.83 1.56 5.61 2.68
6. Ours 0.24 0.26 3.21 1.24
Table 2: Ablative analysis. We ablate from our method: TR - the tracking component (Section 4), DP - the differentiable pose component (Section 3.3), KPN trained - the trained version of the KeyPointNet (i.e. using only the baseline), DN trained - the trained version of the DepthNet. All results are obtained by performing a Sim(3) alignment step [24].

5.4 Visual Odometry Performance

We summarize our results and comparisons of the visual odometry performance with state-of-the-art methods in Table 3. Our method outperforms all other monocular-trained methods, as well as all other stereo-trained methods except for DVSO [53]. However, we emphasize that while DVSO is trained from a wide-baseline stereo setup which provides a very strong prior for outlier rejection, our system is trained in a fully self-supervised manner purely relying on monocular videos - a significantly harder problem. The experimental results indicate that our depth-aware keypoints provide superior matching performance that even rivals state-of-the-art methods trained on stereo imagery.

In addition, we report frame-to-frame trajectory estimation results using the PnP formulation described in Section 3.3. Notably, our frame-to-frame (F2F) method outperforms all other monocular methods except for DF-VO [56], which heavily relies on optical-flow and RANSAC-based essential matrix and hand-engineered scale-factor recovery. Comparing our F2F estimation results with PnP-based DF-VO [56] (DF-VO PnPt in Table 3), we attribute the superior performance to the direct optimization of sparse 2D-3D keypoints, as opposed to [56] which relies purely on dense optical flow. We show qualitative results of our method in Figure 5, noting that our DS-DSO results accurately follow the ground truth trajectory with minimal scale drift.

Method Type 01 02 06 08 09 10 00 03 04 05 07 Train Test
- Average Translational RMSE drift (%) on trajectories of length 100-800m.
ORB-SLAM-M [38] Mono - - - 32.40 - - 25.29 - - 26.01 24.53 - 27.05
SfMLearner [58] Mono 35.2 58.8 25.9 21.9 18.8 14.3 66.4 10.8 4.49 18.7 21.3 29.28 16.55
Zhan et al [55] Mono - - - - 11.9 12.6 - - - - - - 12.30
Bian et al [8] Mono - - - - 11.2 10.1 - - - - - - 10.7
EPC++(mono) [35] Mono - - - - 8.84 8.86 - - - - - - 8.85
Ambrus et al [4] Mono 17.59 6.82 8.93 8.38 6.49 9.83 7.16 7.66 3.8 6.6 11.48 9.67 7.34
Monodepth2 [21] Mono 19.74 3.99 3.80 5.62 5.28 8.47 6.65 8.59 3.62 7.46 9.37 7.82 7.14
DF-VO [56] PnP Mono - - - - 7.12 6.83 - - - - - - 6.98
DF-VO [56] Mono 66.98 3.60 1.03 2.23 2.47 1.96 2.25 2.67 1.43 1.15 0.93 10.2 2.21
UnDeepVO [32] Stereo 69.1 5.58 6.20 4.08 7.01 10.6 4.14 5.00 4.49 3.40 3.15 11.68 8.81
SuperDepth [3] Stereo 13.48 3.48 1.81 2.25 3.74 2.26 6.12 7.90 11.80 4.58 7.60 4.50 7.60
Zhu et al [59] Stereo 45.5 6.40 3.49 4.08 4.66 6.30 4.95 4.83 2.43 3.97 4.50 8.91 5.48
DF-VO [56] Stereo 56.76 2.38 1.03 1.60 2.61 2.29 1.96 2.49 1.03 1.10 0.97 8.67 2.45
DVSO [53] Stereo 1.18 0.84 0.71 1.03 0.83 0.74 0.71 0.77 0.35 0.58 0.73 0.89 0.63
Ours F2F Mono 17.79 3.15 1.88 3.06 2.69 5.12 2.76 3.02 1.93 3.30 2.41 5.61 2.68
Ours DS-DSO Mono 4.70 3.62 0.92 2.46 2.31 5.24 1.83 1.21 0.76 1.84 0.54 3.21 1.24
- Average Rotational RMSE drift () on trajectories of length 100-800m.
ORB-SLAM-M [38] Mono - - - 12.13 - - 7.37 - - 10.62 10.83 - 10.23
Bian et al [8] Mono - - - - 3.35 4.96 - - - - - - 4.2
Zhan et al [55] Mono - - - - 3.60 3.43 - - - - - - 3.52
SfMLearner [58] Mono 2.74 2.74 4.8 2.91 3.21 3.30 6.13 3.92 5.24 4.1 6.65 4.45 3.26
EPC++(mono) [35] Mono - - - - 3.34 3.18 - - - - - - 3.26
DF-VO [56] PnP Mono - - - - 2.43 3.88 - - - - - - 3.12
Monodepth2 [21] Mono 1.97 1.56 1.09 1.90 1.60 2.26 2.62 4.77 2.66 2.92 5.38 3.67 1.73
Ambrus et al [4] Mono 1.01 0.87 0.39 0.61 0.86 0.98 1.70 3.49 0.42 0.90 2.05 0.79 1.71
DF-VO [56] Mono 17.04 0.52 0.26 0.30 0.30 0.31 0.58 0.50 0.29 0.30 0.29 2.51 0.31
UnDeepVO [32] Stereo 1.60 2.44 1.98 1.79 3.61 4.65 1.92 6.17 2.13 1.5 2.48 2.45 4.13
SuperDepth [3] Stereo 1.97 1.10 0.78 0.84 1.19 1.03 2.72 4.30 1.90 1.67 5.17 1.15 3.15
Zhu et al [59] Stereo 1.78 1.92 1.02 1.17 1.69 1.59 1.39 2.11 1.16 1.2 1.78 1.50 1.64
DF-VO [56] Stereo 13.93 0.55 0.30 0.32 0.29 0.37 0.60 0.39 0.25 0.30 0.27 2.11 0.33
DVSO [53] Stereo 0.11 0.22 0.20 0.25 0.21 0.21 0.24 0.18 0.06 0.22 0.35 0.20 0.21
Ours F2F Mono 0.72 1.01 0.80 0.76 0.61 1.07 1.17 2.45 1.93 1.11 1.16 0.83 1.56
Ours DS-DSO Mono 0.16 0.22 0.13 0.31 0.30 0.29 0.33 0.33 0.18 0.22 0.23 0.24 0.26
Table 3: Comparison of vision-based trajectory estimation with state-of-the-art methods. The Type column indicates the data used at train type. Note: All methods are evaluated on monocular data. Our results are obtained after performing a single Sim(3) alignment step [24] wrt. the ground truth trajectories. Bold text denotes the best method trained on monocular data; _ denotes the best overall method. and represent test and respectively train seq. for our method, as well as for [3, 53][58, 55, 35, 32, 59, 56] are trained on Sequences 00-08 and tested on Sequences 09 and 10. The numbers for [38] are reported from [32]. - the numbers of [21] are based on our own implementation.
Figure 5: Qualitative trajectory estimation results on the KITTI Odometry Seq. 03, 04, 05 and 07. We compare trajectory estimation results obtained via hand-engineered keypoint matching methods against our depth-aware learned keypoint matching, with a common visual odometry backend such as DSO. As illustrated in the figure, our self-supervised method is able to accurately and robustly track stable keypoints for the task of long-term trajectory estimation.

5.5 Ablation Study

We summarize our ablative analysis in Table 2. Our baseline - KeypointNet pre-trained on COCO and DepthNet trained on KITTI, but the two are not optimized together - shows superior results compared to most monocular methods (see Table 3

), thus motivating our approach of combining keypoints and depth in a self-supervised learning framework. We notice a significant improvement when training the two networks together (

Row 2: Ours TR, DP). Adding the the differential pose estimation (Row 5: Ours TR) further improves the performance of our system for the metric; we note that the metric does not improve, mostly due to an error in Sequence (please refer to the supplementary material for detailed results for each version of our method on all the KITTI odometry sequence). We further ablate the KeypointNet (Row 3: Ours TR, KPN trained) - i.e. we estimate the ego-motion using the DepthNetwork after training together with the KeypointNet, but we use the original KeypointNet trained only on COCO. We perform a similar experiment ablating the trained DepthNet (Row 4: Ours TR, DepthNet trained).

In both cases we note a performance drop for both the and metrics, concluding that the Multiview Adaptation training procedure along with the differentiable pose improves both the DepthNet and KeypointNet for the task of Visual Odometry. We note a percentage point improvement in the metric and a percentage point improvement in the metric when comparing the proposed method with Multiview Adaptation and differentiable pose with the baseline (Row 5 versus row 1). Finally, we note that when using the DS-DSO tracking system (row 6) our results improve significantly, which we attribute to the robustness of our features both from a geometry as well as an appearance perspective. We emphasize that all our results, including the pretraining of our networks, is done in a self-supervised fashion, without any supervision.

6 Conclusion

In this paper, we propose a fully self-supervised framework for depth-aware keypoint learning from unlabeled monocular video, by incorporating a novel differentiable pose estimation module that simultaneously optimizes the keypoints and their depths in a structure-from-motion setting. Unlike existing learned keypoint methods that employ simple homography adaptation, we introduce multi-view adaptation that exploits the temporal context in videos to further boost the repeatability and matching performance of our proposed keypoint network. The resulting 3D keypoints and associated descriptors exhibit superior performance compared to all other traditional and learned methods, and is also able to learn from realistic non-planar 3D scenes. Finally, we show how our proposed network can be integrated with a monocular visual odometry system to achieve accurate, scale-aware, long-term tracking results which are on par with state-of-the-art stereo-methods.

Supplementary Materials

Appendix A Architecture Diagram

ResNet18-DepthNet we provide a detailed description of our DepthNet architecture in Table 4, and note that we follow [19] and use a ResNet18 encoder followed by a decoder which outputs inverse depth at 4 scales.

Layer Description K

Output Tensor Dim.

#0 Input RGB image 3HW

Conv2d + BatchNorm + ReLU

Conv2d + BatchNorm 3
Depth Encoder
#1 Conv2d (S2) + BatchNorm + ReLU 7 64H/2W/2
#2 Conv2d + BatchNorm + ReLU 3 64H/2W/2
#3 ResidualBlock (#2) x2 - 64H/2W/2
#4 Max. Pooling (1/2) 3 64H/4W/4
#5 ResidualBlock (#3 + $2) x2 - 128H/4W/4
#6 Max. Pooling (1/2) 3 128H/8W/8
#7 ResidualBlock (#4 + #3) x2 - 256H/8W/8
#8 Max. Pooling (1/2) 3 256H/16W/16
#9 ResidualBlock (#5 + #4) x2 - 512H/16W/16
Depth Decoder
#10 Conv2D + ELU (#9) 3 128H/16W/16
#11 Conv2D + Upsample (#10) 3 128H/8W/8
#12 Conv2D + Sigmoid 3 1H/8W/8
#13 Conv2D + ELU 3 64H/8W/8
#14 Conv2D + Upsample(#7 #13) 3 64H/4W/4
#15 Conv2D + Sigmoid 3 1H/8W/8
#16 Conv2D + ELU 3 32H/4W/4
#17 Conv2D + Upsample (#5 #16) 3 32H/2W/2
#18 Conv2D + Sigmoid 3 1H/8W/8
#19 Conv2D + ELU 3 16H/2W/2
#20 Conv2D + Upsample (#3 #19) 3 16HW
#21 Conv2D + Sigmoid 3 1HW
Table 4: DepthNet diagram. Line numbers in bold indicate output inverse depth layer scales. Upsample

is a nearest-neighbor interpolation operation that doubles the spatial dimensions of the input tensor.

denotes feature concatenation for skip connections.

ResNet18 KeypointNet Table 5 details the network architecture of our KeypointNet. We follow [5] but change the network encoder and use a ResNet18 architecture instead, which we found to perform better.

Layer Description K Output Tensor Dim.
#0 Input RGB image 3HW
Conv2d + BatchNorm + ReLU 3
Conv2d + BatchNorm 3
KeyPoint Encoder
#1 Conv2d (S2) + BatchNorm + ReLU 7 64H/2W/2
#2 Conv2d + BatchNorm + ReLU 3 64H/2W/2
#3 ResidualBlock (#2) x2 - 64H/2W/2
#4 Max. Pooling (1/2) 3 64H/4W/4
#5 ResidualBlock (#3 + $2) x2 - 128H/4W/4
#6 Max. Pooling (1/2) 3 128H/8W/8
#7 ResidualBlock (#4 + #3) x2 - 256H/8W/8
#8 Max. Pooling (1/2) 3 256H/16W/16
#9 ResidualBlock (#5 + #4) x2 - 512H/16W/16
KeyPoint Decoder
#10 Conv2D + BatchNorm + LReLU (#9) 3 256H/16W/16
#11 Conv2D + Upsample (#10) 3 256H/8W/8
#12 Conv2D + BatchNorm + LReLU 3 256H/8W/8
#13 Conv2D + Upsample(#7 #12) 3 128H/4W/4
#14 Conv2D + BatchNorm + LReLU 3 128H/4W/4
#15 Conv2D + Upsample (#5 #14) 3 64H/2W/2
#16 Conv2D + BatchNorm + LReLU 3 64H/2W/2
Score Head
#12 Conv2d + BatchNorm + LReLU (#12) 3 256H/8W/8
#13 Conv2d + Sigmoid 3 1H/8W/8
Location Head
#14 Conv2d + BatchNorm + LReLU (#12) 3 256H/8W/8
#15 Conv2d + Tan. Harmonic 3 2H/8W/8
Descriptor Head
#16 Conv2d + BatchNorm + LReLU (#16) 3 64H/2W/2
#17 Conv2d 3 64H/2W/2
Table 5: KeypointNet diagram. Upsample is a nearest-neighbor interpolation operation that doubles the spatial dimensions of the input tensor. denotes feature concatenation for skip connections.

Appendix B Dense Depth Evaluation

Method Abs Rel Sq Rel RMSE RMSE
Monodepth2 [22] 0.090 0.545 3.942 0.137 0.914 0.983 0.995
DepthNet baseline 0.089 0.543 3.968 0.136 0.916 0.982 0.995
DepthNet finetuned 0.094 0.572 3.805 0.138 0.912 0.981 0.994
Table 6: Quantitative performance comparison of depth estimation on the KITTI dataset for reported depths of up to 80m. For Abs Rel, Sq Rel, RMSE and RMSE lower is better, and for , and higher is better. All networks have been pre-trained on ImageNet [12] pretraining. We evaluate on the annotated KITTI depth maps from [50]. At test-time, the scale for all the methods is corrected using the median ground-truth depth from the LiDAR.

We perform a qualitative evaluation of our DepthNet on the KITTI dataset, specifically on the Eigen [15] test split, and report the numbers in Table 6. We also include the numbers reported by [22] and note that our DepthNet baseline numbers are on par with those of [22] (note that this corresponds to row 1. Baseline of Table 2 in the main text). Table 6 also shows our numbers after fine-tuning the DepthNet and KeypointNet through the proposed Multi-View Adaptation method (note that this corresponds to row 5. Ours TR of Table 2 in the main text). We note a slight decrease in the Abs Rel and Sq Rel metrics, but otherwise the numbers are within error margin with respect to our baseline. These results provide an important sanity check: as the main focus of this work is sparse, depth-aware keypoint learning, we don’t expect to see much variation when performing dense depth evaluation. We mention that sparsely evaluating the depth using the keypoints regressed by our method is not feasible using the depth available in the KITTI dataset: even using the denser depth maps provided by [50], only about of our keypoints have valid depths in the ground truth maps, which amounts to a very small number of points () per image.

Appendix C Structural Similarity (SSIM) loss

We define the SSIM loss [52] as:


with and

. To compute the per-patch mean and standard deviation

and we use a block filter.

Appendix D Pose Estimation

Recall that we aim to minimize:


where is the rotation matrix and

is the translation vector. They together compose a rigid body transform

, which is defined by . is a member of the Lie algebra and is mapped to the Lie group through the matrix exponential :



is the skew-symmetric matrix of


The estimated relative pose can be obtained by optimizing the residual error in Equation (16). The Gaussian-Newton (GN) method is used to solve this non-linear least square problem. GN calculates iteratively as follows:


where is the Jacobian matrix with respect to the residual measurements. RANSAC is performed to achieve a robust estimation, rejecting outliers with three major types against the ego-motion assumption: false-positive matching pairs, dynamic objects or points with wrong depth estimations.

Appendix E Detailed Results for the Pose Ablation Study

Table 7 provides detailed results on all the KITTI odometry sequences for each entry of our ablation study (Table 2 of the main text). We note that (i) the proposed contributions - Multi-View Adaptation (row 2 vs row 1) and Differentiable Pose (row 5 vs row 1) consistently improve over the baseline; and that (ii) by swapping out the KeypointNet or DepthNet trained using the proposed Multi-View Adaptation with their baseline counterparts (rows 3 and 4) results in worse performance for both the and metrics.

Method Type 01 02 06 08 09 10 00 03 04 05 07 Train Test
- Average Translational RMSE drift (%) on trajectories of length 100-800m.
Baseline Mono 18.96 3.35 2.16 3.80 3.15 5.06 3.50 3.64 2.33 3.25 3.00 6.08 3.14
Ours TR, DP Mono 20.17 3.37 2.15 3.01 2.61 5.39 2.89 3.10 2.88 3.09 2.66 6.12 2.92
Ours TR, KPN trained Mono 19.09 3.30 2.23 3.16 2.84 5.03 2.83 3.29 2.02 3.69 2.58 5.94 2.88
Ours TR, DN trained Mono 15.53 3.37 1.84 3.63 2.83 5.06 3.56 3.06 2.11 3.33 2.34 5.38 2.88
Ours TR Mono 17.79 3.15 1.88 3.06 2.69 5.12 2.76 3.02 1.93 3.30 2.41 5.61 2.68
Ours Mono 4.70 3.62 0.92 2.46 2.31 5.24 1.83 1.21 0.76 1.84 0.54 3.21 1.24
- Average Rotational RMSE drift () on trajectories of length 100-800m.
Baseline Mono 1.02 1.12 0.82 1.00 0.72 1.43 1.26 3.17 1.09 1.24 1.39 1.02 1.63
Ours TR, DP Mono 1.08 1.03 0.97 0.73 0.65 0.91 1.24 2.64 1.00 1.08 1.18 0.89 1.43
Ours TR, KPN trained Mono 0.84 1.12 0.98 0.77 0.64 1.24 1.23 2.81 1.56 1.24 1.23 0.93 1.61
Ours TR, DN trained Mono 0.66 1.13 0.68 0.88 0.62 1.44 1.20 2.74 1.73 1.22 1.04 0.91 1.58
Ours TR Mono 0.72 1.01 0.80 0.76 0.61 1.07 1.17 2.45 1.93 1.11 1.16 0.83 1.56
Ours Mono 0.16 0.22 0.13 0.31 0.30 0.29 0.33 0.33 0.18 0.22 0.23 0.24 0.26
Table 7: Detailed results of our Pose Ablative Analysis. Note: our results are obtained after performing a single Sim(3) alignment step [24] wrt. the ground truth trajectories.

Appendix F Qualitative Results on HPatches

Figure 6: Qualitative matching results of our method on the HPatches dataset [6].

Appendix G Keypoint Detector and Descriptor Evaluation Metrics

We follow [14] and use the Repeatability and Localization Error metrics to estimate keypoint performance and Homography Accuracy and Matching Score matrics to estimate descriptor performance. We note that for all metrics we used a distance threshold of . For the Homography estimation, consistent with other reported methods, we used keypoints with the highest scores. Similarly, for the frame to frame tracking we selected keypoints to estimate the relative pose.

Repeatability is computed as the ratio of correctly associated keypoints after warping onto the target frame. We consider a warped keypoint correctly associated if the nearest keypoint in the target frame (based on Euclidean distance) is below a certain threshold.

Localization Error is computed as the average Euclidean distance between warped and associated keypoints.

Homography Accuracy To compute the homography between two images we perform reciprocal descriptor matching and we used OpenCV’s findHomography method with RANSAC, with a maximum of 5000 iterations and error threshold 3. To compute the Homography Accuracy we compare the estimated homography with the ground truth homography. Specifically we warp the image corners of the original image onto the target image using both the estimated homography and the ground truth homography, and we compute the average distance between the two sets of warped image corners, noting whether the average distance is below a certain threshold.

Matching Score is computed as the ratio between successful keypoint associations between the two images, with the association being performed using Euclidean distance in descriptor space.


  • [1] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski (2011) Building rome in a day. Communications of the ACM 54 (10), pp. 105–112. Cited by: §2.
  • [2] S. Agarwal, N. Snavely, S. M. Seitz, and R. Szeliski (2010) Bundle adjustment in the large. In European conference on computer vision, pp. 29–42. Cited by: §1.
  • [3] R. Ambrus and A. Gaidon (2019) SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation. In ICRA, Cited by: Table 3.
  • [4] R. Ambrus, V. Guizilini, J. Li, S. Pillai, and A. Gaidon (2019) Two stream networks for self-supervised ego-motion estimation. In Proceedings of the 3rd International Conference on Robot Learning (CoRL), Cited by: §2, Table 3.
  • [5] Anonymous (2020) Neural outlier rejection for self-supervised keypoint learning. In Submitted to International Conference on Learning Representations, Note: under review External Links: Link Cited by: Appendix A, §1, §2, §2, §3.2, §3.2, §3.4, §5.1, §5.2, §5.2, Table 1.
  • [6] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5173–5182. Cited by: Figure 6, §5.1, §5.3, Table 1.
  • [7] H. Bay, T. Tuytelaars, and L. Van Gool (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: Table 1.
  • [8] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. arXiv preprint arXiv:1908.10553. Cited by: Table 3.
  • [9] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M. Cheng, and I. Reid (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. External Links: 1908.10553 Cited by: §3.4.
  • [10] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Transactions on robotics 32 (6), pp. 1309–1332. Cited by: §1.
  • [11] P. H. Christiansen, M. F. Kragh, Y. Brodskiy, and H. Karstoft (2019) UnsuperPoint: end-to-end unsupervised interest point detector and descriptor. arXiv preprint arXiv:1907.04011. Cited by: §1, §2, §2, §3.2, §3.2, §5.1, Table 1.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 6, §5.2.
  • [13] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Self-improving visual odometry daniel. arXiv preprint arXiv:1812.03245. Cited by: §2.
  • [14] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236. Cited by: Appendix G, §1, §2, §2, §3.2, §5.1, Table 1.
  • [15] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, pp. 2366–2374. Cited by: Appendix B, §5.1.
  • [16] J. Engel, V. Koltun, and D. Cremers (2017) Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence 40 (3), pp. 611–625. Cited by: §1, §4.
  • [17] J. Engel, V. Koltun, and D. Cremers (2018) Direct sparse odometry. IEEE TPAMI. Cited by: §3.3.
  • [18] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §5.1.
  • [19] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Cited by: Appendix A.
  • [20] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Vol. 2, pp. 7. Cited by: §2, §3.4.
  • [21] C. Godard, O. Mac Aodha, and G. Brostow (2018) Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260. Cited by: §3.4, §5.2, §5.2, Table 3.
  • [22] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2018) Digging into self-supervised monocular depth prediction. arXiv:1806.01260. Cited by: Table 6, Appendix B, §3.4, §3.4.
  • [23] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova (2019) Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. External Links: 1904.04998 Cited by: §3.4.
  • [24] M. Grupp (2017) Evo: python package for the evaluation of odometry and slam.. Note: Cited by: Table 7, Table 2, Table 3.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
  • [26] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In NIPS, Cited by: §3.4.
  • [27] A. Kendall, M. Grimes, and R. Cipolla (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pp. 2938–2946. Cited by: §2.
  • [28] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • [29] A. Kirillov, R. Girshick, K. He, and P. Dollár (2019) Panoptic feature pyramid networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6399–6408. Cited by: §1.
  • [30] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) EPnP: An accurate o (n) solution to the PnP problem. IJCV. Cited by: §2, §3.3.
  • [31] S. Leutenegger, M. Chli, and R. Siegwart (2011) BRISK: binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV), pp. 2548–2555. Cited by: Table 1.
  • [32] R. Li, S. Wang, Z. Long, and D. Gu (2017) UnDeepVO: Monocular visual odometry through unsupervised deep learning. arXiv preprint arXiv:1709.06841. Cited by: Table 3.
  • [33] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.1.
  • [34] D. G. Lowe et al. (1999) Object recognition from local scale-invariant features.. In iccv, Vol. 99, pp. 1150–1157. Cited by: §1, §2, Table 1.
  • [35] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille (2018) Every pixel counts++: joint learning of geometry and motion with 3d holistic understanding. arXiv preprint arXiv:1810.06125. Cited by: Table 3.
  • [36] F. Ma, G. V. Cavalheiro, and S. Karaman (2018) Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. arXiv preprint arXiv:1807.00275. Cited by: §2.
  • [37] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós (2015-10) ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §2.
  • [38] R. Mur-Artal and J. D. Tardós (2017)

    Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras

    IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §4, Table 3.
  • [39] Y. Ono, E. Trulls, P. Fua, and K. M. Yi (2018) LF-Net: learning local features from images. In NIPS, Cited by: §2.
  • [40] Y. Ono, E. Trulls, P. Fua, and K. M. Yi (2018) LF-net: learning local features from images. In Advances in Neural Information Processing Systems, pp. 6234–6244. Cited by: Table 1.
  • [41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §5.2.
  • [42] E. Rosten and T. Drummond (2006) Machine learning for high-speed corner detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 430–443. Cited by: §2.
  • [43] E. Rosten, R. B. Porter, and T. Drummond (2010) Faster and better: a machine learning approach to corner detection.. IEEE Trans. Pattern Anal. Mach. Intell. 32 (1), pp. 105–119. External Links: Link Cited by: §2.
  • [44] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Vol. 11, pp. 2. Cited by: §1, §2, Table 1.
  • [45] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019-06) From coarse to fine: robust hierarchical localization at large scale. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [46] N. Savinov, A. Seki, L. Ladicky, T. Sattler, and M. Pollefeys (2017-07)

    Quad-networks: unsupervised learning to rank for interest point detection

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [47] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212. Cited by: §1.
  • [48] Z. Teed and J. Deng (2018) DeepV2D: video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605. Cited by: §2.
  • [49] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §1.
  • [50] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger (2017) Sparsity invariant cnns. In 2017 International Conference on 3D Vision (3DV), pp. 11–20. Cited by: Table 6, Appendix B.
  • [51] Y. Verdie, K. Yi, P. Fua, and V. Lepetit (2015-06) TILDE: a temporally invariant learned detector. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [52] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE TIP. Cited by: Appendix C, §3.4.
  • [53] N. Yang, R. Wang, J. Stückler, and D. Cremers (2018) Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry. In ECCV, Cited by: §1, §4, §5.1, §5.4, Table 3.
  • [54] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) LIFT: Learned Invariant Feature Transform. In ECCV, Cited by: §2.
  • [55] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid (2018)

    Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction

    In CVPR, Cited by: Table 3.
  • [56] H. Zhan, C. S. Weerasekera, J. Bian, and I. Reid (2019) Visual odometry revisited: what should be learnt?. arXiv preprint arXiv:1909.09803. Cited by: §2, §5.4, Table 3.
  • [57] Z. Zhang (2000) A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence 22. Cited by: §3.3.
  • [58] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, Vol. 2, pp. 7. Cited by: §2, §3.1, §3.4, Table 3.
  • [59] A. Z. Zhu, W. Liu, Z. Wang, V. Kumar, and K. Daniilidis (2018) Robustness meets deep learning: an end-to-end hybrid pipeline for unsupervised learning of egomotion. arXiv preprint arXiv:1812.08351. Cited by: Table 3.