1 Introduction
Most existing VO systems are either geometric or learning-based. In this paper, we argue that a truly robust VO system should combine the best of both worlds (i.e., geometry and learning). In particular, we propose a self-supervised method to learn monocular VO with long-term modeling, where the training scheme is directly inspired by traditional geometric methods (see Figure 1).
At the heart of the state-of-the-art VO systems [engel2017direct, engel2014lsd, forster2014svo, murORB2] is the incorporation of several long-studied geometric modules, including keypoint tracking, motion estimation, keyframe insertion, and bundle adjustment (BA) [triggs1999bundle]. With all these modules, a key insight is to optimize the states (e.g., 6-DoF camera poses) over long-term observations such that the system suffers less from error accumulation [scaramuzza2011visual]. While being robust in normal scenarios, monocular VO still suffers from the difficulty in initialization for slow motions [mur2015orb], and the tracking tends to fail miserably in unconstrained environments with large texture-less regions, fast movements, or other adverse factors [yang2018challenges] such as rolling shutter effect [schubert2018direct, zhuang2019learning] and unknown camera intrinsics [bogdan2018deepcalib, zhuang2019degeneracy].
In contrast, learning-based VO methods [wang2017deepvo, wang2018end, xue2018guided, xue2019beyond]
have the potential of being more robust to the aforementioned challenges by harnessing the rich priors from data. However, training neural networks in a supervised way involves collecting large-scale, diverse datasets with ground truth annotations, which could be labor-extensive and time-consuming. Recently, self-supervised methods
[godard2019digging, li2018undeepvo, ranjan2019competitive, wang2019recurrent, zhan2018unsupervised, zhou2017unsupervised, zou2018dfnet] have been proposed to tackle this task. Instead of supervising the networks with ground truth labels, the idea is to couple the depth and pose networks with photometric errors across adjacent frames and jointly train them in an end-to-end manner. Nonetheless, the performance of these methods still falls behind that of geometric methods [mur2015orb] for general scenarios.One of the potential reasons for their performance gap is that the pose networks do not exploit the temporal coherence over long sequences. During training, these networks receive short snippets (e.g., 3-frame or 5-frame) as input and predict the ego-motions that are optimized locally for the current snippet. When evaluating these methods in short snippets, they compare favorably even with the state-of-the-art geometric methods [mur2015orb]. However, if we concatenate all the predictions to form the full trajectory, it is often the case that the learning-based methods generate much larger pose errors, as illustrated in Figure 2.


In this paper, we argue that learning VO requires explicit long-term modeling to infuse the insights from geometric methods [engel2017direct, engel2014lsd, forster2014svo, murORB2]. To this end, we propose a novel self-supervised VO learning framework that draws inspiration from geometric modules. Specifically, we build our learning framework upon a depth network of an auto-encoder structure with skip-connections [godard2019digging] and a pose network with a two-layer LSTM module [xue2019beyond]. In contrast to the supervised method by Xue et al. [xue2019beyond], our method incorporates extra depth information and uses a completely different training scheme, leading to a purely self-supervised learning framework. To mitigate error accumulation, we propose a cycle consistency constraint between the two-layer predictions, mimicking a mini loop closure module, which improves the pose consistency over the sequence. In order to model long-term dependency in VO, we propose a two-stage training strategy, which considers both short-term and long-term constraints. The proposed two training stages correspond to the local and global bundle adjustment modules in the geometric VO, allowing us to refine the poses within a large temporal range.
In summary, our contributions are:
-
We propose a novel self-supervised VO learning framework that explicitly models long-term temporal dependency.
-
We build connections between our method and key building blocks of geometric VO systems and demonstrate well-motivated designs.
-
We evaluate the full pose trajectories by our method, against the state-of-the-art geometric and learning-based baselines, and achieve competitive results on standard VO datasets, including KITTI and TUM RGB-D.
To the best of our knowledge, our method is the first of the kind that is able to learn from “truly” long sequences (e.g., 100 frames) in the training stage. Our experiments show that our proposed method gives rise to significant empirical benefits by explicitly considering long-term modeling.
2 Related Work
Geometric Methods. Visual odometry is a long-standing problem that estimates the ego-motion incrementally [nister2004visual, scaramuzza2011visual] using visual input. A conventional geometric VO system usually consists of the following components [scaramuzza2011visual]: feature detection, feature matching (or tracking), motion estimation (e.g., triangulation [hartley1997triangulation]), and local optimization (e.g., bundle adjustment). A keyframe mechanism [klein2007parallel] is also adopted for improved robustness in motion estimation. Incorporated with a mapping system that reconstructs the 3D scene structures, a VO system turns to a system called Simultaneous Localization and Mapping (SLAM) [cadena2016past]. The key to the robustness of the modern VO/SLAM systems [mur2015orb, tiwari2020pseudo] lies in their capability to extract reliable image measurements and optimize the states (e.g., 6-DoF camera poses) over a large number of frames. In this work, we leverage these geometric insights to design a robust learning-based VO system.
Fully-Supervised Methods. With the success of deep neural networks, end-to-end learning-based methods [wang2017deepvo, wang2018end, xue2018guided, xue2019beyond] have been proposed to tackle the visual odometry problem. These methods often rely on a supervised loss using the ground-truths to regress the 6-DoF camera relative pose from a pair of consecutive images. Recently, some methods [bloesch2018codeslam, tang2018ba, teed2018deepv2d, ummenhofer2017demon, zhou2018deeptam] exploit CNNs to predict the scene depth and camera pose jointly, utilizing the geometric connection between the structure and the motion. This corresponds to learning Structure-from-Motion (SfM) in a supervised manner. Although the methods above achieve good performance, they require ground-truth annotations to train the networks. In contrast, our method is self-supervised, requiring nothing but the monocular video frames.
Self-Supervised Methods. To mitigate the requirement of data annotations, self-supervised methods [godard2019digging, li2018undeepvo, ranjan2019competitive, wang2019recurrent, zhan2018unsupervised, zhou2017unsupervised] have been proposed to tackle the SfM task. The main supervisory signal of these methods comes from the photometric-consistency between corresponding pixels of neighboring frames. While they achieve good performance on single-view depth estimation, the performance of ego-motion estimation still lags behind the traditional SLAM/VO methods. Recently, Bian et al. [bian2019unsupervised] argue that the pose networks cannot provide full camera trajectories over long sequences due to the inconsistent scale of per-frame estimations and thus propose a geometry consistency constraint. However, their method only enforces the globally consistent trajectories by propagating the consistency on overlapping short snippets during training. In contrast, our method directly optimizes over long sequences via long-term modeling. Inspired by the keyframe mechanism in geometric methods, Sheng et al. [sheng2019unsupervised] propose to jointly learn depth, ego-motion, and keyframe selection simultaneously in a self-supervised manner. Similarly, the training of this method considers only short snippets and thus is unable to model long-term dependency.
Sequential Modeling.
Sequential modeling based on recurrent neural networks (RNNs) has been successfully applied to many applications, such as speech recognition
[chorowski2014end], machine translation [graves2014towards], and video prediction [srivastava2015unsupervised, villegas2017learning]. Aiming to estimate the full trajectory over a long sequence of frames, VO can be naturally formulated as a sequential learning problem and thus modeled with RNNs [wang2019recurrent, wang2017deepvo, wang2018end, xue2018guided]. Recently, Xue et al. [xue2019beyond] propose to use a two-layer LSTM network for pose estimation, where the first layer estimates the relative motion between consecutive frames, and the second layer estimates global absolute poses.Despite using a similar pose network, our method differs from Xue et al. [xue2019beyond] in being self-supervised v.s. full-supervised and the associated training strategies. First of all, our method further incorporates depth information, while the method in [xue2019beyond] relies only on pose features. Apart from the photometric discrepancy as to the supervisory signal, we enforce a cycle consistency between the predictions from the two-layer LSTM modules, which serves as a mini “loop closure” module that mimics the geometry VO system. More importantly, we decouple our network training into two stages, allowing our method to optimize over long sequences (more than 90 frames) during training, whereas the method in [xue2019beyond] only trains with 11-frame snippets. To our knowledge, this is the firstdeep learning approach for visual odometry that takes long sequences as input in the training stage.
3 Method


Figure 3 provides a high-level overview of the proposed monocular VO system. Our system has two major components: a depth network and a pose network.222Since accurate pose prediction is the primary focus of this paper, we name our method as VO instead of SfM or SLAM. The single-image depth network employs an encoder-decoder structure with skip-connections [godard2019digging]. The pose network consists of a FlowNet backbone [dosovitskiy2015flownet], a two-layer LSTM module [xue2019beyond], and two pose prediction heads (with one after each of the LSTM layers). In the two-layer recurrent architecture, the first-layer focuses on predicting consecutive frame motions, while the second-layer refines the estimations from the first-layer [xue2019beyond].
3.1 Background
We formulate the monocular visual odometry task as a view synthesis problem, by training the networks to predict a target image from the source image with the estimated depth and camera pose. Such a system typically consists of two components: a depth network which takes a single RGB image as input to predict the depth map, and a pose network which takes a concatenation of two consecutive frames as input to estimate the 6-DoF ego-motion.
Given two input images and , the estimated depth map and camera pose , we can then compute the per-pixel correspondence between the two input images. Assume a known camera intrinsic matrix , and let represent the 2D homogeneous coordinate of a pixel in . We can find the corresponding point of in following the equation [zhou2017unsupervised]:
(1) |
Appearance loss. For a self-supervised visual odometry system, the primary supervision comes from the appearance dissimilarity between the synthesis image and the target image. To effectively handle occlusion, we use three consecutive frames to compute the per-pixel minimum photometric reprojection loss [godard2019digging], i.e.,
(2) |
where is a weighted combination of the L2 loss and the Structured SIMilarity (SSIM) loss, denotes the frame synthesized from using Eq. (1). To handle static pixels, we adopt the auto-masking mechanism following Godard et al. [godard2019digging].
Smoothness loss. Since the appearance loss cannot provide meaningful supervision for texture-less or homogeneous regions of the scene, a smoothness prior of disparity is incorporated. We here use the edge-aware smoothness loss as in Wang et al. [wang2018learning].
Remark 1
The appearance loss in Eq. (2) corresponds to a local photometric bundle adjustment objective, which is also commonly used in the geometric direct VO/SLAM systems [engel2017direct, engel2014lsd, wang2017stereo].
3.2 Cycle consistency within memory-aided sequential modeling
With the above setting, current state-of-the-art self-supervised methods estimate the ego-motion within a local range, discarding the sequential dependence and dynamics in the long sequences. Such information, however, is essential for a pose network to recover the entire trajectory in a consistent manner. We thus adopt a recurrent structure of our pose network to utilize the temporal information.
Sequential modeling. To learn to utilize the temporal information, we adopt the recurrent network structure with a convolution LSTM (ConvLSTM) module [xingjian2015convolutional]. Previously, the pose network takes the concatenation of two frames and outputs the 6-DoF camera pose directly. After incorporating the ConvLSTM module, the pose network also takes the previous estimation information into account when predicting the output. Formally, we have
(3) | |||
(4) | |||
(5) |
where is the pose encoder, denotes the output and hidden state of ConvLSTM at time , is a linear layer to predict the 6-DoF motion . By doing this, the network implicitly learns to aggregate temporal information and learns the motion pattern.
Memory buffer and refinement. In the sequential modeling setting above, the pose network estimates the relative pose for every two consecutive frames. However, the motions between consecutive frames are often tiny, which results in difficulties in extracting good features for relative pose estimation. Thus, predicting the camera pose from a non-adjacent “anchor” frame to the current frame could be a better option. Note that many traditional SLAM systems [mur2015orb, murORB2] adopt a keyframe mechanism and always compute camera poses from the current frame to the most recent keyframe.
Inspired by the keyframe mechanism, we incorporate the second-layer ConvLSTM and adopt the memory module proposed by Xue et al. [xue2019beyond]
. After each step in the first-layer ConvLSTM, we store the hidden state tensor in a memory buffer, whose length is set to the length of the input snippet. When we read out from the memory buffer, we compute the weighted average of all the memory slots in the memory buffer as in
[xue2019beyond].We also compute the depth and pose features for the first frame and the current frame as additional input to the second-layer ConvLSTM. This can be formally written as
(6) | |||
(7) | |||
(8) | |||
(9) |
where is the depth encoder, is the read-out memory, , denote the output and hidden state from the second-layer at time , and is another linear layer predicting the absolute pose in the current snippet.
Remark 2
The ConvLSTMs explicitly model the sequential nature of the VO problem and meanwhile facilitate the implementation of a keyframe mechanism. Compared to the memory module by Xue et al. [xue2019beyond], which only considers pose features, our memory module accommodates both depth and pose features. As verified in our experiments 333Table 6 in the supplementary material., incorporating the additional depth information in memory improves the overall performance.
[autoplay,loop,width=]1gif_files/cycle/000003
Cycle consistency over two-layer poses. To train the second-layer ConvLSTM, we utilize the photometric error between the first frame and the other frames of the input snippet, i.e.,
(10) |
where is the number of frames for the input snippet, which is set to 7 in our model.
Also, according to the transitivity of the camera transformation, we have an additional constraint to ensure the consistency between the first and second layer ConvLSTM (as shown in Figure 4), i.e.,
(11) |
Thus, the overall objective is
(12) |
where are the hyper-parameters to balance the scale of different terms, which are empirically set to .
Remark 3
The loss in Eq. (11) can be thought of as a mini “loop closure” module that enforces the cycle-consistency between the outputs of two ConvLSTM layers. Note that our method is also compatible with the existing full loop closure techniques [kummerle2011g], which we will consider in the future work.
3.3 Long-range constraints via stage-wise training
Although we adopt a recurrent network structure to aggregate temporal information for better performance, the network has never seen long sequences but only short snippets during the training time. Thus, the network may not learn how to fully utilize the long-term temporal context. The hurdle that prevents us from taking long sequences as input is the limited memory volume of modern GPUs. To tackle this problem for training a long-term model, we propose a two-stage training strategy. We first train our whole model with the full objective using short snippets.
Once the first stage of training is finished, we run this model on each sequence in the dataset separately, to extract the required input for the second-layer ConvLSTM and store them. After that, we only fine-tune the lightweight second-layer ConvLSTM, without the heavy feature extraction and depth networks, which saves us a lot of memories. By doing this, we can now feed long sequences into the network during the training time, allowing the network to better learn how to utilize the temporal context. Since only the second-layer ConvLSTM is optimized, our loss for the second stage of training is
(13) |
where is the number of frames of each snippet, which is set to 7; is the number of snippets in the input sequence, which is set to 16. Note that consecutive snippets have one frame in common, and thus the total number of frames in the input sequence is 97. The synthesized image is a function of depth and pose, where pose encodes long-range constraints through hidden states of ConvLSTMs, yielding an effective window of 97 frames. We summarize our method in Algorithm 1.
Remark 4
The second training stage can be viewed as a motion-only bundle adjustment module [mur2015orb] that considers long-term modeling.
4 Experimental Results
4.1 Settings
Datasets. To evaluate our method, we conduct the main experiments on the KITTI dataset [Geiger2013IJRR, Geiger2012CVPR]
, which consists of urban and highway driving sequences for road scene understanding
[song2014robust, dhiman2016continuous]. The odometry split of KITTI is a widely used benchmark for odometry/SLAM evaluation. It contains 22 sequences, among which Sequence 00-10 have ground truth trajectory labels, and the annotations of the remaining sequences are not publicly available. Following Zhou et al.[zhou2017unsupervised], we use Sequence 00-08 as our training set and validate the models on Sequence 09 and 10. Besides, we select 18 more sequences from KITTI raw data, which have no overlaps with the odometry split, for further evaluation. Since the ground truth trajectories of Sequence 11-21 are not available, we run ORB-SLAM2 (stereo version) to get predictions as (pseudo) ground truth for evaluation. In addition to these outdoor scenes, we also train and evaluate our model on the TUM RGB-D dataset [sturm2012benchmark]. This dataset is collected by hand-held cameras in indoor environments with challenging conditions. We use the same train/test split as in Xue et al. [xue2019beyond].Evaluation metrics.
For the KITTI dataset, we adopt the absolute trajectory RMSE and relative translation/rotation errors for all possible subsequences of length (100, 200, …, 800 meters). For the TUM RGB-D dataset, we use the translational RMSE as our evaluation metrics. For self-supervised monocular methods, since the absolute scale is unknown, we align the trajectory globally using the evo toolbox
[grupp2017evo].Implementation details.
We use ImageNet pretrained ResNet-18 as our depth encoder. Our depth decoder structure is the same as Godard et al.
[godard2019digging]. For the pose encoder, we take the encoder of FlowNet-S structure until the last layer, which is pre-trained on FlyingChairs [dosovitskiy2015flownet]for optical flow estimation. We implement our system using the publicly available PyTorch framework and conduct all our experiments with a single TitanXP GPU. For both stages, we train the network with the Adam optimizer
[kingma2014adam]for 20 epochs. The learning rate is set to 5e-5 for the first 15 epochs and drops to 5e-6 for the remaining epochs. The input size is 640
192, and the batch size is set to 2. In the first stage of training, the number of frames is set to 7, while the number of frames is set to 97 in the second stage. Note that the long-term optimization only happens in the training time. At the test time, our model runs at 14.3 frames per second.4.2 Results
Seq. 09 | Seq. 10 | |||||
Method | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) |
Baseline | 22.71 | 7.55 | 0.028 | 17.87 | 10.43 | 0.046 |
One-layer ConvLSTM | 23.45 | 5.59 | 0.016 | 11.93 | 7.23 | 0.023 |
Two-layer ConvLSTM | 9.77 | 4.23 | 0.013 | 12.68 | 6.02 | 0.023 |
Two-layer ConvLSTM + Two-stage training | 11.30 | 3.49 | 0.010 | 11.80 | 5.81 | 0.018 |

(a) Full trajectories (b) Zoom-in
Ablation study. To validate our design choice, we perform an ablation study on Sequence 09 and 10 of the KITTI Odometry dataset. We consider the following variants. 1) Baseline: the pose network takes as input the concatenation of two consecutive frames to generate pose estimation; 2) One-layer ConvLSTM: we incorporate a one-layer ConvLSTM for the pose network; 3) Two-layer ConvLSTM: we use the full model, but only conduct the first stage of training; 4) Two-layer ConvLSTM + Two-stage training: our final model with the two-stage training strategy.
As shown in Table 1, the performance gradually improves as we add more components. Specifically, adding a recurrent module improves the overall performance over the baseline; adding the second layer LSTM leads to further improvement, which validates the effectiveness of the second layer in the self-supervised learning setting; applying our second stage long-term training again boosts the performance, achieving a new state-of-the-art for self-supervised methods. We show a visual comparison in Figure 5.
Seq. 09 | Seq. 10 | ||||||
Method | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) | |
Geo. | ORB-SLAM2-M (w/o LC) [murORB2] | 44.10 | 9.67 | 0.003 | 6.43 | 4.04 | 0.003 |
ORB-SLAM2-M [murORB2] | 8.84 | 3.22 | 0.004 | 8.51 | 4.25 | 0.003 | |
Sup. | DeepVO [wang2017deepvo] | - | - | - | - | 8.11 | 0.088 |
ESP-VO [wang2018end] | - | - | - | - | 9.77 | 0.102 | |
GFS-VO [xue2018guided] | - | - | - | - | 6.32 | 0.023 | |
GFS-VO-RNN [xue2018guided] | - | - | - | - | 7.44 | 0.032 | |
BeyondTracking [xue2019beyond] | - | - | - | - | 3.94 | 0.017 | |
DeepV2D [teed2018deepv2d] | 79.06 | 8.71 | 0.037 | 48.49 | 12.81 | 0.083 | |
Self-Sup. | SfMLearner [zhou2017unsupervised] | 24.31 | 8.28 | 0.031 | 20.87 | 12.20 | 0.030 |
GeoNet [yin2018geonet] | 158.45 | 28.72 | 0.098 | 43.04 | 23.90 | 0.090 | |
Depth-VO-Feat [zhan2018unsupervised] | - | 11.93 | 0.039 | - | 12.45 | 0.035 | |
vid2depth [mahjourian2018unsupervised] | - | - | - | - | 21.54 | 0.125 | |
UnDeepVO [li2018undeepvo] | - | 7.01 | 0.036 | - | 10.63 | 0.046 | |
Wang et al. [wang2019recurrent] | - | 9.88 | 0.034 | - | 12.24 | 0.052 | |
CC [ranjan2019competitive] | 29.00 | 6.92 | 0.018 | 13.77 | 7.97 | 0.031 | |
DeepMatchVO [shen2019icra] | 27.08 | 9.91 | 0.038 | 24.44 | 12.18 | 0.059 | |
PoseGraph [li2019pose] | - | 8.10 | 0.028 | - | 12.90 | 0.032 | |
MonoDepth2 [godard2019digging] | 55.47 | 11.47 | 0.032 | 20.46 | 7.73 | 0.034 | |
SC-SfMLearer [bian2019unsupervised] | - | 11.2 | 0.034 | - | 10.1 | 0.050 | |
Ours | 11.30 | 3.49 | 0.010 | 11.80 | 5.81 | 0.018 |
Comparison with the state-of-the-art methods. For comparison, we select several state-of-the-art methods, including the monocular version of ORB-SLAM2 [murORB2] (denoted as ORB-SLAM2-M) (with and without loop closure optimization), several supervised learning methods [teed2018deepv2d, wang2017deepvo, wang2018end, xue2018guided, xue2019beyond], and some other self-supervised methods [bian2019unsupervised, godard2019digging, li2018undeepvo, li2019pose, mahjourian2018unsupervised, ranjan2019competitive, shen2019icra, wang2019recurrent, yin2018geonet, zhan2018unsupervised, zhou2017unsupervised]. Note that all supervised methods are trained on Sequence 00, 02, 08, 09 of the KITTI Odometry dataset [Geiger2012CVPR], except DeepV2D [teed2018deepv2d], which is trained on the Eigen split of KITTI raw dataset [Geiger2013IJRR]. As we can see in Table 2, our final model outperforms other self-supervised methods by a significant margin. In particular, our method outperforms the recent proposed SC-SfMLearner [bian2019unsupervised], which aims to reconstruct the scale-consistent camera trajectory. This indicates that the explicit long-term modeling used in our approach is more effective than propagating the geometric constraint among overlapping short snippets in SC-SfMLearner [bian2019unsupervised]. Our model also compares favorably with the geometric method and outperforms all supervised methods except Xue et al. [xue2019beyond] on Sequence 10.
Results on additional KITTI sequences. From the raw data of the KITTI dataset, we select 18 short sequences that have no overlaps with either the training or test split of the KITTI Odometry dataset.444The sequence names are available in the supplementary material. We then apply the same pre-trained models on these test sequences. As we can see in Table 3, our method outperforms other learning-based methods and even compares favorably with ORB-SLAM2-M methods in terms of RMSE and relative translation error.
Method | RMSE (m) | Rel. trans (%) | Rel. rot. (deg/m) | |
Geo. | ORB-SLAM2-M (w/o LC) [murORB2] | 7.17 | 9.41 | 0.008 |
ORB-SLAM2-M [murORB2] | 8.12 | 10.64 | 0.008 | |
Sup. | DeepV2D [teed2018deepv2d] | 10.94 | 11.81 | 0.028 |
Self-Sup. | SfMLearner [zhou2017unsupervised] | 10.79 | 13.82 | 0.041 |
GeoNet [yin2018geonet] | 14.41 | 18.99 | 0.076 | |
CC [ranjan2019competitive] | 7.51 | 10.49 | 0.024 | |
DeepMatchVO [shen2019icra] | 8.53 | 12.76 | 0.033 | |
MonoDepth2 [godard2019digging] | 7.51 | 11.99 | 0.028 | |
Ours | 6.47 | 9.99 | 0.019 |
Method | RMSE (m) | Rel. trans (%) | Rel. rot. (deg/m) | |
Geo. | ORB-SLAM2-M (w/o LC) [murORB2] | 81.20 | 19.60 | 0.009 |
ORB-SLAM2-M [murORB2] | 44.09 | 12.96 | 0.007 | |
Sup. | DeepV2D [teed2018deepv2d] | 221.33 | 24.61 | 0.041 |
Self-Sup. | SfMLearner [zhou2017unsupervised] | 75.00 | 26.54 | 0.045 |
GeoNet [yin2018geonet] | 94.98 | 29.11 | 0.062 | |
CC [ranjan2019competitive] | 55.44 | 16.65 | 0.032 | |
DeepMatchVO [shen2019icra] | 95.79 | 17.31 | 0.038 | |
MonoDepth2 [godard2019digging] | 99.36 | 12.28 | 0.031 | |
Ours | 71.63 | 7.28 | 0.014 |




Results on KITTI test sequences. Since the ground truth trajectories of Sequence 11-21 on the KITTI Odometry dataset are not available, we cannot directly recover the global scale using similarity transformation. Thus, we choose to run the stereo version of ORB-SLAM2 (denoted as ORB-SLAM2-S), which is one of the state-of-the-art methods on these sequences. To compare with other methods, we treat the estimations from ORB-SLAM2-S as pseudo ground truth. As we can see in Table 4, our method achieves state-of-the-art performance among the learning-based methods and even compares favorably with ORB-SLAM2-M in terms of the relative translation error. We visualize trajectories of four sequences in Figure 6. As we can see, our method better aligns with the reference trajectories from ORB-SLAM2-S. We also submit the globally aligned results to the KITTI evaluation server and get similar numbers, which indicates using ORB-SLAM2-S as pseudo ground truth for evaluation is reasonable. Please refer to the supplementary material for more details.


Results on the TUM RGB-D dataset. The TUM RGB-D dataset was created to evaluate the performance of RGB-D SLAM and is thus very challenging for monocular methods. To test our model in indoor environments, we instead compare our model with several strong baselines. For traditional methods, we choose the monocular version of ORB-SLAM2 (denoted as ORB-SLAM2-M) and DSO [engel2017direct]. For learning-based methods, we choose the BeyondTracking method from Xue et al. [xue2019beyond] and the recent DeepV2D [teed2018deepv2d]. For DeepV2D, since it is trained on another indoor dataset, in case it cannot generalize its scale to TUM RGB-D, we provide two options: with and without global scale alignment.
Table 5 shows that traditional methods perform well on some of the sequences, but they failed to produce results from the remaining ones due to tracking failure. Our method outperforms the supervised baseline DeepV2D [teed2018deepv2d] on most of the sequences, but compares less favorably than the supervised VO method in [xue2019beyond]. We conjecture that this is due to the limited amount of training data available on the TUM RGB-D dataset. Currently, we use the same amount of training data as supervised methods. Adding more unlabeled video data to the training might lead to better performance for our method. We also notice that the rolling shutter issue in this dataset makes the photometric consistency assumption less accurate, which could potentially hurt the performance of both the proposed method and DSO.
Method | Seq. 1 | Seq. 2 | Seq. 3 | Seq. 4 | Seq. 5 | Seq. 6 | Seq. 7 | Seq. 8 | Seq. 9 | Seq. 10 | Avg. | |
Geo. | ORB-SLAM2-M [murORB2] | 0.041 | 0.184 | - | - | - | - | - | 0.057 | - | 0.018 | - |
DSO [engel2017direct] | - | 0.197 | - | 0.737 | - | 0.082 | - | 0.093 | 0.543 | 0.040 | - | |
Sup. | BeyondTracking [xue2019beyond] | 0.153 | 0.208 | 0.056 | 0.070 | 0.172 | 0.015 | 0.123 | 0.007 | 0.035 | 0.042 | 0.088 |
DeepV2D [teed2018deepv2d] | 0.232 | 0.651 | 0.186 | 0.167 | 0.171 | 0.029 | 0.435 | 0.106 | 0.085 | 0.082 | 0.214 | |
DeepV2D (aligned) [teed2018deepv2d] | 0.087 | 0.300 | 0.114 | 0.106 | 0.181 | 0.013 | 0.380 | 0.110 | 0.094 | 0.098 | 0.148 | |
Self-Sup. | Ours | 0.192 | 0.190 | 0.083 | 0.122 | 0.177 | 0.016 | 0.219 | 0.102 | 0.179 | 0.107 | 0.139 |
Discussions. Our learning framework is motivated by geometric VO methods. The FlowNet backbone mimics the tracking module to extract pair-wise image features, and the LSTMs model the sequential nature of the VO problem. The design of the two-layer LSTM module also resembles the keyframe mechanism of geometric VO in the sense that the second LSTM predicts the motion between a keyframe and a non-keyframe, refining the initial consecutive estimations from the first LSTM. The cycle consistency constraint between the two-layer LSTM estimations serves as a mini loop closure to enforce the transitivity consistency of poses. The second stage of training allows our network to explicitly optimize over long sequences, which resembles the motion-only bundle adjustment module. We combine the best of both geometry and learning by building a self-supervised VO framework whose components (network, loss, training scheme) are fully inspired by the well-studied geometric modules. As verified in our experiments, these geometry inspired designs lead to significantly better results than the existing self-supervised baselines.
Limitations. Although the proposed system achieves a good camera pose estimation performance in terms of translation error, the improvement on the rotation prediction is not as substantial as the translation. We conjecture that the large rotation error may be due to the bias within the training data. Specifically, for driving scenarios, the translational motions occur much more frequently than the rotational ones. Training on more diverse video sequences or synthetic data could potentially alleviate the inherent bias in the existing datasets. Also, we observe that our system fails under the over-exposure scenarios since our method still relies on visual input to extract information.
5 Conclusions
In this work, we learn a monocular visual odometry system in a self-supervised manner to mimic critical modules in traditional geometric methods. We first adopt a two-layer convolutional LSTM module to model the long-term dependency in the pose estimation. To allow the network to see beyond short snippets (e.g., 3 or 5 frames) during the training time, we propose a stage-wise training strategy. Combining the recurrent architecture and the proposed decoupled training scheme, our system achieves state-of-the-art performance among self-supervised methods. In the current form, we do not have a mechanism to detect loops and perform full loop closure. In the future, we plan to study how to incorporate it into our learning framework.
Acknowledgment. This work was part of Y. Zou’s internship at NEC Labs America, in San Jose. Y. Zou and J.-B. Huang were also supported in part by NSF under Grant No. (#1755785).
References
Supplementary Material
In this supplementary document, we provide additional experimental results and information to complement the main manuscript. First, we conduct additional ablation experiments to further validate our design choices. Second, we show our results on the KITTI Odometry leaderboard. Third, we show results on the KITTI Odometry training split. Fourth, we show results on the snippet-level pose and single-view depth estimation for completeness. Lastly, we provide the list of sequences we selected from KITTI raw data. We also provide a demo video showing the trajectories of several challenging sequences in the KITTI Odometry dataset. Please refer to the attached file supp_video.mp4.
A Ablation Study
In Table 6, we conduct an ablation study to validate the effectiveness of the incorporated cycle consistency constraint, pose features (from and ), depth features, and the memory buffer in our two-layer ConvLSTM module. As we can see, all the components help improve performance in the first stage of training.
Seq. 09 | Seq. 10 | |||||
Method | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) |
Two-layer ConvLSTM (w/o cycle consistency) |
20.37 | 5.02 | 0.016 | 16.63 | 6.88 | 0.035 |
Two-layer ConvLSTM (w/o pose features) |
14.26 | 5.64 | 0.018 | 14.47 | 7.52 | 0.030 |
Two-layer ConvLSTM (w/o depth features) |
11.53 | 4.54 | 0.015 | 14.07 | 6.54 | 0.031 |
Two-layer ConvLSTM (w/o memory buffer) |
12.54 | 5.12 | 0.014 | 13.96 | 7.20 | 0.026 |
Two-layer ConvLSTM |
9.77 | 4.23 | 0.013 | 12.68 | 6.02 | 0.023 |
|
In Table 7, we conduct an ablation study to show the performance of different input sequence lengths of the second stage of training. Our results show that the performance gradually improves as we increase the number of input frames during training. When the number of frames reaches the GPU memory limitations (e.g., our default setting, 97-frame), we achieve the best performance. Training the model on a GPU with larger memory could potentially improve the performance further.
Seq. 09 | Seq. 10 | |||||
Method | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) | RMSE (m) | Rel. trans. (%) | Rel. rot. (deg/m) |
49-frame |
12.50 | 3.83 | 0.011 | 12.30 | 5.99 | 0.018 |
73-frame |
12.42 | 3.69 | 0.010 | 12.06 | 5.89 | 0.018 |
97-frame (default) |
11.30 | 3.49 | 0.010 | 11.80 | 5.81 | 0.018 |
|
B Results on KITTI Odometry Test Set
In Table 8, we provide results on the KITTI Odometry leaderboard. It may be observed that the performance of our method is close to Table 5 in the main manuscript. This suggests that using ORB-SLAM2-S as pseudo ground truth is a reasonable choice for evaluation.
In addition to our method, we select two state-of-the-art self-supervised methods (CC [ranjan2019competitive] and MonoDepth2 [godard2019digging]) and submit the estimated results to the server as well. Our method compares favorably with these two self-supervised methods. Our method also outperforms the supervised method DeepVO [wang2017deepvo] by a large margin.
Method | Rel. trans (%) | Rel. rot. (deg/m) | |
Geo. |
ORB-SLAM2-S [murORB2] | 1.70 | 0.0028 |
|
VISO2-M [geiger2011stereoscan] | 11.94 | 0.0234 |
|
VISO2-M+GP [geiger2011stereoscan, song2014robust] | 7.46 | 0.0245 |
Sup. |
DeepVO [wang2017deepvo] | 24.55 | 0.0489 |
Self-Sup. |
CC [ranjan2019competitive] | 16.06 | 0.0320 |
|
MonoDepth2 [godard2019digging] | 12.59 | 0.0312 |
|
Ours | 7.40 | 0.0142 |
|
In Figure 8, we show qualitative results on the remaining 7 sequences (other than those shown in the main manuscript) from the KITTI Odometry test set. Our method aligns best with the reference ORB-SLAM2-S trajectories.


(a) Seq. 11 (b) Seq. 12


(a) Seq. 14 (b) Seq. 15


(a) Seq. 17 (b) Seq. 20

(b) Seq. 21
C Results on KITTI Odometry Training Set
RMSE (m) | Seq. 00 | Seq. 01 | Seq. 02 | Seq. 03 | Seq. 04 | Seq. 05 | Seq. 06 | Seq. 07 | Seq. 08 | |
Geo. |
ORB-SLAM2-M (w/o LC) | 54.94 | 568.63 | 58.55 | 1.41 | 2.41 | 29.32 | 51.87 | 16.83 | 36.90 |
ORB-SLAM2-M | 9.02 | 529.28 | 17.96 | 2.07 | 1.56 | 5.20 | 14.07 | 2.88 | 37.83 | |
Sup. | DeepV2D [teed2018deepv2d] | 101.08 | 484.87 | 121.02 | 3.62 | 8.86 | 35.23 | 113.31 | 12.86 | 55.69 |
Self-Sup. |
SfMLearner [zhou2017unsupervised] | 97.81 | 108.09 | 152.15 | 7.47 | 2.49 | 48.13 | 39.56 | 21.28 | 32.56 |
|
GeoNet [yin2018geonet] | 148.81 | 168.90 | 293.46 | 17.58 | 7.26 | 86.94 | 17.69 | 13.88 | 138.00 |
|
CC [ranjan2019competitive] | 68.31 | 50.41 | 59.19 | 8.89 | 2.25 | 22.49 | 13.02 | 11.31 | 49.29 |
|
DeepMatchVO [shen2019icra] | 51.34 | 85.96 | 127.99 | 11.03 | 3.09 | 27.59 | 20.98 | 16.71 | 38.71 |
|
MonoDepth2 [godard2019digging] | 82.05 | 30.81 | 86.64 | 2.40 | 2.00 | 21.49 | 5.16 | 10.42 | 51.83 |
|
Ours | 13.13 | 41.38 | 12.61 | 1.61 | 2.22 | 8.24 | 9.16 | 9.92 | 13.98 |
|
Rel. trans (%) | Seq. 00 | Seq. 01 | Seq. 02 | Seq. 03 | Seq. 04 | Seq. 05 | Seq. 06 | Seq. 07 | Seq. 08 | |
Geo. |
ORB-SLAM2-M (w/o LC) | 14.11 | 131.75 | 12.70 | 1.21 | 2.40 | 9.12 | 18.50 | 10.34 | 9.72 |
ORB-SLAM2-M | 3.23 | 125.63 | 3.69 | 1.73 | 1.97 | 2.31 | 5.92 | 2.15 | 11.68 | |
Sup. |
DeepVO [wang2017deepvo] | - | - | - | 8.49 | 7.19 | 2.62 | 5.42 | 3.91 | - |
|
ESP-VO [wang2018end] | - | - | - | 6.72 | 6.33 | 3.35 | 7.24 | 3.52 | - |
|
GFS-VO [xue2018guided] | - | - | - | 5.44 | 2.91 | 3.27 | 8.50 | 3.37 | - |
|
GFS-VO-RNN [xue2018guided] | - | - | - | 6.36 | 5.95 | 5.85 | 14.58 | 5.88 | - |
|
BeyondTracking [xue2019beyond] | - | - | - | 3.32 | 2.96 | 2.59 | 4.93 | 3.07 | - |
|
DeepV2D [teed2018deepv2d] | 12.38 | 56.26 | 7.79 | 4.07 | 8.22 | 6.35 | 16.67 | 4.96 | 6.63 |
Self-Sup. |
SfMLearner [zhou2017unsupervised] | 19.27 | 21.71 | 18.99 | 9.73 | 3.17 | 10.02 | 11.00 | 11.68 | 8.67 |
|
GeoNet [yin2018geonet] | 33.63 | 22.96 | 54.00 | 19.41 | 10.81 | 22.68 | 9.90 | 9.82 | 22.26 |
|
CC [ranjan2019competitive] | 10.42 | 15.64 | 8.08 | 8.49 | 2.90 | 5.70 | 4.38 | 5.91 | 7.16 |
|
DeepMatchVO [shen2019icra] | 5.31 | 29.57 | 15.94 | 9.67 | 4.15 | 7.42 | 5.69 | 7.62 | 9.43 |
|
MonoDepth2 [godard2019digging] | 7.64 | 10.06 | 8.34 | 5.30 | 3.20 | 4.66 | 2.48 | 4.58 | 7.32 |
|
Ours | 2.60 | 13.27 | 2.49 | 1.59 | 2.52 | 2.63 | 2.64 | 6.43 | 3.61 |
|
Rel. rot (deg/m) | Seq. 00 | Seq. 01 | Seq. 02 | Seq. 03 | Seq. 04 | Seq. 05 | Seq. 06 | Seq. 07 | Seq. 08 | |
Geo. |
ORB-SLAM2-M (w/o LC) | 0.003 | 0.010 | 0.003 | 0.002 | 0.002 | 0.002 | 0.003 | 0.003 | 0.003 |
ORB-SLAM2-M | 0.003 | 0.012 | 0.004 | 0.002 | 0.002 | 0.003 | 0.002 | 0.005 | 0.003 | |
Sup. |
DeepVO [wang2017deepvo] | - | - | - | 0.069 | 0.070 | 0.036 | 0.058 | 0.046 | - |
|
ESP-VO [wang2018end] | - | - | - | 0.065 | 0.061 | 0.049 | 0.073 | 0.050 | - |
|
GFS-VO [xue2018guided] | - | - | - | 0.033 | 0.013 | 0.016 | 0.027 | 0.022 | - |
|
GFS-VO-RNN [xue2018guided] | - | - | - | 0.036 | 0.024 | 0.025 | 0.050 | 0.026 | - |
|
BeyondTracking [xue2019beyond] | - | - | - | 0.021 | 0.018 | 0.012 | 0.019 | 0.018 | - |
|
DeepV2D [teed2018deepv2d] | 0.051 | 0.051 | 0.030 | 0.021 | 0.034 | 0.027 | 0.073 | 0.030 | 0.031 |
Self-Sup. |
SfMLearner [zhou2017unsupervised] | 0.057 | 0.026 | 0.033 | 0.035 | 0.033 | 0.036 | 0.038 | 0.059 | 0.026 |
|
GeoNet [yin2018geonet] | 0.057 | 0.041 | 0.061 | 0.098 | 0.070 | 0.077 | 0.043 | 0.059 | 0.078 |
|
CC [ranjan2019competitive] | 0.035 | 0.011 | 0.016 | 0.041 | 0.012 | 0.022 | 0.008 | 0.031 | 0.023 |
|
DeepMatchVO [shen2019icra] | 0.013 | 0.013 | 0.024 | 0.046 | 0.020 | 0.017 | 0.022 | 0.037 | 0.012 |
|
MonoDepth2 [godard2019digging] | 0.021 | 0.010 | 0.015 | 0.014 | 0.008 | 0.017 | 0.004 | 0.026 | 0.024 |
|
Ours | 0.005 | 0.003 | 0.003 | 0.006 | 0.005 | 0.005 | 0.007 | 0.021 | 0.003 |
|
In Table 9, we compare the results on the training set of the KITTI Odometry dataset. Note that all supervised methods are trained on Sequence 00, 02, 08, 09 of the KITTI Odometry dataset [Geiger2012CVPR], except DeepV2D [teed2018deepv2d], which is trained on the Eigen split of KITTI raw dataset [Geiger2013IJRR]. Comparing to other self-supervised approaches, our method achieves smaller errors on the training set, indicating that the proposed system can effectively learn to model the camera pose trajectory during training time. Our method also compares favorably against the geometric-based method ORB-SLAM2.
In Figure 9, we show the qualitative results of our method on Seq. 00-08 on the KITTI Odometry dataset.



(a) Seq. 00 (b) Seq. 01 (b) Seq. 02



(a) Seq. 03 (b) Seq. 04 (b) Seq. 05



(a) Seq. 06 (b) Seq. 07 (b) Seq. 08
D Snippet-level Pose Results and Depth Results
For completeness, we provide the pose estimation results when evaluating on 5-frame snippets in Table 10 and the single-view depth estimation results in Table 11. Note that the depth network is fixed during the second stage of training, so for the depth evaluation, we only train our model for the first stage on the Eigen split of the KITTI raw dataset. As we can see in Table 10, although CC [ranjan2019competitive] and DeepMatchVO [shen2019icra] achieve good results on the snippet-level, their results on the video-level are no longer the state-of-the-art. This indicates that evaluating camera pose estimation performance on the snippet-level could be inaccurate, and thus we need to evaluate the whole trajectory to reflect the holistic performance. In Table 11, we also observe that our method slightly outperforms the current self-supervised state-of-the-art MonoDepth2 [godard2019digging], which indicates that a better pose estimation module could lead to a better depth estimation performance.
Seq. 09 | Seq. 10 | |
ORB-SLAM (full) | 0.0140.008 | 0.0120.011 |
SfMLearner [zhou2017unsupervised] | 0.0210.017 | 0.0200.015 |
vid2depth [mahjourian2018unsupervised] | 0.0130.010 | 0.0120.011 |
GeoNet [yin2018geonet] | 0.0120.007 | 0.0120.009 |
DF-Net [zou2018dfnet] | 0.0170.007 | 0.0150.009 |
CC [ranjan2019competitive] | 0.0120.007 | 0.0120.008 |
DeepMatchVO [shen2019icra] | 0.0090.005 | 0.0080.007 |
MonoDepth2 [godard2019digging] | 0.0170.008 | 0.0150.010 |
Ours |
0.0150.006 | 0.0150.009 |
|
|
Error metric | Accuracy metric | |||||
Method | Abs Rel | Sq Rel | RMSE | log RMSE | |||
SfMLearner [zhou2017unsupervised] |
0.208 | 1.768 | 6.856 | 0.283 | 0.678 | 0.885 | 0.957 |
vid2depth [mahjourian2018unsupervised] | 0.163 | 1.240 | 6.220 | 0.250 | 0.762 | 0.916 | 0.968 |
GeoNet [yin2018geonet] | 0.155 | 1.296 | 5.857 | 0.233 | 0.793 | 0.931 | 0.973 |
DF-Net [zou2018dfnet] | 0.150 | 1.124 | 5.507 | 0.223 | 0.806 | 0.933 | 0.973 |
CC [ranjan2019competitive] |
0.140 | 1.070 | 5.326 | 0.217 | 0.826 | 0.941 | 0.975 |
DeepMatchVO [shen2019icra] |
0.156 | 1.309 | 5.73 | 0.236 | 0.797 | 0.929 | 0.969 |
MonoDepth2 [godard2019digging] |
0.115 | 0.903 | 4.863 | 0.193 | 0.877 | 0.959 | 0.981 |
SC-SfMLearner [bian2019unsupervised] |
0.137 | 1.089 | 5.439 | 0.217 | 0.830 | 0.942 | 0.975 |
Ours |
0.115 | 0.871 | 4.778 | 0.191 | 0.874 | 0.961 | 0.982 |
|
E Additional KITTI Sequences
As mentioned in the main manuscript, we selected 18 sequences from KITTI raw data to further evaluate the methods, which have no overlaps with either KITTI Odometry split or Eigen split. We list the sequence names in Table 12.
Sequence names |
2011_09_26_drive_0036 |
2011_09_26_drive_0086 |
2011_09_26_drive_0101 |
2011_09_26_drive_0117 |
2011_09_29_drive_0071 |
2011_10_03_drive_0047 |
2011_09_26_drive_0059 |
2011_09_26_drive_0027 |
2011_09_26_drive_0009 |
2011_09_26_drive_0013 |
2011_09_26_drive_0029 |
2011_09_26_drive_0064 |
2011_09_26_drive_0084 |
2011_09_26_drive_0096 |
2011_09_26_drive_0106 |
2011_09_26_drive_0056 |
2011_09_26_drive_0023 |
2011_09_26_drive_0093 |
|