Log In Sign Up

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

by   Yuliang Zou, et al.

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant (e.g., O(100)) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a "global" loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D.


page 8

page 14

page 24


Learning by Inertia: Self-supervised Monocular Visual Odometry for Road Vehicles

In this paper, we present iDVO (inertia-embedded deep visual odometry), ...

MotionHint: Self-Supervised Monocular Visual Odometry with Motion Constraints

We present a novel self-supervised algorithm named MotionHint for monocu...

Geometric Consistency for Self-Supervised End-to-End Visual Odometry

With the success of deep learning based approaches in tackling challengi...

PoseConvGRU: A Monocular Approach for Visual Ego-motion Estimation by Learning

While many visual ego-motion algorithm variants have been proposed in th...

Deep Visual Odometry with Adaptive Memory

We propose a novel deep visual odometry (VO) method that considers globa...

Deep Direct Visual Odometry

Monocular direct visual odometry (DVO) relies heavily on high-quality im...

A Consistency-Based Loss for Deep Odometry Through Uncertainty Propagation

The incremental poses computed through odometry can be integrated over t...

1 Introduction

Most existing VO systems are either geometric or learning-based. In this paper, we argue that a truly robust VO system should combine the best of both worlds (i.e., geometry and learning). In particular, we propose a self-supervised method to learn monocular VO with long-term modeling, where the training scheme is directly inspired by traditional geometric methods (see Figure 1).

At the heart of the state-of-the-art VO systems [engel2017direct, engel2014lsd, forster2014svo, murORB2] is the incorporation of several long-studied geometric modules, including keypoint tracking, motion estimation, keyframe insertion, and bundle adjustment (BA) [triggs1999bundle]. With all these modules, a key insight is to optimize the states (e.g., 6-DoF camera poses) over long-term observations such that the system suffers less from error accumulation [scaramuzza2011visual]. While being robust in normal scenarios, monocular VO still suffers from the difficulty in initialization for slow motions [mur2015orb], and the tracking tends to fail miserably in unconstrained environments with large texture-less regions, fast movements, or other adverse factors [yang2018challenges] such as rolling shutter effect [schubert2018direct, zhuang2019learning] and unknown camera intrinsics [bogdan2018deepcalib, zhuang2019degeneracy].

In contrast, learning-based VO methods [wang2017deepvo, wang2018end, xue2018guided, xue2019beyond]

have the potential of being more robust to the aforementioned challenges by harnessing the rich priors from data. However, training neural networks in a supervised way involves collecting large-scale, diverse datasets with ground truth annotations, which could be labor-extensive and time-consuming. Recently, self-supervised methods 

[godard2019digging, li2018undeepvo, ranjan2019competitive, wang2019recurrent, zhan2018unsupervised, zhou2017unsupervised, zou2018dfnet] have been proposed to tackle this task. Instead of supervising the networks with ground truth labels, the idea is to couple the depth and pose networks with photometric errors across adjacent frames and jointly train them in an end-to-end manner. Nonetheless, the performance of these methods still falls behind that of geometric methods [mur2015orb] for general scenarios.

One of the potential reasons for their performance gap is that the pose networks do not exploit the temporal coherence over long sequences. During training, these networks receive short snippets (e.g., 3-frame or 5-frame) as input and predict the ego-motions that are optimized locally for the current snippet. When evaluating these methods in short snippets, they compare favorably even with the state-of-the-art geometric methods [mur2015orb]. However, if we concatenate all the predictions to form the full trajectory, it is often the case that the learning-based methods generate much larger pose errors, as illustrated in Figure 2.

Evaluation 5-frame video-level SfMLearner [zhou2017unsupervised] 0.019 97.81 Our method 0.015 13.13 (a) SfMLearner [zhou2017unsupervised] (b) Our method (c) RMSE (m)
Figure 2: 5-frame v.s. video-level evaluation. Evaluating visual odometry requires having a global picture of recovered trajectories. However, most self-supervised methods only evaluate the trajectories within a short snippet, which may not reflect the holistic performance. “5-frame” means that we evaluate the results using 5-frame snippets, and “video-level” means that we evaluate on the entire trajectory.

In this paper, we argue that learning VO requires explicit long-term modeling to infuse the insights from geometric methods [engel2017direct, engel2014lsd, forster2014svo, murORB2]. To this end, we propose a novel self-supervised VO learning framework that draws inspiration from geometric modules. Specifically, we build our learning framework upon a depth network of an auto-encoder structure with skip-connections [godard2019digging] and a pose network with a two-layer LSTM module [xue2019beyond]. In contrast to the supervised method by Xue et al. [xue2019beyond], our method incorporates extra depth information and uses a completely different training scheme, leading to a purely self-supervised learning framework. To mitigate error accumulation, we propose a cycle consistency constraint between the two-layer predictions, mimicking a mini loop closure module, which improves the pose consistency over the sequence. In order to model long-term dependency in VO, we propose a two-stage training strategy, which considers both short-term and long-term constraints. The proposed two training stages correspond to the local and global bundle adjustment modules in the geometric VO, allowing us to refine the poses within a large temporal range.

In summary, our contributions are:

  • We propose a novel self-supervised VO learning framework that explicitly models long-term temporal dependency.

  • We build connections between our method and key building blocks of geometric VO systems and demonstrate well-motivated designs.

  • We evaluate the full pose trajectories by our method, against the state-of-the-art geometric and learning-based baselines, and achieve competitive results on standard VO datasets, including KITTI and TUM RGB-D.

To the best of our knowledge, our method is the first of the kind that is able to learn from “truly” long sequences (e.g., 100 frames) in the training stage. Our experiments show that our proposed method gives rise to significant empirical benefits by explicitly considering long-term modeling.

2 Related Work

Geometric Methods. Visual odometry is a long-standing problem that estimates the ego-motion incrementally [nister2004visual, scaramuzza2011visual] using visual input. A conventional geometric VO system usually consists of the following components [scaramuzza2011visual]: feature detection, feature matching (or tracking), motion estimation (e.g., triangulation [hartley1997triangulation]), and local optimization (e.g., bundle adjustment). A keyframe mechanism [klein2007parallel] is also adopted for improved robustness in motion estimation. Incorporated with a mapping system that reconstructs the 3D scene structures, a VO system turns to a system called Simultaneous Localization and Mapping (SLAM) [cadena2016past]. The key to the robustness of the modern VO/SLAM systems [mur2015orb, tiwari2020pseudo] lies in their capability to extract reliable image measurements and optimize the states (e.g., 6-DoF camera poses) over a large number of frames. In this work, we leverage these geometric insights to design a robust learning-based VO system.

Fully-Supervised Methods. With the success of deep neural networks, end-to-end learning-based methods [wang2017deepvo, wang2018end, xue2018guided, xue2019beyond] have been proposed to tackle the visual odometry problem. These methods often rely on a supervised loss using the ground-truths to regress the 6-DoF camera relative pose from a pair of consecutive images. Recently, some methods [bloesch2018codeslam, tang2018ba, teed2018deepv2d, ummenhofer2017demon, zhou2018deeptam] exploit CNNs to predict the scene depth and camera pose jointly, utilizing the geometric connection between the structure and the motion. This corresponds to learning Structure-from-Motion (SfM) in a supervised manner. Although the methods above achieve good performance, they require ground-truth annotations to train the networks. In contrast, our method is self-supervised, requiring nothing but the monocular video frames.

Self-Supervised Methods. To mitigate the requirement of data annotations, self-supervised methods [godard2019digging, li2018undeepvo, ranjan2019competitive, wang2019recurrent, zhan2018unsupervised, zhou2017unsupervised] have been proposed to tackle the SfM task. The main supervisory signal of these methods comes from the photometric-consistency between corresponding pixels of neighboring frames. While they achieve good performance on single-view depth estimation, the performance of ego-motion estimation still lags behind the traditional SLAM/VO methods. Recently, Bian et al. [bian2019unsupervised] argue that the pose networks cannot provide full camera trajectories over long sequences due to the inconsistent scale of per-frame estimations and thus propose a geometry consistency constraint. However, their method only enforces the globally consistent trajectories by propagating the consistency on overlapping short snippets during training. In contrast, our method directly optimizes over long sequences via long-term modeling. Inspired by the keyframe mechanism in geometric methods, Sheng et al. [sheng2019unsupervised] propose to jointly learn depth, ego-motion, and keyframe selection simultaneously in a self-supervised manner. Similarly, the training of this method considers only short snippets and thus is unable to model long-term dependency.

Sequential Modeling.

Sequential modeling based on recurrent neural networks (RNNs) has been successfully applied to many applications, such as speech recognition 

[chorowski2014end], machine translation [graves2014towards], and video prediction [srivastava2015unsupervised, villegas2017learning]. Aiming to estimate the full trajectory over a long sequence of frames, VO can be naturally formulated as a sequential learning problem and thus modeled with RNNs [wang2019recurrent, wang2017deepvo, wang2018end, xue2018guided]. Recently, Xue et al. [xue2019beyond] propose to use a two-layer LSTM network for pose estimation, where the first layer estimates the relative motion between consecutive frames, and the second layer estimates global absolute poses.

Despite using a similar pose network, our method differs from Xue et al. [xue2019beyond] in being self-supervised v.s. full-supervised and the associated training strategies. First of all, our method further incorporates depth information, while the method in [xue2019beyond] relies only on pose features. Apart from the photometric discrepancy as to the supervisory signal, we enforce a cycle consistency between the predictions from the two-layer LSTM modules, which serves as a mini “loop closure” module that mimics the geometry VO system. More importantly, we decouple our network training into two stages, allowing our method to optimize over long sequences (more than 90 frames) during training, whereas the method in [xue2019beyond] only trains with 11-frame snippets. To our knowledge, this is the firstdeep learning approach for visual odometry that takes long sequences as input in the training stage.

3 Method

(a) 1 stage of training (short snippet) (b) 2 stage of training (long sequence)
Figure 3: Overview. Our method adopts a stage-wise training strategy. (a) In the first stage of training, we jointly train all the components: the depth encoder, the depth decoder, the pose encoder, the first and second layer of ConvLSTM (Sec. 3.2); (b) In the second stage of training, we pre-extract features as input and fine-tune the second layer of the ConvLSTM module only (Sec. 3.3).

Figure 3 provides a high-level overview of the proposed monocular VO system. Our system has two major components: a depth network and a pose network.222Since accurate pose prediction is the primary focus of this paper, we name our method as VO instead of SfM or SLAM. The single-image depth network employs an encoder-decoder structure with skip-connections [godard2019digging]. The pose network consists of a FlowNet backbone [dosovitskiy2015flownet], a two-layer LSTM module [xue2019beyond], and two pose prediction heads (with one after each of the LSTM layers). In the two-layer recurrent architecture, the first-layer focuses on predicting consecutive frame motions, while the second-layer refines the estimations from the first-layer [xue2019beyond].

3.1 Background

We formulate the monocular visual odometry task as a view synthesis problem, by training the networks to predict a target image from the source image with the estimated depth and camera pose. Such a system typically consists of two components: a depth network which takes a single RGB image as input to predict the depth map, and a pose network which takes a concatenation of two consecutive frames as input to estimate the 6-DoF ego-motion.

Given two input images and , the estimated depth map and camera pose , we can then compute the per-pixel correspondence between the two input images. Assume a known camera intrinsic matrix , and let represent the 2D homogeneous coordinate of a pixel in . We can find the corresponding point of in following the equation [zhou2017unsupervised]:


Appearance loss. For a self-supervised visual odometry system, the primary supervision comes from the appearance dissimilarity between the synthesis image and the target image. To effectively handle occlusion, we use three consecutive frames to compute the per-pixel minimum photometric reprojection loss [godard2019digging], i.e.,


where is a weighted combination of the L2 loss and the Structured SIMilarity (SSIM) loss, denotes the frame synthesized from using Eq. (1). To handle static pixels, we adopt the auto-masking mechanism following Godard et al. [godard2019digging].

Smoothness loss. Since the appearance loss cannot provide meaningful supervision for texture-less or homogeneous regions of the scene, a smoothness prior of disparity is incorporated. We here use the edge-aware smoothness loss as in Wang et al. [wang2018learning].

Remark 1

The appearance loss in Eq. (2) corresponds to a local photometric bundle adjustment objective, which is also commonly used in the geometric direct VO/SLAM systems [engel2017direct, engel2014lsd, wang2017stereo].

3.2 Cycle consistency within memory-aided sequential modeling

With the above setting, current state-of-the-art self-supervised methods estimate the ego-motion within a local range, discarding the sequential dependence and dynamics in the long sequences. Such information, however, is essential for a pose network to recover the entire trajectory in a consistent manner. We thus adopt a recurrent structure of our pose network to utilize the temporal information.

Sequential modeling. To learn to utilize the temporal information, we adopt the recurrent network structure with a convolution LSTM (ConvLSTM) module [xingjian2015convolutional]. Previously, the pose network takes the concatenation of two frames and outputs the 6-DoF camera pose directly. After incorporating the ConvLSTM module, the pose network also takes the previous estimation information into account when predicting the output. Formally, we have


where is the pose encoder, denotes the output and hidden state of ConvLSTM at time , is a linear layer to predict the 6-DoF motion . By doing this, the network implicitly learns to aggregate temporal information and learns the motion pattern.

Memory buffer and refinement. In the sequential modeling setting above, the pose network estimates the relative pose for every two consecutive frames. However, the motions between consecutive frames are often tiny, which results in difficulties in extracting good features for relative pose estimation. Thus, predicting the camera pose from a non-adjacent “anchor” frame to the current frame could be a better option. Note that many traditional SLAM systems [mur2015orb, murORB2] adopt a keyframe mechanism and always compute camera poses from the current frame to the most recent keyframe.

Inspired by the keyframe mechanism, we incorporate the second-layer ConvLSTM and adopt the memory module proposed by Xue et al. [xue2019beyond]

. After each step in the first-layer ConvLSTM, we store the hidden state tensor in a memory buffer, whose length is set to the length of the input snippet. When we read out from the memory buffer, we compute the weighted average of all the memory slots in the memory buffer as in 


We also compute the depth and pose features for the first frame and the current frame as additional input to the second-layer ConvLSTM. This can be formally written as


where is the depth encoder, is the read-out memory, , denote the output and hidden state from the second-layer at time , and is another linear layer predicting the absolute pose in the current snippet.

Remark 2

The ConvLSTMs explicitly model the sequential nature of the VO problem and meanwhile facilitate the implementation of a keyframe mechanism. Compared to the memory module by Xue et al. [xue2019beyond], which only considers pose features, our memory module accommodates both depth and pose features. As verified in our experiments  333Table 6 in the supplementary material., incorporating the additional depth information in memory improves the overall performance.


Figure 4: Cycle-consistency over two-layer poses. In our model, the first layer ConvLSTM estimates the relative pose between consecutive frames, and the second layer ConvLSTM predicts the “absolute” pose within the current snippet. By exploiting the transitivity of camera poses, we incorporate a cycle-consistency constraint between the two layers. Animation can be viewed in Adobe Reader.

Cycle consistency over two-layer poses. To train the second-layer ConvLSTM, we utilize the photometric error between the first frame and the other frames of the input snippet, i.e.,


where is the number of frames for the input snippet, which is set to 7 in our model.

Also, according to the transitivity of the camera transformation, we have an additional constraint to ensure the consistency between the first and second layer ConvLSTM (as shown in Figure 4), i.e.,


Thus, the overall objective is


where are the hyper-parameters to balance the scale of different terms, which are empirically set to .

Remark 3

The loss in Eq. (11) can be thought of as a mini “loop closure” module that enforces the cycle-consistency between the outputs of two ConvLSTM layers. Note that our method is also compatible with the existing full loop closure techniques [kummerle2011g], which we will consider in the future work.

3.3 Long-range constraints via stage-wise training

Although we adopt a recurrent network structure to aggregate temporal information for better performance, the network has never seen long sequences but only short snippets during the training time. Thus, the network may not learn how to fully utilize the long-term temporal context. The hurdle that prevents us from taking long sequences as input is the limited memory volume of modern GPUs. To tackle this problem for training a long-term model, we propose a two-stage training strategy. We first train our whole model with the full objective using short snippets.

Once the first stage of training is finished, we run this model on each sequence in the dataset separately, to extract the required input for the second-layer ConvLSTM and store them. After that, we only fine-tune the lightweight second-layer ConvLSTM, without the heavy feature extraction and depth networks, which saves us a lot of memories. By doing this, we can now feed long sequences into the network during the training time, allowing the network to better learn how to utilize the temporal context. Since only the second-layer ConvLSTM is optimized, our loss for the second stage of training is


where is the number of frames of each snippet, which is set to 7; is the number of snippets in the input sequence, which is set to 16. Note that consecutive snippets have one frame in common, and thus the total number of frames in the input sequence is 97. The synthesized image is a function of depth and pose, where pose encodes long-range constraints through hidden states of ConvLSTMs, yielding an effective window of 97 frames. We summarize our method in Algorithm 1.

Remark 4

The second training stage can be viewed as a motion-only bundle adjustment module [mur2015orb] that considers long-term modeling.

1 First stage: short-range training ;
Input : 7-frame snippet
Trainable : Depth Encoder, Depth Decoder, Pose Encoder, First-layer ConvLSTM, Second-layer ConvLSTM
Objective :  (Eq. (12))
2 Second stage: long-range training ;
Input : 97-frame sequence
Trainable : Second-layer ConvLSTM
Objective :  (Eq. (13))
Algorithm 1 Stage-wise training strategy

4 Experimental Results

4.1 Settings

Datasets. To evaluate our method, we conduct the main experiments on the KITTI dataset [Geiger2013IJRR, Geiger2012CVPR]

, which consists of urban and highway driving sequences for road scene understanding 

[song2014robust, dhiman2016continuous]. The odometry split of KITTI is a widely used benchmark for odometry/SLAM evaluation. It contains 22 sequences, among which Sequence 00-10 have ground truth trajectory labels, and the annotations of the remaining sequences are not publicly available. Following Zhou et al.[zhou2017unsupervised], we use Sequence 00-08 as our training set and validate the models on Sequence 09 and 10. Besides, we select 18 more sequences from KITTI raw data, which have no overlaps with the odometry split, for further evaluation. Since the ground truth trajectories of Sequence 11-21 are not available, we run ORB-SLAM2 (stereo version) to get predictions as (pseudo) ground truth for evaluation. In addition to these outdoor scenes, we also train and evaluate our model on the TUM RGB-D dataset [sturm2012benchmark]. This dataset is collected by hand-held cameras in indoor environments with challenging conditions. We use the same train/test split as in Xue et al. [xue2019beyond].

Evaluation metrics.

For the KITTI dataset, we adopt the absolute trajectory RMSE and relative translation/rotation errors for all possible subsequences of length (100, 200, …, 800 meters). For the TUM RGB-D dataset, we use the translational RMSE as our evaluation metrics. For self-supervised monocular methods, since the absolute scale is unknown, we align the trajectory globally using the evo toolbox 


Implementation details.

We use ImageNet pretrained ResNet-18 as our depth encoder. Our depth decoder structure is the same as Godard et al. 

[godard2019digging]. For the pose encoder, we take the encoder of FlowNet-S structure until the last layer, which is pre-trained on FlyingChairs [dosovitskiy2015flownet]

for optical flow estimation. We implement our system using the publicly available PyTorch framework and conduct all our experiments with a single TitanXP GPU. For both stages, we train the network with the Adam optimizer 


for 20 epochs. The learning rate is set to 5e-5 for the first 15 epochs and drops to 5e-6 for the remaining epochs. The input size is 640

192, and the batch size is set to 2. In the first stage of training, the number of frames is set to 7, while the number of frames is set to 97 in the second stage. Note that the long-term optimization only happens in the training time. At the test time, our model runs at 14.3 frames per second.

4.2 Results

Seq. 09 Seq. 10
Method RMSE (m) Rel. trans. (%) Rel. rot. (deg/m) RMSE (m) Rel. trans. (%) Rel. rot. (deg/m)
Baseline 22.71 7.55 0.028 17.87 10.43 0.046
One-layer ConvLSTM 23.45 5.59 0.016 11.93 7.23 0.023
Two-layer ConvLSTM 9.77 4.23 0.013 12.68 6.02 0.023
Two-layer ConvLSTM + Two-stage training 11.30 3.49 0.010 11.80 5.81 0.018
Table 1: Ablation study. We evaluate different variants of the proposed method on sequences 09 and 10 of the KITTI Odometry dataset [Geiger2012CVPR]. The best performance is in bold and the second best is underlined.

(a) Full trajectories (b) Zoom-in
Figure 5: Visual comparison on Sequence 09. We compare different variants of our method to validate the design choice. As we can see, adding each of the components gradually improves the overall performance. Best viewed in color.

Ablation study. To validate our design choice, we perform an ablation study on Sequence 09 and 10 of the KITTI Odometry dataset. We consider the following variants. 1) Baseline: the pose network takes as input the concatenation of two consecutive frames to generate pose estimation; 2) One-layer ConvLSTM: we incorporate a one-layer ConvLSTM for the pose network; 3) Two-layer ConvLSTM: we use the full model, but only conduct the first stage of training; 4) Two-layer ConvLSTM + Two-stage training: our final model with the two-stage training strategy.

As shown in Table 1, the performance gradually improves as we add more components. Specifically, adding a recurrent module improves the overall performance over the baseline; adding the second layer LSTM leads to further improvement, which validates the effectiveness of the second layer in the self-supervised learning setting; applying our second stage long-term training again boosts the performance, achieving a new state-of-the-art for self-supervised methods. We show a visual comparison in Figure 5.

Seq. 09 Seq. 10
Method RMSE (m) Rel. trans. (%) Rel. rot. (deg/m) RMSE (m) Rel. trans. (%) Rel. rot. (deg/m)
Geo. ORB-SLAM2-M (w/o LC)  [murORB2] 44.10 9.67 0.003 6.43 4.04 0.003
ORB-SLAM2-M  [murORB2] 8.84 3.22 0.004 8.51 4.25 0.003
Sup. DeepVO  [wang2017deepvo] - - - - 8.11 0.088
ESP-VO  [wang2018end] - - - - 9.77 0.102
GFS-VO  [xue2018guided] - - - - 6.32 0.023
GFS-VO-RNN  [xue2018guided] - - - - 7.44 0.032
BeyondTracking  [xue2019beyond] - - - - 3.94 0.017
DeepV2D  [teed2018deepv2d] 79.06 8.71 0.037 48.49 12.81 0.083
Self-Sup. SfMLearner  [zhou2017unsupervised] 24.31 8.28 0.031 20.87 12.20 0.030
GeoNet  [yin2018geonet] 158.45 28.72 0.098 43.04 23.90 0.090
Depth-VO-Feat  [zhan2018unsupervised] - 11.93 0.039 - 12.45 0.035
vid2depth  [mahjourian2018unsupervised] - - - - 21.54 0.125
UnDeepVO  [li2018undeepvo] - 7.01 0.036 - 10.63 0.046
Wang et al. [wang2019recurrent] - 9.88 0.034 - 12.24 0.052
CC  [ranjan2019competitive] 29.00 6.92 0.018 13.77 7.97 0.031
DeepMatchVO  [shen2019icra] 27.08 9.91 0.038 24.44 12.18 0.059
PoseGraph  [li2019pose] - 8.10 0.028 - 12.90 0.032
MonoDepth2  [godard2019digging] 55.47 11.47 0.032 20.46 7.73 0.034
SC-SfMLearer [bian2019unsupervised] - 11.2 0.034 - 10.1 0.050
Ours 11.30 3.49 0.010 11.80 5.81 0.018
Table 2: Comparison with the state-of-the-art. The results of ORB-SLAM2-M methods are the medians of 5 runs. ‘-’ means the results are not available from that paper. For DeepV2D [teed2018deepv2d], SfMLearner [zhou2017unsupervised], GeoNet [yin2018geonet], CC [ranjan2019competitive], DeepMatchVO [shen2019icra], and MonoDepth2 [godard2019digging], we take the pre-trained models and run on the sequences to get the results. The best performance of each block is in bold, and the second best is underlined.

Comparison with the state-of-the-art methods. For comparison, we select several state-of-the-art methods, including the monocular version of ORB-SLAM2 [murORB2] (denoted as ORB-SLAM2-M) (with and without loop closure optimization), several supervised learning methods [teed2018deepv2d, wang2017deepvo, wang2018end, xue2018guided, xue2019beyond], and some other self-supervised methods [bian2019unsupervised, godard2019digging, li2018undeepvo, li2019pose, mahjourian2018unsupervised, ranjan2019competitive, shen2019icra, wang2019recurrent, yin2018geonet, zhan2018unsupervised, zhou2017unsupervised]. Note that all supervised methods are trained on Sequence 00, 02, 08, 09 of the KITTI Odometry dataset [Geiger2012CVPR], except DeepV2D [teed2018deepv2d], which is trained on the Eigen split of KITTI raw dataset [Geiger2013IJRR]. As we can see in Table 2, our final model outperforms other self-supervised methods by a significant margin. In particular, our method outperforms the recent proposed SC-SfMLearner [bian2019unsupervised], which aims to reconstruct the scale-consistent camera trajectory. This indicates that the explicit long-term modeling used in our approach is more effective than propagating the geometric constraint among overlapping short snippets in SC-SfMLearner [bian2019unsupervised]. Our model also compares favorably with the geometric method and outperforms all supervised methods except Xue et al. [xue2019beyond] on Sequence 10.

Results on additional KITTI sequences. From the raw data of the KITTI dataset, we select 18 short sequences that have no overlaps with either the training or test split of the KITTI Odometry dataset.444The sequence names are available in the supplementary material. We then apply the same pre-trained models on these test sequences. As we can see in Table 3, our method outperforms other learning-based methods and even compares favorably with ORB-SLAM2-M methods in terms of RMSE and relative translation error.

Method RMSE (m) Rel. trans (%) Rel. rot. (deg/m)
Geo. ORB-SLAM2-M (w/o LC)  [murORB2] 7.17 9.41 0.008
ORB-SLAM2-M  [murORB2] 8.12 10.64 0.008
Sup. DeepV2D  [teed2018deepv2d] 10.94 11.81 0.028
Self-Sup. SfMLearner  [zhou2017unsupervised] 10.79 13.82 0.041
GeoNet  [yin2018geonet] 14.41 18.99 0.076
CC  [ranjan2019competitive] 7.51 10.49 0.024
DeepMatchVO  [shen2019icra] 8.53 12.76 0.033
MonoDepth2  [godard2019digging] 7.51 11.99 0.028
Ours 6.47 9.99 0.019
Table 3: Average results on 18 additional KITTI sequences. The results of ORB-SLAM2-M methods are the medians of 5 times. The best performance of each block is in bold, and the second best is underlined.
Method RMSE (m) Rel. trans (%) Rel. rot. (deg/m)
Geo. ORB-SLAM2-M (w/o LC)  [murORB2] 81.20 19.60 0.009
ORB-SLAM2-M  [murORB2] 44.09 12.96 0.007
Sup. DeepV2D  [teed2018deepv2d] 221.33 24.61 0.041
Self-Sup. SfMLearner  [zhou2017unsupervised] 75.00 26.54 0.045
GeoNet  [yin2018geonet] 94.98 29.11 0.062
CC  [ranjan2019competitive] 55.44 16.65 0.032
DeepMatchVO  [shen2019icra] 95.79 17.31 0.038
MonoDepth2  [godard2019digging] 99.36 12.28 0.031
Ours 71.63 7.28 0.014
Table 4: Results on KITTI Odometry official test split. Since ground truth trajectories are not publicly available, we use estimations from the stereo version of ORB-SLAM2 as pseudo ground truth. The best performance of each block is in bold, and the second best is underlined.
(a) Seq. 13 (b) Seq. 16 (c) Seq. 18 (d) Seq. 19
Figure 6: Visual comparison on the KITTI Odometry test set. We show the trajectories of ORB-SLAM2-S, CC [ranjan2019competitive], MonoDepth2 [godard2019digging] and our method. Our method aligns best with the reference ORB-SLAM2-S trajectories. Best viewed in color.

Results on KITTI test sequences. Since the ground truth trajectories of Sequence 11-21 on the KITTI Odometry dataset are not available, we cannot directly recover the global scale using similarity transformation. Thus, we choose to run the stereo version of ORB-SLAM2 (denoted as ORB-SLAM2-S), which is one of the state-of-the-art methods on these sequences. To compare with other methods, we treat the estimations from ORB-SLAM2-S as pseudo ground truth. As we can see in Table 4, our method achieves state-of-the-art performance among the learning-based methods and even compares favorably with ORB-SLAM2-M in terms of the relative translation error. We visualize trajectories of four sequences in Figure 6. As we can see, our method better aligns with the reference trajectories from ORB-SLAM2-S. We also submit the globally aligned results to the KITTI evaluation server and get similar numbers, which indicates using ORB-SLAM2-S as pseudo ground truth for evaluation is reasonable. Please refer to the supplementary material for more details.

Figure 7: Qualitative Results on the TUM RGB-D dataset. We overlay the first frame of each sequence with the trajectory. Best viewed in color.

Results on the TUM RGB-D dataset. The TUM RGB-D dataset was created to evaluate the performance of RGB-D SLAM and is thus very challenging for monocular methods. To test our model in indoor environments, we instead compare our model with several strong baselines. For traditional methods, we choose the monocular version of ORB-SLAM2 (denoted as ORB-SLAM2-M) and DSO [engel2017direct]. For learning-based methods, we choose the BeyondTracking method from Xue et al. [xue2019beyond] and the recent DeepV2D [teed2018deepv2d]. For DeepV2D, since it is trained on another indoor dataset, in case it cannot generalize its scale to TUM RGB-D, we provide two options: with and without global scale alignment.

Table 5 shows that traditional methods perform well on some of the sequences, but they failed to produce results from the remaining ones due to tracking failure. Our method outperforms the supervised baseline DeepV2D [teed2018deepv2d] on most of the sequences, but compares less favorably than the supervised VO method in [xue2019beyond]. We conjecture that this is due to the limited amount of training data available on the TUM RGB-D dataset. Currently, we use the same amount of training data as supervised methods. Adding more unlabeled video data to the training might lead to better performance for our method. We also notice that the rolling shutter issue in this dataset makes the photometric consistency assumption less accurate, which could potentially hurt the performance of both the proposed method and DSO.

Method Seq. 1 Seq. 2 Seq. 3 Seq. 4 Seq. 5 Seq. 6 Seq. 7 Seq. 8 Seq. 9 Seq. 10 Avg.
Geo. ORB-SLAM2-M  [murORB2] 0.041 0.184 - - - - - 0.057 - 0.018 -
DSO  [engel2017direct] - 0.197 - 0.737 - 0.082 - 0.093 0.543 0.040 -
Sup. BeyondTracking  [xue2019beyond] 0.153 0.208 0.056 0.070 0.172 0.015 0.123 0.007 0.035 0.042 0.088
DeepV2D  [teed2018deepv2d] 0.232 0.651 0.186 0.167 0.171 0.029 0.435 0.106 0.085 0.082 0.214
DeepV2D (aligned)  [teed2018deepv2d] 0.087 0.300 0.114 0.106 0.181 0.013 0.380 0.110 0.094 0.098 0.148
Self-Sup. Ours 0.192 0.190 0.083 0.122 0.177 0.016 0.219 0.102 0.179 0.107 0.139
Table 5: Results on the TUM-RGBD dataset [sturm2012benchmark]. ‘-’ means that traditional method fails in that sequence. The best performance of each block is in bold, and the second best is underlined.

Discussions. Our learning framework is motivated by geometric VO methods. The FlowNet backbone mimics the tracking module to extract pair-wise image features, and the LSTMs model the sequential nature of the VO problem. The design of the two-layer LSTM module also resembles the keyframe mechanism of geometric VO in the sense that the second LSTM predicts the motion between a keyframe and a non-keyframe, refining the initial consecutive estimations from the first LSTM. The cycle consistency constraint between the two-layer LSTM estimations serves as a mini loop closure to enforce the transitivity consistency of poses. The second stage of training allows our network to explicitly optimize over long sequences, which resembles the motion-only bundle adjustment module. We combine the best of both geometry and learning by building a self-supervised VO framework whose components (network, loss, training scheme) are fully inspired by the well-studied geometric modules. As verified in our experiments, these geometry inspired designs lead to significantly better results than the existing self-supervised baselines.

Limitations. Although the proposed system achieves a good camera pose estimation performance in terms of translation error, the improvement on the rotation prediction is not as substantial as the translation. We conjecture that the large rotation error may be due to the bias within the training data. Specifically, for driving scenarios, the translational motions occur much more frequently than the rotational ones. Training on more diverse video sequences or synthetic data could potentially alleviate the inherent bias in the existing datasets. Also, we observe that our system fails under the over-exposure scenarios since our method still relies on visual input to extract information.

5 Conclusions

In this work, we learn a monocular visual odometry system in a self-supervised manner to mimic critical modules in traditional geometric methods. We first adopt a two-layer convolutional LSTM module to model the long-term dependency in the pose estimation. To allow the network to see beyond short snippets (e.g., 3 or 5 frames) during the training time, we propose a stage-wise training strategy. Combining the recurrent architecture and the proposed decoupled training scheme, our system achieves state-of-the-art performance among self-supervised methods. In the current form, we do not have a mechanism to detect loops and perform full loop closure. In the future, we plan to study how to incorporate it into our learning framework.

Acknowledgment. This work was part of Y. Zou’s internship at NEC Labs America, in San Jose. Y. Zou and J.-B. Huang were also supported in part by NSF under Grant No. (#1755785).


Supplementary Material

In this supplementary document, we provide additional experimental results and information to complement the main manuscript. First, we conduct additional ablation experiments to further validate our design choices. Second, we show our results on the KITTI Odometry leaderboard. Third, we show results on the KITTI Odometry training split. Fourth, we show results on the snippet-level pose and single-view depth estimation for completeness. Lastly, we provide the list of sequences we selected from KITTI raw data. We also provide a demo video showing the trajectories of several challenging sequences in the KITTI Odometry dataset. Please refer to the attached file supp_video.mp4.

A Ablation Study

In Table 6, we conduct an ablation study to validate the effectiveness of the incorporated cycle consistency constraint, pose features (from and ), depth features, and the memory buffer in our two-layer ConvLSTM module. As we can see, all the components help improve performance in the first stage of training.

Seq. 09 Seq. 10
Method RMSE (m) Rel. trans. (%) Rel. rot. (deg/m) RMSE (m) Rel. trans. (%) Rel. rot. (deg/m)

Two-layer ConvLSTM (w/o cycle consistency)
20.37 5.02 0.016 16.63 6.88 0.035

Two-layer ConvLSTM (w/o pose features)
14.26 5.64 0.018 14.47 7.52 0.030

Two-layer ConvLSTM (w/o depth features)
11.53 4.54 0.015 14.07 6.54 0.031

Two-layer ConvLSTM (w/o memory buffer)
12.54 5.12 0.014 13.96 7.20 0.026

Two-layer ConvLSTM
9.77 4.23 0.013 12.68 6.02 0.023

Table 6: Ablation study on different components of the second-layer ConvLSTM. The best performance is in bold and the second best is underlined.

In Table 7, we conduct an ablation study to show the performance of different input sequence lengths of the second stage of training. Our results show that the performance gradually improves as we increase the number of input frames during training. When the number of frames reaches the GPU memory limitations (e.g., our default setting, 97-frame), we achieve the best performance. Training the model on a GPU with larger memory could potentially improve the performance further.

Seq. 09 Seq. 10
Method RMSE (m) Rel. trans. (%) Rel. rot. (deg/m) RMSE (m) Rel. trans. (%) Rel. rot. (deg/m)

12.50 3.83 0.011 12.30 5.99 0.018

12.42 3.69 0.010 12.06 5.89 0.018

97-frame (default)
11.30 3.49 0.010 11.80 5.81 0.018

Table 7: Ablation study on different input sequence length of the second-stage of training. The best performance is in bold and the second best is underlined.

B Results on KITTI Odometry Test Set

In Table 8, we provide results on the KITTI Odometry leaderboard. It may be observed that the performance of our method is close to Table 5 in the main manuscript. This suggests that using ORB-SLAM2-S as pseudo ground truth is a reasonable choice for evaluation.

In addition to our method, we select two state-of-the-art self-supervised methods (CC [ranjan2019competitive] and MonoDepth2 [godard2019digging]) and submit the estimated results to the server as well. Our method compares favorably with these two self-supervised methods. Our method also outperforms the supervised method DeepVO [wang2017deepvo] by a large margin.

Method Rel. trans (%) Rel. rot. (deg/m)


ORB-SLAM2-S [murORB2] 1.70 0.0028

VISO2-M [geiger2011stereoscan] 11.94 0.0234

VISO2-M+GP [geiger2011stereoscan, song2014robust] 7.46 0.0245


DeepVO [wang2017deepvo] 24.55 0.0489


CC [ranjan2019competitive] 16.06 0.0320

MonoDepth2 [godard2019digging] 12.59 0.0312

Ours 7.40 0.0142

Table 8: Results on KITTI Odometry leaderboard. Note that we use the estimations from ORB-SLAM2-S [murORB2] to align scale globally for the self-supervised methods.

In Figure 8, we show qualitative results on the remaining 7 sequences (other than those shown in the main manuscript) from the KITTI Odometry test set. Our method aligns best with the reference ORB-SLAM2-S trajectories.

(a) Seq. 11 (b) Seq. 12

(a) Seq. 14 (b) Seq. 15

(a) Seq. 17 (b) Seq. 20

(b) Seq. 21
Figure 8: Visual comparison on the KITTI Odometry test set. We show the trajectories of ORB-SLAM2-S, CC [ranjan2019competitive], MonoDepth2 [godard2019digging] and our method. Our method aligns best with the reference ORB-SLAM2-S trajectories.

C Results on KITTI Odometry Training Set

Table 9: Pose evalution on training split of KITTI Odometry dataest [Geiger2012CVPR]. The results of ORB-SLAM2-M methods are the medians of 5 times. ‘-’ means the results are not available from that paper. For DeepV2D [teed2018deepv2d], SfMLearner [zhou2017unsupervised], GeoNet [yin2018geonet], CC [ranjan2019competitive], DeepMatchVO [shen2019icra], and MonoDepth2 [godard2019digging], we take the pre-trained models and run on the sequences to get the results. The best performance of each block is in bold, and the second best is underlined.

RMSE (m) Seq. 00 Seq. 01 Seq. 02 Seq. 03 Seq. 04 Seq. 05 Seq. 06 Seq. 07 Seq. 08


ORB-SLAM2-M (w/o LC) 54.94 568.63 58.55 1.41 2.41 29.32 51.87 16.83 36.90
ORB-SLAM2-M 9.02 529.28 17.96 2.07 1.56 5.20 14.07 2.88 37.83
Sup. DeepV2D [teed2018deepv2d] 101.08 484.87 121.02 3.62 8.86 35.23 113.31 12.86 55.69


SfMLearner [zhou2017unsupervised] 97.81 108.09 152.15 7.47 2.49 48.13 39.56 21.28 32.56

GeoNet [yin2018geonet] 148.81 168.90 293.46 17.58 7.26 86.94 17.69 13.88 138.00

CC [ranjan2019competitive] 68.31 50.41 59.19 8.89 2.25 22.49 13.02 11.31 49.29

DeepMatchVO [shen2019icra] 51.34 85.96 127.99 11.03 3.09 27.59 20.98 16.71 38.71

MonoDepth2 [godard2019digging] 82.05 30.81 86.64 2.40 2.00 21.49 5.16 10.42 51.83

Ours 13.13 41.38 12.61 1.61 2.22 8.24 9.16 9.92 13.98

Rel. trans (%) Seq. 00 Seq. 01 Seq. 02 Seq. 03 Seq. 04 Seq. 05 Seq. 06 Seq. 07 Seq. 08


ORB-SLAM2-M (w/o LC) 14.11 131.75 12.70 1.21 2.40 9.12 18.50 10.34 9.72
ORB-SLAM2-M 3.23 125.63 3.69 1.73 1.97 2.31 5.92 2.15 11.68


DeepVO [wang2017deepvo] - - - 8.49 7.19 2.62 5.42 3.91 -

ESP-VO [wang2018end] - - - 6.72 6.33 3.35 7.24 3.52 -

GFS-VO [xue2018guided] - - - 5.44 2.91 3.27 8.50 3.37 -

GFS-VO-RNN [xue2018guided] - - - 6.36 5.95 5.85 14.58 5.88 -

BeyondTracking [xue2019beyond] - - - 3.32 2.96 2.59 4.93 3.07 -

DeepV2D [teed2018deepv2d] 12.38 56.26 7.79 4.07 8.22 6.35 16.67 4.96 6.63


SfMLearner [zhou2017unsupervised] 19.27 21.71 18.99 9.73 3.17 10.02 11.00 11.68 8.67

GeoNet [yin2018geonet] 33.63 22.96 54.00 19.41 10.81 22.68 9.90 9.82 22.26

CC [ranjan2019competitive] 10.42 15.64 8.08 8.49 2.90 5.70 4.38 5.91 7.16

DeepMatchVO [shen2019icra] 5.31 29.57 15.94 9.67 4.15 7.42 5.69 7.62 9.43

MonoDepth2 [godard2019digging] 7.64 10.06 8.34 5.30 3.20 4.66 2.48 4.58 7.32

Ours 2.60 13.27 2.49 1.59 2.52 2.63 2.64 6.43 3.61

Rel. rot (deg/m) Seq. 00 Seq. 01 Seq. 02 Seq. 03 Seq. 04 Seq. 05 Seq. 06 Seq. 07 Seq. 08


ORB-SLAM2-M (w/o LC) 0.003 0.010 0.003 0.002 0.002 0.002 0.003 0.003 0.003
ORB-SLAM2-M 0.003 0.012 0.004 0.002 0.002 0.003 0.002 0.005 0.003


DeepVO [wang2017deepvo] - - - 0.069 0.070 0.036 0.058 0.046 -

ESP-VO [wang2018end] - - - 0.065 0.061 0.049 0.073 0.050 -

GFS-VO [xue2018guided] - - - 0.033 0.013 0.016 0.027 0.022 -

GFS-VO-RNN [xue2018guided] - - - 0.036 0.024 0.025 0.050 0.026 -

BeyondTracking [xue2019beyond] - - - 0.021 0.018 0.012 0.019 0.018 -

DeepV2D [teed2018deepv2d] 0.051 0.051 0.030 0.021 0.034 0.027 0.073 0.030 0.031


SfMLearner [zhou2017unsupervised] 0.057 0.026 0.033 0.035 0.033 0.036 0.038 0.059 0.026

GeoNet [yin2018geonet] 0.057 0.041 0.061 0.098 0.070 0.077 0.043 0.059 0.078

CC [ranjan2019competitive] 0.035 0.011 0.016 0.041 0.012 0.022 0.008 0.031 0.023

DeepMatchVO [shen2019icra] 0.013 0.013 0.024 0.046 0.020 0.017 0.022 0.037 0.012

MonoDepth2 [godard2019digging] 0.021 0.010 0.015 0.014 0.008 0.017 0.004 0.026 0.024

Ours 0.005 0.003 0.003 0.006 0.005 0.005 0.007 0.021 0.003

In Table 9, we compare the results on the training set of the KITTI Odometry dataset. Note that all supervised methods are trained on Sequence 00, 02, 08, 09 of the KITTI Odometry dataset [Geiger2012CVPR], except DeepV2D [teed2018deepv2d], which is trained on the Eigen split of KITTI raw dataset [Geiger2013IJRR]. Comparing to other self-supervised approaches, our method achieves smaller errors on the training set, indicating that the proposed system can effectively learn to model the camera pose trajectory during training time. Our method also compares favorably against the geometric-based method ORB-SLAM2.

In Figure 9, we show the qualitative results of our method on Seq. 00-08 on the KITTI Odometry dataset.

(a) Seq. 00 (b) Seq. 01 (b) Seq. 02

(a) Seq. 03 (b) Seq. 04 (b) Seq. 05

(a) Seq. 06 (b) Seq. 07 (b) Seq. 08
Figure 9: Visual comparison on the KITTI Odometry training set. We show the trajectories of ORB-SLAM2-M, CC [ranjan2019competitive], MonoDepth2 [godard2019digging] and our method.

D Snippet-level Pose Results and Depth Results

For completeness, we provide the pose estimation results when evaluating on 5-frame snippets in Table 10 and the single-view depth estimation results in Table 11. Note that the depth network is fixed during the second stage of training, so for the depth evaluation, we only train our model for the first stage on the Eigen split of the KITTI raw dataset. As we can see in Table 10, although CC [ranjan2019competitive] and DeepMatchVO [shen2019icra] achieve good results on the snippet-level, their results on the video-level are no longer the state-of-the-art. This indicates that evaluating camera pose estimation performance on the snippet-level could be inaccurate, and thus we need to evaluate the whole trajectory to reflect the holistic performance. In Table 11, we also observe that our method slightly outperforms the current self-supervised state-of-the-art MonoDepth2 [godard2019digging], which indicates that a better pose estimation module could lead to a better depth estimation performance.

Table 10: 5-frame snippet-level results on KITTI Odometry datest [Geiger2012CVPR].

Seq. 09 Seq. 10
ORB-SLAM (full) 0.0140.008 0.0120.011
SfMLearner [zhou2017unsupervised] 0.0210.017 0.0200.015
vid2depth [mahjourian2018unsupervised] 0.0130.010 0.0120.011
GeoNet [yin2018geonet] 0.0120.007 0.0120.009
DF-Net [zou2018dfnet] 0.0170.007 0.0150.009
CC [ranjan2019competitive] 0.0120.007 0.0120.008
DeepMatchVO [shen2019icra] 0.0090.005 0.0080.007
MonoDepth2 [godard2019digging] 0.0170.008 0.0150.010

0.0150.006 0.0150.009

Error metric Accuracy metric
Method Abs Rel Sq Rel RMSE log RMSE

SfMLearner [zhou2017unsupervised]
0.208 1.768 6.856 0.283 0.678 0.885 0.957
vid2depth [mahjourian2018unsupervised] 0.163 1.240 6.220 0.250 0.762 0.916 0.968
GeoNet [yin2018geonet] 0.155 1.296 5.857 0.233 0.793 0.931 0.973
DF-Net [zou2018dfnet] 0.150 1.124 5.507 0.223 0.806 0.933 0.973

CC [ranjan2019competitive]
0.140 1.070 5.326 0.217 0.826 0.941 0.975

DeepMatchVO [shen2019icra]
0.156 1.309 5.73 0.236 0.797 0.929 0.969

MonoDepth2 [godard2019digging]
0.115 0.903 4.863 0.193 0.877 0.959 0.981

SC-SfMLearner [bian2019unsupervised]
0.137 1.089 5.439 0.217 0.830 0.942 0.975

0.115 0.871 4.778 0.191 0.874 0.961 0.982

Table 11: Single-view depth estimation results on Eigen test split of KITTI raw dataset [Geiger2013IJRR].

E Additional KITTI Sequences

As mentioned in the main manuscript, we selected 18 sequences from KITTI raw data to further evaluate the methods, which have no overlaps with either KITTI Odometry split or Eigen split. We list the sequence names in Table 12.

Table 12: Names of 18 additional KITTI sequences.

Sequence names