AD-VO: Scale-Resilient Visual Odometry Using Attentive Disparity Map

by   Joosung Lee, et al.
Yonsei University

Visual odometry is an essential key for a localization module in SLAM systems. However, previous methods require tuning the system to adapt environment changes. In this paper, we propose a learning-based approach for frame-to-frame monocular visual odometry estimation. The proposed network is only learned by disparity maps for not only covering the environment changes but also solving the scale problem. Furthermore, attention block and skip-ordering scheme are introduced to achieve robust performance in various driving environment. Our network is compared with the conventional methods which use common domain such as color or optical flow. Experimental results confirm that the proposed network shows better performance than other approaches with higher and more stable results.


page 1

page 2


Training Deep SLAM on Single Frames

Learning-based visual odometry and SLAM methods demonstrate a steady imp...

Dynamic Dense RGB-D SLAM using Learning-based Visual Odometry

We propose a dense dynamic RGB-D SLAM pipeline based on a learning-based...

Learning monocular visual odometry with dense 3D mapping from dense 3D flow

This paper introduces a fully deep learning approach to monocular SLAM, ...

Real-time Monocular Visual Odometry for Turbid and Dynamic Underwater Environments

In the context of robotic underwater operations, the visual degradations...

DLO: Direct LiDAR Odometry for 2.5D Outdoor Environment

For autonomous vehicles, high-precision real-time localization is the gu...

Exploring Self-Attention for Visual Odometry

Visual odometry networks commonly use pretrained optical flow networks i...

Learning Monocular Visual Odometry through Geometry-Aware Curriculum Learning

Inspired by the cognitive process of humans and animals, Curriculum Lear...

1 Introduction

In Simultaneous Localization and Mapping (SLAM) system, it is important to acquire the position and orientation of the camera called visual odometry (VO). Generally, VO in SLAM systems has been studied through feature-based method [19, 13, 4] and direct-based method [6]. These methods focus on camera trajectory optimization while not on frame-to-frame (F2F) estimation.

Our approach aims to estimate F2F VO without optimization methods, such as loop closing [19] and bundle adjustment [7]. Geiger et al. [9] suggested the feature-based F2F VO which generates consistent 3D point cloud through feature matching between frames. Ciarfuglia et al. [2]

presented the correlation between optical flow and camera ego-motion in a non-geometric method by adopting support vector machine.

Figure 1: RGB images [8] and disparity maps [10] (images are on temporal order from top to bottom).

Recently, deep networks showed remarkable improvement of the computer vision technology 

[20, 16]

. One of the first convolutional neural network (CNN) based methods were proposed by Konda et al. 

[15, 14], who showed the feasibility of learning F2F visual odometry. Moreover, Costante et al. [3] (P-CNN) and Muller et al. [18] (Flowdometry) predicted F2F camera ego-motion using pre-built optical flow images [1, 5]. They insisted that using optical flow images as training is adequate than RGB domain because it contains displacement information. However, these methods still remain a scale problem. SFMLearner [21] and UndeepVO [17]

predict depth information and camera ego-motion through unsupervised learning. Considering the potentials of deep learning-based methods, we design an end-to-end deep convolution neural network to estimate F2F monocular visual odometry.

The contributions of our paper are summarized as follows. First, our visual odometry network is implemented using only the disparity map (right column of Fig. 1). Since the disparity map has spatial clues in each frame, we can effectively address the scale problem and obtain better performance than the current existing optical flow-based networks such as P-CNN [3] and Flowdometry [18]. In addition, our network was designed to extract both attention and feature map through two parallel blocks. The attention block enables to focus on sensitive regions of the image. Second, a skip-ordering scheme is introduced to learn larger displacements of frames by training additional image pairs. The contributions of our method do not only improve the performance but also enable the camera ego-motion to have robustness regardless of diverse driving environment.

Figure 2: Frame to frame odometry model which is based on disparity maps. Our network has four blocks of ‘Frame feature’, ‘Attention’, ‘Translation’ and ‘Rotation’.

2 Proposed method

In this section, we describe a deep learning network that predicts F2F camera ego-motion, as illustrated in Fig. 2. The proposed network focuses on improving odometry estimation accuracy based on disparity maps with skip-ordering (SO) strategy. In this study, monocular depth estimation based on [10] is employed to generate disparity maps. The network consists of four blocks: frame feature, attention, translation, and rotation. The details of the proposed network are explained in the following subsections.

2.1 Network architecture

The front part of the network is designed to extract frame feature and attention maps through two parallel blocks, as illustrated in Fig. 2. A frame feature map contains general information about a camera ego-motion prediction. An attention map reflects the sensitivity of the camera ego-motion estimation in the frame feature map. Since intensity value in the disparity map is highly dependent on the camera motion displacement, the attention map is learned by utilizing its property. For instance, camera ego-motion is difficult to be predicted from objects in a long distance such as buildings, trees, and sky in the scene. After each extraction, the attention map and frame feature map are pixel-wisely multiplied per frame, and then two multiplied maps are concatenated.

Translation and rotation blocks use the concatenated map as their input. Each block learns rotation and translation information in parallel to be robust on feature learning. The rotation block has two additional layers than the translation block because the former has higher nonlinearity than the latter [11]. Moreover, we normalize losses to balance rotation and translation errors, which is described in detail in the next section.

All layers use the ReLU as an activation function except for the attention layer. Each attention layer is followed by the sigmoid activation function to yield the value between 0 and 1 for the attention map.

2.2 Training and testing

The KITTI dataset [8] is used in train (sequence number 00 to 07) and test (08 to 10) phases [3, 18]. The ground truth of the KITTI has 12 values per an image, which of 9 are the rotation matrix and 3 are the position. These are the information about the first frame of the sequences in world coordinate system. Therefore, it is necessary to change the related information between frames, to train the network. Eq. 1 below explains how the information about rotation matrix R, position P and translation T is obtained, which are expressed as:


where and of the -th frame with respect to the -th frame.

A rotation matrix is generally represented by Euler angle, quaternion or Lie group (3). However, we empirically found that the quaternion vectors have the limitations in learning the rotation matrix. When the amount of rotation is small, one of four elements in the quaternion has abnormally higher value than the others, which can be subjected as a biased result. In Lie group (3), one-to-one mapping between rotation matrix and Lie group (3) cannot be transformed with each other when no rotation occurs. Therefore, Euler angle is suitable for F2F camera ego-motion estimation in our network.

2.2.1 Training

In training phase, three consecutive disparity maps are used as an input (, , ), where d represents the disparity map, and t denotes the time order. As described above, skip-ordering scheme is introduced to learn larger displacements of frames by training additional image pairs. Thus, we use (, ) as skip-ordering (SO) and each elements in (, ), (, ), (, ) are paired with others. Through SO strategy, the network becomes robust against various motion changes.

Our network loss consists of the weighted sums of rotation and translation parts, which can be expressed as:


where and are rotation and translation losses, respectively. The weighted factor is known to have a large value to normalize the losses of rotation and translation, because the former has higher nonlinearity than the latter [11]. We experimentally set . The mean squared error L is described in Eq. 3.


in which N is the number of training samples.

The loss  is minimized by the Adam optimization [12]. The learning rate starts from 1

5, then reduces by a factor of 2 for every 5 epochs until 30 epochs.

2.2.2 Testing

The network uses two consecutive disparity maps as an input and yields the relation of translation and Euler angle between the frames.

To obtain the rotation matrix and position vector from st to -th frame, following equations are required:


The rotation matrix of the first frame (

) is set to 3x3 identity matrix and located at the origin of world coordinates, (0, 0, 0). The obtained results are evaluated using the provided code by KITTI 


3 Experiments

The proposed method has been compared with the handcraft-based method VISO2-M [9], and three learning-based methods of SVR-VO [2], P-CNN [3], and Flowdometry [18].

All methods were evaluated using KITTI benchmark metrics. The VISO2-M and SVR-VO were implemented by the provided code in [9, 3], receptively. Each result of P-CNN and Flowdometry is reported in  [3, 18]. Since VISO2-M does not solve the scale problem, we recovered the scale through the range of position in ground truth. Table. 1 shows the performance of the compared algorithms.

To evaluate the structure of our network, RGB-VO and D-VO have been additionally experimented. Each network was trained end-to-end by monocular RGB and disparity map using SO without attention layers. As presented in Table. 2, RGB-VO shows higher average translation performance than compared to VISO2-M and SVR-VO. However, it is worse than other deep network approaches. D-VO, which simply replaced the domain with a disparity map, has better average translation performance than RGB-VO. Translation error is reduced from 13.83% to 12.44% indicating that using disparity map is effective on translation accuracy. However, since a significant improvement of rotation is not found, merely using disparity map is not adequate.

For evaluating the value of an attention block, AD-VO with SO was additionally evaluated. From the comparison of D-VO and AD-VO with SO, translation error was reduced from 12.44% to 8.59% and the rotation error was reduced from 0.0474% to 0.0334%. These results prove that the usage of the attention block is effective. Accordingly, AD-VO with SO shows the best average error on translation and performance of rotation accuracy in sequence 10. The further experiment on D-VO was conducted without SO to determine the effect of SO. Comparing D-VO without and with SO, the translation error was reduced from 16.02% to 12.44%, with the rotation error from 0.0562% to 0.0474%. From the comparison of the result, it can be seen that using attention layer and SO improve the performance.

Attention layer and SO does not only improve the performance but also stabilize the result. We have analyzed the standard deviation of each algorithm and network between the test sequences. The standard deviation of AD-VO with SO was measured as the smallest among the algorithms with 0.868 and 0.001 in translation and rotation. Unlike the other networks, our method have small variation in translation error between test sequences and stable results, regardless of the environment. Through an analysis of the attention block, it determines which regions should be considered. Fig. 

3 shows the result of reconstructed trajectories in test sequences.

(a) Sequence 08
(b) Sequence 09
(c) Sequence 10
Figure 3: Reconstructed trajectories of test sequences(08, 09, 10). Our algorithm is AD-VO with skip-ordering.
VISO2-M SVR-VO Flowdometry P-CNN VO
Seq Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m]
08 26.213 0.0247 15.42 0.0363 9.98 0.0544 7.60 0.0187
09 4.09 0.0124 10.50 0.0445 12.64 0.0804 6.75 0.0252
10 60.02 0.0669 21.97 0.0545 11.65 0.0728 21.23 0.0405
avg 24.91 0.0266 15.02 0.0401 10.77 0.0623 8.96 0.0235
std 15.305 0.015 3.165 0.0061 1.136 0.0113 4.689 0.0067
Table 1: The performance of the comparison algorithms. Translation and rotation error of test sequences using KITTI devkit.
w/ skip-ordering w/o skip-ordering w/ skip-ordering w/ skip-ordering
Seq Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m]
08 14.39 0.0452 16.41 0.0547 13.36 0.0478 9.21 0.0341
09 9.59 0.0397 13.70 0.0417 8.79 0.0365 7.43 0.0320
10 19.18 0.0438 18.48 0.0954 14.36 0.0671 7.26 0.0317
avg 13.83 0.0438 16.02 0.0562 12.44 0.0474 8.59 0.0334
std 2.718 0.0023 1.407 0.0147 1.995 0.0083 0.868 0.001
Table 2: The performance of our algorithms. Translation and rotation error of test sequences using KITTI devkit. AD-VO works reliably in a variety of driving environment. ‘D’ means that we use disparity map and ‘A’ means that model adapt attention block.

4 Conclusion

In this paper, we presented a novel system to obtain F2F camera ego-motion from monocular images. We studied four different algorithms and compared the performance using evaluation metrics provided by KITTI benchmark. Our system is designed not only to recover the scale problem but also to achieve stable results in various environment. An Attention block allows where the network should be trained to focus on influential regions of the image. Moreover, we suggested a skip-ordering scheme to train the larger displacement of camera ego-motion. To the best of our knowledge, DA-VO is the first F2F camera ego-motion network using only disparity maps. In the future, we aim to enhance the rotation performance by combining optical flow and disparity map.


  • [1] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert (2004) High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision, pp. 25–36. Cited by: §1.
  • [2] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci (2014) Evaluation of non-geometric methods for visual odometry. Robotics and Autonomous Systems 62 (12), pp. 1717–1730. Cited by: §1, §3.
  • [3] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia (2016) Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE robotics and automation letters 1 (1), pp. 18–25. Cited by: §1, §1, §2.2, §3, §3.
  • [4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse (2007) MonoSLAM: real-time single camera slam. IEEE transactions on pattern analysis and machine intelligence 29 (6), pp. 1052–1067. Cited by: §1.
  • [5] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766. Cited by: §1.
  • [6] J. Engel, T. Schöps, and D. Cremers (2014) LSD-slam: large-scale direct monocular slam. In European Conference on Computer Vision, pp. 834–849. Cited by: §1.
  • [7] C. Engels, H. Stewénius, and D. Nistér (2006) Bundle adjustment rules. Photogrammetric computer vision 2 (2006). Cited by: §1.
  • [8] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

    pp. 3354–3361. Cited by: Figure 1, §2.2.2, §2.2.
  • [9] A. Geiger, J. Ziegler, and C. Stiller (2011) Stereoscan: dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pp. 963–968. Cited by: §1, §3, §3.
  • [10] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Vol. 2, pp. 7. Cited by: Figure 1, §2.
  • [11] A. Kendall, M. Grimes, and R. Cipolla (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 2938–2946. Cited by: §2.1, §2.2.1.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.1.
  • [13] G. Klein and D. Murray (2007) Parallel tracking and mapping for small ar workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pp. 225–234. Cited by: §1.
  • [14] K. Konda and R. Memisevic (2013) Unsupervised learning of depth and motion. arXiv preprint arXiv:1312.3429. Cited by: §1.
  • [15] K. R. Konda and R. Memisevic (2015) Learning visual odometry with a convolutional network.. In VISAPP (1), pp. 486–490. Cited by: §1.
  • [16] K. Lee, J. Lee, J. Lee, S. Hwang, and S. Lee (2017) Brightness-based convolutional neural network for thermal image enhancement. IEEE Access 5, pp. 26867–26879. Cited by: §1.
  • [17] R. Li, S. Wang, Z. Long, and D. Gu (2017) UnDeepVO: monocular visual odometry through unsupervised deep learning. arXiv preprint arXiv:1709.06841. Cited by: §1.
  • [18] P. Muller and A. Savakis (2017) Flowdometry: an optical flow and deep learning based approach to visual odometry. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 624–631. Cited by: §1, §1, §2.2, §3, §3.
  • [19] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §1, §1.
  • [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [21] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, Vol. 2, pp. 7. Cited by: §1.