In Simultaneous Localization and Mapping (SLAM) system, it is important to acquire the position and orientation of the camera called visual odometry (VO). Generally, VO in SLAM systems has been studied through feature-based method [19, 13, 4] and direct-based method . These methods focus on camera trajectory optimization while not on frame-to-frame (F2F) estimation.
Our approach aims to estimate F2F VO without optimization methods, such as loop closing  and bundle adjustment . Geiger et al.  suggested the feature-based F2F VO which generates consistent 3D point cloud through feature matching between frames. Ciarfuglia et al. 
presented the correlation between optical flow and camera ego-motion in a non-geometric method by adopting support vector machine.
Recently, deep networks showed remarkable improvement of the computer vision technology[20, 16]
. One of the first convolutional neural network (CNN) based methods were proposed by Konda et al.[15, 14], who showed the feasibility of learning F2F visual odometry. Moreover, Costante et al.  (P-CNN) and Muller et al.  (Flowdometry) predicted F2F camera ego-motion using pre-built optical flow images [1, 5]. They insisted that using optical flow images as training is adequate than RGB domain because it contains displacement information. However, these methods still remain a scale problem. SFMLearner  and UndeepVO 
predict depth information and camera ego-motion through unsupervised learning. Considering the potentials of deep learning-based methods, we design an end-to-end deep convolution neural network to estimate F2F monocular visual odometry.
The contributions of our paper are summarized as follows. First, our visual odometry network is implemented using only the disparity map (right column of Fig. 1). Since the disparity map has spatial clues in each frame, we can effectively address the scale problem and obtain better performance than the current existing optical flow-based networks such as P-CNN  and Flowdometry . In addition, our network was designed to extract both attention and feature map through two parallel blocks. The attention block enables to focus on sensitive regions of the image. Second, a skip-ordering scheme is introduced to learn larger displacements of frames by training additional image pairs. The contributions of our method do not only improve the performance but also enable the camera ego-motion to have robustness regardless of diverse driving environment.
2 Proposed method
In this section, we describe a deep learning network that predicts F2F camera ego-motion, as illustrated in Fig. 2. The proposed network focuses on improving odometry estimation accuracy based on disparity maps with skip-ordering (SO) strategy. In this study, monocular depth estimation based on  is employed to generate disparity maps. The network consists of four blocks: frame feature, attention, translation, and rotation. The details of the proposed network are explained in the following subsections.
2.1 Network architecture
The front part of the network is designed to extract frame feature and attention maps through two parallel blocks, as illustrated in Fig. 2. A frame feature map contains general information about a camera ego-motion prediction. An attention map reflects the sensitivity of the camera ego-motion estimation in the frame feature map. Since intensity value in the disparity map is highly dependent on the camera motion displacement, the attention map is learned by utilizing its property. For instance, camera ego-motion is difficult to be predicted from objects in a long distance such as buildings, trees, and sky in the scene. After each extraction, the attention map and frame feature map are pixel-wisely multiplied per frame, and then two multiplied maps are concatenated.
Translation and rotation blocks use the concatenated map as their input. Each block learns rotation and translation information in parallel to be robust on feature learning. The rotation block has two additional layers than the translation block because the former has higher nonlinearity than the latter . Moreover, we normalize losses to balance rotation and translation errors, which is described in detail in the next section.
2.2 Training and testing
The KITTI dataset  is used in train (sequence number 00 to 07) and test (08 to 10) phases [3, 18]. The ground truth of the KITTI has 12 values per an image, which of 9 are the rotation matrix and 3 are the position. These are the information about the first frame of the sequences in world coordinate system. Therefore, it is necessary to change the related information between frames, to train the network. Eq. 1 below explains how the information about rotation matrix R, position P and translation T is obtained, which are expressed as:
where and of the -th frame with respect to the -th frame.
A rotation matrix is generally represented by Euler angle, quaternion or Lie group (3). However, we empirically found that the quaternion vectors have the limitations in learning the rotation matrix. When the amount of rotation is small, one of four elements in the quaternion has abnormally higher value than the others, which can be subjected as a biased result. In Lie group (3), one-to-one mapping between rotation matrix and Lie group (3) cannot be transformed with each other when no rotation occurs. Therefore, Euler angle is suitable for F2F camera ego-motion estimation in our network.
In training phase, three consecutive disparity maps are used as an input (, , ), where d represents the disparity map, and t denotes the time order. As described above, skip-ordering scheme is introduced to learn larger displacements of frames by training additional image pairs. Thus, we use (, ) as skip-ordering (SO) and each elements in (, ), (, ), (, ) are paired with others. Through SO strategy, the network becomes robust against various motion changes.
Our network loss consists of the weighted sums of rotation and translation parts, which can be expressed as:
where and are rotation and translation losses, respectively. The weighted factor is known to have a large value to normalize the losses of rotation and translation, because the former has higher nonlinearity than the latter . We experimentally set . The mean squared error L is described in Eq. 3.
in which N is the number of training samples.
The network uses two consecutive disparity maps as an input and yields the relation of translation and Euler angle between the frames.
To obtain the rotation matrix and position vector from st to -th frame, following equations are required:
All methods were evaluated using KITTI benchmark metrics. The VISO2-M and SVR-VO were implemented by the provided code in [9, 3], receptively. Each result of P-CNN and Flowdometry is reported in [3, 18]. Since VISO2-M does not solve the scale problem, we recovered the scale through the range of position in ground truth. Table. 1 shows the performance of the compared algorithms.
To evaluate the structure of our network, RGB-VO and D-VO have been additionally experimented. Each network was trained end-to-end by monocular RGB and disparity map using SO without attention layers. As presented in Table. 2, RGB-VO shows higher average translation performance than compared to VISO2-M and SVR-VO. However, it is worse than other deep network approaches. D-VO, which simply replaced the domain with a disparity map, has better average translation performance than RGB-VO. Translation error is reduced from 13.83% to 12.44% indicating that using disparity map is effective on translation accuracy. However, since a significant improvement of rotation is not found, merely using disparity map is not adequate.
For evaluating the value of an attention block, AD-VO with SO was additionally evaluated. From the comparison of D-VO and AD-VO with SO, translation error was reduced from 12.44% to 8.59% and the rotation error was reduced from 0.0474% to 0.0334%. These results prove that the usage of the attention block is effective. Accordingly, AD-VO with SO shows the best average error on translation and performance of rotation accuracy in sequence 10. The further experiment on D-VO was conducted without SO to determine the effect of SO. Comparing D-VO without and with SO, the translation error was reduced from 16.02% to 12.44%, with the rotation error from 0.0562% to 0.0474%. From the comparison of the result, it can be seen that using attention layer and SO improve the performance.
Attention layer and SO does not only improve the performance but also stabilize the result. We have analyzed the standard deviation of each algorithm and network between the test sequences. The standard deviation of AD-VO with SO was measured as the smallest among the algorithms with 0.868 and 0.001 in translation and rotation. Unlike the other networks, our method have small variation in translation error between test sequences and stable results, regardless of the environment. Through an analysis of the attention block, it determines which regions should be considered. Fig.3 shows the result of reconstructed trajectories in test sequences.
|Seq||Trans [%]||Rot [deg/m]||Trans [%]||Rot [deg/m]||Trans [%]||Rot [deg/m]||Trans [%]||Rot [deg/m]|
|w/ skip-ordering||w/o skip-ordering||w/ skip-ordering||w/ skip-ordering|
|Seq||Trans [%]||Rot [deg/m]||Trans [%]||Rot [deg/m]||Trans [%]||Rot [deg/m]||Trans [%]||Rot [deg/m]|
In this paper, we presented a novel system to obtain F2F camera ego-motion from monocular images. We studied four different algorithms and compared the performance using evaluation metrics provided by KITTI benchmark. Our system is designed not only to recover the scale problem but also to achieve stable results in various environment. An Attention block allows where the network should be trained to focus on influential regions of the image. Moreover, we suggested a skip-ordering scheme to train the larger displacement of camera ego-motion. To the best of our knowledge, DA-VO is the first F2F camera ego-motion network using only disparity maps. In the future, we aim to enhance the rotation performance by combining optical flow and disparity map.
-  (2004) High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision, pp. 25–36. Cited by: §1.
-  (2014) Evaluation of non-geometric methods for visual odometry. Robotics and Autonomous Systems 62 (12), pp. 1717–1730. Cited by: §1, §3.
-  (2016) Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE robotics and automation letters 1 (1), pp. 18–25. Cited by: §1, §1, §2.2, §3, §3.
-  (2007) MonoSLAM: real-time single camera slam. IEEE transactions on pattern analysis and machine intelligence 29 (6), pp. 1052–1067. Cited by: §1.
-  (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766. Cited by: §1.
-  (2014) LSD-slam: large-scale direct monocular slam. In European Conference on Computer Vision, pp. 834–849. Cited by: §1.
-  (2006) Bundle adjustment rules. Photogrammetric computer vision 2 (2006). Cited by: §1.
Are we ready for autonomous driving? the kitti vision benchmark suite.
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3354–3361. Cited by: Figure 1, §2.2.2, §2.2.
-  (2011) Stereoscan: dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pp. 963–968. Cited by: §1, §3, §3.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Vol. 2, pp. 7. Cited by: Figure 1, §2.
-  (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 2938–2946. Cited by: §2.1, §2.2.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.1.
-  (2007) Parallel tracking and mapping for small ar workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pp. 225–234. Cited by: §1.
-  (2013) Unsupervised learning of depth and motion. arXiv preprint arXiv:1312.3429. Cited by: §1.
-  (2015) Learning visual odometry with a convolutional network.. In VISAPP (1), pp. 486–490. Cited by: §1.
-  (2017) Brightness-based convolutional neural network for thermal image enhancement. IEEE Access 5, pp. 26867–26879. Cited by: §1.
-  (2017) UnDeepVO: monocular visual odometry through unsupervised deep learning. arXiv preprint arXiv:1709.06841. Cited by: §1.
-  (2017) Flowdometry: an optical flow and deep learning based approach to visual odometry. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 624–631. Cited by: §1, §1, §2.2, §3, §3.
-  (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §1, §1.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
-  (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, Vol. 2, pp. 7. Cited by: §1.