I Introduction
Learning based Visual Odometry (LVO) in the last few years has seen an increasing attention of the robotics community because of its desirable properties of robustness to image noise and camera calibration independence [1]
, mostly thanks to Convolutional Neural Networks (CNNs) representational power, which can complement current geometric solutions
[2]. While current results are very promising, making these solutions easily applicable to different environments still presents challenges. One of them is that most of the approaches so far explored have not shown strong domain independence and suffer from high dataset bias, i.e. the performances considerably degrade when tested on sequences with motion dynamics and scene depth significantly different from the training data [3]. In the context of LVO this bias is expressed in different Optical Flow (OF) field distribution in training and test data, due to differences in scene depth and general motion of the camera sensor.One possible explanation for the poor performances of learned methods on unseen contexts is that most current learning architectures try to extract both visual features and motion estimate as a single training problem, coupling the appearance and scene depth with the actual camera motion information contained in the OF input. Some works have addressed the problem with an unsupervised, or semisupervised approach, trying to learn directly the motion representation and scene depth from some kind of frametoframe photometric error [4] [5] [6]. While very promising, these approaches are mainly devised for scene depth estimation and still fall short in terms of general performances on EgoMotion estimation.
At the same time, previous research has shown how OF fields have a bilinear dependence on motion and inverse scene depth [7]. We suggest that this is the main reason for the low generalization properties shown by learned algorithms so far. Past research has shown that the high dimensional OF field, when scene depth can be considered locally constant, can be projected on a much lower dimensional linear space [8] [9]. However, when these conditions do not hold, the OF field subspace exists but is highly nonlinear.
In this work we propose to exploit this knowledge, estimating the latent OF representation using an AutoEncoder (AE) Neural Network architecture as a nonlinear subspace approximator. AE networks are able to extract latent variable representation of high dimensional inputs. Since our aim is to make the EgoMotion estimation more robust to OF fields that show high variability in their distribution, we do not simply use this subspace to directly produce motion prediction. Instead, we propose a novel architecture that jointly trains the subspace estimation and EgoMotion estimation so that the two network tasks are mutually reinforcing and at the same time able to better generalize OF field representation. The conceptual architecture is shown in Figure 1. To demonstrate the increased performances and reduced dataset bias with respect to high dynamical variation of the OF field, we test the proposed approach on a challenging scenario. We subsample the datasets, producing sequences that simulate high speed variations, then we train and test on sequences that are both different in appearance and subsampling rate.
Ii Related Works
Iia EgoMotion estimation
IiA1 Geometric Visual Odometry
GVO has a long history of solutions. While the first approaches were based on sparse feature tracking, mainly for computational reasons, nowadays direct or semidirect approaches are preferred. These approaches use the photometric error as an optimization objective. Research on this topic is very active. Engel at al. developed one of the most succesful direct approaches, LSD SLAM, both for monocular and stereoscopic cameras [10], [11]. Forster et al. developed the SemiDirect VO (SVO) [12] and its more recent update [13], which is a direct method but tracks only a subset of features on the image and runs at very high frame rate compared to full direct methods. Even if direct methods have gained most of the attention in the last few years, the ORBSLAM algorithm by MurArtal et al. [14] reverted to sparse feature tracking and reached impressive robustness and accuracy comparable with direct approaches.
IiA2 Learned Visual Odometry
Learned approaches go back to the early explorations by Roberts et al. [8, 15], Guizilini et al. [16, 17], and Ciarfuglia et al. [18]. As for the geometric case, the initial proposal focused on sparse OF features that, faithful to the there’s no free lunch theorem, explored the performances of different learning algorithms such as SVMs, Gaussian Processes and others. While these early approaches already showed some of the strengths of LVO, it was only more recently, when Costante et al. [1]
introduced the use of CNNs for feature extraction from dense optical flow, that the learned methods started to attract more interest. Since then a couple of methods have been proposed. Muller and Savakis
[19] added the FlowNet architecture to the estimation network, producing one of the first endtoend approaches. Clark et al. [20] proposed an endtoend approach that merged camera inputs with IMU readings using an LSTM network. Through this sensor fusion, the resulting algorithm is able to give good results but requires sensors other than a single monocular camera. The use of LSTM is further explored by Wang et al. in [21], this time without any sensor fusion. The resulting architecture gives again good performances on KITTI sequences but does not show any experiments on environments with different appearance from the training sequences. On a different track is the work of Pillai et al. [22], that, like [17], looked at the problem as a generative probabilistic problem. Pillai proposes an architecture based on an MDN network and a Variational AutoEncoder (VAE) to estimate the motion density given the OF inputs as a GMM. While Frame to Frame (F2F) performances are on a par with other approaches, they also introduce a loss term on the whole trajectory that mimics the bundle optimization that is often used in GVO. The results of the complete system are thus very good. However, they use as input sparse KLT optical flow, since the joint density estimation for dense OF would become computationally intractable, meaning that they could be more prone to OF noise than dense methods.Most of the described approaches claim independence from camera parameters. While this is true, we note that this is more an intrinsic feature of the learning approach than the merit of a particular architecture. The learned model implicitly learns also the camera parameters, but then it fails on images collected with other camera optics. This parameter generalization issue remains an open problem for LVO.
IiB Semisupervised Approaches
Since dataset bias and domain independence are critical challenges for LVO, it is not surprising that a number of unsupervised and semisupervised methods have been recently proposed. However, all the architectures have been proposed as a way of solving the more general problem of joint scene depth and motion estimation, and motion estimation is considered more as a way of improving depth estimation. Konda and Memisevich [23] used a stereo pair to learn VO but the architecture was conceived only for stereo cameras. Ummenhofer and Zhou [4] propose the DeMoN architecture, a solution for F2F Structure from Motion (SfM) that trains a network endtoend on image pairs, levering motion parallax. Zhou et al. [5] proposed an endtoend unsupervised system based on a loss that minimizes image warping error from one frame to the next. A similar approach is used by Vijayanarasimhan et al. [6] with their SfMNet. All these approaches are devised mainly for depth estimation and the authors give little or no attention to the performances on VO tasks. Nonetheless, the semisupervised approach is one of the more relevant future directions for achieving domain independence for LVO, and we expect that this approach will be integrated in the current research on this topic.
IiC Optical Flow Latent Space Estimation
The semisupervised approaches described in Section IIB make evident an intrinsic aspect of monocular camera motion estimation, that is, even when the scene is static, the OF field depends both on camera motion and scene depth. This relationship between inverse depth and motion is bilinear and well known [24]
and is at the root of scale ambiguity in monocular VO. However, locally and under certain hypothesis of depth regularity, it is possible to express the OF field in terms of a linear subspace of OF basis vectors. Roberts et al.
[15] used ProbabilisticPCA to learn a lower dimensional dense OF subspace without supervision, then used it to compute dense OF templates starting from sparse optical flow. They then used it to compute EgoMotion. Herdtweck and Cristóbal extended the result and used Expert Systems to estimate motion [25]. More recently, a similar approach to OF field computation was proposed by Wulff and Black [9] that complemented the PCA with MRF, while Ochs et al. [26] did the same by including prior knowledge with an MAP approach. These methods suggest that OF field, which is an intrinsically high dimensional space generated from a nonlinear process, lies on an ideal lower dimensional manifold that sometimes can be linearly locally approximated. However, modern deep networks are able to find latent representation of high dimensional image inputs, and in this work we use this intuition to explore this OF latent space estimation.Iii Contribution
Inspired by the early work of Roberts on OF subspaces [7], and by recent advances in deep latent space learning [27], we propose a network architecture that jointly estimates a low dimensional representation of dense OF field using an AutoEncoder (AE) and at the same time computes the camera EgoMotion estimate with a standard Convolutional network, as in [1]. The two networks share the feature representation in the decoder part of the AE, and this constrains the training process to learn features that are compatible with a general latent subspace. We show through experiments that this joint training increases the EgoMotion estimation performances and generalization properties. In particular, we show that learning the latent space and concatenating it to the feature vector makes the resulting estimation considerably more robust to domain change, both in appearance and in OF field dynamical range and distribution.
We train our network both in an endtoend version, using deep OF estimation, and with standard OF field input, in order to explore the relative advantages and weaknesses. We show that while the endtoend approach is more general, precomputed OF still has some performance advantages.
In summary our contributions are:

A novel endtoend architecture to jointly learn the OF latent space and camera EgoMotion estimation is proposed. We call this architecture Latent SpaceVO (LSVO).

The strength of the proposed architecture is demonstrated experimentally, both for appearance changes, blur, and large camera speed changes.

Effects of geometrically computed OF fields are compared to endtoend architectures in all cases.

The adaptability of the proposed approach to other endtoend architectures is demonstrated, without increasing the chances of overfitting them, due to parameters increase.
Iv Learning Optical Flow Subspaces
Given an optical flow vector from a given OF field , [7] [9] approximate it with a linear relationship:
(1) 
where the columns of are the basis vectors that form the OF linear subspace and is a vector of latent variables. This approximation is valid only if there are some regularities of scene depth and is applicable only to local patches in the image. The real subspace is nonlinear in nature and, in this work, we express it as a generic function that we learn from data by using the architecture described in the following.
Iva Latent Space Estimation with AutoEncoder Networks
Let be the camera motion vector and the input OF field, computed with some dense method, where is a 2dimensional vector of the field at image coordinates
. Both can be viewed as random variables with their own distributions. In particular, we make the hypothesis that the input images lie on a lower dimensional manifold, as in
[28], and thus also the OF field lies on a lower dimensional space with a distance function , where . The true manifold is very difficult to compute, so we look for an estimate using the model extracted by an encoding neural network.Let be a vector of latent random variables that encodes the variabilities of OF field that lies on this approximate space. The decoder part of the AE can be seen as a function
(2) 
where is the set of learnable parameters of the network (with upconv layers), that is able to generate a dense optical flow from a vector of latent variables . Note that the AE works similarly to a nonlinear version of PCA [27]. We define the set as our approximation of the OF field manifold and use the logaritmic Euclidean distance (as described in Section IVB
as a loss function) as an approximation of
. Using this framework the problem of estimating the latent space is carried out by the AE network, where the Encoder part can be defined as the function .While in [22] the AE is used to estimate motion, and are the camera translation and rotations, here we follow a different strategy. We compute the latent space for a twofold purpose: we use the latent variables as an input feature to the motion estimation network and we learn this latent space together with the estimator, thus forcing the estimator to learn features compatible with the encoder representation. Together these two aspects make the representation more robust to domain changes.
IvB Network Architecture
The LSVO network architecture in its endtoend form is shown in Figure 2. It is composed of two main branches, one is the AE network and the other is the convolutional network that computes the regression of motion vector . The OF extraction section is Flownet [29], for which we use the pretrained weights. We run tests finetuning this part of the network on KITTI [30] and Malaga [31] datasets, but the result was a degraded performance due to overfitting.
The next layers are convolutions that extract features from the computed OF field. After the first convolutional layers (conv1, conv2 and conv3), the network splits into the AE network and the estimation network. The two branches share part of the feature extraction convolutions, so the entire network is constrained in learning a general representation that is good for estimation and latent variable extraction. The Encoder is completed by another convolutional layer, that brings the input to the desired representation , and its output is fed both in the Decoder and concatenated to the feature extracted before. The resulting feature vector, composed of latent variables and convolutional features is fed into a fully connected network that performs motion estimation. The details are summarized in Table I.
The AE is trained with a pixelwise squared Root Mean Squared Log Error (RMSLE) loss:
(3) 
where is the predicted OF vector for the ith pixel, and is the corresponding input to the network, and the logarithm is intended as an elementwise operation. This loss penalizes the ratio difference, and not the absolute value difference of the estimated OF compared to the real one, so that the flow vectors of distant points are taken into account and not smoothed off.
We use the loss introduced by Kendall et al. in [32]:
(4) 
where the is camera translation vector in meters, is the rotation vector in Euler notation in radians, and is a scale factor that balances the angular and translational errors. has been crossvalidated on the trajectory reconstruction error ( for our experiments), so that the frame to frame error propagation to the whole trajectory is taken into account. The use of a Euclidean loss with Euler angle representation works well in the case of autonomous cars, since the yaw angle is the only one with significant changes. For more general cases, is better to use a quaternion distance metric [33].
Layer name  Kernel size  Stride  output size  
Input        
LSVO  
Shared Features Layer  conv1  
conv2  
conv3  
AutoEncoder  conv4  
upconv1  
crop      
upconv2  
Estimator  maxpool  
concat and      
dense1      
dense2      
dense3      
STVO  
Feature Extraction  stconv1  
stmaxpool1  
stconv2  
stmaxpool2  
Estimation  concat and      
stdense1      
stdense2     
IvC OF field distribution
As mentioned in Section IVA
, the OF field has a probability distribution that lies on a manifold with lower dimensionality than the number of pixels of the image. We can argue that the actual density depends on the motion of the camera as much as the scene depth of the images collected. In this work, we test generalization properties of the network for both aspects:

[label=, ref=()]

For the appearance we use the standard approach to test on completely different sequences than the ones used in training.

For the motion dynamics, we subsample the sequences, thus multiplying the OF dynamics by the same factor.

To further test OF distribution robustness, we also test the architecture on downsampled blurred images, as in [1].
Examples of the resulting OF field are shown in Figure 3, while an example of a blurred OF field is shown in Figure 4. In both images there are evident differences both in hue and saturation, meaning that both modulus and phase of the OF vectors change.





V Experimental Results
Va Data and Experiments setup
We perform experiments on two different datasets, the KITTI Visual Odometry benchmark [30] and the Malaga 2013 dataset [31]. Both datasets are taken from cars that travel in city suburbs and countryside, however the illumination conditions and camera setups are different. For the KITTI dataset we used the sequences to for training and the , and for test, as is common practice. The images are all around , and we resize them to . The frame rate is Hz. For the Malaga dataset we use the sequences , and as test set, and the , , , , , and as training set. In this case the images are that we resize to . The frame rate is Hz. For the Malaga dataset there is no high precision GPS ground truth, so we use the ORBSLAM2 stereo VO [14] as a Ground truth, since its performances, comprising bundle adjustment and loop closing, are much higher than any monocular method.
The networks are implemented in Keras/Tensorflow and trained using an Nvidia Titan Xp. Training of the STVO variant takes
, while LSVO . The STVO memory occupancy is on average MB, while LSVO requires MB. At test time, computing Flownet and BF features takes on average ms and ms per sample, while the prediction requires, on average, ms for both STVO and LSVO. The total time, when considering Flownet features, amounts to ms for STVO and ms for LSVO. Hence, we can observe that the increased complexity does not affect much computational performance at test time.For all the experiments described in the following Section, we tested the LSVO architecture and the STVO baseline. Furthermore, on all KITTI experiments we tested with both Flownet and BF features. While the contribution of this work relates mainly on showing the increased robustness of the proposed method with respect to learned architectures, we also sampled the performances of SotA geometrical methods, namely VISO2M [35] and ORBSLAM2M [14] in order to have a general baseline.
VB Experiments
As mentioned in Section IVC, on both datasets we perform three kinds of experiments, of increasing difficulty. We observe that the original sequences show some variability in speed, since the car travels in both datasets at speeds of up to Km/h, but the distribution of OF field is still limited. This implies that the possible combinations of linear and rotational speeds are limited. We extend the variability of OF field distribution performing some data augmentation. Firstly, we subsample the sequences by 2 and 3 times, to generate virtual sequences that have OF vectors with very different intensity. In Figure 3, an example of the different dynamics is shown. In both KITTI and Malaga datasets we indicate the standard sequences by the subscript, and the sequences subsampled by and times by and , respectively. In addition to this, we generate blurred versions of the test sequences, with gaussian blur, as in [1]. Then we perform three kinds of experiment and compare the results. The first is a standard training and test on sequences. This kind of test explores the generalization properties on appearance changes alone. In the second kind of experiment we train all the networks on the sequences and and test on . This helps us to understand how the networks perform when both appearance and OF dynamics change. The third experiment is training on and sequences, and testing on the on the blurred versions of the test set (Figure 4).
The proposed architecture is endtoend, since it computes the OF field through a Flownet network. However, as a baseline, we decided to test the performances of all the architecture on a standard geometrical OF input, computed as in [34], and indicated as BF in the following.
In addition, we train the BF version on the RGB representation of OF, since from our experiments performs slightly better than the floating point one.
VISO2M [35]  ORBSLAM2M [14]  STVO (Flow)  STVO (BF)  LSVO (Flow)  LSVO (BF)  

Trasl.  Rot.  Trasl.  Rot.  Trasl.  Rot.  Trasl.  Rot.  Trasl.  Rot.  Trasl.  Rot.  
KITTI  
KITTI  fail  fail  
KITTI + blur  fail  fail 
VISO2M [35]  ORBSLAM2M [14]  STVO (Flow)  LSVO (Flow)  
Trasl.  Rot.  Trasl.  Rot.  Trasl.  Rot.  Trasl.  Rot.  
Malaga  
Malaga  fail  fail  
Malaga + blur  fail  fail  fail  fail 
VC Discussion
The experiments described in Section VC on both datasets have been evaluated with KITTI devkit [30], and the output plots have been reported in Figures 5, 6, 7, 8 and 9. In all Figures except 7, the upper subplots, (a) and (b), represent the translational and rotational errors averaged on subsequences of length m up to m. The lower plots represent the same errors, but averaged on vehicle speed (Km/h). The horizontal axis limits for the lower plots, in Figures relative to downsampled experiments are different, since the subsampling is seen by the evaluation software as an increase in vehicle speed. In Table II and III the total average translational and rotational errors for all the experiments are reported.
Figure 5 summarises the performances of all methods on KITTI without frame downsampling. From Figures 4(a) and 4(b) we observe that the BFfed architectures outperform the Flownetfed networks by a good margin. This is expected, since BF OF fields have been tuned on the dataset to be usable, while Flownet has not been finetuned on KITTI sequences. In addition, the LSVO networks perform almost always better than, or on a par with, the corresponding ST networks. When we consider Figures 4(c) and 4(d), we observe that the increase in performance from ST to LSVO appears to be slight, except in the rotational errors for the Flownet architecture. However, the difference between the length errors and the speed errors is coherent if we consider that the errors are averaged. Therefore, the speed values that are less represented in the dataset are probably the ones that are more difficult to estimate, but at the same time their effect on the general trajectory estimation is consequently less important.
The geometrical methods do not work on frame pairs only, but perform local bundle adjustment and eventually scale estimation. Even if the comparison is not completely fair with respect to learned methods, it is informative nonetheless. In particular we observe (see Figure 5) that the geometrical methods are able to achieve top performances on angular estimation, because they work on fullresolution images and because there is no scale error on angular rate. On the contrary, on average, they perform sensibly worse than learned methods for translational errors. This is also expected, since geometrical methods lack in scale estimation, while learned methods are able to infer scale from appearance. Similar results are obtained for the Malaga dataset. The complete set of experiments is available online [36].
When we consider the second type of experiment, we expect that the general performances of all the architectures and methods should decrease, since the task is more challenging. At the same time, we are interested in probing the robustness and generalization properties of the LSVO architectures over the ST ones. Figure 6 shows the KITTI results. From 5(a) and 5(b) we notice that, while all the average errors for each length increase with respect to the previous experiments, they increase much more for the two ST variants. If we consider the errors depicted in Figures 5(c) and 5(d), we observe that the LSVO networks perform better than the ST ones, except on speed around 60Km/h, where they are on par. This is understandable, since the networks have been trained on and , that correspond to very low and very high speeds, so the OF in between them are the less represented in the training set. However, the most important consideration here is that the LSVO architectures show more robustness to domain shifts. The plots of the performances on Malaga can be found online [36], and the same considerations of the previous one apply.
The last experiment is on the downsampled and blurred image. On these datasets both VISO2M and ORBSLAM2M fail to give any trajectory, due to the lack of keypoints, while Learned methods always give reasonable results. The results are shown in Figure 8 and 9 for the KITTI and the Malaga dataset, respectively. In both KITTI and Malaga experiments we observe a huge improvement in performances of LSVO over STVO. Due to the difference in sample variety in Malaga with respect to KITTI, we observe overfitting of the more complex network (LSVO) over the less represented linear speeds (above Kmh).
This experiments demonstrate that the LSVO architecture is particularly apt to help endtoend networks in extracting a robust OF representation. This is an important result, since this architecture can be easily included in other endtoend approaches, increasing the estimation performances by a good margin, but without significantly increasing the number of parameters for the estimation task, making it more robust to overfitting, as mentioned in Section IVB.
Vi Conclusions
This work presented LSVO, a novel network architecture for estimating monocular camera EgoMotion. The architecture is composed by two branches that jointly learn a latent space representation of the input OF field, and the camera motion estimate. The joint training allows for the learning of OF features that take into account the underlying structure of a lower dimensional OF manifold. The proposed architecture has been tested on the KITTI and Malaga datasets, with challenging alterations, in order to test the robustness to domain variability in both appearance and OF dynamic range. Compared to the datadriven architectures, LSVO network outperformed the single branch network on most benchmarks, and in the others performed at the same level. Compared to geometrical methods, the learned methods show outstanding robustness to nonideal conditions and reasonable performances, given that they work only on a frame to frame estimation and on smaller input images. The new architecture is lean and easy to train and shows good generalization performances. The results provided here are promising and encourage further exploration of OF field latent space learning for the purpose of estimating camera EgoMotion. All the code, datasets and trained models are made available online [36].
References
 [1] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring Representation Learning with CNNs for FrametoFrame EgoMotion Estimation,” Robotics and Automation Letters, IEEE, vol. 1, no. 1, pp. 18–25, 2016.
 [2] R. GomezOjeda, Z. Zhang, J. GonzalezJimenez, and D. Scaramuzza, “Learningbased image enhancement for visual odometry in challenging hdr environments,” arXiv preprint arXiv:1707.01274, 2017.
 [3] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars, “A deeper look at dataset bias,” in Pattern Recognition: 37th German Conference, GCPR 2015, Aachen, Germany, October 710, 2015, Proceedings. Springer International Publishing, 2015, pp. 504–516.

[4]
B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and
T. Brox, “DeMoN: Depth and Motion Network for Learning
Monocular Stereo,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, July 2017. 
[5]
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and egomotion from video,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.  [6] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, “SfMNet: Learning of structure and motion from video,” arXiv preprint arXiv:1704.07804, 2017.
 [7] R. J. W. Roberts, “Optical flow templates for mobile robot environment understanding,” Ph.D. dissertation, GIT, 2014.
 [8] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. R. Balch, “Memorybased learning for visual odometry,” in 2008 IEEE International Conference on Robotics and Automation (ICRA), 2008.
 [9] J. Wulff and M. J. Black, “Efficient sparsetodense optical flow estimation using a learned basis and layers,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [10] J. Engel, T. Schöps, and D. Cremers, “LSDSLAM: Largescale direct monocular SLAM,” in European Conference on Computer Vision (ECCV). Springer, 2014, pp. 834–849.
 [11] J. Engel, J. Stückler, and D. Cremers, “Largescale direct SLAM with stereo cameras,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 1935–1942.
 [12] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semidirect monocular visual odometry,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 15–22.
 [13] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “SVO: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
 [14] R. MurArtal, J. M. M. Montiel, and J. D. Tardos, “ORBSLAM: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
 [15] R. Roberts, C. Potthast, and F. Dellaert, “Learning general optical flow subspaces for egomotion estimation and detection of motion anomalies,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
 [16] V. Guizilini and F. Ramos, “Visual odometry learning for unmanned aerial vehicles,” in 2011 IEEE International Conference on Robotics and Automation (ICRA), 2011, pp. 6213–6220.

[17]
——, “Semiparametric models for visual odometry,” in
2012 IEEE International Conference on Robotics and Automation (ICRA), 2012.  [18] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of nongeometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717 – 1730, 2014.

[19]
P. Muller and A. Savakis, “Flowdometry: An optical flow and deep learning based approach to visual odometry,” in
2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017. 
[20]
R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “VINet:
Visualinertial odometry as a sequencetosequence learning problem,” in
Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA.
, 2017, pp. 3995–4001.  [21] S. Wang, R. Clark, H. Wen, and N. Trigoni, “DeepVO: Towards endtoend visual odometry with deep recurrent convolutional neural networks,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 2043–2050.
 [22] S. Pillai and J. J. Leonard, “Towards visual egomotion learning in robots,” arXiv preprint arXiv:1705.10279, 2017.
 [23] K. R. Konda and R. Memisevic, “Learning visual odometry with a convolutional network.” in International Conference on Computer Vision Theory and Applications (VISAPP), 2015, pp. 486–490.
 [24] D. J. Heeger and A. D. Jepson, “Subspace methods for recovering rigid motion I: Algorithm and implementation,” International Journal of Computer Vision, vol. 7, no. 2, pp. 95–117, 1992.
 [25] C. Herdtweck and C. Curio, “Experts of probabilistic flow subspaces for robust monocular odometry in urban areas,” in 2012 IEEE Intelligent Vehicles Symposium, 2012 IEEE, June 2012, pp. 661–667.

[26]
M. Ochs, H. Bradler, and R. Mester, “Learning rank reduced interpolation with Principal Component Analysis,” in
Intelligent Vehicles Symposium (IV), 2017 IEEE, June 2017, pp. 1126–1133.  [27] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
 [28] J.Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in European Conference on Computer Vision (ECCV), 2016.
 [29] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 2758–2766.
 [30] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, 2013.
 [31] J.L. Blanco, F.A. Moreno, and J. GonzalezJimenez, “The málaga urban dataset: Highrate stereo and lidars in a realistic urban scenario,” International Journal of Robotics Research, vol. 33, no. 2, pp. 207–214, 2014.
 [32] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for realtime 6dof camera relocalization,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
 [33] J. J. Kuffner, “Effective sampling and distance metrics for 3d rigid body path planning,” in Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004 IEEE International Conference on, vol. 4, April 2004, pp. 3993–3998 Vol.4.
 [34] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in European Conference on Computer Vision (ECCV). Springer, 2004, pp. 25–36.
 [35] A. Geiger, J. Ziegler, and C. Stiller, “StereoScan: Dense 3d reconstruction in realtime,” in 2011 IEEE Intelligent Vehicles Symposium (IV), June 2011, pp. 963–968.
 [36] Isarlab @ unipg website. [Online]. Available: http://isar.unipg.it/