I Introduction
Visual odometry (VO) [1], commonly referred to as egomotion estimation, is a fundamental capability that enables robots to reliably navigate its immediate environment. With the widespread adoption of cameras in various robotics applications, there has been an evolution in visual odometry algorithms with a wide set of variants including monocular VO [1, 2], stereo VO [3, 4] and even nonoverlapping ncamera VO [5, 6]. Furthermore, each of these algorithms has been custom tailored for specific camera optics (pinhole, fisheye, catadioptric) and the range of motions observed by these cameras mounted on various platforms [7].
With increasing levels of model specification for each domain, we expect these algorithms to perform differently from others while maintaining lesser generality across various optics and camera configurations. Moreover, the strong dependence of these algorithms on their model specification limits the ability to actively monitor and optimize their intrinsic and extrinsic model parameters in an online fashion. In addition to these concerns, autonomous systems today use several sensors with varied intrinsic and extrinsic properties that make system characterization tedious. Furthermore, these algorithms and their parameters are finetuned on specific datasets while enforcing little guarantees on their generalization performance on new data.
To this end, we propose a fully trainable architecture for visual odometry estimation in generic cameras with varied camera optics (pinhole, fisheye and catadioptric
lenses). In this work, we take a geometric approach by posing the regression task of egomotion as a density estimation problem. By tracking salient features in the image induced by the egomotion (via KanadeLucasTomasi/KLT feature tracking), we learn the mapping from these tracked flow features to a probability mass over the range of likely egomotion. We make the following contributions:

A fully trainable egomotion estimator: We introduce a fullydifferentiable density estimation model for visual egomotion estimation that robustly captures the inherent ambiguity and uncertainty in relative camera pose estimation (See Figure 1).

Egomotion for generic camera optics: Without imposing any constraints on the type of camera optics, we propose an approach that is able to recover egomotions for a variety of camera models including pinhole, fisheye and catadioptric lenses.

Bootstrapped egomotion training and refinement: We propose a bootstrapping mechanism for autonomous systems whereby a robot selfsupervises the egomotion regression task. By fusing information from other sensor sources including GPS and INS (Inertial Navigation Systems), these indirectly inferred trajectory estimates serve as ground truth target poses/outputs for the aforementioned regression task. Any newly introduced camera sensor can now leverage this information to learn to provide visual egomotion estimates without relying on an externally provided ground truth source.

Introspective reasoning via sceneflow predictions
: We develop a generative model for optical flow prediction that can be utilized to perform outlierrejection and scene flow reasoning.
Through experiments, we provide a thorough analysis of egomotion recovery from a variety of camera models including pinhole, fisheye and catadioptric cameras. We expect our generalpurpose approach to be robust, and easily tunable for accuracy during operation. We illustrate the robustness and generality of our approach and provide our findings in Section IV.
Ii Related Work
Recovering relative camera poses from a set of images is a well studied problem under the context of StructurefromMotion (SfM) [8, 9]. SfM is usually treated as a nonlinear optimization problem, where the camera poses (extrinsics), camera model parameters (intrinsics), and the 3D scene structure are jointly optimized via nonlinear leastsquares [8].
Unconstrained VO: Visual odometry, unlike incremental StructurefromMotion, only focuses on determining the 3D camera pose from sequential images or video imagery observed by a monocular camera. Most of the early work in VO was done primarily to determine vehicle egomotion [10, 11, 12] in 6DOF, especially in the Mars planetary rover. Over the years several variants of the VO algorithm were proposed, leading up to the work of Nister et al. [1], where the authors proposed the first realtime and scalable VO algorithm. In their work, they developed a 5point minimal solver coupled with a RANSACbased outlier rejection scheme [13] that is still extensively used today. Other researchers [14] have extended this work to various camera types including catadioptric and fisheye lenses.
Constrained VO: While the classical VO objective does not impose any constraints regarding the underlying motion manifold or camera model, it however contains several failure modes that make it especially difficult to ensure robust operation under arbitrary scene and lighting conditions. As a result, imposing egomotion constraints has been shown to considerably improve accuracy, robustness, and runtime performance. One particularly popular strategy for VO estimation in vehicles is to enforce planar homographies during matching features on the ground plane [15, 16], thereby being able to robustly recover both relative orientation and absolute scale. For example, Scaramuzza et al. [7, 17] introduced a novel 1point solver by imposing the vehicle’s nonholonomic motion constraints, thereby speeding up the VO estimation up to 400Hz.
Datadriven VO: While several modelbased methods have been developed specifically for the VO problem, a few have attempted to solve it with a datadriven approach. Typical approaches have leveraged dimensionality reduction techniques by learning a reduceddimensional subspace of the optical flow vectors induced by the egomotion [18]. In [19]
, Ciarfuglia et al. employ Support Vector Regression (SVR) to recover vehicle egomotion (3DOF). The authors further build upon their previous result by swapping out the SVR module with an endtoend trainable convolutional neural network
[20] while showing improvements in the overall performance on the KITTI odometry benchmark [21]. Recently, Clarke et al. [22] introduced a visualinertial odometry solution that takes advantage of a neuralnetwork architecture to learn a mapping from raw inertial measurements and sequential imagery to 6DOF pose estimates. By posing visualinertial odometry (VIO) as a sequencetosequence learning problem, they developed a neural network architecture that combined convolutional neural networks with Long ShortTerm Units (LSTMs) to fuse the independent sensor measurements into a reliable 6DOF pose estimate for egomotion. Our work closely relates to these datadriven approaches that have recently been developed. We provide a qualitative comparison of how our approach is positioned within the visual egomotion estimation landscape in Table I.Iii Egomotion regression
As with most egomotion estimation solutions, it is imperative to determine the minimal parameterization of the underlying motion manifold. In certain restricted scene structures or motion manifolds, several variants of egomotion estimation are proposed [7, 15, 16, 17]. However, we consider the case of modeling cameras with varied optics and hence are interested in determining the full range of egomotion, often restricted, that induces the pixellevel optical flow. This allows the freedom to model various unconstrained and partially constrained motions that typically affect the overall robustness of existing egomotion algorithms. While modelbased approaches have shown tremendous progress in accuracy, robustness, and runtime performance, a few recent datadriven approaches have been shown to produce equally compelling results [20, 22, 24]. An adaptive and trainable solution for relative pose estimation or egomotion can be especially advantageous for several reasons: (i) a generalpurpose endtoend trainable model architecture that applies to a variety of camera optics including pinhole, fisheye, and catadioptric lenses; (ii) simultaneous and continuous optimization over both egomotion estimation and camera parameters (intrinsics and extrinsics that are implicitly modeled); and (iii) joint reasoning over resourceaware computation and accuracy within the same architecture is amenable. We envision that such an approach is especially beneficial in the context of bootstrapped (or weaklysupervised) learning in robots, where the supervision in egomotion estimation for a particular camera can be obtained from the fusion of measurements from other robot sensors (GPS, wheel encoders etc.).
Our approach is motivated by previous minimally parameterized models [7, 17] that are able to recover egomotion from a single tracked feature. We find this representation especially appealing due to the simplicity and flexibility in pixellevel computation. Despite the reduced complexity of the input space for the mapping problem, recovering the full 6DOF egomotion is illposed due to the inherently underconstrained system. However, it has been previously shown that under nonholonomic vehicle motion, camera egomotion may be fully recoverable up to a sufficient degree of accuracy using a single point [7, 17].
We now focus on the specifics of the egomotion regression objective. Due to the underconstrained nature of the prescribed regression problem, the pose estimation is modeled as a density estimation problem over the range of possible egomotions^{1}^{1}1Although the parametrization is maintained as , it is important to realize that the nature of most autonomous car datasets involve a lowerdimensional () motion manifold, conditioned on the input flow features. It is important to note that the output of the proposed model is a density estimate for every feature tracked between subsequent frames.
Iiia Density estimation for egomotion
In typical associative mapping problems, the joint probability density is decomposed into the product of two terms: (i) : the conditional density of the target pose conditioned on the input feature correspondence obtained from sparse optical flow (KLT) [25] (ii) : the unconditional density of the input data . While we are particularly interested in the first term that predicts the range of possible values for given new values of , we can observe that the density provides a measure of how well the prediction is captured by the trained model.
The critical component in estimating the egomotion belief is the ability to accurately predict the conditional probability distribution
of the pose estimates that is induced by the given input feature and the flow . Due to its powerful and rich modeling capabilities, we use a Mixture Density Network (MDN) [26]to parametrize the conditional density estimate. MDNs are a class of endtoend trainable (fullydifferentiable) density estimation techniques that leverage conventional neural networks to regress the parameters of a generative model such as a finite Gaussian Mixture Model (GMM). The powerful representational capacity of neural networks coupled with rich probabilistic modeling that GMMs admit, allows us to model multivalued or multimodal beliefs that typically arise in inverse problems such as visual egomotion.
For each of the input flow features extracted via KLT, the conditional probability density of the target pose data (Eqn 1) is represented as a convex combination of Gaussian components,
(1) 
where is the mixing coefficient for the th component as specified in a typical GMM. The Gaussian kernels are parameterized by their mean vector and diagonal covariance . It is important to note that the parameters , and are general and continuous functions of . This allows us to model these parameters as the output (, , ) of a conventional neural network which takes as its input. Following [26], the outputs of the neural network are constrained as follows: (i) The mixing coefficients must sum to 1, i.e. where . This is accomplished via the softmax activation as seen in Eqn 2
. (ii) Variances
are strictly positive via the exponential activation (Eqn 3).(2)  
(3)  
(4) 
The proposed model is learned endtoend by maximizing the data loglikelihood, or alternatively minimizing the negative loglikelihood (denoted as in Eqn 4), given the input feature tracks () and expected egomotion estimate . The resulting egomotion density estimates obtained from each individual flow vectors are then fused by taking the product of their densities. However, to maintain tractability of density products, only the mean and covariance corresponding to the largest mixture coefficient (i.e. most likely mixture mode) for each feature is considered for subsequent trajectory optimization (See Eqn 5).
(5) 
IiiB Trajectory optimization
While minimizing the MDN loss () as described above provides a reasonable regressor for egomotion estimation, it is evident that optimizing frametoframe measurements do not ensure longterm consistencies in the egomotion trajectories obtained by integrating these regressed estimates. As one expects, the integrated trajectories are sensitive to even negligible biases in the egomotion regressor.
Twostage optimization: To circumvent the aforementioned issue, we introduce a second optimization stage that jointly minimizes the local objective () with a global objective that minimizes the error incurred between the overall trajectory and the trajectory obtained by integrating the regressed pose estimates obtained via the local optimization. This allows the global optimization stage to have a warmstart with an almost correct initial guess for the network parameters.
As seen in Eqn 6, pertains to the overall trajectory error incurred by integrating the individual regressed estimates over a batched window (we typically consider 200 to 1000 frames). This allows us to finetune the regressor to predict valid estimates that integrate towards accurate longterm egomotion trajectories. As expected, the model is able to roughly learn the curved trajectory path, however, it is not able to make accurate predictions when integrated for longer timewindows (due to the lack of the global objective loss term in Stage 1). Figure 2 provides a highlevel overview of the inputoutput relationships of the training procedure, including the various network losses incorporated in the egomotion encoder/regressor. For illustrative purposes only, we refer the reader to Figure 3 where we validate this twostage approach over a simulated dataset [27].
In Eqn 6,
is the frametoframe egomotion estimate and the regression target/output of the MDN function
, where . is the overall trajectory predicted by integrating the individually regressed frametoframe egomotion estimates and is defined by .(6) 
Stage 1  Stage 2  Stage 2  Stage 2 
(Final)  (Epoch 4) 
(Epoch 8)  (Epoch 18) 
IiiC Bootstrapped learning for egomotion estimation
Typical robot navigation systems consider the fusion of visual odometry estimates with other modalities including estimates derived from wheel encoders, IMUs, GPS etc. Considering odometry estimates (for e.g. from wheel encoders) asis, the uncertainties in openloop chains grow in an unbounded manner. Furthermore, relative pose estimation may also be inherently biased due to calibration errors that eventually contribute to the overall error incurred. GPS, despite being noiseridden, provides an absolute sensor reference measurement that is especially complementary to the openloop odometry chain maintained with odometry estimates. The probabilistic fusion of these two relatively uncorrelated measurement modalities allows us to recover a sufficiently accurate trajectory estimate that can be directly used as ground truth data (in Figure 4) for our supervised regression problem.
The indirect recovery of training data from the fusion of other sensor modalities in robots falls within the selfsupervised or bootstrapped learning paradigm. We envision this capability to be especially beneficial in the context of lifelong learning in future autonomous systems. Using the fused and optimized pose estimates (recovered from GPS and odometry estimates), we are able to recover the required inputoutput relationships for training visual egomotion for a completely new sensor (as illustrated in Figure 4). Figure 5 illustrates the realization of the learned model in a typical autonomous system where it is treated as an additional sensor source. Through experiments IVC, we illustrate this concept with the recovery of egomotion in a robot car equipped with a GPS/INS unit and a single camera.
IiiD Introspective Reasoning for SceneFlow Prediction
Scene flow is a fundamental capability that provides directly measurable quantities for egomotion analysis. The flow observed by sensors mounted on vehicles is a function of the inherent scene depth, the relative egomotion undergone by the vehicle, and the intrinsic and extrinsic properties of the camera used to capture it. As with any measured quantity, one needs to deal with sensorlevel noise propagated through the model in order to provide robust estimates. While the input flow features are an indication of egomotion, some of the features may be corrupted due to lack of or ambiguous visual texture or due to flow induced by the dynamics of objects other than the egomotion itself. Evidently, we observe that the dominant flow is generally induced by egomotion itself, and it is this flow that we intend to fully recover via a conditional variational autoencoder (CVAE). By inverting the regression problem, we develop a generative model able to predict the mostlikely flow induced given an egomotion estimate , and feature location . We propose a sceneflow specific autoencoder that encodes the implicit egomotion observed by the sensor, while jointly reasoning over the latent depth of each of the individual tracked features.
(7)  
Through the proposed denoising autoencoder model, we are also able to attain an introspection mechanism for the presence of outliers. We incorporate this additional module via an auxiliary loss as specified in Eqn
7. An illustration of these flow predictions are shown in Figure 8.Iv Experiments
In this section, we provide detailed experiments on the performance, robustness and flexibility of our proposed approach on various datasets. Our approach differentiates itself from existing solutions on various fronts as shown in Table I. We evaluate the performance of our proposed approach on various publiclyavailable datasets including the KITTI dataset [21], the MultiFOV synthetic dataset [27] (pinhole, fisheye, and catadioptric lenses), an omnidirectionalcamera dataset [28], and on the Oxford Robotcar 1000km Dataset [29].
Navigation solutions in autonomous systems today typically fuse various modalities including GPS, odometry from wheel encoders and INS to provide robust trajectory estimates over extended periods of operation. We provide a similar solution by leveraging the learned egomotion capability described in this work, and fuse it with intermittent GPS updates^{2}^{2}2For evaluation purposes only, the absolute ground truth locations were added as weak priors on datasets without GPS measurements (Secion IVA). While maintaining similar performance capabilities (Table II), we reemphasize the benefits of our approach over existing solutions:

Versatile: With a fully trainable model, our approach is able to simultaneously reason over both egomotion and implicitly modeled camera parameters (intrinsics and extrinsics). Furthermore, online calibration and parameter tuning is implicitly encoded within the same learning framework.

Modelfree: Without imposing any constraints on the type of camera optics, our approach is able to recover egomotions for a variety of camera models including pinhole, fisheye and catadioptric lenses. (Section IVB)

Bootstrapped training and refinement: We illustrate a bootstrapped learning example whereby a robot selfsupervises the proposed egomotion regression task by fusing information from other sensor sources including GPS and INS (Section IVC)

Introspective reasoning for sceneflow prediction: Via the CVAE generative model, we are able to reason/introspect over the predicted flow vectors in the image given an egomotion estimate. This provides an obvious advantage in robustoutlier detection and identifying dynamic objects whose flow vectors need to be disambiguated from the egomotion scene flow (Figure 8)
(a) MultiFOV Synthetic Dataset  (b) Omnicam Dataset  (c) Oxford 1000km  (d) KITTI 00 
(e) KITTI 05  (f) KITTI 07  (g) KITTI 08  (h) KITTI 09 
Iva Evaluating egomotion performance with sensor fusion
In this section, we evaluate our approach against a few stateoftheart algorithms for monocular visual odometry [4]. On the KITTI dataset [21], the pretrained estimator is used to robustly and accurately predict egomotion from KLT features tracked over the dataset image sequence. The frametoframe egomotion estimates are integrated for each session to recover the full trajectory estimate and simultaneously fused with intermittent GPS updates (incorporated every 150 frames). In Figure 6, we show the qualitative performance in the overall trajectory obtained with our method. The entire poseoptimized trajectory is compared against the ground truth trajectory. The translational errors are computed for each of the ground truth and prediction pose pairs, and their median value is reported in Table II for a variety of datasets with varied camera optics.
IvB Varied camera optics
Most of the existing implementations of VO estimation are restricted to a class of camera optics, and generally avoid implementing a generalpurpose VO estimator for varied camera optics. Our approach on the other hand, has shown the ability to provide accurate VO with intermittent GPS trajectory estimation while simultaneously being applicable to a varied range of camera models. In Figure 7, we compare with intermittent GPS trajectory estimates for all three camera models, and verify their performance accuracy compared to ground truth. In our experiments, we found that while our proposed solution was sufficiently powerful to model different camera optics, it was significantly better at modeling pinhole lenses as compared to fisheye and catadioptric cameras (See Table II). In future work, we would like to investigate further extensions that improve the accuracy for both fisheye and catadioptric lenses.
Dataset  Camera Optics  Median Trajectory Error 

KITTI00  Pinhole  0.19 m 
KITTI02  Pinhole  0.30 m 
KITTI05  Pinhole  0.12 m 
KITTI07  Pinhole  0.18 m 
KITTI08  Pinhole  0.63 m 
KITTI09  Pinhole  0.30 m 
MultiFOV [27]  Pinhole  0.18 m 
MultiFOV [27]  Fisheye  0.48 m 
MultiFOV [27]  Catadioptric  0.36 m 
Omnidirectional [28]  Catadioptric  0.52 m 
Oxford 1000km [29]  Pinhole  0.03 m 
.
Pinhole  Fisheye  Catadioptric 
IvC Selfsupervised Visual Egomotion Learning in Robots
We envision the capability of robots to selfsupervise tasks such as visual egomotion estimation to be especially beneficial in the context of lifelong learning and autonomy. We experiment and validate this concept through a concrete example using the 1000km Oxford Robot Car dataset [29]. We train the task of visual egomotion on a new camera sensor by leveraging the fused GPS and INS information collected on the robot car as ground truth trajectories (6DOF), and extracting feature trajectories (via KLT) from image sequences obtained from the new camera sensor. The timestamps from the cameras are synchronized with respect to the timestamps of the fused GPS and INS information, in order to obtain a onetoone mapping for training purposes. We train on the stereo_centre (pinhole) camera dataset and present our results in Table II. As seen in Figure 6, we are able to achieve considerably accurate longterm state estimates by fusing our proposed visual egomotion estimates with even sparser GPS updates (every 23 seconds, instead of 50Hz GPS/INS readings). This allows the robot to reduce its reliance on GPS/INS alone to perform robust, longterm trajectory estimation.
Image 


Forward 

(a) Pinhole  (b) Fisheye  (c) Catadioptric 
IvD Implementation Details
In this section we describe the details of our proposed model, training methodology and parameters used. The input to the densitybased egomotion estimator are feature tracks extracted via (KanadeLucasTomasi) KLT feature tracking over the raw camera image sequences. The input feature positions and flow vectors are normalized to be the in range of using the dimensions of the input image. We evaluate sparse LK (LucasKanade) optical flow over 7 pyramidal scales with a scale factor of . As the features are extracted, the corresponding robot pose (either available via GPS or GPS/INS/wheel odometry sensor fusion) is synchronized and recorded in for training purposes. The input KLT features, and the corresponding relative pose estimates used for training are parameterized as , with a Euclidean translation vector and an Euler rotation vector .
Network and training: The proposed architecture consists of a set of fullyconnected stacked layers (with 1024, 128 and 32 units) followed by a Mixture Density Network with 32 hidden units and 5 mixture components (). Each of the initial fullyconnected layers implement tanh activation after it, followed by a dropout layer with a dropout rate of 0.1. The final output layer of the MDN (, , ) consists of outputs where is the desired number of states estimated.
The network is trained (in Stage 1) with loss weights of 10, 0.1, 1 corresponding to the losses described in previous sections. The training data is provided in batches of 100 frametoframe subsequent image pairs, each consisting of approximately 50 randomly sampled feature matches via KLT. The learning rate is set to
with Adam as the optimizer. On the synthetic MultiFOV dataset and the KITTI dataset, training for most models took roughly an hour and a half (3000 epochs) independent of the KLT feature extraction step.
Twostage optimization: We found the oneshot joint optimization of the local egomotion estimation and global trajectory optimization to have sufficiently low convergence rates during training. One possible explanation is the high sensitivity of the loss weight parameters that is used for tuning the local and global losses into a single objective. As previously addressed in Section IIIB, we separate the training into two stages thereby alleviating the aforementioned issues, and maintaining fast convergence rates in Stage 1. Furthermore, we note that during the second stage, it only requires a few tens of iterations for sufficiently accurate egomotion trajectories. In order to optimize over a larger timewindow in stage 2, we set the batch size to 1000 frametoframe image matches, again randomly sampled from the training set as before. Due to the large integration window and memory limitations, we train this stage purely on the CPU for only 100 epochs each taking roughly 30s per epoch. Additionally, in stage 2, the loss weights for are increased to 100 in order to have faster convergence to the global trajectory. The remaining loss weights are left unchanged.
Trajectory fusion: We use GTSAM^{3}^{3}3http://collab.cc.gatech.edu/borg/gtsam to construct the underlying factor graph for posegraph optimization. Odometry constraints obtained from the frametoframe egomotion are incorporated as a 6DOF constraint parameterized in with rad rotational noise and m translation noise. As with typical autonomous navigation solutions, we expect measurement updates in the form of GPS (absolute reference updates) in order to correct for the longterm drift incurred in openloop odometry chains. We incorporate absolute prior updates only every 150 frames, with a weak translation prior of m. The constraints are incrementally added and solved using iSAM2 [30] as the measurements are streamed in, with updates performed every 10 frames.
While the proposed MDN is parametrized in Euler angles, the trajectory integration module
parameterizes the rotation vectors in quaternions for robust and unambiguous longterm trajectory estimation. All the rigid body transformations are implemented directly in Tensorflow for pureGPU training support.
Runtime performance: We are particularly interested in the runtime / testtime performance of our approach on CPU architectures for mostly resourceconstrained settings. Independent of the KLT feature tracking runtime, we are able to recover egomotion estimates at roughly 3ms on a consumergrade Intel(R) Core(TM) i73920XM CPU @ 2.90GHz.
Source code and Pretrained weights
: We implemented the MDNbased egomotion estimator with Keras and Tensorflow, and trained our models using a combination of CPUs and GPUs (NVIDIA Titan X). All the code was trained on an servergrade Intel(R) Xeon(R) CPU E52630 v3 @ 2.40GHz and tested on the same consumergrade machine as mentioned above to emulate potential realworld usecases. The source code and pretrained models used will be made available shortly
^{4}^{4}4See http://people.csail.mit.edu/spillai/learningegomotion and https://github.com/spillai/learningegomotion.V Discussion
The initial results in bootstrapped learning for visual egomotion has motivated new directions towards lifelong learning in autonomous robots. While our visual egomotion model architecture is shown to be sufficiently powerful to recover egomotions for nonlinear camera optics such as fisheye and catadioptric lenses, we continue to investigate further improvements to match existing stateoftheart models for these lens types. Our current model does not capture distortion effects yet, however, this is very much a future direction we would like to take. Another consideration is the resourceconstrained setting, where the optimization objective incorporates an additional regularization term on the number of parameters used, and the computation load consumed. We hope for this resourceaware capability to transfer to realworld limitedresource robots and to have a significant impact on the adaptability of robots for longterm autonomy.
Vi Conclusion
While many visual egomotion algorithm variants have been proposed in the past decade, we envision that a fully endtoend trainable algorithm for generic camera egomotion estimation shall have farreaching implications in several domains, especially autonomous systems. Furthermore, we expect our method to seamlessly operate under resourceconstrained situations in the near future by leveraging existing solutions in model reduction and dynamic model architecture tuning. With the availability of multiple sensors on these autonomous systems, we also foresee our approach to bootstrapped task (visual egomotion) learning to potentially enable robots to learn from experience, and use the new models learned from these experiences to encode redundancy and faulttolerance all within the same framework.
References
 [1] David Nistér, Oleg Naroditsky, and James Bergen. Visual odometry. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 1, pages I–652. IEEE, 2004.
 [2] Kurt Konolige, Motilal Agrawal, and Joan Sola. Largescale visual odometry for rough terrain. In Robotics research, pages 201–212. Springer, 2010.
 [3] Andrew Howard. Realtime stereo visual odometry for autonomous ground vehicles. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3946–3952. IEEE, 2008.
 [4] Bernd Kitt, Andreas Geiger, and Henning Lategahn. Visual odometry based on stereo image sequences with RANSACbased outlier rejection scheme. In Intelligent Vehicles Symposium, pages 486–492, 2010.
 [5] Gim Hee Lee, Friedrich Faundorfer, and Marc Pollefeys. Motion estimation for selfdriving cars with a generalized camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2746–2753, 2013.
 [6] Laurent Kneip, Paul Furgale, and Roland Siegwart. Using multicamera systems in robotics: Efficient solutions to the nPnP problem. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 3770–3776. IEEE, 2013.
 [7] Davide Scaramuzza. 1pointRANSAC structure from motion for vehiclemounted cameras by exploiting nonholonomic constraints. Int’l J. of Computer Vision, 95(1):74–85, 2011.
 [8] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment — A modern synthesis. In International workshop on vision algorithms, pages 298–372. Springer, 1999.
 [9] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
 [10] Hans P Moravec. Obstacle avoidance and navigation in the real world by a seeing robot rover. Technical report, DTIC Document, 1980.
 [11] Larry Henry Matthies. Dynamic stereo vision. 1989.
 [12] Clark F Olson, Larry H Matthies, H Schoppers, and Mark W Maimone. Robust stereo egomotion for long distance navigation. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 453–458. IEEE, 2000.
 [13] Martin A Fischler and Robert C Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
 [14] Peter Corke, Dennis Strelow, and Sanjiv Singh. Omnidirectional visual odometry for a planetary rover. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, volume 4, pages 4007–4012. IEEE, 2004.
 [15] Bojian Liang and Nick Pears. Visual navigation using planar homographies. In Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, volume 1, pages 205–210. IEEE, 2002.
 [16] Qifa Ke and Takeo Kanade. Transforming camera geometry to a virtual downwardlooking camera: Robust egomotion estimation and groundlayer detection. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I–390. IEEE, 2003.
 [17] Davide Scaramuzza, Friedrich Fraundorfer, and Roland Siegwart. Realtime monocular visual odometry for onroad vehicles with 1point RANSAC. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 4293–4299. IEEE, 2009.
 [18] Richard Roberts, Christian Potthast, and Frank Dellaert. Learning general optical flow subspaces for egomotion estimation and detection of motion anomalies. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 57–64. IEEE, 2009.
 [19] Thomas A Ciarfuglia, Gabriele Costante, Paolo Valigi, and Elisa Ricci. Evaluation of nongeometric methods for visual odometry. Robotics and Autonomous Systems, 62(12):1717–1730, 2014.
 [20] Gabriele Costante, Michele Mancini, Paolo Valigi, and Thomas A Ciarfuglia. Exploring Representation Learning With CNNs for FrametoFrame EgoMotion Estimation. IEEE Robotics and Automation Letters, 1(1):18–25, 2016.
 [21] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012.
 [22] Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, and Niki Trigoni. VINet: VisualInertial odometry as a sequencetosequence learning problem. AAAI, 2016.
 [23] Davide Scaramuzza and Friedrich Fraundorfer. Visual odometry [tutorial]. IEEE robotics & automation magazine, 18(4):80–92, 2011.
 [24] Kishore Konda and Roland Memisevic. Learning visual odometry with a convolutional network. In International Conference on Computer Vision Theory and Applications, 2015.
 [25] Stan Birchfield. KLT: An implementation of the KanadeLucasTomasi feature tracker, 2007.
 [26] Christopher M Bishop. Mixture Density Networks. 1994.
 [27] Zichao Zhang, Henri Rebecq, Christian Forster, and Davide Scaramuzza. Benefit of large fieldofview cameras for visual odometry. In IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016.
 [28] Miriam Schönbein and Andreas Geiger. Omnidirectional 3d reconstruction in augmented manhattan worlds. In Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pages 716–723. IEEE, 2014.
 [29] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The Oxford RobotCar dataset. Int’l J. of Robotics Research, page 0278364916679498, 2016.
 [30] Michael Kaess, Hordur Johannsson, Richard Roberts, Viorela Ila, John J Leonard, and Frank Dellaert. iSAM2: Incremental smoothing and mapping using the bayes tree. Int’l J. of Robotics Research, 31(2):216–235, 2012.