Deep Learning based techniques have been adopted with precision to solve a lot of standard computer vision problems, some of which are image classification, object detection and segmentation. Despite the widespread success of these approaches, they have not yet been exploited largely for solving the standard perception related problems encountered in autonomous navigation such as Visual Odometry (VO), Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM). This paper analyzes the problem of Monocular Visual Odometry using a Deep Learning-based framework, instead of the regular 'feature detection and tracking' pipeline approaches. Several experiments were performed to understand the influence of a known/unknown environment, a conventional trackable feature and pre-trained activations tuned for object classification on the network's ability to accurately estimate the motion trajectory of the camera (or the vehicle). Based on these observations, we propose a Convolutional Neural Network architecture, best suited for estimating the object's pose under known environment conditions, and displays promising results when it comes to inferring the actual scale using just a single camera in real-time.READ FULL TEXT VIEW PDF
In recent years, Convolutional Neural Networks (CNNs) have been employed successfully for numerous applications in Computer Vision and Robotics such as object detection  , classification , semantic segmentation  and many others, often outperforming the conventional feature-based methods. However, a few exceptions exist to this trend; notably - Structure from Motion (SFM), Simultaneous Localization and Mapping (SLAM) and Visual Odometry (VO) are some of the traditional perception problems, for which deep learning techniques have not been exploited in a large manner. In this paper, we analyze the problem of Visual Odometry using a Deep Learning-based framework.
In robot navigation, odometry is defined as the process of fusing data from different motion sensors to estimate the change in the robot’s position over time. This process of determining the trajectory plays an important part in robotics, forming the basis of path planning and controls. Traditionally, this problem has been tackled using data from rotary encoders, IMU and GPS . While this approach has been practically successful in solving the problem in hand, it is still prone to unfavorable conditions like wheel slipping in uneven terrains and lack of GPS signals. Recently, this problem has been solved just by using data from the camera (sequence of images). This process of incrementally estimating the robot’s pose (position and orientation) by analyzing the motion changes in the associated camera images is known as visual odometry .
A standard Visual Odometry approach generally follows the following steps (for both monocular and stereo vision cases)  :
Image acquisition at two time instances
Image correction such as rectification and lens distortion removal
Feature tracking between the two images to obtain the optical flow
Estimation of motion using the obtained optical flow and the camera parameters.
On the deep learning front, there have been huge technological advancements regarding the applications of CNNs. It has been shown that these deep networks are adept in extracting various abstract features from images.
Our work proposes a Deep Learning-based framework for analyzing the problem of visual odometry, motivated from the observation that instead of geometric feature descriptors, CNNs can be used to extract high-level features from images. Using these features, we estimate the transformation matrix between two consecutive scenes to recreate the vehicle’s trajectory. Another significant contribution of this paper is using only monocular vision to estimate the vehicle’s position in true scale, which cannot be done solely by pure geometry based methods. This is possible since the training network is able to learn the camera intrinsic parameters and scale. We hope that this framework will open up further research into the associated fields of Simultaneous Localization and Mapping (SLAM) and Structure from Motion (SFM) as well.
The problem of visual odometry has been traditionally tackled by two methods - feature-based and direct (”appearance-based”). While the first approach relies on detecting and tracking a sparse set of salient image features such as lines and corners, the latter relies directly on the pixel intensity values to extract motion information.
Feature-based methods use a variety of feature detectors to detect salient feature points such as FAST (Features from Accelerated Segment Test) , SURF (Speeded Up Robust Features) , BRIEF (Binary Robust Independent Elementary Features) , ORB (Oriented FAST and Rotated BRIEF)  and Harris  corner detectors. These feature points are then tracked in the next sequential frame using a feature point tracker, the most common one being the KLT tracker , . The result thus obtained is the optical flow, following which the ego-motion can then be estimated using the camera parameters as proposed by Nister . This general approach of detecting feature points and tracking them is followed by most papers (in both monocular vision and stereo vision based approaches) as is the case in  and . More recent works in this area employ the PTAM approach , which is a robust feature tracking-based SLAM algorithm, with an added advantage of running in real-time by parallelizing the motion estimation and mapping tasks , , .
Direct or ”appearance-based” methods for visual odometry rely directly on the pixel intensity values in an image, and minimize errors directly in sensor space, while subsequently avoiding feature matching and tracking. These methods however require a planarity assumption (e.g. homography). Early direct monocular SLAM methods like  and  make use of filtering algorithms for Structure from Motion, while in  and  non-linear least squares estimation was used. Other approaches like DTAM  compute a dense depth-map for each key-frame, which was used for aligning the whole image to find the camera pose. This is done by minimizing a global energy function. Since this approach is computationally intensive, heavy GPU parallelization is required. To mitigate this heavy computational requirement, the method described in  is proposed. Recently, fast direct monocular SLAM has also been achieved by the LSD-SLAM algorithm .
Aside from these two approaches, the other notable method is a semi-direct approach to the problem, which combines the successful factors of feature-based methods (tracking many features, parallel tracking and mapping) with the accuracy and speed of direct methods. This was explored in the work by Scaramuzza et.al. 
With the advent of CNNs 
, numerous computer vision tasks have been solved very efficiently and with higher accuracy by these architectures as compared to traditional geometry-based approaches. Classification problems such as the ImageNet Large Scale Visual Recognition Competition (ILSVRC), , regression problems like depth regression , object detection  and segmentation problems  have all been solved by these networks.
However, the domains of Structure from Motion, SLAM and Visual Odometry are still untouched by the advances in deep learning. Recently, optical flow between two images has been obtained by networks such as FlowNet  and EpicFlow . Homography between two images have also been estimated using deep networks in . Nicolai, Skeele et al. applied deep learning techniques to learn odometry, but using laser data from a LIDAR. The only visual odometry approach using deep learning that the authors are aware of the work of Konda and Memisevic . Their approach however is limited to stereo visual odometry. Agrawal et al. 
propose the use of ego-motion vector as a weak supervisory signal for feature learning. For inferring egomotion, their training approach treats the whole problem as a classification task. As opposed to this, we treat the visual odometry estimation as a regression problem.
The pipeline can be divided into two stages : Data Preprocessing and the CNN Framework, designed specifically for different experiments.
For our experiments, the KITTI Vision benchmark  was used. The visual odometry dataset provided by KITTI consists of stereo-vision sequences collected while driving the vehicle in different environments. Since this work focuses on monocular vision, the video sequences collected from a single camera were considered. Of the 21 sequences available, 11 sequences with ground truth trajectories were used for training and testing sequences. These 11 sequences were further sorted into training and testing dataset, as per the need of our experiments. The original ground truth pose information is available in terms of a sequence of 34 transformation matrices which describe the motion of a vehicle between 0 time step to t time step. These matrices were processed to generate the ground truth data in a new form describing the differential changes in translational motion (x, z, ) of the vehicle, for all subsequent images in pairs I and I (where I is image at t time step and I is image at (t+1) time step) along two designated translational axes (x, z). Each of the original image sequences of size 1241376 were warped and downsampled to 256256, as the architecture we propose was inspired by AlexNet , which restricts inputs to square sized images only. Later, a dataset of image pairs was generated consisting of images at t time step and the corresponding image at (t+1) time step. Thus, the final processed dataset could be represented as:
This was the base input image and ground truth label format. However, for different experiments, this base data was converted into other realizable formats, or augmented with additional data, which are explained in the later subsections.
All the demonstrated experiments were performed on an Intel Xeon @4 x 3.3 GHZ machine loaded with 32 GB DDR3 RAM and NVIDIA GTX 970. To evaluate our approach for learning visual odometry and GPU based implementations, we chose Caffe, developed by the Berkeley Vision and Learning Center. All the data pre-processing were programmed in Python, using associated libraries for compatibility with the python bindings of Caffe.
We designed a CNN architecture, partly based on the original AlexNet , tuned to take as inputs simultaneously - the paired images in sequence (I, I), with an objective to regress the targeted labels (x, z, ). All weights in the network’s convolutional layers had a gaussian initialization, whereas the fully connected layers were initialized using the xavier algorithm . The network was designed to compute a L2 (Euclidean) Loss. Based on the different experiments performed for the proposed analysis, the network architecture was further tuned specific to each task, with the details described below.
From the 11 sequences in the dataset, 7 were considered for training and 4 for testing. Here, the testing sequences were chosen such that they belonged to different environmental conditions as compared to the training sequences. The network architecture consists of two parallel AlexNet-based cascaded convolutional layers concatenating at the end of the final convolutional layer to generate fully connected layers, which are smoothly stacked to regress the target variables (x, z, ) (Figure 1).
The network takes 3 inputs in the form of I, I and the pose (x, z, ) between them. The two data inputs corresponding to image sequences were fed into the convolutional cascades which convolved in parallel, and then concatenated at the end to generate a flattened (image batch size x 8192) vector. This vector was fed into custom designed fully connected layers that converged to (image batch size x 3) and was fed along with the ground truth label to an Euclidean loss layer to minimize the loss. The same architecture, ignoring the dropout layers, was used in test phase.
The training sequences and testing sequence were taken from a random permutation of the entire dataset into two different proportions: 80:20 and 50:50 from all the 11 sequences individually. This ensured that both training and test sets contained similar environment sequences.
The network architecture adopted was exactly the same as the previous experiment. The only difference from the previous experiment was in the preparation of the training set and testing set, with the motivation to observe the network’s behavior in a known or unknown environment. This provides an insight into the nature of the Visual Odometry problem. The experiment helps in understanding if the proposed network architecture is robust to new environments or requires a prior knowledge of the scene.
The model was trained twice independently, once for the 80:20 and once for 50:50 training to testing set ratio scenario. The major motivation for training the model in two different ratios was to analyze the amount of data required by the network to sufficiently learn about the environment to be able to accurately estimate the trajectory.
For this task, in addition to the schema used in the first experiment, FAST  features were added as a prior input to the network (Figure 2). The features for each image were appended to the RGB data to generate a 4-dimensional feature set for the each input image. The image data thus obtained and the poses ground truth were segregated into 7 training and 4 test sequences. The network architecture, same as the previous experiments, follows the the same procedure as employed in the first experiment. This experiment was performed with an objective to observe the influence of a prior feature, conventionally used for a feature-based approach for solving the visual odometry problem, in improving the accuracy of pose estimation.
This experiment was performed using a network architecture consisting of two AlexNet-based cascaded convolutional layers pre-trained on the ImageNet database. The network was fine-tuned by training on part of dataset sequences while the rest were used as test sequences. Here, the output activations of the final convolutional layer in the original AlexNet architecture were extracted and served as the input instead of a standard RGB image. The learnable part of the architecture comprised of 1 convolution layer and 4 fully connected layers (Figure 3). This experiment was designed with the motivation to understand the effect of pre-trained activations trained on object classification labels for the task of estimating the odometry vector.
For the experiments described in section 3.3, the results are shown for comparison of the network predictions with the ground truth and to observe the loss in training and testing phase. The network was observed to pass any arbitrary image pair through its layers, compute the layer activations and estimate the odometry vectors at an average of 9ms, displaying real-time capabilities. It was further observed that this did not depend on the nature of the scene.
For this evaluation, the testing was performed on an environment completely unknown to the network. In such conditions, the estimated position deviates too much from the ground truth, as shown in Figure 4.
The training and test loss for this network is shown in Figure 5. As can be observed from the plot, the training loss declines very fast with the number of iterations. On the other hand, the loss during testing oscillates around a fixed value with small variations. This shows that although the network is able to reduce the the loss on a known environment, the lack of knowledge of a scene does not help in estimating the odometry vector. Therefore, even after a significant number of iterations, the testing loss does not fall.
This experiment was performed on a known environment, with data segregated into training and testing sequence in ratios of 80-20 and 50-50. Figure 6 and 9 show a significant improvement in the prediction of odometry vector in a sequence, part of which is already known to the network. Figure 6, 7 and 8 are the results for data broken into 50-50 ratio.
Figure 7 gives an insight into the deviation, which is observed to be increasing with time. Therefore, it can be concluded that the error in odometry accumulates over time resulting in the predicted trajectory drifting away from the ground truth.
The loss, similar to deviation, shows great improvement in performance in known environment over unknown environment. The test loss follows the training loss and shows a steep drop with increase in number of iterations.
In this part, we used FAST features as priors along with the RGB images. As observed from Figure 12, this network displays similar behavior in terms of training and test loss as that of a network in an unknown environment. This experiment consisted of fewer test iteration cases.
The results from the experiments performed are highly encouraging. The authors believe that the results not only suggest that the architecture presented can be tried out on robotic platforms, but also provide us a deep understanding of how this network deals with the visual odometry problem.
From the results of testing on a known environment, it is clear that more the network learns about a particular environment, the better it gets at predicting the visual odometry. This is in alignment with the general perception. Also, this supports the hypothesis that the network treats the problem of visual odometry as specific to a particular scene. This is further supported on comparing these results to that of 1 experiment. In case of predicting visual odometry data on unseen images, the network performs fairly poor.
Inspired by this finding, the authors delve deeper into understanding the significance of features required for scene understanding. presents the use of ego-motion vector as a weak supervisory signal for feature learning. They show the effectiveness of the features learnt on simple tasks like scene and object recognition. Motivated by this, the authors used the pre-trained weights of AlexNet 
trained on object classification for the presented network. However the results obtained are not supportive of the fact, thus showing that the features extracted from the pre-trained network are not generic to the problem of visual odometry.
The authors try out the idea of providing prior information about the scene to improve the prediction accuracy on unknown environments. Therefore, the FAST features of the scene were used along with the features extracted by the convolutional layers of the network.
The results of predicting visual odometry in known environment shows the error drifting with time. Therefore, the predicted trajectory also seems to show more deviation from ground truth with time. To tackle this issue, the authors feel that the use of recurrent network would be more appropriate. The presence of recurrent connections would enable the network to correct the error incurred from ground truth continuously.
It would also be interesting to explore further on the fusion of conventional trackable features as a prior to the higher level features generated by the CNNs.
Use of generative networks to predict the next scene from an estimated ego-motion vector and update the ego-motion vector using a feedback loop could be used to correct the accumulating error. The mechanism is known to function in the human brain  and a similar architecture can be used in artificial systems too.
The proposed network demonstrates promising results, when provided with a prior knowledge of the environment, while displaying the expected opposite response in case of an unknown environment. The network, when provided with a prior of FAST features, and trained on an unknown environment, shows a similar behavior as that of the network subjected to an unknown environment without any prior. It may be concluded that the proposed CNN designed for the purpose of Visual Odometry is able to learn features similar to FAST, and a manual addition of these features only contributes to redundancy. When deployed on known environments, the network architecture is able to learn the actual scale in real time, which is not possible for monocular visual odometry using geometric methods.
Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
A generalized extended kalman filter implementation for the robot operating system.In Intelligent Autonomous Systems 13, pages 335–348. Springer, 2016.
Epicflow: Edge-preserving interpolation of correspondences for optical flow.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1164–1172, 2015.