1 Introduction
The importance of navigation and mapping to the fields of robotics and computer vision has only increased since its inception. Vision based navigation in particular is an extremely interesting field of research due to its discernible resemblance to human navigation and the wealth of information an image contains. Although creating a machine that understands structure and motion purely from RGB images is challenging, the computer vision community has developed a plethora of algorithms to replicate useful aspects of human vision using a computer. Tracking and mapping remains an unsolved problem, with many popular approaches. Photometric based techniques rely on establishing correspondences across different viewpoints of the same scene and the matching points are then used to perform triangulation. Based on the density of the map, the field can be divided into dense [28], semidense [6] and sparse[26] approaches, each comes with advantages and disadvantages.
Applying machine learning techniques to solve vision problems has been another popular area of research. Great advances have been made in the fields of image classification [20, 12, 13] and semantic segmentation [25, 41]
and this has led geometry based machine learning methods to follow suit. The massive growth in neural network driven research has largely been facilitated by the increased availability of lowcost high performance GPUs as well as the relative accessibility of machine learning frameworks such as Tensorflow and Caffe.
In this work we draw from both machine learning approaches as well as SfM techniques to create a unified framework which is capable of predicting the depth of a scene and the motion parameters governing the camera motion between an image pair. We construct our framework incrementally where the network is first trained to predict depths given a single color image. Then a color image pair as well as their associated depth predictions are provided to a flow estimation network which produces an optical flow map along with an estimated measure of confidence in x and y motion. Finally, the pose estimation block utilises the outputs of the previous networks to estimate a motion vector corresponding to the logarithm of the Special Euclidean Transformation
in , which describes the relative camera motion from the first image to the second.We summarise the contributions made in this paper as follows:

We also present the first approach to use a neural network to predict the full information matrix which represents the confidence of our optical flow estimate.
2 Related Work
Estimating motion and structure from two or more views is a well studied vision problem. In order to reconstruct the world and estimate camera motion, sparse feature based systems [19, 26] compute correspondences through feature matching while the denser approaches[6, 28] rely on brightness constancy across multiple viewpoints. In this work, we leverage CNNs to solve the aforementioned tasks and we summarize the existing works in the literature that are related to the ideas presented in this paper.
2.1 Single Image Depth Prediction
Predicting depth from a single RGB image using learning based approaches has been explored even prior to the resurgence of CNNs. In [33], Saxena et al. employed a Markov Random Field (MRF) to combine global and local image features. Similar to our approach Eigen et al. [5] introduced a common CNN architecture capable of predicting depth maps for both indoor and outdoor environments. This concept was later extended to a multistage coarse to fine network by Eigen et al. in [4]. Advances were made in the form of combining graphical models with CNNs [24] to further improve the accuracy of depth maps, through the use of related geometric tasks [3] and by making architectural improvements specifically designed for depth prediction [22]. Kendall et al. demonstrated that predicting depths and uncertainties improve the overall accuracy in [17]. While most of these methods demonstrated impressive results, explicit notion of geometry was not used during any stage of the pipeline which opened the way for geometry based depth prediction approaches.
In one of the earliest works to predict depth using geometry in an unsupervised fashion, Garg et al. used the photometric difference between a stereo image pair, where the target image was synthesized using the predicted disparity and the known baseline[8]. Leftright consistency was explicitly enforced in the unsupervised framework of Goddard et al. [10] as well as in the semisupervised framework of Kuznietsov et al.[21], which is a technique we also found to be beneficial during training on sparse ground truth data.
2.2 Optical Flow Prediction
An early work in optical flow prediction using CNNs was [7]. This was later extended by Ilg et al. to FlowNet 2.0 [14] which included stacked FlowNets [7] as well as warping layers. Ranjan and Black proposed a spatial pyramid based optical flow prediction network [30]. More recently, Sun et al. proposed a framework which uses the principles from geometry based flow estimation techniques such as image pyramid, warping and cost volumes in [36]. As our end goal revolves around predicting camera pose, it becomes necessary to isolate the flow that was caused purely from camera motion, in order to achieve this we extend upon these previous works to predict both the optical flow and the associated information matrix of the flow. Although not in a CNN context [39] showed the usefulness of estimating flow and uncertainty.
2.3 Pose Estimation
CNNs have been successfully used to estimate various components of a Structure from Motion pipeline. Earlier works focused on learning discriminative image based features suitable for egomotion estimation [2, 15]. Yi et al.[40] showed a full feature detection framework can be implemented using deep neural networks. Rad and Lepetit in BB8[29] showed the pose of objects can be predicted even under partial occlusion and highlighted the increased difficulty of predicting 3D quantities over 2D quantities. Kendall and Cipolla demonstrated that camera pose prediction from a single image catered for relocalization scenarios [16].
However, each of the above works lack a representation of structure as they do not explicitly predict depths. Our work is more closely related to that of Zhou et al. [42] and Ummenhofer et al. [38] and their frameworks SfMLearner and DeMoN. Both of these approaches also predict a single confidence map in contrast to ours which estimates the confidence in x and y directions separately. Since our framework predicts metric depths in comparison to theirs we are able produce far more accurate visual odometry and combat against scale drift. CNN SLAM by Tateno et al. [37] incorporated depth predictions of [22] into a SLAM framework. Our method performs competitively with CNNSLAM as well as ORBSLAM[27] and LSDSLAM[6] which have the added advantage of performing loop closures and local/global bundle adjustments despite solely computing sequential frametoframe alignments.
3 Method
3.1 Network Architecture
The overall architecture consists of 3 main subsystems in the form of a depth, flow and camera pose network. A large percentage of the model capacity is invested in to the depth prediction component for two reasons. Firstly, the output of the depth network also serves as an additional input to the other subsystems. Secondly, we wanted to achieve superior depths for indoor and outdoor environments using a common architecture ^{1}^{1}1Although there are separate models for indoor and outdoor scenes the underlying architecture is common.. In order to preserve space and to provide an overall understanding of the data flow a high level diagram of the network is shown in Figure 1. An expanded architecture with layer definitions for each of the subsystems is included in the supplementary materials.
3.2 Depth Prediction
The depth prediction network consists of an encoder and a decoder module. The encoder network is largely based on the DenseNet161 architecture described in [13]
. In particular we use the variant pretrained on ImageNet
[32] and slightly increase the receptive field of the pooling layers. As the original input is downsampled 4 times by the encoder, during the decoding stage the feature maps are upsampled back 4 times to make the model fully convolutional. We employ skip connections in order to reintroduce the finer details lost during pooling. Since the first downsampling operation is done at a very early stage of the pipeline and closely resemble the image features, these activations are not reused inside the decoder. Upproject blocks are used to perform upsampling in our network, which provide better depth maps compared to deconvolutional layers as shown in [22].Due to the availability of dense ground truth data for indoor datasets (e.g NYUv2 [34], RGBD[35]
) this network can be directly utilised to perform supervised learning. Unfortunately, the ground truth data for the outdoor datasets (KITTI) are much sparser and meant we had to incorporate a semisupervised learning approach in order to provide a strong training signal. Therefore, during training on KITTI, we use a Siamese version of the depth network with complete weightsharing, and enforce photometric consistency between the leftright image pairs through an additional loss function. This is similar to the previous approaches
[8, 21] and is only required during the training stage, during inference only a single input image is required to perform depth estimation using our network.3.3 Flow Prediction
The flow network provides an estimation of the optical flow along with the associated confidences given an image pair. These outputs combined with predicted depths allow us to predict the camera pose. As part of our ablation studies we integrated the flow predictions of [14] with our depths, however, the main limitation of this approach was the lack of a mechanism to filter out the dynamic objects which are abundant in outdoor environments. This was solved by estimating confidence, specifically the information matrix in addition to the optical flow. More concretly, for each pixel our flow network predicts 5 quantities, the optical flow in the x and y direction, and the quantities , and , which are required to compute the information matrix or the inverse of the covariance matrix as shown below.
(1) 
This parametrisation guarantees is positivedefinite and can be used to parametrise any information matrix. We found that the gradients are much more stable compared to predicting the information matrix directly as the determinant of the matrix is always greater than zero since only when .
With respect to the architecuture we borrow elements from FlowNet [7] as well as FlowNet 2.0 [14]. As mentioned in [14], FlowNet 2.0 was unable to reliably estimate small motions, which we address with two key changes. Firstly, our flow network takes the predicted depth map as an input, allowing the network to learn the relationship between depth and flow explicitly, including that closer objects appear to move more compared to the objects that are further away from the camera. Secondly, we use “warpconcatenation”, where coarse flow estimates are used to warp the CNN features during the decoder stage. This appears to resolve small motions more effectively particularly on the TUM [35] dataset.
3.4 Pose Estimation
We take two approaches to pose estimation, shown in Figure 2, an iterative and a fullyconnected(FC). This contrasts the ability of a neural network to estimate using the available information, and the simplicity of a standard computer vision approach using the available predicted quantities. We use FC layers to provide the network with as wide a receptive field as possible, to compare more equivalently against using the inferred quantities in the iterative approach.
Iterative
This approach uses a more conventional method for computing relative pose estimates. We use a standard reweighted least squares solver based on the residual flow, given an estimate of the relative transformation. More concretely we attempt to minimise the following error function with respect to the relative transformation parameters ()
(2) 
where is the total residual flow in normalised camera coordinates, the subscript indicates only the first two dimensions of the vector are used, is the inverse depth coordinate of an ordered point cloud (), and are the predicted flow and estimated flow respectively, and is the pixel coordinate. is the current transformation estimate, and can be expressed by the matrix exponential as , where is the component of the motion vector , which is a member of the Liealgebra , and is the generator matrix corresponding to the relevant motion parameter. As This pipeline is implemented in Tensorflow [1] it allows us to train the network end to end. Please see the supplementary material for a more detailed explanation.
FullyConnected
Similar to Zhou et al.[42] and Ummenhofer et al. [38] we also constructed a fully connected layer based pose estimation network. This network utilises 3 stacked fully connected layers and uses the same inputs as our iterative method. While we outperform the pose estimation benchmarks of [42] and [38] using this network the iterative network is our recommended approach due to its close resemblance to conventional geometry based techniques.
3.5 Loss Functions
Depth Losses
For supervised training on indoor and outdoor datasets we use a reverse Huber loss function [22] defined by
(3) 
where , and and represent the predicted and the ground truth depth respectively. For the KITTI dataset we employed an additional photometric loss during training as the ground truth is highly sparse. This unsupervised loss term enforces leftright consistency between stereo pairs, defined by
(4) 
where and are the left and right images and and are their corresponding depth maps, is a normalisation function where , K is the camera intrinsic matrix, is the transformation from pixel to camera coordinates, and and define the relative transformation matrices from lefttoright and righttoleft respectively. In this case the rotation is assumed to be the identity and the matrices purely translate in the xdirection. Additionally, we use a smoothness term defined by
(5) 
where and are the horizontal and vertical gradients of the predicted depth. This provides qualitatively better depths as well as faster convergence. The final loss function used to train KITTI depths is given by
(6) 
where and are computed on both left and right images separately.
Flow Loss
The probability distribution of multivariate Gaussian in 2D can be defined as follows.
(7) 
where is the information matrix or inverse covariance matrix . The flow loss criterion can now be defined by
(8) 
where is the predicted flow, and is the ground truth flow. This optimises by maximising the loglikelihood of the probability distribution over the residual flow error.
Pose Loss
Given two input images , , the predicted depth map of and the predicted relative pose the unsupervised loss and pose loss can be defined as
(9)  
(10) 
where maps a transformation T from the Liegroup to the Liealgebra , such that can be represented by its constituent motion parameters, and is the ground truth relative pose parameters.
3.6 Training Regime
We train our network endtoend on NYUv2 [34], TUM[35] and KITTI[9] datasets. We use the standard test/train split for NYUv2 and KITTI and define our scene split for TUM. It is worth mentioning that the amount of training data we used is radically reduced compared to [42] and [38]. More concretly, for NYUv2 we use of the full dataset, for KITTI . We use the Adam optimiser [18] with an initial learning rate of 1e4 for all experiments and chose Tensorflow [1] as the learning framework and train using an NVIDIADGX1. We provide a detailed training schedule and breakdown in the supplementary material.
4 Results
In this section we summarise the singleimage depth prediction and relative pose estimation performance of our system on several popular machine learning and SLAM datasets. We also investigate the effect of using alternative optical flow estimates from [14] and [31] in our pose estimation pipeline as an ablation study. The entire model contains 130M parameters. Our depth estimator runs at 5fps on an NVIDIA GTX 1080Ti, while other subnetworks run at 30fps.
4.1 Depth Estimation
We summarise the results of evaluating our singleimage depth estimation of the datasets NYUv2[34], RGBD[35] and KITTI[9] in Tables 1, 2 and 3 respectively using the established metrics of [5].
We train Ours(baseline) model to showcase the improvement we get by purely using the depth loss. This is then extended to use the full endtoend training loss (depth + flow + pose losses) in the Ours(full) model which demonstrates a consistent improvement across all datasets. Most notably in Tables 2 and 3 for which ground truth pose data was available for training. This validates our approach for improving single image depth estimation performance, and demonstrates a network can be improved by enforcing more geometric priors on the loss functions. We would like to mention that the improvement we gain from Ours(baseline) to Ours(full) is purely due to the novel combined loss terms as the flow and pose sub networks do not increase the model capacity of the depth subnet itself.
Method  lower better  higher better  

[4]  0.641  0.214  0.16  76.9%  95.0%  98.8% 
Laina et al.[22]  0.573  0.195  0.13  81.1%  95.3%  98.8% 
Kendall et al. [17]  0.506    0.110  81.7%  95.9%  98.9% 
Ours (baseline)  0.487  0.164  0.113  86.7%  97.7%  99.4% 
Ours (full)  0.478  0.161  0.111  87.2%  97.8%  99.5% 
Cap  Method  lower better  higher better  

080m  Zhou et al.[42]  6.856  0.283  0.208  67.8%  88.5%  95.7% 
Godard et al.[10]  4.935  0.206  0.141  86.1%  94.9%  97.6%  
Kuznietsov et al. [21]  4.621  0.189  0.113  86.2%  96.0%  98.6%  
Ours (baseline)  4.394  0.178  0.095  89.4%  96.6%  98.6%  
Ours (full)  4.301  0.173  0.096  89.5%  96.8%  98.7%  
050m  Zhou et al.[42]  5.181  0.264  0.201  69.6%  90.0%  96.6% 
Garg et al. [8]  5.104  0.273  0.169  74.0%  90.4%  96.2%  
Godard et al. [10]  3.729  0.194  0.108  87.3%  95.4%  97.9%  
Kuznietsov et al.[21]  3.518  0.179  0.108  87.5%  96.4%  98.8%  
Ours(baseline)  3.359  0.168  0.092  90.5%  97.0%  98.8%  
Ours(full)  3.284  0.164  0.092  90.6%  97.1%  98.9% 
Method  lower better  higher better  

Laina et al.[22]  1.275  0.481  0.189  0.371  75.3%  89.1%  91.8% 
DeMoN(est)[38]  2.980  0.910  1.413  5.109  21.0%  36.6%  48.9% 
DeMoN(gt)[38]  1.584  0.555  0.301  0.581  52.7%  70.7%  80.7% 
Ours(baseline)  1.068  0.353  0.128  0.236  86.9%  92.2%  93.5% 
Ours(full)  0.996  0.329  0.108  0.194  90.3%  93.6%  94.5% 
Additionally we include qualitative results for NYUv2[34] and KITTI[9] in Figure 3 and 4 respectively. Each of which illustrates a noticeable improvement over previous methods. We also demonstrate that the improvement is beyond the numbers, as our approach generates more convincing depths even when the RMSE may be higher, as is the case in the second row of Figure 3, where [22] computes a lower RMSE. More impressive still are the results in Figure 4, where we compare against previous approaches that are both trained on much larger training sets than our own and still show noticeable qualitative and quantitative improvements.
4.2 Pose Estimation
To demonstrate the ability of our approach to perform accurate relative pose estimation, we compare our approach on several unseen sequences from the datasets for which groundtruth poses were available. To quantitatively evaluate the trajectories we use the absolute trajectory error (ATE) and the relative pose error (RPE) as proposed in [35]. To mitigate the effect of scaledrift on these quantities we scale all poses to the groundtruth associated poses during evaluation. By using both metrics it provides an estimate of the consistency of each pose estimation approach. We summarise the results of this quantitative analysis for KITTI[9] in Table 4 and for RGBD[35] in Table 5. We include comparisons of the performance against other stateoftheart pose estimation networks namely SFMLearner[42] and DeMoN[38]. Additionally we include results from current stateoftheart SLAM systems also, namely ORBSLAM2[26] and LSDSLAM[6].
In Table 4 we show the most comparable performance of our approach to stateoftheart SLAM systems. We demonstrate a noticeable improvement over SfMLearner on both sequences in all metrics. We evaluate SfMLearner on its frametoframe tracking performance for adjacent frames (SFMLearner(1)) and separations of 5 frames (SFMLearner(5)), as they train their approach to estimate this size frame gap. Even with the massive reduction in accumulation error expected by taking larger frame gaps (demonstrated in reduced ATE) our system still produces more accurate pose estimates.
We show the resulting scaled trajectories of sequence 09 in Figure 5, as well as the relative scaling of each trajectories poses in a boxplot. The spread of scales present for SFMLearner indicates scale is essentially ignored by their system, with scale drifts ranging across a full log scale, while ORBSLAM and our approach are barely visible at this scale. Another thing to note is that our scale is centered around 1.0, as we estimate scale directly by estimating metric depths. This seems to provide a strong benefit in terms of reducing scaledrift and we believe makes our system more usable in practice.
In Table 5 we show a significant improvement in performance against existing machine learning approaches across several sequences from the RGBD dataset[35]. We evaluate against DeMoN[38] in two ways, frametoframe (DeMoN(1)) and we again try to provide the same advantage to DeMoN as SfMLearner by using wider baselines, which they claim improves their depth estimations[38], using a frame gap of 10 (DeMoN(10)). It can be observed that even with the massive reduction in accumulation error over our frametoframe approach, we still manage to significantly outperform their approach in ATE, even surpassing LSDSLAM on the sequence fr1xyz. ORBSLAM is still the clear winner, as they massively benefit from the ability to perform local bundleadjustments on the sequences used, which are short trajectories of small scenes. We include an example of a frame from the sequence fr3walkxyz in Figure 7, which shows this scene is not static, but our system has the ability to deal with this through the flow confidence estimates, discussed in Section 4.3
Sequence  09  10  

Method  ATE(m)  RPE(m)  RPE(°)  ATE(m)  RPE(m)  RPE(°) 
ORBSLAM(noloop)[26]  57.57  0.040  0.103  8.090  0.033  0.105 
ORBSLAM(full)[26]  9.104  0.056  0.084  7.349  0.031  0.100 
SfMlearner(5)[42]  58.31  0.077  0.803  31.75  0.069  1.242 
SfMlearner(1)[42]  81.09  0.050  0.976  75.89  0.045  1.599 
Ours(fully connected)  41.50  0.087  0.387  29.29  0.081  0.486 
Ours(full)  16.55  0.047  0.128  9.846  0.039  0.138 
Sequence  fr1xyz  fr2360hs  fr3walkxyz  

Method  ATE  RPE  RPE  ATE  RPE  RPE  ATE  RPE  RPE 
(m)  (m)  (°)  (m)  (m)  (°)  (m)  (m)  (°)  
LSDSLAM[6]  0.090            0.124     
ORBSLAM[26]  0.009  0.007  0.645        0.012  0.013  0.694 
DeMoN(10)[38]  0.178  0.021  1.193  0.601  0.035  2.243  0.265  0.049  1.447 
DeMoN(1)[38]  0.183  0.037  3.612  0.669  0.032  3.233  0.279  0.040  3.174 
Ours(fully connected)  0.169  0.028  1.887  0.883  0.030  1.799  0.268  0.044  1.698 
Ours(iterative)  0.071  0.024  1.237  0.461  0.020  0.736  0.240  0.026  0.811 
4.3 Ablation Experiments
In order to examine the contribution of using each component of our pose estimation network, we compare the pose estimates under various configurations on sequences 09 and 10 of the KITTI odometry dataset[9], summarised in Table 6. We examine the relative improvement of iterating on our pose estimation till convergence, against a single weightedleastsquares iteration, which demonstrates iterating has a significantly positive effect. We demonstrate the improved utility of our flows by replacing our flow estimates with other stateoftheart flow estimation methods from [14] and [31] in our pose estimation pipeline, and consistently demonstrate an improvement using our approach. We show the result of optimising with and without our estimated confidences, demonstrating quantitatively how important they are to pose estimation accuracy, with significant reductions across all metrics.
Sequence  09  10  

Method  ATE(m)  RPE(m)  RPE(°)  ATE(m)  RPE(m)  RPE(°) 
Ours(noconf)  53.40  0.356  0.931  58.50  0.308  1.058 
Ours(noconf,iterative)  33.18  0.248  0.421  35.87  0.280  0.803 
Flownet2.0[14]  29.64  0.349  0.838  51.90  0.222  0.954 
Flownet2.0(iterative)[14]  24.61  0.185  0.400  22.61  0.142  0.484 
EpicFlow[31]  119.0  0.566  0.931  20.98  0.199  0.853 
EpicFlow(iterative)[31]  59.79  0.379  0.459  14.80  0.154  0.581 
Ours(fullsingle iteration)  31.20  0.089  0.324  24.10  0.095  0.389 
Ours(fulltil convergence)  16.55  0.047  0.128  9.846  0.039  0.138 
We also demonstrate qualitatively one of the ways in which estimating confidence improves our pose estimation in Figure 7. This shows that our system has learned the confidence on moving objects is lower than its surroundings and the confidences of edges are higher, helping our system focus on salient information during optimisation in an approach similar to [6].
5 Conclusion and Further Work
We present the first piece of work that performs least squares based pose estimation inside a neural network. Instead of replacing every component of the SLAM pipeline with CNNs, we argue it’s better to use CNNs for tasks that greatly benefit from feature extraction (depth and flow prediction) and use geometry for tasks its proven to work well (motion estimation given the depths and flow). Our formulation is fully differentiable and is trained endtoend. We achieve stateoftheart performances on single image depth prediction for both NYUv2
[34] and KITTI [9] datasets. We demonstrate both qualitatively and quantitatively that our system is capable of producing better visual odometry that considerably reduces scaledrift by predicting metric depths.Supplementary Material
I Dataset Evaluation Analysis
In this section we evaluate and analyse the relative performance on each dataset as well as correlations in the dataset and how they relate to overall performance.
i.i NYUv2[34]
The dataset NYUv2[34] has been a popular benchmark for indoor depth estimation and semantic segmentation since the work of Eigen et al.[4]. We provide several qualitative and quantitative results from the evaluation of our approach in Figures 9, 10 and 22. This shows our strongest, median and worst performing images, as well as each predictions RMSE error in meters. This reveals two insights about our system’s performance, and that is we perform stronger on images with closer median depths and that our largest errors occur when we incorrectly estimate the overall scale of the scene. The relationship to median depth is evident in Figure 8, where the RMSE is strongly correlated to the median scene depth. We also observe a similarly strong correlation in the performance of all three approaches, although our approach is overwhelmingly out performing the competitors.
What conclusions can we draw from these results? Well this is a rather clear result of the choice of error metric in ranking the results. In this case as we rank by the RMSE, we would expect higher depths to be the images with the largest error, as only either very large predictions or very large ground truth values can generate large RMSE values. This also indicates that our network tends to behave conservatively, estimating the scene is closer on average rather than further. This is probably a direct result of the depth value distribution in the training set, potentially biasing the depths towards the lower end.
i.ii Kitti [9]
Our most impressive performance is perhaps on the KITTI benchmark dataset [9]. Where as shown in Figure 12 (left), we consistently outperform the competing approaches on almost all test images. The scale of the depth error had to be changed to in order to capture the full range of errors. This could be because the competing approaches estimate inverse depth/disparity and invert the predicted values to compute their loss function. This can lead to unstable performance on large distances, due to the nonlinearity of this section, as opposed to our approach which is linear to all depths.
Again we include the analysis of the median depths sorted against the RMSE error in Figure 12 (right), as we did for NYUv2. In this case the relationship between error and depth is largely reduced, this is most likely due to the nature of the dataset which contains a very similar spread of data for most images in the training set, as they film similar scenarios. However the relationship is still visible in Figure 13, where these scenes contain comparitively low depth values, indicating again our system behaves conservatively in estimating depths.
i.iii Comparison using the architecture of Kuznietsov [21]
We replaced the architecture of our depth estimation network using that of Kuznietsov et al. [21]. As it can be seen below by using the full training loss we are able to improve the accuracy of the depth estimation results indicating the generality of the approach.
Ii Pose Trajectories
ii.i KITTI Trajectories
We include the trajectory from sequence 10 of the odometry dataset from KITTI [9]. For the quantitative results please refer to the main paper. The resulting trajectories in Figure 16, indicate the comparitively strong performance of our approach, and show that our iterative (bottomleft) approach is significantly more accurate than the FC approach (topright). This trajectory contains no loops, and as such could lead to significant scale drift in some SLAM systems, however in this case the frequent local bundleadjusts performed by ORBSLAM seem to have helped to maintain the map quality throughout.
ii.ii RGBD Trajectories
We show the estimated aligned trajectories for several sequences from the RGBD dataset [35], to demonstrate qualitatively performance of our system against previous approaches. We summarise the trajectories in Figure 17, which demonstrates our comparably favourable performance against the approach DeMoN[38]. This is all despite our method estimating only frametoframe relative poses from adjacent frames, while DeMoN(10) is using a larger baseline and thus should estimate a smoother trajectory given the reduction in accumulation error. Ultimately ORBSLAM [26] is still the clear winner, as it uses information from multiple frames, and iteratively aggregates error across short sections. However as our approach is purely VO we were able to get a trajectory for fr2360hs, which we were unable to for ORBSLAM due to the challenging nature of the camera motion and rapid lighting changes.
ii.iii Comparison With CNNSLAM
In this table we include a comparison of our approach on the datasets used by CNNSLAM [37]. We would like to point out that our method performs competitively despite solely computing sequential frametoframe alignments and does not (yet) take advantage of the loop closures and local/global bundle adjustments used by the competing methods.
Method  Absolute Trajectory Error  

TUM/seq1  TUM/seq2  TUM/seq3  
CNNSLAM  0.542  0.243  0.214 
LSDSLAM  1.826  0.436  0.937 
ORBSLAM  1.206  0.495  0.733 
Ours (fc)  1.043  0.672  0.186 
Ours (full)  0.799  0.587  0.157 
Iii Optical Flow
Iv Network Architectures
iv.i Depth Network
The encoder takes a global mean subtracted RGB image as an input, during the feature encoding stage the resolution of the activations are reduced by a factor of 16. First downsampling operation is performed using a strided convolutional layer, the next with a maxpooling layer and the final two with averagepooling layers. Up sampling process is performed using the upproject blocks proposed in
[22]. Since the first downsampling operation is performed by the very first convolutional layer and closely resemble image features,these activations are not provided to the decoder via a skip connection. It should be noted that ours isn’t the first piece of work to predict depth using a DenseNet architecture. Kendall et al. [17] also used a DenseNet variant and the gains that we obtain are predominantly due to the loss functions we employed. Appendix I shows the full breakdown of the architectureiv.ii Flow Network
The flow network has three streams. The first stream takes the left image and its’ predicted depth map as an input, the second stream receives the right image and the corresponding predicted depth map and finally, the third stream receives both the left and right images and their associated depth predictions. Barring the first layer, all other layers of each stream share their weights. During the decoder stage the predicted flow is used to perform warp concatenations, where the right images activations are warped and concatenated with that of the left image. Since we are estimating optical flow in a coarse to fine manner, where the latter layers compute a residual to be added to the initial flow estimate, warpconcatenations help to capture the small displacements more effectively
V Pose Network
v.i Iterative Reweighted Approach
As described in the main body of the paper, we are attempting to minimise the following error function with respect to the relative transformation parameters ():
(11) 
For simplicity, we express the values in terms of normalised camera coordinates. The estimated flow is computed from the normalised camera coordinate and the current estimated transformation as shown in Equation 11. To simplify the mathematics we can represent the transformation using a matrix exponential as , where is the component of the motion vector , which is a member of the Liealgebra , and is the generator matrix corresponding to the relevant motion parameter. We can now differentiate the residual function with respect to the motion parameters to generate the following Jacobian
(12) 
where is the Jacobian, which can be stacked to form a larger Jacobian matrix , additionally the residual vectors can be stacked . This allows us to iteratively reduce the loss function using a standard GaussNewton approach given by
(13) 
where is the additive update to the motion parameters , and W is a diagonal weight matrix . is the weight matrix defined by
(14) 
where is the confidence value in the xdirection, is a constant that is computed from the residual (Equation 11), to be the mean residual magnitude of a single image, and is the residual in the xdirection. This pipeline is implemented in Tensorflow [1] and allows us to train the network end to end.
v.ii Network Based Approach
This section was addressed in detail in the main body.
Vi Training Procedure
vi.i Depth Training
All of the DenseNet161 layers [13] of the depth nets are initialised using Imagenet[32] pretrained weights. Remainder of the layers are intialised using MSRC[11] initialisation. NYUv2[34] and TUM[35] models are trained purely using the supervised loss term. The network is regularized using a weight decay of 1 through out training and the learning rate schedule is shown below :
Out of 400,0000 images in the NYU dataset, we only use 12,000 during training. We perform data augmentation 4 times (a total training set of 48000 images) using color shifts, random crops and leftright flips. Although, data augmentation can be implemented during training we noticed a considerable speedup by performing data augmentation offline. The training images and the corresponding ground truth are downsampled by a factor of 2. Hence, the resolution of each training example becomes 320240. Each training batch contains 8 images and we use 4 GPUs, resulting in a overall batch size of 32. In terms of training speed we observe on average 19.3 examples/sec or 0.415 sec/batch.
For the KITTI dataset we use 10,000 training images. Out of the training images that were defined in [5] we further prune our training set to exclude any images that are part of the odometry test set. We adopt a learning rate schedule which spans for half the duration of the NYU. This is primarily to avoid over fitting as we are now working with a comparatively small training dataset.
vi.i.1 Optical Flow Training
In order to compute the ground truth optical flow image, for the NYUv2 [34] dataset we first compute the camera pose using the Iterative Closest Point (ICP) algorithm which can then be used with the ground truth depth map to compute optical flow. This process is slightly simplified for the TUM[35] dataset as the ground truth pose is provided. The network is then trained using the optical flow loss criterion. All the layers of the flow network are initialised using the MSRC[11] initialisation and the learning rate schedule is shown in Figure 24. As it can be seen, the training duration is much smaller compared to the depth network training as the primary objective at this stage is to obtain a crude representation for both optical flow and the information matrix. Complete endtoend fine tuning happens when the network is trained using the pose loss criterion.
vi.i.2 Pose Training
We optimize the full network endtoend using the pose loss and demonstrate that the stateofthe art depths can be further improved using the knowledge of pose. We train the network for 20,000 iterations with an initial learning rate of which is halved at the halfway point.
Appendix I
We include a full table description of our depth network, for completeness.
Depth Net
Model Architechture Breakdown  
Layer  Channesl (I/O)  Scaling  Inputs 
conv1  3/96  2  Input Image 
pool1  96/96  4  conv1 
conv2_1_x1  96/192  4  pool1 
conv2_1_x2  192/48  4  conv2_1_x1 
concat2_1  144/144  4  conv2_1_x2, pool1 
DenseBlk_1  144/48  4  concat2_1 
concat2_2  192/192  4  DenseBlk_1, concat2_1 
DenseBlk_2  192/48  4  concat2_2 
concat2_3  240/240  4  DenseBlk_2, concat2_2 
DenseBlk_3  240/48  4  concat2_3 
concat2_4  288/288  4  DenseBlk_3, concat2_3 
DenseBlk_4  288/48  4  concat2_4 
concat2_5  336/336  4  DenseBlk_4, concat2_4 
DenseBlk_5  336/48  4  concat2_5 
concat2_6  384/384  4  DenseBlk_5, concat2_5 
conv2_blk  384/192  4  concat2_6 
pool2  192/192  2  conv2_blk 
DenseBlk_6  192/48  8  pool2 
concat3_1  240/240  8  DenseBlk_6, pool2 
DenseBlk_7  240/48  8  concat3_1 
concat3_2  288/288  8  DenseBlk_7, concat3_1 
DenseBlk_8  288/48  8  concat3_2 
concat3_3  336/336  8  DenseBlk_8, concat3_2 
DenseBlk_9  336/48  8  concat3_4 
concat3_4  384/384  8  DenseBlk_9, concat3_3 
DenseBlk_10  384/48  8  concat3_4 
concat3_5  432/432  8  DenseBlk_10, concat3_4 
DenseBlk_11  432/48  8  concat3_5 
concat3_6  480/480  8  DenseBlk_11, concat3_5 
DenseBlk_12  480/48  8  concat3_6 
concat3_7  528/528  8  DenseBlk_12, concat3_6 
DenseBlk_13  528/48  8  concat3_7 
concat3_8  576/576  8  DenseBlk_13, concat3_7 
DenseBlk_14  576/48  8  concat3_8 
concat3_9  624/624  8  DenseBlk_14, concat3_8 
DenseBlk_15  624/48  8  concat3_9 
concat3_10  672/672  8  DenseBlk_15, concat3_9 
DenseBlk_16  672/48  8  concat3_10 
concat3_11  720/720  8  DenseBlk_16, concat3_10 
DenseBlk_17  720/48  8  concat3_11 
concat3_12  768/768  8  DenseBlk_17, concat3_11 
conv3_blk  768/384  8  concat3_12 
pool3  384/384  16  conv3_blk 
DenseBlk_18  384/48  8  pool3 
concat4_1  432/432  8  DenseBlk_18, pool3 
DenseBlk_19  480/48  16  concat4_2 
DenseBlk_19  480/48  16  concat4_2 
concat4_2  528/528  16  DenseBlk_19, concat4_1 
DenseBlk_20  528/48  16  concat4_3 
concat4_3  576/576  16  DenseBlk_20, concat4_2 
DenseBlk_21  576/48  16  concat4_4 
concat4_4  624/624  16  DenseBlk_21, concat4_3 
DenseBlk_22  624/48  16  concat4_5 
concat4_5  672/672  16  DenseBlk_22, concat4_4 
DenseBlk_23  672/48  16  concat4_6 
concat4_6  720/720  16  DenseBlk_23, concat4_5 
DenseBlk_24  720/48  16  concat4_7 
concat4_7  768/768  16  DenseBlk_24, concat4_6 
DenseBlk_25  768/48  16  concat4_8 
concat4_8  816/816  16  DenseBlk_25, concat4_7 
DenseBlk_26  816/48  16  concat4_9 
concat4_9  864/864  16  DenseBlk_26, concat4_8 
DenseBlk_27  864/48  16  concat4_10 
concat4_10  912/912  16  DenseBlk_27, concat4_9 
DenseBlk_28  912/48  16  concat4_11 
concat4_11  960/960  16  DenseBlk_28, concat4_10 
DenseBlk_29  960/48  16  concat4_12 
concat4_12  1008/1008  16  DenseBlk_29, concat4_11 
DenseBlk_30  1008/48  16  concat4_13 
concat4_13  1056/1056  16  DenseBlk_30, concat4_12 
DenseBlk_31  1056/48  16  concat4_14 
concat4_14  1104/1104  16  DenseBlk_31, concat4_13 
DenseBlk_32  1104/48  16  concat4_15 
concat4_15  1152/1152  16  DenseBlk_32, concat4_14 
DenseBlk_33  1152/48  16  concat4_16 
concat4_16  1200/1200  16  DenseBlk_33, concat4_15 
DenseBlk_34  1200/48  16  concat4_17 
concat4_17  1248/1248  16  DenseBlk_34, concat4_16 
DenseBlk_35  1248/48  16  concat4_18 
concat4_18  1296/1296  16  DenseBlk_35, concat4_17 
DenseBlk_36  1296/48  16  concat4_19 
concat4_19  1344/1344  16  DenseBlk_36, concat4_18 
DenseBlk_37  1344/48  16  concat4_20 
concat4_20  1392/1392  16  DenseBlk_37, concat4_19 
DenseBlk_38  1392/48  16  concat4_21 
concat4_21  1440/1440  16  DenseBlk_38, concat4_20 
DenseBlk_39  1440/48  16  concat4_22 
concat4_22  1488/1488  16  DenseBlk_39, concat4_21 
DenseBlk_40  1488/48  16  concat4_23 
concat4_23  1536/1536  16  DenseBlk_40, concat4_22 
DenseBlk_41  1536/48  16  concat4_24 
concat4_24  1584/1584  16  DenseBlk_41, concat4_23 
DenseBlk_42  1584/48  16  concat4_25 
concat4_25  1632/1632  16  DenseBlk_42, concat4_24 
DenseBlk_43  1632/48  16  concat4_26 
concat4_26  1680/1680  16  DenseBlk_43, concat4_25 
DenseBlk_44  1680/48  16  concat4_27 
concat4_27  1728/1728  16  DenseBlk_44, concat4_26 
DenseBlk_45  1728/48  16  concat4_28 
concat4_28  1776/1776  16  DenseBlk_45, concat4_27 
DenseBlk_46  1776/48  16  concat4_29 
concat4_29  1824/1824  16  DenseBlk_46, concat4_28 
DenseBlk_47  1824/48  16  concat4_30 
concat4_30  1872/1872  16  DenseBlk_47, concat4_29 
DenseBlk_48  1872/48  16  concat4_31 
concat4_31  1920/1920  16  DenseBlk_48, concat4_30 
DenseBlk_49  1920/48  16  concat4_32 
concat4_32  1968/1968  16  DenseBlk_49, concat4_31 
DenseBlk_50  1968/48  16  concat4_33 
concat4_33  2016/2016  16  DenseBlk_50, concat4_32 
DenseBlk_51  2016/48  16  concat4_34 
concat4_34  2064/2064  16  DenseBlk_51, concat4_33 
DenseBlk_52  2064/48  16  concat4_35 
concat4_35  2112/2112  16  DenseBlk_52, concat4_34 
DenseBlk_53  2112/48  16  concat4_36 
concat4_36  2160/2160  16  DenseBlk_53, concat4_35 
conv4_blk  2160/1056  16  concat4_36 
DenseBlk_54  1056/48  16  conv4_blk 
concat5_1  1104/1104  16  conv4_blk, DenseBlk_54 
DenseBlk_55  1104/48  16  concat5_2 
concat5_2  1152/1152  16  DenseBlk_55, concat5_1 
DenseBlk_56  1152/48  16  concat5_3 
concat5_3  1200/1200  16  DenseBlk_56, concat5_2 
DenseBlk_57  1200/48  16  concat5_4 
concat5_4  1248/1248  16  DenseBlk_57, concat5_3 
DenseBlk_58  1248/48  16  concat5_5 
concat5_5  1296/1296  16  DenseBlk_58, concat5_4 
DenseBlk_59  1296/48  16  concat5_6 
concat5_6  1344/1344  16  DenseBlk_59, concat5_5 
DenseBlk_60  1344/48  16  concat5_7 
concat5_7  1392/1392  16  DenseBlk_60, concat5_6 
DenseBlk_61  1392/48  16  concat5_8 
concat5_8  1440/1440  16  DenseBlk_61, concat5_7 
DenseBlk_62  1440/48  16  concat5_9 
concat5_9  1488/1488  16  DenseBlk_62, concat5_8 
DenseBlk_63  1488/48  16  concat5_10 
concat5_10  1536/1536  16  DenseBlk_63, concat5_9 
DenseBlk_64  1536/48  16  concat5_11 
concat5_11  1584/1584  16  DenseBlk_64, concat5_10 
DenseBlk_65  1584/48  16  concat5_12 
concat5_12  1632/1632  16  DenseBlk_65, concat5_11 
DenseBlk_66  1632/48  16  concat5_13 
concat5_13  1680/1680  16  DenseBlk_66, concat5_12 
DenseBlk_67  1680/48  16  concat5_14 
concat5_14  1728/1728  16  DenseBlk_67, concat5_13 
DenseBlk_68  1728/48  16  concat5_15 
concat5_15  1776/1776  16  DenseBlk_68, concat5_14 
DenseBlk_69  1776/48  16  concat5_16 
concat5_16  1824/1824  16  DenseBlk_69, concat5_15 
DenseBlk_70  1824/48  16  concat5_17 
concat5_17  1872/1872  16  DenseBlk_70, concat5_16 
DenseBlk_71  1872/48  16  concat5_18 
concat5_18  1920/1920  16  DenseBlk_71, concat5_17 
DenseBlk_72  1920/48  16  concat5_19 
concat5_19  1968/1968  16  DenseBlk_72, concat5_18 
DenseBlk_73  1968/48  16  concat5_20 
concat5_20  2016/2016  16  DenseBlk_73, concat5_19 
DenseBlk_74  2016/48  16  concat5_21 
concat5_21  2064/2064  16  DenseBlk_74, concat5_20 
DenseBlk_75  2064/48  16  concat5_22 
concat5_22  2112/2112  16  DenseBlk_75, concat5_21 
DenseBlk_76  2112/48  16  concat5_23 
concat5_23  2160/2160  16  DenseBlk_76, concat5_22 
DenseBlk_77  2160/48  16  concat5_24 
concat5_24  2208/2208  16  DenseBlk_77, concat5_23 
conv5_blk  2208/1024  16  concat5_24 
upproject_1  1024/512  8  concat5_24 
concat_up_2  896/896  8  upproject_1, conv3_blk 
upproject_2  896/584  4  concat_up_2 
concat_up_3  776/776  4  upproject_2, conv2_blk 
upproject_3  776/256  2  concat_up_3 
upproject_4  256/128  1  upproject_3 
conv_pred  128/1  1  upproject_4 
References
 [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
 [2] Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: IEEE International Conference on Computer Vision (ICCV) (2015)
 [3] Dharmasiri, T., Spek, A., Drummond, T.: Joint prediction of depths, normals and surface curvature from rgb images using cnns. arXiv preprint arXiv:1706.07593 (2017)
 [4] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In: IEEE International Conference on Computer Vision (ICCV) (2015)
 [5] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multiscale deep network. In: Advances in neural information processing systems (NIPS) (2014)
 [6] Engel, J., Schöps, T., Cremers, D.: LSDSLAM: Largescale direct monocular slam. In: European Conference on Computer Vision (ECCV) (2014)
 [7] Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., Van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. IEEE International Conference on Computer Vision (ICCV) (2015)
 [8] Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV) (2016)
 [9] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013)

[10]
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with leftright consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 [11] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: IEEE International Conference on Computer Vision (2015)
 [12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
 [13] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 [14] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 [15] Jayaraman, D., Grauman, K.: Learning image representations tied to egomotion. In: IEEE International Conference on Computer Vision (ICCV) (2015)
 [16] Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 4762–4769 (2016)
 [17] Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems (NIPS) (2017)
 [18] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [19] Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR) (2007)
 [20] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)
 [21] Kuznietsov, Y., Stückler, J., Leibe, B.: Semisupervised deep learning for monocular depth map prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 [22] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: International Conference on 3D Vision (3DV) (2016)

[23]
Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: ACM Transactions on Graphics (2004)
 [24] Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
 [25] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
 [26] MurArtal, R., Montiel, J.M.M., Tardos, J.D.: ORBSLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31(5) (2015)
 [27] MurArtal, R., Tardós, J.D.: ORBSLAM2: an opensource SLAM system for monocular, stereo and RGBD cameras. IEEE Transactions on Robotics 33(5) (2017)
 [28] Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: Dense tracking and mapping in realtime. In: IEEE International Conference on Computer Vision (ICCV) (2011)
 [29] Rad, M., Lepetit, V.: BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth. In: IEEE International Conference on Computer Vision (ICCV) (2017)
 [30] Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)

[31]
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edgepreserving interpolation of correspondences for optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
 [32] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., FeiFei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s112630150816y
 [33] Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (2006)
 [34] Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European Conference on Computer Vision (ECCV). pp. 1–14 (2012)
 [35] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgbd slam systems. In: International Conference on Intelligent Robot Systems (IROS) (2012)
 [36] Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWCNet: CNNs for optical flow using pyramid, warping, and cost volume. arXiv preprint arXiv:1709.02371 (2017)
 [37] Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnnslam: Realtime dense monocular slam with learned depth prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 [38] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and motion network for learning monocular stereo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 [39] Wannenwetsch, A.S., Keuper, M., Roth, S.: Probflow: Joint optical flow and uncertainty estimation. In: IEEE International Conference on Computer Vision (ICCV) (2017)
 [40] Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: Learned Invariant Feature Transform. In: European Conference on Computer Vision (ECCV) (2016)
 [41] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Computer Vision and Pattern Recognition (CVPR) (2017)

[42]
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and egomotion from video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)