The importance of navigation and mapping to the fields of robotics and computer vision has only increased since its inception. Vision based navigation in particular is an extremely interesting field of research due to its discernible resemblance to human navigation and the wealth of information an image contains. Although creating a machine that understands structure and motion purely from RGB images is challenging, the computer vision community has developed a plethora of algorithms to replicate useful aspects of human vision using a computer. Tracking and mapping remains an unsolved problem, with many popular approaches. Photometric based techniques rely on establishing correspondences across different viewpoints of the same scene and the matching points are then used to perform triangulation. Based on the density of the map, the field can be divided into dense , semi-dense  and sparse approaches, each comes with advantages and disadvantages.
Applying machine learning techniques to solve vision problems has been another popular area of research. Great advances have been made in the fields of image classification [20, 12, 13] and semantic segmentation [25, 41]
and this has led geometry based machine learning methods to follow suit. The massive growth in neural network driven research has largely been facilitated by the increased availability of low-cost high performance GPUs as well as the relative accessibility of machine learning frameworks such as Tensorflow and Caffe.
In this work we draw from both machine learning approaches as well as SfM techniques to create a unified framework which is capable of predicting the depth of a scene and the motion parameters governing the camera motion between an image pair. We construct our framework incrementally where the network is first trained to predict depths given a single color image. Then a color image pair as well as their associated depth predictions are provided to a flow estimation network which produces an optical flow map along with an estimated measure of confidence in x and y motion. Finally, the pose estimation block utilises the outputs of the previous networks to estimate a motion vector corresponding to the logarithm of the Special Euclidean Transformationin , which describes the relative camera motion from the first image to the second.
We summarise the contributions made in this paper as follows:
We also present the first approach to use a neural network to predict the full information matrix which represents the confidence of our optical flow estimate.
2 Related Work
Estimating motion and structure from two or more views is a well studied vision problem. In order to reconstruct the world and estimate camera motion, sparse feature based systems [19, 26] compute correspondences through feature matching while the denser approaches[6, 28] rely on brightness constancy across multiple viewpoints. In this work, we leverage CNNs to solve the aforementioned tasks and we summarize the existing works in the literature that are related to the ideas presented in this paper.
2.1 Single Image Depth Prediction
Predicting depth from a single RGB image using learning based approaches has been explored even prior to the resurgence of CNNs. In , Saxena et al. employed a Markov Random Field (MRF) to combine global and local image features. Similar to our approach Eigen et al.  introduced a common CNN architecture capable of predicting depth maps for both indoor and outdoor environments. This concept was later extended to a multi-stage coarse to fine network by Eigen et al. in . Advances were made in the form of combining graphical models with CNNs  to further improve the accuracy of depth maps, through the use of related geometric tasks  and by making architectural improvements specifically designed for depth prediction . Kendall et al. demonstrated that predicting depths and uncertainties improve the overall accuracy in . While most of these methods demonstrated impressive results, explicit notion of geometry was not used during any stage of the pipeline which opened the way for geometry based depth prediction approaches.
In one of the earliest works to predict depth using geometry in an unsupervised fashion, Garg et al. used the photometric difference between a stereo image pair, where the target image was synthesized using the predicted disparity and the known baseline. Left-right consistency was explicitly enforced in the unsupervised framework of Goddard et al.  as well as in the semi-supervised framework of Kuznietsov et al., which is a technique we also found to be beneficial during training on sparse ground truth data.
2.2 Optical Flow Prediction
An early work in optical flow prediction using CNNs was . This was later extended by Ilg et al. to FlowNet 2.0  which included stacked FlowNets  as well as warping layers. Ranjan and Black proposed a spatial pyramid based optical flow prediction network . More recently, Sun et al. proposed a framework which uses the principles from geometry based flow estimation techniques such as image pyramid, warping and cost volumes in . As our end goal revolves around predicting camera pose, it becomes necessary to isolate the flow that was caused purely from camera motion, in order to achieve this we extend upon these previous works to predict both the optical flow and the associated information matrix of the flow. Although not in a CNN context  showed the usefulness of estimating flow and uncertainty.
2.3 Pose Estimation
CNNs have been successfully used to estimate various components of a Structure from Motion pipeline. Earlier works focused on learning discriminative image based features suitable for ego-motion estimation [2, 15]. Yi et al. showed a full feature detection framework can be implemented using deep neural networks. Rad and Lepetit in BB8 showed the pose of objects can be predicted even under partial occlusion and highlighted the increased difficulty of predicting 3D quantities over 2D quantities. Kendall and Cipolla demonstrated that camera pose prediction from a single image catered for relocalization scenarios .
However, each of the above works lack a representation of structure as they do not explicitly predict depths. Our work is more closely related to that of Zhou et al.  and Ummenhofer et al.  and their frameworks SfM-Learner and DeMoN. Both of these approaches also predict a single confidence map in contrast to ours which estimates the confidence in x and y directions separately. Since our framework predicts metric depths in comparison to theirs we are able produce far more accurate visual odometry and combat against scale drift. CNN SLAM by Tateno et al.  incorporated depth predictions of  into a SLAM framework. Our method performs competitively with CNN-SLAM as well as ORB-SLAM and LSD-SLAM which have the added advantage of performing loop closures and local/global bundle adjustments despite solely computing sequential frame-to-frame alignments.
3.1 Network Architecture
The overall architecture consists of 3 main subsystems in the form of a depth, flow and camera pose network. A large percentage of the model capacity is invested in to the depth prediction component for two reasons. Firstly, the output of the depth network also serves as an additional input to the other subsystems. Secondly, we wanted to achieve superior depths for indoor and outdoor environments using a common architecture 111Although there are separate models for indoor and outdoor scenes the underlying architecture is common.. In order to preserve space and to provide an overall understanding of the data flow a high level diagram of the network is shown in Figure 1. An expanded architecture with layer definitions for each of the subsystems is included in the supplementary materials.
3.2 Depth Prediction
The depth prediction network consists of an encoder and a decoder module. The encoder network is largely based on the DenseNet161 architecture described in 
. In particular we use the variant pre-trained on ImageNet and slightly increase the receptive field of the pooling layers. As the original input is down-sampled 4 times by the encoder, during the decoding stage the feature maps are up-sampled back 4 times to make the model fully convolutional. We employ skip connections in order to re-introduce the finer details lost during pooling. Since the first down-sampling operation is done at a very early stage of the pipeline and closely resemble the image features, these activations are not reused inside the decoder. Up-project blocks are used to perform up-sampling in our network, which provide better depth maps compared to de-convolutional layers as shown in .
) this network can be directly utilised to perform supervised learning. Unfortunately, the ground truth data for the outdoor datasets (KITTI) are much sparser and meant we had to incorporate a semi-supervised learning approach in order to provide a strong training signal. Therefore, during training on KITTI, we use a Siamese version of the depth network with complete weight-sharing, and enforce photometric consistency between the left-right image pairs through an additional loss function. This is similar to the previous approaches[8, 21] and is only required during the training stage, during inference only a single input image is required to perform depth estimation using our network.
3.3 Flow Prediction
The flow network provides an estimation of the optical flow along with the associated confidences given an image pair. These outputs combined with predicted depths allow us to predict the camera pose. As part of our ablation studies we integrated the flow predictions of  with our depths, however, the main limitation of this approach was the lack of a mechanism to filter out the dynamic objects which are abundant in outdoor environments. This was solved by estimating confidence, specifically the information matrix in addition to the optical flow. More concretly, for each pixel our flow network predicts 5 quantities, the optical flow in the x and y direction, and the quantities , and , which are required to compute the information matrix or the inverse of the covariance matrix as shown below.
This parametrisation guarantees is positive-definite and can be used to parametrise any information matrix. We found that the gradients are much more stable compared to predicting the information matrix directly as the determinant of the matrix is always greater than zero since only when .
With respect to the architecuture we borrow elements from FlowNet  as well as FlowNet 2.0 . As mentioned in , FlowNet 2.0 was unable to reliably estimate small motions, which we address with two key changes. Firstly, our flow network takes the predicted depth map as an input, allowing the network to learn the relationship between depth and flow explicitly, including that closer objects appear to move more compared to the objects that are further away from the camera. Secondly, we use “warp-concatenation”, where coarse flow estimates are used to warp the CNN features during the decoder stage. This appears to resolve small motions more effectively particularly on the TUM  dataset.
3.4 Pose Estimation
We take two approaches to pose estimation, shown in Figure 2, an iterative and a fully-connected(FC). This contrasts the ability of a neural network to estimate using the available information, and the simplicity of a standard computer vision approach using the available predicted quantities. We use FC layers to provide the network with as wide a receptive field as possible, to compare more equivalently against using the inferred quantities in the iterative approach.
This approach uses a more conventional method for computing relative pose estimates. We use a standard re-weighted least squares solver based on the residual flow, given an estimate of the relative transformation. More concretely we attempt to minimise the following error function with respect to the relative transformation parameters ()
where is the total residual flow in normalised camera coordinates, the subscript indicates only the first two dimensions of the vector are used, is the inverse depth coordinate of an ordered point cloud (), and are the predicted flow and estimated flow respectively, and is the pixel coordinate. is the current transformation estimate, and can be expressed by the matrix exponential as , where is the component of the motion vector , which is a member of the Lie-algebra , and is the generator matrix corresponding to the relevant motion parameter. As This pipeline is implemented in Tensorflow  it allows us to train the network end to end. Please see the supplementary material for a more detailed explanation.
Similar to Zhou et al. and Ummenhofer et al.  we also constructed a fully connected layer based pose estimation network. This network utilises 3 stacked fully connected layers and uses the same inputs as our iterative method. While we outperform the pose estimation benchmarks of  and  using this network the iterative network is our recommended approach due to its close resemblance to conventional geometry based techniques.
3.5 Loss Functions
For supervised training on indoor and outdoor datasets we use a reverse Huber loss function  defined by
where , and and represent the predicted and the ground truth depth respectively. For the KITTI dataset we employed an additional photometric loss during training as the ground truth is highly sparse. This unsupervised loss term enforces left-right consistency between stereo pairs, defined by
where and are the left and right images and and are their corresponding depth maps, is a normalisation function where , K is the camera intrinsic matrix, is the transformation from pixel to camera coordinates, and and define the relative transformation matrices from left-to-right and right-to-left respectively. In this case the rotation is assumed to be the identity and the matrices purely translate in the x-direction. Additionally, we use a smoothness term defined by
where and are the horizontal and vertical gradients of the predicted depth. This provides qualitatively better depths as well as faster convergence. The final loss function used to train KITTI depths is given by
where and are computed on both left and right images separately.
The probability distribution of multivariate Gaussian in 2D can be defined as follows.
where is the information matrix or inverse covariance matrix . The flow loss criterion can now be defined by
where is the predicted flow, and is the ground truth flow. This optimises by maximising the log-likelihood of the probability distribution over the residual flow error.
Given two input images , , the predicted depth map of and the predicted relative pose the unsupervised loss and pose loss can be defined as
where maps a transformation T from the Lie-group to the Lie-algebra , such that can be represented by its constituent motion parameters, and is the ground truth relative pose parameters.
3.6 Training Regime
We train our network end-to-end on NYUv2 , TUM and KITTI datasets. We use the standard test/train split for NYUv2 and KITTI and define our scene split for TUM. It is worth mentioning that the amount of training data we used is radically reduced compared to  and . More concretly, for NYUv2 we use of the full dataset, for KITTI . We use the Adam optimiser  with an initial learning rate of 1e-4 for all experiments and chose Tensorflow  as the learning framework and train using an NVIDIA-DGX1. We provide a detailed training schedule and breakdown in the supplementary material.
In this section we summarise the single-image depth prediction and relative pose estimation performance of our system on several popular machine learning and SLAM datasets. We also investigate the effect of using alternative optical flow estimates from  and  in our pose estimation pipeline as an ablation study. The entire model contains 130M parameters. Our depth estimator runs at 5fps on an NVIDIA GTX 1080Ti, while other sub-networks run at 30fps.
4.1 Depth Estimation
We train Ours(baseline) model to showcase the improvement we get by purely using the depth loss. This is then extended to use the full end-to-end training loss (depth + flow + pose losses) in the Ours(full) model which demonstrates a consistent improvement across all datasets. Most notably in Tables 2 and 3 for which ground truth pose data was available for training. This validates our approach for improving single image depth estimation performance, and demonstrates a network can be improved by enforcing more geometric priors on the loss functions. We would like to mention that the improvement we gain from Ours(baseline) to Ours(full) is purely due to the novel combined loss terms as the flow and pose sub networks do not increase the model capacity of the depth subnet itself.
|Method||lower better||higher better|
|Laina et al.||0.573||0.195||0.13||81.1%||95.3%||98.8%|
|Kendall et al. ||0.506||-||0.110||81.7%||95.9%||98.9%|
|Cap||Method||lower better||higher better|
|0-80m||Zhou et al.||6.856||0.283||0.208||67.8%||88.5%||95.7%|
|Godard et al.||4.935||0.206||0.141||86.1%||94.9%||97.6%|
|Kuznietsov et al. ||4.621||0.189||0.113||86.2%||96.0%||98.6%|
|0-50m||Zhou et al.||5.181||0.264||0.201||69.6%||90.0%||96.6%|
|Garg et al. ||5.104||0.273||0.169||74.0%||90.4%||96.2%|
|Godard et al. ||3.729||0.194||0.108||87.3%||95.4%||97.9%|
|Kuznietsov et al.||3.518||0.179||0.108||87.5%||96.4%||98.8%|
|Method||lower better||higher better|
|Laina et al.||1.275||0.481||0.189||0.371||75.3%||89.1%||91.8%|
Additionally we include qualitative results for NYUv2 and KITTI in Figure 3 and 4 respectively. Each of which illustrates a noticeable improvement over previous methods. We also demonstrate that the improvement is beyond the numbers, as our approach generates more convincing depths even when the RMSE may be higher, as is the case in the second row of Figure 3, where  computes a lower RMSE. More impressive still are the results in Figure 4, where we compare against previous approaches that are both trained on much larger training sets than our own and still show noticeable qualitative and quantitative improvements.
4.2 Pose Estimation
To demonstrate the ability of our approach to perform accurate relative pose estimation, we compare our approach on several unseen sequences from the datasets for which ground-truth poses were available. To quantitatively evaluate the trajectories we use the absolute trajectory error (ATE) and the relative pose error (RPE) as proposed in . To mitigate the effect of scale-drift on these quantities we scale all poses to the groundtruth associated poses during evaluation. By using both metrics it provides an estimate of the consistency of each pose estimation approach. We summarise the results of this quantitative analysis for KITTI in Table 4 and for RGB-D in Table 5. We include comparisons of the performance against other state-of-the-art pose estimation networks namely SFM-Learner and DeMoN. Additionally we include results from current state-of-the-art SLAM systems also, namely ORB-SLAM2 and LSD-SLAM.
In Table 4 we show the most comparable performance of our approach to state-of-the-art SLAM systems. We demonstrate a noticeable improvement over SfM-Learner on both sequences in all metrics. We evaluate SfM-Learner on its frame-to-frame tracking performance for adjacent frames (SFM-Learner(1)) and separations of 5 frames (SFM-Learner(5)), as they train their approach to estimate this size frame gap. Even with the massive reduction in accumulation error expected by taking larger frame gaps (demonstrated in reduced ATE) our system still produces more accurate pose estimates.
We show the resulting scaled trajectories of sequence 09 in Figure 5, as well as the relative scaling of each trajectories poses in a box-plot. The spread of scales present for SFM-Learner indicates scale is essentially ignored by their system, with scale drifts ranging across a full log scale, while ORB-SLAM and our approach are barely visible at this scale. Another thing to note is that our scale is centered around 1.0, as we estimate scale directly by estimating metric depths. This seems to provide a strong benefit in terms of reducing scale-drift and we believe makes our system more usable in practice.
In Table 5 we show a significant improvement in performance against existing machine learning approaches across several sequences from the RGB-D dataset. We evaluate against DeMoN in two ways, frame-to-frame (DeMoN(1)) and we again try to provide the same advantage to DeMoN as SfM-Learner by using wider baselines, which they claim improves their depth estimations, using a frame gap of 10 (DeMoN(10)). It can be observed that even with the massive reduction in accumulation error over our frame-to-frame approach, we still manage to significantly out-perform their approach in ATE, even surpassing LSD-SLAM on the sequence fr1-xyz. ORB-SLAM is still the clear winner, as they massively benefit from the ability to perform local bundle-adjustments on the sequences used, which are short trajectories of small scenes. We include an example of a frame from the sequence fr3-walk-xyz in Figure 7, which shows this scene is not static, but our system has the ability to deal with this through the flow confidence estimates, discussed in Section 4.3
4.3 Ablation Experiments
In order to examine the contribution of using each component of our pose estimation network, we compare the pose estimates under various configurations on sequences 09 and 10 of the KITTI odometry dataset, summarised in Table 6. We examine the relative improvement of iterating on our pose estimation till convergence, against a single weighted-least-squares iteration, which demonstrates iterating has a significantly positive effect. We demonstrate the improved utility of our flows by replacing our flow estimates with other state-of-the-art flow estimation methods from  and  in our pose estimation pipeline, and consistently demonstrate an improvement using our approach. We show the result of optimising with and without our estimated confidences, demonstrating quantitatively how important they are to pose estimation accuracy, with significant reductions across all metrics.
We also demonstrate qualitatively one of the ways in which estimating confidence improves our pose estimation in Figure 7. This shows that our system has learned the confidence on moving objects is lower than its surroundings and the confidences of edges are higher, helping our system focus on salient information during optimisation in an approach similar to .
5 Conclusion and Further Work
We present the first piece of work that performs least squares based pose estimation inside a neural network. Instead of replacing every component of the SLAM pipeline with CNNs, we argue it’s better to use CNNs for tasks that greatly benefit from feature extraction (depth and flow prediction) and use geometry for tasks its proven to work well (motion estimation given the depths and flow). Our formulation is fully differentiable and is trained end-to-end. We achieve state-of-the-art performances on single image depth prediction for both NYUv2 and KITTI  datasets. We demonstrate both qualitatively and quantitatively that our system is capable of producing better visual odometry that considerably reduces scale-drift by predicting metric depths.
I Dataset Evaluation Analysis
In this section we evaluate and analyse the relative performance on each dataset as well as correlations in the dataset and how they relate to overall performance.
The dataset NYUv2 has been a popular benchmark for indoor depth estimation and semantic segmentation since the work of Eigen et al.. We provide several qualitative and quantitative results from the evaluation of our approach in Figures 9, 10 and 22. This shows our strongest, median and worst performing images, as well as each predictions RMSE error in meters. This reveals two insights about our system’s performance, and that is we perform stronger on images with closer median depths and that our largest errors occur when we incorrectly estimate the overall scale of the scene. The relationship to median depth is evident in Figure 8, where the RMSE is strongly correlated to the median scene depth. We also observe a similarly strong correlation in the performance of all three approaches, although our approach is overwhelmingly out performing the competitors.
What conclusions can we draw from these results? Well this is a rather clear result of the choice of error metric in ranking the results. In this case as we rank by the RMSE, we would expect higher depths to be the images with the largest error, as only either very large predictions or very large ground truth values can generate large RMSE values. This also indicates that our network tends to behave conservatively, estimating the scene is closer on average rather than further. This is probably a direct result of the depth value distribution in the training set, potentially biasing the depths towards the lower end.
i.ii Kitti 
Our most impressive performance is perhaps on the KITTI benchmark dataset . Where as shown in Figure 12 (left), we consistently out-perform the competing approaches on almost all test images. The scale of the depth error had to be changed to in order to capture the full range of errors. This could be because the competing approaches estimate inverse depth/disparity and invert the predicted values to compute their loss function. This can lead to unstable performance on large distances, due to the non-linearity of this section, as opposed to our approach which is linear to all depths.
Again we include the analysis of the median depths sorted against the RMSE error in Figure 12 (right), as we did for NYUv2. In this case the relationship between error and depth is largely reduced, this is most likely due to the nature of the dataset which contains a very similar spread of data for most images in the training set, as they film similar scenarios. However the relationship is still visible in Figure 13, where these scenes contain comparitively low depth values, indicating again our system behaves conservatively in estimating depths.
i.iii Comparison using the architecture of Kuznietsov 
We replaced the architecture of our depth estimation network using that of Kuznietsov et al. . As it can be seen below by using the full training loss we are able to improve the accuracy of the depth estimation results indicating the generality of the approach.
Ii Pose Trajectories
ii.i KITTI Trajectories
We include the trajectory from sequence 10 of the odometry dataset from KITTI . For the quantitative results please refer to the main paper. The resulting trajectories in Figure 16, indicate the comparitively strong performance of our approach, and show that our iterative (bottom-left) approach is significantly more accurate than the FC approach (top-right). This trajectory contains no loops, and as such could lead to significant scale drift in some SLAM systems, however in this case the frequent local bundle-adjusts performed by ORB-SLAM seem to have helped to maintain the map quality throughout.
ii.ii RGB-D Trajectories
We show the estimated aligned trajectories for several sequences from the RGB-D dataset , to demonstrate qualitatively performance of our system against previous approaches. We summarise the trajectories in Figure 17, which demonstrates our comparably favourable performance against the approach DeMoN. This is all despite our method estimating only frame-to-frame relative poses from adjacent frames, while DeMoN(10) is using a larger baseline and thus should estimate a smoother trajectory given the reduction in accumulation error. Ultimately ORB-SLAM  is still the clear winner, as it uses information from multiple frames, and iteratively aggregates error across short sections. However as our approach is purely VO we were able to get a trajectory for fr2-360-hs, which we were unable to for ORB-SLAM due to the challenging nature of the camera motion and rapid lighting changes.
ii.iii Comparison With CNN-SLAM
In this table we include a comparison of our approach on the datasets used by CNN-SLAM . We would like to point out that our method performs competitively despite solely computing sequential frame-to-frame alignments and does not (yet) take advantage of the loop closures and local/global bundle adjustments used by the competing methods.
|Method||Absolute Trajectory Error|
Iii Optical Flow
Iv Network Architectures
iv.i Depth Network
The encoder takes a global mean subtracted RGB image as an input, during the feature encoding stage the resolution of the activations are reduced by a factor of 16. First downsampling operation is performed using a strided convolutional layer, the next with a max-pooling layer and the final two with average-pooling layers. Up sampling process is performed using the up-project blocks proposed in. Since the first down-sampling operation is performed by the very first convolutional layer and closely resemble image features,these activations are not provided to the decoder via a skip connection. It should be noted that ours isn’t the first piece of work to predict depth using a DenseNet architecture. Kendall et al.  also used a DenseNet variant and the gains that we obtain are predominantly due to the loss functions we employed. Appendix I shows the full breakdown of the architecture
iv.ii Flow Network
The flow network has three streams. The first stream takes the left image and its’ predicted depth map as an input, the second stream receives the right image and the corresponding predicted depth map and finally, the third stream receives both the left and right images and their associated depth predictions. Barring the first layer, all other layers of each stream share their weights. During the decoder stage the predicted flow is used to perform warp concatenations, where the right images activations are warped and concatenated with that of the left image. Since we are estimating optical flow in a coarse to fine manner, where the latter layers compute a residual to be added to the initial flow estimate, warp-concatenations help to capture the small displacements more effectively
V Pose Network
v.i Iterative Re-weighted Approach
As described in the main body of the paper, we are attempting to minimise the following error function with respect to the relative transformation parameters ():
For simplicity, we express the values in terms of normalised camera coordinates. The estimated flow is computed from the normalised camera coordinate and the current estimated transformation as shown in Equation 11. To simplify the mathematics we can represent the transformation using a matrix exponential as , where is the component of the motion vector , which is a member of the Lie-algebra , and is the generator matrix corresponding to the relevant motion parameter. We can now differentiate the residual function with respect to the motion parameters to generate the following Jacobian
where is the Jacobian, which can be stacked to form a larger Jacobian matrix , additionally the residual vectors can be stacked . This allows us to iteratively reduce the loss function using a standard Gauss-Newton approach given by
where is the additive update to the motion parameters , and W is a diagonal weight matrix . is the weight matrix defined by
where is the confidence value in the x-direction, is a constant that is computed from the residual (Equation 11), to be the mean residual magnitude of a single image, and is the residual in the x-direction. This pipeline is implemented in Tensorflow  and allows us to train the network end to end.
v.ii Network Based Approach
This section was addressed in detail in the main body.
Vi Training Procedure
vi.i Depth Training
All of the DenseNet-161 layers  of the depth nets are initialised using Imagenet pretrained weights. Remainder of the layers are intialised using MSRC initialisation. NYUv2 and TUM models are trained purely using the supervised loss term. The network is regularized using a weight decay of 1 through out training and the learning rate schedule is shown below :
Out of 400,0000 images in the NYU dataset, we only use 12,000 during training. We perform data augmentation 4 times (a total training set of 48000 images) using color shifts, random crops and left-right flips. Although, data augmentation can be implemented during training we noticed a considerable speedup by performing data augmentation offline. The training images and the corresponding ground truth are downsampled by a factor of 2. Hence, the resolution of each training example becomes 320240. Each training batch contains 8 images and we use 4 GPUs, resulting in a overall batch size of 32. In terms of training speed we observe on average 19.3 examples/sec or 0.415 sec/batch.
For the KITTI dataset we use 10,000 training images. Out of the training images that were defined in  we further prune our training set to exclude any images that are part of the odometry test set. We adopt a learning rate schedule which spans for half the duration of the NYU. This is primarily to avoid over fitting as we are now working with a comparatively small training dataset.
vi.i.1 Optical Flow Training
In order to compute the ground truth optical flow image, for the NYUv2  dataset we first compute the camera pose using the Iterative Closest Point (ICP) algorithm which can then be used with the ground truth depth map to compute optical flow. This process is slightly simplified for the TUM dataset as the ground truth pose is provided. The network is then trained using the optical flow loss criterion. All the layers of the flow network are initialised using the MSRC initialisation and the learning rate schedule is shown in Figure 24. As it can be seen, the training duration is much smaller compared to the depth network training as the primary objective at this stage is to obtain a crude representation for both optical flow and the information matrix. Complete end-to-end fine tuning happens when the network is trained using the pose loss criterion.
vi.i.2 Pose Training
We optimize the full network end-to-end using the pose loss and demonstrate that the state-of-the art depths can be further improved using the knowledge of pose. We train the network for 20,000 iterations with an initial learning rate of which is halved at the half-way point.
We include a full table description of our depth network, for completeness.
|Model Architechture Breakdown|
-  Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
-  Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: IEEE International Conference on Computer Vision (ICCV) (2015)
-  Dharmasiri, T., Spek, A., Drummond, T.: Joint prediction of depths, normals and surface curvature from rgb images using cnns. arXiv preprint arXiv:1706.07593 (2017)
-  Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision (ICCV) (2015)
-  Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems (NIPS) (2014)
-  Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: Large-scale direct monocular slam. In: European Conference on Computer Vision (ECCV) (2014)
-  Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., Van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. IEEE International Conference on Computer Vision (ICCV) (2015)
-  Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV) (2016)
-  Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: IEEE International Conference on Computer Vision (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
-  Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
-  Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
-  Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: IEEE International Conference on Computer Vision (ICCV) (2015)
-  Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 4762–4769 (2016)
-  Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems (NIPS) (2017)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR) (2007)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)
-  Kuznietsov, Y., Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
-  Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: International Conference on 3D Vision (3DV) (2016)
Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: ACM Transactions on Graphics (2004)
-  Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
-  Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31(5) (2015)
-  Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics 33(5) (2017)
-  Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: Dense tracking and mapping in real-time. In: IEEE International Conference on Computer Vision (ICCV) (2011)
-  Rad, M., Lepetit, V.: BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth. In: IEEE International Conference on Computer Vision (ICCV) (2017)
-  Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpolation of correspondences for optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
-  Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (2006)
-  Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European Conference on Computer Vision (ECCV). pp. 1–14 (2012)
-  Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: International Conference on Intelligent Robot Systems (IROS) (2012)
-  Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. arXiv preprint arXiv:1709.02371 (2017)
-  Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnn-slam: Real-time dense monocular slam with learned depth prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
-  Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and motion network for learning monocular stereo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
-  Wannenwetsch, A.S., Keuper, M., Roth, S.: Probflow: Joint optical flow and uncertainty estimation. In: IEEE International Conference on Computer Vision (ICCV) (2017)
-  Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: Learned Invariant Feature Transform. In: European Conference on Computer Vision (ECCV) (2016)
-  Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Computer Vision and Pattern Recognition (CVPR) (2017)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)