Log In Sign Up

ENG: End-to-end Neural Geometry for Robust Depth and Pose Estimation using CNNs

Recovering structure and motion parameters given a image pair or a sequence of images is a well studied problem in computer vision. This is often achieved by employing Structure from Motion (SfM) or Simultaneous Localization and Mapping (SLAM) algorithms based on the real-time requirements. Recently, with the advent of Convolutional Neural Networks (CNNs) researchers have explored the possibility of using machine learning techniques to reconstruct the 3D structure of a scene and jointly predict the camera pose. In this work, we present a framework that achieves state-of-the-art performance on single image depth prediction for both indoor and outdoor scenes. The depth prediction system is then extended to predict optical flow and ultimately the camera pose and trained end-to-end. Our motion estimation framework outperforms the previous motion prediction systems and we also demonstrate that the state-of-the-art metric depths can be further improved using the knowledge of pose.


page 11

page 14

page 18

page 19

page 20

page 21

page 26

page 27


DiffPoseNet: Direct Differentiable Camera Pose Estimation

Current deep neural network approaches for camera pose estimation rely o...

DeepSFM: Structure From Motion Via Deep Bundle Adjustment

Structure from motion (SfM) is an essential computer vision problem whic...

A Survey of Structure from Motion

The structure from motion (SfM) problem in computer vision is the proble...

Real-time Indoor Scene Reconstruction with RGBD and Inertia Input

Camera motion estimation is a key technique for 3D scene reconstruction ...

DeepMLE: A Robust Deep Maximum Likelihood Estimator for Two-view Structure from Motion

Two-view structure from motion (SfM) is the cornerstone of 3D reconstruc...

3D Scene Geometry-Aware Constraint for Camera Localization with Deep Learning

Camera localization is a fundamental and key component of autonomous dri...

CReaM: Condensed Real-time Models for Depth Prediction using Convolutional Neural Networks

Since the resurgence of CNNs the robotic vision community has developed ...

1 Introduction

The importance of navigation and mapping to the fields of robotics and computer vision has only increased since its inception. Vision based navigation in particular is an extremely interesting field of research due to its discernible resemblance to human navigation and the wealth of information an image contains. Although creating a machine that understands structure and motion purely from RGB images is challenging, the computer vision community has developed a plethora of algorithms to replicate useful aspects of human vision using a computer. Tracking and mapping remains an unsolved problem, with many popular approaches. Photometric based techniques rely on establishing correspondences across different viewpoints of the same scene and the matching points are then used to perform triangulation. Based on the density of the map, the field can be divided into dense [28], semi-dense [6] and sparse[26] approaches, each comes with advantages and disadvantages.

Applying machine learning techniques to solve vision problems has been another popular area of research. Great advances have been made in the fields of image classification [20, 12, 13] and semantic segmentation [25, 41]

and this has led geometry based machine learning methods to follow suit. The massive growth in neural network driven research has largely been facilitated by the increased availability of low-cost high performance GPUs as well as the relative accessibility of machine learning frameworks such as Tensorflow and Caffe.

In this work we draw from both machine learning approaches as well as SfM techniques to create a unified framework which is capable of predicting the depth of a scene and the motion parameters governing the camera motion between an image pair. We construct our framework incrementally where the network is first trained to predict depths given a single color image. Then a color image pair as well as their associated depth predictions are provided to a flow estimation network which produces an optical flow map along with an estimated measure of confidence in x and y motion. Finally, the pose estimation block utilises the outputs of the previous networks to estimate a motion vector corresponding to the logarithm of the Special Euclidean Transformation

in , which describes the relative camera motion from the first image to the second.

We summarise the contributions made in this paper as follows:

  • We achieve state-of-the-art results for single image depth prediction on both NYUv2 (indoor) and KITTI (outdoor) datasets. [Section 4: Table 1 and Table 2]

  • We outperform previous camera motion prediction frameworks on both TUM and KITTI datasets. [Section 4: Table 5 and Table 4]

  • We also present the first approach to use a neural network to predict the full information matrix which represents the confidence of our optical flow estimate.

2 Related Work

Estimating motion and structure from two or more views is a well studied vision problem. In order to reconstruct the world and estimate camera motion, sparse feature based systems [19, 26] compute correspondences through feature matching while the denser approaches[6, 28] rely on brightness constancy across multiple viewpoints. In this work, we leverage CNNs to solve the aforementioned tasks and we summarize the existing works in the literature that are related to the ideas presented in this paper.

2.1 Single Image Depth Prediction

Predicting depth from a single RGB image using learning based approaches has been explored even prior to the resurgence of CNNs. In [33], Saxena et al. employed a Markov Random Field (MRF) to combine global and local image features. Similar to our approach Eigen et al. [5] introduced a common CNN architecture capable of predicting depth maps for both indoor and outdoor environments. This concept was later extended to a multi-stage coarse to fine network by Eigen et al. in [4]. Advances were made in the form of combining graphical models with CNNs [24] to further improve the accuracy of depth maps, through the use of related geometric tasks [3] and by making architectural improvements specifically designed for depth prediction [22]. Kendall et al. demonstrated that predicting depths and uncertainties improve the overall accuracy in [17]. While most of these methods demonstrated impressive results, explicit notion of geometry was not used during any stage of the pipeline which opened the way for geometry based depth prediction approaches.

In one of the earliest works to predict depth using geometry in an unsupervised fashion, Garg et al. used the photometric difference between a stereo image pair, where the target image was synthesized using the predicted disparity and the known baseline[8]. Left-right consistency was explicitly enforced in the unsupervised framework of Goddard et al. [10] as well as in the semi-supervised framework of Kuznietsov et al.[21], which is a technique we also found to be beneficial during training on sparse ground truth data.

2.2 Optical Flow Prediction

An early work in optical flow prediction using CNNs was [7]. This was later extended by Ilg et al. to FlowNet 2.0 [14] which included stacked FlowNets [7] as well as warping layers. Ranjan and Black proposed a spatial pyramid based optical flow prediction network [30]. More recently, Sun et al. proposed a framework which uses the principles from geometry based flow estimation techniques such as image pyramid, warping and cost volumes in [36]. As our end goal revolves around predicting camera pose, it becomes necessary to isolate the flow that was caused purely from camera motion, in order to achieve this we extend upon these previous works to predict both the optical flow and the associated information matrix of the flow. Although not in a CNN context [39] showed the usefulness of estimating flow and uncertainty.

2.3 Pose Estimation

CNNs have been successfully used to estimate various components of a Structure from Motion pipeline. Earlier works focused on learning discriminative image based features suitable for ego-motion estimation [2, 15]. Yi et al.[40] showed a full feature detection framework can be implemented using deep neural networks. Rad and Lepetit in BB8[29] showed the pose of objects can be predicted even under partial occlusion and highlighted the increased difficulty of predicting 3D quantities over 2D quantities. Kendall and Cipolla demonstrated that camera pose prediction from a single image catered for relocalization scenarios [16].

However, each of the above works lack a representation of structure as they do not explicitly predict depths. Our work is more closely related to that of Zhou et al. [42] and Ummenhofer et al. [38] and their frameworks SfM-Learner and DeMoN. Both of these approaches also predict a single confidence map in contrast to ours which estimates the confidence in x and y directions separately. Since our framework predicts metric depths in comparison to theirs we are able produce far more accurate visual odometry and combat against scale drift. CNN SLAM by Tateno et al. [37] incorporated depth predictions of [22] into a SLAM framework. Our method performs competitively with CNN-SLAM as well as ORB-SLAM[27] and LSD-SLAM[6] which have the added advantage of performing loop closures and local/global bundle adjustments despite solely computing sequential frame-to-frame alignments.

3 Method

3.1 Network Architecture

The overall architecture consists of 3 main subsystems in the form of a depth, flow and camera pose network. A large percentage of the model capacity is invested in to the depth prediction component for two reasons. Firstly, the output of the depth network also serves as an additional input to the other subsystems. Secondly, we wanted to achieve superior depths for indoor and outdoor environments using a common architecture 111Although there are separate models for indoor and outdoor scenes the underlying architecture is common.. In order to preserve space and to provide an overall understanding of the data flow a high level diagram of the network is shown in Figure 1. An expanded architecture with layer definitions for each of the subsystems is included in the supplementary materials.

Figure 1: Overview of our system full pipeline. Please note that we use the notation to indicate from to

3.2 Depth Prediction

The depth prediction network consists of an encoder and a decoder module. The encoder network is largely based on the DenseNet161 architecture described in [13]

. In particular we use the variant pre-trained on ImageNet

[32] and slightly increase the receptive field of the pooling layers. As the original input is down-sampled 4 times by the encoder, during the decoding stage the feature maps are up-sampled back 4 times to make the model fully convolutional. We employ skip connections in order to re-introduce the finer details lost during pooling. Since the first down-sampling operation is done at a very early stage of the pipeline and closely resemble the image features, these activations are not reused inside the decoder. Up-project blocks are used to perform up-sampling in our network, which provide better depth maps compared to de-convolutional layers as shown in [22].

Due to the availability of dense ground truth data for indoor datasets (e.g NYUv2 [34], RGB-D[35]

) this network can be directly utilised to perform supervised learning. Unfortunately, the ground truth data for the outdoor datasets (KITTI) are much sparser and meant we had to incorporate a semi-supervised learning approach in order to provide a strong training signal. Therefore, during training on KITTI, we use a Siamese version of the depth network with complete weight-sharing, and enforce photometric consistency between the left-right image pairs through an additional loss function. This is similar to the previous approaches

[8, 21] and is only required during the training stage, during inference only a single input image is required to perform depth estimation using our network.

3.3 Flow Prediction

The flow network provides an estimation of the optical flow along with the associated confidences given an image pair. These outputs combined with predicted depths allow us to predict the camera pose. As part of our ablation studies we integrated the flow predictions of [14] with our depths, however, the main limitation of this approach was the lack of a mechanism to filter out the dynamic objects which are abundant in outdoor environments. This was solved by estimating confidence, specifically the information matrix in addition to the optical flow. More concretly, for each pixel our flow network predicts 5 quantities, the optical flow in the x and y direction, and the quantities , and , which are required to compute the information matrix or the inverse of the covariance matrix as shown below.


This parametrisation guarantees is positive-definite and can be used to parametrise any information matrix. We found that the gradients are much more stable compared to predicting the information matrix directly as the determinant of the matrix is always greater than zero since only when .

With respect to the architecuture we borrow elements from FlowNet [7] as well as FlowNet 2.0 [14]. As mentioned in [14], FlowNet 2.0 was unable to reliably estimate small motions, which we address with two key changes. Firstly, our flow network takes the predicted depth map as an input, allowing the network to learn the relationship between depth and flow explicitly, including that closer objects appear to move more compared to the objects that are further away from the camera. Secondly, we use “warp-concatenation”, where coarse flow estimates are used to warp the CNN features during the decoder stage. This appears to resolve small motions more effectively particularly on the TUM [35] dataset.

Figure 2: We detail the two approaches we took to estimating the relative pose alignment between adjacent frames (best viewed in colour). Left shows the iterative approach we took, that incorporates a re-weighted least-squares solver (RWLS) into a pose estimation loop. Right shows our fully-connected (FC) approach, which incorporates a succession of strided convolutions, followed by several FC layers.

3.4 Pose Estimation

We take two approaches to pose estimation, shown in Figure 2, an iterative and a fully-connected(FC). This contrasts the ability of a neural network to estimate using the available information, and the simplicity of a standard computer vision approach using the available predicted quantities. We use FC layers to provide the network with as wide a receptive field as possible, to compare more equivalently against using the inferred quantities in the iterative approach.


This approach uses a more conventional method for computing relative pose estimates. We use a standard re-weighted least squares solver based on the residual flow, given an estimate of the relative transformation. More concretely we attempt to minimise the following error function with respect to the relative transformation parameters ()


where is the total residual flow in normalised camera coordinates, the subscript indicates only the first two dimensions of the vector are used, is the inverse depth coordinate of an ordered point cloud (), and are the predicted flow and estimated flow respectively, and is the pixel coordinate. is the current transformation estimate, and can be expressed by the matrix exponential as , where is the component of the motion vector , which is a member of the Lie-algebra , and is the generator matrix corresponding to the relevant motion parameter. As This pipeline is implemented in Tensorflow [1] it allows us to train the network end to end. Please see the supplementary material for a more detailed explanation.


Similar to Zhou et al.[42] and Ummenhofer et al. [38] we also constructed a fully connected layer based pose estimation network. This network utilises 3 stacked fully connected layers and uses the same inputs as our iterative method. While we outperform the pose estimation benchmarks of [42] and [38] using this network the iterative network is our recommended approach due to its close resemblance to conventional geometry based techniques.

3.5 Loss Functions

Depth Losses

For supervised training on indoor and outdoor datasets we use a reverse Huber loss function [22] defined by


where , and and represent the predicted and the ground truth depth respectively. For the KITTI dataset we employed an additional photometric loss during training as the ground truth is highly sparse. This unsupervised loss term enforces left-right consistency between stereo pairs, defined by


where and are the left and right images and and are their corresponding depth maps, is a normalisation function where , K is the camera intrinsic matrix, is the transformation from pixel to camera coordinates, and and define the relative transformation matrices from left-to-right and right-to-left respectively. In this case the rotation is assumed to be the identity and the matrices purely translate in the x-direction. Additionally, we use a smoothness term defined by


where and are the horizontal and vertical gradients of the predicted depth. This provides qualitatively better depths as well as faster convergence. The final loss function used to train KITTI depths is given by


where and are computed on both left and right images separately.

Flow Loss

The probability distribution of multivariate Gaussian in 2D can be defined as follows.


where is the information matrix or inverse covariance matrix . The flow loss criterion can now be defined by


where is the predicted flow, and is the ground truth flow. This optimises by maximising the log-likelihood of the probability distribution over the residual flow error.

Pose Loss

Given two input images , , the predicted depth map of and the predicted relative pose the unsupervised loss and pose loss can be defined as


where maps a transformation T from the Lie-group to the Lie-algebra , such that can be represented by its constituent motion parameters, and is the ground truth relative pose parameters.

3.6 Training Regime

We train our network end-to-end on NYUv2 [34], TUM[35] and KITTI[9] datasets. We use the standard test/train split for NYUv2 and KITTI and define our scene split for TUM. It is worth mentioning that the amount of training data we used is radically reduced compared to [42] and [38]. More concretly, for NYUv2 we use of the full dataset, for KITTI . We use the Adam optimiser [18] with an initial learning rate of 1e-4 for all experiments and chose Tensorflow [1] as the learning framework and train using an NVIDIA-DGX1. We provide a detailed training schedule and breakdown in the supplementary material.

4 Results

In this section we summarise the single-image depth prediction and relative pose estimation performance of our system on several popular machine learning and SLAM datasets. We also investigate the effect of using alternative optical flow estimates from [14] and [31] in our pose estimation pipeline as an ablation study. The entire model contains 130M parameters. Our depth estimator runs at 5fps on an NVIDIA GTX 1080Ti, while other sub-networks run at 30fps.

4.1 Depth Estimation

We summarise the results of evaluating our single-image depth estimation of the datasets NYUv2[34], RGB-D[35] and KITTI[9] in Tables 1, 2 and 3 respectively using the established metrics of [5].

We train Ours(baseline) model to showcase the improvement we get by purely using the depth loss. This is then extended to use the full end-to-end training loss (depth + flow + pose losses) in the Ours(full) model which demonstrates a consistent improvement across all datasets. Most notably in Tables 2 and 3 for which ground truth pose data was available for training. This validates our approach for improving single image depth estimation performance, and demonstrates a network can be improved by enforcing more geometric priors on the loss functions. We would like to mention that the improvement we gain from Ours(baseline) to Ours(full) is purely due to the novel combined loss terms as the flow and pose sub networks do not increase the model capacity of the depth subnet itself.

Method lower better higher better
[4] 0.641 0.214 0.16 76.9% 95.0% 98.8%
Laina et al.[22] 0.573 0.195 0.13 81.1% 95.3% 98.8%
Kendall et al. [17] 0.506 - 0.110 81.7% 95.9% 98.9%
Ours (baseline) 0.487 0.164 0.113 86.7% 97.7% 99.4%
Ours (full) 0.478 0.161 0.111 87.2% 97.8% 99.5%
Table 1: The performance of several approaches evaluated on single-image depth estimation using the standard testset of NYUv2[34] proposed in [4].
Cap Method lower better higher better
0-80m Zhou et al.[42] 6.856 0.283 0.208 67.8% 88.5% 95.7%
Godard et al.[10] 4.935 0.206 0.141 86.1% 94.9% 97.6%
Kuznietsov et al. [21] 4.621 0.189 0.113 86.2% 96.0% 98.6%
Ours (baseline) 4.394 0.178 0.095 89.4% 96.6% 98.6%
Ours (full) 4.301 0.173 0.096 89.5% 96.8% 98.7%
0-50m Zhou et al.[42] 5.181 0.264 0.201 69.6% 90.0% 96.6%
Garg et al. [8] 5.104 0.273 0.169 74.0% 90.4% 96.2%
Godard et al. [10] 3.729 0.194 0.108 87.3% 95.4% 97.9%
Kuznietsov et al.[21] 3.518 0.179 0.108 87.5% 96.4% 98.8%
Ours(baseline) 3.359 0.168 0.092 90.5% 97.0% 98.8%
Ours(full) 3.284 0.164 0.092 90.6% 97.1% 98.9%
Table 2: The performance of previous state-of-the-art approaches evaluated on the standard testset of the KITTI dataset [9].
Figure 3: Resulting single image depth estimation for several approaches and ours against the ground truth on the dataset NYUv2[34]. The RMSE for each prediction is included
Method lower better higher better
Laina et al.[22] 1.275 0.481 0.189 0.371 75.3% 89.1% 91.8%
DeMoN(est)[38] 2.980 0.910 1.413 5.109 21.0% 36.6% 48.9%
DeMoN(gt)[38] 1.584 0.555 0.301 0.581 52.7% 70.7% 80.7%
Ours(baseline) 1.068 0.353 0.128 0.236 86.9% 92.2% 93.5%
Ours(full) 0.996 0.329 0.108 0.194 90.3% 93.6% 94.5%
Table 3: The performance of previous state-of-the-art approaches on a randomly selected subset of the frames from the RGB-D dataset [35]. We post separate entries for DeMoN(est) and DeMoN(gt), former is scaled by the estimated scale of their system while the latter is scaled by the median groundtruth depth.
Figure 4: The resulting single image depth estimation for several approaches including Zhou et al.[42](SfM-Learner), Kuznietsov et al.[21] and Ours against a ground truth filled using [23] on the testset of the KITTI dataset [9]. We include the RMSE values for each methods prediction. Filled depths are included for visualisation purposes during evaluation the predictions are evaluated against the sparse velodyne ground truth data

Additionally we include qualitative results for NYUv2[34] and KITTI[9] in Figure 3 and 4 respectively. Each of which illustrates a noticeable improvement over previous methods. We also demonstrate that the improvement is beyond the numbers, as our approach generates more convincing depths even when the RMSE may be higher, as is the case in the second row of Figure 3, where [22] computes a lower RMSE. More impressive still are the results in Figure 4, where we compare against previous approaches that are both trained on much larger training sets than our own and still show noticeable qualitative and quantitative improvements.

4.2 Pose Estimation

To demonstrate the ability of our approach to perform accurate relative pose estimation, we compare our approach on several unseen sequences from the datasets for which ground-truth poses were available. To quantitatively evaluate the trajectories we use the absolute trajectory error (ATE) and the relative pose error (RPE) as proposed in [35]. To mitigate the effect of scale-drift on these quantities we scale all poses to the groundtruth associated poses during evaluation. By using both metrics it provides an estimate of the consistency of each pose estimation approach. We summarise the results of this quantitative analysis for KITTI[9] in Table 4 and for RGB-D[35] in Table 5. We include comparisons of the performance against other state-of-the-art pose estimation networks namely SFM-Learner[42] and DeMoN[38]. Additionally we include results from current state-of-the-art SLAM systems also, namely ORB-SLAM2[26] and LSD-SLAM[6].

Figure 5: Top the scaled and aligned trajectories for Zhou et al.[42](SfM-Learner), ORB-SLAM2 [26] (without loop-closure enabled) and Ours respectively. Bottom box-plots of the relative pose scaling required to bring the predicted translation to the same magnitude as the ground-truth pose

In Table 4 we show the most comparable performance of our approach to state-of-the-art SLAM systems. We demonstrate a noticeable improvement over SfM-Learner on both sequences in all metrics. We evaluate SfM-Learner on its frame-to-frame tracking performance for adjacent frames (SFM-Learner(1)) and separations of 5 frames (SFM-Learner(5)), as they train their approach to estimate this size frame gap. Even with the massive reduction in accumulation error expected by taking larger frame gaps (demonstrated in reduced ATE) our system still produces more accurate pose estimates.

Figure 6: Resulting trajectories using our iterative approach on 3 additional sequences of KITTI [9]. Sequence 05 shows a failure for our approach, where accumulated drift causes the trajectory to not be well aligned. These sequences are not used for training

We show the resulting scaled trajectories of sequence 09 in Figure 5, as well as the relative scaling of each trajectories poses in a box-plot. The spread of scales present for SFM-Learner indicates scale is essentially ignored by their system, with scale drifts ranging across a full log scale, while ORB-SLAM and our approach are barely visible at this scale. Another thing to note is that our scale is centered around 1.0, as we estimate scale directly by estimating metric depths. This seems to provide a strong benefit in terms of reducing scale-drift and we believe makes our system more usable in practice.

In Table 5 we show a significant improvement in performance against existing machine learning approaches across several sequences from the RGB-D dataset[35]. We evaluate against DeMoN[38] in two ways, frame-to-frame (DeMoN(1)) and we again try to provide the same advantage to DeMoN as SfM-Learner by using wider baselines, which they claim improves their depth estimations[38], using a frame gap of 10 (DeMoN(10)). It can be observed that even with the massive reduction in accumulation error over our frame-to-frame approach, we still manage to significantly out-perform their approach in ATE, even surpassing LSD-SLAM on the sequence fr1-xyz. ORB-SLAM is still the clear winner, as they massively benefit from the ability to perform local bundle-adjustments on the sequences used, which are short trajectories of small scenes. We include an example of a frame from the sequence fr3-walk-xyz in Figure 7, which shows this scene is not static, but our system has the ability to deal with this through the flow confidence estimates, discussed in Section 4.3

Sequence 09 10
Method ATE(m) RPE(m) RPE(°) ATE(m) RPE(m) RPE(°)
ORB-SLAM(no-loop)[26] 57.57 0.040 0.103 8.090 0.033 0.105
ORB-SLAM(full)[26] 9.104 0.056 0.084 7.349 0.031 0.100
SfM-learner(5)[42] 58.31 0.077 0.803 31.75 0.069 1.242
SfM-learner(1)[42] 81.09 0.050 0.976 75.89 0.045 1.599
Ours(fully connected) 41.50 0.087 0.387 29.29 0.081 0.486
Ours(full) 16.55 0.047 0.128 9.846 0.039 0.138
Table 4: Performance of several approaches evaluated on two sequences of the KITTI dataset [9]. SfM-Learner(1) and SfM-Learner(5) indicates the different frame gaps used to construct the trajectories. The results are separated by SLAM and machine learning approaches. We highlight the strongest results in bold for each type of approach.
Sequence fr1-xyz fr2-360-hs fr3-walk-xyz
(m) (m) (°) (m) (m) (°) (m) (m) (°)
LSD-SLAM[6] 0.090 - - - - - 0.124 - -
ORB-SLAM[26] 0.009 0.007 0.645 - - - 0.012 0.013 0.694
DeMoN(10)[38] 0.178 0.021 1.193 0.601 0.035 2.243 0.265 0.049 1.447
DeMoN(1)[38] 0.183 0.037 3.612 0.669 0.032 3.233 0.279 0.040 3.174
Ours(fully connected) 0.169 0.028 1.887 0.883 0.030 1.799 0.268 0.044 1.698
Ours(iterative) 0.071 0.024 1.237 0.461 0.020 0.736 0.240 0.026 0.811
Table 5: Performance of pose estimation on several sequences from the RGB-D dataset [35]. DeMoN(1) and DeMoN(10) indicates the trajectories were constructed with a frame gap of 1 and 10 respectively. Both [26] and [6] fail to track on fr2-360-hs. The results are separated by SLAM and machine learning approaches. We highlight the strongest results in bold for each type of approach.

4.3 Ablation Experiments

In order to examine the contribution of using each component of our pose estimation network, we compare the pose estimates under various configurations on sequences 09 and 10 of the KITTI odometry dataset[9], summarised in Table 6. We examine the relative improvement of iterating on our pose estimation till convergence, against a single weighted-least-squares iteration, which demonstrates iterating has a significantly positive effect. We demonstrate the improved utility of our flows by replacing our flow estimates with other state-of-the-art flow estimation methods from [14] and [31] in our pose estimation pipeline, and consistently demonstrate an improvement using our approach. We show the result of optimising with and without our estimated confidences, demonstrating quantitatively how important they are to pose estimation accuracy, with significant reductions across all metrics.

Sequence 09 10
Method ATE(m) RPE(m) RPE(°) ATE(m) RPE(m) RPE(°)
Ours(noconf) 53.40 0.356 0.931 58.50 0.308 1.058
Ours(noconf,iterative) 33.18 0.248 0.421 35.87 0.280 0.803
Flownet2.0[14] 29.64 0.349 0.838 51.90 0.222 0.954
Flownet2.0(iterative)[14] 24.61 0.185 0.400 22.61 0.142 0.484
EpicFlow[31] 119.0 0.566 0.931 20.98 0.199 0.853
EpicFlow(iterative)[31] 59.79 0.379 0.459 14.80 0.154 0.581
Ours(full-single iteration) 31.20 0.089 0.324 24.10 0.095 0.389
Ours(full-til convergence) 16.55 0.047 0.128 9.846 0.039 0.138
Table 6: Results of pose estimation on KITTI[9] with various components of the network removed or replaced. We highlight the strongest results in bold.

We also demonstrate qualitatively one of the ways in which estimating confidence improves our pose estimation in Figure 7. This shows that our system has learned the confidence on moving objects is lower than its surroundings and the confidences of edges are higher, helping our system focus on salient information during optimisation in an approach similar to [6].

Figure 7: For a frame pair ( and ) from the sequence fr3-walk-xyz , is the estimated optical flow from to , and and are the estimated flow confidences in the and direction respectively

5 Conclusion and Further Work

We present the first piece of work that performs least squares based pose estimation inside a neural network. Instead of replacing every component of the SLAM pipeline with CNNs, we argue it’s better to use CNNs for tasks that greatly benefit from feature extraction (depth and flow prediction) and use geometry for tasks its proven to work well (motion estimation given the depths and flow). Our formulation is fully differentiable and is trained end-to-end. We achieve state-of-the-art performances on single image depth prediction for both NYUv2

[34] and KITTI [9] datasets. We demonstrate both qualitatively and quantitatively that our system is capable of producing better visual odometry that considerably reduces scale-drift by predicting metric depths.

Supplementary Material

I Dataset Evaluation Analysis

In this section we evaluate and analyse the relative performance on each dataset as well as correlations in the dataset and how they relate to overall performance.

i.i NYUv2[34]

Figure 8: Left: The RMSE error on each image of the test set, sorted by our performance on the NYUv2 dataset. We include two competing approaches, as well as marking which side of the line indicates we are better (‘We Win’) and which we are worse (‘We Lose’). Right: The median ground-truth depth of each image in the test set also sorted by our RMSE performance. We include an approximate trend-line to show the relationship between depth and RMSE in our system

The dataset NYUv2[34] has been a popular benchmark for indoor depth estimation and semantic segmentation since the work of Eigen et al.[4]. We provide several qualitative and quantitative results from the evaluation of our approach in Figures 9, 10 and 22. This shows our strongest, median and worst performing images, as well as each predictions RMSE error in meters. This reveals two insights about our system’s performance, and that is we perform stronger on images with closer median depths and that our largest errors occur when we incorrectly estimate the overall scale of the scene. The relationship to median depth is evident in Figure 8, where the RMSE is strongly correlated to the median scene depth. We also observe a similarly strong correlation in the performance of all three approaches, although our approach is overwhelmingly out performing the competitors.

Figure 9: The 6 highest performing images from the NYUv2 [34] testset, based on RMSE error. All images are of varying scenes, but contain lower median depth values on average.
Figure 10: The middle 6 images from the NYUv2 [34]

testset, based on RMSE error. All images have RMSE values individually lower than the full testset (0.480m), indicating a small number of outliers, which is apparent in Figure

Figure 11: The 6 lowest performing images from the NYUv2 [34] testset, based on RMSE error. In general these images contain higher median depth values, but the way in which our network gets it wrong appears to be in estimating the overall scene scale. This quantity is challenging to estimate, and we can observe qualitatively the system still produces believable relative depth estimates

What conclusions can we draw from these results? Well this is a rather clear result of the choice of error metric in ranking the results. In this case as we rank by the RMSE, we would expect higher depths to be the images with the largest error, as only either very large predictions or very large ground truth values can generate large RMSE values. This also indicates that our network tends to behave conservatively, estimating the scene is closer on average rather than further. This is probably a direct result of the depth value distribution in the training set, potentially biasing the depths towards the lower end.

i.ii Kitti [9]

Figure 12: Left: The RMSE error on each image of the test set, sorted by our performance on the KITTI dataset. We include two competing approaches, as well as marking which side of the line indicates we are better(‘We Win’) and which we are worse (‘We Lose’). Right: The median ground-truth depth of each image in the test set also sorted by our RMSE performance. We include an approximate trend-line to show the relationship between depth and RMSE in our system
Figure 13: The highest performing 6 images from the KITTI [9] testset, based on RMSE error. Surprisingly not all of these contain a great deal of scale context, in particular rows 2 and 6, where they face a dirt ramp, which is atypical of the predominantly road facing dataset. This indicates strongly that the approach is genuinely learning about the geometry of the scenes
Figure 14: The middle 6 performing images from the KITTI [9] testset, based on RMSE error. The RMSE values are hovering around the value achieved for the dataset and represent the typical performance. Note the systems ability to estimate depth in the top half of the scene, which never receives a ground truth training signal, as the LIDAR only scans below the horizon line
Figure 15: The lowest 6 performing images from the KITTI [9] testset, based on RMSE error. Although the RMSE of each of these images is comparitively high, the depth predictions produced are convincing qualitatively

Our most impressive performance is perhaps on the KITTI benchmark dataset [9]. Where as shown in Figure 12 (left), we consistently out-perform the competing approaches on almost all test images. The scale of the depth error had to be changed to in order to capture the full range of errors. This could be because the competing approaches estimate inverse depth/disparity and invert the predicted values to compute their loss function. This can lead to unstable performance on large distances, due to the non-linearity of this section, as opposed to our approach which is linear to all depths.

Again we include the analysis of the median depths sorted against the RMSE error in Figure 12 (right), as we did for NYUv2. In this case the relationship between error and depth is largely reduced, this is most likely due to the nature of the dataset which contains a very similar spread of data for most images in the training set, as they film similar scenarios. However the relationship is still visible in Figure 13, where these scenes contain comparitively low depth values, indicating again our system behaves conservatively in estimating depths.

i.iii Comparison using the architecture of Kuznietsov [21]

We replaced the architecture of our depth estimation network using that of Kuznietsov et al. [21]. As it can be seen below by using the full training loss we are able to improve the accuracy of the depth estimation results indicating the generality of the approach.

Dataset NYU[34] TUM[35] KITTI [9]
Baseline Full Baseline Full Baseline Full
Metric 0.536 0.525 1.096 1.015 3.518 3.425
(=1.25) 82.5 82.8 79.9 81.1 87.5 89.5
96.3 96.7 90.4 91.8 96.4 96.9
99.2 99.3 93.8 94.6 98.8 98.8

Ii Pose Trajectories

ii.i KITTI Trajectories

We include the trajectory from sequence 10 of the odometry dataset from KITTI [9]. For the quantitative results please refer to the main paper. The resulting trajectories in Figure 16, indicate the comparitively strong performance of our approach, and show that our iterative (bottom-left) approach is significantly more accurate than the FC approach (top-right). This trajectory contains no loops, and as such could lead to significant scale drift in some SLAM systems, however in this case the frequent local bundle-adjusts performed by ORB-SLAM seem to have helped to maintain the map quality throughout.

Figure 16: Trajectories of Our method in both configurations, as well as the resulting trajectories of ORB-SLAM(full) [26] and SfM-Learner [42]. We demonstrate comparative quality to ORB-SLAM, and significantly out perform SfM-Learner

ii.ii RGB-D Trajectories

We show the estimated aligned trajectories for several sequences from the RGB-D dataset [35], to demonstrate qualitatively performance of our system against previous approaches. We summarise the trajectories in Figure 17, which demonstrates our comparably favourable performance against the approach DeMoN[38]. This is all despite our method estimating only frame-to-frame relative poses from adjacent frames, while DeMoN(10) is using a larger baseline and thus should estimate a smoother trajectory given the reduction in accumulation error. Ultimately ORB-SLAM [26] is still the clear winner, as it uses information from multiple frames, and iteratively aggregates error across short sections. However as our approach is purely VO we were able to get a trajectory for fr2-360-hs, which we were unable to for ORB-SLAM due to the challenging nature of the camera motion and rapid lighting changes.

Figure 17: Trajectories of Our method against ORB-SLAM [26] and DeMoN(10) [38], for the evaluated sequenced from the RGB-D dataset [35]. We demonstrate a marked improvement upon DeMoN which, although being given a slight advantage in some respects by widening the baseline and reducing accumulated pose error, still performs poorly. However against ORB-SLAM, both methods come up a little short, as ORB-SLAM is able to perform local bundle-adjustments across multiple keyframes, which greatly reduces the overall error.

ii.iii Comparison With CNN-SLAM

In this table we include a comparison of our approach on the datasets used by CNN-SLAM [37]. We would like to point out that our method performs competitively despite solely computing sequential frame-to-frame alignments and does not (yet) take advantage of the loop closures and local/global bundle adjustments used by the competing methods.

Method Absolute Trajectory Error
TUM/seq1 TUM/seq2 TUM/seq3
CNN-SLAM 0.542 0.243 0.214
LSD-SLAM 1.826 0.436 0.937
ORB-SLAM 1.206 0.495 0.733
Ours (fc) 1.043 0.672 0.186
Ours (full) 0.799 0.587 0.157

Iii Optical Flow

Figure 18: A selection of optical flow predictions made by our framework on the TUM dataset[35]. For a frame pair ( and ) , is the estimated optical flow from to , and and are the estimated flow confidences in the and direction respectively. The first two rows correspond to static scenes where only the camera moves resulting in uniform flow across the image. The third row shows an example of a dynamic scene
Figure 19: A selection of optical flow predictions made by our framework on the KITTI dataset[9]. Dynamic objects and the objects that do not appear in both frames due to large camera motion have low confidence
Figure 20: Optical flow color coding

Iv Network Architectures

iv.i Depth Network

Figure 21: The Depth Prediction Network. We include a summary of all operations (bottom-left) as well as description of the up-project blocks used in the decoder (bottom-right)

The encoder takes a global mean subtracted RGB image as an input, during the feature encoding stage the resolution of the activations are reduced by a factor of 16. First downsampling operation is performed using a strided convolutional layer, the next with a max-pooling layer and the final two with average-pooling layers. Up sampling process is performed using the up-project blocks proposed in

[22]. Since the first down-sampling operation is performed by the very first convolutional layer and closely resemble image features,these activations are not provided to the decoder via a skip connection. It should be noted that ours isn’t the first piece of work to predict depth using a DenseNet architecture. Kendall et al. [17] also used a DenseNet variant and the gains that we obtain are predominantly due to the loss functions we employed. Appendix I shows the full breakdown of the architecture

iv.ii Flow Network

Figure 22: The optical flow prediction network. We include a summary of all operations (bottom-left) as well as description of the flow-conv and flow-deconv blocks (bottom-right)

The flow network has three streams. The first stream takes the left image and its’ predicted depth map as an input, the second stream receives the right image and the corresponding predicted depth map and finally, the third stream receives both the left and right images and their associated depth predictions. Barring the first layer, all other layers of each stream share their weights. During the decoder stage the predicted flow is used to perform warp concatenations, where the right images activations are warped and concatenated with that of the left image. Since we are estimating optical flow in a coarse to fine manner, where the latter layers compute a residual to be added to the initial flow estimate, warp-concatenations help to capture the small displacements more effectively

V Pose Network

v.i Iterative Re-weighted Approach

As described in the main body of the paper, we are attempting to minimise the following error function with respect to the relative transformation parameters ():


For simplicity, we express the values in terms of normalised camera coordinates. The estimated flow is computed from the normalised camera coordinate and the current estimated transformation as shown in Equation 11. To simplify the mathematics we can represent the transformation using a matrix exponential as , where is the component of the motion vector , which is a member of the Lie-algebra , and is the generator matrix corresponding to the relevant motion parameter. We can now differentiate the residual function with respect to the motion parameters to generate the following Jacobian


where is the Jacobian, which can be stacked to form a larger Jacobian matrix , additionally the residual vectors can be stacked . This allows us to iteratively reduce the loss function using a standard Gauss-Newton approach given by


where is the additive update to the motion parameters , and W is a diagonal weight matrix . is the weight matrix defined by


where is the confidence value in the x-direction, is a constant that is computed from the residual (Equation 11), to be the mean residual magnitude of a single image, and is the residual in the x-direction. This pipeline is implemented in Tensorflow [1] and allows us to train the network end to end.

v.ii Network Based Approach

This section was addressed in detail in the main body.

Vi Training Procedure

vi.i Depth Training

All of the DenseNet-161 layers [13] of the depth nets are initialised using Imagenet[32] pretrained weights. Remainder of the layers are intialised using MSRC[11] initialisation. NYUv2[34] and TUM[35] models are trained purely using the supervised loss term. The network is regularized using a weight decay of 1 through out training and the learning rate schedule is shown below :

Figure 23: Learning rate schedule for NYUv2[34] depth training

Out of 400,0000 images in the NYU dataset, we only use 12,000 during training. We perform data augmentation 4 times (a total training set of 48000 images) using color shifts, random crops and left-right flips. Although, data augmentation can be implemented during training we noticed a considerable speedup by performing data augmentation offline. The training images and the corresponding ground truth are downsampled by a factor of 2. Hence, the resolution of each training example becomes 320240. Each training batch contains 8 images and we use 4 GPUs, resulting in a overall batch size of 32. In terms of training speed we observe on average 19.3 examples/sec or 0.415 sec/batch.

For the KITTI dataset we use 10,000 training images. Out of the training images that were defined in [5] we further prune our training set to exclude any images that are part of the odometry test set. We adopt a learning rate schedule which spans for half the duration of the NYU. This is primarily to avoid over fitting as we are now working with a comparatively small training dataset.

vi.i.1 Optical Flow Training

In order to compute the ground truth optical flow image, for the NYUv2 [34] dataset we first compute the camera pose using the Iterative Closest Point (ICP) algorithm which can then be used with the ground truth depth map to compute optical flow. This process is slightly simplified for the TUM[35] dataset as the ground truth pose is provided. The network is then trained using the optical flow loss criterion. All the layers of the flow network are initialised using the MSRC[11] initialisation and the learning rate schedule is shown in Figure 24. As it can be seen, the training duration is much smaller compared to the depth network training as the primary objective at this stage is to obtain a crude representation for both optical flow and the information matrix. Complete end-to-end fine tuning happens when the network is trained using the pose loss criterion.

Figure 24: Learning rate schedule for optical flow training

vi.i.2 Pose Training

We optimize the full network end-to-end using the pose loss and demonstrate that the state-of-the art depths can be further improved using the knowledge of pose. We train the network for 20,000 iterations with an initial learning rate of which is halved at the half-way point.

Appendix I

We include a full table description of our depth network, for completeness.

Depth Net

Model Architechture Breakdown
Layer Channesl (I/O) Scaling Inputs
conv1 3/96 2 Input Image
pool1 96/96 4 conv1
conv2_1_x1 96/192 4 pool1
conv2_1_x2 192/48 4 conv2_1_x1
concat2_1 144/144 4 conv2_1_x2, pool1
DenseBlk_1 144/48 4 concat2_1
concat2_2 192/192 4 DenseBlk_1, concat2_1
DenseBlk_2 192/48 4 concat2_2
concat2_3 240/240 4 DenseBlk_2, concat2_2
DenseBlk_3 240/48 4 concat2_3
concat2_4 288/288 4 DenseBlk_3, concat2_3
DenseBlk_4 288/48 4 concat2_4
concat2_5 336/336 4 DenseBlk_4, concat2_4
DenseBlk_5 336/48 4 concat2_5
concat2_6 384/384 4 DenseBlk_5, concat2_5
conv2_blk 384/192 4 concat2_6
pool2 192/192 2 conv2_blk
DenseBlk_6 192/48 8 pool2
concat3_1 240/240 8 DenseBlk_6, pool2
DenseBlk_7 240/48 8 concat3_1
concat3_2 288/288 8 DenseBlk_7, concat3_1
DenseBlk_8 288/48 8 concat3_2
concat3_3 336/336 8 DenseBlk_8, concat3_2
DenseBlk_9 336/48 8 concat3_4
concat3_4 384/384 8 DenseBlk_9, concat3_3
DenseBlk_10 384/48 8 concat3_4
concat3_5 432/432 8 DenseBlk_10, concat3_4
DenseBlk_11 432/48 8 concat3_5
concat3_6 480/480 8 DenseBlk_11, concat3_5
DenseBlk_12 480/48 8 concat3_6
concat3_7 528/528 8 DenseBlk_12, concat3_6
DenseBlk_13 528/48 8 concat3_7
concat3_8 576/576 8 DenseBlk_13, concat3_7
DenseBlk_14 576/48 8 concat3_8
concat3_9 624/624 8 DenseBlk_14, concat3_8
DenseBlk_15 624/48 8 concat3_9
concat3_10 672/672 8 DenseBlk_15, concat3_9
DenseBlk_16 672/48 8 concat3_10
concat3_11 720/720 8 DenseBlk_16, concat3_10
DenseBlk_17 720/48 8 concat3_11
concat3_12 768/768 8 DenseBlk_17, concat3_11
conv3_blk 768/384 8 concat3_12
pool3 384/384 16 conv3_blk
DenseBlk_18 384/48 8 pool3
concat4_1 432/432 8 DenseBlk_18, pool3
DenseBlk_19 480/48 16 concat4_2
DenseBlk_19 480/48 16 concat4_2
concat4_2 528/528 16 DenseBlk_19, concat4_1
DenseBlk_20 528/48 16 concat4_3
concat4_3 576/576 16 DenseBlk_20, concat4_2
DenseBlk_21 576/48 16 concat4_4
concat4_4 624/624 16 DenseBlk_21, concat4_3
DenseBlk_22 624/48 16 concat4_5
concat4_5 672/672 16 DenseBlk_22, concat4_4
DenseBlk_23 672/48 16 concat4_6
concat4_6 720/720 16 DenseBlk_23, concat4_5
DenseBlk_24 720/48 16 concat4_7
concat4_7 768/768 16 DenseBlk_24, concat4_6
DenseBlk_25 768/48 16 concat4_8
concat4_8 816/816 16 DenseBlk_25, concat4_7
DenseBlk_26 816/48 16 concat4_9
concat4_9 864/864 16 DenseBlk_26, concat4_8
DenseBlk_27 864/48 16 concat4_10
concat4_10 912/912 16 DenseBlk_27, concat4_9
DenseBlk_28 912/48 16 concat4_11
concat4_11 960/960 16 DenseBlk_28, concat4_10
DenseBlk_29 960/48 16 concat4_12
concat4_12 1008/1008 16 DenseBlk_29, concat4_11
DenseBlk_30 1008/48 16 concat4_13
concat4_13 1056/1056 16 DenseBlk_30, concat4_12
DenseBlk_31 1056/48 16 concat4_14
concat4_14 1104/1104 16 DenseBlk_31, concat4_13
DenseBlk_32 1104/48 16 concat4_15
concat4_15 1152/1152 16 DenseBlk_32, concat4_14
DenseBlk_33 1152/48 16 concat4_16
concat4_16 1200/1200 16 DenseBlk_33, concat4_15
DenseBlk_34 1200/48 16 concat4_17
concat4_17 1248/1248 16 DenseBlk_34, concat4_16
DenseBlk_35 1248/48 16 concat4_18
concat4_18 1296/1296 16 DenseBlk_35, concat4_17
DenseBlk_36 1296/48 16 concat4_19
concat4_19 1344/1344 16 DenseBlk_36, concat4_18
DenseBlk_37 1344/48 16 concat4_20
concat4_20 1392/1392 16 DenseBlk_37, concat4_19
DenseBlk_38 1392/48 16 concat4_21
concat4_21 1440/1440 16 DenseBlk_38, concat4_20
DenseBlk_39 1440/48 16 concat4_22
concat4_22 1488/1488 16 DenseBlk_39, concat4_21
DenseBlk_40 1488/48 16 concat4_23
concat4_23 1536/1536 16 DenseBlk_40, concat4_22
DenseBlk_41 1536/48 16 concat4_24
concat4_24 1584/1584 16 DenseBlk_41, concat4_23
DenseBlk_42 1584/48 16 concat4_25
concat4_25 1632/1632 16 DenseBlk_42, concat4_24
DenseBlk_43 1632/48 16 concat4_26
concat4_26 1680/1680 16 DenseBlk_43, concat4_25
DenseBlk_44 1680/48 16 concat4_27
concat4_27 1728/1728 16 DenseBlk_44, concat4_26
DenseBlk_45 1728/48 16 concat4_28
concat4_28 1776/1776 16 DenseBlk_45, concat4_27
DenseBlk_46 1776/48 16 concat4_29
concat4_29 1824/1824 16 DenseBlk_46, concat4_28
DenseBlk_47 1824/48 16 concat4_30
concat4_30 1872/1872 16 DenseBlk_47, concat4_29
DenseBlk_48 1872/48 16 concat4_31
concat4_31 1920/1920 16 DenseBlk_48, concat4_30
DenseBlk_49 1920/48 16 concat4_32
concat4_32 1968/1968 16 DenseBlk_49, concat4_31
DenseBlk_50 1968/48 16 concat4_33
concat4_33 2016/2016 16 DenseBlk_50, concat4_32
DenseBlk_51 2016/48 16 concat4_34
concat4_34 2064/2064 16 DenseBlk_51, concat4_33
DenseBlk_52 2064/48 16 concat4_35
concat4_35 2112/2112 16 DenseBlk_52, concat4_34
DenseBlk_53 2112/48 16 concat4_36
concat4_36 2160/2160 16 DenseBlk_53, concat4_35
conv4_blk 2160/1056 16 concat4_36
DenseBlk_54 1056/48 16 conv4_blk
concat5_1 1104/1104 16 conv4_blk, DenseBlk_54
DenseBlk_55 1104/48 16 concat5_2
concat5_2 1152/1152 16 DenseBlk_55, concat5_1
DenseBlk_56 1152/48 16 concat5_3
concat5_3 1200/1200 16 DenseBlk_56, concat5_2
DenseBlk_57 1200/48 16 concat5_4
concat5_4 1248/1248 16 DenseBlk_57, concat5_3
DenseBlk_58 1248/48 16 concat5_5
concat5_5 1296/1296 16 DenseBlk_58, concat5_4
DenseBlk_59 1296/48 16 concat5_6
concat5_6 1344/1344 16 DenseBlk_59, concat5_5
DenseBlk_60 1344/48 16 concat5_7
concat5_7 1392/1392 16 DenseBlk_60, concat5_6
DenseBlk_61 1392/48 16 concat5_8
concat5_8 1440/1440 16 DenseBlk_61, concat5_7
DenseBlk_62 1440/48 16 concat5_9
concat5_9 1488/1488 16 DenseBlk_62, concat5_8
DenseBlk_63 1488/48 16 concat5_10
concat5_10 1536/1536 16 DenseBlk_63, concat5_9
DenseBlk_64 1536/48 16 concat5_11
concat5_11 1584/1584 16 DenseBlk_64, concat5_10
DenseBlk_65 1584/48 16 concat5_12
concat5_12 1632/1632 16 DenseBlk_65, concat5_11
DenseBlk_66 1632/48 16 concat5_13
concat5_13 1680/1680 16 DenseBlk_66, concat5_12
DenseBlk_67 1680/48 16 concat5_14
concat5_14 1728/1728 16 DenseBlk_67, concat5_13
DenseBlk_68 1728/48 16 concat5_15
concat5_15 1776/1776 16 DenseBlk_68, concat5_14
DenseBlk_69 1776/48 16 concat5_16
concat5_16 1824/1824 16 DenseBlk_69, concat5_15
DenseBlk_70 1824/48 16 concat5_17
concat5_17 1872/1872 16 DenseBlk_70, concat5_16
DenseBlk_71 1872/48 16 concat5_18
concat5_18 1920/1920 16 DenseBlk_71, concat5_17
DenseBlk_72 1920/48 16 concat5_19
concat5_19 1968/1968 16 DenseBlk_72, concat5_18
DenseBlk_73 1968/48 16 concat5_20
concat5_20 2016/2016 16 DenseBlk_73, concat5_19
DenseBlk_74 2016/48 16 concat5_21
concat5_21 2064/2064 16 DenseBlk_74, concat5_20
DenseBlk_75 2064/48 16 concat5_22
concat5_22 2112/2112 16 DenseBlk_75, concat5_21
DenseBlk_76 2112/48 16 concat5_23
concat5_23 2160/2160 16 DenseBlk_76, concat5_22
DenseBlk_77 2160/48 16 concat5_24
concat5_24 2208/2208 16 DenseBlk_77, concat5_23
conv5_blk 2208/1024 16 concat5_24
upproject_1 1024/512 8 concat5_24
concat_up_2 896/896 8 upproject_1, conv3_blk
upproject_2 896/584 4 concat_up_2
concat_up_3 776/776 4 upproject_2, conv2_blk
upproject_3 776/256 2 concat_up_3
upproject_4 256/128 1 upproject_3
conv_pred 128/1 1 upproject_4


  • [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  • [2] Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: IEEE International Conference on Computer Vision (ICCV) (2015)
  • [3] Dharmasiri, T., Spek, A., Drummond, T.: Joint prediction of depths, normals and surface curvature from rgb images using cnns. arXiv preprint arXiv:1706.07593 (2017)
  • [4] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision (ICCV) (2015)
  • [5] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems (NIPS) (2014)
  • [6] Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: Large-scale direct monocular slam. In: European Conference on Computer Vision (ECCV) (2014)
  • [7] Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., Van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. IEEE International Conference on Computer Vision (ICCV) (2015)
  • [8] Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV) (2016)
  • [9] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013)
  • [10]

    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  • [11] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: IEEE International Conference on Computer Vision (2015)
  • [12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [13] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [14] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [15] Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: IEEE International Conference on Computer Vision (ICCV) (2015)
  • [16] Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 4762–4769 (2016)
  • [17] Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems (NIPS) (2017)
  • [18] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [19] Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR) (2007)
  • [20] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)
  • [21] Kuznietsov, Y., Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [22] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: International Conference on 3D Vision (3DV) (2016)
  • [23]

    Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: ACM Transactions on Graphics (2004)

  • [24] Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
  • [25] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
  • [26] Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31(5) (2015)
  • [27] Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics 33(5) (2017)
  • [28] Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: Dense tracking and mapping in real-time. In: IEEE International Conference on Computer Vision (ICCV) (2011)
  • [29] Rad, M., Lepetit, V.: BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth. In: IEEE International Conference on Computer Vision (ICCV) (2017)
  • [30] Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • [31]

    Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpolation of correspondences for optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

  • [32] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015).
  • [33] Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (2006)
  • [34] Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European Conference on Computer Vision (ECCV). pp. 1–14 (2012)
  • [35] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: International Conference on Intelligent Robot Systems (IROS) (2012)
  • [36] Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. arXiv preprint arXiv:1709.02371 (2017)
  • [37] Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnn-slam: Real-time dense monocular slam with learned depth prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [38] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and motion network for learning monocular stereo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [39] Wannenwetsch, A.S., Keuper, M., Roth, S.: Probflow: Joint optical flow and uncertainty estimation. In: IEEE International Conference on Computer Vision (ICCV) (2017)
  • [40] Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: Learned Invariant Feature Transform. In: European Conference on Computer Vision (ECCV) (2016)
  • [41] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Computer Vision and Pattern Recognition (CVPR) (2017)
  • [42]

    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)