Humans comprehend 3D from a single viewpoint by leveraging the knowledge of context together with the shape and appearance priors. Similar to human visual perception, robust computer vision systems require the ability to perceive the environment in 3D. This fact has motivated research in monocular depth estimation. The problem to recover depth from a single image is ill-posed due to the projection ambiguity. Supervised methods have been proposed to estimate depth from a monocular image, demonstrating promising results, by training on a large amount of dense ground-truth depth data, however, this is expensive and impractical to acquire for real-world scenes[12, 14, 15]. An alternative approach is to generate synthetic data by rendering from computer generated models [41, 5, 49, 58] or 3D scans [8, 10, 1], but it is challenging to create data that represents the variety and detail of real-world appearance. Also transfer from training on synthetic data to real scenes remains a challenging open problem. Instead of regressing depth from raw pixels, self-supervised learning methods reformulate depth estimation as an image reconstruction problem by re-synthesising a target view from a single source view without ground-truth depth [16, 18]. Commonly these methods use stereo image pairs for training with a fixed camera baseline and require scenes with a fixed depth range. These methods learn to estimate inter-image disparity and then use camera calibration to estimate depth.
Learning monocular depth across diverse scenes is a challenging problem, due to large changes in depth range. Typically indoor scenes have a depth range of , whereas outdoor scenes are commonly 100s of meters. Fig. 1 shows some typical scenes with different depth ranges. Monocular depth estimation should be able to estimate depth across scenes with a wide variation in the depth range. It is infeasible to train a deep network that can regress depth values from raw pixels when the output space is incompatible. Existing self-supervised methods can be trained only on data sets with similar depth ranges [19, 63, 37, 62, 40], limiting the number of images that can be used for training. As a result, they demonstrate poor generalisation performance and can only perform specific tasks, such as depth estimation in outdoor driving scenes with a fixed stereo baseline. Recently, some supervised approaches [32, 33, 7] addressed this issue by normalizing the ground-truth to have the same scale which allows learning relative depth values on moving camera data sets with a predefined depth range. However, these methods still depend on ground-truth depth values in order to estimate depth from an image.
In this paper, we propose a self-supervised method, RealMonoDepth, that allows a single network to estimate depth for indoor and outdoor scenes demonstrating improved accuracy over previous self-supervised monocular depth estimation approaches. The proposed network learns real depth from stereo-pair and moving camera data sets for scenes with a diverse depth range and without a fixed camera baseline. An overview of our approach for scenes with varying depth ranges (small, medium and large) is depicted in Fig. 1. To enable self-supervised learning of depth estimation across multiple scene scales, we introduce a novel loss function to learn relative depth together with the real depth. The network learns the real depth by transferring relative depth through scaling and warping which is used to compute reconstruction loss for self-supervised training, as shown in Fig. 2. The contributions are of this work are:
A self-supervised monocular depth estimation method that is able to generalise learning across scenes with different depth range.
A novel loss function over real depth for self-supervised learning of depth from a single image.
Evaluation on five benchmark data sets demonstrates generalisation across indoor and outdoor scenes with improved performance over previous work.
2 Related Work
Although it is not possible to perform metric reconstruction from a single image due to the projective scale ambiguity, recently proposed learning-based methods have demonstrated that a reliable estimate of the scene geometry can be generated by having prior knowledge of the scale of objects in the scene. This section reviews the supervised, unsupervised and self-supervised approaches that take a single RGB image as input and estimate per-pixel depth as output.
2.1 Supervised Depth Estimation
Supervised single image dense depth prediction methods exploit depth values obtained from active sensors such as Kinect and LIDAR as ground-truth for training. Eigen et al.  and Fu et al.  exploited regression loss to estimate depth for indoor and outdoor scenes respectively. Mayer et al.  demonstrated that training a fully convolutional network  is an effective approach to learn disparity from stereo images. They also generated a large synthetic data set to train their network called DispNet, demonstrating improved performance on scenes with diverse depth ranges. However the network trained on synthetic data gives limited performance on real in the wild scenes. Huang et al.  proposed a deep CNN that leverages multi-view stereo images with known camera poses and calibration to generate patch plane sweep volumes with respect to a reference view. Matching is performed between these patches and produces the inverse disparity map for the reference view. Instead of generating plane sweep volumes, Yao et al.  applied differential homography warping on 2D feature maps learned from multiple input views to produce a 3D cost volume. However, these methods suffer from the limited availability of multi-view data of real scenes with ground-truth depth, thereby relying heavily on synthetic data to train their network, which leads poor generalisation capability on complex real scenes. Hence these supervised methods require ground-truth depth or synthetic data for monocular depth estimation on real scenes.
2.2 Unsupervised and Self-Supervised Depth Estimation
Recently, unsupervised and self-supervised methods for depth estimation have gained attention eliminating the requirement of ground-truth depth data for real scenes. Unsupervised methods simultaneously estimate the depth and pose from a single RGB image and self-supervised methods exploit camera pose information estimated as a pre-process to obtain depth from a single monocular image.
Zhou et al. 
pioneered the work in unsupervised depth estimation by proposing separate deep CNN networks for pose estimation between unlabelled video sequences and for single view depth estimation. Instead of training an additional pose estimator network, Wang et al. implemented a differentiable version of Direct Visual Odometry Space (DVO)  which is popularly used in current SLAM [39, 11] algorithms. Specifically, DVO solves for the pose by minimizing the warping loss of the reference frame from the source frame given the reference frame depth. Furthermore, they introduced a depth normalisation layer in order to address the scale ambiguity problem, which significantly outperformed . Inspired by the relation between depth, pose and optical flow tasks in 3D scene geometry, Yin et al.  jointly trained an unsupervised end-to-end deep network to predict pose and depth for non-rigid objects. An optical flow consistency check is imposed between backward and forward flow estimations for reliable estimation. However all these unsupervised methods give a limited performance on general scenes because of the ambiguity in the projection scale introduced by both unknown depth and pose.
Self-supervised methods exploit known camera pose information to resolve the depth ambiguity given estimates of disparity. Garg et al.  introduced a self-supervised learning approach to train a deep network from stereo pairs by exploiting the epipolar relation between the cameras given a known calibration in order to generate inverse warp of the left view to reconstruct the right view. Although their results are impressive, they use non-differentiable Taylor series expansion to perform warping of disparity. Godard et al. [18, 19]
imposed left-right consistency as a constraint for disparity regularisation and established a differentiable optimisation by leveraging spatial transformer networks for bilinear sampling. Given a stereo pair as input for training, they estimate two disparity maps: left view disparity with respect to right view and right view disparity with respect to left view. Then, they reconstruct both views and also the disparity maps to achieve left-right consistency. Moreover, they also imposed edge-aware smoothness as another regulation along with left-right consistency and leveraged SSIM  loss in addition to L1 reconstruction loss. Poggi et al.  proposed an improvement to 
by leveraging three rectified views instead of stereo in order to establish additional disparity consistency. Inspired by recent successes of deep learning in single image super-resolution, Pillai et al. employ sub-pixel convolutional layers instead of resizing convolution layers. Contrary to previous methods that are limited to low-resolution operation, their method exploits high fidelity for better self-supervision. Existing self-supervised approaches estimate disparity and assume a fixed camera baseline during training. This limits the approaches to training for scenes with a similar depth range and does not allow generalisation across diverse indoor and outdoor scenes, or diverse data sets for training. In this paper, we introduce an approach to self-supervised learning using a loss function based on estimates of the true scene depth. This allows generalisation across both stereo pair and moving camera data sets for scenes with different depth ranges.
Both self-supervised and unsupervised methods suffer from the following limitations: requirement of fixed camera baseline; limited generalisation performance on scenes with varying depth ranges (indoor/outdoor); and self-supervised methods estimate disparity and only work for training on stereo image pairs with a fixed baseline which limits the training set. The proposed self-supervised depth estimation method addresses all of these limitations by generalising learning across scenes with different depth ranges and works for both stereo and moving camera data sets improving generalisation and accuracy of depth estimation over previous approaches.
This paper introduces a self-supervised single image depth estimation approach that is able to generalise learning across scenes with diverse depth ranges. The method is trained on both stereo image pairs and moving camera data sets, giving state-of-the-art performance across five benchmark data sets. The method estimates depth from a single view image, an overview is shown in Fig. 2. The proposed network is trained on two views of a static scene. The Relative DepthNet network estimates relative depth maps from two views, inspired from  which estimates disparity maps between stereo pair. As a pre-process, camera calibration is estimated from sparse correspondences between two views using an existing visual SFM method COLMAP . The sparse correspondences are used to estimate the median scene depth for scale transformation, which is applied on the relative depth maps to obtain real depth maps/true scene depth. The real depth maps and images are warped between views using the calibration to estimate the loss. This enables the network to be trained on moving camera and stereo data sets of real scenes with varying depth ranges and camera view baselines to generalise performance across a variety of indoor and outdoor scenes in the wild.
3.1 Training Framework
The training framework for the proposed approach is illustrated in Fig. 3 for learning single view depth estimation from two viewpoint images. Given two images of a scene from different viewpoints (, ), the depth network (Relative DepthNet) predicts the corresponding per-pixel relative depth maps (, ) using shared weights. The relative depth maps are transformed to real depth (, ) using the scale transform module. The self-supervised loss is then computed using the warped real depth estimates (, ) and warped images (, ). Using estimated calibration and camera poses, real depth values allow us to reconstruct the input images (, ) and depth maps (,
). This information is interpolated to compute photometric loss () and geometric consistency () loss, that supervises the depth network. SSIM and smoothness losses are introduced to regulate the depth estimation. The loss function for the proposed method is:
where indexes over different image scales and , , and are the weighting terms.
Extension to training with multiple viewpoint images (¿2) is straightforward (using loss based on the estimated real scene depth for each input image ).
While existing self-supervised methods are only suited for training on fixed baseline cameras and scenes with similar depth range, we overcome this limitation by stabilizing the output space of the depth network, allowing training on indoor and outdoor scenes with diverse depth ranges.
Relative DepthNet Architecture:
Based on the U-Net Architecture , in order to effectively capture both local and global information, we use a multi-scale encoder-decoder network with skip connections, similar to . Inspired by [21, 19, 27], we select ResNet50 
for the encoder with initialised weights pre-trained on ImageNet and randomly initialise decoder weights. Unlike popular self-supervised/unsupervised single depth estimation approaches [18, 19, 37, 63, 40, 31]
, our network predicts relative depth instead of inverse depth or disparity. We observe that applying sigmoid activation at the network output slows convergence for close and far depth values due to the vanishing gradient problem. Instead, we replace sigmoid with identity activation and handle negative values with exponential mapping, detailed in the Scale Transform section. Apart from multi-scale output layers of the decoder, we apply batch normalisation and ReLU nonlinearities in all layers. A detailed explanation with the full details of the network architecture required for implementation is presented in the supplementary material.
To learn to estimate depth from images across diverse scenes with varying depth ranges, we normalise depth across the images and data sets using a non-linear scale transform and train the network to estimate relative depth. Given the relative depth map prediction as input, our scale transform module outputs the real depth map. This is formulated as:
where is the median depth value for two images and , and are the real scene depth maps. Inspired by supervised methods [14, 15], we use exponential mapping in equation 2 in order to reduce the penalisation of the deep network from distant and ambiguous depth values.
During training, camera calibration is required to estimate the median depth value for the scale transform. For data sets with unknown calibration (e.g. Mannequin data set), off-the-shelf SFM method COLMAP  is used to solve for camera calibration and sparse correspondences between views. If the camera calibration is known (e.g. KITTI data set), we use the calibration to compute sparse correspondences between views. The sparse correspondences are triangulated in 3D exploiting camera pose to get a sparse reconstruction of the scenes. The sparse 3D points are projected on each view to obtain sparse depth maps for each viewpoint. Depth values are then sorted and the median depth value is estimated, as shown in Fig. 1. Median depth enables prediction of real depth maps which are used together with the input images to estimate the loss in Equation 1.
Photometric consistency: This loss enables estimation of depth and is inspired by self-supervised learning approaches that reformulate depth estimation as an image reconstruction problem [16, 18]. Here, the underlying intuition is that every viewpoint is a 2D projection of the same 3D scene, so one view can be reconstructed from another view, which implies knowledge about depth. With known calibration between views, depth is treated as an intermediate variable to perform novel-view synthesis with a deep network. Here, is the reconstructed reference view from and is the reconstructed reference view from . Let , matrices represent the intrinsic parameters of and respectively, , denote relative camera pose matrices between views, , represent projected depth values for each pixel . Then, the homogeneous pixel-wise projection relations () between two views can be formulated as:
Since the projected pixel values are continuous, we use a spatial transformer network  to perform differentiable bilinear sampling in order to approximate and by interpolating and from neighboring corner pixels:
Here, are the indices of the top left, top right, bottom left and bottom right pixels of the projected pixels and is the corresponding weight, which is inversely proportional to spatial distance. Also, due to occlusions between viewpoints, some pixels in the reference view will be projected outside of the image plane boundary in the source view. In order to prevent these unresolved regions from penalising the network, similar to Mahjourian et al. , a validity binary masks is computed to exclude these pixels from the training loss . Hence, the L1 image reconstruction loss for two views becomes:
Also, SSIM  loss is applied to 3x3 image patches in order to regulate the noisy artifacts caused by the L1 loss, defined as follows:
Smoothness: This loss term ensures the depth maps are smooth, reducing the noise in the depth maps. Sobel gradients are used to calculate the loss instead of horizontal and vertical gradients in Wong et al. . Sobel gradients allow depth to be smooth horizontally, vertically and diagonally. In most cases, regions that have higher reconstruction error are only visible in one view. Applying smoothness regularisation to these unresolvable regions induces a false penalty in the network training. In order to overcome this problem, an adaptive smoothness regulation weight is used for every pixel that varies in space, as in previous work . Based on Equation 4, the adaptive weight for a pixel is computed as follows: , where index corresponds to the view and is the scale factor that determines the range of and is the global residual represented as: . depends on global residual () at each position and tends to be small when the residuals are high. The average of approaches to 1 as the training converges. Based on adaptive weight, the smoothness regularisation objective for two views is:
Here, the depth values are enforced to be locally smooth which is represented as the x and y Sobel gradients of the relative depth maps.
The per-view predicted depth maps may not be consistent with the same 3D geometry. This causes depth discontinuities and outliers on the surfaces and boundaries of the objects. In order to address this issue, we learn the real depth map (, ) for each viewpoint simultaneously with consistency checks. Inspired by Bian et al. , we enforce geometric consistency symmetrically by sampling different viewpoints in the same training batch. Based on Equation 3, we interpolate projected depth maps , with bilinear sampling in order to approximate , which lie on the pixel grid. The geometric consistency loss function for is defined as follows:
Here, similar to  we use a normalised symmetric loss function to achieve depth consistency between predicted depth maps for each viewpoint.
|data set||Indoor||Outdoor||Dynamic||Video||Depth||Diversity||Annotation||# Images|
|DIW ||✓||✓||✓||Ordinal Pair||High||User clicks||496K|
|Megadepth ||✓||✓||No scale||Medium||SFM||130K|
|MC ||✓||✓||✓||✓||No scale||High||SFM||115K|
Qualitative and quantitative results are presented on five benchmark data sets against state-of-the-art supervised, unsupervised and self-supervised methods. We demonstrate that the proposed self-supervised loss function using real depth dramatically improves generalisation performance when trained on both moving camera (Mannequin Challenge (MC)  mostly indoor) and stereo (KITTI  outdoor) data sets jointly. These data sets contain both indoor (1–10m) and outdoor (1–1000m) scenes with a wide variation in depth range. We test the same trained model on four benchmark data sets which the network has not seen during training: KITTI Eigen test split  (street scenes), Make3D  (outdoor buildings), NYUDv2 test split  (indoor) and dynamic subset of TUM-RGBD  (humans in indoor environments). Key attributes of the data sets used in experiments are listed in Table 1. Additional qualitative results on a wide variety of in the wild scene images for the DIW  data set are presented in the supplementary material, together with comparative performance evaluation on a diverse range of challenging in the wild videos in the supplementary video.
4.1 Implementation Details
Our model is implemented in Tensorflow, trained using the Adam 
optimiser for 25 epochs with an input/output resolution ofand a batch size of . Each batch sample consists of different viewpoint images of the same scene. We set the number of viewpoint images as which leads to images for each batch. Initial learning rate is set to and it is decayed with a linear scheduler . The weights for the loss terms are empirically determined as: where is the downsampling factor for each scale, and scale factor for adaptive regularisation term is chosen as , similar to . For data augmentation, we perform horizontal flipping, random scaling, cropping with and apply random brightness, contrast, saturation on the rest , with the same range of values as in . Full details of network implementation are given in the supplementary material.
4.2 Experimental Setup
We train our model based on data split of Eigen et al.  for KITTI and Li et al.  for Mannequin Challenge (MC) data sets. We perform both individual and mixed training on these data sets with and without our proposed scale transform approach in order to evaluate the differences in generalisation capability and compare with previous methods. The underlying motivation for combining these data sets is threefold: 1) They are both large data sets suitable for self-supervised training, 2) Their sizes are comparable which makes them favourable to mix for joint training, 3) They represent different varieties of appearance: outdoor street scenes in KITTI and humans (mostly indoors) in MC.
We select 23,488 stereo pairs of KITTI for training and the remaining 697 images are used as test set for evaluating single view depth estimation. Similar to train split of , we select 2463 scenes of MC for training. Since some of the videos on Youtube were deleted by the owners, we were not able to access all of the video URLs provided by , so our training set is slightly smaller. In order to ensure well balanced class distribution, we randomly sample 40 viewpoint images from each scene. Images are resampled for scenes that have a lower number of viewpoints.
We quantitatively evaluate the single view depth estimation of our model following the error metrics of Eigen et al. : mean absolute relative error (Abs Rel), mean squared relative error (Sq Rel), root mean squared error (RMS), and root mean squared log10 error (RMS(log)). Following Zhou et al. , we scale our relative single-view depth map predictions to match the median of ground-truth.
4.3 Comparison with State-of-the-art
This section presents quantitative results of the proposed scale transform method models trained on various data set combinations (KITTI, MC, MC+KITTI). Note: due to estimation of disparity rather than depth and assumption of a fixed camera baseline in previous self-supervised estimation methods [18, 20, 42, 16], it is not possible to train for data sets with different scene scales.
KITTI We report the results for the test set of KITTI based on Eigen split  in Table 2. Our model trained on MC+KITTI shows the best generalisation performance outperforming all state-of-the-art supervised methods which are not trained on KITTI and which are trained on large scale diverse data sets such as Chen et al.  and Li et al. . We also demonstrate that the training loss function based on real depth allows generalisation over data sets for indoor and outdoor scenes with different scales and allows the combination of data sets during training without degrading test performance. Our model trained on MC+KITTI outperforms other state-of-the-art self-supervised/unsupervised methods on KITTI even when they are trained on KITTI.
|Method||Supervision||Training set||Abs Rel||Sq Rel||RMS||RMS(log)|
Make3D Next, we evaluate on the Make3D data set following the procedure in . In Table 3, our model trained on MC+KITTI demonstrates the best generalisation performance on an unseen data set compared to other methods which are [35, 34] and are not [14, 7] trained on Maked3D. Qualitative comparisons of our method trained on MC+KITTI with training loss based on real depth demonstrates improved performance in Fig.4.
|Method||Supervision||Training set||Abs Rel||RMS||Method||Supervision||Training set||Abs Rel||RMS|
|Xu ||Depth||Make3D||0.184||4.38||Xu ||Depth||NYU||0.121||0.586|
|Li ||Depth||Make3D||0.278||7.19||Li ||Depth||NYU||0.139||0.505|
|Laina ||Depth||Make3D||0.176||4.45||Laina ||Depth||NYU||0.129||0.583|
|Liu ||Depth||Make3D||0.314||8.60||Liu ||Depth||NYU||0.230||0.824|
|Liu ||Depth||Make3D||0.335||9.49||Liu ||Depth||NYU||0.335||1.06|
|Laina ||Depth||NYU||0.669||7.31||Eigen ||Depth||NYU||0.215||0.907|
|Liu ||Depth||NYU||0.669||7.20||Eigen ||Depth||NYU||0.158||0.641|
|Eigen ||Depth||NYU||0.505||6.89||Roy ||Depth||NYU||0.187||0.744|
|Chen ||Depth||DIW||0.550||7.25||Wang ||Depth||NYU||0.220||0.745|
|Li ||Depth||Megadepth||0.402||6.23||Jafari ||Depth||NYU||0.157||0.673|
|Monodepth ||Pose||KITTI||0.525||9.88||Monodepth2 ||Pose||KITTI||0.342||1.183|
|DDVO ||KITTI||0.387||8.09||Ours||Pose||MC + KITTI||0.193||0.686|
NYUDv2 Here, we show that our proposed method also generalises well to indoor scenes. We evaluate performance against state-of-the-art self-supervised monocular method Monodepth2 on the NYUDv2 test split. In Table 3, our model outperforms Monodepth2 with a significant margin and achieves competitive accuracy against supervised methods that are trained on a different split of the same data set. We also provide qualitative comparisons of our model trained on MC+KITTI in Fig. 5 demonstrating improved performance.
TUM Finally, we present quantitative and qualitative results tested on the dynamic subset of the TUM-RGBD data set in Table 4 and Fig. 6 respectively. Our model ranks second-best compared to supervised methods and best for self-supervised methods. In order to make a fair comparison with , we only include their result trained on a single image without any additional prior knowledge except the depth ground-truth.
|Method||Supervision||Training set||Abs Rel||RMS|
|DeMoN ||Depth||TUM RGBD+MVS||0.220||0.866|
|Li (single image)||Depth||MC||0.204||0.840|
4.4 Ablation Study
Here, we evaluate the effect on generalisation performance across scenes with different depth ranges for our self-supervised training using a loss function based on real depth estimates. For models trained without our proposed method, we omit the scene median depth term, so the scale transform module is modified as where is the estimated relative depth of our Relative DepthNet network and is the real depth map which is used for warping the other viewpoint image in order to compute the loss for training. We train our network with four different configurations regarding training data sets and usage of proposed scale transform (ST) during training: 1) MC w/o ST, 2) MC+KITTI w/o ST, 3) MC w/ ST and 4) MC+KITTI w/ ST. Then, we test each of these four models on four benchmark data sets similar to section 4.3. We report numerical ablation results in Table 5 and show qualitative comparisons between models trained on MC+KITTI with and without scale transform in Fig. 5, Fig. 6 and Fig. 7. Both our models trained on MC+KITTI with and without scale transform achieve similar numerical results on test split of KITTI. This is reasonable since the KITTI data set is collected with stereo cameras that have a fixed baseline between them. On the other hand, the MC data set consists of diverse scenes with no fixed baseline between viewpoint images in each scene. For other test sets, models trained with the proposed loss function significantly outperform the models that are trained without the scale transform. Moreover, our proposed method also allows training on a combination of different data sets to generalise across scene depth ranges for indoor and outdoor scenes.
|Test set||Error Measure||Training Set||Test set||Error Measure||Training Set|
|MC||MC + KITTI||MC||MC + KITTI|
|w/o ST||ST||w/o ST||ST||w/o ST||ST||w/o ST||ST|
|KITTI||Abs Rel||0.426||0.276||0.116||0.108||NYU||Abs Rel||0.282||0.201||0.333||0.193|
|Make3D||Abs Rel||0.424||0.346||0.347||0.289||TUM Dynamic||Abs Rel||0.252||0.207||0.273||0.201|
5 Conclusion and Future Work
We present a generalised self-supervised monocular depth estimation method (RealMonoDepth) that overcomes the limitation of existing self-supervised methods ([20, 63, 18, 61, 16]) that are limited to scenes with fixed scale and depth range. These methods cannot be trained on moving camera data sets due to the assumption of a fixed baseline and are unable to generalise to unseen data sets with different depth ranges. RealMonoDepth addresses all of these limitations by allowing simultaneous training on a combination of indoor and outdoor scenes with varying depth ranges. This leads to significantly improved generalisation performance across indoor and outdoor scenes and scenes which are unseen during training, and removes the dependance on a fixed camera baseline. The proposed method allows mixing stereo and moving camera data sets (MC + KITTI) improving on state-of-the-art performance in single view depth estimation across five benchmark data sets including data sets with varying depth range.
Success of deep networks depends on the use of large data sets. The proposed self-supervised training from sequences captured from a single camera allows us to train the network on diverse uncontrolled in the wild data sets, such as the Mannequin Challenge (MC) data set used in this work. This opens the door to further generalisation through training across even larger and more diverse scenes. A limitation of the proposed method is that it works with static scenes during training as with all the other single-image depth estimation methods. At test time we estimate depth from a single image and are therefore able to handle dynamic scenes with a wide variety of scale (see supplementary video). An interesting potential future work might be to extend the training of our method for dynamic scenes to increase the diversity of the data.
-  (2016) Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, pp. 1–16. Cited by: §1.
Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.1.
-  (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Advances in Neural Information Processing Systems, pp. 35–45. Cited by: §3.1, §3.1.
Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8001–8008. Cited by: Table 2.
-  (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §1.
Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2624–2632. Cited by: §4.3.
-  (2016) Single-image depth perception in the wild. In Advances in neural information processing systems, pp. 730–738. Cited by: §1, Table 1, §4.3, Table 2, Table 3, Table 4, §4.
-  (2016) A large dataset of object scans. arXiv:1602.02481. Cited by: §1.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: Table 1.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1.
-  (2003) Real-time simultaneous localisation and mapping with a single camera. In null, pp. 1403. Cited by: §2.2.
-  (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §1.
-  (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pp. 2650–2658. Cited by: Table 3, §4.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §1, §2.1, §3.1, §4.2, §4.2, §4.3, §4.3, Table 2, Table 3.
-  (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §1, §2.1, §3.1, Table 2, Table 4.
-  (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European Conference on Computer Vision, pp. 740–756. Cited by: §1, §2.2, §3.1, §4.3, Table 2, §5.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: Table 1, §4.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §1, §2.2, §3.1, §3.1, §4.3, Table 2, Table 3, §5.
-  (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3838. Cited by: §1, §2.2, §3.1, §3.1, §3, §4.1, §4.3, §4.3, Table 2, Table 3, Table 4.
-  (2019) Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. arXiv preprint arXiv:1904.04998. Cited by: §4.3, §5.
-  (2018) Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500. Cited by: §3.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
-  (2018) Deepmvs: learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830. Cited by: §2.1.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2.2, §3.1.
-  (2017) Analyzing modular cnn architectures for joint depth prediction and semantic segmentation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4620–4627. Cited by: Table 3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2017) Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655. Cited by: §3.1.
-  (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. Cited by: Table 2, Table 3, Table 4.
-  (2018) Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition 83, pp. 328–339. Cited by: Table 3.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1119–1127. Cited by: Table 3.
-  (2018) Undeepvo: monocular visual odometry through unsupervised deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7286–7291. Cited by: §3.1.
-  (2019) Learning the depths of moving people by watching frozen people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4521–4530. Cited by: §1, Table 1, §4.2, §4.2, §4.3, Table 4, §4.
-  (2018) Megadepth: learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2041–2050. Cited by: §1, Table 1, §4.3, Table 2, Table 3.
-  (2015) Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162–5170. Cited by: §4.3, Table 2, Table 3.
-  (2014) Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723. Cited by: §4.3, Table 3.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.1.
-  (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675. Cited by: §1, §3.1, §3.1.
-  (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §2.1.
-  (2002) FastSLAM: a factored solution to the simultaneous localization and mapping problem. Aaai/iaai 593598. Cited by: §2.2.
-  (2019) Superdepth: self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9250–9256. Cited by: §1, §2.2, §3.1.
-  (2017) Depthsynth: real-time realistic synthetic data generation from cad models for 2.5 d recognition. In 2017 International Conference on 3D Vision (3DV), pp. 1–10. Cited by: §1.
-  (2018) Learning monocular depth estimation with unsupervised trinocular assumptions. In 2018 International Conference on 3D Vision (3DV), pp. 324–333. Cited by: §2.2, §4.3, Table 2.
-  (2018) Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. IEEE transactions on pattern analysis and machine intelligence 41 (6), pp. 1426–1440. Cited by: Table 4.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.
-  (2016) Monocular depth estimation using neural regression forest. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5506–5514. Cited by: Table 3.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §3.1.
-  (2008) Make3d: learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence 31 (5), pp. 824–840. Cited by: Table 1, §4.
-  (2016) Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1, §3.
-  (2004) The princeton shape benchmark. In Proceedings Shape Modeling Applications, 2004., pp. 167–178. Cited by: §1.
-  (2012) Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp. 746–760. Cited by: Table 1, §4.
-  (2011) Real-time visual odometry from dense rgb-d images. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 719–722. Cited by: §2.2.
-  (2012) A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. Cited by: Table 1.
-  (2017) Demon: depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5038–5047. Cited by: Table 4.
-  (2018) Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2022–2030. Cited by: §2.2, Table 2, Table 3.
-  (2015) Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2809. Cited by: Table 3.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §2.2, §3.1.
-  (2019) Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5644–5653. Cited by: §3.1, §4.1.
-  (2014) Beyond pascal: a benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, pp. 75–82. Cited by: §1.
-  (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5354–5362. Cited by: Table 3.
-  (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §2.1.
-  (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §2.2, Table 2, §5.
-  (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349. Cited by: §1.
-  (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §1, §2.2, §3.1, §4.2, Table 2, Table 3, §5.
6 Network Implementation Details
We use the standard pretrained resnet50 encoder, , officially provided by Tensorflow . The decoder weights are initialised randomly and the details of our architecture are shown in Table 6. Our decoder uses skip connections from the encoder  , is the final encoder output and estimates multi-resolution depth maps (, , , ) in order to exploit both local and global information to resolve higher resolution details. Code will be released upon publication.
|Layer||Output Size||Kernel Size||Stride||Input||BatchNorm||Activation|
|iconv5||3||1||upconv5 + econv4||Yes||ReLU|
|iconv4||3||1||Upsample(upconv4) + econv3||Yes||ReLU|
|iconv3||3||1||upconv3 + econv2 + Upsample(depth4)||Yes||ReLU|
|iconv2||3||1||upconv2 + econv1 + Upsample(depth3)||Yes||ReLU|
|iconv1||3||1||upconv1 + Upsample(depth2)||Yes||ReLU|
7 Additional Results
We provide additional qualitative results in order to showcase the generalization ability of the proposed model trained on MC+KITTI using our novel loss function.
Supplementary video. We generate depth predictions with our model and compare against current state-of-the-art Monodepth2  on sample YouTube videos which consist of diverse scenes with dynamic objects and varying depth range. Results are presented in the supplementary video. Note: the video scenes are monocular and were not seen during training. These videos are recorded with standard handheld monocular cameras and do not have ground-truth depth estimates. Each frame was processed independently i.e. the temporal relation is not used.
Diverse scene images. We also show qualitative results on the test set of DIW  dataset in Fig. 8, 9 and 10. These images constitute diversely rich content including indoor, natural and street scenes consisting of various objects taken from arbitrary camera angles with uncontrolled lighting conditions and scene appearance. Results demonstrate plausible depth estimation for general scenes with performance comparable to human or previous depth estimation using supervised learning .