Estimating an accurate depth map from single RGB image is of great significance in 3D scene understanding as well as in many real-world applications such as augmented reality and autonomous driving. Compared to traditional hand-crafted feature-based methods, supervised [5, 7, 17, 16, 33, 1] and stereo self-supervised [11, 20, 22, 10]
learning has been proved to be able to achieve better performance on this task. Unfortunately, these methods either require a large amount of high-quality annotated ground-truth, which is difficult to obtain, or need complex stereo calibration. Therefore, monocular self-supervised learning methods became the focus of research. Some recent works[3, 31, 9, 36] revealed its great potential to tackle monocular depth estimation task.
Despite its potential to reach satisfying performances, current methods have two shortcomings. One of them is that they are only able to estimate relative depth rather than the absolute one. For evaluation, scale factor is calculated by ratio between the medians of ground-truth (given by LiDAR) and predicted depth [3, 31, 9, 36], as can be seen from Fig. 1. Theoretically, it is a decent solution. However, for practical uses, obtaining ground-truth in real applications using other sensors not only raises the cost, it also complexes the system, leading to complicated joint calibration processes and synchronization problems.
Another problem is that because the decoder of current methods predict depth in different resolutions separately, some details on object-level is omitted. For example, object boundary can be blurred and the depth of texture on the object may be predicted differently than the object itself.
In this paper, we propose DNet, a novel self-supervised monocular depth estimation pipeline that exploits densely connected hierarchical features to obtain more precise object-level depth inference, and uses dense geometrical constraint to eliminate the dependence on additional sensors or depth ground-truth to perform scale recovery, so that it is easier to be brought into practical use.
Our contributions are listed as follows:
We improve the former multi-scale estimation strategy by proposing a novel dense connected prediction (DCP) layer. Instead of predicting depth and computing reconstruction loss separately under different scales, the proposed DCP layer exploits hierarchical feature so that object-level depth inference can be made based on multi-scale prediction features, refining object boundary and reducing visual artifacts.
A novel dense geometrical constraints (DGC) module is introduced to perform high-quality scale recovery for autonomous driving. Based on relative depth estimation, DGC module can finish per-pixel ground segmentation and estimate a camera height from every ground point. Statistical method is applied to determine the camera height so that outliers of ground point extraction can be robustly suppressed. Scale factor can be determined through comparision between the given and estimated camera height.
DNet is extensively evaluated on KITTI Eigen Split , where the results not only showed the capability of DCP layer to improve the performance of object-level depth inference, but also proved that DNet using DGC module has competitive performance against those methods using depth ground-truth to determine scale factor. Ablation studies demonstrated module effectiveness as well as sensitivity of DNet to ground points ratio.
Ii Related Works
Ii-a Self-supervised monocular depth estimation
Monocular depth estimation has always been an important aspect of scene understanding. Some works apply supervised [5, 7, 17, 16, 33, 1] or stereo self-supervised [11, 20, 22, 10] methods to tackle the problem. However, due to the difficulty of obtaining large amount of labeled data or complex stereo calibration to train the depth estimation network, monocular self-supervised method was proposed instead [3, 31, 9, 36].
Proposed by the pioneering work , the basic idea is to use photometric reconstruction loss calculated by comparing the target image with the target view reconstructed from nearby source views. However, it assumes that the scene is static and that no occlusion is present between different consecutive frames. [18, 34, 27] explicitly established different motion models to resolve the moving scene problem.  introduced 3D surface normal by constructing two additional layers for better depth estimation.  replaced the original photometric reconstruction error with per-pixel minimum reprojection error, which partially enabled it to tackle occlusion. It also used up-sampling and proposed auto-masking of stationary pixels to avoid ’holes’ of infinite depth generated by low-texture and moving objects respectively.
However, all aforementioned works predict only relative depth, which means there still exists a scale gap between the prediction and true depth. For evaluation purpose, ratio between medians of ground-truth and current prediction is employed to acquire absolute depth. Unfortunately, in real application scenarios, ground-truth is either too difficult or financially expensive to obtain. Therefore, a scale recovering approach which is free of depth ground truth is called for.
Ii-B Monocular scale recovery
Scale uncertainty has always been a problem for 3D vision for monocular camera. To recover scale factor and achieve absolute depth estimation,  utilizes pose information and  uses stereo data to pretrain network, both introducing additional sensor information but the results were no as satisfactory. Besides depth estimation, a typical example of this is monocular visual SLAM. In order to mitigate this, [6, 26] integrate object detection algorithms into monocular visual SLAM system and take advantage of object size prior to recover scale. However, in addition to the significant increase of computation complexity, these methods show limited robustness under scenes without known object classes.
Handling the geometrical relationship between camera and ground is also an effective approach to tackle this problem. This geometrical constrain is broadly used in autonomous driving tasks, for ground is commonly seen in images captured by on-board cameras. The main task of these methods is to estimate a relative camera height using camera-ground geometrical constrains, and thus infer scale with absolute camera height prior. 
extracted ground using trained classifier, but it doesn’t possess an excellent generalization power. extracted the ground points densely in region of interest similar to [15, 35], but it requires dense stereo to be added to the system, which can potentially raise cost and increase complexity. In , the most similar work to this one, used surface normal to extract ground points and thus calculate camera height. However, due to the sparsity originated in its key-point-based strategy, data association through consecutive frames are needed, makes this method hardly integrated into monocular depth estimation tasks which use only single image as input. In additon, this method regards the ground as a whole, flat panel with single surface normal, which is a strong assumption for autonomous driving scenarios. In contrast, our method is free of data association, which means it can be integrated into both monocular depth estimation and visual SLAM tasks. Furthermore, our method achieve per-pixel surface normal calculation and ground segmentation, makes the algorithm robust to different road conditions for autonomous driving.
In this section, a novel pipeline called DNet specifically designed for monocular absolute depth estimation in autonomous driving applications is proposed. The pipeline can be divided into two parts, respectively relative depth estimation, with dense connected prediction (DCP) layer to improve object-level depth inference, and scale recovery based on dense geometrical constraint, without needing any additional sensor signals or depth ground-truth. The overview of DNet can be seen in Fig. 2.
Iii-a Relative depth estimation
The proposed DNet is based on Monodepth2 . As all self-supervised depth estimation methods, its object-level inference can still have texture copy and imprecise object boundaries. In this section, we will first introduce Monodepth2 and then resolve this issue by introducing DCP layer to replace full resolution module used in Monodepth2.
Iii-A1 Baseline: Monodepth2 w/o full resolution
Architecture: Two networks are required in monocular self-supervision architecture, respectively a depth network and a pose network. Single image of the -th frame is taken as the input of the depth network. Depth network outputs a dense relative depth map . Pose network takes and sequentially as inputs and then outputs camera poses of the -th image relative to that of the -th and -th images, i.e., .
Self-supervision loss: Two parts constitute the overall loss, respectively per-pixel minimum reconstruction loss and inverse depth smoothness loss . Reconstruction loss is calculated by firstly inverse warping source images to rebuild two target images . After that, photometric error (PE) between reconstructed image and target image is calculated combining structural similarity index (SSIM)  and L1 norm between two images as follows:
where is used for weight adjustment.
Per-pixel minimum loss is then calculated as follows:
Combined with edge-aware inverse depth smoothness:
where is the mean-normalized inverse depth, overall loss can be constructed with two hyper-parameters and as:
where subscript denotes different resolution layers of the decoder. is determined according to the resolution.
Iii-A2 DNet & Dense connected prediction layer
Overall loss: Because the photometric error of low resolution depth prediction can be the result of wrong network prediction or the aliasing of down-sampling, using the same weight in loss for low-res and high-res results can mislead the network to converge in non-optimal values. Additionally, in consideration that features with lower resolution are reused for multiple times, the weight of error in lower resolution depth prediction is reduced as follows:
where is introduced as weight adjustment parameter.
DCP layer: In order to handle local gradient caused by bilinear sampling  and local minima, current works [3, 31, 9, 36] including our baseline Monodepth2 use multi-scale depth prediction strategy. This strategy uses features in different scales independently, which has the tendency of depth artifacts (Fig. 9). Motivated by reducing the depth artifacts and acquiring more reasonable object-level depth inference, we propose a novel DCP layer that hierarchically combines features in different scales. The intuition is based on the observation that low-res layers of decoder network can provide more reliable object-level depth inference and high-res layers focus more on local depth details.
Formally, the numbers of feature channels in different scales are reduced to eight using a convolutional layer in the DCP layer, so that the number of channels are uniformed and calculations afterwards can be simplified. Features in low-res layers are then up-sampled and concatenated to higher-res layer features. By doing this, we introduce more precise object-level inference into higher resolution depth predictions that originally care less about object-level depth. The final depth estimation is performed based on the hierarchical features provided by densely connected feature layers. Detailed structure can be seen in Fig. 3.
Iii-B Scale recovery
Scale recovery is performed after relative depth is predicted so that absolute depth map can be generated solely relying on monocular image. Dense geometrical constraint (DGC) is thus introduced. DGC is specifically designed for autonomous driving applications. It works under the assumption that there are enough ground points in the monocular image, which is usually the case for autonomous driving. Unlike the scale recovery employed by feature-based visual odometry, ground points are densely extracted by DGC from the monocular images to form a dense ground point map. Each point in the map is used to estimate one camera height, as can be seen in Fig. 5. A large number of camera heights can thus be obtained. By applying statistical methods for overall camera height estimation, outliers can barely harm the estimation result of the scale factor.
Iii-B1 Surface normal calculation
The first step is to determine a surface normal for each pixel in the input image. All the pixel points need to be projected to 3D space according to the following equation:
where refers to the pixel on the -th row and the -th column in 2D space with one homogeneous coordinate, and is the corresponding 3D point, is the depth of that specific point, and is the camera intrinsic matrix.
Similar to , for each pixel point , 8-neighbor convention is used to determine several planes around it, as in Fig. 4. All 8 neighbors of are grouped into 4 pairs. Two vectors of connected respectively to two points in one pair form a 90-degree angle, i.e., . Four pairs of vector constitutes 4 surfaces, thus generating 4 surface normals, which can be calculated by:
where denotes the -th element of the -th pair in and .
The final normalized surface normal of point is given by normalizing and averaging four estimated normals:
Iii-B2 Ground point detection
Ground points usually refers to the points that has a normalized normal close to ideal ground normal, i.e., . With this ideal target normal and the calculated normalized surface normal, we propose a similarity function based on absolute value of cosine function. The calculated similarity can be used as a simple criteria to determine whether is a ground point or not.
|Method||Scale||Lower is better||Higher is better|
|Factor||Abs Rel||Sq Rel||RMSE||RMSE log|
|Zhou et al. CVPR’17||GT||0.183||1.595||6.709||0.270||0.734||0.902||0.959|
|Yang et al. AAAI’18||GT||0.182||1.481||6.501||0.267||0.725||0.906||0.963|
|Mahjourian et al. CVPR’18||GT||0.163||1.240||6.220||0.250||0.762||0.916||0.968|
|Bian et al. NIPS’19||GT||0.128||1.047||5.234||0.208||0.846||0.947||0.976|
|Pinard et al. ECCV’18||P||0.271||4.495||7.312||0.345||0.678||0.856||0.924|
|Roussel et al. IROS’19||S||0.175||1.585||6.901||0.281||0.751||0.905||0.959|
where operator denotes the inner product operation.
Considering the uncertainty produced by estimating the surface normal and the y-axis of camera coordinate system is not strictly perpendicular to the ground as in Fig. 5, a threshold is set. For , the pixel point is considered as ground points. After determination for ground points has finished for all pixel points, a set of ground points is detected, where denotes the y-axis value of . A ground mask is thereafter generated.
Iii-B3 Camera height estimation
When all the ground points have been densely identified from the image, the geometrical relationship between ground points and camera itself is ready to be exploited. As can be seen from Fig. 5, camera height is the projection of vector in the direction of surface normal of point , i.e., . Therefore, camera height of can be calculated as follows:
where . This operation is done for all .
Now a set of camera heights with element number equal to that of ground points is obtained. But for overall scale factor, one single camera height should be estimated for the relative depth map. After careful experiments, median of all estimated camera heights is selected as the final camera height.
Iii-B4 Scale factor calculation
Given the camera height estimated for current relative depth map for , in order to calculate the scale factor, all that is still needed is the real height of the camera . The scale factor for the current relative depth estimation is simply determined as follows:
Iii-C Absolute depth estimation
After successfully estimated the scale factor for current relative depth map , absolute depth can be thus pixel-wise calculated:
where denotes the absolute depth estimated for current image .
Thorough experiments are presented here for evaluation of DNet pipeline. Quantitative results show our proposed DNet is able to achieve competitive performance on both relative depth estimation and scale recovery. Also, ablation study is performed to prove the effectiveness of our proposed DCP layer. And due to the dependency of enough visible ground, experiments under different ground point ratio show the robustness of DGC scale recovery module.
Iv-a Implementation details
The same training parameters and method as Monodepth2 are used. Specifically, we set , and for SSIM is equal to . Only monocular image sequence is used during training. For scale recovery, angle threshold . Low values are assigned to and for low-res predictions, i.e., .
The experiments are run on a computer with Intel Xeon 8163 CPU (2.5GHz) and NVIDIA RTX 2080 Ti.
Iv-B Evaluation dataset
All experiments for evaluation of DNet are conducted on the Eigen split  of KITTI 2015 containing 697 test images. For evaluation of depth estimation results, it contains ground truth projected from LiDAR 3D point clouds to 2D depth maps. However, there is no ground truth for scale factors to transfer relative depth maps to absolute depth maps. Usually used method is to use the ratio between medians of LiDAR detected depth values and estimated ones as ground truth of scale factor.
Iv-C Quantitative evaluation
Thorough quantitative evaluation is presented to show the overall performance of DNet pipeline on both relative and absolute depth estimation. Commonly used metrics are adopted for evaluation.
Table I demonstrates the overall depth estimation performance of DNet, both using ground-truth (GT) and DGC based scale recovery, in comparison with 14 self-supervised monocular depth estimators. DNet with GT scale recovery is first evaluated to demonstrate its relative depth estimation performance. As can be seen from the table, DNet with GT scale recovery has achieved a satisfactory result. It has improved compared to Monodepth2 on former four metrics by respectively 1.74%, 4.32%, 1.05% and 1.04%.
In terms of absolute depth estimation, DGC performs almost as well as GT based scale recovery. Compared to Roussel et al., DNet achieves improvement on former four metrics by respectively 32.57%, 41.64%, 28.73% and 29.18%. The performance of DGC module can even outperform most depth estimator using GT scale recovery. These indicate that DGC scale recovery method, in spite of its simplicity, can carry out a satisfactory scale recovery.
Iv-D Ablation study
In order to better show the benefit of our proposed modules and the robustness against ground point ratio, comprehensive ablation study is conducted.
|Method||Lower is better||Higher is better|
|Abs Rel||Sq Rel||RMSE||RMSE log|
|Method||Lower is better||Higher is better|
|Abs Rel||Sq Rel||RMSE||RMSE log|
Iv-D1 Benefit of densely connected prediction layer:
In order to show the effectiveness of hierarchical feature generated by densely connected prediction layer, comparisons are made between baseline and DNet as can be seen in Table II. It can be seen that, our proposed densely connected prediction layer can boost the performance on the former four metrics by respectively 3.42%, 3.36%, 1.78%, 2.05%.
Iv-D2 Benefit of densely connected prediction layer on object-level prediction:
Depth estimation on objects can be challenging for the irrgular boundary and text copy effects. To show the improvement of DCP layer on object-level prediction, Mask-RCNN is used to generate object masks as shown in Fig.6 on test files and error metrics are calculated only within the masked areas. Table III compares performance between baseline and DNet on the object-level depth prediction. Our proposed densely connected prediction layer improves the object-level prediction performance on the former four metrics by respectively 11.01%, 23.45%, 5.80%, 2.14%.
Iv-E Robustness of DGC scale recovery against visible ground:
Since DGC scale recovery largely depends on the ground points extraction, the relationship of its performance and the proportion of ground points in a single frame should be carefully evaluated. The evaluation result is shown in Fig. 7, where the x-axis is ground point ratio and y-axis is . It can be seen that when the ground point ratio is larger than 1.03% under different driving conditions, the proposed DGC module can perform uniformly and robustly comparable to GT scale recovery.
|DGC scale recovery||4.1ms|
Iv-F Qualitative evaluation
Qualitative results are demonstrated in Fig. 8 and Fig. 9. Fig. 8 shows the overall absolute depth estimation results as well as intermediate results such as surface normal map and ground point mask. Fig. 9 demonstrates intuitively the improvement brought by introducing DCP in comparison with our baseline. It can be seen that object boundary is more precise and depth artifacts are to some extent eliminated.
Iv-G Additional DGC and GT comprarisons
There are also results showing that in some cases, DGC scale recovery works even better than GT scale recovery, especially in those scenes, where ground point ratio is relatively large. Some example of those scenes can be seen in Fig. 10. The performance in those frames can be seen in Table V. Surprisingly, in at least 31.7% and at most 45.2% of the frames, DGC scale recovery module performs better in terms of four metrics. Detailed result of the ratio of frames where DGC performs favorably against GT scale recovery can be seen in Table VI.
|Frame||Scale||Lower is better|
|Factor||Abs Rel||Sq Rel||RMSE||RMSE log|
|Abs Rel||Sq Rel||RMSE||RMSE log|
In this work, a novel pipeline for self-supervised monocular absolute depth estimation is presented. DCP layer is proposed to generate hierarchical features for high resolution depth inferences, so that object boundary can be more accurate and depth artifacts can be better addressed. In order for the self-supervised monocular depth estimation to be more easily adapted to and used in autonomous driving applications, DGC module is introduced to perform absolute depth prediction without additional sensors and depth ground truth. Extensive experiments were conducted to demonstrate the effectiveness and robustness of the proposed DNet pipeline as well as DCP and DGC module. In future, this work provides intuition for better use of hierarchical features and can serve as the basis for further explorations of scale recovery methods.
High quality monocular depth estimation via transfer learning. arXiv preprint arXiv:1812.11941. Cited by: §I, §II-A.
-  (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. arXiv preprint arXiv:1908.10553. Cited by: TABLE I.
Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8001–8008. Cited by: §I, §I, §II-A, §III-A2, TABLE I.
-  (2011) What does ground tell us? monocular visual odometry under planar motion constraint. In 2011 11th International Conference on Control, Automation and Systems, pp. 1480–1485. Cited by: §II-B.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: 3rd item, §I, §II-A, §IV-B.
-  (2016) Object-aware bundle adjustment for correcting monocular scale drift. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 4770–4776. Cited by: §II-B.
-  (2018) Deep ordinal regression network for monocular depth estimation. In , pp. 2002–2011. Cited by: §I, §II-A.
-  (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: 3rd item, §IV-B.
-  (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3838. Cited by: §I, §I, §II-A, §II-A, §III-A2, §III-A, TABLE I.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §I, §II-A.
-  (2019) Learn stereo, infer mono: siamese networks for self-supervised, monocular, depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §I, §II-A.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Fig. 6, §IV-D2.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §III-A2.
-  (2012) Depth extraction from video using non-parametric sampling-supplemental material. In European conference on Computer Vision, Cited by: §I.
-  (2011) Monocular visual odometry using a planar road model to solve scale ambiguity. Cited by: §II-B.
-  (2019) From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326. Cited by: §I, §II-A.
-  (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 2024–2039. Cited by: §I, §II-A.
-  (2018) Every pixel counts++: joint learning of geometry and motion with 3d holistic understanding. arXiv preprint arXiv:1810.06125. Cited by: §II-A, TABLE I.
-  (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675. Cited by: TABLE I.
-  (2019) Superdepth: self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9250–9256. Cited by: §I, §II-A.
-  (2018) Learning structure-from-motion from motion. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §II-B, TABLE I.
-  (2018) Learning monocular depth estimation with unsupervised trinocular assumptions. In 2018 International Conference on 3D Vision (3DV), pp. 324–333. Cited by: §I, §II-A.
-  (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12240–12249. Cited by: TABLE I.
-  (2019) Monocular depth estimation in new environments with absolute scale. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1735–1741. Cited by: §II-B, TABLE I, §IV-C.
-  (2015) High accuracy monocular sfm and scale correction for autonomous driving. IEEE transactions on pattern analysis and machine intelligence 38 (4), pp. 730–743. Cited by: §II-B.
-  (2017) Probabilistic global scale estimation for monoslam based on generic object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 48–56. Cited by: §II-B.
-  (2017) Sfm-net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804. Cited by: §II-A.
-  (2018) Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2022–2030. Cited by: TABLE I.
-  (2018) Monocular visual odometry scale recovery using geometrical constraint. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 988–995. Cited by: §II-B.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-A1.
-  (2018) Lego: learning edge with geometry all at once by watching videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 225–234. Cited by: §I, §I, §II-A, §III-A2, TABLE I.
-  (2017) Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv preprint arXiv:1711.03665. Cited by: §II-A, §III-B1, TABLE I.
-  (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5684–5693. Cited by: §I, §II-A.
-  (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §II-A, TABLE I.
-  (2019) Ground-plane-based absolute scale estimation for monocular visual odometry. IEEE Transactions on Intelligent Transportation Systems. Cited by: §II-B.
-  (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §I, §I, §II-A, §II-A, §III-A2, TABLE I.
-  (2018) Df-net: unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 36–53. Cited by: TABLE I.