SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation

05/21/2019 ∙ by Michaël Ramamonjisoa, et al. ∙ Université de Bordeaux 2

We introduce SharpNet, a method that predicts an accurate depth map for an input color image, with a particular attention to the reconstruction of occluding contours: Occluding contours are an important cue for object recognition, and for realistic integration of virtual objects in Augmented Reality, but they are also notoriously difficult to reconstruct accurately. For example, they are a challenge for stereo-based reconstruction methods, as points around an occluding contour are visible in only one image. Inspired by recent methods that introduce normal estimation to improve depth prediction, we introduce a novel term that constrains depth and occluding contours predictions. Since ground truth depth is difficult to obtain with pixel-perfect accuracy along occluding contours, we use synthetic images for training, followed by fine-tuning on real data. We demonstrate our approach on the challenging NYUv2-Depth dataset, and show that our method outperforms the state-of-the-art along occluding contours, while performing on par with the best recent methods for the rest of the images. Its accuracy along the occluding contours is actually better than the `ground truth' acquired by a depth camera based on structured light. We show this by introducing a new benchmark based on NYUv2-Depth for evaluating occluding contours in monocular reconstruction, which is our second contribution.



There are no comments yet.


page 1

page 4

page 5

page 8

page 11

page 12

page 13

page 14

Code Repositories


Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monocular depth estimation is a very ill-posed yet highly desirable task for applications such as robotics, augmented or mixed reality, autonomous driving, and scene understanding in general. Recently, many methods have been proposed to solve this problem using Deep Learning approaches, either relying on supervised learning 

[5, 4, 17, 7] or on self-learning [9, 29, 20], and these methods already often obtain very impressive results.

Jiao et al. [14]
NYUv2-Depth Ground Truth Depth
Manual insertion
Figure 1: Our SharpNet method shows significant improvement over state-of-the-art methods in terms of occluding contours accuracy, while being competitive with state-of-the-art on global scene monocular depth estimation. We show augmentation with a virtual Stanford rabbit on an NYUv2 [22] RGB image using different depth maps for occlusion-aware integration. First three rows show the depth map used for occlusion-aware insertion (left) and resulting augmentation (right). An error of only a few pixels can significantly degrade the realism of the integration. Last row shows a insertion obtained with a manually drawn binary mask (in black) for reference. Note that augmentation is significantly better with SharpNet than with the ground truth obtained with a structured-light camera (second row).

However, as shown in Fig. 1, occluding contours remain difficult to reconstruct correctly, while they are an important cue for object recognition, and for augmented reality or path planning, for example. This is due to several reasons. First, the depth annotations of training images are likely to be inaccurate along the occluding contours, if the depth annotations are obtained with a stereo reconstruction method or a structured light camera. This is for example the case for the NYUv2-Depth dataset [22], which is an important benchmark, used by many recent works for evaluation. This is because on one or both sides of the occluding contours lie 3D points that are visible in only one image, challenging the 3D reconstruction [24]. Structured light cameras essentially rely on stereo reconstruction, where one image is replaced by a known pattern [10]

, and therefore suffer from the same problem. Second, occluding contours, despite their importance, represent a small part of the images, and may not influence the loss function used during training if they are not handled with special care.

In this paper, we show that it is possible to learn to reconstruct more accurately occluding contours by adding a simple term that constrains the depth predictions together with the occluding contours during learning. This approach is inspired by recent works that predict the depths and normals for an input image, and enforce constraints between them [27, 21]. A similar constraint between depths and occluding contours can be introduced, and we show that this results in better reconstructions along the occluding contours, without degrading the accuracy of the rest of the reconstruction.

More exactly, we train a network to predict depths, normals, and occluding contours for an input image, by minimizing a loss function that integrates constraints between the depths and the occluding contours, and also between the depths and the normals. We show that these two constraints can be integrated in a very similar way with simple terms in the loss function. At run-time, we can predict only the depth values, making our method much faster than many state-of-the-art methods, since it runs at 150 fps on images and is thus suitable for real-time applications.

We show that each aspect of our training procedure improves the depth output. In particular, our experiments show that the constraint between depths and occluding contours is important, and that the improvement does not simply come simply from an effect of multi-task learning. Learning to predict the normals in addition to the depths and the occluding contours helps the convergence of training towards good depth predictions.

We demonstrate our approach on the NYUv2-Depth dataset, in order to compare it against previous methods. Since the depth annotations are noisy especially along the occluding contours, as already mentioned above, we use synthetic images for initializing the network before fine-tuning on NYUv2-Depth. As training data for the locations of the occluding contours, we simply use the object instance boundaries given by the synthetic dataset. However, we only use the depth ground truth as training data when finetuning on the NYUv2-Depth dataset.

A proper evaluation of the accuracy of the occluding contours is difficult. Since the “ground truth” depth data is typically noisy along occluding contours, as it is the case for NYUv2-Depth, an evaluation based on this data would not be representative of the actual quality. Even with better depth data, identifying automatically in ground truth depth data, as depth discontinuities for example, would be sensitive to the parameters used by the identification method (see Fig. 4).

We therefore decided to annotate manually the occluding contours in a randomly selected subset of 30 images from the NYUv2-Depth test data, which we call the NYUv2-OC dataset. We will make our annotations and our code for the evaluation of the occluding contours publicly available for future comparison. We evaluate our method on this data in terms of 2D localization, in addition to evaluating on the NYUv2-Depth validation set on more standard metrics depth estimation metrics [5, 4, 17]. Our experiments show that while achieving competitive results on those metrics on the NYUv2-Depth benchmark by placing second on all of them, we outperform all previous methods in terms of occluding contours 2D localization, especially the current leading method on monocular depth estimation [14].

2 Related Work

Monocular depth estimation for images made significant progress recently. We discuss below mostly the most recent ones, and several techniques that helps monocular depth estimation: Learning from synthetic data, using normals for learning to predict depths, and refinement based on CRFs.

2.1 Supervised and Self-Supervised Monocular Depth Estimation

With the development of large datasets of images annotated with depth data [22, 8, 23, 33], many supervised methods have been proposed. Eigen et al. [5, 4] used multi-scale depth estimation to capture global and local information to help depth prediction. Given the remarkable performances they achieved on both popular benchmarks NYUv2-Depth [22] and KITTI [8], more work extended this multi-scale approach [18, 30]. Previous work also consider ordinal depth classification [7] or pair-wise depth-map comparisons [2] to add local and non-local constraints. Our approach relies on a simpler monoscale architecture, making it efficient at run-time. Our constraints between depths, normals, and occluding contours guide learning towards good depth prediction for the whole image.

Laina et al. [17]

exploit the power of deep residual neural networks 

[11] and show that using the more appropriate BerHu [19, 34] reconstruction loss yields better performances. However, their end results are quite smooth around occluding contours, making their method inappropriate for realistic occlusion-aware augmented reality.

Jiao et al. [14] noticed that the depth distribution of the NYUv2 dataset is heavy-tailed. The authors therefore proposed an attention-driven loss for the network supervision, and pair the depth estimation task with semantic segmentation to improve performances on the dataset. However, while they currently achieve the best performance on the NYUv2-Depth dataset, their approach suffers from a bias towards high-depth areas such as windows, corridors or mirrors. While this translates into a significant decrease of the final error, it also produces blurry depth maps, as can be seen in Fig. 1. By contrast, our reconstructions tend to be much sharper along the occluding boundaries as desired, and our method is much faster, making it suitable for real-time applications.

Self-learning methods have also become popular for monocular reconstruction, and exploit the consistency between multiple views [9, 29, 20, 31, 32, 25]. While such approach is very exciting, it does not reach yet the accuracy of supervised methods in general, and it should be preferred only when no annotated data is available for supervised learning.

2.2 Edge- and Occlusion-Aware Depth Estimation

Wang et al. [27] introduced their SURGE method to improve scene reconstruction on planar and edge regions by learning to jointly predict depth and normal maps, as well as edges and planar regions, then refining the depth prediction by solving and optimization problem using a Dense Conditional Random Field (DCRF). While their method yields appealing reconstruction results on planar regions, it still underperforms state-of-the-art methods on global metrics, and the use of DCRF makes it unsuited for real-time applications. Furthermore, SURGE [27] is evaluated on the reconstruction quality around edges using standard depth error metrics, but not on the 2D localization of their occluding contours.

Many self-supervised methods [31, 32, 25, 9] have incorporated edge- or occlusion-aware geometry constrains which exist when working with stereo pairs or sequences of images as provided in the very popular KITTI depth estimation benchmark [8]. However, although these methods can perform monocular depth estimation at test time, they require multiple calibrated views at training time. They are therefore unable to work on monocular RGB-D datasets such as NYUv2-Depth [22] or SUN-RGBD [23].

[28, 13] worked on occlusion-aware depth estimation to improve reconstruction for augmented reality applications. While achieving spectacular results, they however require one or multiple light-field images, which are more costly to obtain than ubiquitous RGB images.

Conscious of the lack of evaluation metrics and benchmarks for quality of edge and planes reconstruction from monocular depth estimates, Koch

et al. [15] introduced the iBims-v1 dataset, a high quality benchmark of 100 RGB images with their associated depth map. With their work, they tackle the low quality of depth maps of other RGB-D datasets such as [23] and [22], while also introducing annotations and metrics for occluding contours and planarity of planar regions. We build on top of their work for our evaluation of occluding contour reconstruction quality.

3 Method

Figure 2: The architecture of our multi-task encoder-decoder network. We use a single ResNet50 encoder which learns an intermediate representation that is shared by all decoders. With this setting, the representation generalizes better for all tasks. We use skip connections between features of the encoder and of the decoder at corresponding scales.

As shown in Fig. 2, we train a network to predict, for a training color image , a depth map

, a map of occluding contours probabilities

, and a map of normals.

Figure 3: We compare contours extracted from object instance boundaries in the original annotated NYUv2-Depth dataset. The instance boundaries are more accurate than the occluding boundaries extracted automatically from the Kinect-v1 depth maps of NYU and we use them instead for training, even if they do not contain the occluding boundaries within the objects’ silhouettes.

3.1 Training Overview

We first train on the synthetic dataset PBRS [33], which provides the ground truth for the depth map , the normals map , and the binary map of object instance contours for each training image . Since occluding contours are not directly provided in the PBRS dataset, we use instead the object instance contours as a proxy. We argue that on a macroscopic scale, a large proportion of occluding contours in an image are due to objects occluding one another as can be seen in Fig. 3. However, we show that we can also enable our network to learn internal occluding contours within objects even without “pure” occluding contours supervision. Indeed, we make use of constrains on depth map and occluding contour predictions and respectively (see Section. 3.4 for more details) to enforce the contour estimation task to also predict intra-object occluding boundaries.

We then finetune on the NYUv2-Depth dataset without direct supervised losses on the occluding contours or normals ( and described below): Even though [16] and [22] produce ground truth normals map with different estimation methods operating on the Kinect-v1 depth maps, their output results are generally noisy. Occluding contours are not given in the original NYUv2-Depth dataset. Although one could automatically extract them using edge detectors [1, 3] on depth maps, such extraction is very sensitive to the detector’s parameters (see Figure 4). Instead, we introduce consensus terms that explicitly constrain the predicted contours, normals and depth maps together ( and described below) at training time.

At test-time, we can choose to use only the depth stream of if we are not interested in the normals nor the boundaries, making inference very fast.

Figure 4: Left column: A sample from our manually annotated NYUv2-OC dataset with an RGB image from NYUv2, its Kinect-v1 depth map and our manually annotated occluding contours. Right column: Our NYUv2-OC occluding contours (in red) on top of the edges detected on ground truth Kinect-v1 depth map (in black) for various Canny detector parameters ( and denote low and high threshold respectively). The more permissive the detector gets, the more occluding contours are detected (although never all of them), and the more erroneous they become. This shows that automatically extracted contours even on ground truth depth maps are problematic when evaluating the accuracy of occluding contours, motivating our NYUv2-OC dataset made of manually drawn contours.

3.2 Loss Function

We estimate the parameters of network by minimizing the following loss function over all the training images:



  • , , and are supervision terms for the depth, the occluding contours, and the normals respectively. We adjust weights , , and during training so that we focus first on learning local geometry (normals and boundaries) then on depth. See Section 4.1 for more details.

  • and introduce constraints between the predicted depth map and the predicted contours, and between the predicted depth map and the predicted normals respectively.

We detail these losses below. All losses are computed using only valid pixel locations. The PBRS synthetic dataset provides such a mask. When finetuning on NYUv2-Depth, we mask out the white pixels on the images border.

3.3 Supervision Terms , , and

The supervision terms on the predicted depth and normal maps are drawn from previous works on monocular depth prediction. For our term on occluding contours prediction, we rely on previous work for edge prediction.

Depth Prediction Loss .

As in recent works, our loss on depth prediction applies to log-distances. We use the BerHu loss function [19, 34], as it was shown in [17] to result in faster converging and better solutions:


The sum is over all the valid pixel locations. The BerHu (also known as reverse Huber) function is defined as a loss for large deviations, and a loss for small ones. As in [17], we take the parameter of the BerHu function as .

Occluding Contours Prediction Loss .

We use the recent attention loss from [26], which was developed for 2d edge detection, to learn to predict the occluding contours. This attention loss helps dealing with the imbalance of edge pixels compared to non-edge pixels:


where are hyper-parameters which we set to the authors values , and is computed image per image as the proportion of contour pixels. We use this pixel-wise attention loss to define the occluding contour prediction loss:


As mentioned above, this loss is disabled when finetuning on the NYUv2-Depth dataset.

Normals Prediction Loss .

For normals prediction, we use a common method introduced by Eigen et al. [4] which is to minimize at all valid pixels the angle between the predicted normals and their ground truth counterpart , by maximizing their dot-product. We therefore used the following loss:


This loss slightly differs from the one of [4] as we limit it to positive values. As mentioned earlier, this loss is disabled when finetuning on the NYUv2-Depth dataset.

3.4 Consensus Terms and

Depth-Contours Consensus Term.

To force the network to predict sharp depth edges at occluding contours where strong depth discontinuities occur, we propose the following loss between the predicted occluding contours probability map and the predicted depth map :


This encourages the network to associate pixels with large depth gradients with occluding contours: High-gradient areas will lead to a large loss unless the occluding contour probability is close to one. [9, 12] also used this type of edge-aware gradient-loss, although they used it to impose consensus between photometric gradients and depth gradients. However, relying on photometric gradients can be dangerous, as strong image gradients do not necessarily correspond to occluding contours, and occluding contours do not necessarily correspond to strong image gradients.

By enforcing this constraint on predictions and , we introduce a bias on boundary prediction from instance boundaries towards occluding contours.

Since this loss does not involve ground truth occluding contours

, it can be used when finetuning on the NYUv2-Depth dataset, thus allowing semi-supervised learning of the occluding contours on NYUv2-Depth. Because it involves both depth and contour prediction streams, it enforces the depth and contours decoders to become consistent, but also helps the ResNet50 encoder to produce a more general and powerful representation.

Depth-Normals Consensus Loss.

Depth and normals are two highly correlated entities. Thus, to impose geometric consistency during prediction between the normal and depth predictions and , we use the following loss:



is extracted from the 3D vector

, and is computed as the 2D gradient of the depth map estimate using finite differences. This term enforces consistency between the normals and depth predictions at all pixels but those predicted as boundaries (where ). Our formulation of depth-normals consensus is much simpler than those proposed in [27, 31, 6] as these works express their constraint in 3D world coordinates, thus requiring the camera calibration matrix.

Again, imposing this constraint during finetuning allows us to constrain normals, boundaries, and depth, even when the ground truth normals and boundaries are not available.

4 Experiments

We evaluate our method and compare it to previous work using standard metrics, but also the depth boundary edge (DBE) accuracy metric introduced by Koch et al. [15] (see following Section 4.2 and Eq. (8) for more details). We show that our method achieves the best trade-off between global reconstruction error and DBE.

4.1 Implementation Details

We implemented our work in Pytorch and will make our pretrained weight, training and evaluation code publicly available.

111 Both training and evaluation are done on a single high-end NVIDIA GTX 1080 Ti GPU.


We first train our network on the synthetic PBRS [33] dataset, using depth and normals maps, along with object instance boundaries which we use as a proxy to occluding contours. We split the PBRS dataset in training/validation/test sets using a 80%/10%/10% ratio. We then finetune our network on the NYUv2-Depth training set using only depth data. Finally, we use the NYUv2-Depth validation set for depth evaluation and our new NYUv2-OC for occluding contours accuracy evaluation.


Training a multi-task network requires some caution: Since several loss terms are involved, and in particular one for each task, one should pay special attention for any suboptimal solution for one task due to ‘over-learning’ another. To monitor each task individually, we monitor each individual loss along with the global training loss and check that all of them decrease. When setting all loss coefficients equal to one, we noticed that the normals loss decreased faster than others. Similarly, we found that learning boundaries was much faster than learning depth. We argue that this is because local features such contours or local planes, i.e. where normals are constant, are easier to learn since they appear almost in all training examples. Training depth, however, requires the network to exploit context data such as room layout in order to regress a globally consistent depth map.

Building on these observations, we choose to learn the easier tasks first, then use them as guidance to the more complex task of depth estimation through our novel consensus loss terms of Eqs. (7) and (6). See supplementary material for more details on the training procedure.

4.2 Evaluation Method

We evaluate our method on the benchmark dataset NYUv2 Depth [22]. The most common metrics are: Thresholded accuracies , linear and logarithmic Root Mean Squared Error (RMSE and RMSE respectively), Absolute Relative difference , and average logarithmic error .

Evaluated on full NYUv2-Depth Evaluated on our NYUv2-OC
Method Accuracy Error (px)
rel RMSE (lin) RMSE (log)
Eigen et al. [4] (VGG) 0.766 0.949 0.988 0.195 0.068 0.660 0.217 2.830 2.917 3.039 3.068
Eigen et al. [4] (AlexNet) 0.690 0.911 0.977 0.250 0.082 0.755 0.259 2.683 2.862 3.048 3.108
Laina et al. [17] 0.818 0.955 0.988 0.170 0.059 0.602 0.200 3.901 3.791 3.910 3.939
Fu et al. [7] 0.850 0.957 0.985 0.150 0.052 0.578 0.194 3.605 3.657 3.940 3.942
Jiao et al. [14] 0.909 0.981 0.995 0.133 0.042 0.401 0.146 5.041 3.756 3.961 3.986
Ours 0.884 0.980 0.995 0.148 0.048 0.496 0.159 2.088 2.378 2.764 2.838
Table 1: Our final evaluation results. Results in red, blue and green achieve first, second and third place respectively. Numerical results might vary from the original papers, as we evaluated all methods based on the authors depth map predictions, in the center crop proposed by [4] without limiting predictions to range .

NYUv2-Depth benchmark evaluation.

We summarize the our comparative study between our method and previous ones in Table 1.

Since authors evaluating on the NYUv2-Depth benchmark often use different evaluation methods, it makes fair comparison difficult to perform. For instance, [30] and [7] evaluate on crops of the image where the projection map are available, i.e. they remove a border of the image for evaluation: those regions are provided by Eigen et al. [4] in their evaluation toolkit. Some authors also clip resulting depth-maps to the range valid depth sensor range [0.7m; 10m]. Finally, not all the authors makes their prediction and/or evaluation code publicly available. The authors of [14] kindly shared their predictions on the NYUv2-Depth dataset with us, and the following results for their method were obtained on the depth map predictions they provided us. All other methods have released their predictions online.

Fair comparison is ensured by performing evaluation of each method solely using their associated depth map predictions and one single evaluation code. An important note is that RMSE values are computed over all pixels in images, i.e. the sum under square root is computed over all pixels in the validation set. This differs to some papers which compute one RMSE value per image and then compute the mean over all images. As mentioned above, we will make our evaluation code and results publicly available.

Occluding contours location accuracy.

To evaluate occluding contours location accuracy, we follow the work of Koch et al. [15] as they proposed an experimental method for such evaluation. Since it is important to examine whether predicted depths maps are able to represent all occluding contours as depth discontinuities in an accurate way, they analyzed occluding contours accuracy performances by detecting and comparing edges in predicted and ground truth depth maps.

Since in the NYUv2-Depth dataset, the Kinect-v1 depth map is quite noisy especially around occluding boundaries, we chose to manually annotate a subset the dataset in terms of occluding contours, building our NYUv2-OC dataset, and use it for evaluation. Several samples of our dataset are shown in Fig. 4, Fig. 3 and Fig. 11. In order to evaluate the predicted depth maps’ quality in terms of occluding contours reconstruction, binary edges are first extracted from with a Canny detector. They are then compared to the “ground truth” binary edges extracted from ground truth depth map with the same detection algorithm222Edges are extracted from depth maps with normalized dynamic range.; Such comparison is done via truncated chamfer distance of the binary edge images and . More precisely, an Euclidean distance transform is applied to the “ground truth” binary edges image and distances are truncated to a maximum of 10 pixels. Pixels in exceeding 10 pixels distance are ignored in order to evaluate predicted edges only in the local neighborhood of the ground truth edges (we refer the reader to the authors original paper [15] for further details). Finally, the depth boundary accuracy error is defined as:


where the sum is performed over all valid pixels. We compare our method on this metric with state-of-the-art depth estimation methods using different Canny parameters. Evaluation results are shown in Table 1: We outperform all state-of-the-art methods on occluding contours accuracy, while being a competitive second best on standard depth estimation evaluation metrics.

Since the detected edges in are highly sensitive to the edge detector’s parameters (see Fig.4), we evaluate the depth boundary accuracy error when varying the threshold parameters and of the Canny edge detector. The results are shown in Fig. 5.

Figure 5: Our method achieves the best trade-off between global depth reconstruction error and occluding boundary accuracy, as it achieves second place in depth reconstruction error and first place in occultation contours accuracy. Box plots are drawn using many random combinations of and parameters.
         RGB           Laina et al.[17]       Fu et al.[7]        Jiao et al.[14]       GT (NYUv2)   SharpNet
Figure 6: Several examples of images from our NYUv2-OC dataset and their associated depth map estimate for different methods. The second row for each image shows the in black the detected edges on those estimates using a Canny edge detector (in black) with and , overlaid on our manually annotated ground truth in red. Our SharpNet method not only creates sharper occluding contours, leading to less spurious and erroneous contours than with [7] the Kinect-v1 depth-map; it also leads to much better located edges than other methods.

4.3 Ablation Study

To prove the impact of our geometry consensus terms, we performed an ablation study to analyze the contribution of training with synthetic and real data, as well as our novel geometry consensus terms. Evaluation of different models on our NYUv2-OC dataset are shown in Table 2, confirming their contribution to both improved depth reconstruction results over the whole NYUv2-Depth dataset and occluding contours accuracy.

Method Training Dataset (px)
w/o consensus PBRS 0.298 2.279 2.815 3.192 3.255
w/ consensus PBRS 0.272 2.479 2.571 2.859 2.889
w/o consensus PBRS + NYUv2 0.165 2.279 2.876 3.301 3.374
w/ consensus PBRS + NYUv2 0.159 2.088 2.378 2.764 2.838
Table 2: Ablation study of geometry consensus terms contribution and training method. We see that our added geometry consensus terms brings a significant performance boost by guiding the depth towards learning accurate occluding contours. These terms also help to keep a good trade-off between between occluding contours accuracy and depth reconstruction during the necessary fine-tuning on NYUv2-Depth. is computed over the full NYUv2-Depth dataset.

5 Conclusion

In this paper, we show that our SharpNet method is able to achieve competitive depth reconstruction from a single RGB image with particular attention to occluding contours thanks to geometry consensus terms introduced during multi-task training. Our high quality depth estimation and high accuracy occluding contours reconstruction allow for realistic integration of virtual objects in augmented reality in real-time as we achieve 150 fps inference speed. We show the superiority of our SharpNet to state-of-the-art by introducing a first version of our new NYUv2-OC occluding contour dataset, which we plan to extend in future work.


  • [1] J. Canny. A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6):679–698, Nov 1986.
  • [2] Y. Cao, T. Zhao, K. Xian, C. Shen, and Z. Cao. Monocular Depth Estimation with Augmented Ordinal Depth Relationships. IEEE Transactions on Image Processing, 2018.
  • [3] P. Dollár and C. L. Zitnick. Fast Edge Detection Using Structured Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8):1558–1570, August 2015.
  • [4] D. Eigen and R. Fergus. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In

    International Conference on Computer Vision

    , pages 2650–2658, 2015.
  • [5] D. Eigen, C. Puhrsch, and R. Fergus. Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. In Advances in Neural Information Processing Systems, pages 2366–2374, 2014.
  • [6] X. Fei, A. Wang, and S. Soatto. Geo-Supervised Visual Depth Prediction. IEEE Robotics and Automation Letters, 4:1661–1668, 2018.
  • [7] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep Ordinal Regression Network for Monocular Depth Estimation. In

    Conference on Computer Vision and Pattern Recognition

    , 2018.
  • [8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research, 2013.
  • [9] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [10] J. Han, L. Shao, D. Xu, and J. Shotton. Enhanced Computer Vision With Microsoft Kinect Sensor: A Review. IEEE Transactions on Cybernetics, 2013.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [12] P. Heise, S. Klose, B. Jensen, and A. Knoll. PM-Huber: PatchMatch with Huber Regularization for Stereo Matching. In IEEE International Conference on Computer Vision, pages 2360–2367, 2013.
  • [13] X. Jiang, M. L. Pendu, and C. Guillemot. Depth Estimation with Occlusion Handling from a Sparse Set of Light Field Views. IEEE International Conference on Image Processing, pages 634–638, 2018.
  • [14] J. Jiao, Y. Cao, Y. Song, and R. W. H. Lau. Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss. In European Conference on Computer Vision, 2018.
  • [15] T. Koch, L. Liebel, F. Fraundorfer, and M. Körner. Evaluation of CNN-Based Single-Image Depth Estimation Methods. In European Conference on Computer Vision, 2018.
  • [16] L. Ladicky, J. Shi, and M. Pollefeys. Pulling Things Out of Perspective. In Conference on Computer Vision and Pattern Recognition, pages 89–96, 2014.
  • [17] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper Depth Prediction with Fully Convolutional Residual Networks. In International Conference on 3D Vision, pages 239–248, 2016.
  • [18] J. Li, R. Klein, and A. Yao. A Two-Streamed Network for Estimating Fine-Scaled Depth Maps from Single RGB Images. In International Conference on Computer Vision, pages 3392–3400, 2017.
  • [19] B. Owen.

    A Robust Hybrid of Lasso and Ridge Regression.

    Contemp. Math., 443, 01 2007.
  • [20] M. Poggi, F. Tosi, and S. Mattoccia. Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions. In International Conference on 3D Vision, pages 324–333, 2018.
  • [21] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric Neural Network for Joint Depth and Surface Normal Estimation. In Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018.
  • [22] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor Segmentation and Support Inference from RGBD Images. In European Conference on Computer Vision, 2012.
  • [23] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. In Conference on Computer Vision and Pattern Recognition, pages 567–576, June 2015.
  • [24] R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2011.
  • [25] Q. Teng, Y. Chen, and C. Huang.

    Occlusion-Aware Unsupervised Learning of Monocular Depth, Optical Flow and Camera Pose with Geometric Constraints.

    Future Internet, 10:92, 09 2018.
  • [26] G. Wang, X. Liang, and F. W. B. Li. DOOBNet: Deep Object Occlusion Boundary Detection from an Image. CoRR, abs/1806.03772, 2018.
  • [27] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille. SURGE: Surface Regularized Geometry Estimation from a Single Image. In Advances in Neural Information Processing Systems, pages 172–180, 2016.
  • [28] T. Wang, A. A. Efros, and R. Ramamoorthi. Depth Estimation with Occlusion Modeling Using Light-Field Cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(11):2170–2181, Nov 2016.
  • [29] J. Xie, R. Girshick, and A. Farhadi.

    Deep3D: Fully Automatic 2D-To-3d Video Conversion with Deep Convolutional Neural Networks.

    In European Conference on Computer Vision, pages 842–857, 2016.
  • [30] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [31] Z. Yang, W. Xu, L. Zhao, and R. Nevatia. Unsupervised Learning of Geometry From Videos With Edge-Aware Depth-Normal Consistency. In AAAI, 2018.
  • [32] Z. Yin and J. Shi. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Conference on Computer Vision and Pattern Recognition, pages 1983–1992, 2018.
  • [33] Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and T. Funkhouser. Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [34] L. Zwald and S. Lambert-Lacroix. The Berhu Penalty and the Grouped Effect. In arXiv Preprint, 2012.

6 Supplementary Materials

6.1 Synthetic Dataset

To train our network to predict geometrically consistent normals, depth, and occluding contours, we used the synthetic dataset from PBRS from Zhang et al. [33]. Indeed, perfect consistency between each output is important to ensure high quality depth maps at occluding contours, but also to ensure good generalization from synthetic to real data. We show in Figure. 7 a sample from the PBRS [33] dataset, as well as our geometric constraints in Figure. 8, which enforce the network to predict outputs consistent with each other, both with synthetic and real images.

Figure 7: A sample of the synthetic PBRS [33].
Figure 8: Our geometric constrains on depth, normals and occluding contours applied during training.

6.2 Training Details

Here we present some details about our training method. Since we are performing multi-task learning, each task-attached loss is weighted in the global loss term. We found that learning normals and boundaries first brought the best results. We also present the data augmentation strategy we used while training.

6.3 Data Augmentation

We used the following standard data augmentation to train both on PBRS [33] and NYUv2-Depth [22]:

  • random scale with scale factor : the depth map is divided by and the coordinate of the normal map is multiplied by ,

  • random rotation of angle : all corresponding maps are rotated the camera 2D plane and normals maps are recomputed in the camera coordinates (the rotation matrix is applied on each pixel of rotated normal map)

  • random crops of size ,

  • random Gamma adjustment using Torchvision transforms package, using ratio

6.4 Training Parameters

We recall the loss function equation:

We detail all training steps in Table. 3. For all experiments, we used polynomial learning rate decay with power 0.9. We also used weight decay with a decay rate of .

Dataset Iterations batch size iter size learning rate
PBRS [33] 400,000 4 3 1 0.01 20 OFF OFF
PBRS [33] 350,000 5 3 1 0.005 0.5 ON ON
NYUv2-Depth [22] 25,000 6 3 1 0 0 ON ON
Table 3: Training details for each step of our training method. iter size

stands for the number of batches used per iteration for backpropagation (performed using the average loss computed from each batch loss).

values are this low to rescale the attention loss of [26].

6.5 Additional Qualitative Results

We show in Figure. 10 some results of image augmentations: we augment the images by adding a virtual object in them using both the RGB image and a depth map of the scene and a depth map of the virtual object. This object is inserted with consideration of occlusions, i.e. we only fill pixels such that with RGB pixels from the rendered object .

Figure 9: Illustration of our occlusion-aware virtual object insertion. Top row (left to right): the original RGB image, the virtual object , an object insertion ignoring occlusion. Bottom row (left to right): our estimated depth map , the augmented depth map when using , the final result.
   Eigen et al.[4]    Laina et al.[17]       Fu et al.[7]         Jiao et al.[14]      GT (NYUv2)   SharpNet
Figure 10: More examples of virtual occlusion-aware object insertion using depth maps predicted by different methods as well as along with the Kinect ground truth depth map from the original NYUv2-Depth dataset
         RGB           Laina et al.[17]       Fu et al.[7]        Jiao et al.[14]       GT (NYUv2)   SharpNet
Figure 11: More examples of images from our NYUv2-OC dataset and their associated depth map estimate for different methods. Edges (in black) were detected using a Canny edge detector with and . Our manually annotated ground truth is represented in red.