Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation
We introduce SharpNet, a method that predicts an accurate depth map for an input color image, with a particular attention to the reconstruction of occluding contours: Occluding contours are an important cue for object recognition, and for realistic integration of virtual objects in Augmented Reality, but they are also notoriously difficult to reconstruct accurately. For example, they are a challenge for stereo-based reconstruction methods, as points around an occluding contour are visible in only one image. Inspired by recent methods that introduce normal estimation to improve depth prediction, we introduce a novel term that constrains depth and occluding contours predictions. Since ground truth depth is difficult to obtain with pixel-perfect accuracy along occluding contours, we use synthetic images for training, followed by fine-tuning on real data. We demonstrate our approach on the challenging NYUv2-Depth dataset, and show that our method outperforms the state-of-the-art along occluding contours, while performing on par with the best recent methods for the rest of the images. Its accuracy along the occluding contours is actually better than the `ground truth' acquired by a depth camera based on structured light. We show this by introducing a new benchmark based on NYUv2-Depth for evaluating occluding contours in monocular reconstruction, which is our second contribution.READ FULL TEXT VIEW PDF
Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation
Monocular depth estimation is a very ill-posed yet highly desirable task for applications such as robotics, augmented or mixed reality, autonomous driving, and scene understanding in general. Recently, many methods have been proposed to solve this problem using Deep Learning approaches, either relying on supervised learning[5, 4, 17, 7] or on self-learning [9, 29, 20], and these methods already often obtain very impressive results.
Jiao et al. 
|NYUv2-Depth Ground Truth Depth|
However, as shown in Fig. 1, occluding contours remain difficult to reconstruct correctly, while they are an important cue for object recognition, and for augmented reality or path planning, for example. This is due to several reasons. First, the depth annotations of training images are likely to be inaccurate along the occluding contours, if the depth annotations are obtained with a stereo reconstruction method or a structured light camera. This is for example the case for the NYUv2-Depth dataset , which is an important benchmark, used by many recent works for evaluation. This is because on one or both sides of the occluding contours lie 3D points that are visible in only one image, challenging the 3D reconstruction . Structured light cameras essentially rely on stereo reconstruction, where one image is replaced by a known pattern 
, and therefore suffer from the same problem. Second, occluding contours, despite their importance, represent a small part of the images, and may not influence the loss function used during training if they are not handled with special care.
In this paper, we show that it is possible to learn to reconstruct more accurately occluding contours by adding a simple term that constrains the depth predictions together with the occluding contours during learning. This approach is inspired by recent works that predict the depths and normals for an input image, and enforce constraints between them [27, 21]. A similar constraint between depths and occluding contours can be introduced, and we show that this results in better reconstructions along the occluding contours, without degrading the accuracy of the rest of the reconstruction.
More exactly, we train a network to predict depths, normals, and occluding contours for an input image, by minimizing a loss function that integrates constraints between the depths and the occluding contours, and also between the depths and the normals. We show that these two constraints can be integrated in a very similar way with simple terms in the loss function. At run-time, we can predict only the depth values, making our method much faster than many state-of-the-art methods, since it runs at 150 fps on images and is thus suitable for real-time applications.
We show that each aspect of our training procedure improves the depth output. In particular, our experiments show that the constraint between depths and occluding contours is important, and that the improvement does not simply come simply from an effect of multi-task learning. Learning to predict the normals in addition to the depths and the occluding contours helps the convergence of training towards good depth predictions.
We demonstrate our approach on the NYUv2-Depth dataset, in order to compare it against previous methods. Since the depth annotations are noisy especially along the occluding contours, as already mentioned above, we use synthetic images for initializing the network before fine-tuning on NYUv2-Depth. As training data for the locations of the occluding contours, we simply use the object instance boundaries given by the synthetic dataset. However, we only use the depth ground truth as training data when finetuning on the NYUv2-Depth dataset.
A proper evaluation of the accuracy of the occluding contours is difficult. Since the “ground truth” depth data is typically noisy along occluding contours, as it is the case for NYUv2-Depth, an evaluation based on this data would not be representative of the actual quality. Even with better depth data, identifying automatically in ground truth depth data, as depth discontinuities for example, would be sensitive to the parameters used by the identification method (see Fig. 4).
We therefore decided to annotate manually the occluding contours in a randomly selected subset of 30 images from the NYUv2-Depth test data, which we call the NYUv2-OC dataset. We will make our annotations and our code for the evaluation of the occluding contours publicly available for future comparison. We evaluate our method on this data in terms of 2D localization, in addition to evaluating on the NYUv2-Depth validation set on more standard metrics depth estimation metrics [5, 4, 17]. Our experiments show that while achieving competitive results on those metrics on the NYUv2-Depth benchmark by placing second on all of them, we outperform all previous methods in terms of occluding contours 2D localization, especially the current leading method on monocular depth estimation .
Monocular depth estimation for images made significant progress recently. We discuss below mostly the most recent ones, and several techniques that helps monocular depth estimation: Learning from synthetic data, using normals for learning to predict depths, and refinement based on CRFs.
With the development of large datasets of images annotated with depth data [22, 8, 23, 33], many supervised methods have been proposed. Eigen et al. [5, 4] used multi-scale depth estimation to capture global and local information to help depth prediction. Given the remarkable performances they achieved on both popular benchmarks NYUv2-Depth  and KITTI , more work extended this multi-scale approach [18, 30]. Previous work also consider ordinal depth classification  or pair-wise depth-map comparisons  to add local and non-local constraints. Our approach relies on a simpler monoscale architecture, making it efficient at run-time. Our constraints between depths, normals, and occluding contours guide learning towards good depth prediction for the whole image.
Laina et al. 
exploit the power of deep residual neural networks and show that using the more appropriate BerHu [19, 34] reconstruction loss yields better performances. However, their end results are quite smooth around occluding contours, making their method inappropriate for realistic occlusion-aware augmented reality.
Jiao et al.  noticed that the depth distribution of the NYUv2 dataset is heavy-tailed. The authors therefore proposed an attention-driven loss for the network supervision, and pair the depth estimation task with semantic segmentation to improve performances on the dataset. However, while they currently achieve the best performance on the NYUv2-Depth dataset, their approach suffers from a bias towards high-depth areas such as windows, corridors or mirrors. While this translates into a significant decrease of the final error, it also produces blurry depth maps, as can be seen in Fig. 1. By contrast, our reconstructions tend to be much sharper along the occluding boundaries as desired, and our method is much faster, making it suitable for real-time applications.
Self-learning methods have also become popular for monocular reconstruction, and exploit the consistency between multiple views [9, 29, 20, 31, 32, 25]. While such approach is very exciting, it does not reach yet the accuracy of supervised methods in general, and it should be preferred only when no annotated data is available for supervised learning.
Wang et al.  introduced their SURGE method to improve scene reconstruction on planar and edge regions by learning to jointly predict depth and normal maps, as well as edges and planar regions, then refining the depth prediction by solving and optimization problem using a Dense Conditional Random Field (DCRF). While their method yields appealing reconstruction results on planar regions, it still underperforms state-of-the-art methods on global metrics, and the use of DCRF makes it unsuited for real-time applications. Furthermore, SURGE  is evaluated on the reconstruction quality around edges using standard depth error metrics, but not on the 2D localization of their occluding contours.
Many self-supervised methods [31, 32, 25, 9] have incorporated edge- or occlusion-aware geometry constrains which exist when working with stereo pairs or sequences of images as provided in the very popular KITTI depth estimation benchmark . However, although these methods can perform monocular depth estimation at test time, they require multiple calibrated views at training time. They are therefore unable to work on monocular RGB-D datasets such as NYUv2-Depth  or SUN-RGBD .
[28, 13] worked on occlusion-aware depth estimation to improve reconstruction for augmented reality applications. While achieving spectacular results, they however require one or multiple light-field images, which are more costly to obtain than ubiquitous RGB images.
Conscious of the lack of evaluation metrics and benchmarks for quality of edge and planes reconstruction from monocular depth estimates, Kochet al.  introduced the iBims-v1 dataset, a high quality benchmark of 100 RGB images with their associated depth map. With their work, they tackle the low quality of depth maps of other RGB-D datasets such as  and , while also introducing annotations and metrics for occluding contours and planarity of planar regions. We build on top of their work for our evaluation of occluding contour reconstruction quality.
As shown in Fig. 2, we train a network to predict, for a training color image , a depth map
, a map of occluding contours probabilities, and a map of normals.
We first train on the synthetic dataset PBRS , which provides the ground truth for the depth map , the normals map , and the binary map of object instance contours for each training image . Since occluding contours are not directly provided in the PBRS dataset, we use instead the object instance contours as a proxy. We argue that on a macroscopic scale, a large proportion of occluding contours in an image are due to objects occluding one another as can be seen in Fig. 3. However, we show that we can also enable our network to learn internal occluding contours within objects even without “pure” occluding contours supervision. Indeed, we make use of constrains on depth map and occluding contour predictions and respectively (see Section. 3.4 for more details) to enforce the contour estimation task to also predict intra-object occluding boundaries.
We then finetune on the NYUv2-Depth dataset without direct supervised losses on the occluding contours or normals ( and described below): Even though  and  produce ground truth normals map with different estimation methods operating on the Kinect-v1 depth maps, their output results are generally noisy. Occluding contours are not given in the original NYUv2-Depth dataset. Although one could automatically extract them using edge detectors [1, 3] on depth maps, such extraction is very sensitive to the detector’s parameters (see Figure 4). Instead, we introduce consensus terms that explicitly constrain the predicted contours, normals and depth maps together ( and described below) at training time.
At test-time, we can choose to use only the depth stream of if we are not interested in the normals nor the boundaries, making inference very fast.
We estimate the parameters of network by minimizing the following loss function over all the training images:
, , and are supervision terms for the depth, the occluding contours, and the normals respectively. We adjust weights , , and during training so that we focus first on learning local geometry (normals and boundaries) then on depth. See Section 4.1 for more details.
and introduce constraints between the predicted depth map and the predicted contours, and between the predicted depth map and the predicted normals respectively.
We detail these losses below. All losses are computed using only valid pixel locations. The PBRS synthetic dataset provides such a mask. When finetuning on NYUv2-Depth, we mask out the white pixels on the images border.
The supervision terms on the predicted depth and normal maps are drawn from previous works on monocular depth prediction. For our term on occluding contours prediction, we rely on previous work for edge prediction.
The sum is over all the valid pixel locations. The BerHu (also known as reverse Huber) function is defined as a loss for large deviations, and a loss for small ones. As in , we take the parameter of the BerHu function as .
We use the recent attention loss from , which was developed for 2d edge detection, to learn to predict the occluding contours. This attention loss helps dealing with the imbalance of edge pixels compared to non-edge pixels:
where are hyper-parameters which we set to the authors values , and is computed image per image as the proportion of contour pixels. We use this pixel-wise attention loss to define the occluding contour prediction loss:
As mentioned above, this loss is disabled when finetuning on the NYUv2-Depth dataset.
For normals prediction, we use a common method introduced by Eigen et al.  which is to minimize at all valid pixels the angle between the predicted normals and their ground truth counterpart , by maximizing their dot-product. We therefore used the following loss:
This loss slightly differs from the one of  as we limit it to positive values. As mentioned earlier, this loss is disabled when finetuning on the NYUv2-Depth dataset.
To force the network to predict sharp depth edges at occluding contours where strong depth discontinuities occur, we propose the following loss between the predicted occluding contours probability map and the predicted depth map :
This encourages the network to associate pixels with large depth gradients with occluding contours: High-gradient areas will lead to a large loss unless the occluding contour probability is close to one. [9, 12] also used this type of edge-aware gradient-loss, although they used it to impose consensus between photometric gradients and depth gradients. However, relying on photometric gradients can be dangerous, as strong image gradients do not necessarily correspond to occluding contours, and occluding contours do not necessarily correspond to strong image gradients.
By enforcing this constraint on predictions and , we introduce a bias on boundary prediction from instance boundaries towards occluding contours.
Since this loss does not involve ground truth occluding contours
, it can be used when finetuning on the NYUv2-Depth dataset, thus allowing semi-supervised learning of the occluding contours on NYUv2-Depth. Because it involves both depth and contour prediction streams, it enforces the depth and contours decoders to become consistent, but also helps the ResNet50 encoder to produce a more general and powerful representation.
Depth and normals are two highly correlated entities. Thus, to impose geometric consistency during prediction between the normal and depth predictions and , we use the following loss:
is extracted from the 3D vector, and is computed as the 2D gradient of the depth map estimate using finite differences. This term enforces consistency between the normals and depth predictions at all pixels but those predicted as boundaries (where ). Our formulation of depth-normals consensus is much simpler than those proposed in [27, 31, 6] as these works express their constraint in 3D world coordinates, thus requiring the camera calibration matrix.
Again, imposing this constraint during finetuning allows us to constrain normals, boundaries, and depth, even when the ground truth normals and boundaries are not available.
We evaluate our method and compare it to previous work using standard metrics, but also the depth boundary edge (DBE) accuracy metric introduced by Koch et al.  (see following Section 4.2 and Eq. (8) for more details). We show that our method achieves the best trade-off between global reconstruction error and DBE.
We implemented our work in Pytorch and will make our pretrained weight, training and evaluation code publicly available.111 https://github.com/MichaelRamamonjisoa/SharpNet. Both training and evaluation are done on a single high-end NVIDIA GTX 1080 Ti GPU.
We first train our network on the synthetic PBRS  dataset, using depth and normals maps, along with object instance boundaries which we use as a proxy to occluding contours. We split the PBRS dataset in training/validation/test sets using a 80%/10%/10% ratio. We then finetune our network on the NYUv2-Depth training set using only depth data. Finally, we use the NYUv2-Depth validation set for depth evaluation and our new NYUv2-OC for occluding contours accuracy evaluation.
Training a multi-task network requires some caution: Since several loss terms are involved, and in particular one for each task, one should pay special attention for any suboptimal solution for one task due to ‘over-learning’ another. To monitor each task individually, we monitor each individual loss along with the global training loss and check that all of them decrease. When setting all loss coefficients equal to one, we noticed that the normals loss decreased faster than others. Similarly, we found that learning boundaries was much faster than learning depth. We argue that this is because local features such contours or local planes, i.e. where normals are constant, are easier to learn since they appear almost in all training examples. Training depth, however, requires the network to exploit context data such as room layout in order to regress a globally consistent depth map.
We evaluate our method on the benchmark dataset NYUv2 Depth . The most common metrics are: Thresholded accuracies , linear and logarithmic Root Mean Squared Error (RMSE and RMSE respectively), Absolute Relative difference , and average logarithmic error .
|Evaluated on full NYUv2-Depth||Evaluated on our NYUv2-OC|
|rel||RMSE (lin)||RMSE (log)|
|Eigen et al.  (VGG)||0.766||0.949||0.988||0.195||0.068||0.660||0.217||2.830||2.917||3.039||3.068|
|Eigen et al.  (AlexNet)||0.690||0.911||0.977||0.250||0.082||0.755||0.259||2.683||2.862||3.048||3.108|
|Laina et al. ||0.818||0.955||0.988||0.170||0.059||0.602||0.200||3.901||3.791||3.910||3.939|
|Fu et al. ||0.850||0.957||0.985||0.150||0.052||0.578||0.194||3.605||3.657||3.940||3.942|
|Jiao et al. ||0.909||0.981||0.995||0.133||0.042||0.401||0.146||5.041||3.756||3.961||3.986|
We summarize the our comparative study between our method and previous ones in Table 1.
Since authors evaluating on the NYUv2-Depth benchmark often use different evaluation methods, it makes fair comparison difficult to perform. For instance,  and  evaluate on crops of the image where the projection map are available, i.e. they remove a border of the image for evaluation: those regions are provided by Eigen et al.  in their evaluation toolkit. Some authors also clip resulting depth-maps to the range valid depth sensor range [0.7m; 10m]. Finally, not all the authors makes their prediction and/or evaluation code publicly available. The authors of  kindly shared their predictions on the NYUv2-Depth dataset with us, and the following results for their method were obtained on the depth map predictions they provided us. All other methods have released their predictions online.
Fair comparison is ensured by performing evaluation of each method solely using their associated depth map predictions and one single evaluation code. An important note is that RMSE values are computed over all pixels in images, i.e. the sum under square root is computed over all pixels in the validation set. This differs to some papers which compute one RMSE value per image and then compute the mean over all images. As mentioned above, we will make our evaluation code and results publicly available.
To evaluate occluding contours location accuracy, we follow the work of Koch et al.  as they proposed an experimental method for such evaluation. Since it is important to examine whether predicted depths maps are able to represent all occluding contours as depth discontinuities in an accurate way, they analyzed occluding contours accuracy performances by detecting and comparing edges in predicted and ground truth depth maps.
Since in the NYUv2-Depth dataset, the Kinect-v1 depth map is quite noisy especially around occluding boundaries, we chose to manually annotate a subset the dataset in terms of occluding contours, building our NYUv2-OC dataset, and use it for evaluation. Several samples of our dataset are shown in Fig. 4, Fig. 3 and Fig. 11. In order to evaluate the predicted depth maps’ quality in terms of occluding contours reconstruction, binary edges are first extracted from with a Canny detector. They are then compared to the “ground truth” binary edges extracted from ground truth depth map with the same detection algorithm222Edges are extracted from depth maps with normalized dynamic range.; Such comparison is done via truncated chamfer distance of the binary edge images and . More precisely, an Euclidean distance transform is applied to the “ground truth” binary edges image and distances are truncated to a maximum of 10 pixels. Pixels in exceeding 10 pixels distance are ignored in order to evaluate predicted edges only in the local neighborhood of the ground truth edges (we refer the reader to the authors original paper  for further details). Finally, the depth boundary accuracy error is defined as:
where the sum is performed over all valid pixels. We compare our method on this metric with state-of-the-art depth estimation methods using different Canny parameters. Evaluation results are shown in Table 1: We outperform all state-of-the-art methods on occluding contours accuracy, while being a competitive second best on standard depth estimation evaluation metrics.
Since the detected edges in are highly sensitive to the edge detector’s parameters (see Fig.4), we evaluate the depth boundary accuracy error when varying the threshold parameters and of the Canny edge detector. The results are shown in Fig. 5.
To prove the impact of our geometry consensus terms, we performed an ablation study to analyze the contribution of training with synthetic and real data, as well as our novel geometry consensus terms. Evaluation of different models on our NYUv2-OC dataset are shown in Table 2, confirming their contribution to both improved depth reconstruction results over the whole NYUv2-Depth dataset and occluding contours accuracy.
|w/o consensus||PBRS + NYUv2||0.165||2.279||2.876||3.301||3.374|
|w/ consensus||PBRS + NYUv2||0.159||2.088||2.378||2.764||2.838|
In this paper, we show that our SharpNet method is able to achieve competitive depth reconstruction from a single RGB image with particular attention to occluding contours thanks to geometry consensus terms introduced during multi-task training. Our high quality depth estimation and high accuracy occluding contours reconstruction allow for realistic integration of virtual objects in augmented reality in real-time as we achieve 150 fps inference speed. We show the superiority of our SharpNet to state-of-the-art by introducing a first version of our new NYUv2-OC occluding contour dataset, which we plan to extend in future work.
International Conference on Computer Vision, pages 2650–2658, 2015.
Conference on Computer Vision and Pattern Recognition, 2018.
A Robust Hybrid of Lasso and Ridge Regression.Contemp. Math., 443, 01 2007.
Occlusion-Aware Unsupervised Learning of Monocular Depth, Optical Flow and Camera Pose with Geometric Constraints.Future Internet, 10:92, 09 2018.
Deep3D: Fully Automatic 2D-To-3d Video Conversion with Deep Convolutional Neural Networks.In European Conference on Computer Vision, pages 842–857, 2016.
To train our network to predict geometrically consistent normals, depth, and occluding contours, we used the synthetic dataset from PBRS from Zhang et al. . Indeed, perfect consistency between each output is important to ensure high quality depth maps at occluding contours, but also to ensure good generalization from synthetic to real data. We show in Figure. 7 a sample from the PBRS  dataset, as well as our geometric constraints in Figure. 8, which enforce the network to predict outputs consistent with each other, both with synthetic and real images.
Here we present some details about our training method. Since we are performing multi-task learning, each task-attached loss is weighted in the global loss term. We found that learning normals and boundaries first brought the best results. We also present the data augmentation strategy we used while training.
random scale with scale factor : the depth map is divided by and the coordinate of the normal map is multiplied by ,
random rotation of angle : all corresponding maps are rotated the camera 2D plane and normals maps are recomputed in the camera coordinates (the rotation matrix is applied on each pixel of rotated normal map)
random crops of size ,
random Gamma adjustment using Torchvision
We recall the loss function equation:
We detail all training steps in Table. 3. For all experiments, we used polynomial learning rate decay with power 0.9. We also used weight decay with a decay rate of .
|Dataset||Iterations||batch size||iter size||learning rate|
stands for the number of batches used per iteration for backpropagation (performed using the average loss computed from each batch loss).values are this low to rescale the attention loss of .
We show in Figure. 10 some results of image augmentations: we augment the images by adding a virtual object in them using both the RGB image and a depth map of the scene and a depth map of the virtual object. This object is inserted with consideration of occlusions, i.e. we only fill pixels such that with RGB pixels from the rendered object .