Dense Deep Depth Estimation Network (D3-Net) in PyTorch.
Depth estimation is of critical interest for scene understanding and accurate 3D reconstruction. Most recent approaches in depth estimation with deep learning exploit geometrical structures of standard sharp images to predict corresponding depth maps. However, cameras can also produce images with defocus blur depending on the depth of the objects and camera settings. Hence, these features may represent an important hint for learning to predict depth. In this paper, we propose a full system for single-image depth prediction in the wild using depth-from-defocus and neural networks. We carry out thorough experiments to test deep convolutional networks on real and simulated defocused images using a realistic model of blur variation with respect to depth. We also investigate the influence of blur on depth prediction observing model uncertainty with a Bayesian neural network approach. From these studies, we show that out-of-focus blur greatly improves the depth-prediction network performances. Furthermore, we transfer the ability learned on a synthetic, indoor dataset to real, indoor and outdoor images. For this purpose, we present a new dataset containing real all-focus and defocused images from a Digital Single-Lens Reflex (DSLR) camera, paired with ground truth depth maps obtained with an active 3D sensor for indoor scenes. The proposed approach is successfully validated on both this new dataset and standard ones as NYUv2 or Depth-in-the-Wild. Code and new datasets are available at https://github.com/marcelampc/d3net_depth_estimation.READ FULL TEXT VIEW PDF
Dense Deep Depth Estimation Network (D3-Net) in PyTorch.
3D reconstruction has a large field of applications such as in human computer interaction, augmented reality and robotics, which have driven research on the topic. This reconstruction usually relies on on accurate depth estimates to process the 3D shape of an object or a scene. Traditional depth estimation approaches exploit different physical aspects to extract 3D information from perception, such as stereoscopic vision, structure from motion, structured light and other depth cues in 2D images [1, 2]. However, some of these techniques impose restrictions that depend on the environment (e.g. sun, texture) or even require several devices (e.g. camera, projector), leading to cumbersome systems. Many efforts have been made to build compact systems: the most notable are perhaps the light-field cameras which use a microlens array in front of the sensor, from which a depth map can be extracted .
In recent years, several approaches for depth estimation based on deep learning, referred to as deep depth estimation, have been proposed, starting from 
. These methods use a single point of view (a single image) and thus lead to compact, standard systems. Most of them exploit depth cues in the image based on geometrical aspects of the scene to estimate the 3D structure with the use of convolutional neural networks (CNNs)[5, 6, 7, 8]. A few ones can also make use of additional depth cues such as stereo information to train the network  and improve predictions.
Another important cue for depth estimation has for long been the defocus blur. Indeed, Depth from Defocus (DFD) has been widely investigated in the past [10, 11, 12, 13, 14, 15]. It led to various analytical methods and corresponding optical systems for depth prediction. However, DFD with a conventional camera and a single image suffers from ambiguity in depth estimation with respect to the focal plane and dead zone, due to the camera depth of field, where no blur can be measured. Moreover, DFD requires a scene model and an explicit calibration between blur level and depth value to estimate 3D information from an unknown scene. It is tempting to integrate defocus blur with the power of neural networks, which leads to the question: does defocus blur improve deep depth estimation performances?
In this paper, we use a dense neural network, D3-Net , in order to study the influence of defocus blur on depth estimation. Depth estimation performance is first tested on a synthetically defocused dataset created from NYUv2, with optically realistic blur variation, which allows to compare several optical settings and study their influence. We further examine the uncertainty of the CNN predictions to better understand the main difficulties of the trained models while learning the proposed task with and without blur. We then propose to explore real defocused data with a new dataset which comprises of indoor all-in-focus and defocused images, and corresponding depth maps. Finally, we verify how the deep model behaves when confronted to challenging images in the wild with the Depth-in-the-Wild  dataset and further outdoor images, with and without learning from defocus blur.
These experiments show that defocused information is exploited by neural networks and is indeed an important hint to improve deep depth estimation. Moreover, the joint use of structural and blur information proposed in this paper overcomes current limitations of single-image DFD such as ambiguity and dead zone, with respect to the focal plane. Finally, we show that these findings can be used in a dedicated device with real defocus blur to actually predict depth indoors and outdoors with good generalization.
Deep monocular depth estimation.
Several works have been developed to perform monocular depth estimation based on techniques of machine learning. One of the first solutions was presented by Saxenaet al. , which formulate the depth estimation for the Make3D dataset as a Markov Random Field (MRF) problem with horizontally aligned images using a multi-scale architecture. More recent solutions are based on deep convolutional networks to exploit spatial correlation by enforcing a local connectivity through convolutional operations. Eigen et al. [5, 4] proposed a multi-scale architecture capable of extracting global and local information from the scene to estimate the corresponding depth map. In , Cao et al. used a Conditional Random Field (CRF) to post-process the output of a deep residual network (ResNet)  in order to improve the reliability of the predictions. Xu et al. 
adopted a deeply supervised approach connecting intermediate outputs of a ResNet to a continuous CRF fusion module to combine depth prediction at different scales achieving high performance for the task. Also adopting residual connections, Lainaet al.  proposed an encoder-decoder architecture with fast up-projection blocks. More recently, Jung et al.  introduced generative adversarial networks  (GANs) to the deep depth estimation field, adapting an adversarial loss to refine the depth map predictions. With a different strategy, [9, 25, 26] propose to investigate the epipolar geometry using deep networks. DeMoN  jointly estimates a depth map and camera motion given a sequential pair of images exploring optical flow. The works in [25, 26]
use unsupervised learning to reconstruct stereo information and predict depth. More recently, Kendall and Gal and Carvalho et al.  explore the reuse of feature maps during learning, building upon an encoder decoder with dense and skip connections . In , they propose a regression function that captures the uncertainty of the data and in 
the network adopts an end-to-end adversarial generative loss function to improve prediction.
The aforementioned techniques for monocular depth estimation with neural networks base their learning capabilities on structured information (e.g., textures, linear perspective, statistics of objects and their positions). However, depth perception can use another well-know cue: defocus blur. We present in the following section state-of-the-art approaches from this domain.
Depth estimation using DFD. In computational photography, several works investigated the use of defocus blur to infer depth, starting from . Indeed, the amount of defocus blur of an object can be related to its depth using geometrical optics , where stands for the focal length, the distance of the object with respect to the lens, the distance between the sensor and the lens and the lens diameter. , where is the f-number.
Recent works usually use DFD with a single image (SIDFD). Although the acquisition is simple, it also leads to more complex processing as both the scene and the blur are unknown. State of the art approaches use analytical models for the scene such as sharp edges models  or statistical scene Gaussian priors [29, 12]. Coded apertures have also been proposed to improve depth estimation accuracy with respect to standard optics [11, 30, 31, 14].
Nevertheless, SIDFD suffers from two main limitations: first, there is an ambiguity related to the object’s position in front or behind the in-focus plane; second, blur variation cannot be measured in the camera depth of field, leading to a dead zone. Ambiguity can be solved using asymmetrical coded aperture , or even by setting the focus at infinity, at a cost of reducing the light intensity that reaches the sensor or large depth of field (i.e., dead zone), respectively. Second, dead zones can be overcome using several images with various in-focus planes. In a single snapshot context, this can be obtained with unconventional optics such as a plenoptic camera  or a lens with chromatic aberration [33, 12], but both at the cost of image quality (low resolution or chromatic aberration).
Indeed, inferring depth from the amount of defocus blur with model-based techniques requires a tedious explicit calibration step, usually conducted using point sources or a known high frequency pattern [34, 11] at each potential depth.
These constraints lead us to investigate data-based methods using deep learning techniques to explore structured information together with blur cues to execute the proposed task.
Learning depth from defocus blur.
The existence of common datasets for depth estimation [35, 1, 32], containing pairs of RGB images and corresponding depth maps, facilitates the creation of synthetic defocused images using real camera parameters. Hence, a deep learning approach can be used.
To the best of our knowledge, only a few papers in the literature use defocus blur as a cue in learning depth from a single image. Srinivasan et al..  uses defocus blur to train a network dedicated to monocular depth estimation: the model measures the consistency of simulated defocused images, generated from the estimated depth map and all-in-focus image, with true defocused images. However, the final network is used to conduct depth estimation from all-in-focus images. Hazirbas et al.  propose to conduct depth estimation using a focal stack, which is more related to depth from focus approaches than DFD. Finally,  presents a network for depth estimation and deblurring using a single defocused image.
This work shows that networks can integrate blur interpretation. However,  creates a synthetically defocused dataset from real NYUv2 images without consideration of a realistic blur variation with respect to the depth, nor sensor settings (e.g., camera aperture, focal distance).
However, there has not been much investigation about how defocus blur influence on depth estimation, nor how can these experiments improve depth prediction in the wild.
In contrast with previous works, to the best of our knowledge, we present the first system for deep depth from defocus (Deep-DFD): i.e. single-image depth prediction in the wild using deep learning and depth-from-defocus. In section 3, we study the influence of defocus blur on deep depth estimation performances. (i) We run tests on a synthetically defocused dataset generated from a set of true depth maps and all-in-focus images. The amount of defocus blur with respect to depth varies according to a physical optical model to better relate to realistic examples. (ii) We also compare performances of deep depth estimation with several optical settings: here we compare the case of all-in-focus images with the case of defocused images from three different focus settings. (iii) We analyse the influence of defocus blur on neural networks using uncertainty maps and diagrams of errors per depth. In section 4, (iv) we carry out validation and analysis of the estimation results on a new dataset with pairs of real images and depth maps obtained with a Digital Single Lens Reflex (DSLR) camera and an RGB-D (Red Green Blue Depth) sensor. At last, in section 5, (v) we show the network is able to generalized to images in the wild.
In this section, we perform a series of experiments with synthetic and real defocused data exploring the power of deep learning to depth prediction. As we are interested in using blur as a cue, we do not apply any image processing for data augmentation capable of modifying out-of-focus information. Hence, for all experiments, we extract random crops of 224x224 from the original images and apply horizontal flip with a probability of 50%. Tests are realized using the full-resolution image.
. We use the PyTorch framework on a NVIDIA TITAN X GPU with 12GB of memory. We initialize the D3-Net encoder, corresponding to DenseNet-121, with pretrained weights on Imagenet dataset and D3-Net decoder with random weights from a normal distribution with zero-mean and 0.2 variance. We add dropout regularization with a probability of 0.5 to the first four convolutional layers of the decoder as we noticed it improves generalization. We also adopt dropout layers to posteriorly study model’s uncertainty.
The NYU-Depth V2 (NYUv2) dataset  has approximately 230k pairs of images from 249 scenes for training and 215 scenes for testing. In , D3-Net reaches its best performances when trained with the complete dataset. However, NYUv2 also contains a smaller split with 1449 pairs of aligned RGB and depth images, of which 795 pairs are used for training and 654 pairs for testing. Therefore, experiments in this section were performed using this smaller dataset to fasten experiments. Original frames from Microsoft Kinect output have the resolution of 640x480. Pairs of images from the RGB and Depth sensors are posteriorly aligned, cropped and processed to fill-in invalid depth values. Final resolution is 561x427.
To generate physically realistic out-of-focus images, we choose the parameters corresponding to a synthetic camera with a focal length of 15mm, f-number 2.8 and pixel size of 5.6m. Three settings of in-focus plane are tested, respectively at 2m, 4m and 8m from the camera. Figure 4 shows the variation of the blur diameter with respect to depth, for both settings and Figure 5 shows examples of synthetic defocused images. As illustrated in Figure 4, setting the in-focus plane at 2m corresponds to a camera with small depth of field. The objects in the depth range of 1 to 10m will present small defocus blur amounts, apart from the objects in the camera depth of field, which remain sharp. Note that this configuration suffers from depth ambiguity caused by the blur estimation. Setting the in-focus plane at a larger depth, here 4m or 8m, corresponds to a camera with larger depth of field. Only the closest objects will show defocus blur, with the blur amount in the approximate depth range 0-3m that will be larger than with the 2m setting. This can be observed in the extracted details of images in Figure 5.
To create the out-of-focus dataset, we adopt the layered approach of  where each defocused image is the sum of blurred images multiplied by masks taking into account local object depth and occlusion of foreground objects according to:
where is the defocus blur at depth , is the all-in-focus image, is the mask corresponding to object at depth and the layer extension behind occluders, obtained by inpainting. Finally models the cumulative occlusions defined as:
Following , we chose to model the blur as a disk function of which the diameter varies with the depth.
As will be discussed later in this paper, the proposed approach can be disputable as the true depth map is used to generate the out-of focus image. However, this strategy allows us easily perform various experiments to analyze the influence of blur corresponding to different in-focus settings in the image.
Table 1 shows performance results of D3-Net first using all-focuses and then defocused images with proposed settings. Note that as illustrated in Figure 4 when the in-focus plane is at 8m, there is no observable ambiguity. Hence performance comparison with SIDFD methods can then be made. In such manner, we include error metrics of two methods from the SIDFD literature [15, 41] which estimate the amount of local blur using either sharp edge model or gaussian prior on the scene gradients.
Several conclusions can be drawn from Table 1. First, as already stated by Anwar et al., there is a significant improvement on the performance of depth estimation when using out-of-focus images instead of all-in-focus images. Second, D3-Net outperforms the standard model-based SIDFD methods, which can also be observed in figure 7, without requiring an analytical scene model nor explicit blur calibration. Indeed, the neural network makes use of both parameters without being specifically designed to. Furthermore, there is also a sensitivity of the depth estimation performance with respect to the position of the in-focus plane. The best setting for these tests is the in-focus plane at 2m, which corresponds to a significant amount of blur for most of the objects but near the focal plane. This shows that the network actually uses blur cue and is able to overcome depth ambiguity using geometrical structural information. Figure 7 also illustrates this conclusion: the presented scene has mainly three depth levels with a foreground below 2m, a background after 2m, and intermediate level around 2m. The corresponding out-of-focus image is generated using an in-focus plane at 2m. Using , the background and the foreground are at the same depth, while D3-Net shows no such error in the depth map.
Finally, we also trained and tested D3-Net with the dataset proposed in Anwar et al. . However, differently from the method explored in this paper, the out-of-focus images were generate without any regard to camera settings. The last two lines from Table 1 shows that D3-Net also outperforms the network in .
|Original RGB images|
|RGB images with additional blur|
|D3-Net 2m focus||0.068||0.028||0.274||0.110||96.1%||99.0%||99.6%|
|D3-Net 4m focus||0.085||0.036||0.398||0.125||92.5%||99.0%||99.8%|
|D3-Net 8m focus||0.060||-||0.324||-||95.2%||99.1%||99.9%|
|Zhuo et al.  8m focus||0.273||-||0.981||-||51.7%||83.1%||95.1%|
|Trouvé et al.  8m focus||0.429||0.289||1.743||0.956||39.2%||52.7%||61.5%|
|RGB images with additional blur proposed by |
|Anwar et al. ||0.094||0.039||0.347||-||-||-||-|
In addition, Figure 6 and columns 3 and 6 from Figure 9 show examples of predicted depth maps. The depth maps obtained with out-of-focus images are sharper than using all-in-focus images. Indeed, defocus blur provides local depth information to the network leading to a better depth map segmentation.
Per depth error analysis. There is an intrinsic relation between the number of examples a network can learn from and its performance when observing similar samples to them. Here, we study the prediction error per depth range when using all-in-focus images or defocused images and observe relation to depth data distribution. Figure 8 shows in the same plot repartition the RMS per depth in meters and the depth distribution for testing and training images for the NYUv2 dataset.
For all-in-focus images, the errors seem to be highly correlated to the number of examples in the dataset. Indeed, a minimum error is obtained for 2m, corresponding to the depth with the highest number of examples. On the other hand, using defocus blur, errors repartition is more similar to a quadratic increase of error with depth, which is the usual error repartition of passive depth estimation.
Furthermore, the 2m focus setting does not show an error increase at 2m (its focal plane position), though it corresponds to the dead zone of SIDFD. This surprising result shows that the proposed approach overcomes this issue probably because the neural network relies on context and geometric features. In general, 2m, 4m and 8m focus have similar performance for depth range between 0 to 3m. After this depth, the 2m focus presents lowest errors. When focus is at 4m, we observe a drop in all metrics performances compared to 2m and 8m. The reason for this can be observed when comparing both Figures 4 and 8. This configuration presents worst RMS performances between 3 and 7m, when blur information is too small to be used by the network and there is not enough data to overcome the missing cue, but enough to worsen results. The same happens to the model at 8m, where results are more prone to errors after approximately 7m.
To go further in the analysis of understanding the influence of blur in depth prediction, we present a study on model uncertainties following [42, 27, 43]. More precisely, we evaluate the epistemic uncertainty of the deep network model, or how ignorant is the model with respect to the dataset probabilistic distribution.
To perform this experiment, we place a prior distribution over the network weights to replace the deterministic weight parameters at test time . We adopt the Monte Carlo dropout method  to measure variational inference placing dropout layers during train and also during test phases. Following , we produce 50 samples for each image, calculate the mean prediction and the variance of these predictions to generate the model uncertainty.
Figure 9 presents examples of the network prediction, mean error and epistemic uncertainty for the NYUv2 dataset with sharp images and with focus at 2m. Mean error is produced using the ground truth image, while the variance only depends on the model’s prior distribution. For both configurations, highest variances are observed in non-textured areas and edges, as predictable. However, the model with blur has less diffuse uncertainty: it is concentrated on the object edges, and these objects are better segmented. In the second row of the figure, we observe that the all-in-focus model has difficulties to find an object near the window, while this is overcome with blur cues present on the defocused model. In the first row, we observe high levels of uncertainty at the zones near the bookcase, defocused model reduce some of this variance with defocus information. Finally, the last row presents a hard example where both models have high prediction variances mainly in the top middle part, where there is a hole. However the all-in-focus model also presents high mean error and variance in the bottom zone unlike the model with blur.
In section 3, several experiments were performed using a synthetic version of NYUv2. However, when adopting convolutional neural networks, it can be a little tricky to use the desired output (depth) to create blur information on the input of the network. So, in this section, we propose to validate our method on real defocused data from a DSLR camera paired with the respective depth map from a calibrated RGB-D sensor.
Dataset creation. To create a DFD dataset, we paired a DSLR Nikon D200 with an Asus Xtion sensor to produce out-of-focus data and corresponding depth maps, respectively. Our platform can be observed in Figure 10. We carefully calibrate the depth sensor to the DSLR coordinates to produce RGB images paired with the corresponding depth map. The proposed dataset contains 110 images from indoor scenes, with 81 images for training and 29 images for testing. Each scene is acquired with two camera apertures: =2.8 and =8, providing respectively out-of-focus and all-in-focus images.
As the DFD dataset contains a small amount of images, we pretrain the network using simulated images from NYUv2 dataset and then conduct a finetuning of the network using the real dataset. The DSLR camera originally captures images of high resolution 3872x2592; but to reduce the calculation burden, we downsample the DSLR images to 645x432. In order to simulate defocused images from NYUv2 as similar as possible as those provided by the DSLR, the image from the Kinect are upsampled and cropped to have the same resolution and the same field of view as the 645x432 DSLR images. Then defocus blur is applied to the images using the same method as in section 3 but with a blur variation with that fits the real blur variation of the DSLR, obtained experimentally.
Performance results. Using the real images dataset, we perform three experiments: first we train D3-Net with the in-focus dataset and defocused dataset respectively, using same patch approach from last experiments. We also test D3-Net with the in-focus dataset using an strategy that explores the global information of the scene using a series of preprocessing methods: we resize input images to 320x256 and performance data augmentation suggested in  to improve generalization.
In Table 2, the performance metrics from the proposed models can be compared. The results show that defocus blur does improve the network performance increasing 10 to 20 percentual points in accuracy and also gives qualitative results with better segmentation as illustrated in Figure 11.
The network is capable to find a relation between depth and defocus blur and predict better results, even thought the network may miss from global information when being trained with small patches. When feeding the network with resized images, filters from the last layers of the encoder, as from the first layers of the decoder, can understand the global information as they are fed with feature maps from the entire scene in a low resolution. However, this relation is not enough to give better predictions. As we can observe in the first examples of the third row in Figure 11, the DFD D3-Net used defocus to find the contours of the object, meanwhile the D3-Net with resize wrongly predicts the form of a chair, as it is an object constantly present in front of a desk. Our experiments show that the Deep-DFD model is more robust to generalization and less prone to overfitting than traditional methods trained and finetuned on all-in-focus images.
In the era of autonomous driving vehicles (on land, on water, or in the air), there has been an increasing demand of less intrusive, more robust sensors and processing techniques to embed in systems able to evolve in the wild. Previously, we validated our approach with several experiments on indoor scenes and we proved that blur can be learned by a neural network to improve prediction and also to improve the model’s confidence to its estimations. In this section, we now propose to tackle the general case of uncontrolled scenes. We first assess the ability of the standard D3-Net, trained without defocus blur, to generalize to ”in-the-wild” images using the Depth-in-the-Wild dataset  (DiW). Second, we use the whole system, D3-Net trained on indoor defocused images and the DSLR camera described from section 4, in uncontrolled, outdoor environments.
Depth-in-the-Wild dataset (DiW). The ground truth of the DiW dataset is not dense; indeed, only two points of each RGB image are relatively annotated as being closer or farther from the camera, or at the same distance. To adapt the network, we replace the objective function of D3-Net by the one proposed by the authors of the dataset . Then, for training, we take the weights of D3-Net trained on all-in-focus NYUv2 , and finetune the model on DiW using the modified network. We show the results of this model on the test set of DiW in figure 12. The predicted depths present sharp edges for people and objects and give plausible estimates of the 3D structure of the given scenes. However, as the network was mostly trained on indoor scenes, it cannot give accurate depth predictions on sky regions. This shows that the a neural network has inherent capacity to predict depth in the wild. We will now see that we can improve this capacity by integrating physical cues of the sensor.
Deep-DFD in the wild. We now propose to observe how deep models trained with blurred indoor images behave when confronted to challenging outdoor scenes. These experiments explore the model’s capability to adapt predictions to new scenarios, never seen during training. To perform our tests, we first acquire new data using the DSLR camera with defocus optics (from section 4) and keeping the same camera settings. As the depth sensor from the proposed platform works poorly outdoor, this new set of images does not contain respective depth ground truth. Thus, the model is neither trained on the new data, nor finetuned. Indeed, we use directly the models finetuned on indoor data with defocus blur (section 4).
Results from the CNN models and from Zhuo’s  analytical method are shown in Figure 13. With D3-Net trained on all-in-focus images, the model constantly fails to extract information from new objects, as can be observed in the images with the road and also with the tree trunk. As expected, this model tries to base prediction on objects similar to what those seen during training or during finetuning, which are mostly non-existent in these new scenes. On the contrary, though the model trained with defocus blur information has equally never seen these new scenarios, the predictions give results relatively close to the expected depth maps. Indeed, the Deep-DFD model notably extracts and uses blur information to help prediction, as geometric features are unknown for the trained network. Finally, Zhuo’s method also gives encouraging results, but constantly fails duo to defocus blur ambiguity to the focal plane (as on the handrail on the top left example of fig. 13). As can be deduced from our experiments, the combined use of geometric, statistical and defocus blur is a promising method to generalize learning capabilities.
In this paper, we have studied the influence of defocus blur as a cue in a monocular depth estimation using deep learning approach. We have shown that using blurred images outperforms the use of all-in-focus images, without requiring any scene model nor blur calibration. Besides, the combined use of defocus blur and geometrical structure information on the image, brought by the use of a deep network, avoids the classical limitations of Depth from Defocus with a conventional camera, such as depth ambiguity or dead zones. We have proposed different tools to visualize the benefit of defocus blur on the network performance, such as per depth error statistics and uncertainty maps. These tools have shown that depth estimation with defocus blur is most significantly improved at short depths, resulting in better depth map segmentations. We have also compared performance of deep depth estimation with defocus blur from several optical settings to better understand the influence of the camera parameters to deep depth prediction. In our tests, the best performances are obtained for a close in-focus plane, which leads to really small camera depths of field and thus defocus blur on most of the objects in the dataset.
Besides synthetic data, this paper also provides excellent results on both indoor and outdoor real defocused images from a new set of DSLR images. These experiments on real defocused data proved that defocus blur combined to neural networks are more robust to training data and domain generalization, reducing possible constraints of actual acquisition models with active sensors and stereo systems. Notably, results on the challenging domain of outdoor scenes without further calibration, or finetuning prove that this new system can be widely used in the wild to combine physical information (defocus blur) and cues already used by standard neural networks, such as geometry and perspective. These observations open the way to further studies on the optimization of the camera parameters and acquisition modalities for 3D estimation using defocus blur and deep learning.
International Journal of Computer Vision104(1) (2013) 38–68
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs
Depth prediction from a single image with conditional adversarial networks.In: ICIP. (2017)