Single Image Depth Estimation Trained via Depth from Defocus Cues
Estimating depth from a single RGB images is a fundamental task in computer vision, which is most directly solved using supervised deep learning. In the field of unsupervised learning of depth from a single RGB image, depth is not given explicitly. Existing work in the field receives either a stereo pair, a monocular video, or multiple views, and, using losses that are based on structure-from-motion, trains a depth estimation network. In this work, we rely, instead of different views, on depth from focus cues. Learning is based on a novel Point Spread Function convolutional layer, which applies location specific kernels that arise from the Circle-Of-Confusion in each image location. We evaluate our method on data derived from five common datasets for depth estimation and lightfield images, and present results that are on par with supervised methods on KITTI and Make3D datasets and outperform unsupervised learning approaches. Since the phenomenon of depth from defocus is not dataset specific, we hypothesize that learning based on it would overfit less to the specific content in each dataset. Our experiments show that this is indeed the case, and an estimator learned on one dataset using our method provides better results on other datasets, than the directly supervised methods.READ FULL TEXT VIEW PDF
Single Image Depth Estimation Trained via Depth from Defocus Cues
In classical computer vision, many depth cues were used in order to recover depth from a given set of images. These shape from X methods include structure-from-motion, which is based on multi-view geometry, shape from structured light, in which the known light source plays the role of an additional view, shape from shadow, and most relevant to our work, shape from defocus. In machine learning based computer vision, the interest has mostly shifted into depth from a single image, treating the problem as a multivariant image-to-depth regression problem, with an additional emphasis on using deep learning.
Learning depth from a single image consists of two forms. There are supervised methods, in which the target information (the depth) is explicitly given, and unsupervised methods, in which the depth information is given implicitly. The most common approach in unsupervised learning is to provide the learning algorithm with stereo pairs or other forms of multiple views [37, 41]. In these methods, the training set consists of multiple scenes, where for each scene, we are given a set of views. The output of the method, similar to the supervised case, is a function that given a single image, estimates depth at every point.
In this work, we rely, instead of multiple view geometry, on shape from defocus. The input to our method, during training, is an all-in-focus image and one or more focused images of the same scene from the same viewing point. The algorithm then learns a regression function, which, given an all-in-focus image, estimates depth by reconstructing the given focused images. In classical computer vision, research in this area led to a variety of applications [44, 35, 32], such as estimating depth from mobile phone images . A deep learning based approach was presented by Anwar et al.  who employ synthetic focus images in supervised depth learning, and an aperture supervision depth learning by Srinivasan et al. , who employ lightfield images in the same way we use defocus images.
Our method relies on a novel Point Spread Function (PSF) layer, which preforms a local operation over an image, with a location dependent kernel which is computed “on-the-fly”, according to the estimated parameters of the PSF at each location. More specifically, the layer receives three inputs: an all-in-focus image, estimated depth-map and camera parameters, and outputs an image at one specific focus. This image is then compared to the training images to compute a loss. Both the forward and backward operations of the layer are efficiently computed using a dedicated CUDA kernel. This layer is then used as part of a novel architecture, combining the successful ASPP architecture [5, 9]. To improve the ASPP block, we add dense connections , followed by self-attention .
We evaluate our method on all relevant benchmarks we were able to obtain. These include the flower lightfield dataset and the multifocus indoor and outdoor scene dataset, for which we compare the ability to generate unseen focus images with other methods. We also evaluate on the KITTI, NYU, and Make3D, which are monocular depth estimation datasets. In all cases, we show an improved performance in comparison to methods with a similar level of supervision, and performance that is on par with the best directly supervised methods on KITTI and Make3D datasets. We note that our method uses focus cues for depth estimation, hence the task of defocusing for itself is not evaluated.
When learning depth from a single image, the most dominant cue is often the content of the image. For example, in street view images one can obtain a good estimate of the depth based on the type of object (sidewalk, road, building, car) and its location in the image. We hypothesize that when learning from focus data, the role of local image statistics becomes more dominant, and that these image statistics are more global between different visual domains. We therefore conduct experiments in which a depth estimator trained on one dataset is evaluated on another. Our experiments show a clear advantage to our method, in comparison to the state-of-the-art supervised monocular method of .
In monocular depth estimation, a single image is given as input, and the output is the predicted depth associated with that image. Supervised training methods learn from the ground truth depth directly and the so-called unsupervised methods employ other data cues, such as stereo image pairs. One of the first methods in the field was presented by Saxena et al. 
, applying supervised learning and proposed a patch-based model and Markov Random Field (MRF). Following this work, a variety of approaches had been presented using hand crafted representations[29, 18, 26, 11]
. Recent methods use convolutional neural networks (CNN), starting from learning features for a conditional random field (CRF) model as in Liuet al. , to learning end-to-end CNN models refined by CRFs, as in [2, 40].
Many models employ an autoencoder structure[7, 12, 17, 19, 39, 9], with an added advantage to very deep networks that employ ResNets . Eigen et al. [8, 7] showed that using multi-scaled depth predictions helps with the decrease in spatial resolution, which happened in the encoder model, and improves depth estimation. Other work uses different loss for regression, such as the reversed Huber  used by Laina et al.  to lower the smoothness effect of the norm, and the recent work by Fu et al.  who uses ordinal regression for each pixel with their spacing-increasing discretization (SID) strategy to discretize depth.
Unsupervised depth estimation Modern methods for unsupervised depth estimation have relied on the geometry of the scene, Garg et al.  for example, proposed using stereo pairs for learning, introducing the differentiable inverse warping. Godard et al. 
added the Left-Right consistency constraint to the loss function, exploiting another geometrical cue. Zhouet al.  learned, in addition the ego-motion of the scene, and GeoNet  also used the optical flow of the scene. Wang et al.  recently showed that using direct visual odometry along with depth normalization substantially improves performance on prediction.
Depth from focus/defocus The difference between depth from focus and depth from defocus is that, in the first case, camera parameters can be changed during the depth estimation process. In the second case, this is not allowed. Unlike the motion based methods above, these methods obtain depth using the structure of the optical geometry of the lens and light ray, as described in Sec. 3.1. Work in this field mainly focuses on analytical techniques. Zhuo et al.  for example, estimated the amount of spatially varying defocus blur at edge locations. The use of Coded Aperture had been proposed by [20, 36, 30] to improve depth estimation. Later work in this field, such as Suwajanakorn et al. , Tang et al.  and Surh et al.  employed focal stacks — sets of images of the same scene with different focus distances — and estimated depth based on a variety of blurring models, such as the Ring Difference Filter . These methods first reconstruct an all-in-focus image and then optimize a depth map that best explains the re-rendering of the focal stack images out of the all-in-focus image.
There are not many deep learning works in the field. Srinivasan et al.  presented a new lightfield dataset of flower images. They used the ground truth lightfield images to render focused images and employed a regression model to estimate depth from defocus by reconstruction of the rendered focused images.While Srinivasan et al.  did not compare to other RGB-D datasets [13, 27, 28, 23], their method can take as input any all-in-focus image. We evaluate  rendering process using our network on the KITTI dataset. Anwar et al.  utilized the provided depth of those datasets to integrate focus rendering within a fully supervised depth learning scheme.
We review the relevant optical geometry on which our PSF layer relies and then move to the layer itself.
Depth from focus methods are mostly based on the thin-lens model and geometry, as shown in Fig. 0(a). The figure illustrates light rays trajectories and the blurring effect made by out-of-focus objects. The plane of focus is defined such that light rays emerging from it towards the lens fall at the same point on the camera sensor plane. An object is said to be in focus, if its distance from the lens falls inside the camera’s depth-of-field (DoF), which is the distance about the plane of focus where objects appear acceptably sharp by the human eye. Objects outside the DoF appear blurred on the image plane, an effect caused by the spread of light rays coming from the unfocused objects and forming what is called the “Circle-Of-Confusion” (CoC), as marked by C in Fig. 0(a). In this paper, we will use the following terminology: an all-in-focus image is an image where all objects appear in focus, and a focused image is one where blurring effects caused by the lens configuration are observed.
In this model, we consider the following parameters to describe a specific camera: focal-length , which is the distance between the lens plane and the point where initially parallel rays are brought to a focus, aperture , which is the diameter of the lens (or an opening through which light travels), and the plane of focus (or focus distance), which is the distance between the lens plane and the plane where all points are in focus. Following the thin-lens model, we define the size of blur, i.e., the diameter of the CoC, which we denote as , according to the following equation:
where is the distance between an object to the lens plane, and where is what is known as the f-number of the camera. While CoC is usually measured in millimeters (), we transform its size to pixels by considering a camera pixel-size of as in , and a camera output scale , which is the ratio between sensor size and output image size. The final CoC size in pixels is computed as follows:
The CoC is directly related to the depth, as illustrated in Fig. 0(b), where each line represents a different focus distance . As can be seen, the relation is not one-to-one and will cause ambiguity in depth estimation. Moreover, different camera settings are required for different scenes in terms of the scene’s maximum depth, i.e. for KITTI, we consider maximum depth of 80 meters, and 10 meters for NYU. We also consider a constant f-number of and a different focal-length for all datasets, in order to lower depth ambiguity by lowering the DoF range (see Sec. 5.2 for more details).
We now refer to one more measurement named CoC-limit, defined as the largest blur spot that will still be perceived by the human eye as a point, when viewed on a final image from a standard viewing distance. The CoC-limit also limits the kernel size used for rendering and is, therefore, highly influential on the run time (bigger kernels lead to more computations). We employ a kernel of size , which reflects a standard CoC-limit of .
Because we work in pixel space, if the diameter is less then one pixel (), we ignore the blurring effect.
According to the above formulation, a focused image can be generated from an all-in-focus image and depth-map, as commonly done in graphics rendering. Let be an all-in-focus image and be a rendered focused image derived from depth-map , CoC-map , camera parameters , and , we define as follows:
where is an offsets set related to a kernel of size :
We denote by the convolution operation with a functional kernel , by the image location indices, and by the offset indices bounded by the kernel size.
Based on Eq. 5, given a set of focused images of the same scene, one may optimize a model to predict the all-in-focus image and the depth map. Alternatively, given a focused image and its correspondent all-in-focus image, we predict the scene depth by reconstructing the focused image.
The PSF layer we employ can be seen as a particular case of the locally connected layers of , with a few differences: first, in the PSF layer, the same operator is applied across all channels, while in the locally-connected layer, as well as in conventional layers (excluding depth-convolution 
), the local operator varies between the input channels. Additionally, The PSF layer does not sum the outcomes, and returns the same number of channels in the output tensor as in the input tensor.
The PSF convolutional layer, designed for the task of Depth from Defocus (DfD), is based on Eq. 5, where kernels vary between locations and are calculated “on-the-fly”, according to function , which is defined in Eq. 4. The kernel is, therefore, a local function of the object’s distance, with a blur kernel applied to out-of-focus pixels. The layer takes as input an all-in-focus image , depth-map
and the camera parameters vector, which contains the aperture , the focal length and the focal depth . The layer then outputs a focused image . As mentioned before, we fix the near and far distance limits to fit each dataset and use the fixed pixel size mentioned above. The rendering process begins by first calculating the CoC-map according to Eq. 1, and then applying the functional kernel convolution defined in Eq. 5. We implement the following operation in CUDA and compute its derivative as follows:
A detailed explanation of the forward and backward pass is provided in the supplementary material.
In this section, we describe the training method and the model architecture, which extends the ASPP architecture to include both self-attention and dense connections. We then describe the training procedure.
Let be a (real-world) focused version of , and be a predicted focused version of . We train a regression model to minimize the reconstruction loss of and .
. Both networks take part in the loss, and backpropagation throughis performed using Eq. 7, 8.
The learned network is applied to an all-in-focus image and returns a predicted depth . The fixed network consists of the PSF layer, as described in Sec. 3.2. It takes as input an all-in-focus , a depth (estimated or not) and the camera parameters vector . It outputs , which is a focused version of according to depth and camera parameters . We distinguish between a rendered focus image from ground truth depth which we denote as (also used for real focused imaged), and rendered focused image from predicted depth , which we denote as .
The training procedure has two cases, training with real data or on generated data, depending on the training dataset at hand. In both cases, training is performed end-to-end by running and sequentially. First, is applied to an all-in-focus image and outputs the predicted depth-map . Using this map, the all-in-focus image and camera parameters , renders the predicted focused image . A reconstruction error is then applied with and , where for the case of depth-based datasets, we render the training focused images , according to ground truth depth-map and camera specifications . Fig. 2 shows the training scheme, where the blue dashed rectangle illustrates the second case, where is rendered from the ground truth depth.
In the first case, since we compare with the work of , we use a single focused image during training, although more can be used. In the second case, we compare with fully supervised methods, that benefit from a direct access to the depth information, and we report results for 1, 2, 6 and 10 rendered focused images.
where is the Structural Similarity measure , and controls the balance w.r.t. to loss.
The reconstruction loss above does not take into account the blurriness in some parts of image , which arise from regions that are out of focus. We, therefore, add a sharpness measure similar to , which considers the sharpness of each pixel. It contains three parts: (i) the image Laplacian , (ii) the image Contrast Visibility , and (iii) the image Variance , where is the average pixel value in a window of size pixels. The sharpness measure is given by , and the loss term is:
The final loss term is then:
For all experiments, we set .
Our network is illustrated in Fig. 3. It consists of an encoder-decoder architecture, where we rely on the DeepLabV3+ [4, 5] model, which was found to be effective for semantic segmentation and depth estimation tasks . The encoder has two parts: a ResNet  backbone and a subsequent Atrous Spatial Pyramid Pooling (ASPP) module. Unlike , we do not employ a pretrained ResNet and learn it end-to-end.
The Atrous convolutions (also called dilated convolutions) add padding between kernel cells to enlarge the receptive field from earlier layers, while keeping the weight size constant. ASPP contains several parallel Atrous convolutions with different dilations. As advised in
, we also replace all pooling layers of the encoder with convolution layers with an appropriate stride.
The loss is computed in the highest resolution, to support higher quality outputs. However, to comply with GPU memory constraints, the network takes as an input, a downsampled image of half the original size. The network’s output is then upsampled to the original image size.
Dense ASPP with Self-Attention The original ASPP consists of three or more independent layers - average pooling followed by convolution, convolution, and four Atrous layers. Each convolution layer has 256 channels and the four outputs of these layers, along with the pool+conv layer are concatenated together to form a tensor with channel size . We propose two additional modifications from different parts of the literature: dense connections  and self attention .
We add dense connections between the convolution and all Atrous convolution layers of the ASPP module, sequentially connecting all layers from smallest to the largest dilation layer. Each layer, therefore, receives as the input tensor not just the output of the previous layer, but the concatenation of the output tensors of all preceding layers. This is illustrated as the skip connection arrows in Fig. 3.
Self-Attention aims to integrate local features with their global dependencies, and as shown in previous work [42, 10], it improve results in image segmentation and generation. Our implementation is based on  dual-attention.
The decoder part of consists of three upsampling blocks, each having three convolution layers followed by bilinear upsampling. A skip connection from a low level layer of the backbone is concatenated with the input of the second block. The output of decoder is the predicted depth.
We divide our experiments into two types, DoF supervision and DoF supervision from rendered data, as mentioned in the previous section. We further experiment with cross domain evaluation, where we evaluate our method in comparison to the state-of-the-art supervised method . Here the models are trained on domain A and tested on domain B, denoted as . We show that learning depth from focus cues, though not achieving better results than the supervised methods - but comparable with top methods in KITTI and Make3D datasets, achieves better generalization expressed by higher results in cross domain evaluation.
The network is trained on a single Titan-X Pascal GPUs with batch size of 3, using Adam for optimization with a learning rate of and weight decay of
. The dedicated CUDA implementation of the PSF layer runs x80 faster than the optimized pytorch implementation.
The following five benchmarks are used:
Lightfield dataset  The dataset contains lightfield flowers and plants images, taken with a Lytro Illum camera. From the lightfield images, we follow the procedure of  to generate the all-in-focus and shallow DoF images, and split the dataset into 3143 and 300 images for train and test.
DSLR dataset  This dataset contains 110 images and ground truth depth from indoor scenes, with 81 images for training and 29 images for testing, and 34 images from outdoor scenes without ground truth depth. Each scene is acquired with two camera apertures: and , providing focused and all-in-focus images.
KITTI  This benchmark contains RGB-D images taken in an outdoor environment at resolution of roughly which we refer to as the full resolution output size. The train/test splits we employ follow Eigen et al. , with 23,000 training images and 697 test images. The input depth-maps and images are cropped, according to  to obtain valid depth values, and resized to half-size.
NYU DepthV2  This benchmark contains about 120K indoor RGB and depth images captured with a Microsoft Kinect. The datasets consists of 249 scenes for training and 215 scenes for testing. We report results on 654 test images from a small subset of 1449 aligned RGB-depth pairs, as done in previous work.
Make3D [27, 28] The Make3D benchmark contains 534 RGB-depth pairs, split into 400 pairs for training and 134 for testing. The input images are provided at a high resolution, while the depth-maps are at low resolution. Therefore, data is resized to , as proposed by [27, 28]. Following , results are evaluated in two settings: for depth cap of 0-70, and for depth cap 0-80.
DoF supervision We first report results on the Lightfield dataset dataset, which provides focused and all-in-focus image pairs with no ground truth depth. The performance is evaluated using the PSNR and SSIM measures. Our results are shown in Tab. 1. As can be seen, we significantly outperform the literature baselines provided by .
|Image Regression ||DoF||24.60||0.895|
Rendered DoF supervision
For rendered DoF supervision, we consider four datasets [8, 27, 23, 3] with ground truth depth, where we render focused images with different focus distances. We denote by F1, F2, F6, F10 the four training setups, which differ by the number of rendered focused images used in training. The order in which focal distances are selected, is defined by the following focal sequence , where each number represents the percent of the maximum depth used for each dataset. For example, F2 employs focal distances of 0.2 and 0.8 times the maximal depth.
We perform two types of evaluations. First, we evaluate our method for each dataset with different numbers of focused images during training, and compare our results with other unsupervised methods, as well as with supervised ones. The evaluation measures are those commonly used in the literature [13, 27, 28] and include various RMSE measures and a thresholded error rate.
Tab. 2 and 3 show that our method outperforms monocular and stereo supervision methods on the KITTI and Make3D dataset. This also holds when the previous methods are trained with additional data obtained from the Cityscapes dataset. In comparison to the depth supervised methods, we outperform all methods on KITTI, with the exception of , and outperform [9, 21] on Make3D. In Fig. 4, we present qualitative results of our method compared to the state-of-the-art unsupervised method  on the KITTI dataset. As can be seen in Tab. 4, there are no literature unsupervised methods reported for the NYU dataset, where we are slightly outperformed by the supervised methods.
We next preform cross domain evaluation compared to the published models of the state-of-the-art supervised method , where training is performed on KITTI or NYU, and tested on different datasets. These tests are meant to evaluate the specificity of the learned network to a particular dataset. Since the absolute depth differs between datasets, we evaluate the methods by computing the Pearson correlation metric. Results are shown in Tab. 5. As can be seen, when transferring from both KITTI and NYU, we outperform the directly supervised method. The gap is especially visible for the NYU network.
|KITTI NYU||DORN ||0.423 0.010|
|Ours F1||0.121 0.006|
|Ours F10||0.429 0.009|
|KITTI Make3D||DORN ||0.616 0.011|
|Ours F1||0.484 0.019|
|Ours F10||0.642 0.014|
|KITTI D3Net||DORN ||0.145 0.048|
|Ours F1||0.148 0.032|
|Ours F10||0.275 0.054|
|NYU KITTI||DORN ||0.456 0.006|
|Ours F1||0.567 0.006|
|Ours F10||0.634 0.005|
|NYU Make3D||DORN ||0.250 0.019|
|Ours F1||0.249 0.032|
|Ours F10||0.456 0.022|
|NYU D3Net||DORN ||0.260 0.054|
|Ours F1||0.530 0.048|
|Ours F10||0.434 0.052|
We also provide cross-domain results for the outdoor images of the DSLR dataset, where no ground truth depth is provided, using the PSNR and SSIM metrics. Tab. 5.1 shows in this case that our method transfers better from NYU and only slightly better from KITTI in comparison to .
The Effect of Focal Distance Because the focus distance and DoF range are positively correlated, training with a far focus distance increases the DoF and puts a large range of distances in focus. As a result, focus cues are lowered, causing performance to decrease. In Fig. 5 we present, for the Make3D dataset, the accuracy of F1 training with different focus distances, where a clear decrease in performance is seen at mid-range and an increase afterward, as a result of the dataset maximum depth, capping the far DoF distance, i.e. lowering the DoF range, and increasing focus cues for closer objects.
Dense ASPP with Self-Attention We evaluate our dense ASPP with self-attention in comparison to three versions of the original ASPP model: vanilla ASPP, ASPP with dense connections and ASPP with self-attention. In order to differentiate between different ambiguity scenarios, training is preformed with the F1, F2, F6 and F10 methods. As can be seen in Tab 5.1, our model outperform the different ASPP versions. However, as the number of focused images increases, the gaps are reduced.
Different rendering methods To further compare with , we have conducted a test on the KITTI dataset, where we replaced our rendering network with their compositional rendering, and modified our depth network
’s last layer to output 80 depth probabilities (similar to). From Tab. 8, the compositional method of  preforms poorly on KITTI in the F1 and F2 setting.
We propose a method for learning to estimate depth from a single image, based on focus cues. Our method outperforms the similarly supervised method  and all other unsupervised literature methods. In most cases, it matches the performance of directly supervised methods, when evaluated on test images from the training domain. Since focus cues are more generic than content cues, our method outperforms the state-of-the-art supervised method in cross domain evaluation on all available literature datasets.
We introduce a differentiable PSF convolutional layer, which propagates image based losses back to the estimated depth. We also contribute a new architecture that introduces dense connection and Self-Attention to the ASPP module. Our code is available as part of the supplementary material, and on GitHub https://github.com/shirgur/UnsupervisedDepthFromFocus.
This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of the first author is part of a Ph.D. thesis research conducted at Tel Aviv University.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 89–96. Cited by: §2.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. pp. 1119–1127. Cited by: Figure 4, §5.1.
A robust hybrid of lasso and ridge regression. Contemporary Mathematics 443 (7), pp. 59–72. Cited by: §2.