Single Image Depth Estimation Trained via Depth from Defocus Cues

by   Shir Gur, et al.
Tel Aviv University

Estimating depth from a single RGB images is a fundamental task in computer vision, which is most directly solved using supervised deep learning. In the field of unsupervised learning of depth from a single RGB image, depth is not given explicitly. Existing work in the field receives either a stereo pair, a monocular video, or multiple views, and, using losses that are based on structure-from-motion, trains a depth estimation network. In this work, we rely, instead of different views, on depth from focus cues. Learning is based on a novel Point Spread Function convolutional layer, which applies location specific kernels that arise from the Circle-Of-Confusion in each image location. We evaluate our method on data derived from five common datasets for depth estimation and lightfield images, and present results that are on par with supervised methods on KITTI and Make3D datasets and outperform unsupervised learning approaches. Since the phenomenon of depth from defocus is not dataset specific, we hypothesize that learning based on it would overfit less to the specific content in each dataset. Our experiments show that this is indeed the case, and an estimator learned on one dataset using our method provides better results on other datasets, than the directly supervised methods.


Unsupervised Learning Based Focal Stack Camera Depth Estimation

We propose an unsupervised deep learning based method to estimate depth ...

NENet: Monocular Depth Estimation via Neural Ensembles

Depth estimation is getting a widespread popularity in the computer visi...

Deep Eyes: Binocular Depth-from-Focus on Focal Stack Pairs

Human visual system relies on both binocular stereo cues and monocular f...

Investigating Neural Architectures by Synthetic Dataset Design

Recent years have seen the emergence of many new neural network structur...

Unsupervised Single-shot Depth Estimation using Perceptual Reconstruction

Real-time estimation of actual object depth is a module that is essentia...

Unsupervised Learning of Depth and Ego-Motion from Cylindrical Panoramic Video with Applications for Virtual Reality

We introduce a convolutional neural network model for unsupervised learn...

Unpaired Single-Image Depth Synthesis with cycle-consistent Wasserstein GANs

Real-time estimation of actual environment depth is an essential module ...

Code Repositories


Single Image Depth Estimation Trained via Depth from Defocus Cues

view repo

1 Introduction

In classical computer vision, many depth cues were used in order to recover depth from a given set of images. These shape from X methods include structure-from-motion, which is based on multi-view geometry, shape from structured light, in which the known light source plays the role of an additional view, shape from shadow, and most relevant to our work, shape from defocus. In machine learning based computer vision, the interest has mostly shifted into depth from a single image, treating the problem as a multivariant image-to-depth regression problem, with an additional emphasis on using deep learning.

Learning depth from a single image consists of two forms. There are supervised methods, in which the target information (the depth) is explicitly given, and unsupervised methods, in which the depth information is given implicitly. The most common approach in unsupervised learning is to provide the learning algorithm with stereo pairs or other forms of multiple views [37, 41]. In these methods, the training set consists of multiple scenes, where for each scene, we are given a set of views. The output of the method, similar to the supervised case, is a function that given a single image, estimates depth at every point.

In this work, we rely, instead of multiple view geometry, on shape from defocus. The input to our method, during training, is an all-in-focus image and one or more focused images of the same scene from the same viewing point. The algorithm then learns a regression function, which, given an all-in-focus image, estimates depth by reconstructing the given focused images. In classical computer vision, research in this area led to a variety of applications [44, 35, 32], such as estimating depth from mobile phone images [33]. A deep learning based approach was presented by Anwar et al[1] who employ synthetic focus images in supervised depth learning, and an aperture supervision depth learning by Srinivasan et al[31], who employ lightfield images in the same way we use defocus images.

Our method relies on a novel Point Spread Function (PSF) layer, which preforms a local operation over an image, with a location dependent kernel which is computed “on-the-fly”, according to the estimated parameters of the PSF at each location. More specifically, the layer receives three inputs: an all-in-focus image, estimated depth-map and camera parameters, and outputs an image at one specific focus. This image is then compared to the training images to compute a loss. Both the forward and backward operations of the layer are efficiently computed using a dedicated CUDA kernel. This layer is then used as part of a novel architecture, combining the successful ASPP architecture [5, 9]. To improve the ASPP block, we add dense connections [16], followed by self-attention [42].

We evaluate our method on all relevant benchmarks we were able to obtain. These include the flower lightfield dataset and the multifocus indoor and outdoor scene dataset, for which we compare the ability to generate unseen focus images with other methods. We also evaluate on the KITTI, NYU, and Make3D, which are monocular depth estimation datasets. In all cases, we show an improved performance in comparison to methods with a similar level of supervision, and performance that is on par with the best directly supervised methods on KITTI and Make3D datasets. We note that our method uses focus cues for depth estimation, hence the task of defocusing for itself is not evaluated.

When learning depth from a single image, the most dominant cue is often the content of the image. For example, in street view images one can obtain a good estimate of the depth based on the type of object (sidewalk, road, building, car) and its location in the image. We hypothesize that when learning from focus data, the role of local image statistics becomes more dominant, and that these image statistics are more global between different visual domains. We therefore conduct experiments in which a depth estimator trained on one dataset is evaluated on another. Our experiments show a clear advantage to our method, in comparison to the state-of-the-art supervised monocular method of [9].

2 Related Work

Learning based monocular depth estimation

In monocular depth estimation, a single image is given as input, and the output is the predicted depth associated with that image. Supervised training methods learn from the ground truth depth directly and the so-called unsupervised methods employ other data cues, such as stereo image pairs. One of the first methods in the field was presented by Saxena et al[27]

, applying supervised learning and proposed a patch-based model and Markov Random Field (MRF). Following this work, a variety of approaches had been presented using hand crafted representations 

[29, 18, 26, 11]

. Recent methods use convolutional neural networks (CNN), starting from learning features for a conditional random field (CRF) model as in Liu

et al[22], to learning end-to-end CNN models refined by CRFs, as in [2, 40].

Many models employ an autoencoder structure 

[7, 12, 17, 19, 39, 9], with an added advantage to very deep networks that employ ResNets [15]. Eigen et al[8, 7] showed that using multi-scaled depth predictions helps with the decrease in spatial resolution, which happened in the encoder model, and improves depth estimation. Other work uses different loss for regression, such as the reversed Huber [24] used by Laina et al[19] to lower the smoothness effect of the norm, and the recent work by Fu et al[9] who uses ordinal regression for each pixel with their spacing-increasing discretization (SID) strategy to discretize depth.

Unsupervised depth estimation Modern methods for unsupervised depth estimation have relied on the geometry of the scene, Garg et al[12] for example, proposed using stereo pairs for learning, introducing the differentiable inverse warping. Godard et al[14]

added the Left-Right consistency constraint to the loss function, exploiting another geometrical cue. Zhou

et al[43] learned, in addition the ego-motion of the scene, and GeoNet [41] also used the optical flow of the scene. Wang et al[37] recently showed that using direct visual odometry along with depth normalization substantially improves performance on prediction.

Depth from focus/defocus The difference between depth from focus and depth from defocus is that, in the first case, camera parameters can be changed during the depth estimation process. In the second case, this is not allowed. Unlike the motion based methods above, these methods obtain depth using the structure of the optical geometry of the lens and light ray, as described in Sec. 3.1. Work in this field mainly focuses on analytical techniques. Zhuo et al[44] for example, estimated the amount of spatially varying defocus blur at edge locations. The use of Coded Aperture had been proposed by [20, 36, 30] to improve depth estimation. Later work in this field, such as Suwajanakorn et al[33], Tang et al[35] and Surh et al[32] employed focal stacks — sets of images of the same scene with different focus distances — and estimated depth based on a variety of blurring models, such as the Ring Difference Filter [32]. These methods first reconstruct an all-in-focus image and then optimize a depth map that best explains the re-rendering of the focal stack images out of the all-in-focus image.

There are not many deep learning works in the field. Srinivasan et al[31] presented a new lightfield dataset of flower images. They used the ground truth lightfield images to render focused images and employed a regression model to estimate depth from defocus by reconstruction of the rendered focused images.While Srinivasan et al[31] did not compare to other RGB-D datasets [13, 27, 28, 23], their method can take as input any all-in-focus image. We evaluate [31] rendering process using our network on the KITTI dataset. Anwar et al[1] utilized the provided depth of those datasets to integrate focus rendering within a fully supervised depth learning scheme.

3 Differentiable Optical Model

We review the relevant optical geometry on which our PSF layer relies and then move to the layer itself.

3.1 Depth From Defocus

(a) Lens illustration
(b) CoC - KITTI
(c) CoC - KITTI
Figure 1: (a) Illustration of lens principles. Blue beams represent an object in focus. Red beams represent an object further away and out of focus. See text for symbol definitions. (b) CoC diameter w.r.t. object distance as seen in KITTI. Camera settings are: , , and . (c) Sample blur kernel. Green line represents depth edge, Blue colors represent the relative blur contribution w.r.t. CoC.

Depth from focus methods are mostly based on the thin-lens model and geometry, as shown in Fig. 0(a). The figure illustrates light rays trajectories and the blurring effect made by out-of-focus objects. The plane of focus is defined such that light rays emerging from it towards the lens fall at the same point on the camera sensor plane. An object is said to be in focus, if its distance from the lens falls inside the camera’s depth-of-field (DoF), which is the distance about the plane of focus where objects appear acceptably sharp by the human eye. Objects outside the DoF appear blurred on the image plane, an effect caused by the spread of light rays coming from the unfocused objects and forming what is called the “Circle-Of-Confusion” (CoC), as marked by C in Fig. 0(a). In this paper, we will use the following terminology: an all-in-focus image is an image where all objects appear in focus, and a focused image is one where blurring effects caused by the lens configuration are observed.

In this model, we consider the following parameters to describe a specific camera: focal-length , which is the distance between the lens plane and the point where initially parallel rays are brought to a focus, aperture , which is the diameter of the lens (or an opening through which light travels), and the plane of focus (or focus distance), which is the distance between the lens plane and the plane where all points are in focus. Following the thin-lens model, we define the size of blur, i.e., the diameter of the CoC, which we denote as , according to the following equation:


where is the distance between an object to the lens plane, and where is what is known as the f-number of the camera. While CoC is usually measured in millimeters (), we transform its size to pixels by considering a camera pixel-size of as in [3], and a camera output scale , which is the ratio between sensor size and output image size. The final CoC size in pixels is computed as follows:


The CoC is directly related to the depth, as illustrated in Fig. 0(b), where each line represents a different focus distance . As can be seen, the relation is not one-to-one and will cause ambiguity in depth estimation. Moreover, different camera settings are required for different scenes in terms of the scene’s maximum depth, i.e. for KITTI, we consider maximum depth of 80 meters, and 10 meters for NYU. We also consider a constant f-number of and a different focal-length for all datasets, in order to lower depth ambiguity by lowering the DoF range (see Sec. 5.2 for more details).

We now refer to one more measurement named CoC-limit, defined as the largest blur spot that will still be perceived by the human eye as a point, when viewed on a final image from a standard viewing distance. The CoC-limit also limits the kernel size used for rendering and is, therefore, highly influential on the run time (bigger kernels lead to more computations). We employ a kernel of size , which reflects a standard CoC-limit of .

In this work, following [33, 35], we consider the blur model to be a disc-shaped point spread function (PSF), modeled by a Gaussian kernel with radius and kernel’s location indices :


Because we work in pixel space, if the diameter is less then one pixel (), we ignore the blurring effect.

According to the above formulation, a focused image can be generated from an all-in-focus image and depth-map, as commonly done in graphics rendering. Let be an all-in-focus image and be a rendered focused image derived from depth-map , CoC-map , camera parameters , and , we define as follows:


where is an offsets set related to a kernel of size :


We denote by the convolution operation with a functional kernel , by the image location indices, and by the offset indices bounded by the kernel size.

Based on Eq. 5, given a set of focused images of the same scene, one may optimize a model to predict the all-in-focus image and the depth map. Alternatively, given a focused image and its correspondent all-in-focus image, we predict the scene depth by reconstructing the focused image.

While [31] uses a weighted sum of disk kernels to render blur, our blur kernel is a Gaussian composition of different blur contributions from all neighbors (Eq. 5) where each kernel coefficient is calculated by a Gaussian function w.r.t. a different estimated CoC, as illustrated in Fig. 0(c).

3.2 The PSF Convolutional layer

The PSF layer we employ can be seen as a particular case of the locally connected layers of [34], with a few differences: first, in the PSF layer, the same operator is applied across all channels, while in the locally-connected layer, as well as in conventional layers (excluding depth-convolution [6]

), the local operator varies between the input channels. Additionally, The PSF layer does not sum the outcomes, and returns the same number of channels in the output tensor as in the input tensor.

The PSF convolutional layer, designed for the task of Depth from Defocus (DfD), is based on Eq. 5, where kernels vary between locations and are calculated “on-the-fly”, according to function , which is defined in Eq. 4. The kernel is, therefore, a local function of the object’s distance, with a blur kernel applied to out-of-focus pixels. The layer takes as input an all-in-focus image , depth-map

and the camera parameters vector

, which contains the aperture , the focal length and the focal depth . The layer then outputs a focused image . As mentioned before, we fix the near and far distance limits to fit each dataset and use the fixed pixel size mentioned above. The rendering process begins by first calculating the CoC-map according to Eq. 1, and then applying the functional kernel convolution defined in Eq. 5. We implement the following operation in CUDA and compute its derivative as follows:


A detailed explanation of the forward and backward pass is provided in the supplementary material.

4 Approach

In this section, we describe the training method and the model architecture, which extends the ASPP architecture to include both self-attention and dense connections. We then describe the training procedure.

4.1 General Architecture and the Training Loss

Let be a (real-world) focused version of , and be a predicted focused version of . We train a regression model to minimize the reconstruction loss of and .

We define two networks, and , for depth estimation and focus rendering respectively. While is learned, implements Eq. 4 and 5

. Both networks take part in the loss, and backpropagation through

is performed using Eq. 78.

The learned network is applied to an all-in-focus image and returns a predicted depth . The fixed network consists of the PSF layer, as described in Sec. 3.2. It takes as input an all-in-focus , a depth (estimated or not) and the camera parameters vector . It outputs , which is a focused version of according to depth and camera parameters . We distinguish between a rendered focus image from ground truth depth which we denote as (also used for real focused imaged), and rendered focused image from predicted depth , which we denote as .

The training procedure has two cases, training with real data or on generated data, depending on the training dataset at hand. In both cases, training is performed end-to-end by running and sequentially. First, is applied to an all-in-focus image and outputs the predicted depth-map . Using this map, the all-in-focus image and camera parameters , renders the predicted focused image . A reconstruction error is then applied with and , where for the case of depth-based datasets, we render the training focused images , according to ground truth depth-map and camera specifications . Fig. 2 shows the training scheme, where the blue dashed rectangle illustrates the second case, where is rendered from the ground truth depth.

Figure 2: Training scheme. Blue region represents the rendering branch, which is used for depth-based datasets.

In the first case, since we compare with the work of [31], we use a single focused image during training, although more can be used. In the second case, we compare with fully supervised methods, that benefit from a direct access to the depth information, and we report results for 1, 2, 6 and 10 rendered focused images.

Training loss We first consider the reconstruction loss and the depth smoothness [38, 14] w.r.t. the input image , the predicted focused image , the focused image , and the estimated depth map :


where is the Structural Similarity measure [38], and controls the balance w.r.t. to loss.

The reconstruction loss above does not take into account the blurriness in some parts of image , which arise from regions that are out of focus. We, therefore, add a sharpness measure similar to [25], which considers the sharpness of each pixel. It contains three parts: (i) the image Laplacian , (ii) the image Contrast Visibility , and (iii) the image Variance , where is the average pixel value in a window of size pixels. The sharpness measure is given by , and the loss term is:


The final loss term is then:


For all experiments, we set .

4.2 Model Architecture

Figure 3: Dense ASPP with an added attention block.

Our network is illustrated in Fig. 3. It consists of an encoder-decoder architecture, where we rely on the DeepLabV3+ [4, 5] model, which was found to be effective for semantic segmentation and depth estimation tasks [9]. The encoder has two parts: a ResNet [15] backbone and a subsequent Atrous Spatial Pyramid Pooling (ASPP) module. Unlike [9], we do not employ a pretrained ResNet and learn it end-to-end.

The Atrous convolutions (also called dilated convolutions) add padding between kernel cells to enlarge the receptive field from earlier layers, while keeping the weight size constant. ASPP contains several parallel Atrous convolutions with different dilations. As advised in 


, we also replace all pooling layers of the encoder with convolution layers with an appropriate stride.

The loss is computed in the highest resolution, to support higher quality outputs. However, to comply with GPU memory constraints, the network takes as an input, a downsampled image of half the original size. The network’s output is then upsampled to the original image size.

Dense ASPP with Self-Attention The original ASPP consists of three or more independent layers - average pooling followed by convolution, convolution, and four Atrous layers. Each convolution layer has 256 channels and the four outputs of these layers, along with the pool+conv layer are concatenated together to form a tensor with channel size . We propose two additional modifications from different parts of the literature: dense connections [16] and self attention [42].

We add dense connections between the convolution and all Atrous convolution layers of the ASPP module, sequentially connecting all layers from smallest to the largest dilation layer. Each layer, therefore, receives as the input tensor not just the output of the previous layer, but the concatenation of the output tensors of all preceding layers. This is illustrated as the skip connection arrows in Fig. 3.

Self-Attention aims to integrate local features with their global dependencies, and as shown in previous work [42, 10], it improve results in image segmentation and generation. Our implementation is based on [10] dual-attention.

The decoder part of consists of three upsampling blocks, each having three convolution layers followed by bilinear upsampling. A skip connection from a low level layer of the backbone is concatenated with the input of the second block. The output of decoder is the predicted depth.

5 Experiments

We divide our experiments into two types, DoF supervision and DoF supervision from rendered data, as mentioned in the previous section. We further experiment with cross domain evaluation, where we evaluate our method in comparison to the state-of-the-art supervised method [9]. Here the models are trained on domain A and tested on domain B, denoted as . We show that learning depth from focus cues, though not achieving better results than the supervised methods - but comparable with top methods in KITTI and Make3D datasets, achieves better generalization expressed by higher results in cross domain evaluation.

The network is trained on a single Titan-X Pascal GPUs with batch size of 3, using Adam for optimization with a learning rate of and weight decay of

. The dedicated CUDA implementation of the PSF layer runs x80 faster than the optimized pytorch implementation.

The following five benchmarks are used:

Lightfield dataset [31] The dataset contains lightfield flowers and plants images, taken with a Lytro Illum camera. From the lightfield images, we follow the procedure of [31] to generate the all-in-focus and shallow DoF images, and split the dataset into 3143 and 300 images for train and test.

DSLR dataset [3] This dataset contains 110 images and ground truth depth from indoor scenes, with 81 images for training and 29 images for testing, and 34 images from outdoor scenes without ground truth depth. Each scene is acquired with two camera apertures: and , providing focused and all-in-focus images.

KITTI [13] This benchmark contains RGB-D images taken in an outdoor environment at resolution of roughly which we refer to as the full resolution output size. The train/test splits we employ follow Eigen et al[8], with 23,000 training images and 697 test images. The input depth-maps and images are cropped, according to [8] to obtain valid depth values, and resized to half-size.

NYU DepthV2 [23] This benchmark contains about 120K indoor RGB and depth images captured with a Microsoft Kinect. The datasets consists of 249 scenes for training and 215 scenes for testing. We report results on 654 test images from a small subset of 1449 aligned RGB-depth pairs, as done in previous work.

Make3D [27, 28] The Make3D benchmark contains 534 RGB-depth pairs, split into 400 pairs for training and 134 for testing. The input images are provided at a high resolution, while the depth-maps are at low resolution. Therefore, data is resized to , as proposed by [27, 28]. Following [27], results are evaluated in two settings: for depth cap of 0-70, and for depth cap 0-80.

5.1 Results

DoF supervision We first report results on the Lightfield dataset dataset, which provides focused and all-in-focus image pairs with no ground truth depth. The performance is evaluated using the PSNR and SSIM measures. Our results are shown in Tab. 1. As can be seen, we significantly outperform the literature baselines provided by  [31].

Algorithm Supervision PSNR SSIM
Image Regression [31] DoF 24.60 0.895
Multi-View [31] DoF 34.49 0.960
Lightfield [31] DoF 36.68 0.967
Compositional [31] DoF 36.90 0.966
Ours DoF 38.33 0.979
Table 1: Quantitative results on the Lightfield test set, reported as a mean value of PSNR and SSIM of the reconstructed focused image.

Rendered DoF supervision

Table 2: KITTI: Quantitative results on the KITTI Eigen split. Top - Unsupervised methods where ‘S’ and ‘M’ stands for stereo and video (monocular) supervision, and ‘K+CS’ stands for training with the added data from the CityScapes dataset. Middle - Our method. Bottom - Supervised methods.
Table 3: Make3D: Quantitative results on Make3D [27, 28] dataset. Top - Unsupervised methods where ‘S’ and ‘M’ stands for stereo and video (monocular) supervision. Middle - Our method. Bottom - Supervised methods.
Table 4: NYU: Quantitative results on NYU V2 [23] dataset. Top - Our method. Bottom - Supervised methods.
Figure 4: KITTI: Qualitative results on the KITTI Eigen Split. All images are cropped to the valid depth region as proposed in [8]. From left to right, reference image and ground truth, Wang et al[37] and ours.

For rendered DoF supervision, we consider four datasets [8, 27, 23, 3] with ground truth depth, where we render focused images with different focus distances. We denote by F1, F2, F6, F10 the four training setups, which differ by the number of rendered focused images used in training. The order in which focal distances are selected, is defined by the following focal sequence , where each number represents the percent of the maximum depth used for each dataset. For example, F2 employs focal distances of 0.2 and 0.8 times the maximal depth.

We perform two types of evaluations. First, we evaluate our method for each dataset with different numbers of focused images during training, and compare our results with other unsupervised methods, as well as with supervised ones. The evaluation measures are those commonly used in the literature [13, 27, 28] and include various RMSE measures and a thresholded error rate.

Tab. 2 and 3 show that our method outperforms monocular and stereo supervision methods on the KITTI and Make3D dataset. This also holds when the previous methods are trained with additional data obtained from the Cityscapes dataset. In comparison to the depth supervised methods, we outperform all methods on KITTI, with the exception of [9], and outperform [9, 21] on Make3D. In Fig. 4, we present qualitative results of our method compared to the state-of-the-art unsupervised method [37] on the KITTI dataset. As can be seen in Tab. 4, there are no literature unsupervised methods reported for the NYU dataset, where we are slightly outperformed by the supervised methods.

We next preform cross domain evaluation compared to the published models of the state-of-the-art supervised method [9], where training is performed on KITTI or NYU, and tested on different datasets. These tests are meant to evaluate the specificity of the learned network to a particular dataset. Since the absolute depth differs between datasets, we evaluate the methods by computing the Pearson correlation metric. Results are shown in Tab. 5. As can be seen, when transferring from both KITTI and NYU, we outperform the directly supervised method. The gap is especially visible for the NYU network.

Transition Algorithm Correlation
KITTI NYU DORN [9] 0.423 0.010
Ours F1 0.121 0.006
Ours F10 0.429 0.009
KITTI Make3D DORN [9] 0.616 0.011
Ours F1 0.484 0.019
Ours F10 0.642 0.014
KITTI D3Net DORN [9] 0.145 0.048
Ours F1 0.148 0.032
Ours F10 0.275 0.054
NYU KITTI DORN [9] 0.456 0.006
Ours F1 0.567 0.006
Ours F10 0.634 0.005
NYU Make3D DORN [9] 0.250 0.019
Ours F1 0.249 0.032
Ours F10 0.456 0.022
NYU D3Net DORN [9] 0.260 0.054
Ours F1 0.530 0.048
Ours F10 0.434 0.052
Table 5: Quantitative results for cross domain evaluation. Models are trained on domain A and tested on domain B. Reported numbers are mean standard error.

We also provide cross-domain results for the outdoor images of the DSLR dataset, where no ground truth depth is provided, using the PSNR and SSIM metrics. Tab. 5.1 shows in this case that our method transfers better from NYU and only slightly better from KITTI in comparison to [9].

Table 7: A comparison on KITTI between the original ASPP and our dense ASPP with self-attention. We denote ‘D’ for Dense connections and ‘SA’ for Self-Attention. RMSE is shown for focused image stacks of different sizes.
Table 8: A comparison on KITTI dataset between different blur methods on top of our network. BF= bilateral filtering.
Figure 5: (a) , higher is better, for training F1 with different focus distance. (b) RMSE, lower is better.
Table 6: Quantitative results on the outdoor DSLR [3] test set, reported as mean value of PSNR and SSIM of the reconstructed focused image.

5.2 Ablation Studies

The Effect of Focal Distance  Because the focus distance and DoF range are positively correlated, training with a far focus distance increases the DoF and puts a large range of distances in focus. As a result, focus cues are lowered, causing performance to decrease. In Fig. 5 we present, for the Make3D dataset, the accuracy of F1 training with different focus distances, where a clear decrease in performance is seen at mid-range and an increase afterward, as a result of the dataset maximum depth, capping the far DoF distance, i.e. lowering the DoF range, and increasing focus cues for closer objects.

Dense ASPP with Self-Attention We evaluate our dense ASPP with self-attention in comparison to three versions of the original ASPP model: vanilla ASPP, ASPP with dense connections and ASPP with self-attention. In order to differentiate between different ambiguity scenarios, training is preformed with the F1, F2, F6 and F10 methods. As can be seen in Tab 5.1, our model outperform the different ASPP versions. However, as the number of focused images increases, the gaps are reduced.

Different rendering methods To further compare with [31], we have conducted a test on the KITTI dataset, where we replaced our rendering network with their compositional rendering, and modified our depth network

’s last layer to output 80 depth probabilities (similar to 

[31]). From Tab. 8, the compositional method of [31] preforms poorly on KITTI in the F1 and F2 setting.

6 Conclusion

We propose a method for learning to estimate depth from a single image, based on focus cues. Our method outperforms the similarly supervised method [31] and all other unsupervised literature methods. In most cases, it matches the performance of directly supervised methods, when evaluated on test images from the training domain. Since focus cues are more generic than content cues, our method outperforms the state-of-the-art supervised method in cross domain evaluation on all available literature datasets.

We introduce a differentiable PSF convolutional layer, which propagates image based losses back to the estimated depth. We also contribute a new architecture that introduces dense connection and Self-Attention to the ASPP module. Our code is available as part of the supplementary material, and on GitHub


This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of the first author is part of a Ph.D. thesis research conducted at Tel Aviv University.


  • [1] S. Anwar, Z. Hayder, and F. Porikli (2017) Depth estimation and blur removal from a single out-of-focus image. In BMVC, Cited by: §1, §2.
  • [2] Y. Cao, Z. Wu, and C. Shen (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.
  • [3] M. Carvalho, B. Le Saux, P. Trouvé-Peloux, A. Almansa, and F. Champagnat (2018) Deep depth from defocus: how can defocus blur improve 3D estimation using dense neural networks?. 3DRW ECCV Workshop. Cited by: §3.1, §5.1, §5.1, §5.
  • [4] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §4.2.
  • [5] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611. Cited by: §1, §4.2, §4.2.
  • [6] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. arXiv preprint, pp. 1610–02357. Cited by: §3.2.
  • [7] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658. Cited by: §2.
  • [8] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. pp. 2366–2374. Cited by: §2, Figure 4, §5.1, §5.
  • [9] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. pp. 2002–2011. Cited by: §1, §1, §2, §4.2, Figure 4, §5.1, §5.1, §5.1, §5.1, Table 5, §5.
  • [10] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu (2018) Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983. Cited by: §4.2.
  • [11] R. Furukawa, R. Sagawa, and H. Kawasaki (2017) Depth estimation using structured light flow–analysis of projected pattern flow on an object’s surface–. arXiv preprint arXiv:1710.00513. Cited by: §2.
  • [12] R. Garg, V. K. BG, G. Carneiro, and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European Conference on Computer Vision, pp. 740–756. Cited by: §2, §2.
  • [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §2, §5.1, §5.
  • [14] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. 2 (6), pp. 7. Cited by: §2, §4.1, Figure 4.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. pp. 770–778. Cited by: §2, §4.2.
  • [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In CVPR, Vol. 1, pp. 3. Cited by: §1, §4.2.
  • [17] Y. Kuznietsov, J. Stückler, and B. Leibe (2017) Semi-supervised deep learning for monocular depth map prediction. pp. 6647–6655. Cited by: §2, Figure 4.
  • [18] L. Ladicky, J. Shi, and M. Pollefeys (2014) Pulling things out of perspective. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 89–96. Cited by: §2.
  • [19] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. pp. 239–248. Cited by: §2.
  • [20] A. Levin, R. Fergus, F. Durand, and W. T. Freeman (2007) Image and depth from a conventional camera with a coded aperture. ACM transactions on graphics (TOG) 26 (3), pp. 70. Cited by: §2.
  • [21] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He (2015)

    Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs

    pp. 1119–1127. Cited by: Figure 4, §5.1.
  • [22] F. Liu, C. Shen, G. Lin, and I. D. Reid (2016) Learning depth from single monocular images using deep convolutional neural fields.. IEEE Trans. Pattern Anal. Mach. Intell. 38 (10), pp. 2024–2039. Cited by: §2, Figure 4.
  • [23] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §2, §5.1, Table 4, §5.
  • [24] A. B. Owen (2007)

    A robust hybrid of lasso and ridge regression

    Contemporary Mathematics 443 (7), pp. 59–72. Cited by: §2.
  • [25] M. Pagidimarry and K. A. Babu (2011) An all approach for multi-focus image fusion using neural network. Artificial Intelligent Systems and Machine Learning 3 (12), pp. 732–739. Cited by: §4.1.
  • [26] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun (2016) Dense monocular depth estimation in complex dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4058–4066. Cited by: §2.
  • [27] A. Saxena, S. H. Chung, and A. Y. Ng (2006) Learning depth from single monocular images. In Advances in neural information processing systems, pp. 1161–1168. Cited by: §2, §2, §5.1, §5.1, Table 3, §5.
  • [28] A. Saxena, M. Sun, and A. Y. Ng (2007) Learning 3-d scene structure from a single still image. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 1–8. Cited by: §2, §5.1, Table 3, §5.
  • [29] A. Saxena, M. Sun, and A. Y. Ng (2009) Make3d: learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence 31 (5), pp. 824–840. Cited by: §2.
  • [30] A. Sellent and P. Favaro (2014) Which side of the focal plane are you on?. In 2014 IEEE international conference on computational photography (ICCP), pp. 1–8. Cited by: §2.
  • [31] P. P. Srinivasan, R. Garg, N. Wadhwa, R. Ng, and J. T. Barron (2018) Aperture supervision for monocular depth estimation. pp. 6393–6401. Cited by: §1, §2, §3.1, §4.1, §5.1, §5.1, §5.2, Table 1, §5, §6.
  • [32] J. Surh, H. Jeon, Y. Park, S. Im, H. Ha, and I. S. Kweon (2017) Noise robust depth from focus using a ring difference filter. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [33] S. Suwajanakorn, C. Hernandez, and S. M. Seitz (2015) Depth from focus with your mobile phone. pp. 3497–3506. Cited by: §1, §2, §3.1.
  • [34] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §3.2.
  • [35] H. Tang, S. Cohen, B. L. Price, S. Schiller, and K. N. Kutulakos (2017) Depth from defocus in the wild.. pp. 4773–4781. Cited by: §1, §2, §3.1.
  • [36] A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan, and J. Tumblin (2007) Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing. In ACM transactions on graphics (TOG), Vol. 26, pp. 69. Cited by: §2.
  • [37] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey (2018) Learning depth from monocular videos using direct methods. pp. 2022–2030. Cited by: §1, §2, Figure 4, §5.1.
  • [38] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.1.
  • [39] J. Xie, R. Girshick, and A. Farhadi (2016) Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In European Conference on Computer Vision, pp. 842–857. Cited by: §2.
  • [40] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. 1. Cited by: §2, Figure 4.
  • [41] Z. Yin and J. Shi (2018) GeoNet: unsupervised learning of dense depth, optical flow and camera pose. 2. Cited by: §1, §2, Figure 4.
  • [42] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §1, §4.2, §4.2.
  • [43] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. 2 (6), pp. 7. Cited by: §2, Figure 4.
  • [44] S. Zhuo and T. Sim (2011) Defocus map estimation from a single image. Pattern Recognition 44 (9), pp. 1852–1858. Cited by: §1, §2.