Log In Sign Up

Deep Eyes: Binocular Depth-from-Focus on Focal Stack Pairs

by   Xinqing Guo, et al.
University of Delaware
ShanghaiTech University

Human visual system relies on both binocular stereo cues and monocular focusness cues to gain effective 3D perception. In computer vision, the two problems are traditionally solved in separate tracks. In this paper, we present a unified learning-based technique that simultaneously uses both types of cues for depth inference. Specifically, we use a pair of focal stacks as input to emulate human perception. We first construct a comprehensive focal stack training dataset synthesized by depth-guided light field rendering. We then construct three individual networks: a FocusNet to extract depth from a single focal stack, a EDoFNet to obtain the extended depth of field (EDoF) image from the focal stack, and a StereoNet to conduct stereo matching. We then integrate them into a unified solution to obtain high quality depth maps. Comprehensive experiments show that our approach outperforms the state-of-the-art in both accuracy and speed and effectively emulates human vision systems.


page 3

page 6

page 7

page 8


A Learning-based Framework for Hybrid Depth-from-Defocus and Stereo Matching

Depth from defocus (DfD) and stereo matching are two most studied passiv...

Single Image Depth Estimation Trained via Depth from Defocus Cues

Estimating depth from a single RGB images is a fundamental task in compu...

ChiTransformer:Towards Reliable Stereo from Cues

Current stereo matching techniques are challenged by restricted searchin...

Semantic See-Through Rendering on Light Fields

We present a novel semantic light field (LF) refocusing technique that c...

Deadeye: A Novel Preattentive Visualization Technique Based on Dichoptic Presentation

Preattentive visual features such as hue or flickering can effectively d...

Layered Stereo by Cooperative Grouping with Occlusion

Human stereo vision uses occlusions as a prominent cue, sometimes the on...

Why should we add early exits to neural networks?

Deep neural networks are generally designed as a stack of differentiable...

1 Introduction

Human visual system relies on a variety of depth cues to gain 3D perception. The most important ones are binocular, defocus, and motion cues. Binocular cues such as stereopsis, eye convergence, and disparity yield depth from binocular vision through exploitation of parallax. Defocus cue allows depth perception even with a single eye by correlating variation of defocus blurs with the motion of the ciliary muscles surrounding the lens. Motion parallax also provides useful input to assess depth, but arrives over time and depends on texture gradients.

Computer vision algorithms such as stereo matching [34, 1] and depth-from-focus/defocus [28, 29, 23, 6, 7] seek to directly employ binocular and defocus cues which are available without scene statistics. Recent studies have shown that the two types of cues complement each other to provide 3D perception [13]. In this paper, we seek to develop learning based approaches to emulate this process.

To exploit binocular cues, traditional stereo matching algorithms rely on feature matching and optimization to maintain the Markov Random Field property: the disparity field should be smooth everywhere with abrupt changes at the occlusion boundaries. Existing solutions such as graph-cut, belief propagation [19, 39]

, although effective, tend to be slow. In contrast, depth-from-focus (DfF) exploits differentiations of sharpness at each pixel across a focal stack and assigns the layer with highest sharpness as its depth. Compared with stereo, DfF generally presents a low fidelity estimation due to depth layer discretization. Earlier DfF techniques use a focal sweep camera to produce a coarse focal stack due to mechanical limitations whereas more recent ones attempt to use a light field to synthetically produce a denser focal stack.

Our solution benefits from recent advance on computational photography and we present an efficient and reliable learning based technique to conduct depth inference from a focal stack pair, emulating the process of how human eyes work. We call our technique binocular DfF or B-DfF. Our approach leverages deep learning techniques that can effectively extract features learned from large amount of imagery data. Such a deep representation has shown great promise in stereo matching

[49, 48, 22]. Little work, however, has been proposed on using deep learning for DfF or more importantly, integrating stereo and DfF. This is mainly due to the lack of fully annotated DfF datasets.

Figure 1: BDfFNet integrates FocusNet, EDoFNet and StereoNet to predict high quality depth map from binocular focal stacks.
Figure 2: A binocular focal stack pair consists of two horizontally rectified focal stacks. The upper and lower triangles show corresponding slices focusing at respective depths. Bottom shows the ground truth color and depth images. We add Poisson noise to training data, a critical step for handling real scenes.

We first construct a comprehensive focal stack dataset. Our dataset is based on the highly diversified dataset from [24], which contains both stereo color images and ground truth disparity maps. Then we adopt the algorithm from Virtual DSLR [46] to generate the refocused images. [46] uses color and depth image pair as input for light field synthesis and rendering, but without the need to actually create the light field. The quality of the rendered focal stacks are comparable to those captured by expensive DSLR camera. Next, we propose three individual networks: (1) FocusNet, a multi-scale network to extract depth from a single focal stack (2) EDoFNet, a deep network consisting of small convolution kernels to obtain the extended depth of field (EDoF) image from the focal stack and (3) StereoNet to obtain depth directly from a stereo pair. The EDoF image from EDoFNet serves to both guide the refinement of the depth from FocusNet and provide inputs for StereoNet. We also show how to integrate them into a unified solution BDfFNet to obtain high quality depth maps. Fig. 1 illustrates the pipeline.

We evaluate our approach on both synthetic and real data. To physically implement B-DfF, we construct a light field stereo pair by using two Lytro Illum cameras. Light field rendering is then applied to produce the two focal stacks as input to our framework. Comprehensive experiments show that our technique outperforms the state-of-the-art techniques in both accuracy and speed. More importantly, we believe our solution provides important insights on developing future sensors and companion 3D reconstruction solutions analogous to human eyes.

2 Related Work

Our work is closely related to depth from focus/defocus and stereo. The strength and weakness of the two approaches have been extensively discussed in [35, 43].

Depth from Focus/Defocus Blur carries information about the object’s distance. Depth from Focus/Defocus (DfF/DfD) recovers scene depth from a collection of images captured under varying focus settings. In general, DfF [28, 29, 23] determines the depth by analyzing the most in-focus slice in the focal stack, while DfD [6, 7] infers depth based on the amount of the spatially varying blur at each pixel. To avoid ambiguity in textureless region, Moreno-Noguer et al. [26] used active illumination to project a sparse set of dots onto the scene. The defocus of the dots offers depth cue, which could be further used for realistic refocusing. [10] combined focal stack with varying aperture to recover scene geometry. Moeller et al. [25] applied an efficient nonconvex minimization technique to solve DfD in a variational framework. Suwajanakorn et al. [40] proposed the DfF with mobile phone under uncalibrated setting. They first aligned the focal stack, then jointly optimized the camera parameters and depth map, and further refined the depth map using anisotropic regularization.

A drastic difference of these methods to our approach is that they rely on hand-crafted features to estimate focusness or blur kernel, whereas in this paper we leverage neural network to learn more discriminative features from focal stack and directly predict depth at lower computational cost.

Learning based Stereo Depth from stereo has been studied extensively by the computer vision community for decades. We refer the readers to the comprehensive survey for more details [34, 1]

. Here we only discuss recent methods based on Convolutional Neural Network (CNN).

Deep learning benefits stereo matching at various stages. A number of approaches exploit CNN to improve the matching cost. The seminal work by Žbontar and LeCun [49] computed a similarity score from patches using CNN, then applied the traditional cost aggregation and optimization to solve the energy function. Han et al.[9] jointly learned feature representations and feature comparison functions in a unified network, which improved on previous results with less storage requirement. Luo et al. [22] speeded up the matching process by using a product layer, and treated the disparity estimation as a multi-class classification problem.  [3, 48, 21, 32]

conducted similar work but with different network architecture. Alternatively, CNN can also help predict the confidence of disparity map to remove outliers. Seki and Pollefeys

[36] leveraged CNN for stereo confidence measure, and incorporated predicted confidence into Semi-Global Matching by adjusting its parameters. In order to automatically generate the dataset for learning based confidence measure, Mostegel et al. [27] checked the consistency of multiple depth maps of the same scene obtained with the same stereo approach, and collected labeled confidence map as the training data.

End-to-end network architectures have also been explored. Mayer et al. [24] adopted and extended the architecture of the FlowNet [5], which consists of a contractive part and an expanding part to learn depth at multiple scales. They also created three synthetic datasets to facilitate the training process. Knöbelreiter et al. [18] learned unary and pairwise cost of stereo using CNNs, then posed the optimization as a conditional random field (CRF) problem. The hybrid CNN-CRF model was trained in image’s full resolution in an end-to-end fashion.

Combining DfF/DfD and stereo matching has also been studied, although not within the learning framework. Early work [17, 38] attempted to utilize the depth map from the focus/defocus to reduce the search space for stereo and solve the correspondence problem more efficiently. [33] simultaneously recovered depth and restored the original focused image from a defocused stereo pair. Recently, Tao et al. [42]

analyzed the epipolar image (EPI) from light field camera to infer depth. They found that the horizontal variances after vertical integration of the EPI encodes defocus cue, while vertical variances encodes disparity cue. The two cues were then jointly optimized in an MRF framework. To obtain high resolution depth in a semi-calibrated manner, Wang

et al. [44] proposed a hybrid camera system that consists of two calibrated auxiliary cameras and an uncalibrated main camera. They first transfered the depth from auxiliary cameras to the viewpoint of the main camera by rectifying three images simultaneously, and further improved the depth map along occlusion boundaries using defocus cue.

Aforementioned approaches leave the combination and optimization of focus and disparity cue to postprocessing. In contrast, we resort to extra layers of network to infer the optimized depth with low computational cost and efficiency.

3 Dual Focal Stack Dataset

With fast advances of the data driven methods, numerous datasets have been created for various applications. However, by far, there are limited resources on focal stacks. To this end, we generate our dual focal stack dataset based on FlyingThings3D from [24]. FlyingThings3D is an entirely synthetic dataset, consisting of everyday objects flying along randomized 3D paths. Their 3D models and textures are separated into disjointed training and testing parts. In total, the dataset contains about 25,000 stereo images with ground truth disparity. To make the data tractable, we select stereo frames whose largest disparity is less than 100 pixels, then we normalize the disparity to .

Takeda et al. [41] demonstrate that in stereo setup, the disparity and the diameter of the circle of confusion have a linear relationship:


where is the baseline length and is the aperture size. Based on above observation, we adopt the Virtual DSLR approach from [46] to generate synthetic focal stacks. Virtual DSLR requires color and disparity image pair as inputs, and outputs refocused images with quality comparable to those captured from regular, expensive DSLR. The advantage of their algorithm is that it resembles light field synthesis and refocusing but does not require actual creation of the light field, hence reducing both memory and computational load. In addition, their method takes special care of occlusion boundaries to avoid color bleeding and discontinuity commonly observed in brute-force blur-based defocus synthesis. To better explain their approach, we list the formulation as below:


To simulate a scene point with depth projected to a circular region on sensor, we assume the focal length , an aperture size , sensor to lens distance ,and the circular region diameter . Here and according to the thin lens law. The diameter of the circular region measures the size of blur kernel and it is linear to the absolute difference of the inverse of the distances and . For the scope of this paper, we use only circular apertures, although more complex ones can easily be synthesized. To emulate the pupil of the eye in varying lighting conditions, we randomly select the size of the blur kernel for each stereo pair, but limit the largest diameter of the blur kernel to 31 pixels. We also evenly separate the scene into 16 depth layers and render a refocused image for each layer. After generating the focal stacks, we add poisson noise to the images to simulate the real image captured by a camera. This turns out to be critical in real scene experiments, as described in section 6.2. Finally, we split the generated dual focal stacks into 750 training data and 70 testing data. Figure 2 shows two slices from the dual focal and their corresponding color and depth image.

4 B-DfF Network Architecture

Convolutional neural networks are very efficient at learning non-linear mapping between the input and the output. Therefore, we aim to take an end-to-end approach to predict a depth map. [37] shows that a deep network with small kernels is very effective in image recognition tasks. Although a small kernel has limited spatial support, a deep network by stacking multiple layers of such kernels could substantially enlarge the receptive field while reducing the number of parameters to avoid overfitting. Therefore, a general principle in designing our network is to use deep architecture with small convolutional kernels.

As already mentioned, the input to our neural network are two rectified focal stacks. To extract depth from defocus and disparity, our solution is composed of three individual networks. We start in section 4.1 by describing the FocusNet, a multi-scale network that estimates depth from a single focal stack. Then in section 4.2 we further enhance the result by the extended depth of field images from EDoFNet. Finally we combine StereoNet and FocusNet in 4.3 to infer high quality depth from binocular focal stacks.

4.1 FocusNet for DfF/DfD

Figure 3: FocusNet is a multi-scale network for conducting depth-from-focus.
Figure 4: Left: EDoFNet consists of 20 layers of convolutional layers to form an extended depth-of-field (EDoF) image from focal stack. Right: FocusNet-v2 combines FocusNet and EDoFNet by using the EDoF image to refine the depth estimation.
Figure 5: (a) StereoNet

follows the Hourglass network architecture which consists of the max pooling layer (yellow), the deconvolution layer (green) and the residual module (blue). (b) shows the detailed residual module.

Motivated by successes from multi-scale networks, we propose FocusNet, a multiscale network to extract depth from a single focal stack. Specifically, FocusNet

consists of four branches of various scales. Except the first branch, other branches subsample the image by using different strides in the convolutional layer, enabling aggregation of information over large areas. Therefore, both the high-level information from the coarse feature maps and the fine details could be preserved. At the end of the branch, a deconvolutional layer is introduced to upsample the image to its original resolution. Compared with the traditional bicubic upsampling, deconvolution layer automatically learns upsampling kernels that are better suited for the application. Finally, we stack the multi-scale features maps together, resulting in a concatenated per-pixel feature vector. The feature vectors are further fused by layers of convolutional networks to predict the final depth value.

An illustration of the network architecture is shown in Fig. 3. We use

kernels for most layers except those convolutional layers used for downsampling and upsampling, where a larger kernel is used to cover more pixels. The spatial padding is also applied for each convolution layer to preserve the resolution. Following

[37], the number of feature maps increases as the image resolution decreases. Between the convolutional layers we insert PReLU layer [11] to increase the network’s nonlinearity. For the input of the network we simply stack the focal stack images together along the channel’s dimension.

4.2 Guided Depth Refinement by EDoF Image

There exist many approaches [8, 14] to refine/upsample depth image with the guidance of an intensity image. The observation is that homogeneous texture regions often correspond to homogeneous surface parts, while depth edges often occur at high intensity variations. With this in mind, we set out to first extract the EDoF image from the focal stack, then guide the refinement of the depth image. Several methods [20, 40] have been proposed to extract the EDoF image from the focal stack. However, the post processing is suboptimal in terms of computational efficiency and elegance. Thus, we seek to directly output an EDoF image from a separate network, which we termed EDoFNet.


is composed of 20 convolutional layers, with PRelu as its activation function. The input of the

EDoFNet is the focal stack, the same as the input of FocusNet, and the output is the EDoF image. With the kernel size of , a 20 layer convolutional network will produce a receptive field of , which is larger than the size of the largest blur kernel. Fig. 4 shows the architecture of EDoFNet.

Finally, we concatenate the depth image from FocusNet and the EDoF image from the EDoFNet, and fuse them by using another 10 layer convolutional network. We call the new network FocusNet-v2. The architecture of FocusNet-v2 is illustrated in Fig. 4.

4.3 StereoNet and BDfFNet for Depth from Binocular Focal Stack

Given the EDoF stereo pair from the EDoFNet, we set out to estimate depth from stereo using another network, termed StereoNet. For stereo matching, it is critical to consolidate both local and global cues to generate precise pixel-wise disparity. To this end, we propose StereoNet by adopting the Hourglass network architecture [30], as shown in Fig. 5. The advantage of this network is that it can attentively evaluate the coherence of features across scales by utilizing large amount of residual modules [12]. The network composes of downsampling part and upsampling part. The downsampling part consists of a series of max pooling interleaved with residual modules while the upsampling part is a mirrored architecture of the downsampling part, with max pooling replaced by deconvolution layer for upsampling. Between any pair of corresponding max pooling and upsampling, there is a connection layer comprising of a residual module. Elementwise addition follows to add processed lower-level features to higher-level features. In this way, the network learns a more holistic representation of input images. Prediction is generated at the end of the upsampling part. One round of downsampling and upsampling part can be viewed as one iteration of predicting, whereas additional rounds can be stacked to refine initial estimates. For StereoNet, we use two rounds of downsampling and upsampling parts as they already give good performance, while further rounds improve marginally at the cost of more training time. Note that the weights are not shared in the two rounds.

Different from [30], we do not downsample input images before the first downsampling part. This stems from the difference in problem settings: our solution aims for pixel-wise precision while [30] only requires structured understanding of images. Throughout the network, we use small convolution filters ( or ). After each pair of downsampling and upsampling parts, supervision is applied using the same ground truth disparity map. The final output is of the same resolution as the input images.

Finally, we construct BDfFNet by concatenating the results from StereoNet, FocusNet-v2 and EDoFNet, and adding more convolutional layers. The convolutional layers serve to find the optimal combination from focus cue and disparity cue.

5 Implementation

Optimization Given the focal stack as input and ground truth color/depth image as label, we train all the networks end-to-end. In our implementation, we first train each network individually, then fine-tune the concatenated network with the pre-trained weights as initialization. Because FocusNet and FocusNet-v2 contains multiple convolutional layers for downsampling, the input image needs to be cropped to the nearest number that is multiple of 8 for both height and width. We use the mean square error (MSE) with -norm regularization as the loss for all models, which leads to the following objective function


where and are the -th focal stack and depth image, is the function represented by the network and are the learned weights. Although there are works [50]

suggesting the mean absolute error (MAE) might be a better loss function, our experiment shows that results from MAE are inferior to MSE.

Following [15]

, we apply batch normalization after the convolution layer and before PRelu layer. We initialize the weights using the technique from

[11]. We employ MXNET [2] as the learning framework and train and test the networks on a NVIDIA K80 graphic card. We make use of the Adam optimizer [16] and set the weight decay = 0.002, = 0.9, = 0.999. The initial learning rate is set to be 0.001. We first train each sub-network of BDfFNet

separately and then combine them for further training. All the networks are trained for 80 epoches.

Data augmentation and preprocessing For FocusNet and EDoFNet, the size of the analyzed patches determines the largest sensible blur kernel size. Therefore, we randomly crop a patch of size from the image, which contains enough contextual information to extract the depth and EDoF image. For StereoNet, a larger patch of size is used to accommodate the large disparity between stereo images. To facilitate the generalization of the network, we augment the data by flipping the patches horizontally and vertically. All the data augmentations are performed on the fly at almost no extra cost. Finally, the range of all images are normalized to .

Figure 6: Results of our EDoFNet. First row shows two slices of the focal stack focusing at different depth. Second and third row show the EDoF and ground truth image respectively.

6 Experiments

Figure 7: Comparisons on FocusNet vs. FocusNet-v2, i.e., without and with the guide of an all-focus image.

6.1 Extract the EDoF Image from Focal Stack

We train EDoFNet on a single focal stack of 16 slices. Although the network has simple structure, the output EDoF image features high image quality. Our network also runs much faster than conventional methods based on global optimization: on the resolution of it runs at 4 frames per second. Fig. 6 shows the result of EDoFNet. Compared with ground truth image, the produced EDoF image is slightly blurry. However, given a very noisy focal stack as input, the resultant EDoF image gets rid of large part of the noise. Our experiments also show that it suffices to guide the refinement of depth image and be used as the input of StereoNet.

6.2 Depth Estimation from Focal Stack

As mentioned in 4.2, to construct FocusNet-v2, we first train FocusNet and EDoFNet respectively, then concatenate their output with more fusion layers and train the combination. Fig. 7 shows the result of both FocusNet and FocusNet-v2. We observe that FocusNet produces results with splotchy artifact, and depth bleeds across object’s boundary. However, FocusNet-v2 utilizes the EDoF color image to assist depth refinement, alleviating the artifacts and leading to clearer depth boundary. It is worth noting that we also trained a network that has identical structure to FocusNet-v2 from scratch, but the result is of inferior quality. We suspect this is due to the good initialization provided by the pre-trained model.

We compare our results with [40] and [25] using the data provided by the authors of [40]. We select 16 images from their focal stack for DfF. Fig. 8 illustrates the results. Our FocusNet-v2 is capable of predicting disparity value with higher quality, while using significantly less time (0.9 second) than [40] (10 mins) and [25] (4 seconds).

We also train the FocusNet-v2 on a clean dataset without poisson noise. It performs better on synthetic data, but exhibits severe noise pattern on real images, as shown in Fig. 9. The experiment confirms the necessity to add noise to the dataset for simulating real images.

Figure 8: Comparisons on depth estimation from a single focal stack using our FocusNet-v2 (last column) vs. [40] (second column) and [25] (third column). FocusNet-v2 is able to maintain smoothness on flat regions while preserving sharp occlusion boundaries.
Figure 9: Results from FocusNet-v2 trained by the clean dataset without poisson noise.

6.3 Depth Estimation from Stereo and Binocular Focal Stack

Figure 10 shows the results from StereoNet and BDfFNet. Compared with FocusNet-v2, StereoNet gives better depth estimation. This is expected since StereoNet requires binocular focal stacks as input, while FocusNet-v2 only use a single focal stack. However, StereoNet exhibits blocky artifacts and overly smoothed boundary. In contrast, depth prediction from BDfFNet features sharp edges. The depth in flat surface region is also smoother compared to FocusNet-v2.

Table 1 describes the mean absolute error (MAE) and running time of all models on image.

Figure 10: Comparisons on results only using StereoNet vs. the composed BDfFNet. BDfFNet produces much sharper boundaries while reducing blocky artifacts.

FocusNet FocusNet-v2 StereoNet BDfFNet
MAE 0.045 0.031 0.024 0.021
Time(s) 0.6 0.9 2.8 9.7

Table 1: MAE and running time of models.

6.4 Real Scene Experiment

We further conduct tests on real scenes. To physically implement B-DfF, we construct a light field stereo pair by using two Lytro Illum cameras, as illustrated in Fig. 12. Comparing with stereo focal sweeping, the Lytro pair can conduct high quality post-capture refocusing without the need for accurate synchronized mechanical control on focal length. In our experiment the two light field cameras share the same configuration including the zoom and focus settings. The raw images are preprocessed using Light Field Toolbox [4]. Finally we conduct refocusing using shift-and-add algorithm [31] to synthesize the focal stack.

Figure 11: Comparisons of real scene results from FocusNet-v2, StereoNet and BDfFNet.

Figure 11 shows the predicted depth from FocusNet-v2, StereoNet and BDfFNet. Results show that BDfFNet benefits from both FocusNet-v2 and StereoNet to offer smoother depth with sharp edges. The experiments also demonstrate that models learned from our dataset could be transferred to predict real scene depth.

Figure 12: To emulate our B-DfF setup, we combine a pair of Lytro Illum cameras into a stereo setup.

7 Discussions and Future Work

Our deepeye solution exploits efficient learning and computational light field imaging to infer depths from a focal stack pair. Our technique mimics human vision system that simultaneously employs binocular stereo matching and monocular depth-from-focus. Comprehensive experiments show that our technique is able to produce high quality depth estimation orders of magnitudes faster than the prior art. In addition, we have created a large dual focal stack database with ground truth disparity.

Our current implementation limits the input size of our network to be focal stacks of 16 layers. In our experiments, we have shown that it is able to produce high fidelity depth estimation under our setup. To handle denser focal stacks, one possibility is to concatenate all images in the stack as a 3D focal cube or volume [51], where and are the width and height and is the index of a layer. We can then downsample the slice along

dimension to 16 slices using light field compression or simplification techniques such as tensor

[45] and triangulation [47]. Another important future direction we plan to explore is to replace one of the two focal stacks to be an all-focus image. This would further reduce the computational cost for constructing the network but would require adjusting the architecture. Finally, aside from computer vision, we hope our work will stimulate significant future work in human perception and the biological nature of human eyes.

8 Acknowledgments

This project was supported by the National Science Foundation under grant CBET-1706130.


  • [1] M. Z. Brown, D. Burschka, and G. D. Hager. Advances in computational stereo. TPAMI, 25(8):993–1008, 2003.
  • [2] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
  • [3] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang. A deep visual correspondence embedding model for stereo matching costs. In ICCV, pages 972–980, 2015.
  • [4] D. Dansereau, O. Pizarro, and S. Williams. Decoding, calibration and rectification for lenselet-based plenoptic cameras. In CVPR, pages 1027–1034, 2013.
  • [5] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015.
  • [6] P. Favaro and S. Soatto. A geometric approach to shape from defocus. TPAMI, 27(3):406–417, 2005.
  • [7] P. Favaro, S. Soatto, M. Burger, and S. J. Osher. Shape from defocus via diffusion. TPAMI, 30(3):518–531, 2007.
  • [8] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. In ICCV, pages 993–1000, 2013.
  • [9] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In CVPR, pages 3279–3286, 2015.
  • [10] S. W. Hasinoff and K. N. Kutulakos. Confocal stereo. International Journal of Computer Vision, 81(1):82–104, 2009.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    In ICCV, pages 1026–1034, 2015.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [13] R. Held, E. Cooper, and M. Banks. Blur and disparity are complementary cues to depth. Current Biology, 22(5):426 – 431, 2012.
  • [14] T.-W. Hui, C. C. Loy, and X. Tang.

    Depth map super-resolution by deep multi-scale guidance.

    In ECCV, 2016.
  • [15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
  • [17] W. N. Klarquist, W. S. Geisler, and A. C. Bovik. Maximum-likelihood depth-from-defocus for active vision. In International Conference on Intelligent Robots and Systems, pages 374–379 vol.3, 1995.
  • [18] P. Knöbelreiter, C. Reinbacher, A. Shekhovtsov, and T. Pock. End-to-end training of hybrid cnn-crf models for stereo. arXiv preprint arXiv:1611.10229, 2016.
  • [19] V. Kolmogorov and R. Zabih. Multi-camera scene reconstruction via graph cuts. In ECCV, pages 82–96, 2002.
  • [20] S. Kuthirummal, H. Nagahara, C. Zhou, and S. K. Nayar. Flexible depth of field photography. TPAMI, 33(1):58–71, 2011.
  • [21] Z. Liu, Z. Li, J. Zhang, and L. Liu. Euclidean and hamming embedding for image patch description with convolutional networks. In CVPR Workshops, pages 72–78, 2016.
  • [22] W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In TPAMI, pages 5695–5703, 2016.
  • [23] A. S. Malik, S. O. Shim, and T. S. Choi. Depth map estimation using a robust focus measure. In ICIP, pages 564–567, 2007.
  • [24] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pages 4040–4048, 2016.
  • [25] M. Moeller, M. Benning, C. B. Schoenlieb, and D. Cremers. Variational depth from focus reconstruction. IEEE Transactions on Image Processing, 24(12):5369–5378, 2015.
  • [26] F. Moreno-Noguer, P. N. Belhumeur, and S. K. Nayar. Active refocusing of images and videos. ACM Trans. Graph., 26(3), 2007.
  • [27] C. Mostegel, M. Rumpler, F. Fraundorfer, and H. Bischof. Using self-contradiction to learn confidence measures in stereo vision. In CVPR, pages 4067–4076, 2016.
  • [28] S. K. Nayar. Shape from focus system. In CVPR, pages 302–308, 1992.
  • [29] S. K. Nayar and Y. Nakagawa. Shape from focus. TPAMI, 16(8):824–831, 1994.
  • [30] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. pages 483–499, 2016.
  • [31] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan. Light field photography with a hand-held plenoptic camera. Stanford University Computer Science Tech Report, 2:1–11, 2005.
  • [32] H. Park and K. M. Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, 2016.
  • [33] A. N. Rajagopalan, S. Chaudhuri, and U. Mudenagudi. Depth estimation and image restoration using defocused stereo pairs. TPAMI, 26(11):1521–1525, 2004.
  • [34] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision, 47(1-3), 2002.
  • [35] Y. Y. Schechner and N. Kiryati. Depth from defocus vs. stereo: How different really are they? Int. J. Comput. Vision, 39(2):141–162, 2000.
  • [36] A. Seki and M. Pollefeys. Patch based confidence prediction for dense disparity map. In BMVC, volume 10, 2016.
  • [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [38] M. Subbarao, T. Yuan, and J. Tyan. Integration of defocus and focus analysis with stereo for 3d shape recovery. In Proc. SPIE, volume 3204, pages 11–23, 1997.
  • [39] J. Sun, N. N. Zheng, and H. Y. Shum. Stereo matching using belief propagation. TPAMI, 25(7):787–800, 2003.
  • [40] S. Suwajanakorn, C. Hernandez, and S. M. Seitz. Depth from focus with your mobile phone. In CVPR, pages 3497–3506, 2015.
  • [41] Y. Takeda, S. Hiura, and K. Sato. Fusing depth from defocus and stereo with coded apertures. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 209–216, 2013.
  • [42] M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi. Depth from combining defocus and correspondence using light-field cameras. In ICCV, pages 673–680, 2013.
  • [43] V. Vaish, M. Levoy, R. Szeliski, C. L. Zitnick, and S. B. Kang. Reconstructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures. In CVPR, pages 2331–2338, 2006.
  • [44] T. C. Wang, M. Srikanth, and R. Ramamoorthi. Depth from semi-calibrated stereo and defocus. In CVPR, pages 3717–3726, 2016.
  • [45] S. Wanner and B. Goldluecke. Globally consistent depth labeling of 4d light fields. In CVPR, pages 41–48, 2012.
  • [46] Y. Yang, H. Lin, Z. Yu, S. Paris, and J. Yu. Virtual DSLR: high quality dynamic depth-of-field synthesis on mobile platforms. In Digital Photography and Mobile Imaging XII, pages 1–9, 2016.
  • [47] Z. Yu, X. Guo, H. Ling, A. Lumsdaine, and J. Yu. Line assisted light field triangulation and stereo matching. In ICCV, pages 2792–2799, 2013.
  • [48] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In CVPR, pages 4353–4361, 2015.
  • [49] J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In CVPR, pages 1592–1599, 2015.
  • [50] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1):47–57, 2017.
  • [51] C. Zhou, D. Miau, and S. K. Nayar. Focal sweep camera for space-time refocusing. Technical Report, Department of Computer Science, Columbia University, 2012.