Learning to Autofocus

04/26/2020 ∙ by Charles Herrmann, et al. ∙ Google cornell university 1

Autofocus is an important task for digital cameras, yet current approaches often exhibit poor performance. We propose a learning-based approach to this problem, and provide a realistic dataset of sufficient size for effective learning. Our dataset is labeled with per-pixel depths obtained from multi-view stereo, following "Learning single camera depth estimation using dual-pixels". Using this dataset, we apply modern deep classification models and an ordinal regression loss to obtain an efficient learning-based autofocus technique. We demonstrate that our approach provides a significant improvement compared with previous learned and non-learned methods: our model reduces the mean absolute error by a factor of 3.6 over the best comparable baseline algorithm. Our dataset and code are publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

page 18

page 19

page 20

page 21

page 22

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a scene with variable depth, any camera lens with a finite-size aperture can only focus at one scene depth (the focus distance), and the rest of the scene will contain blur. This blur is difficult to remove via post-processing, and so selecting an appropriate focus distance is crucial for image quality.

There are two main, independent tasks that a camera must address when focusing. First, the camera must determine the salient region that should be in focus. The user may choose such a region explicitly, e.g., by tapping on the screen of a smartphone, or it may be detected automatically by, for example, a face detector. Second, given a salient region (which camera manufacturers often refer to as “autofocus points”) and one or more possibly out-of-focus observations, the camera must predict the most suitable focus distance for the lens that brings that particular region into focus. This second task is called autofocus.

Conventional autofocus algorithms generally fall into two major categories: contrast-based and phase-based methods. Contrast-based methods define a sharpness metric, and identify the ideal focus distance by maximizing the sharpness metric across a range of focus distances. Such methods are necessarily slow in practice, as they must make a large number of observations, each of which requires physical lens movement. In addition, they suffer from a few important weaknesses, which we discuss in Section 4.

Modern phase-based methods leverage disparity from the dual-pixel sensors that are increasingly available on smartphones and DSLR cameras. These sensors are essentially two-view plenoptic cameras [27] with left and right sub-images that receive light from the two halves of the aperture. These methods operate under the assumption that in-focus objects will produce similar left and right sub-images, whereas out-of-focus objects will produce sub-images with a displacement or disparity that is proportional to the degree of defocus. Naively, one could search for the focus distance that minimizes the left/right mismatch, like the contrast-based methods. Alternatively, some methods use calibration to model the relationship between disparity and depth, and make a prediction with just one input. However, accurate estimation of disparity between the dual-pixel sub-images is challenging due to the small effective baseline. Further, it is difficult to characterize the relationship between disparity and depth accurately due to optical effects that are hard to model, resulting in errors [10].

In this paper, we introduce a novel learning-based approach to autofocus: a ConvNet that takes as input raw sensor data, optionally including the dual-pixel data, and predicts the ideal focus distance. Deep learning is well-suited to this task, as modern ConvNets are able to utilize subtle defocus clues (such as irregularly-shaped point spread functions) in the data that often mislead heuristic contrast-based autofocus methods. Unlike phase-based methods, a learned model can also directly estimate the position the lens should be moved to, instead of determining it from disparity using a hand-crafted model and calibration—a strategy which may be prone to errors.

In order to train and evaluate our network, we also introduce a large and realistic dataset captured using a smartphone camera and labeled with per-pixel depth computed using multi-view stereo. The dataset consists of focal stacks: a sequence of image patches of the same scene, varying only in focus distance. We will formulate the autofocus problem precisely in section 3, but note that the output of autofocus is a focal index which specifies one of the patches in the focal stack. Both regular and dual-pixel raw image data are included, allowing evaluation of both contrast- and phase-based methods. Our dataset is larger than most previous efforts [4, 15, 24], and contains a wider range of realistic scenes. Notably, we include outdoors scenes (which are particularly difficult to capture with a depth sensor like Kinect) as well as scenes with different levels of illumination.

We show that our models achieve a significant improvement in accuracy on all versions of the autofocus problem, especially on challenging imagery. On our test set, the best baseline algorithm that takes one frame as input produces a mean absolute error of 11.3 (out of 49 possible focal indices). Our model with the same input has an error of 3.1, and thus reduces the mean absolute error by a factor of 3.6.

2 Related Work

There has been surprisingly little work in the computer vision community on autofocus algorithms. There are a number of non-learning techniques in the image processing literature

[5, 47, 46, 20, 21], but the only learning approach [24] uses classical instead of deep learning.

A natural way to use computer vision techniques for autofocus would be to first compute metric depth. Within the vast body of literature on depth estimation, the most closely related work of course relies on focus.

Most monocular depth techniques that use focus take a complete focal stack as input and then estimate depth by scoring each focal slice according to some measure of sharpness [17, 26, 41]. Though acquiring a complete focal stack of a static scene with a static camera is onerous, these techniques can be made tractable by accounting for parallax [39]. More recently, deep learning-based methods [15] have yielded improved results with a full focal stack approach.

Instead of using a full focal stack, some early work attempted to use the focal cues in just one or two images to estimate depth at each pixel, by relating the apparent blur of the image to its disparity [11, 29], though these techniques are necessarily limited in their accuracy compared to those with access to a complete focal stack. Both energy minimization [40] and deep learning [4, 36] have also been applied to single-image approaches for estimating depth from focus, with significantly improved accuracy. Similarly, much progress has been made in the more general problem of using learning for monocular depth estimation using depth cues besides focus [9, 34], including dual-pixel cues [43, 10].

In this work, we address the related problem of autofocus by applying deep learning. A key aspect of the autofocus problem is that commodity focus modules require a single focus estimate to guide them, that may have a tenuous connection with predicted depth map due to hardware issues (see Section 4). Many algorithms predict non-metric depth maps, making the task harder, e.g., scale invariant monocular depth prediction [9] or affine invariant depth prediction using dual-pixel data [10]. Hence, instead of predicting a dense depth map, we directly predict a single estimate of focal depth that can be used to guide the focus module. This prediction is done end to end with deep learning.

3 Problem Formulation

(a) Single-Slice
(b) Focal Stack
(c) Two-Step
Figure 1: Three different autofocus subproblems; in each, the goal is to estimate the in-focus slice, by taking the argmax (orange) of a set of scores produced for each possible focal slice (blue). In the single-slice problem LABEL:, the algorithm is given a single observed slice (red). In the focal stack problem LABEL:, the algorithm is given the entire stack. In the multi-step problem (here shown with just two steps) LABEL:, the problem is solved in stages; Given an initial lens position and image, we decide where to focus next, obtain a new observation, and then make a final estimate of the in-focus slice using both observed images.

In the natural formulation of the autofocus problem, the lens can move continuously, producing an infinite set of possible focus distances corresponding to different focal planes. We discretize the continuous lens positions into focus distances, and from each position we extract an image patch corresponding to the region of interest. We assume the location of the patch has been determined by a user or some external saliency algorithm, and so we consider this image patch to be “the image” and will refer to it as such throughout the paper. Further, the image can either contain the dual-pixel subimages as two channels or it can contain just the green channel based on the type of input being considered. We refer to the set of images obtained at different focus distances as a focal stack, an individual image as a focal slice, and as the focal index. We assume each focal stack has exactly one focal index whose slice is in focus.

Standard autofocus algorithms can be naturally partitioned according to the number of focal slices they require as input. For example, contrast-based methods often require the entire focal stack (or a large subset), whereas phase-based or depth-from-defocus algorithms can estimate a focus distance given just a single focal slice. Motivated by the differences in input space among standard autofocus algorithms, we define three representative sub-problems (visualized in Figure 1), which all try to predict the correct focal index but vary based primarily on their input.

Focal Stack:

(1)

This is the simplest formulation where the algorithm is given a completely observed focal stack. Algorithms for this type typically define a sharpness or contrast metric and pick the focal index which maximizes the chosen metric.

Single Slice:

(2)

This is the most challenging formulation, as the algorithm is given only a single, random focal slice, which can be thought of as the starting position of the lens. In this formulation, algorithms generally try to estimate blur size or use geometric cues to estimate a measure of depth that is then translated to a focal index.

Multi-Step:

(3)

where , and is a predetermined constant controlling the total number of steps. The multi-step problem is a mix between the previous two problems. The algorithm is given an initial focal index, acquires and analyzes the image at that focus distance, and then is permitted to move to an additional focal index of its choice, repeating the process at most times. This formulation approximates the online problem of moving the lens to the correct position with as few attempts as possible. This multi-step formulation resembles the “hybrid” autofocus algorithms that are often used by camera manufacturers, in which a coarse focus estimate is produced by some phase-based system (or a direct depth sensor if available) which is then refined by a contrast-based solution that uses a constrained and abbreviated focal stack as input.

4 Autofocus Challenges

We now describe the challenges in real cameras that make the autofocus problem hard in practice. With the thin-lens and paraxial approximations, the amount of defocus blur is specified by

(4)

where is the aperture size, the focal length, the depth of a scene-point and the focus distance (Figure. 2(a)). is related to the distance between the lens and the sensor by the thin-lens equation. This implies that if the depth is known, one can focus, i.e, reduce the defocus blur to zero by choosing an appropriate , which can be achieved by physically adjusting the distance between the lens and the sensor . This suggests that recovering depth () is sufficient to focus. Dual-pixel sensors can aid in the task of finding as they produce two images, each of which sees a slightly different viewpoint of the scene (Figure 2(b)). The disparity between these viewpoints [10] is

(5)

where is a constant of proportionality.

(a) Ordinary Sensor
(b) Dual-Pixel Sensor
Figure 2: Cameras (a) focus by moving the sensor or lens, and only produce sharp images at a single depth ( in this case). Dual-pixel sensors (b) split each pixel into two halves that each collect light from the two halves of the lens, which aides autofocus.

This theoretical model is often used in the academic pursuit of autofocus (or more often, depth-from-defocus) algorithms. However, the paraxial and thin lens approximations are significant simplifications of camera hardware design and of the physics of image formation. Here we detail some of the issues ignored by this model and existing approaches, and explain how they are of critical importance in the design of an effective, practical autofocus algorithm.

Unrealistic PSF Models.

One core assumption underlying contrast-based algorithms is that, as the subject being imaged moves further out of focus, the high-frequency image content corresponding to the subject is reduced. The assumption that in-focus content results in sharp edges while out-of-focus content results in blurry edges has only been shown to be true for Gaussian point spread functions (PSF) [23, 49]. However, this assumption can be broken by real-world PSFs, which may be disc- or hexagon-shaped with the goal of producing an aesthetically pleasing “bokeh”. Or they may be some irregular shape that defies characterization as a side effect of hardware and cost constraints of modern smartphone camera construction. In the case of a disc-shaped PSF, for example, an out-of-focus delta function may actually have more gradient energy than an in-focus delta function, especially when pixels are saturated (See Figure 3).

(a) Im,
(b) Blur,
(c) Disc,
Figure 3: Many contrast-based autofocus algorithms return the focus distance that maximizes image sharpness, measured here as the norm of the image gradient . This works well for some camera PSFs, as a sharp image (such as the saturated delta function in LABEL:) will likely have more gradient energy than the same image seen out of focus under a Gaussian PSF (such as in LABEL:). But actual cameras tend to have irregular PSFs that more closely resemble discs than Gaussians, and as a result an out-of-focus image may have a higher gradient energy than an in-focus image (such as the delta function convolved with a disc filter in LABEL:). This is one reason why simple contrast-based autofocus algorithms often fail in practice.

Noise in Low Light Environments.

Images taken in dim environments often contain significant noise, a problem that is exacerbated by the small aperture sizes and small pixel pitch of consumer cameras [14]. Prior work in low-light imaging has noted that conventional autofocus algorithms systematically break in such conditions [22]. This appears to be due to the gradient energy resulting from sensor noise randomly happening to exceed that of the actual structure in the image, which causes contrast-based autofocus algorithms (which seek to maximize contrast) to be misled. See Figure 4 for a visualization of this issue.

(a) Contrast metric
(b) Predicted
(c) Ground truth
Figure 4: Image noise misleads contrast-based focus measures, making it difficult to focus in low-light. There is no obvious peak in a contrast measure LABEL: applied to the noisy patches in LABEL: and LABEL:. As a result, the argmax index results in patch LABEL: that is out of focus, instead of the in-focus ground-truth patch LABEL:, which contains subtle high-frequency texture.

Focal Breathing.

A camera’s field of view depends on its focus distance, a phenomenon called focal breathing.111Sometimes also referred to as focus breathing or lens breathing. This occurs because conventional cameras focus by changing the distance between the image plane and the lens, which induces a zoom-like effect as shown in Figure 5. This effect can be problematic for contrast-based autofocus algorithms, as edges and gradients can leave or enter the field of view of the camera over the course of a focal sweep, even when the camera and scene are stationary. While it is possible to calibrate for focal breathing by modeling it as a zoom and crop, applying such a calibration increases latency, may be inaccurate due to unknown radial distortion, and may introduce resampling artifacts that interfere with contrast-based metrics.

(a) Optics
(b) Focused, =0.88
(c) Unfocused, =1.02
Figure 5: The optics of image formation mean that modifying the focus of a lens causes “focal breathing”: a change of the camera’s field of view. Consider light from two points that is being imaged at three different focus distances, as in the top of LABEL:. Because the light is spreading away from the center of the sensor, focusing therefore causes the positions of the points on the imaging plane to shift inwards as the distance between the imaging plane and lens (i.e., focus distance) decreases. This occurs in real image patches and can mislead contrast-based metrics: the in-focus image patch LABEL: has less gradient energy than the out-of-focus image patch LABEL: because edges move in and out of the patch when focusing. (Gradient energy is only computed within the red rectangles of LABEL: and LABEL:.)

Hardware Support.

Nearly all smartphone cameras use voice coil motors (VCMs) to focus: the lens sits within a barrel, where it is attached to a coil spring and positioned near an electromagnet, and the electromagnet’s voltage is adjusted to move the camera along the 1D axis of the spring and barrel, thereby changing the focus distance of the camera. Though VCMs are inexpensive and ubiquitous, they pose a number of issues for the design of an autofocus or depth-from-defocus algorithm. 1) Most VCM autofocus modules are “open loop”: a voltage can be specified, but it is not possible to determine the actual metric focus distance that is then induced by this voltage. 2) Due to variation in temperature, the orientation of the lens relative to gravity, cross talk with other components (e.g., the coils and magnets in optical image stabilization (OIS) module), and simple wear-and-tear on the VCM’s spring, the mapping from a specified voltage to its resulting metric focus distance be grossly inaccurate. 3) The lens may move “off-axis” (perpendicular to the spring) during autofocus due to OIS, changing both the lens’s focus distance and its principal point.

Unknown and uncalibrated PSFs, noise, focal breathing, and the large uncertainty in how the VCM behaves make it difficult to manually engineer a reliable solution to the autofocus problem. This suggests a learning-based approach using a modern neural network.

5 Dataset

Our data capture procedure generally follows the approach of [10], with the main difference being that we capture and process focal stacks instead of individual in-focus captures. Specifically, we use the smartphone camera synchronization system of [1] to synchronize captures from five Google Pixel 3 devices arranged in a cross pattern (Figure 6(a)). We capture a static scene with all five cameras at 49 focal depths sampled uniformly in inverse depth space from 0.102 meters to 3.91 meters. We jointly estimate intrinsics and extrinsics of all cameras using structure from motion [13], and then compute depth (Figure 6(c)) for each image using a modified form of the multi-view stereo pipeline of [10]. We sample

patches with a stride of

from the central camera capture yielding focal stacks of dimensions . We then calculate the ground-truth index for each stack by taking the median of the corresponding stack in the associated depth maps and finding the focal index with the closest focus distance in inverse-depth space. The median is robust to errors in depth and a reasonable proxy for other sources of ground truth that might require more effort, e.g., manual annotation. We then filter these patches by the median confidence of the depth maps. Please see the supplemental material for more details.

Our dataset has 51 scenes, with 10 stacks per scene containing different compositions, for a total of 443,800 patches. These devices capture both RGB and dual-pixel data. Since autofocus is usually performed on raw sensor data (and not a demosaiced RGB image), we use only the raw dual-pixel data and their sum, which is equivalent to the raw green channel. To generate a train and test set, we randomly selected 5 scenes out of the 51 to be the test set; as such, our train set contains 460 focal stacks (387,000 patches) and our test set contains 50 (56,800 patches).

Our portable capture rig allows us to capture a semantically diverse dataset with focal stacks from both indoor and outdoor scenes using a consumer camera (Figure 6), making the dataset one of the first of its kind. Compared to other datasets primarily intended for autofocus [4, 24], our dataset is substantially larger, a key requirement for deep learning techniques. Our dataset is comparable in size to [15], which uses a Lytro for lightfield capture and a Kinect for metric depth. However, we have significantly more scenes (51 vs 12) and use a standard phone camera instead of a plenoptic camera. The latter has a lower resolution ( for the Lytro used in [15] vs for our dual-pixel data) and “focal stacks” generated by algorithmic refocusing do not exhibit issues such as focal breathing, hardware control, noise, PSFs, etc, which are present upon focusing a standard camera. These issues are some of the core challenges of autofocus, as described above in Section 4.

(a) Our capture rig
(b) RGB
(c) Depth
(d) Example focal stacks
Figure 6: Our portable rig LABEL: with 5 synchronized cameras similar to the one in [10] allows us to capture outdoor scenes LABEL: and compute ground truth depth LABEL: using multi-view stereo. In LABEL: we show 7 of the 49 slices from three focal stacks at different depths corresponding to the patches marked in LABEL:. The ground truth patches (the in-focus patches according to our estimated depth) are marked in yellow.

6 Our Model

We build our model upon the MobileNetV2 architecture [32], which has been designed to take as input a conventional 3-channel RGB image. In our use case, we need to represent a complete focal stack, which contains 49 images. We encode each slice of the focal stack as a separate channel, so the model can reason about each image in the focal stack. In our experiments where we give the model access to dual pixel data, each image in the focal stack is a 2-channel image where the channels correspond to the left and right dual-pixel images respectively. In our ablations where the model is deprived of dual-pixel data, each image in the focal stack is a 1-channel image that contains the sum of the left and right views (which is equivalent to the green channel of the raw RGB image). To accommodate this much “wider” number of channels in the input to our network, we increase the number of channels by 4 times the original amount (width multiplier of 4) to prevent a contraction in the number of channels between the input and the first layer. In practice, the network runs quickly: 32.5 ms on a flagship smartphone.

In the setup where the full focal stack is available as input, the model is given a tensor for dual-pixel data, and a

tensor for traditional green-channel sensor data. In the task where only one focal slice is observable, we use one-hot encoding along the channel dimension as input: the input is a 98-channel tensor (or 49 for green-channel only input) where the channels that correspond to unobserved slices in the focal stack are all zeros. We use this same encoding in the first step of our multi-step model, but we add an additional one-hot encoding for each subsequent step of the model, thereby giving the model access to all previously-observed images in the focal stack. We train this network by taking a completed single-slice network and evaluate it on all possible focal stacks and input indices. We then feed a new network this one-hot encoding, so the new network sees the first input index and the prediction of the single-slice network.

We model autofocus as an ordinal regression problem: we treat each focal index as its own discrete distinct class, but we assume that there is an ordinal relationship between the class labels corresponding to each focal index (e.g., index 6 is closer to index 7 than it is to index 15). The output of all versions of our network is 49 logits. We train our model by minimizing the ordinal regression loss of

[8]

, which is similar to the cross-entropy used by traditional logistic regression against unordered labels, but where instead of calculating cross-entropy with respect to a Kronecker delta function representing the ground-truth label, that delta function is convolved with a Laplacian distribution. This encourages the model to make predictions that are as close as possible to the ground-truth, while using traditional cross-entropy would incorrectly model any prediction other than the ground-truth (even those immediately adjacent) as being equally costly.

For training, we use Adam [18] with default parameters (initial lr , beta1 , beta2 ), with a batchsize of 128 and for 20k global steps. For the ordinal regression loss, we use L2 cost metric of [8] with a coefficient of 1.

7 Results

higher is better lower is better
Algorithm MAE RMSE
I* DCT Reduced Energy Ratio [21] 0.034 0.082 0.122 0.186 18.673 22.855
I* Total Variation (L1) [25, 31] 0.048 0.136 0.208 0.316 15.817 21.013
I* Histogram Entropy [19] 0.087 0.230 0.326 0.432 14.013 20.223
I* Modified DCT [20] 0.033 0.091 0.142 0.235 15.713 20.197
I* Gradient Count () [19] 0.109 0.312 0.453 0.612 9.543 16.448
I* Gradient Count () [19] 0.126 0.347 0.493 0.645 9.103 16.218
I* DCT Energy Ratio [6] 0.110 0.286 0.410 0.554 9.556 15.286
I* Eigenvalue Trace [44] 0.116 0.303 0.434 0.580 8.827 14.594
I*

Intensity Variance

[19]
0.116 0.303 0.434 0.580 8.825 14.593
I* Intensity Coefficient of Variation 0.125 0.327 0.469 0.624 8.068 13.808
I* Percentile Range () [33] 0.110 0.293 0.422 0.570 8.404 13.761
I* Percentile Range () [33] 0.123 0.326 0.470 0.633 7.126 12.312
I* Percentile Range () [33] 0.134 0.347 0.502 0.672 6.372 11.456
I* Total Variation (L2) [31] 0.167 0.442 0.611 0.770 5.488 11.409
I* Sum of Modified Laplacian [26] 0.209 0.524 0.706 0.852 4.169 9.781
I* Diagonal Laplacian [42] 0.210 0.528 0.709 0.857 4.006 9.467
I* Laplacian Energy [38] 0.208 0.520 0.701 0.852 3.917 9.062
I* Laplacian Variance [28] 0.195 0.496 0.672 0.832 3.795 8.239
I* Mean Local Log-Ratio () 0.220 0.559 0.751 0.906 2.652 6.396
I* Mean Local Ratio () [16] 0.220 0.559 0.751 0.906 2.645 6.374
I* Mean Local Norm-Dist-Sq () 0.219 0.562 0.752 0.907 2.526 5.924
I* Wavelet Sum () [48] 0.210 0.547 0.752 0.918 2.392 5.650
I* Mean Gradient Magnitude [41] 0.210 0.545 0.747 0.915 2.359 5.284
I* Wavelet Variance () [48] 0.198 0.522 0.731 0.906 2.398 5.105
I* Gradient Magnitude Variance [28] 0.205 0.536 0.739 0.909 2.374 5.103
I* Wavelet Variance () [48] 0.162 0.429 0.636 0.854 2.761 5.006
I* Wavelet Ratio () [45] 0.161 0.430 0.640 0.862 2.706 4.856
I* Mean Wavelet Log-Ratio () 0.208 0.544 0.753 0.927 2.191 4.843
I* Mean Local Ratio () [16] 0.221 0.570 0.772 0.931 2.072 4.569
I* Wavelet Ratio () [45] 0.199 0.527 0.734 0.911 2.265 4.559
I* Mean Local Log-Ratio () 0.221 0.571 0.772 0.931 2.067 4.554
I* Wavelet Sum () [48] 0.170 0.458 0.672 0.888 2.446 4.531
I* Mean Local Norm-Dist-Sq () 0.221 0.572 0.770 0.929 2.056 4.395
I* Mean Local Ratio () [16] 0.210 0.550 0.755 0.927 2.085 4.309
I* Mean Local Log-Ratio () 0.211 0.551 0.755 0.927 2.083 4.305
I* Mean Wavelet Log-Ratio () 0.169 0.458 0.672 0.891 2.358 4.174
I* Mean Local Norm-Dist-Sq () 0.212 0.555 0.760 0.928 2.059 4.164
I* Our Model 0.233 0.600 0.798 0.957 1.600 2.446
D* Normalized SAD [12] 0.166 0.443 0.636 0.819 4.280 8.981
D* Ternary Census (L1, ) [37] 0.171 0.450 0.633 0.802 4.347 8.794
D* Normalized Cross-Correlation [2, 12] 0.168 0.446 0.639 0.824 4.149 8.740
D* Rank Transform (L1) [50] 0.172 0.451 0.633 0.811 4.138 8.558
D* Census Transform (Hamming) [50] 0.179 0.473 0.663 0.842 3.737 8.126
D* Ternary Census (L1, ) [37] 0.178 0.472 0.664 0.841 3.645 7.804
D* Normalized Envelope (L2) [3] 0.155 0.432 0.633 0.856 2.945 5.665
D* Normalized Envelope (L1) [3] 0.165 0.448 0.653 0.870 2.731 5.218
D* Our Model 0.241 0.606 0.807 0.955 1.611 2.674
D1 ZNCC Disparity with Calibration 0.064 0.181 0.286 0.448 8.879 12.911
D1 SSD Disparity [43] 0.097 0.262 0.393 0.547 7.537 11.374
D1 Learned Depth [10] 0.108 0.289 0.428 0.586 7.176 11.351
D1 Our Model 0.164 0.455 0.653 0.885 2.235 3.112
I1 Our Model 0.115 0.318 0.597 0.691 4.321 6.737
Table 1: Results of our model and baselines on the test set for four different versions of the autofocus problem. The leftmost column indicates problem type with I* meaning the full focal stack of green-channel images is passed to the algorithm. In D* , the full focal stack of dual-pixel data is passed to the algorithm. In D1 , a randomly chosen dual-pixel focal slice is passed to the algorithm and in I1 , a randomly chosen green-channel slice is passed. Results are sorted by RMSE independently for each input type. The top three techniques for each metric are highlighted with single slice techniques clubbed together. A indicates that the results were computed on patches inside a 1.5x crop of the entire image.
(a) Dual-pixel input
(b) Baseline
(c) Ours
(d) GT
Figure 7: Qualitative results using Learned Depth [10] and our D1 model. Given a defocused dual-pixel patch LABEL:, the baseline predicts out-of-focus slices LABEL:; our model predicts in-focus slices LABEL: that are similar to the ground truth LABEL:.
(a) Input original (b) Input brightened (c) Baseline (d) Ours (e) GT
Figure 8: Qualitative results on low-light examples using ZNCC disparity as baseline and our D1 model on an example patch for a dark scene. The images have been brightened for visualization.
(a) Input stack I* 
(b) Baseline
(c) Ours
(d) GT
Figure 9: Qualitative result on an example patch LABEL: for I* . All images are passed as input. The output LABEL: from the I* baseline Mean Local Norm-Dist-Sq () is out of focus. There is less dark image content in the output due to focal breathing which fools the contrast-based baseline. The output LABEL: from our I* model is the same as the ground truth LABEL:.

We demonstrate that our approach is better than numerous baselines on several variants of the autofocus problem.

We use similar error metrics as the Middlebury stereo dataset [35]: the fraction of patches whose predicted focal indices have an error of no more than 0, 1, 2, or 4, as well as the mean absolute error (MAE) and root mean square error (RMSE). For the focal stack problem, all algorithms are run on all elements of the test set and aggregated. For the single-slice problem, an algorithm will be run on for all . For the multi-step problem, each patch in the test set will be evaluated 49 different times, with different focal indices acting as the starting position.

We compare our model’s performance against a wide range of baselines. For the baselines labeled as I* , we take all images (i.e., the sum of the two dual-pixel images) from the input focal stack, evaluate a sharpness metric for each image, and then take the top-scoring image as the predicted focal depth for the stack. This is basically contrast-based depth-from-defocus. We take the top performing techniques from a recent survey paper [30].

The baselines labeled as D* use dual-pixel images as input. Instead of maximizing contrast, they instead attempt to identify which slice in the dual-pixel focal stack has the most similar-looking left and right sub-images, under the assumption that the two sub-images of an in-focus image are identical. Because there is little prior work on dual-pixel autofocus or depth-from-focus using the entire focus stack, we use classical techniques in stereo image-matching to produce a similarity metric between the left and right images that we maximize.

Finally, the D1 baselines try to predict the in-focus index given only one dual-pixel image pair. These baselines compute a disparity between the left and right views. As these baselines lack the global knowledge of the entire focal stack, they require calibration mapping this disparity to focus distances in the physical world. This calibration is spatially-varying and typically less accurate in the periphery of the field-of-view [43]. Two of the baselines based on prior work only work in the center 1.5x crop of the image. We evaluate these baselines only in the crop region. This only helps those baselines, as issues like focal breathing and irregular PSFs are worse at the periphery. Please see the supplemental material for a description of the baselines.

7.1 Performance

Table 1 presents our model’s performance for the full-focal green (I* ), full-focal dual pixel (D* ), single-slice green (I1 ), and single-slice dual pixel (D1 ) problems. Our D1 model significantly out-performs other single-slice algorithms, with a RMSE of 3.11 compared to the closest baseline value of 11.351, and MAE of 2.235 compared to 7.176. In other words, baselines were wrong on average by 14.6% of the focal sweep, whereas our learned model was wrong by only 4.5%. We also demonstrate improved performance for the full-focal sweep problem, with a MAE of 1.60 compared to 2.06 of Mean Local Norm-Dist. Our D* model also outperforms the baselines in its category but performs about the same as our I* model; despite having better within-0, within-1, and within-2 scores, it has slightly lower MAE and MSQE. In a visual comparison, we observed that both of our full-focal models produced patches which were visually very similar to the ground truth and were rarely blatantly incorrect. This suggests that both I* and D* have enough information to make an accurate prediction; as such, the additional information in D* does not provide a significant advantage.

7.2 Multi-step

Table 2 presents the results for the multi-step problem. Two D1 baselines were extended into multi-step algorithms by re-evaluating them on the results of the previous run’s output. Both improve substantially from the additional step. In particular, these algorithms are more accurate on indices with less defocus blur (indices close to the ground truth). The first step serves to move the algorithm from a high blur slice to a lower blur slice and the second step then fine-tunes. We see similar behavior from our I1 model, which also improves substantially in the second step. We attribute this gain to the model solving the focus-blur ambiguity which we discuss more in Section 7.4. Our D1 model improves but by a smaller amount than other techniques, likely because it already has high performance in the first step. It also gains much less information from the second slice than the I1 model since there is no ambiguity to resolve.

higher is better lower is better
Algorithm # of steps MAE RMSE
D1 ZNCC Disparity with Calibration 1 0.064 0.181 0.286 0.448 8.879 12.911
2 0.100 0.278 0.426 0.617 6.662 10.993
D1 Learned Depth [10] 1 0.108 0.289 0.428 0.586 7.176 11.351
2 0.172 0.433 0.618 0.802 3.876 7.410
D1 Our model 1 0.164 0.455 0.653 0.885 2.235 3.112
2 0.201 0.519 0.723 0.916 1.931 2.772
I1 Our model 1 0.115 0.318 0.597 0.691 4.321 6.737
2 0.138 0.377 0.567 0.807 2.855 4.088
Table 2: Multi-step problem. Note that the D1 Learned Depth model uses a 1.5x center crop on the images it evaluates; it evaluates on a subset of the test set which has generally fewer artifacts (eg. focal breathing, radial distortion, etc.).

7.3 Performance with Registration

As stated in Section 4, focal breathing can cause errors in contrast-based techniques. Here, we estimate the magnitude of this problem by registering the focal stack to compensate for focal breathing and then re-evaluating the algorithms on the registered focal stack.

higher is better lower is better
Algorithm MAE RMSE
I* Mean Local Ratio () [16] 0.222 0.578 0.776 0.932 2.181 5.184
I* Mean Local Log-Ratio () 0.222 0.579 0.776 0.932 2.176 5.178
I* Mean Local Norm-Dist-Sq () 0.221 0.576 0.773 0.928 2.202 5.097
I* Mean Local Ratio () [16] 0.212 0.565 0.773 0.940 1.923 3.920
I* Mean Local Log-Ratio () 0.213 0.566 0.774 0.941 1.916 3.917
I* Wavelet Sum () [48] 0.194 0.520 0.731 0.922 2.019 3.558
I* Mean Wavelet Log-Ratio () 0.185 0.504 0.718 0.922 2.003 3.239
I* Our Model 0.251 0.610 0.809 0.957 1.570 2.529
Table 3: Ablation study with regards to registrations. Existing techniques perform better when the focal stack has undergone a simple registration. However, our model trained on the registered data still performs better than the baselines.

Theoretically, the change in FoV due to focal breathing can be removed using a zoom-and-crop registration calibrated by the camera’s focal distance. However, in practice, this registration is far from perfect and can introduce artifacts into the scene. Additionally, any noise in the measurement of focal distance means that a calibration-based registration may be imperfect. To evaluate this approach, we tested two different registrations: a zoom-and-crop registration calibrated by reported focal distance, and a grid search over zoom-and-crop registration parameters to minimize the L2 difference between the images. We note that both of these techniques led to registrations that eliminated some but not all of the change in FOV.

Table 3 shows the performance of a model we trained and the best contrast techniques on the registered data. Most of the contrast algorithms improved when run on the registered focal stack, gaining approximately 0.1 MAE. This suggests that focal breathing affects their performance. In addition, our model trained and evaluated on registered data outperforms our model trained and evaluated on non-registered data.

7.4 Single-slice Focus-blur Ambiguity

Figure 10: LABEL: I1 and D1 model predictions for a patch given focal slice 25 as input. I1 model outputs a bimodal distribution as it struggles to disambiguate between the focal indices in-front of and behind the current slice that can generate the same focus-blur. D1 distribution is unimodal as dual-pixel data helps disambiguate between the two. For the same patch, I1 model’s prediction for different input slices is visualized in LABEL:. For focal slices that are towards the near or the far end, the model predicts correctly as one of the two candidate indices lie outside the range while the ambiguity is problematic for input slices in the middle. Inside this problematic range, the model tends to predict focal indices corresponding to depths which, while on the wrong side of the in-focus plane, would produce the same size circle of confusion.

In the single-slice problem, an algorithm given only the green-channel faces a fundamental ambiguity: out-of-focus image content may be on either side of the in-focus plane, due to the absolute value in equation 4. On the other hand, the model with dual-pixel data can resolve this ambiguity since dual-pixel disparity is signed (Equation 5). This can be seen from I1 vs D1 results in Table 2 where I1 single step results are significantly worse than single step D1 results, but the difference narrows down for the two step case where the ambiguity can be resolved by looking at two slices.

The ambiguity is also visualized in Figure 10

for a particular patch where the I1 model outputs a bimodal distribution while the D1 model’s output probability is unimodal. Interestingly, this ambiguity is only problematic for focal-slices where both the candidate indices are plausible, i.e., lie between 0 and 49, as shown in Figure 

10.

References

  • [1] S. Ansari, N. Wadhwa, R. Garg, and J. Chen (2019) Wireless software synchronization of multiple distributed cameras. ICCP. Cited by: §5.
  • [2] D. I. Barnea and H. F. Silverman (1972) A class of algorithms for fast digital image registration. Transactions on Computers. Cited by: §A.2, §A.3, Table 1.
  • [3] S. Birchfield and C. Tomasi (1998) A pixel dissimilarity measure that is insensitive to image sampling. TPAMI. Cited by: §A.2, §A.2, Table 1.
  • [4] M. Carvalho, B. Le Saux, P. Trouvé-Peloux, A. Almansa, and F. Champagnat (2018) Deep depth from defocus: how can defocus blur improve 3d estimation using dense neural networks?. ECCV. Cited by: §1, §2, §5.
  • [5] C. Chan, S. Huang, and H. H. Chen (2017-Sep.) Enhancement of phase detection for autofocus. ICIP (), pp. 41–45. External Links: Document, ISSN Cited by: §2.
  • [6] Chun-Hung Shen and H. H. Chen (2006) Robust focus measure for low-contrast images. International Conference on Consumer Electronics. Cited by: §A.1, Table 1.
  • [7] A. Cohen, I. Daubechies, and J. Feauveau (1992) Biorthogonal bases of compactly supported wavelets. Communications on pure and applied mathematics. Cited by: §A.1.
  • [8] R. Diaz and A. Marathe (2019) Soft labels for ordinal regression. CVPR. Cited by: §6, §6.
  • [9] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. NIPS. Cited by: §2, §2.
  • [10] R. Garg, N. Wadhwa, S. Ansari, and J. T. Barron (2019) Learning single camera depth estimation using dual-pixels. ICCV. Cited by: §A.3, Table 4, Appendix C, Table 5, Learning to Autofocus, §1, §2, §2, §4, Figure 6, §5, Figure 7, Table 1, Table 2.
  • [11] P. Grossmann (1987) Depth from focus. Pattern recognition letters. Cited by: §2.
  • [12] M. J. Hannah (1974) Computer matching of areas in stereo images.. Ph.D. Thesis, Stanford University. Cited by: §A.2, §A.2, §A.3, Table 1.
  • [13] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §5.
  • [14] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. SIGGRAPH Asia. Cited by: §4.
  • [15] C. Hazirbas, S. G. Soyer, M. C. Staab, L. Leal-Taixé, and D. Cremers (2018) Deep depth from focus. ACCV. Cited by: §1, §2, §5.
  • [16] F. Helmli and S. Scherer (2001) Adaptive shape from focus with an error estimation in light microscopy. International Symposium on Image and Signal Processing and Analysis. Cited by: §A.1, Table 1, Table 3.
  • [17] B. K.P. Horn (1968) Focusing. Cited by: §2.
  • [18] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §6.
  • [19] E. Krotkov (1988) Focusing. IJCV. Cited by: §A.1, §A.1, §A.1, §A.1, Table 1.
  • [20] S. Y. Lee, Y. Kumar, J. M. Cho, S. W. Lee, and S. Kim (2008) Enhanced autofocus algorithm using robust focus measure and fuzzy reasoning. Transactions on Circuits and Systems for Video Technology. Cited by: §A.1, §2, Table 1.
  • [21] S. Y. Lee, J. T. Yoo, and S. Kim (2009) Reduced energy-ratio measure for robust autofocusing in digital camera. Signal Processing Letters. Cited by: §A.1, §2, Table 1.
  • [22] O. Liba, K. Murthy, Y. Tsai, T. Brooks, T. Xue, N. Karnad, Q. He, J. T. Barron, D. Sharlet, R. Geiss, S. W. Hasinoff, Y. Pritch, and M. Levoy (2019) Handheld mobile photography in very low light. SIGGRAPH Asia. Cited by: §4.
  • [23] T. Lindeberg (1990) Scale-space for discrete signals. TPAMI. Cited by: §4.
  • [24] H. Mir, P. Xu, R. Chen, and P. van Beek (2015)

    An autofocus heuristic for digital cameras based on supervised machine learning

    .
    Journal of Heuristics. Cited by: §1, §2, §5.
  • [25] H. Nanda and R. Cutler (2001) Practical calibrations for a real-time digital omnidirectional camera. Technical report In Technical Sketches, Computer Vision and Pattern Recognition. Cited by: §A.1, Table 1.
  • [26] S. K. Nayar and Y. Nakagawa (1994) Shape from focus. TPAMI. Cited by: §A.1, §2, Table 1.
  • [27] R. Ng, M. Levoy, M. Brédif, G. Duval, M. E. Horowitz, and P. Hanrahan (2005) Light field photography with a hand-held plenoptic camera. Technical report Stanford University. Cited by: §1.
  • [28] J. L. Pech-Pacheco, G. Cristóbal, J. Chamorro-Martínez, and J. Fernández-Valdivia (2000) Diatom autofocusing in brightfield microscopy: a comparative study. ICPR. Cited by: §A.1, §A.1, Table 1.
  • [29] A. P. Pentland (1987) A new sense for depth of field. TPAMI. Cited by: §2.
  • [30] S. Pertuz, D. Puig, and M. García (2012) Analysis of focus measure operators in shape-from-focus. Pattern Recognition. Cited by: §A.1, §7.
  • [31] L. I. Rudin, S. Osher, and E. Fatemi (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena. Cited by: §A.1, §A.1, Table 1.
  • [32] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. CVPR. Cited by: §6.
  • [33] A. B. Santos, C. O. de Solorzano, J. J. Vaquero, J. M. Peña, N. Malpica, and F. del Pozo (1997) Evaluation of autofocus functions in molecular cytogenetic analysis. Journal of microscopy. Cited by: Table 1.
  • [34] A. Saxena, M. Sun, and A. Y. Ng (2008) Make3d: learning 3d scene structure from a single still image. TPAMI. Cited by: §2.
  • [35] D. Scharstein and R. Szeliski (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV. Cited by: §7.
  • [36] P. P. Srinivasan, R. Garg, N. Wadhwa, R. Ng, and J. T. Barron (2018) Aperture supervision for monocular depth estimation. CVPR. Cited by: §2.
  • [37] F. Stein (2004) Efficient computation of optical flow using the census transform. Pattern Recognition. Cited by: §A.2, Table 1.
  • [38] M. Subbarao, T. Choi, and A. Nikzad (1993) Focusing techniques. Optical Engineering. Cited by: §A.1, Table 1.
  • [39] S. Suwajanakorn, C. Hernandez, and S. M. Seitz (2015) Depth from focus with your mobile phone. CVPR. Cited by: §2.
  • [40] H. Tang, S. Cohen, B. Price, S. Schiller, and K. N. Kutulakos (2017-07) Depth from defocus in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [41] J. M. Tenenbaum (1971) Accommodation in computer vision. Ph.D. Thesis, Stanford University. Cited by: §A.1, §2, Table 1.
  • [42] A. Thelen, S. Frey, S. Hirsch, and P. Hering (2009)

    Improvements in shape-from-focus for holographic reconstructions with regard to focus operators, neighborhood-size, and height value interpolation

    .
    TIP. Cited by: §A.1, Table 1.
  • [43] N. Wadhwa, R. Garg, D. E. Jacobs, B. E. Feldman, N. Kanazawa, R. Carroll, Y. Movshovitz-Attias, J. T. Barron, Y. Pritch, and M. Levoy (2018) Synthetic depth-of-field with a single-camera mobile phone. SIGGRAPH. Cited by: §A.3, §A.3, §A.3, §A.3, Table 4, Table 5, §2, Table 1, §7.
  • [44] C. Wee and R. Paramesran (2007) Measure of image sharpness using eigenvalues. Information Sciences. Cited by: §A.1, Table 1.
  • [45] H. Xie, W. Rong, and L. Sun (2006) Wavelet-based focus measure and 3-d surface reconstruction method for microscopy images. IROS. Cited by: §A.1, Table 1.
  • [46] C. Yang and H. H. Chen (2016-Sep.) Gaussian noise approximation for disparity-based autofocus. ICIP (). External Links: Document, ISSN Cited by: §2.
  • [47] C. Yang, S. Huang, K. Shih, and H. H. Chen (2018) Analysis of disparity error for stereo autofocus. IEEE Transactions on Image Processing. Cited by: §2.
  • [48] G. Yang and B. J. Nelson (2003) Wavelet-based autofocusing and unsupervised segmentation of microscopic images. IROS. Cited by: §A.1, §A.1, Table 1, Table 3.
  • [49] A. L. Yuille and T. A. Poggio (1986) Scaling theorems for zero crossings. TPAMI. Cited by: §4.
  • [50] R. Zabih and J. Woodfill (1994) Non-parametric local transforms for computing visual correspondence. ECCV. Cited by: §A.2, §A.2, Table 1.

Appendix A Baseline Algorithms

Here we document the algorithms taken from prior work that we use as baselines for our proposed model.

a.1 Contrast-Based Baseline Algorithms

As a point of comparison for our proposed model, we implemented a number of contrast-based autofocus algorithms (or equivalently, patch-based depth-from-defocus algorithms) and evaluated them as baselines on our task. When selecting what baselines to implement, we prioritized top-performing techniques according to a relatively recent survey paper [30]. Given a focal stack of images we compute a contrast score for each , and we return the index into the focal stack that maximizes .

Intensity Variance [19]:

The variance of the intensity values of the entire image.

(6)

Intensity Coefficient of Variation [19]:

The coefficient of variation of the intensity values of the entire image, which is the standard deviation of the intensity values divided by their mean. Similar metrics are sometimes referred to in past work as “normalized variance”.

(7)

Total Variation (L1) [25, 31]:

The total absolute difference between the intensity value of all pixels and their (4-connected) neighbors:

(8)

Total Variation (L2) [31]:

The total squared difference between the intensity value of all pixels and their (4-connected) neighbors. This is sometimes referred to as “gradient energy”:

(9)

Energy of Laplacian [38]:

The image is convolved by a discrete Laplace operator, and the response are squared and summed.

(10)

Laplacian Variance [28]:

The image is convolved by a discrete Laplace operator, and the global variance of the response is computed.

(11)

Sum of Modified Laplacian [26]:

The image is convolved by a 1D discrete Laplace operator in x and y, and the absolute values of each filter response are summed.

(12)

Diagonal Laplacian [42]:

This is the same as the “sum of modified Laplacian” approach, but augmented with diagonal Laplacian filters as well.

(13)

Mean Gradient Magnitude [41]:

The mean gradient magnitude, where the gradient is computed using the norm of the response of Sobel filters. This is sometimes referred to as “Tenengrad”.

(14)

Gradient Count [19]:

The total number of edges in the image whose magnitude is above some threshold , where the gradient magnitude is again computed using Sobel filters.

(15)

Gradient Magnitude Variance [28]:

The global variance of gradient magnitudes, where gradients are again computed using Sobel filters.

(16)

Percentile Range:

The difference between the ’th percentile and the ’th percentile of intensity values in the image. When , this is the difference between the maximum and minimum pixel intensities in the image.

(17)

Histogram Entropy [19]:

The Shannon entropy of all pixel intensities in the image.

(18)

DCT Energy Ratio [6]:

The squared sum of all DCT coefficients of the image without the DC component, divided by the squared DC component.

(19)

DCT Reduced Energy Ratio [21]:

The squared sum of the 5 lowest order DCT coefficients (excluding the DC component) divided by the squared DC component.

(20)

Modified DCT [20]:

The total filter response of the image convolved with a checkerboard-like filter, which is somewhat related to the DCT of the image.

(21)

Wavelet Sum [48]:

The sum of the absolute value of the high-frequency components of level of the wavelet decomposition of the image. In our experiments, we use CDF9/7 wavelets [7].

(22)

Wavelet Variance [48]:

The variance of the high-frequency components of level of the wavelet decomposition of the image.

(23)

Wavelet Ratio [45]:

The ratio of the squared norm of the high-frequency components of level of the wavelet decomposition of the image to the squared norm of the low-frequency components.

(24)

Mean Wavelet Log-Ratio

: This is a baseline of our own design in which we modify the “Wavelet Ratio” model to compute a local log-ratio between the high-frequency and low-frequency energy at each spatial location in one level of a wavelet decomposition, and then compute the mean of those log-ratios. We add to the denominator to prevent numerical issues.

(25)

Eigenvalue Trace [44]:

The image is reduced to a matrix where each column is a vector containing the intensity values of each non-overlapping patch (here, of size

) in the image. The trace of the sample covariance of that matrix is then used as a measure of sharpness.

(26)

Mean Local Ratio [16]:

A local measure of contrast is computed at each pixel by considering the ratio of each pixel intensity to a local average, and the overall contrast is computed as the average of those ratios (rectified if they are below ) across the image. The numerator and denominator of each ratio are incremented by to avoid numerical issues.

(27)

Where applies a Gaussian blur of standard deviation to image .

Mean Local Log-Ratio:

This is a baseline of our own design in which we modify the “Mean Local Ratio” technique above, by using the geometric mean of ratios instead of the arithmetic mean.

(28)

Mean Local Norm-Dist-Sq:

This is another baseline of our own design, in which we modify the “Mean Local Ratio” technique to use normalized squared distance (similar to a Coefficient of Variation) instead of ratios, which improves performance.

(29)

a.2 Dual-Pixel / Stereo Baseline Algorithms

Because our images are taken from a dual pixel (DP) sensor, our focal stack can be thought of as a stack of left and right images in a stereo pair . When a patch is in focus, the left and right DP images should resemble each other. It is therefore possible to construct simple autofocus algorithms by taking each left/right image pair in a DP focal stack, compute some measure of mismatch between those two images , and return the focal index that minimizes that loss. In this section, we describe the baseline algorithms we use for this approach. Because patches of the the left and right DP images may have drastically different global brightnesses due to lens shading (especially when the patches are taken from the periphery of the entire image frame), these stereo-like algorithms must be invariant to global transformations of the input images. For this reason, we center each image by its mean and divide by its standard deviation before computing all stereo measures:

(30)

This has no effect on some models (such as census and rank transformations) but is critical for other models.

Census Transform (Hamming) [50]:

We apply the census transformation to the left and right DP images, wherein each pixel is represented by an 8-length binary vector representing whether or not the pixel is greater than each of its 8 neighbors. We score each pair according to the total Hamming distance between the two census-transformed images.

(31)

Rank Transform (L1) [50]:

We apply the rank transformation (the -norm of the census transformation) to the left and right DP images, and score each pair according to the L1 distance between the two rank-transformed images.

(32)
(33)

Ternary Census [37]:

We apply the ternary census transformation to the left and right DP images, wherein each pixel is represented by an 8-length ternary vector representing if the pixel is greater than, less than, or close to (according to some threshold ) each of its 8 neighbors. We then score each pair according to the total L1 distance between the two census-transformed images.

(34)

Normalized Cross-Correlation [2, 12]:

NCC is just the inner product of these two normalized images, with its sign flipped such that minimization results in maximum cross-correlation. This is equivalent to minimizing the normalized sum of squared distances between the two images.

(35)

Normalized SAD [12]:

The sum of absolute deviations between the two normalized images.

(36)

Normalized Envelope (L1) [3]:

Pixel matching techniques can be made invariant to the discrete sampling of the sensor by adapting them to operate on smooth upper and lower envelopes of image intensities. Here we compute an upper and lower envelope of the left and right images, and from them compute the total L1 distance between the extents of the left and right envelopes.

(37)

where is a

“max” filter (i.e. max pooling),

is a “min” filter (i.e. min pooling), and is a box filter (i.e. average pooling). and are defined similarly.

Normalized Envelope (L2) [3]:

Similarly, we can compute the total squared distance between the extents of the left and right envelopes.

(38)

a.3 Single-Slice Baseline Algorithms

The baseline methods above infer the in-focus index by either maximizing contrast (for contrast-based methods) or minimizing stereo mismatch (for dual-pixel methods). Hence, they all require the knowledge of the entire focal stack before making a prediction.

However, the DP algorithms can be extended to predict the in-focus index with just one input DP image pair, if we can establish the relationship between left/right disparity and ideal focus distance . We list a few such algorithms below.

SSD Disparity:

We use the block matching approach of [43] to estimate disparity. In order to convert the disparity of a patch to a focal depth, we fit a linear model that estimates focal depth from the median patch disparity. The linear model is robustly estimated from all training patches using RANSAC. This methods computes depth over reduced field of view and we report results only on patches contained within that field of view. A narrower field of view is not unfair to the baseline as PSF variations and focal breathing are worse near the periphery.

Learned Depth:

We use the neural network based approach of [10] to predict depth from dual-pixel images. The model from [10] predicts depth maps up to an unknown affine transform, which we estimate by solving a least squares problem that minimizes the distance between the affine transformed depth map and the disparity from [43] that are known to be linearly related. We use the same fitting described in SSD Disparity and restrict evaluation to the same reduced field of view.

ZNCC Disparity with Calibration:

We compute the zero-normalized cross correlation between the input DP image pair (using Equation 30) to get . Then, we compute disparity between and [2, 12] and apply a precomputed calibration to convert disparity to focal distance. Specifically, to compute disparity , we do the following

(39)

for integer in a small range around zero. We then refine to get sub-pixel resolution by fitting a quadratic near the peak and finding its supremum.

Under paraxial and thin-lens approximations, and assuming constant aperture and focal length, signed disparity and ideal focus distance are related by an affine transform [43]:

(40)

where is a calibration constant and is the lens’s current focus distance.

The assumption that is a constant breaks down for real lenses as they do not satisfy the paraxial and thin-lens approximations. In fact, the value of varies significantly across the field of view, due to optical aberration, vignetting, changes in optical blur kernels, etc., as shown in [43]. The camera device we use embeds a factory calibration table that specifies the measured values sparsely across the field of view. We obtain the value of for each input patch by bilinearly interpolating the low-resolution calibration table.

With the knowledge of disparity , calibration coefficient , and current focus distance , we can easily solve for in Equation 40.

Appendix B Generalization to other phones

To show that our technique generalizes, we use the data captured in the paper to create a new test set using the “left” camera, which has a different calibration and PSF.

This left test set contains the same scenes as the test set in the original paper; however, the overall attributes of the set may be different. The “left” phone is positioned in front of the “center” phone by 1.1 cm (on the z-axis, it is +1.1cm closer to objects in the scene). In addition, the computed depth has an overall lower confidence than that of the center camera since fewer cameras see all the pixels captured by the left camera. This problem is particularly apparent on the left side of the capture. In addition, because we keep the same confidence threshold as used for the center camera, fewer patches will be generated. In general, it may be difficult to compare the raw numbers from the test set using the center camera and the test set using the left camera.

As shown in Table 4, all techniques report slightly lower numbers. This indicates that the “left” test-set may be more difficult than the “center” test-set due to the aforementioned changes. Despite this, our model still outperforms the baselines. Additionally, several simple techniques, like adding calibration data to the model or a brief fine-tuning stage for each camera, could be easily added to our approach and potentially lead to improved per-device performance.

For this run, ZNCC Disparity uses calibration for the “left” camera, and linear models to convert to focal depths for SSD Disparity and Learned Depth were estimated using the training data patches from the “left” camera.

higher is better lower is better
Algorithm MAE RMSE
D1 Learned Depth [10] 0.070 0.206 0.340 0.564 7.224 11.010
D1 SSD Disparity [43] 0.068 0.200 0.333 0.550 7.377 10.951
D1 ZNCC Disparity 0.046 0.136 0.224 0.379 9.436 13.138
D1 Our model 0.105 0.322 0.513 0.807 2.912 3.867
Table 4: Evaluating techniques on the “left” version of the test set. This tests whether the technique generalizes to other phones. Note our model still outperforms the baselines and that the performance went down for all techniques indicating that the “left” version of the test set is harder. See text for explanation. A indicates that patches within a reduced field of view were used.

Appendix C Multi-step problem

In Figure C, we obtain improved results on the multi-step problem.

higher is better lower is better
Algorithm # of steps MAE RMSE
D1 ZNCC Disparity with Calibration 1 0.064 0.181 0.286 0.448 8.879 12.911
2 0.100 0.278 0.426 0.617 6.662 10.993
D1 Learned Depth [10] 1 0.108 0.289 0.428 0.586 7.176 11.351
2 0.172 0.433 0.618 0.802 3.876 7.410
I1 Our model 1 0.115 0.318 0.597 0.691 4.321 6.737
2 0.138 0.377 0.567 0.807 2.855 4.088
D1 Our model 1 0.164 0.455 0.653 0.885 2.235 3.112
2 0.201 0.519 0.723 0.916 1.931 2.772

Appendix D Light and Dark Scenes

In Figure 8 in the main paper, we presented examples on particularly dark images. In Table 5, we present the full numeric breakdowns of the performance of single-index algorithms on scenes with a normal amounts of light versus scenes with low light.

To capture these, we placed the rig in a fixed position and then captured two focal stacks: one with the light on and then one with the light turned off. As a result, these captures should be perfectly registered and should be the identical besides the presence or absence of light. We then used the ground truth depth from the light image to eliminate any possible mistakes that the SFM pipeline would have with the darker images.

higher is better lower is better
Setting Algorithm MAE RMSE
Light D1 SSD Disparity [43] 0.079 0.228 0.355 0.528 6.732 9.577
D1 Learned Depth [10] 0.094 0.264 0.401 0.576 6.262 9.376
D1 ZNCC Disparity 0.064 0.188 0.304 0.486 7.222 10.179
D1 Our model 0.126 0.369 0.578 0.832 2.654 3.563
Dark D1 Learned Depth [10] 0.061 0.178 0.286 0.442 9.104 12.793
D1 SSD Disparity [43] 0.055 0.162 0.252 0.396 9.343 12.669
D1 ZNCC Disparity 0.056 0.167 0.272 0.443 7.972 11.080
D1 Our model 0.112 0.323 0.497 0.729 3.479 4.957
Table 5: Performance for scenes in high and low light. Note that our technique is the most resistant to dark scenes. A indicates that patches within a reduced field of view were used.

Appendix E Example Images

e.1 Single slice as input

In Figures 11, 12, 13, 14, we provide a random selection of inputs (among those inside the 1.5x crop center, so that the PD baselines are present) and the predictions from all baselines and our models. The “Input” is what the algorithm is given. The focal stack identification key is directly above the row. The title of each focal slice is: the name of the algorithm, the index, (“Err” followed by the number of indices away from the ground truth).

Figure 11: Algorithms given singleindex. Example page 1
Figure 12: Algorithms given singleindex. Example page 2
Figure 13: Algorithms given singleindex. Example page 3
Figure 14: Algorithms given singleindex. Example page 4

e.2 Focal stack as input

In Figures 15, 16, 17, 18, we provide a random selection of inputs in the test set. The diagram contains an “Input” category; however, this is simply to display another element of the focal stack. All of these algorithms receive the full focal stack as input. The focal stack identification key is directly above the row. The title of each focal slice is: the name of the algorithm, the index, (“Err” followed by the number of indices away from the ground truth).

Figure 15: Algorithms given fullfocal. Example page 1
Figure 16: Algorithms given fullfocal. Example page 2
Figure 17: Algorithms given fullfocal. Example page 3
Figure 18: Algorithms given fullfocal. Example page 4

References

  • [1] S. Ansari, N. Wadhwa, R. Garg, and J. Chen (2019) Wireless software synchronization of multiple distributed cameras. ICCP. Cited by: §5.
  • [2] D. I. Barnea and H. F. Silverman (1972) A class of algorithms for fast digital image registration. Transactions on Computers. Cited by: §A.2, §A.3, Table 1.
  • [3] S. Birchfield and C. Tomasi (1998) A pixel dissimilarity measure that is insensitive to image sampling. TPAMI. Cited by: §A.2, §A.2, Table 1.
  • [4] M. Carvalho, B. Le Saux, P. Trouvé-Peloux, A. Almansa, and F. Champagnat (2018) Deep depth from defocus: how can defocus blur improve 3d estimation using dense neural networks?. ECCV. Cited by: §1, §2, §5.
  • [5] C. Chan, S. Huang, and H. H. Chen (2017-Sep.) Enhancement of phase detection for autofocus. ICIP (), pp. 41–45. External Links: Document, ISSN Cited by: §2.
  • [6] Chun-Hung Shen and H. H. Chen (2006) Robust focus measure for low-contrast images. International Conference on Consumer Electronics. Cited by: §A.1, Table 1.
  • [7] A. Cohen, I. Daubechies, and J. Feauveau (1992) Biorthogonal bases of compactly supported wavelets. Communications on pure and applied mathematics. Cited by: §A.1.
  • [8] R. Diaz and A. Marathe (2019) Soft labels for ordinal regression. CVPR. Cited by: §6, §6.
  • [9] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. NIPS. Cited by: §2, §2.
  • [10] R. Garg, N. Wadhwa, S. Ansari, and J. T. Barron (2019) Learning single camera depth estimation using dual-pixels. ICCV. Cited by: §A.3, Table 4, Appendix C, Table 5, Learning to Autofocus, §1, §2, §2, §4, Figure 6, §5, Figure 7, Table 1, Table 2.
  • [11] P. Grossmann (1987) Depth from focus. Pattern recognition letters. Cited by: §2.
  • [12] M. J. Hannah (1974) Computer matching of areas in stereo images.. Ph.D. Thesis, Stanford University. Cited by: §A.2, §A.2, §A.3, Table 1.
  • [13] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §5.
  • [14] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. SIGGRAPH Asia. Cited by: §4.
  • [15] C. Hazirbas, S. G. Soyer, M. C. Staab, L. Leal-Taixé, and D. Cremers (2018) Deep depth from focus. ACCV. Cited by: §1, §2, §5.
  • [16] F. Helmli and S. Scherer (2001) Adaptive shape from focus with an error estimation in light microscopy. International Symposium on Image and Signal Processing and Analysis. Cited by: §A.1, Table 1, Table 3.
  • [17] B. K.P. Horn (1968) Focusing. Cited by: §2.
  • [18] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §6.
  • [19] E. Krotkov (1988) Focusing. IJCV. Cited by: §A.1, §A.1, §A.1, §A.1, Table 1.
  • [20] S. Y. Lee, Y. Kumar, J. M. Cho, S. W. Lee, and S. Kim (2008) Enhanced autofocus algorithm using robust focus measure and fuzzy reasoning. Transactions on Circuits and Systems for Video Technology. Cited by: §A.1, §2, Table 1.
  • [21] S. Y. Lee, J. T. Yoo, and S. Kim (2009) Reduced energy-ratio measure for robust autofocusing in digital camera. Signal Processing Letters. Cited by: §A.1, §2, Table 1.
  • [22] O. Liba, K. Murthy, Y. Tsai, T. Brooks, T. Xue, N. Karnad, Q. He, J. T. Barron, D. Sharlet, R. Geiss, S. W. Hasinoff, Y. Pritch, and M. Levoy (2019) Handheld mobile photography in very low light. SIGGRAPH Asia. Cited by: §4.
  • [23] T. Lindeberg (1990) Scale-space for discrete signals. TPAMI. Cited by: §4.
  • [24] H. Mir, P. Xu, R. Chen, and P. van Beek (2015) An autofocus heuristic for digital cameras based on supervised machine learning. Journal of Heuristics. Cited by: §1, §2, §5.
  • [25] H. Nanda and R. Cutler (2001) Practical calibrations for a real-time digital omnidirectional camera. Technical report In Technical Sketches, Computer Vision and Pattern Recognition. Cited by: §A.1, Table 1.
  • [26] S. K. Nayar and Y. Nakagawa (1994) Shape from focus. TPAMI. Cited by: §A.1, §2, Table 1.
  • [27] R. Ng, M. Levoy, M. Brédif, G. Duval, M. E. Horowitz, and P. Hanrahan (2005) Light field photography with a hand-held plenoptic camera. Technical report Stanford University. Cited by: §1.
  • [28] J. L. Pech-Pacheco, G. Cristóbal, J. Chamorro-Martínez, and J. Fernández-Valdivia (2000) Diatom autofocusing in brightfield microscopy: a comparative study. ICPR. Cited by: §A.1, §A.1, Table 1.
  • [29] A. P. Pentland (1987) A new sense for depth of field. TPAMI. Cited by: §2.
  • [30] S. Pertuz, D. Puig, and M. García (2012) Analysis of focus measure operators in shape-from-focus. Pattern Recognition. Cited by: §A.1, §7.
  • [31] L. I. Rudin, S. Osher, and E. Fatemi (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena. Cited by: §A.1, §A.1, Table 1.
  • [32] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. CVPR. Cited by: §6.
  • [33] A. B. Santos, C. O. de Solorzano, J. J. Vaquero, J. M. Peña, N. Malpica, and F. del Pozo (1997) Evaluation of autofocus functions in molecular cytogenetic analysis. Journal of microscopy. Cited by: Table 1.
  • [34] A. Saxena, M. Sun, and A. Y. Ng (2008) Make3d: learning 3d scene structure from a single still image. TPAMI. Cited by: §2.
  • [35] D. Scharstein and R. Szeliski (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV. Cited by: §7.
  • [36] P. P. Srinivasan, R. Garg, N. Wadhwa, R. Ng, and J. T. Barron (2018) Aperture supervision for monocular depth estimation. CVPR. Cited by: §2.
  • [37] F. Stein (2004) Efficient computation of optical flow using the census transform. Pattern Recognition. Cited by: §A.2, Table 1.
  • [38] M. Subbarao, T. Choi, and A. Nikzad (1993) Focusing techniques. Optical Engineering. Cited by: §A.1, Table 1.
  • [39] S. Suwajanakorn, C. Hernandez, and S. M. Seitz (2015) Depth from focus with your mobile phone. CVPR. Cited by: §2.
  • [40] H. Tang, S. Cohen, B. Price, S. Schiller, and K. N. Kutulakos (2017-07) Depth from defocus in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [41] J. M. Tenenbaum (1971) Accommodation in computer vision. Ph.D. Thesis, Stanford University. Cited by: §A.1, §2, Table 1.
  • [42] A. Thelen, S. Frey, S. Hirsch, and P. Hering (2009) Improvements in shape-from-focus for holographic reconstructions with regard to focus operators, neighborhood-size, and height value interpolation. TIP. Cited by: §A.1, Table 1.
  • [43] N. Wadhwa, R. Garg, D. E. Jacobs, B. E. Feldman, N. Kanazawa, R. Carroll, Y. Movshovitz-Attias, J. T. Barron, Y. Pritch, and M. Levoy (2018) Synthetic depth-of-field with a single-camera mobile phone. SIGGRAPH. Cited by: §A.3, §A.3, §A.3, §A.3, Table 4, Table 5, §2, Table 1, §7.
  • [44] C. Wee and R. Paramesran (2007) Measure of image sharpness using eigenvalues. Information Sciences. Cited by: §A.1, Table 1.
  • [45] H. Xie, W. Rong, and L. Sun (2006) Wavelet-based focus measure and 3-d surface reconstruction method for microscopy images. IROS. Cited by: §A.1, Table 1.
  • [46] C. Yang and H. H. Chen (2016-Sep.) Gaussian noise approximation for disparity-based autofocus. ICIP (). External Links: Document, ISSN Cited by: §2.
  • [47] C. Yang, S. Huang, K. Shih, and H. H. Chen (2018) Analysis of disparity error for stereo autofocus. IEEE Transactions on Image Processing. Cited by: §2.
  • [48] G. Yang and B. J. Nelson (2003) Wavelet-based autofocusing and unsupervised segmentation of microscopic images. IROS. Cited by: §A.1, §A.1, Table 1, Table 3.
  • [49] A. L. Yuille and T. A. Poggio (1986) Scaling theorems for zero crossings. TPAMI. Cited by: §4.
  • [50] R. Zabih and J. Woodfill (1994) Non-parametric local transforms for computing visual correspondence. ECCV. Cited by: §A.2, §A.2, Table 1.