Compared with multi-shot depth measurement methods such as structure from motion (SfM) [agarwal2009building] and depth from defocus (DfD) [subbarao1994depth, tang2017depth, guo2017focal], a single-shot method is suitable for moving objects. One of the most successful single-shot methods is deep monocular depth estimation. Despite the remarkable progress of deep monocular depth estimation in recent years [eigen2014depth, godard2017unsupervised, fu2018deep], it cannot estimate a correct depth map without sufficient contextual information due to the lack of a physical depth cue, for instance, in a scene without the ground.
Instead of utilizing contextual information, color-coded aperture (CCA) methods can acquire a depth map based on a physical depth cue encoded in a single-shot image by inserting different types of color filters [bando2008extracting, kim2012multifocusing, chakrabarti2012depth, lee2013single, martinello2015dual, paramonov2016depth, moriuchi201723] into the lens aperture as shown in Figure 2b. As shown in Figure 2a, color and blur radius vary according to the distance from the focus distance. Conventional CCA methods assume an ideal lens for simplifying analytical modeling of defocus blur. However, actual lenses have shift-variant point spread functions (PSFs) distorted according to the position of the image sensor by lens aberrations such as field curvature, coma or lateral chromatic aberration as shown in Figure 2e. Furthermore, the depth cues often disappear because of several uncertainties, such as saturation, soft shadow, dark color and large blur. These uncertainties are not treated distinctly in conventional methods [bando2008extracting, kim2012multifocusing, chakrabarti2012depth, lee2013single, martinello2015dual, paramonov2016depth, moriuchi201723, haim2018depth].
In this paper, we propose a physical cue-based deep learning to overcome the differences between the analytical model and the actual one. In order to estimate a correct depth map under shift-variant PSFs, we add positional information as an additional branch by a self-attention mechanism [zhang2018self]. It also couples additional color channels to solve dependency on object colors for handling various complex color pattern correctly. To handle various uncertainties, we improve the loss function based on Bayesian deep learning [kendall2017uncertainties] for stabilizing the training. As shown in Figure 1, We demonstrate that our method is superior to a conventional method in quantitative and qualitative experiments including various outdoor scenes. Furthermore, compared to a long-baseline stereo camera, the proposed method provides an error-free depth map at close range, as there is no blind spot between the left and right cameras.
The contributions of this paper are as follows. (a) We propose a CNN-based depth estimation network that does not infer the depth from the contextual information but physically measures the depth given by an optical cue. (b) We add positional information as additional channels by a self-attention mechanism to handle shift-variant aberrations. (c) We train the network with additional color channels using many pictures taken by an actual lens to handle various complex color patterns correctly. (d)
To handle various uncertainties, we propose Bayes L1 loss instead of the conventional heteroscedastic variance for stabilizing the training. There are two limitations to this paper. First, our method is not applicable to the regions with small gradients since there are no depth cues. Second, our method requires much computation time, for example, about 50 seconds (NVIDIA Geforce GTX 1080 ti), because of the patch-based architecture[bailer2018fast]. The improvement is left for future work.
2 Background(Color-coded aperture photography)
CCA methods [bando2008extracting, kim2012multifocusing, chakrabarti2012depth, lee2013single, martinello2015dual, paramonov2016depth, moriuchi201723] are categorized to a computational photography (CP) technique [zhou2011computational] developed in the last decade. The image quality of CCA is higher than coded-aperture [levin2007image] having unnatural blur shape due to the special shape of the aperture. In order to acquire the depth map, CCAs use disparity [bando2008extracting, kim2012multifocusing, lee2013single, paramonov2016depth] or defocus blur [chakrabarti2012depth, martinello2015dual, moriuchi201723].
Figure 2a shows the change of the optical path through the lens with cyan and yellow color filters [paramonov2016depth, moriuchi201723]. The color direction of defocus blur with a near or far object is inverted at the focus distance. Such a change of the defocus blur allows retrieval of the distance in front of or behind the focal plane. Depth from defocus technique [subbarao1994depth] is applied to estimate an accurate depth map for CCA [martinello2015dual, moriuchi201723], which is called depth from analytical defocus (DfAD). These methods assume the gaussian blur as shown in Figure 2c. The blur radius is estimated by , where and are deformed images by convolution kernels deforming the asymmetric gaussian blur of the R and B images to the gaussian blur of the G image and is zero-mean normalized cross correlation [moriuchi201723]. However, conventional CCA methods assume an ideal lens for simplifying analytical modeling of the defocus blur. By the difference between the analytical model and the actual one, as shown in Figure 3c, DfAD gives a distorted depth map due to a shift-variant PSF(Figure 2d). Figure 3f shows errors because of differences between the ideal analytical model and the actual one. Although recent work [paramonov2016depth] modeled the aberration effect by a double-Gauss lens model, the handcrafted model must be reconstructed when it is applied to other lenses.
In this section, to overcome the differences between the analytical model and the actual one, we propose a physical cue-based deaberration network.
Baseline network We adopt patch-based architecture [zagoruyko2015learning, simo2015discriminative, zbontar2016stereo, luo2016efficient, bailer2016cnn, bailer2018fast] for our network to learn only the defocus blur instead of the contextual information and train the network easily. The architecture takes a patch extracted from a captured image as an input and outputs a single depth value corresponding to the patch. Since this network does not access to information of neighbor patches, it does not learn the contextual information. Therefore, the learned network has high generalization performance. This network can be trained by patchwise images with flat depth data only. Such data can be collected easily by our training system (see Section 4.1).
Our network structure is based on ResNet [he2016deep]. The baseline network structure is indicated by the red dotted rectangle in Figure 4. In a preprocessing stage, the gradient of an image patch is calculated with respect to the horizontal and vertical axis. All of the gradients are concatenated to . It is well known that gradients give better results than color images do [bando2008extracting, chakrabarti2012depth, martinello2015dual, paramonov2016depth, moriuchi201723] and our experiment also has shown such result (see also Section 4.2lin2013network] and a fully connected layer (dense layer). The network infers the defocus blur with learnable weight parameters as .
Deep deaberration network The aberration effect varies according to wavelength of color, horizontal and vertical axis in the lens. In order to handle the aberration effect, we add the positional and color information to the gradients as shown in Figure 4. The position is broadcasted into the same size as the patch. In order to add color information, the input image patch is converted to hue and saturation with the same shape of the patch.
In order to handle the lens aberration efficiently, we introduce the self-attention mechanism [zhang2018self] to our deep deaberration network (DDN) indicated by the green dotted rectangle in Figure 4
. Since the lens aberration causes the shape of the blur to change, important features vary according to the position of the image patch. The attention mechanism is trained so as to put large weights on such important features accordingly, and, thus, shift-invariant features are extracted as a result. The color branch can handle the dependency on object colors in the same way. After concatenating the positional and color information to the gradients, the attention maps are calculated by sigmoid functions from each feature map. The feature map of the main branch is multiplied by the above two attention maps before its ResBlock.
The proposed networks are trained as a regression problem with supervision similar to stereo matching [kendall2017end, chang2018pyramid] and deep monocular depth estimation [kuznietsov2017semi, Atapour-Abarghouei_2018_CVPR]. The ground truth distance recorded by the training system is converted to the blur radius by using the lens maker’s formula. A tuple of is the element of a training data-set, where is the index. L1 loss function is defined as where is the total number of the training patches.
Reliability prediction In actual CCA optics, the depth cues often disappear because of several uncertainties, such as saturation, soft shadow, dark color, and large blur. In the literature of Bayesian deep learning [gal2016dropout, kendall2017uncertainties], such uncertainties are categorized as heteroscedastic uncertainty [kendall2017uncertainties]. To handle the heteroscedastic uncertainty, the network should be changed to also output variance prediction as . The loss function is defined as heteroscedastic variance [kendall2017uncertainties]. However, this loss function shows significant instability in the training of our task as shown in Figure 8 (indicated by Bayes L2). This instability often causes training to fail. The progress of the training makes the variance prediction noticeably smaller and the error
is also expected to be small. However, outlier errors make the loss very high because the denominator becomes very small simultaneously. Then, the loss function will diverge with the second order.
To stabilize the training, we propose a new loss function that has the heteroscedastic absolute standard deviation. In order to reduce the order, we convert the loss function by replacing the squared error and the variance with the absolute error and the absolute standard deviation as follows.
To output , an additional final layer is added to the end of the main branch in DDN indicated by the blue dotted rectangle in Figure 4. We use as the reliability. Figure 8 shows that our new loss function stabilizes the training significantly (indicated by Bayes L1).
4.1 Training system, data and details
Training system We have developed a training system in order to automatically take many pictures with actual lenses as shown in Figure 5. This system consists of four 8K displays (LC-70X500) and a 12[m] slide stage arranged orthogonally to the displays. As these four displays are arranged 2 x 2, the screen size and resolution become 140 inches and 15360 x 8640, respectively. In order to learn only defocus blur information instead of contextual one, we introduce various randomization techniques to our training recipe to make the deep network focus on the blur information. We use many images sampled randomly from the MSCOCO data-set [vinyals2017show]. They are arranged in a matrix form as shown in Figure 5. Horizontal/vertical flipping and random scaling are applied to each image to remove its shape and scale information.
Training data We used a digital single-lens reflex (DSLR) camera: Nikon AI AF Nikkor 50mm f/1.8D (lens), Nikon D810 (body). The f-number was set to 4.0 throughout all experiments. The focus distance was set to 1500[mm] and the images were taken at 100 positions spaced at regular intervals on the blurred space from 1100[mm] to 2400[mm]. Four different images were taken at each position. Three images were for the training data and the last one was for the test data. This process took about only three hours. The captured images were resized to 1845x1232. We randomly collected image patches from only an edge and texture region without overlapping. The training and test data-sets include around 150,000 and 15,000 patches, respectively.
Implementation and training details
Our DDN operates on an input patch size of 16x16 pixels with five Resblocks for each branch. The convolutional layers in all of our networks have 3x3 kernels and 1 stride. The number of channels is fixed to 32 from the beginning to the end. To train our networks, we use ADAM[kingma2014adam] with the default parameters and 128 as the batch size. Although several data augmentation techniques [krizhevsky2012imagenet] are usually applied in order to avoid overfitting, these techniques deform the shape of PSF which we should learn. We only select random crop [buslaev2018albumentations], brightness [buslaev2018albumentations] and random erasing [zhong2017random]
that do not affect PSF. We trained DDN by 1500 epochs with the above training data and our training recipe. Finally, the test accuracy reached to 0.72 (=8.1[mm] at 1500[mm]).
Ablation study We show the contributions of the proposed components by ablation study. Test accuracy curves during the training for the ablation study is shown in Figure 8. The result shows that the positional branch significantly affects accuracy. The Bayes L1 loss and the color branch have an effect on the accuracy. The gradient affects the accuracy slightly.
Effectiveness of positional branch We trained and tested the network with and without the positional branch using images having several sizes and shapes composed of several blocks as shown in Figure 10. In a large area, the training becomes hard because of the need to handle the shift-variant PSF. Figure 10 shows that the accuracy without the positional branch drops quickly as the area becomes large. As shown in Figure 11b, the distortion by the shift-variant PSF remains. In contrast, with the positional branch, the accuracy keeps high and the above distortion is disappeared in that depth map as shown in Figure 11d.
Effectiveness of color branch We evaluate the network with and without the color branch with respect to black-and-white and color subjects as shown in Figure 10. For the black-and-white subject, two networks have the same accuracy. The accuracy drops in the color subject because high saturated color confuses the network. Since the color branch helps the network to discriminate between the defocus blur and high saturated color, the color branch achieves higher accuracy. This is also shown in the depth maps as shown in Figure 11g and h.
Effectiveness of reliability In actual scenes, we verify the effectiveness of the learned reliability. Figure 12 shows several samples of uncertainty. Several types of uncertainty caused depth errors. The reliability prediction can capture the depth errors correctly even though all of them are unseen for the learned network. Threshold shows good balance between the reliable and unreliable regions.
4.3 Quantitative and qualitative results
In the quantitative and qualitative experiments, the DDN is trained by only the indoor data-set. After the training, we changed the focus distance from 1500[mm] to 7000[mm] to apply it to outdoor scenes. We compared our DDN with DfAD [moriuchi201723] and a stereo camera composed of two prototype CCA cameras with 20cm baseline. Since DfAD utilizes DFD technique to the color channels, the comparison with DfAD includes the one with typical DFDs. Coded-aperture (CA) [levin2007image] and focal track (FT) [guo2017focal] have some relations to our method but it is difficult to apply them to our CCA by the following reasons. The image of CCA has fewer zeros on the frequency domain than the requirement of CA. FT is based on the time derivative of defocus blur pairs by the small oscillation lens. It cannot be applicable to CCA due to large differences of blur between the inter-color channels.
Quantitative evaluation We quantitatively evaluated depth errors using our training system. To get stereo depth, we used semi-global matching (SGM) [hirschmuller2005accurate] implemented by [opencv]. Although SGM uses strong spatial regularization, DDN and DfAD don’t use it. This quantitative evaluation was set to the range from 2000 [mm] to 12000 [mm], which is different from the training. Figure 8 shows the error curves over the target distance. The error of DDN is much less than that of DfAD. The error of DDN falls short of the one of the stereo camera. However, the theoretical accuracy of our CCA camera is equivalent to the stereo camera with 1.25cm baseline according to the aperture size. Considering the aperture size, the accuracy of DDN is sufficiently high.
Qualitative evaluation We qualitatively evaluated the depth maps in actual outdoor scenes. Figure 13 shows the qualitative results. Gray color indicates that there is no depth cue. Depth maps by the stereo camera are high resolution at a great distance. However, depth errors often occur at a small distance (within 3[m]) as shown in (i) and (ii). They are caused by occlusion(iii). This is a problem specific to stereo matching. DfAD has insufficient resolution in the distant region((i), (ii), (iii) and (iv)). It also shows several errors caused by horizontal edges((i) and (iv)) and slant edges((iv)). In contrast, DDN gives improved depth maps for the failure cases both of the stereo camera and DfAD.
Robustness against to individual difference and other focal lengths For verification of the robustness against the individual difference, we also trained DDN by Lens B and C (Nikon AI AF Nikkor 50mm f/1.8D). We apply this trained DDNs to the captured image by Lens A (Nikon AI AF Nikkor 50mm f/1.8D) as shown in Figure 14. There is almost no difference in those depth maps. We also apply our method to f=14mm lens (AI AF Nikkor 14mm f/2.8D ED) and f=150mm lens (SP 150-600mm F/5-6.3 Di VC USD G2). As shown in Figure 15, DDN gives clear depth maps to not only f=50mm lens but also f=14mm lens and f=150mm lens.
With a view to realizing the single-shot depth measurement of a monocular camera, we have improved the depth measurement of CCA by using deep learning. We have proposed DDN with a self-attention mechanism to learn lens aberration efficiently. We have also proposed a Bayes L1 loss function to handle the uncertainty more accurately. We have confirmed that DDN showed a great advantage over the baseline in terms of accuracy and, in addition, the accuracy of Bayes L1 loss has been better than L1 loss. The learned reliability has been able to capture the errors caused by uncertainty correctly in spite of unseen outdoor scenes. In terms of quantitative results, the error of DDN was significantly better than DfAD. In terms of qualitative results, DDN was superior to DfAD for various outdoor scenes.