DeepAI
Log In Sign Up

Learning Sub-Pixel Disparity Distribution for Light Field Depth Estimation

08/20/2022
by   Wentao Chao, et al.
National University of Defense Technology
Beijing Normal University
17

Existing light field (LF) depth estimation methods generally consider depth estimation as a regression problem, supervised by a pixel-wise L1 loss between the regressed disparity map and the groundtruth one. However, the disparity map is only a sub-space projection (i.e., an expectation) of the disparity distribution, while the latter one is more essential for models to learn. In this paper, we propose a simple yet effective method to learn the sub-pixel disparity distribution by fully utilizing the power of deep networks. In our method, we construct the cost volume at sub-pixel level to produce a finer depth distribution and design an uncertainty-aware focal loss to supervise the disparity distribution to be close to the groundtruth one. Extensive experimental results demonstrate the effectiveness of our method. Our method, called SubFocal, ranks the first place among 99 submitted algorithms on the HCI 4D LF Benchmark in terms of all the five accuracy metrics (i.e., BadPix0.01, BadPix0.03, BadPix0.07, MSE and Q25), and significantly outperforms recent state-of-the-art LF depth methods such as OACC-Net and AttMLFNet. Code and model are available at https://github.com/chaowentao/SubFocal.

READ FULL TEXT VIEW PDF

page 3

page 4

page 6

page 7

07/28/2018

Unsupervised Adversarial Depth Estimation using Cycled Generative Networks

While recent deep monocular depth estimation approaches based on supervi...
03/29/2022

Self-Supervised Light Field Depth Estimation Using Epipolar Plane Images

Exploiting light field data makes it possible to obtain dense and accura...
02/26/2021

3D Vessel Reconstruction in OCT-Angiography via Depth Map Estimation

Optical Coherence Tomography Angiography (OCTA) has been increasingly us...
04/18/2018

Variational Disparity Estimation Framework for Plenoptic Image

This paper presents a computational framework for accurately estimating ...
06/27/2020

MTStereo 2.0: improved accuracy of stereo depth estimation withMax-trees

Efficient yet accurate extraction of depth from stereo image pairs is re...
09/19/2019

Learning to Think Outside the Box: Wide-Baseline Light Field Depth Estimation with EPI-Shift

We propose a method for depth estimation from light field data, based on...
09/26/2022

UDepth: Fast Monocular Depth Estimation for Visually-guided Underwater Robots

In this paper, we present a fast monocular depth estimation method for e...

Code Repositories

SubFocal

Learning Sub-Pixel Disparity Distribution for Light Field Depth Estimation


view repo

Introduction

Light field (LF) depth estimation is a fundamental task in image processing with many subsequent applications, such as refocusing Ng et al. (2005); Wang et al. (2018)

, super-resolution

Zhang et al. (2019); Jin et al. (2020a); Cheng et al. (2021), view synthesis Wu et al. (2018); Meng et al. (2019); Jin et al. (2020b), 3D reconstruction Kim et al. (2013) and virtual reality Yu (2017).

Figure 1: BadPix 0.07 and MSE scores achieved by different methods on the HCI 4D LF benchmark. Lower scores represent higher accuracy.

Deep learning has been rapidly developed and applied to LF depth estimation in recent years Heber and Pock (2016); Shin et al. (2018); Chen et al. (2021); Tsai et al. (2020); Wang et al. (2022a)

. The mainstream deep learning-based methods consist of four steps, including feature extraction, cost construction, cost aggregation and depth (disparity) regression

111Depth and disparity and can be transformed directly according to , where and stand for the baseline length and the focal length of the LF camera, respectively. The following part will not distinguish between them.. Cost construction provides initial similarity measures of possible matched pixels among different views. Existing methods construct cost volumes by shifting the features of each view with specific integer values, and regress the final disparity values (in sub-pixel accuracy) by performing a weighted average on each candidate disparity. Although these methods achieve a remarkable performance by minimizing the pixel-wise loss between the predicted disparity and the groundtruth one, the disparity distribution issue has been ignored.

Disparity distribution can be considered as a set of probability values that a pixel belongs to different candidate disparities. The regressed disparity is only a sub-space projection (more precisely, an expectation) of the predicted disparity distribution. Since there are unlimited disparity distributions that can produce the same expectation, simply minimizing the loss in the disparity domain can hinder the deep networks from learning the core information and yield sub-optimal depth estimation results.

In this paper, we propose a novel method to learn the disparity distribution for LF depth estimation. In our method, we propose an interpolation-based sub-pixel cost volume construction approach to achieve sub-pixel level sampling of disparity distribution, and then design an uncertainty-aware focal loss to reduce the differences between estimated and groundtruth disparity distribution. Since the groundtruth disparity distribution is unavailable in public LF datasets, we follow the common settings in

Li et al. (2020b) to model the groundtruth disparity distribution as a Dirac delta distribution (i.e., the possibility of the groundtruth disparity is 1 while the others are 0), and further discretize the Dirac delta distribution into two adjacent sampling points to facilitate the training of our method. We conduct extensive experiments to validate the effectiveness of our method. As shown in Figure 1, our method achieves superior performance as compared to state-of-the-art LF depth estimation methods (e.g., OACC-Net Wang et al. (2022a), AttMLFNet Chen et al. (2021)).

In summary, the contributions of this paper are as follows:

  • We rethink LF depth estimation as a pixel-wise distribution alignment task and make the first attampt to learn the sub-pixel level disparity distribution using deep neural networks.

  • We propose a simple-yet-effective method for sub-pixel disparity distribution learning, in which a sub-pixel level cost volume construction approach and an uncertainty-aware focal loss are developed to supervise the learning of disparity distribution.

  • Experimental results demonstrate the effectiveness of our method. Our method, called SubFocal, ranks the first place among all the 99 submitted algorithms on the HCI 4D LF benchmark in terms of all the five accuracy metrics (i.e., BadPix0.01, BadPix0.03, BadPix0.07, MSE and Q25).

Related Work

In this section, we review traditional and deep learning-based methods in LF depth estimation.

Traditional Methods

Traditional depth estimation algorithms can be divided into three categories based on the properties of LF images. Multi-view stereo matching (MVS) based methods Yu et al. (2013); Jeon et al. (2015) exploit the multi-view information of the LFs for depth estimation. Epipolar plane image (EPI) based methods Zhang et al. (2016); Schilling et al. (2018) calculate the slopes using the epipolar geometric properties of the LFs to predict the depth. Defocus-based methods Tao et al. (2013); Wang et al. (2015) calculate the depth of a pixel by measuring the consistency at different focal stacks. However, traditional methods involve nonlinear optimization and manually-designed features, which are computationally demanding and sensitive to occlusions, weak textures and highlights.

Deep Learning-based Methods

Recently, deep learning has been developed rapidly and gradually used for LF depth estimation Heber and Pock (2016); Shin et al. (2018); Chen et al. (2021); Tsai et al. (2020); Wang et al. (2022a). Some learning-based methods estimate the slope of the line from the EPI. Luo et al. Luo et al. (2017) used pairs of EPI image patches (horizontal and vertical) to train a CNN network. Li et al. Li et al. (2020a) designed a network to learn the relationship between the slopes of straight lines in horizontal and vertical EPIs. Leistner et al. Leistner et al. (2019) introduced the idea of EPI-Shift, a virtual stack of moving LFs, predicting the disparity integer and offset separately, and then combining them together. These EPI-based methods only consider the features in the horizontal and vertical directions of angular views, and requires complex subsequent refinement.

Other methods design networks by directly exploring the correspondence between views in LF images. Shin et al. Shin et al. (2018) introduced a multi-stream input structure to concatenate views from four directions for depth estimation. Tsai et al. Tsai et al. (2020) proposed an attention-based viewpoint selection network that generates attention maps to represent the importance of each viewpoint. Chen et al. Chen et al. (2021) proposed an attention-based multi-level fusion network, using an intra-branch fusion strategy and an inter-branch fusion strategy to perform hierarchical fusion of effective features from different perspectives. Wang et al. Wang et al. (2022a) constructed an occlusion-aware cost volume based on dilated convolution Yu and Koltun (2015), which can achieve highest accuracy in terms of MSE metric with fast speed.

However, the state-of-art methods Tsai et al. (2020); Chen et al. (2021); Wang et al. (2022a) treat depth estimation as a regression task supervised by a pixel-wise L1 loss and lack explicit supervision of disparity distribution. In this paper, we design a sub-pixel cost volume and an uncertainty-aware focal loss to learn sub-pixel disparity distribution for LF depth estimation.

Figure 2: The architecture of our SubFocal network. First, a feature extractor is used to extract the features of each SAI and form the feature map. Second, a sub-pixel cost volume is constructed through a sub-pixel view shift based on bilinear interpolation. Third, we use a cost aggregation module to fuse information on sub-pixel cost volume. Moreover, a disparity regression module is adopted to produce the predicted disparity distribution and the disparity map. The uncertainty map is derived by calculating the JS divergence between predicted disparity distribution and groundtruth one. Finally, the proposed uncertainty-aware focal loss is used to supervise disparity distribution and the disparity map.

Method

Figure 2 depicts the architecture of our proposed method. First, the features of each sub-aperture image (SAI) are extracted using a shared feature extractor. Second, the sub-pixel view shift is performed to construct the sub-pixel cost volume. Third, the cost aggregation module is used to aggregate cost volume information. Moreover, the predicted disparity distribution and disparity map are then produced by attaching a disparity regression module, in which the Jensen-Shannon (JS) divergence is used to calculate the uncertainty map. Finally, the proposed uncertainty-aware focal loss is used to simultaneously supervise the disparity distribution and the disparity map. We will describe each module in detail below.

Network

Feature Extraction

In order to extract multi-scale features of SAI, the widely used Spatial Pyramid Pooling (SPP) module Tsai et al. (2020); Chen et al. (2021) is adopted as the feature extraction module in our network. The features of the SPP module contain hierarchical context information and are concatenated to form the feature map .

Sub-pixel Cost Volume

An LF can be represented using the two-plane model, i.e., , where stands for the spatial coordinate and for the angular coordinate. The LF disparity structure can be formulated as:

(1)

where denotes the disparity between the central view and the adjacent view at .

Cost volume is constructed by shifting the feature according to Equation 1 within a predefined disparity range (usually from -4 to 4). It can be transformed into the disparity distribution by subsequent cost aggregation and disparity regression modules. To obtain a finer disparity distribution, sub-pixel level disparity sampling interval is required. In this paper we construct a sub-pixel cost volume based on bilinear interpolation.

It is worth noting that a smaller sampling interval can generate a finer disparity distribution but will cause more expensive computation and longer inference time. Consequently, we conduct ablation experiments to investigate the trade-off between accuracy and speed with respect to sampling interval, which are described in details.

Cost Aggregation and Disparity Regression

The shape of the sub-pixel cost volume is , and we employ 3D CNN to aggregate the sub-pixel cost volume. Following Tsai et al. (2020) , our cost aggregation consists of eight convolutional layers and two residual blocks. After passing through these 3D convolutional layers, we obtain the final cost volume . We normalize by using softmax operation along dimension to produce the probability of the disparity distribution . Finally, the output disparity can be calculated according to:

(2)

where denotes the estimated center-view disparity, and stand for the minimum and maximum disparity value, is the sampling value between and .

Uncertainty-aware Focal Loss

Previous methods Shin et al. (2018); Tsai et al. (2020); Chen et al. (2021); Wang et al. (2022a) consider depth estimation as a simple regression process, and treat each point equally via a pixel-wise L1 loss. However, unlimited disparity distributions can produce the same expectation of disparity. Our aim is that models need to learn the reasonable disparity distributions (e.g., Dirac delta distribution Li et al. (2020b)) for depth estimation. It is challenging to use the simple L1 loss to supervise the disparity distribution, especially for those difficult spatial areas. Here, we visualize the disparity distributions of our method in Figure 3. In simple and non-occlusion regions (i.e., Point ), a reasonable disparity distribution and accurate disparity can be obtained by directly using L1 loss for training. In contrast, in difficult occlusion regions (i.e., Point ), directly using L1 loss cannot yield reasonable disparity distribution and accurate disparity. It indicates an imbalance issue in learning disparity distribution and disparity.

Figure 3: Visualization of disparity distribution. We compare the disparity distribution of points on with/without using UAFL. The disparity distribution with UAFL is closer to the groundtruth in occlusion region.

To address this issue, we design an uncertainty-aware focal loss (UAFL) to supervise disparity distribution. Similar to Focal Loss Lin et al. (2017); Li et al. (2020b), we also adopt the dynamic weight adjustment strategy to make the model focus more on the hard points with significant differences in disparity distributions. Figure 3 compares the disparity distribution with and without using UAFL. By using UAFL, our model can predict a reasonable disparity distribution and accurate disparity in challenge regions.

To quantitatively evaluate the difficulty of different regions, we adopt the JS divergence to measure the difference between the predicted and the groundtruth disparity distributions as presented in Equation 3.

(3)

is the uncertainty map, represents the predicted disparity distribution, and is the disparity distribution of groundtruth. Since we follow Li et al. (2020b) to consider the groundtruth disparity distribution a Dirac delta distribution, we discretize it to the two adjacent sampling points and , according to:

(4)

where and are the left and right adjacent points, respectively. and represent their corresponding probability values. is the groundtruth disparity. Finally, we design the uncertainty-aware focal loss to supervise both disparity distribution and disparity map:

(5)

where is the uncertainty map, is the predicted disparity, and is the groundtruth disparity. stands for the element-wise multiplication operation, and is the coefficient factor that control the ratio of dynamic weight assignment. When , UAFL degenerate to the standard L1 loss. In ablation experiments, we analyze the effect of different values.

Backgammon Dots Pyramids Stripes
0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE
SPO Zhang et al. (2016) 3.781 8.639 49.94 4.587 16.27 35.06 58.07 5.238 0.861 6.263 79.20 0.043 14.99 15.46 21.87 6.955
CAE Park et al. (2017) 3.924 4.313 17.32 6.074 12.40 42.50 83.70 5.082 1.681 7.162 27.54 0.048 7.872 16.90 39.95 3.556
PS_RF Jeon et al. (2018) 7.142 13.94 74.66 6.892 7.975 17.54 78.80 8.338 0.107 6.235 83.23 0.043 2.964 5.790 41.65 1.382
OBER-cross-ANP Schilling et al. (2018) 3.413 4.956 13.66 4.700 0.974 37.66 73.13 1.757 0.364 1.130 8.171 0.008 3.065 9.352 44.72 1.435
OAVC Han et al. (2021) 3.121 5.117 49.05 3.835 69.11 75.38 92.33 16.58 0.831 9.027 33.66 0.040 2.903 19.88 28.14 1.316
Epinet-fcn-m Shin et al. (2018) 3.501 5.563 19.43 3.705 2.490 9.117 35.61 1.475 0.159 0.874 11.42 0.007 2.457 2.711 11.77 0.932
EPI_ORM Li et al. (2020a) 3.988 7.238 34.32 3.411 36.10 47.93 65.71 14.48 0.324 1.301 19.06 0.016 6.871 13.94 55.14 1.744
LFattNet Tsai et al. (2020) 3.126 3.984 11.58 3.648 1.432 3.012 15.06 1.425 0.195 0.489 2.063 0.004 2.933 5.417 18.21 0.892
FastLFNet Huang et al. (2021) 5.138 11.41 39.84 3.986 21.17 41.11 68.15 3.407 0.620 2.193 22.19 0.018 9.442 32.60 63.04 0.984
DistgDisp Wang et al. (2022b) 5.824 10.54 26.17 4.712 1.826 4.464 25.37 1.367 0.108 0.539 4.953 0.004 3.913 6.885 19.25 0.917
OACC-Net Wang et al. (2022a) 3.931 6.640 21.61 3.938 1.510 3.040 21.02 1.418 0.157 0.536 3.852 0.004 2.920 4.644 15.24 0.845
SubFocal(Ours) 3.194 4.281 12.47 3.667 0.899 1.524 15.51 1.301 0.220 0.411 1.867 0.005 2.464 3.568 9.386 0.821
Boxes Cotton Dino Sideboard
0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE
SPO Zhang et al. (2016) 15.89 29.52 73.23 9.107 2.594 13.71 69.05 1.313 2.184 16.36 69.87 0.310 9.297 28.81 73.36 1.024
CAE Park et al. (2017) 17.89 40.40 72.69 8.424 3.369 15.50 59.22 1.506 4.968 21.30 61.06 0.382 9.845 26.85 56.92 0.876
PS_RF Jeon et al. (2018) 18.95 35.23 76.39 9.043 2.425 14.98 70.41 1.161 4.379 16.44 75.97 0.751 11.75 36.28 79.98 1.945
OBER-cross-ANP Schilling et al. (2018) 10.76 17.92 44.96 4.750 1.108 7.722 36.79 0.555 2.070 6.161 22.76 0.336 5.671 12.48 32.79 0.941
OAVC Han et al. (2021) 16.14 33.68 71.91 6.988 2.550 20.79 61.35 0.598 3.936 19.03 61.82 0.267 12.42 37.83 73.85 1.047
Epinet-fcn-m Shin et al. (2018) 12.34 18.11 46.09 5.968 0.447 2.076 25.72 0.197 1.207 3.105 19.39 0.157 4.462 10.86 36.49 0.798
EPI_ORM Li et al. (2020a) 13.37 25.33 59.68 4.189 0.856 5.564 42.94 0.287 2.814 8.993 41.04 0.336 5.583 14.61 52.59 0.778
LFattNet Tsai et al. (2020) 11.04 18.97 37.04 3.996 0.272 0.697 3.644 0.209 0.848 2.340 12.22 0.093 2.870 7.243 20.73 0.530
FastLFNet Huang et al. (2021) 18.70 37.45 71.82 4.395 0.714 6.785 49.34 0.322 2.407 13.27 56.24 0.189 7.032 21.62 61.96 0.747
DistgDisp Wang et al. (2022b) 13.31 21.13 41.62 3.325 0.489 1.478 7.594 0.184 1.414 4.018 20.46 0.099 4.051 9.575 28.28 0.713
OACC-Net Wang et al. (2022a) 10.70 18.16 43.48 2.892 0.312 0.829 10.45 0.162 0.967 2.874 22.11 0.083 3.350 8.065 28.64 0.542
SubFocal(Ours) 8.536 16.44 32.03 2.993 0.257 0.611 3.337 0.188 0.777 2.052 10.23 0.141 2.360 6.113 18.95 0.404
Bedroom Bicycle Herbs Origami
0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE 0.07 0.03 0.01 MSE
SPO Zhang et al. (2016) 4.864 23.53 72.37 0.209 10.91 26.90 71.13 5.570 8.260 30.62 86.62 11.23 11.69 32.71 75.58 2.032
CAE Park et al. (2017) 5.788 25.36 68.59 0.234 11.22 23.62 59.64 5.135 9.550 23.16 59.24 11.67 10.03 28.35 64.16 1.778
PS_RF Jeon et al. (2018) 6.015 22.45 80.68 0.288 17.17 32.32 79.80 7.926 10.48 21.90 66.47 15.25 13.57 36.45 80.32 2.393
OBER-cross-ANP Schilling et al. (2018) 3.329 9.558 28.91 0.185 8.683 16.17 37.83 4.314 7.120 14.06 36.83 10.44 8.665 20.03 42.16 1.493
OAVC Han et al. (2021) 4.915 19.09 64.76 0.212 12.22 25.46 64.74 4.886 8.733 29.65 74.76 10.36 12.56 30.59 69.35 1.478
Epinet-fcn-m Shin et al. (2018) 2.299 6.345 31.82 0.204 9.614 16.83 42.83 4.603 10.96 25.85 59.93 9.491 5.807 13.00 42.21 1.478
EPI_ORM Li et al. (2020a) 5.492 14.66 51.02 0.298 11.12 21.20 51.22 3.489 8.515 24.60 68.79 4.468 8.661 22.95 56.57 1.826
LFattNet Tsai et al. (2020) 2.792 5.318 13.33 0.366 9.511 15.99 31.35 3.350 5.219 9.473 19.27 6.605 4.824 8.925 22.19 1.733
FastLFNet Huang et al. (2021) 4.903 15.92 52.88 0.202 15.38 28.45 59.24 4.715 10.72 23.39 59.98 8.285 12.64 33.65 72.36 2.228
DistgDisp Wang et al. (2022b) 2.349 5.925 17.66 0.111 9.856 17.58 35.72 3.419 6.846 12.44 24.44 6.846 4.270 9.816 28.42 1.053
OACC-Net Wang et al. (2022a) 2.308 5.707 21.97 0.148 8.078 14.40 32.74 2.907 6.515 46.78 86.41 6.561 4.065 9.717 32.25 0.878
SubFocal(Ours) 2.234 4.364 11.81 0.141 7.277 12.75 28.39 2.670 4.297 8.190 17.36 6.126 2.961 6.917 19.33 0.956
Table 1: Quantitative comparison results with the top ranked methods on HCI 4D LF benchmark. Best results are in bold faces and the second best results are underlined.
Figure 4: The benchmark website’s screenshot is available at https://lightfield-analysis.uni-konstanz.de/ in August 2022. We submit our result to the benchmark website, i.e., SubFocal. The red boxes indicate that our method ranks first among the five widely used accuracy metrics.
Figure 5:

Qualitative comparison our method with the state-of-the-art methods. The even-numbered rows show the results of these methods, while the odd-numbered rows present the corresponding BadPix 0.07 error maps. Red pixels on the error map represent areas where the error is higher than the threshold, i.e., 0.07.

Experiments

In this section, we first describe the dataset, evaluation metrics, and implementation details. The performance of our method is then compared with the state-of-the-art methods. Finally, we design ablation experiments to analyze the effectiveness of the proposed method.

Datasets

The 4D LF Benchmark (HCI 4D) Honauer et al. (2016) is the current mainstream benchmark dataset for evaluating LF disparity estimation. The dataset is rendered by Blender and provides 28 synthetic densely sampled 4D LFs, including 16 , 4 , 4 , and 4 . The , and scenes have groundtruth disparity maps while the does not provide groundtruth disparity annotations. The spatial resolution is 512512, and the angular resolution is 99.

Evaluation

For the quantitative evaluation, we adopt the standard metrics, including mean square errors (MSE 100) and bad pixel ratios (BadPix()). Specifically, MSE 100: mean square error of all pixels at a given mask multiplied by 100. BadPix(): the percentage of pixels whose absolute value between the actual label at a given mask and the algorithm’s predicted result surpasses a threshold , is usually chosen as 0.01,0.03 and 0.07.

Implementation Details

We follow the previous method Shin et al. (2018); Tsai et al. (2020); Chen et al. (2021); Wang et al. (2022a) and use 16 scenes from as the training set, 8 scenes from and as the validation set, and 4 scenes from as the testing set. We use LFattNet Tsai et al. (2020) as the baseline model. During training, LF images are randomly cropped into 3232 grayscale patches. The same data augmentation strategy Shin et al. (2018); Tsai et al. (2020) is used to improve the model performance. We set batchsize to 32 and use the Adam Kingma and Ba (2014)

method. The disparity range is set to [-4,4], and the sampling interval is set to 0.5 by default. To speed up the model training, a two-step strategy is used. In the first stage, the model is trained for 50 epochs using L1 Loss, and the learning rate is 1

10. In the second stage, the L1 loss is replaced with our UAFL. The model is finetuned 10 epoch, and the learning rate is set to 110

. We use the TensorFlow

Abadi et al. (2016) framework to implement our network. The training process takes roughly 65 hours on an NVIDIA RTX 3090 GPU.

Figure 6: Evaluation on real scenes. Compared to EPINet Shin et al. (2018) and OACC-Net Wang et al. (2022a), our disparity maps show better performance especially on edges of objects than others.

Compared with the state-of-the-art methods

Quantitative Comparison

We perform quantitative comparison with state-of-the-art methods, including SPO Zhang et al. (2016), CAE Park et al. (2017), PS_RF Jeon et al. (2018), OBER-cross-ANP Schilling et al. (2018), OAVC Han et al. (2021), Epinet-fcn-m Shin et al. (2018), EPI_ORM Li et al. (2020a), LFattNet Tsai et al. (2020), FastLFNet Huang et al. (2021), AttMLFNet Chen et al. (2021), DistgDisp Wang et al. (2022b), and OACC-Net Wang et al. (2022a), respectively. Figure 4 demonstrates that our method achieves the first place on all five commonly evaluated metrics (BadPix 0.01, BadPix 0.03, BadPix 0.07, MSE 100, and Q25) on the HCI 4D LF benchmark. Table 1 shows the quantitative comparison results on the validation and test set. Our method achieves promising performance on different scenes, especially on the unlabeled test set.

Qualitative Comparison

We show the qualitative result in Figure 5, containing three scenes: , and . The BadPix 0.07 error maps show that our method has substantially less error compared to previous methods, especially in the occlusion and edge regions, such as the and scenes.

Performance on Real Scenes

We evaluate our method on the real-world LF images taken with a Lytro Illum camera Bok et al. (2016) and a moving camera mounted on a gantry Vaish and Adams (2008). We use the model trained on the HCI dataset because directly for inference. Figure 6 illustrates our method can produce disparity maps more accurate than those produced by EPINet Shin et al. (2018) and OACC-Net Wang et al. (2022a), such as the boom of the bulldozer (middle). It clearly demonstrates the generalization capability of our method. Please refer to supplemental material for additional comparisons.

Ablation Study

We conduct extensive ablation experiments to analyze effectiveness of our method. Our ablation studies including the investigations of the disparity intervals, combinations of loss functions and hyper-parameters

of UAFL.

Disparity Interval of Sub-pixel Cost Volume

Disp. Num. Disp. Interv. BP 0.07 MSE Training Time Inference Time
9 1 2.79 1.245 39h 3.5s
17 0.5 2.04 0.845 54h 5.7s
33 0.25 1.92 0.839 110h 18.7s
81 0.1 1.73 0.748 330h 56.3s
Table 2: Comparative results achieved by our method with different disparity sampling density. The training and inference are performed on an NVIDIA RTX 3090 GPU. means to perform patch cropping for inference.
Figure 7: Visual comparison of the results of different disparity intervals on scene.

The selection of the disparity interval determines the sampling density and further affects the inference speed of the model. We need to choose the proper disparity interval to achieve a good trade-off in accuracy and speed. Therefore, we set the disparity intervals to 1, 0.5, 0.25 and 0.1, respectively and train our model for 50 epochs on the HCI 4D LF benchmark and then evaluate their performance on the validation set. Table 2 shows the results of the model at different disparity intervals and numbers. As the disparity interval decreases and the number of samples increases, the accuracy of the model increases, while the training time also increases. Specifically, reducing the disparity interval from 1 to 0.5 improves the BadPix 0.07 and MSE by 0.75 and 0.4, respectively. As shown in Figure 7, we can see that the disparity map at the interval of 0.5 is more accurate than the interval of 1 on the heavily occluded region of scene scene. In addition, we find no significant visual difference between the results of disparity interval 0.25 and disparity interval 0.5. Consequently, considering the accuracy and speed, we finally choose the disparity interval of 0.5 for further experiments.

Loss baseline L1 MSE JS KL UAFL BP 0.07 MSE
2.04 0.845
1.95 0.777
2.64 0.698
2.08 0.745
1.94 0.771
2.36 0.691
1.93 0.768
Table 3: Quantitative comparison the results at different combinations of loss functions.

Combinations of Loss Functions

How to choose the appropriate loss functions is an important issue in our method. Since metrics (e.g., Kullback Leibler (KL) divergence) also can measure the distance between two distributions. We test some combinations of these losses, including L1 loss, MSE loss, KL divergence loss, and JS divergence loss. We use the model with a disparity interval of 0.5 as pre-trained baseline. Then the model is finetuned for 10 epochs with the learning rate being set to 110. Table 3 shows the results of different loss functions. Although MSE loss leads to a better MSE metric, the BadPix 0.07 metric increases sharply. Our UAFL improves both MSE and BadPix 0.07 metrics as compared to baseline, and the overall performance of UAFL is superior than other loss functions.

1 0.5 0.4 0.3 0.2 0.1
BP 0.07 3.23 2.26 2.16 2.04 1.98 1.93
MSE 0.671 0.728 0.734 0.747 0.756 0.768
Table 4: Experimental results of UAFL using different hyper-parameter .
Figure 8: Qualitative and quantitative analysis of the results for and on scene. (a) Disparity maps and BadPix 0.07 error maps. (b) Histogram of MSE error maps.

Hyper-parameter of UAFL

The function of UAFL is to dynamically assign weights based on an uncertainty map to supervise the training of disparity distribution and disparity simultaneously. We introduce the hyper-parameter to adjust the ratio of the assigned weights. Based on the disparity interval of 0.5, we choose different to investigate its influence, and the results are shown in Table 4. We observe that as increases from 0.1 to 1, the MSE metric decreases, while the BadPix 0.07 metric increases. As shown in Figure 8 (a), the BadPix 0.07 error map with is significantly better than that with . Histogram of MSE error in Figure 8

(b) shows fewer error outlier points for

and a slight increase in error outliers for , so the mean MSE of is better than that of . We analyze that a larger can increase the ratio of weights assigned to difficult areas, and thus make the model focus more on challenging regions (e.g., outliers), resulting in a minor overall MSE error. The BadPix metric, on the other hand, calculates the proportion of pixels with an error large than a threshold, such as 0.07. And with an error of 0.07, the model will relax the constraint and assign lower weights on easy regions, leading to a slight increase in the BadPix metric. Considering that the BadPix metric is prioritized for depth estimation, we finally set in our main model.

Conclusion

This paper presents a simple and effective method to learn the disparity distribution for LF depth estimation. On the one hand, we construct an interpolation-based cost volume at the sub-pixel level, which yields a finer disparity distribution. On the other hand, we design an uncertainty-aware focal loss based on JS divergence to supervise the disparity distribution. Extensive experiments validate the effectiveness of our method. Our method achieves superior performance as compared to state-of-the-art LF depth estimation methods.

References

  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) tensorflow: A system for large-scalemachine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: Implementation Details.
  • Y. Bok, H. Jeon, and I. S. Kweon (2016) Geometric calibration of micro-lens-based light field cameras using line features. IEEE transactions on pattern analysis and machine intelligence 39 (2), pp. 287–300. Cited by: Performance on Real Scenes.
  • J. Chen, S. Zhang, and Y. Lin (2021) Attention-based multi-level fusion network for light field depth estimation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    pp. 1009–1017. Cited by: Introduction, Introduction, Deep Learning-based Methods, Deep Learning-based Methods, Deep Learning-based Methods, Feature Extraction, Uncertainty-aware Focal Loss, Implementation Details, Quantitative Comparison.
  • Z. Cheng, Z. Xiong, C. Chen, D. Liu, and Z. Zha (2021) Light field super-resolution with zero-shot learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 10010–10019. Cited by: Introduction.
  • K. Han, W. Xiang, E. Wang, and T. Huang (2021) A novel occlusion-aware vote cost for light field depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 1, Quantitative Comparison.
  • S. Heber and T. Pock (2016) Convolutional networks for shape from light field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3746–3754. Cited by: Introduction, Deep Learning-based Methods.
  • K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke (2016) A dataset and evaluation methodology for depth estimation on 4d light fields. In Asian conference on computer vision, pp. 19–34. Cited by: Datasets.
  • Z. Huang, X. Hu, Z. Xue, W. Xu, and T. Yue (2021) Fast light-field disparity estimation with multi-disparity-scale cost aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6320–6329. Cited by: Table 1, Quantitative Comparison.
  • H. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. Tai, and I. S. Kweon (2018) Depth from a light field image with learning-based matching costs. IEEE transactions on pattern analysis and machine intelligence, pp. 297–310. Cited by: Table 1, Quantitative Comparison.
  • H. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. Tai, and I. So Kweon (2015) Accurate depth map estimation from a lenslet light field camera. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1547–1555. Cited by: Traditional Methods.
  • J. Jin, J. Hou, J. Chen, and S. Kwong (2020a) Light field spatial super-resolution via deep combinatorial geometry embedding and structural consistency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2260–2269. Cited by: Introduction.
  • J. Jin, J. Hou, J. Chen, H. Zeng, S. Kwong, and J. Yu (2020b) Deep coarse-to-fine dense light field reconstruction with flexible sampling and geometry-aware fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Introduction.
  • C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and M. H. Gross (2013) Scene reconstruction from high spatio-angular resolution light fields.. ACM Transactions on Graphics (TOG) 32 (4), pp. 73–1. Cited by: Introduction.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation Details.
  • T. Leistner, H. Schilling, R. Mackowiak, S. Gumhold, and C. Rother (2019) Learning to think outside the box: wide-baseline light field depth estimation with epi-shift. In 2019 International Conference on 3D Vision (3DV), pp. 249–257. Cited by: Deep Learning-based Methods.
  • K. Li, J. Zhang, R. Sun, X. Zhang, and J. Gao (2020a) Epi-based oriented relation networks for light field depth estimation. arXiv preprint arXiv:2007.04538. Cited by: Deep Learning-based Methods, Table 1, Quantitative Comparison.
  • X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang (2020b) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems 33, pp. 21002–21012. Cited by: Introduction, Uncertainty-aware Focal Loss, Uncertainty-aware Focal Loss, Uncertainty-aware Focal Loss.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: Uncertainty-aware Focal Loss.
  • Y. Luo, W. Zhou, J. Fang, L. Liang, H. Zhang, and G. Dai (2017)

    Epi-patch based convolutional neural network for depth estimation on 4d light field

    .
    In International Conference on Neural Information Processing, pp. 642–652. Cited by: Deep Learning-based Methods.
  • N. Meng, H. K. So, X. Sun, and E. Y. Lam (2019) High-dimensional dense residual convolutional neural network for light field reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (3), pp. 873–886. Cited by: Introduction.
  • R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and P. Hanrahan (2005) Light field photography with a hand-held plenoptic camera. Ph.D. Thesis, Stanford University. Cited by: Introduction.
  • I. K. Park, K. M. Lee, et al. (2017) Robust light field depth estimation using occlusion-noise aware data costs. IEEE transactions on pattern analysis and machine intelligence 40 (10), pp. 2484–2497. Cited by: Table 1, Quantitative Comparison.
  • H. Schilling, M. Diebold, C. Rother, and B. Jähne (2018) Trust your model: light field depth estimation with inline occlusion handling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4530–4538. Cited by: Traditional Methods, Table 1, Quantitative Comparison.
  • C. Shin, H. Jeon, Y. Yoon, I. S. Kweon, and S. J. Kim (2018) Epinet: a fully-convolutional neural network using epipolar geometry for depth from light field images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4748–4757. Cited by: Introduction, Deep Learning-based Methods, Deep Learning-based Methods, Uncertainty-aware Focal Loss, Table 1, Figure 6, Implementation Details, Quantitative Comparison, Performance on Real Scenes.
  • M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi (2013) Depth from combining defocus and correspondence using light-field cameras. In Proceedings of the IEEE International Conference on Computer Vision, pp. 673–680. Cited by: Traditional Methods.
  • Y. Tsai, Y. Liu, M. Ouhyoung, and Y. Chuang (2020) Attention-based view selection networks for light-field disparity estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12095–12103. Cited by: Introduction, Deep Learning-based Methods, Deep Learning-based Methods, Deep Learning-based Methods, Feature Extraction, Cost Aggregation and Disparity Regression, Uncertainty-aware Focal Loss, Table 1, Implementation Details, Quantitative Comparison.
  • V. Vaish and A. Adams (2008) The (new) stanford light field archive. Computer Graphics Laboratory, Stanford University 6 (7), pp. 3. Cited by: Performance on Real Scenes.
  • T. Wang, A. A. Efros, and R. Ramamoorthi (2015) Occlusion-aware depth estimation using light-field cameras. In Proceedings of the IEEE international conference on computer vision, pp. 3487–3495. Cited by: Traditional Methods.
  • Y. Wang, L. Wang, Z. Liang, J. Yang, W. An, and Y. Guo (2022a) Occlusion-aware cost constructor for light field depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19809–19818. Cited by: Introduction, Introduction, Deep Learning-based Methods, Deep Learning-based Methods, Deep Learning-based Methods, Uncertainty-aware Focal Loss, Table 1, Figure 6, Implementation Details, Quantitative Comparison, Performance on Real Scenes.
  • Y. Wang, L. Wang, G. Wu, J. Yang, W. An, J. Yu, and Y. Guo (2022b) Disentangling light fields for super-resolution and disparity estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 1, Quantitative Comparison.
  • Y. Wang, J. Yang, Y. Guo, C. Xiao, and W. An (2018) Selective light field refocusing for camera arrays using bokeh rendering and superresolution. IEEE Signal Processing Letters 26 (1), pp. 204–208. Cited by: Introduction.
  • G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai (2018) Light field reconstruction using convolutional network on epi and extended applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (7), pp. 1681–1694. Cited by: Introduction.
  • F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: Deep Learning-based Methods.
  • J. Yu (2017) A light-field journey to virtual reality. IEEE MultiMedia 24 (2), pp. 104–112. Cited by: Introduction.
  • Z. Yu, X. Guo, H. Lin, A. Lumsdaine, and J. Yu (2013) Line assisted light field triangulation and stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2792–2799. Cited by: Traditional Methods.
  • S. Zhang, Y. Lin, and H. Sheng (2019) Residual networks for light field image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11046–11055. Cited by: Introduction.
  • S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong (2016) Robust depth estimation for light field via spinning parallelogram operator. Computer Vision and Image Understanding 145, pp. 148–159. Cited by: Traditional Methods, Table 1, Quantitative Comparison.