Learning Sub-Pixel Disparity Distribution for Light Field Depth Estimation
Existing light field (LF) depth estimation methods generally consider depth estimation as a regression problem, supervised by a pixel-wise L1 loss between the regressed disparity map and the groundtruth one. However, the disparity map is only a sub-space projection (i.e., an expectation) of the disparity distribution, while the latter one is more essential for models to learn. In this paper, we propose a simple yet effective method to learn the sub-pixel disparity distribution by fully utilizing the power of deep networks. In our method, we construct the cost volume at sub-pixel level to produce a finer depth distribution and design an uncertainty-aware focal loss to supervise the disparity distribution to be close to the groundtruth one. Extensive experimental results demonstrate the effectiveness of our method. Our method, called SubFocal, ranks the first place among 99 submitted algorithms on the HCI 4D LF Benchmark in terms of all the five accuracy metrics (i.e., BadPix0.01, BadPix0.03, BadPix0.07, MSE and Q25), and significantly outperforms recent state-of-the-art LF depth methods such as OACC-Net and AttMLFNet. Code and model are available at https://github.com/chaowentao/SubFocal.READ FULL TEXT VIEW PDF
Learning Sub-Pixel Disparity Distribution for Light Field Depth Estimation
. The mainstream deep learning-based methods consist of four steps, including feature extraction, cost construction, cost aggregation and depth (disparity) regression111Depth and disparity and can be transformed directly according to , where and stand for the baseline length and the focal length of the LF camera, respectively. The following part will not distinguish between them.. Cost construction provides initial similarity measures of possible matched pixels among different views. Existing methods construct cost volumes by shifting the features of each view with specific integer values, and regress the final disparity values (in sub-pixel accuracy) by performing a weighted average on each candidate disparity. Although these methods achieve a remarkable performance by minimizing the pixel-wise loss between the predicted disparity and the groundtruth one, the disparity distribution issue has been ignored.
Disparity distribution can be considered as a set of probability values that a pixel belongs to different candidate disparities. The regressed disparity is only a sub-space projection (more precisely, an expectation) of the predicted disparity distribution. Since there are unlimited disparity distributions that can produce the same expectation, simply minimizing the loss in the disparity domain can hinder the deep networks from learning the core information and yield sub-optimal depth estimation results.
In this paper, we propose a novel method to learn the disparity distribution for LF depth estimation. In our method, we propose an interpolation-based sub-pixel cost volume construction approach to achieve sub-pixel level sampling of disparity distribution, and then design an uncertainty-aware focal loss to reduce the differences between estimated and groundtruth disparity distribution. Since the groundtruth disparity distribution is unavailable in public LF datasets, we follow the common settings inLi et al. (2020b) to model the groundtruth disparity distribution as a Dirac delta distribution (i.e., the possibility of the groundtruth disparity is 1 while the others are 0), and further discretize the Dirac delta distribution into two adjacent sampling points to facilitate the training of our method. We conduct extensive experiments to validate the effectiveness of our method. As shown in Figure 1, our method achieves superior performance as compared to state-of-the-art LF depth estimation methods (e.g., OACC-Net Wang et al. (2022a), AttMLFNet Chen et al. (2021)).
In summary, the contributions of this paper are as follows:
We rethink LF depth estimation as a pixel-wise distribution alignment task and make the first attampt to learn the sub-pixel level disparity distribution using deep neural networks.
We propose a simple-yet-effective method for sub-pixel disparity distribution learning, in which a sub-pixel level cost volume construction approach and an uncertainty-aware focal loss are developed to supervise the learning of disparity distribution.
Experimental results demonstrate the effectiveness of our method. Our method, called SubFocal, ranks the first place among all the 99 submitted algorithms on the HCI 4D LF benchmark in terms of all the five accuracy metrics (i.e., BadPix0.01, BadPix0.03, BadPix0.07, MSE and Q25).
In this section, we review traditional and deep learning-based methods in LF depth estimation.
Traditional depth estimation algorithms can be divided into three categories based on the properties of LF images. Multi-view stereo matching (MVS) based methods Yu et al. (2013); Jeon et al. (2015) exploit the multi-view information of the LFs for depth estimation. Epipolar plane image (EPI) based methods Zhang et al. (2016); Schilling et al. (2018) calculate the slopes using the epipolar geometric properties of the LFs to predict the depth. Defocus-based methods Tao et al. (2013); Wang et al. (2015) calculate the depth of a pixel by measuring the consistency at different focal stacks. However, traditional methods involve nonlinear optimization and manually-designed features, which are computationally demanding and sensitive to occlusions, weak textures and highlights.
Recently, deep learning has been developed rapidly and gradually used for LF depth estimation Heber and Pock (2016); Shin et al. (2018); Chen et al. (2021); Tsai et al. (2020); Wang et al. (2022a). Some learning-based methods estimate the slope of the line from the EPI. Luo et al. Luo et al. (2017) used pairs of EPI image patches (horizontal and vertical) to train a CNN network. Li et al. Li et al. (2020a) designed a network to learn the relationship between the slopes of straight lines in horizontal and vertical EPIs. Leistner et al. Leistner et al. (2019) introduced the idea of EPI-Shift, a virtual stack of moving LFs, predicting the disparity integer and offset separately, and then combining them together. These EPI-based methods only consider the features in the horizontal and vertical directions of angular views, and requires complex subsequent refinement.
Other methods design networks by directly exploring the correspondence between views in LF images. Shin et al. Shin et al. (2018) introduced a multi-stream input structure to concatenate views from four directions for depth estimation. Tsai et al. Tsai et al. (2020) proposed an attention-based viewpoint selection network that generates attention maps to represent the importance of each viewpoint. Chen et al. Chen et al. (2021) proposed an attention-based multi-level fusion network, using an intra-branch fusion strategy and an inter-branch fusion strategy to perform hierarchical fusion of effective features from different perspectives. Wang et al. Wang et al. (2022a) constructed an occlusion-aware cost volume based on dilated convolution Yu and Koltun (2015), which can achieve highest accuracy in terms of MSE metric with fast speed.
However, the state-of-art methods Tsai et al. (2020); Chen et al. (2021); Wang et al. (2022a) treat depth estimation as a regression task supervised by a pixel-wise L1 loss and lack explicit supervision of disparity distribution. In this paper, we design a sub-pixel cost volume and an uncertainty-aware focal loss to learn sub-pixel disparity distribution for LF depth estimation.
Figure 2 depicts the architecture of our proposed method. First, the features of each sub-aperture image (SAI) are extracted using a shared feature extractor. Second, the sub-pixel view shift is performed to construct the sub-pixel cost volume. Third, the cost aggregation module is used to aggregate cost volume information. Moreover, the predicted disparity distribution and disparity map are then produced by attaching a disparity regression module, in which the Jensen-Shannon (JS) divergence is used to calculate the uncertainty map. Finally, the proposed uncertainty-aware focal loss is used to simultaneously supervise the disparity distribution and the disparity map. We will describe each module in detail below.
In order to extract multi-scale features of SAI, the widely used Spatial Pyramid Pooling (SPP) module Tsai et al. (2020); Chen et al. (2021) is adopted as the feature extraction module in our network. The features of the SPP module contain hierarchical context information and are concatenated to form the feature map .
An LF can be represented using the two-plane model, i.e., , where stands for the spatial coordinate and for the angular coordinate. The LF disparity structure can be formulated as:
where denotes the disparity between the central view and the adjacent view at .
Cost volume is constructed by shifting the feature according to Equation 1 within a predefined disparity range (usually from -4 to 4). It can be transformed into the disparity distribution by subsequent cost aggregation and disparity regression modules. To obtain a finer disparity distribution, sub-pixel level disparity sampling interval is required. In this paper we construct a sub-pixel cost volume based on bilinear interpolation.
It is worth noting that a smaller sampling interval can generate a finer disparity distribution but will cause more expensive computation and longer inference time. Consequently, we conduct ablation experiments to investigate the trade-off between accuracy and speed with respect to sampling interval, which are described in details.
The shape of the sub-pixel cost volume is , and we employ 3D CNN to aggregate the sub-pixel cost volume. Following Tsai et al. (2020) , our cost aggregation consists of eight convolutional layers and two residual blocks. After passing through these 3D convolutional layers, we obtain the final cost volume . We normalize by using softmax operation along dimension to produce the probability of the disparity distribution . Finally, the output disparity can be calculated according to:
where denotes the estimated center-view disparity, and stand for the minimum and maximum disparity value, is the sampling value between and .
Previous methods Shin et al. (2018); Tsai et al. (2020); Chen et al. (2021); Wang et al. (2022a) consider depth estimation as a simple regression process, and treat each point equally via a pixel-wise L1 loss. However, unlimited disparity distributions can produce the same expectation of disparity. Our aim is that models need to learn the reasonable disparity distributions (e.g., Dirac delta distribution Li et al. (2020b)) for depth estimation. It is challenging to use the simple L1 loss to supervise the disparity distribution, especially for those difficult spatial areas. Here, we visualize the disparity distributions of our method in Figure 3. In simple and non-occlusion regions (i.e., Point ), a reasonable disparity distribution and accurate disparity can be obtained by directly using L1 loss for training. In contrast, in difficult occlusion regions (i.e., Point ), directly using L1 loss cannot yield reasonable disparity distribution and accurate disparity. It indicates an imbalance issue in learning disparity distribution and disparity.
To address this issue, we design an uncertainty-aware focal loss (UAFL) to supervise disparity distribution. Similar to Focal Loss Lin et al. (2017); Li et al. (2020b), we also adopt the dynamic weight adjustment strategy to make the model focus more on the hard points with significant differences in disparity distributions. Figure 3 compares the disparity distribution with and without using UAFL. By using UAFL, our model can predict a reasonable disparity distribution and accurate disparity in challenge regions.
To quantitatively evaluate the difficulty of different regions, we adopt the JS divergence to measure the difference between the predicted and the groundtruth disparity distributions as presented in Equation 3.
is the uncertainty map, represents the predicted disparity distribution, and is the disparity distribution of groundtruth. Since we follow Li et al. (2020b) to consider the groundtruth disparity distribution a Dirac delta distribution, we discretize it to the two adjacent sampling points and , according to:
where and are the left and right adjacent points, respectively. and represent their corresponding probability values. is the groundtruth disparity. Finally, we design the uncertainty-aware focal loss to supervise both disparity distribution and disparity map:
where is the uncertainty map, is the predicted disparity, and is the groundtruth disparity. stands for the element-wise multiplication operation, and is the coefficient factor that control the ratio of dynamic weight assignment. When , UAFL degenerate to the standard L1 loss. In ablation experiments, we analyze the effect of different values.
|SPO Zhang et al. (2016)||3.781||8.639||49.94||4.587||16.27||35.06||58.07||5.238||0.861||6.263||79.20||0.043||14.99||15.46||21.87||6.955|
|CAE Park et al. (2017)||3.924||4.313||17.32||6.074||12.40||42.50||83.70||5.082||1.681||7.162||27.54||0.048||7.872||16.90||39.95||3.556|
|PS_RF Jeon et al. (2018)||7.142||13.94||74.66||6.892||7.975||17.54||78.80||8.338||0.107||6.235||83.23||0.043||2.964||5.790||41.65||1.382|
|OBER-cross-ANP Schilling et al. (2018)||3.413||4.956||13.66||4.700||0.974||37.66||73.13||1.757||0.364||1.130||8.171||0.008||3.065||9.352||44.72||1.435|
|OAVC Han et al. (2021)||3.121||5.117||49.05||3.835||69.11||75.38||92.33||16.58||0.831||9.027||33.66||0.040||2.903||19.88||28.14||1.316|
|Epinet-fcn-m Shin et al. (2018)||3.501||5.563||19.43||3.705||2.490||9.117||35.61||1.475||0.159||0.874||11.42||0.007||2.457||2.711||11.77||0.932|
|EPI_ORM Li et al. (2020a)||3.988||7.238||34.32||3.411||36.10||47.93||65.71||14.48||0.324||1.301||19.06||0.016||6.871||13.94||55.14||1.744|
|LFattNet Tsai et al. (2020)||3.126||3.984||11.58||3.648||1.432||3.012||15.06||1.425||0.195||0.489||2.063||0.004||2.933||5.417||18.21||0.892|
|FastLFNet Huang et al. (2021)||5.138||11.41||39.84||3.986||21.17||41.11||68.15||3.407||0.620||2.193||22.19||0.018||9.442||32.60||63.04||0.984|
|DistgDisp Wang et al. (2022b)||5.824||10.54||26.17||4.712||1.826||4.464||25.37||1.367||0.108||0.539||4.953||0.004||3.913||6.885||19.25||0.917|
|OACC-Net Wang et al. (2022a)||3.931||6.640||21.61||3.938||1.510||3.040||21.02||1.418||0.157||0.536||3.852||0.004||2.920||4.644||15.24||0.845|
|SPO Zhang et al. (2016)||15.89||29.52||73.23||9.107||2.594||13.71||69.05||1.313||2.184||16.36||69.87||0.310||9.297||28.81||73.36||1.024|
|CAE Park et al. (2017)||17.89||40.40||72.69||8.424||3.369||15.50||59.22||1.506||4.968||21.30||61.06||0.382||9.845||26.85||56.92||0.876|
|PS_RF Jeon et al. (2018)||18.95||35.23||76.39||9.043||2.425||14.98||70.41||1.161||4.379||16.44||75.97||0.751||11.75||36.28||79.98||1.945|
|OBER-cross-ANP Schilling et al. (2018)||10.76||17.92||44.96||4.750||1.108||7.722||36.79||0.555||2.070||6.161||22.76||0.336||5.671||12.48||32.79||0.941|
|OAVC Han et al. (2021)||16.14||33.68||71.91||6.988||2.550||20.79||61.35||0.598||3.936||19.03||61.82||0.267||12.42||37.83||73.85||1.047|
|Epinet-fcn-m Shin et al. (2018)||12.34||18.11||46.09||5.968||0.447||2.076||25.72||0.197||1.207||3.105||19.39||0.157||4.462||10.86||36.49||0.798|
|EPI_ORM Li et al. (2020a)||13.37||25.33||59.68||4.189||0.856||5.564||42.94||0.287||2.814||8.993||41.04||0.336||5.583||14.61||52.59||0.778|
|LFattNet Tsai et al. (2020)||11.04||18.97||37.04||3.996||0.272||0.697||3.644||0.209||0.848||2.340||12.22||0.093||2.870||7.243||20.73||0.530|
|FastLFNet Huang et al. (2021)||18.70||37.45||71.82||4.395||0.714||6.785||49.34||0.322||2.407||13.27||56.24||0.189||7.032||21.62||61.96||0.747|
|DistgDisp Wang et al. (2022b)||13.31||21.13||41.62||3.325||0.489||1.478||7.594||0.184||1.414||4.018||20.46||0.099||4.051||9.575||28.28||0.713|
|OACC-Net Wang et al. (2022a)||10.70||18.16||43.48||2.892||0.312||0.829||10.45||0.162||0.967||2.874||22.11||0.083||3.350||8.065||28.64||0.542|
|SPO Zhang et al. (2016)||4.864||23.53||72.37||0.209||10.91||26.90||71.13||5.570||8.260||30.62||86.62||11.23||11.69||32.71||75.58||2.032|
|CAE Park et al. (2017)||5.788||25.36||68.59||0.234||11.22||23.62||59.64||5.135||9.550||23.16||59.24||11.67||10.03||28.35||64.16||1.778|
|PS_RF Jeon et al. (2018)||6.015||22.45||80.68||0.288||17.17||32.32||79.80||7.926||10.48||21.90||66.47||15.25||13.57||36.45||80.32||2.393|
|OBER-cross-ANP Schilling et al. (2018)||3.329||9.558||28.91||0.185||8.683||16.17||37.83||4.314||7.120||14.06||36.83||10.44||8.665||20.03||42.16||1.493|
|OAVC Han et al. (2021)||4.915||19.09||64.76||0.212||12.22||25.46||64.74||4.886||8.733||29.65||74.76||10.36||12.56||30.59||69.35||1.478|
|Epinet-fcn-m Shin et al. (2018)||2.299||6.345||31.82||0.204||9.614||16.83||42.83||4.603||10.96||25.85||59.93||9.491||5.807||13.00||42.21||1.478|
|EPI_ORM Li et al. (2020a)||5.492||14.66||51.02||0.298||11.12||21.20||51.22||3.489||8.515||24.60||68.79||4.468||8.661||22.95||56.57||1.826|
|LFattNet Tsai et al. (2020)||2.792||5.318||13.33||0.366||9.511||15.99||31.35||3.350||5.219||9.473||19.27||6.605||4.824||8.925||22.19||1.733|
|FastLFNet Huang et al. (2021)||4.903||15.92||52.88||0.202||15.38||28.45||59.24||4.715||10.72||23.39||59.98||8.285||12.64||33.65||72.36||2.228|
|DistgDisp Wang et al. (2022b)||2.349||5.925||17.66||0.111||9.856||17.58||35.72||3.419||6.846||12.44||24.44||6.846||4.270||9.816||28.42||1.053|
|OACC-Net Wang et al. (2022a)||2.308||5.707||21.97||0.148||8.078||14.40||32.74||2.907||6.515||46.78||86.41||6.561||4.065||9.717||32.25||0.878|
In this section, we first describe the dataset, evaluation metrics, and implementation details. The performance of our method is then compared with the state-of-the-art methods. Finally, we design ablation experiments to analyze the effectiveness of the proposed method.
The 4D LF Benchmark (HCI 4D) Honauer et al. (2016) is the current mainstream benchmark dataset for evaluating LF disparity estimation. The dataset is rendered by Blender and provides 28 synthetic densely sampled 4D LFs, including 16 , 4 , 4 , and 4 . The , and scenes have groundtruth disparity maps while the does not provide groundtruth disparity annotations. The spatial resolution is 512512, and the angular resolution is 99.
For the quantitative evaluation, we adopt the standard metrics, including mean square errors (MSE 100) and bad pixel ratios (BadPix()). Specifically, MSE 100: mean square error of all pixels at a given mask multiplied by 100. BadPix(): the percentage of pixels whose absolute value between the actual label at a given mask and the algorithm’s predicted result surpasses a threshold , is usually chosen as 0.01,0.03 and 0.07.
We follow the previous method Shin et al. (2018); Tsai et al. (2020); Chen et al. (2021); Wang et al. (2022a) and use 16 scenes from as the training set, 8 scenes from and as the validation set, and 4 scenes from as the testing set. We use LFattNet Tsai et al. (2020) as the baseline model. During training, LF images are randomly cropped into 3232 grayscale patches. The same data augmentation strategy Shin et al. (2018); Tsai et al. (2020) is used to improve the model performance. We set batchsize to 32 and use the Adam Kingma and Ba (2014)
method. The disparity range is set to [-4,4], and the sampling interval is set to 0.5 by default. To speed up the model training, a two-step strategy is used. In the first stage, the model is trained for 50 epochs using L1 Loss, and the learning rate is 110. In the second stage, the L1 loss is replaced with our UAFL. The model is finetuned 10 epoch, and the learning rate is set to 110
. We use the TensorFlowAbadi et al. (2016) framework to implement our network. The training process takes roughly 65 hours on an NVIDIA RTX 3090 GPU.
We perform quantitative comparison with state-of-the-art methods, including SPO Zhang et al. (2016), CAE Park et al. (2017), PS_RF Jeon et al. (2018), OBER-cross-ANP Schilling et al. (2018), OAVC Han et al. (2021), Epinet-fcn-m Shin et al. (2018), EPI_ORM Li et al. (2020a), LFattNet Tsai et al. (2020), FastLFNet Huang et al. (2021), AttMLFNet Chen et al. (2021), DistgDisp Wang et al. (2022b), and OACC-Net Wang et al. (2022a), respectively. Figure 4 demonstrates that our method achieves the first place on all five commonly evaluated metrics (BadPix 0.01, BadPix 0.03, BadPix 0.07, MSE 100, and Q25) on the HCI 4D LF benchmark. Table 1 shows the quantitative comparison results on the validation and test set. Our method achieves promising performance on different scenes, especially on the unlabeled test set.
We show the qualitative result in Figure 5, containing three scenes: , and . The BadPix 0.07 error maps show that our method has substantially less error compared to previous methods, especially in the occlusion and edge regions, such as the and scenes.
We evaluate our method on the real-world LF images taken with a Lytro Illum camera Bok et al. (2016) and a moving camera mounted on a gantry Vaish and Adams (2008). We use the model trained on the HCI dataset because directly for inference. Figure 6 illustrates our method can produce disparity maps more accurate than those produced by EPINet Shin et al. (2018) and OACC-Net Wang et al. (2022a), such as the boom of the bulldozer (middle). It clearly demonstrates the generalization capability of our method. Please refer to supplemental material for additional comparisons.
We conduct extensive ablation experiments to analyze effectiveness of our method. Our ablation studies including the investigations of the disparity intervals, combinations of loss functions and hyper-parametersof UAFL.
|Disp. Num.||Disp. Interv.||BP 0.07||MSE||Training Time||Inference Time|
The selection of the disparity interval determines the sampling density and further affects the inference speed of the model. We need to choose the proper disparity interval to achieve a good trade-off in accuracy and speed. Therefore, we set the disparity intervals to 1, 0.5, 0.25 and 0.1, respectively and train our model for 50 epochs on the HCI 4D LF benchmark and then evaluate their performance on the validation set. Table 2 shows the results of the model at different disparity intervals and numbers. As the disparity interval decreases and the number of samples increases, the accuracy of the model increases, while the training time also increases. Specifically, reducing the disparity interval from 1 to 0.5 improves the BadPix 0.07 and MSE by 0.75 and 0.4, respectively. As shown in Figure 7, we can see that the disparity map at the interval of 0.5 is more accurate than the interval of 1 on the heavily occluded region of scene scene. In addition, we find no significant visual difference between the results of disparity interval 0.25 and disparity interval 0.5. Consequently, considering the accuracy and speed, we finally choose the disparity interval of 0.5 for further experiments.
How to choose the appropriate loss functions is an important issue in our method. Since metrics (e.g., Kullback Leibler (KL) divergence) also can measure the distance between two distributions. We test some combinations of these losses, including L1 loss, MSE loss, KL divergence loss, and JS divergence loss. We use the model with a disparity interval of 0.5 as pre-trained baseline. Then the model is finetuned for 10 epochs with the learning rate being set to 110. Table 3 shows the results of different loss functions. Although MSE loss leads to a better MSE metric, the BadPix 0.07 metric increases sharply. Our UAFL improves both MSE and BadPix 0.07 metrics as compared to baseline, and the overall performance of UAFL is superior than other loss functions.
The function of UAFL is to dynamically assign weights based on an uncertainty map to supervise the training of disparity distribution and disparity simultaneously. We introduce the hyper-parameter to adjust the ratio of the assigned weights. Based on the disparity interval of 0.5, we choose different to investigate its influence, and the results are shown in Table 4. We observe that as increases from 0.1 to 1, the MSE metric decreases, while the BadPix 0.07 metric increases. As shown in Figure 8 (a), the BadPix 0.07 error map with is significantly better than that with . Histogram of MSE error in Figure 8
(b) shows fewer error outlier points forand a slight increase in error outliers for , so the mean MSE of is better than that of . We analyze that a larger can increase the ratio of weights assigned to difficult areas, and thus make the model focus more on challenging regions (e.g., outliers), resulting in a minor overall MSE error. The BadPix metric, on the other hand, calculates the proportion of pixels with an error large than a threshold, such as 0.07. And with an error of 0.07, the model will relax the constraint and assign lower weights on easy regions, leading to a slight increase in the BadPix metric. Considering that the BadPix metric is prioritized for depth estimation, we finally set in our main model.
This paper presents a simple and effective method to learn the disparity distribution for LF depth estimation. On the one hand, we construct an interpolation-based cost volume at the sub-pixel level, which yields a finer disparity distribution. On the other hand, we design an uncertainty-aware focal loss based on JS divergence to supervise the disparity distribution. Extensive experiments validate the effectiveness of our method. Our method achieves superior performance as compared to state-of-the-art LF depth estimation methods.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1009–1017. Cited by: Introduction, Introduction, Deep Learning-based Methods, Deep Learning-based Methods, Deep Learning-based Methods, Feature Extraction, Uncertainty-aware Focal Loss, Implementation Details, Quantitative Comparison.
Epi-patch based convolutional neural network for depth estimation on 4d light field. In International Conference on Neural Information Processing, pp. 642–652. Cited by: Deep Learning-based Methods.