The light field  records both 2D spatial and 2D angular information of the observed scene. The lenslet-based light field camera , a compact and hand-held light field camera, is able to achieve the dense sampling of the viewpoints by utilizing a micro-lens array inserted between the main lens and the photo sensor. The captured 4D light field data implicitly contains geometrical characteristics such as multi-view geometry or epipolar geometry, which has attracted much attention in recent years to improve the performance of depth estimation from light fields.
To visualize light fields and extract light field features, the 4D light field data is often converted into various 2D images such as multi-view sub-aperture images , Epipolar Plane Images (EPIs) , and focal stacks . Some representative methods [19, 20, 23] exploit different depth cues from sub-aperture images and focal stacks for depth estimation. However, it is difficult to acquire dense and accurate depth maps from the lenslet-based cameras owing to the optical distortions  and the narrow baseline  between sub-aperture images. Besides, these methods are usually accompanied by heavy computational burdens and carefully-designed optimization measures. To avoid these issues, some methods [22, 26, 4, 14] exploit EPIs that exhibit patterns of oriented lines with constant colors to visualize light fields. Each of these lines corresponds to the projection of a single 3D scene point, and its slope is called disparity 
. Therefore, one can infer the depth of the corresponding scene point by analyzing the disparity of the oriented line in the EPI. Moreover, the oriented line and its neighboring pixels share the similar linear structure, which is beneficial to estimate the slope of the EPI by constructing the relationship between the center region in the EPI and its neighborhood. Nonetheless, current methods predict depth maps by extracting the optimal slope of EPIs while ignoring the relationship between neighboring pixels in EPIs, which makes the results inaccurate. It has been well recognized that the relation information is capable of offering important visual cues for computer vision tasks, such as spatial and channel relations in semantic segmentation and object detection , and temporal relations in activity recognition .
In this paper, we propose an end-to-end fully convolutional network to estimate the depth value of the intersection point on the horizontal and vertical EPIs, as shown in Figure 1.
We design a siamese network without sharing weights (i. e. pseudo-siamese ) so that the convolution weights of the horizontal and vertical EPIs can be learned separately. Specifically, we propose a new feature extraction module, called Oriented Relation Module (ORM), to learn and reason about the relationship between oriented lines in EPIs by extracting oriented relation features between the center pixel and its neighborhood from EPI patches. Our network is trained using the 4D light field benchmark dataset , where the ground truth disparities are available. However, we find that it is hard to train such a deep network with insufficient data. To mitigate this issue, we propose a data augmentation method by refocusing EPIs so that EPIs with different slopes as well as the corresponding ground truth disparities can be obtained at the same scene point. We show that the newly proposed ORM and EPI-based data augmentation can bring performance boost for light field depth estimation. Code and depth map predictions will be publicly available.
2 Related Work
Conventional depth estimation from light fields mainly rely on different assumptions [20, 13] and handcrafted depth features [19, 23] based on sub-aperture images and focal stacks. In this section, we restrict ourselves to methods that exploit EPIs, and review some representative works with relation reasoning.
Light field depth estimation based on EPIs.
used a structured tensor to compute the slope of each line in vertical and horizontal EPIs. Zhanget al.  introduced the Spinning Parallelogram Operator (SPO) to find matching lines in EPIs. The lines with different slopes are located by maximizing the distribution distances of the regions. Zhang et al.  located the optimal slope of each line segmentation on EPIs by using the locally linear embedding. Differ from these methods, some methods applied CNNs to extract light field features from EPIs. Sun et al.  presented a data-driven approach to estimate the object depths from an enhanced EPI feature using CNN. Heber and Pock 
used CNNs for predicting 2D per-pixel hyperplane slope orientations in EPIs. Based on this work, Heberet al. [6, 5] improved their work by utilizing an U-shaped network and EPI volumes to predict the depth map. Luo et al.  designed an EPI-patch based CNN architecture to estimate the depth of each pixel. Feng et al.  proposed a two-stream network that learns to estimate the depth values of multiple correlated neighborhood pixels from EPI patches. In addition, some of these methods [6, 14, 2] require data pre-processing and subsequent optimization. In contrast, we present an end-to-end fully convolutional network architecture to predict the depth values of center pixels from the corresponding horizontal and vertical EPIs. We explore the similar linear structure information in EPIs and model the relationship between the oriented lines and their neighboring pixels, which help to estimate the slope of the oriented line.
A few recent papers [28, 8, 15] have shown that relations have been exploited to improve the performance of computer vision tasks. Zhou et al.  proposed a temporal relation network to learn and reason about temporal dependencies between video frames at multiple time scales. Hu et al.  proposed an object relation module to model relationships between sets of objects for object detection. Mou et al.  proposed the spatial and channel relation modules to learn and reason about global relationships between any two spatial positions or feature maps, and then produced relation-augmented feature representations for semantic segmentation. Motivated by these works, we propose a oriented relation module to construct the relationship between the center pixel and its neighborhood in the EPI, which allows the network to explicitly learn the relationship between the line orientations and improve the performance of depth estimation.
3 Proposed Method
In this paper, we present an end-to-end fully convolutional network to predict the depth values of center pixels in EPIs of light fields. Two branches are designed to process the horizontal and vertical EPIs separately. The newly proposed oriented relation module is capable of modeling the relationships between the neighboring pixels in EPIs. A refocusing-based EPI augmentation method is also proposed to facilitate training and improve the performance of depth estimation. An overview of the network architecture is shown in Figure 1.
3.1 EPI Patches for Learning
The light-field, indicated as , is generally represented by the two-plane parameterization . Here, and are spatial and angular coordinates, respectively. The central sub-aperture (center view) image is formed by the rays passing through the optical center of the camera main lens (). As shown in Figure 2, given a pixel in the center view image, the horizontal EPI of the row view can be formulated as , which is centered at . Similarly, the vertical EPI of the column view , with the center at , is written as .
can be obtained by analyzing the slope of the line ,
where is the focal distance and is the depth value of the point . The slope of the oriented line is shown in the EPI patch of Figure 2.
To learn the slope of the oriented line of , we extract patches of size from and as inputs. Here, and indicate height and width of the patch, respectively, and is the channel dimension. The size of the patch is determined by the range of disparities. The proposed network predicts the depth of the center pixel from the pair of EPI patches.
3.2 Network Architecture
As shown in Figure 1, the proposed network shares the similar structure with the pseudo-siamese network proposed in , where two branches are designed to learn the weights for the horizontal and vertical EPI patches, respectively. Each branch contains two oriented relation modules (ORMs), a set of seven convolutional blocks, a residual module (RM), and a merging block. The ORM will be discussed in Sect 3.3
. The convolutional block is composed of ‘Conv-ReLU-Conv-BN-ReLU’. To handle the small EPI slope, we apply the convolutional filters with size ofor
and strideto measure a small depth value. However, detailed information of the EPI slope is lost as the network goes deeper. Inspired by the residual learning  that can introduce detailed information of the shallower layer into the deeper layer and effectively improve the network performance, we design a residual module for each branch. The residual module consists of six residual blocks, each of which consists of one convolutional block and one skip connection. We take a slicing operation to implement the skip connection by extracting the center region of the input feature. The final merging block, containing two different convolutional blocks (‘Conv-ReLU-Conv-BN-ReLU’ and ‘Conv-ReLU-Conv’), is used for fusing the horizontal and vertical EPI features to predict the depth value of each pixel.
3.3 Oriented Relation Module
We propose a new Oriented Relation Module (ORM) to reason about the relationship between the center pixel and its neighborhood in each EPI patch. As shown in Figure 3, given an EPI patch of size , we apply two single-layer convolutions of kernel size to model a compact relationship in the EPI patch.
The output features are converted into and , respectively, which are followed by a dot product to construct the oriented relation feature of size . Furthermore, to obtain the relationship between the center pixel and its neighborhood in , we extract the feature of size from the relational feature . Then, we apply the reshaping and ReLU activation on the feature to obtain a new feature of size . Finally, we concatenate the original EPI patch with the feature to obtain the output feature of size .
3.4 EPI Refocusing-based Data Augmentation
To alleviate the problems of insufficient data and overfitting, we propose a new data augmentation method by refocusing EPIs. The light field refocusing shifts the sub-aperture images to obtain images focused at different depth planes . Figure 4 shows sub-aperture images at the same horizontal or vertical views that are stacked together. Lines with different slopes (i.e. the lines in EPIs) are inserted into the scene points of different depth planes on the sub-aperture images. The line at the focal depth should be vertical (slope = 0), while the other lines are inclined (slope or slope ). Taking the center view as the reference, the disparity shift between sub-aperture images changes the slope of the line. Thus, refocusing at a different depth plane changes the orientation of the structure in the EPI.
Here, we assume that the center view is the reference view. For the sake of simplicity, we also assume that lenslet-based cameras have the same focal length and the same baseline for the neighboring views. Similarly, we can obtain the disparity shift . Then we refocus the EPI based on the refocusing principle ,
The EPI patches in Figure 2 show three horizontal EPIs at different refocused depths.
4.1 Implementation Details
as our experimental dataset, which provides highly accurate disparity ground truth and performance evaluation metrics. The dataset includescarefully designed scenes with ground-truth disparity maps. Each scene has angular resolution and spatial resolution. scenes are used for training and the remaining scenes for testing. We randomly sample the horizontal and vertical EPI patch pairs of size from each scene as inputs. To avoid overfitting, we increase the training data times the original data by the proposed EPI refocusing-based data augmentation.
The bad pixel ratio (BadPix) , which denotes the percentage of pixels whose disparity error is larger than 0.07 pixels, as well as the Mean Square Errors (MSE) are computed for evaluation metrics. Given an estimated disparity map , the ground truth disparity map and evaluation region , BadPix is defined as , and MSE is defined as . Lower scores are better for both metrics.
We use the Keras library
with the mean absolute error (MAE) loss to train the proposed network from scratch. We formularize the depth estimation as a multi-label regression problem to estimate the depth value of a single pixel. Note that the network is trained end-to-end and does not make use of pre- and post-processing complications. We utilize the RMSprop optimizer and set the weight decay rate to and batch size to . Our network training takes one day on an NVIDIA GTX 1080Ti.
4.2 Ablation Study
We use the proposed network without the oriented relation module (ORM) and data augmentation based on EPI refocusing (EPIR) as the Baseline.
Effect of the oriented relation module.
Table 1 shows that the network using the ORM brings a significant improvement over the baseline, which can reduce the BadPix by around .
|Metric||Baseline||w/ ORM||w/ EPIR||Full model|
Figure 5 shows qualitative results for comparison.
Boxes and Cotton show that the ORM can reduce the streaking artifacts and improve the accuracy in weakly textured areas. The occlusion boundaries in Backgammon with multiple occlusions can also be better restored through the ORM.
Effect of EPI refocusing-based data augmentation.
From Table 1,we can see that the network using the EPIR is better than the baseline. Moreover, by using both the ORM and the EPIR, the performance is further boosted. To further show the effect of EPI refocusing in the network, we compare the performance by varying the number of refocusing in Table 2. We refocus the training data to the foreground and the background of the original depth plane. From the table, we observe that there are performance gains when increasing the number of refocusing. However, the gain is marginal when comparing EPIR with EPIR.
4.3 Comparison with state-of-the-arts
We compare our approach with other state-of-the-art methods: Jeon et al. , Williem et al. , Wang et al. , Zhang et al. , Luo et al. , and Shin et al. . The qualitative comparison is shown in Figure 6.
The Cotton scene contains smooth surfaces and textureless regions, and the Boxes scene consists of occlusions with depth discontinuity. As can be seen from the figure, our approach can reconstruct the smooth surface and the region with sharp depth discontinuity compared to other methods. For the Sideboard scene with the complex shape and texture, our approach reserves more details and sharper boundaries by distinguishing the subtle difference of EPI slopes. In addition, our approach obtains better disparity maps in the Boxes and Sideboard scenes than the recent state-of-the-art method , which uses the vertical, the horizontal, the left diagonal and the right diagonal viewpoints as inputs. The number of the viewpoints is almost double that of our approach.
|Scenes||Jeon et al. ||Williem et al. ||Wang et al. ||Zhang et al. ||Luo et al. ||Shin et al. ||Ours|
|Scenes||Jeon et al. ||Williem et al. ||Wang et al. ||Zhang et al. ||Luo et al. ||Shin et al. ||Ours|
In particular, the proposed approach predicts more accurate disparity values on the Boxes and Backgammon scenes under multi-occlusions. However, for the Dots scene that contains a lot of noise, the false straight line estimation of the EPI patch leads to inaccurate disparity values in our approach, which is also the common downside of applying EPIs to the CNN-based method (e.g. Luo et al. ). Note that we do not apply any post-processing for depth optimization while most other methods [9, 23, 20, 26, 14] are accompanied by post optimization.
In this paper, we propose an end-to-end fully convolutional network for depth estimation from light fields by exploiting horizontal and vertical EPIs. We introduce a new relational reasoning module to construct the relationship between oriented lines in EPIs. In addition, we propose a new data augmentation method by refocusing the EPIs. We demonstrate the effectiveness of our approach on the 4D Light Field Benchmark . Our approach is competitive with the state-of-the-art methods, and is able to predict more accurate disparity map in some challenging scenes such as Boxes and Sideboard without any post-processing.
-  Maximilian. Diebold and Bastian. Goldluecke. Epipolar Plane Image Refocusing for Improved Depth Estimation and Occlusion Handling. In Vision, Modelling & Visualization, 2013.
-  M. Feng, Y. Wang, J. Liu, L. Zhang, H. F. M. Zaki, and A. Mian. Benchmark Data Set and Method for Depth Estimation From Light Field Images. IEEE Transactions on Image Processing, 2018.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep Residual Learning for Image Recognition.
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Heber and T. Pock. Convolutional Networks for Shape from Light Field. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Heber, W. Yu, and T. Pock. Neural EPI-Volume Networks for Shape from Light Field. In Proceedings of International Conference on Computer Vision (ICCV), 2017.
-  Stefan. Heber, Yu. Wei, and Thomas. Pock. U-shaped Networks for Shape from Light Field. In Proceedings of British Machine Vision Conference, 2016.
-  K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke. A Dataset and Evaluation Methodology for Depth Estimation on 4D Light Fields. In Proceedings of Asian Conference on Computer Vision (ACCV), 2016.
-  H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation Networks for Object Detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  H. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. Tai, and I. S. Kweon. Accurate depth map estimation from a lenslet light field camera. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  Z. Jiang and G. Shen. In Proceedings of International Conference on Systems and Informatics (ICSAI), 2019.
-  Ole. Johannsen, Christian. Heinze, Bastian. Goldluecke, and Perwa. Christian. On the Calibration of Focused Plenoptic Cameras. Springer Berlin Heidelberg, 2013.
-  Marc. Levoy and Pat. Hanrahan. Light Field Rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, 1996.
-  H. Lin, C. Chen, S. B. Kang, and J. Yu. Depth Recovery from Light Field Using Focal Stack Symmetry. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
Y. Luo, W. Zhou, J. Fang, L. Liang, H. Zhang, and G. Dai.
EPI-Patch Based Convolutional Neural Network for Depth Estimation on 4D Light Field.In International Conference on Neural Information Processing, 2017.
-  L. Mou, Y. Hua, and X. X. Zhu. A Relation-Augmented Fully Convolutional Network for Semantic Segmentation in Aerial Scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
-  Ren. Ng, Marc. Levoy, Gene. Duval, Mark. Horowitz, and Pat. Hanrahan. Light Field Photography with a Hand-held Plenoptic Camera. Computer Science Technical Report CSTR, 2005.
-  Ren. Ng. Digital light field photography. PhD thesis, Stanford University, 2006.
-  C. Shin, H. Jeon, Y. Yoon, I. S. Kweon, and S. J. Kim. EPINET: A Fully-Convolutional Neural Network Using Epipolar Geometry for Depth from Light Field Images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi. Depth from Combining Defocus and Correspondence Using Light-Field Cameras. In Proceedings of International Conference on Computer Vision (ICCV), 2013.
-  T.-C. Wang, A. Efros, and R. Ramamoorthi. Occlusion-Aware Depth Estimation Using Light-Field Cameras. In Proceedings of International Conference on Computer Vision (ICCV), 2017.
-  S. Wanner and B. Goldluecke. Globally Consistent Depth Labeling of 4D Light Fields. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
S. Wanner and B. Goldluecke.
Variational Light Field Analysis for Disparity Estimation and Super-Resolution.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
-  W. Williem and I. K. Park. Robust Light Field Depth Estimation for Noisy Scene with Occlusion. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  Xing. Sun, Z. Xu, Nan. Meng, E. Y. Lam, and H. K. -H. So. Data-driven light field depth estimation using deep Convolutional Neural Networks. In Proceedings of International Joint Conference on Neural Networks (IJCNN), 2016.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong. Robust depth estimation for light field via spinning parallelogram operator. Computer Vision and Image Understanding (CVIU), 2016.
-  Y. Zhang, H. Lv, Y. Liu, H. Wang, X. Wang, Q. Huang, X. Xiang, and Q. Dai. Light-Field Depth Estimation via Epipolar Plane Image Analysis and Locally Linear Embedding. IEEE Transactions on Circuits and Systems for Video Technology, 2017.
-  B. Zhou, A. Andonian, and A. Torralba. Temporal Relational Reasoning in Videos. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
-  F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A Sufficient Condition for Convergences of Adam and RMSProp. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.