A dense light field contains detailed multi-perspective information of a real-world scene. Utilizing these information, previous work has demonstrated many exciting applications with light fields, including changing the focus 
, depth estimation[2, 3, 4, 5], and saliency detection . However, it is difficult for existing devices to properly capture such a large quantity of information. In early light field capturing methods, light fields are recorded by multi-camera arrays or light field gantries  which are bulky and expensive. In recent years, commercial light field cameras such as Lytro  and Raytrix  are introduced to the general public. But they are still unable to efficiently sample a dense light field due to their trade-off between angular and spatial resolution.
Many methods have been proposed to synthesize novel views using a set of sparsely sampled views in light field [10, 11, 12, 13]. But, these methods only increase the view density in a single light field. Kalantari et al.  proposed a learning-based method to synthesize novel view at arbitrary position in a light field by using views in the four corners of light field. Recently, Wu et al.  proposed a leaning-based method to synthesize novel views by increasing the resolution of EPI. These methods outperform other state-of-the-art methods [10, 11] on view synthesis. However, all these methods are only able to synthesize novel views in a single light field. Besides, in these methods, the baseline between sampled views has to be close enough. They cannot properly reconstruct a large quantity of novel light rays with wide baseline.
In this paper, we explore dense light field reconstruction from sparse sampling. We propose a novel learning-based method to synthesize a great number of novel light rays between two distant input light fields, whose view planes are coplanar. Using the disparity consistency between light fields, we first model the relationship between EPIs of dense and sparse light field. Then, we extend the error-sensitive disparity consistency between EPIs in sparse light field by employing ResNet. Finally, we reconstruct a large quantity of light rays between input light fields. The proposed method is capable of rendering a dense light field by using multiple input light fields which are captured by commercial light field camera. In addition, the proposed method requires neither depth estimation nor other priors. Experimental results on real-world scenes demonstrate the performance of our proposed method. The proposed method is at most capable of reconstructing four novel light fields between two input light fields. Besides, in terms of the quality of synthesized novel view images, our method outperforms state-of-the-art methods on both quantitative and qualitative results.
Our main contributions are:
1) We present a learning-based method for reconstructing a dense light field by using a sparse set of light fields sampled by commercial light field camera.
2) Our method is able to reconstruct a large quantity of light rays and occlusions between two distant input light fields.
3) We introduce a high-angular-resolution light field dataset whose angular resolution is the highest among light field benchmark datasets so far.
2 Related Work
Dense sampled light field is in need for many computer vision applications. However, it costs much time and space to acquire and store massive light rays by existing devices and algorithms. Many research groups have focused on increasing a camera-captured light field’s resolution by using a set of samples[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. Here, we survey some state-of-the-art methods.
2.1 View-based vs. EPI-based Angular Interpolation
Wanner and Goldluecke  used the estimated depth map to warp input view image to novel view. However, the quality of synthesized view is easily affected by the accuracy of depth map. Levin et al.  used a new prior to render a 4D light field from a 3D focal stack. Shi et al.  took advantage of the sparsity of light field in continuous Fourier domain to reconstruct a full light field. The method sampled multiple 1D viewpoint trajectories with special patterns to reconstruct a full 4D light field. Zhang et al.  introduced a phase-based method to reconstruct a full light field from micro-baseline image pair. Schedl et al.  reconstructed a full light field by searching for best-matching multidimensional patches within the dataset. However, in these methods, due to the limitation of specific sampling pattern and algorithm complexity, they are unable to properly generate a dense light field. The angular resolution of synthesized light field is at most 20*20 with commercial light field cameras. Marwah et al.  proposed a method to reconstruct light field from a coded 2D projection. But they need a special designed equipment to capture compressive light field.
Recently, learning-based methods are explored in light field super-resolution. Kalantari et al. introduced a learning-based method which used four corner view images to synthesize an arbitrary view image in a single light field. They used two sequential networks to estimate depth and color values of pixels in novel view image. Srinivasan et al.  proposed a learning-based method to synthesize a full 4D light field by using a single view image. However, these methods heavily rely on the accuracy of depth map. Yoon et al.  trained several CNNs to increase spatial and angular resolution simultaneously. However, the method could only synthesize one novel view between two or four input views. Wang et al.  proposed a learning-based hybrid imaging system to reconstruct light field video. Although the work did not directly aim at novel view synthesis, it in fact had synthesized novel frames containing different views of a light field. In their proposed system, DSLR provided the prior information that is equivalent to the central view of each synthesized light field frame. Instead of using extra prior to guide light field reconstruction, our proposed method only use light fields which captured by commercial light field camera as input.
Apparently, EPI has a strong characteristic of linearity. Many methods explored light field processing based on EPI. However, there are fewer work focusing on angular interpolation of light field. Wu et al. trained a residual-based network to increase angular resolution of EPI. They employ a 1D blur kernel to remove high spatial frequency in the sparsely sampled EPI before feeding to the CNN. Then, they carry out a non-blind de-blur to restore high spatial frequency details which are removed by the blur kernel. However, due to the limitation of blur kernel’s size and interpolation algorithm in the preprocessing, the method also fails when the baseline of input views is wide.
2.2 View Synthesis vs. Light Field Reconstruction
Many methods focus on light field view synthesis, such as [10, 11, 12, 13, 14, 15, 16, 17, 20] . The most important insight is that all these methods synthesize novel views in internal of a light field. Due to the narrow baseline among views of existing commercial light field camera, all the input view images in these methods have high overlapping ratio between each other. These methods use redundant information between input view images. Therefore, they are only able to synthesize novel views within input views. On the contrast, we propose a novel method that can reconstruct multiple light fields between two input light fields instead of reconstructing views inside a single light field. Therefore, our proposed method is capable of reconstructing a dense light field by using a sparse set of input light fields. Besides, the method is able to reconstruct a mass of occlusions in reconstructed light fields without requiring depth estimation.
We model novel light field reconstruction based on 2D EPI. Different from other light field view synthesis methods, our proposed method directly reconstructs multiple novel light fields, as shown in Fig. 1. Besides, view planes of input light fields are not overlapping. There are hundreds of missing views between two input light fields. Therefore, the difficulty lies in that it needs to reconstruct a large amount of light rays and occlusions.
3 Problem Formulation
In the two-parallel-plane parameterization model, light field is formulated as a 4D function , where the pair represents the intersection of light ray and view plane, and represents the intersection of light ray and image plane. In the paper, we assume that input light fields’ view planes are coplanar (see Fig. 2). Besides, their view planes are non-overlapping. Our task is to interpolate a great deal of novel light rays between these light fields.
All views in a light field are assumed as perspective cameras with identical intrinsic parameters. The transformation between views within the same light field is merely a translation without rotation. Therefore, a 3D point in real-world scene is mapped to the pixel in the image plane of light field as follows.
where denotes a constant, refers to the disparity of , is the interval between view plane and image plane. Then, the constraints of a light ray in light field can be described as
In fact, Eq. 2 is the mathematical description of EPI which is a 2D slice cut from 4D light field. Its simple linear structure makes it easy to analyze in light field. As a specific representation of light field, EPI not only includes angular information but also contains spatial information. EPI mainly has two significant properties. One property is that a scene point is represented by a straight line whose slope is a constant value . Another property is that pixels on a straight line refer to different light rays emitting from the same point. In fact, the slope of a line in EPI reflects the disparity of a point observed in different views. We define this linear constraint between and in EPI as disparity consistency, as formulated in Eq. 2. Based on disparity consistency, a EPI can be further formulated as
For any two light fields in our assumption, the transformation between them is merely a translation without rotation
is an identity matrix,, and are two scene points. Besides, the disparity of an identical scene point stays the same in multiple light fields. The transformation between EPIs of any two light fields in our assumption is formulated as
where is equal to . Therefore, under the condition that two light fields’ view planes are coplanar, their EPIs can represent each other through disparity consistency.
In our model, there are two kinds of light fields (see Fig. 2). One is the sparse light field which is made up by input light fields. The other one is the dense light field which is reconstructed based a sparse light field. In terms of the universe of light rays, the sparse light field is actually a subset of the dense light field. Besides, both of them meet the condition that their view planes are coplanar under our assumption. Therefore, their EPI can also represent each other through disparity consistency.
where is the dense light field’s EPI, is the EPI of sparse light field. Thus, we are able to reconstruct a dense light field by extending the disparity consistency in sparse light field.
4 Reconstruction based on Residual Network
For gaining disparity consistency, disparity estimation is an error-sensitive solution with existing algorithms. In our proposed method, we extend the disparity consistency among light fields by employing a neural network.
Compared with dense light field, there are many light rays being absent in sparse light field which is composed by input light fields (see Fig. 2). Many entire rows of pixels are needed to be reconstructed in its EPI. These missing rows form a blank band in EPI. We initially set these pixels’ values to zero in the blank band, as shown in Fig. 2. Our task is to find an operation that can predict the pixels’ values in blank band.
4.1 Network Architecture
In fact, pixels’ values of input light fields remain unchanged in the sparse light field’s EPI during the reconstructing procedure. We only need to predict pixels’ values in blank band. Therefore, we regard pixels’ values in blank band as the residual values between EPI of dense light field and EPI of sparse light field:
where refers to the residual between and . Thus, we employ ResNet to predict the residual. Besides, due to particular residual blocks and shortcuts in the network, the network only needs to consider the residual between input and output and preserves high frequency details of EPI. We reformulate the reconstruction of dense light field’s EPI as follows:
where refers to the operation of residual network that solves residuals between input and output. refers to parameters of convolution layers in the network. Therefore, the residual between and can be solved by minimizing the difference between output and ground truth iteratively, which refers to and respectively in Eq. 8.
The structure of supervised network is shown in Fig. 3. The network contains 32 convolutional layers.The input and output are single RGB image of incomplete EPI and intact EPI. The main part of the network contains 5 convolutional sections, and each section has 3 residual blocks mentioned by He et al. . The layers in the same section have the same number of filters and filter size. In order to preserve high frequency details in EPI, we cancel the pooling operation throughout the network and maintain the input and output at the same size in each layer.
4.2 Training Details
We have modelled light field reconstruction between input light fields as a learning-based end-to-end regression. In order to minimize the error between the output of the network and ground truth, we use the mean squared error (MSE) as the loss function of our network,
where is the number of input EPIs. Since the training is a supervised process, we use EPI cut from the dense light fields in our dataset (see Section. 5) as ground truth to guide the training.
In the training process, in order to converge our training model efficiently and improve the accuracy, we initialize parameters of network’s filter by using Xavier method  and use the ADAM algorithm 
to optimize the parameters. Besides, to prevent the model from overfitting, we augment training data by randomly adjusting the brightness of EPIs and adding Gaussian noise to EPIs. Furthermore, we train the network with 5 epochs and each epoch contains 2256 iterations. The learning rate is set 1e-4 initially. Then, it is decreased by a factor of 0.96 every 2256 iterations so as to make the model converge more quickly. There are 30 EPIs in each batch in the training process. The training of the network takes about 23 hours on 6 GPUs GTX 1080ti with the Tensorflow.
In this section, we first explain the capturing process of our high-angular-resolution light field dataset. Then, we evaluate our proposed method on the light field dataset by using a sparse sampling pattern. In addition, we test our method’s capacity of reconstructing dense light field with different sampling patterns.
The angular resolution of existing light field datasets is too low to verify our proposed method. Besides, our training is a supervised process. In order to provide ground truth dense light field during training process, we create a dense light field dataset. The dataset composes of 13 indoor scenes and 13 outdoor scenes. It contains plenty of real-world static scenes, such as bicycles, toys and plants, which have abundant colors and complicated occlusions. Each scene in the dataset contains 100 light fields captured by Lytro ILLUM. There are 2600 light fields in total.
For each scene, in order to make all the captured light fields’ view planes coplanar, we mount a Lytro ILLUM on a translation stage and move the camera along a line in the capturing process. Our capturing system is shown in Fig. 4(a). Furthermore, for the sake of gaining a dense light field from each scene, we set a proper step size for the translation stage during moving the camera. It ensures that there is overlap between each pair of adjacent light fields’ view planes, as shown in Fig. 4(b). In our experiment, there are 5 views that are overlapped between each pair of adjacent light fields. The camera focuses at infinity during the capturing. All the light fields are decoded by Lytro Power Tools . For each light field, central views are extracted from views provided by raw data to maintain the imaging quality. Then, we fuse the overlapping views between each pair of adjacent light fields to merge all the light fields together. After merging, each scene is recorded by a 405 high-angular-resolution light field. From another perspective, the high-angular-resolution light field is composed by 45 low-angular-resolution light fields whose view planes connect with each other but have no overlapping views.
5.0.2 Real-world Scenes Evaluation
We design three sparse sampling patterns to evaluate the proposed method on our real-world dataset. With different sampling pattern, the number of light fields which need to be reconstructed between each pair of input light field is different. First, we sparsely sample multiple light fields in each scenes’s dense light field to make up a sparse light field. Then, we use the EPI of sparse light field as our network’s input to generate the intact EPI and reconstruct a dense light field. The sampling patterns are shown in Fig. 6. We choose 20 scenes as the training data which contains 67680 EPIs. The other 6 scenes (see Fig. 5) are used to test our training model and other methods. For our method, we reconstruct 2 novel light fields between each pair of input light fields to verify the proposed method. The methods of Kalantari et al.  and Wu et al.  perform better than other state-of-the-art methods. Thus, we use them to evaluate the quality of views in reconstructed light fields. When we evaluate these two methods on our real-world dataset, we carefully fine-tune all parameters so as to gain the best experimental performance among their results. Furthermore, we set the same up-sampling factor in their code.
The average PSNR and SSIM values are calculated on each testing scene’s reconstructed view images, listed in Table. 1. In the method of Kalantari et al. , the quality of synthesized view is heavily dependent on the accuracy of depth map. It tends to fail in the Basket and Shrub data. Since these scenes are challenging cases for depth estimation. The method proposed by Wu et al.  uses the “blur-deblur” to increase the resolution of EPI instead of estimating depth. It achieves better performance than that of Kalantari et al.  on these. However, this method has to increase light field’s angular resolution sequentially. The result of lower resolution-level’s reconstruction is used as the input of higher resolution-level’s reconstruction so that the constructing error is accumulated along with angular resolution’s increasement. Our proposed method does not require error-sensitive depth estimation to reconstruct light field. Besides, all the light rays in the reconstructed light fields are synthesized at a time. Therefore, in terms of quantitative estimation, the results indicate that our proposed method is significant better than other methods on the quality of synthesized views.
|Kalantari et al. ||34.42||30.26||30.64||31.50||28.55||31.82|
|PSNR||Wu et al. ||35.21||32.87||35.74||32.82||30.73||38.67|
|Kalantari et al. ||0.897||0.922||0.862||0.877||0.878||0.850|
|SSIM||Wu et al. ||0.919||0.958||0.943||0.904||0.910||0.947|
Fig. 7 shows view images in the reconstructed light field. The Toys scene contains plenty of textureless areas. Kalantari et al. ’s result shows heavy artifacts on the dog’s mouth and the bottle, as shown in the blue and yellow boxes in the view image. The dog’s mouth is teared up in their result while our result shows fidelity in these areas. The Basket scene is a challenging case due to the hollowed-out grids on the baskets. Plenty of occlusions are generated by the gridlines. The result of Kalantari et al. ’s method shows visual incoherency on grid area of baskets as shown in Fig. 7(b). The grids of the basket reconstructed by Wu et al. ’s method are also twisted. Besides, the synthesized views by Kalantari et al. ’s method and Wu et al. ’s method both show burring artifacts around the handle of basket, as shown in Fig. 7(b)(c). However, our results show higher permformance in those areas mentioned above. Moreover, our method has primely reconstructed high frequency details of the scenes. The Flower scene contains many leaves and petals with complex shapes which generates numerous occlusions. The results of Kalantari et al.  and Wu et al.  show ghost effects around the petals and occlusion edges. However, our method shows high accuracy in those occlusion and textureless areas, such as the place where two pedals with the same color overlap (see the yellow boxes of Flower scene in Fig. 7).
Fig. 8 shows the details of reconstructed EPI on Flower case and Basket case. For our method, the EPIs in Fig. 8 is cropped from our results. The EPI in Flower scene contains many thin tube formed by pixels from the flower’s stamen. These thin tubes are mixed together in Kalantari et al. ’s result. The result of Wu et al. ’s method shows cracked artifacts on EPI tubes. However, the EPI tubes in our results remain straight and clear. The Basket scene is a challenging case for EPI-based method. The grids on the basket also generate grids in EPI. Therefore, EPI-based method can be challenged by the complex structures in EPI. According to the results, Wu et al. ’s method shows many curved tubes in their result EPI. The result of Kalantari et al.  loses lots of details around tubes in EPI, while our result shows a structured EPI.
|Pattern A||Pattern B||Pattern C|
5.0.3 Method Capacity.
As shown in Fig. 6, with different sampling pattern, the number of light fields which need to be reconstructed between each pair of input light field is different. To test our method’s capacity of reconstructing dense light field, we separately trained the network with different sampling patterns in Fig. 6 to reconstruct 2, 3, 4 novel light fields between each two input light fields. Then, we evaluate the results over 6 testing scenes. Table.2 indicates that PSNR and SSIM values average on 6 testing scenes decrease as the reconstructing number increases between each pair of input light fields. Fig. 9 depicts error maps of the view images in reconstructed light field with different sampling patterns. It indicates that when the reconstructing number increases, the quality of reconstructed light fields also decreases. More results are shown in our supplementary material.
6 Conclusions and Future Work
We propose a novel learning-based method for reconstructing a dense light field from sparse sampling. We model novel light fields reconstruction as extending disparity consistency between dense and sparse light fields. Besides, we introduce a dense sampled light field dataset in which light field has the highest angular resolution so far. The experimental results show that our method can not only reconstruct a dense light field using 2 input light fields, but also extend to multiple input light fields. In addition, our method has a higher performance in the quality of synthesized view than state-of-the-art view synthesizing methods.
Currently, our method is only able to deal with light fields which are captured along a sliding track without orientation change of principal axis. In the future, we will generalize our method to multiple degrees of freedom motion of light field camera. Furthermore, it would be interesting to choose suitable sampling rate of light rays automatically and reconstruct dense light field for a specific scene.
-  Ng, R.: Digital light field photography. Stanford University (2006)
-  Wanner, S., Goldluecke, B.: Globally consistent depth labeling of 4d light fields. In: IEEE CVPR. (2012) 41–48
-  Wang, T.C., Efros, A.A., Ramamoorthi, R.: Depth estimation with occlusion modeling using light-field cameras. IEEE T-PAMI 38(11) (2016) 2170–2181
-  Jeon, H.G., Park, J., Choe, G., Park, J., Bok, Y., Tai, Y.W., So Kweon, I.: Accurate depth map estimation from a lenslet light field camera. In: IEEE CVPR. (2015) 1547–1555
-  Zhu, H., Wang, Q.: An efficient anti-occlusion depth estimation using generalized epi representation in light field. In: SPIE/COS Photonics Asia
-  Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. In: IEEE CVPR. (2014) 2806–2813
-  Wilburn, B., Joshi, N., Vaish, V., Talvala, E.V., Antunez, E., Barth, A., Adams, A., Horowitz, M., Levoy, M.: High performance imaging using large camera arrays. In: ACM TOG. Volume 24. (2005) 765–776
-  Lytro. https://www.lytro.com/ (2016)
-  RayTrix: 3d light field camera technology. https://raytrix.de/
-  Wanner, S., Goldluecke, B.: Variational light field analysis for disparity estimation and super-resolution. IEEE T-PAMI 36(3) (2014) 606–619
-  Zhang, Z., Liu, Y., Dai, Q.: lightfield from micro-baseline image pairs. In: IEEE CVPR. (2015) 3800–3809
-  Kalantari, N.K., Wang, T.C., Ramamoorthi, R.: Learning-based view synthesis for light field cameras. ACM TOG 35(6) (2016) 193
-  Wu, G., Zhao, M., Wang, L., Dai, Q., Chai, T., Liu, Y.: Light field reconstruction using deep convolutional network on epi. In: IEEE CVPR. Volume 2017. (2017) 2
-  Levin, A., Durand, F.: Linear view synthesis using a dimensionality gap light field prior. In: IEEE CVPR. (2010) 1831–1838
-  Marwah, K., Wetzstein, G., Bando, Y., Raskar, R.: Compressive light field photography using overcomplete dictionaries and optimized projections. ACM TOG 32(4) (2013) 46
-  Shi, L., Hassanieh, H., Davis, A., Katabi, D., Durand, F.: Light field reconstruction using sparsity in the continuous fourier domain. ACM TOG 34(1) (2014) 12
-  Schedl, D.C., Birklbauer, C., Bimber, O.: Directional super-resolution by means of coded sampling and guided upsampling. In: IEEE ICCP. (2015) 1–10
-  Yoon, Y., Jeon, H.G., Yoo, D., Lee, J.Y., So Kweon, I.: Learning a deep convolutional network for light-field image super-resolution. In: IEEE ICCV Workshops. (2015) 24–32
-  Zhang, F.L., Wang, J., Shechtman, E., Zhou, Z.Y., Shi, J.X., Hu, S.M.: Plenopatch: Patch-based plenoptic image manipulation. IEEE T-VCG 23(5) (2017) 1561–1573
-  Srinivasan, P.P., Wang, T., Sreelal, A., Ramamoorthi, R., Ng, R.: Learning to synthesize a 4d rgbd light field from a single image. In: IEEE ICCV. Volume 2. (2017) 6
-  Wang, T.C., Zhu, J.Y., Kalantari, N.K., Efros, A.A., Ramamoorthi, R.: Light field video capture using a learning-based hybrid imaging system. ACM TOG 36(4) (2017) 133
-  Levoy, M., Hanrahan, P.: Light field rendering. In: ACM SIGGRAPH. (1996) 31–42
-  Bolles, R.C., Baker, H.H., Marimont, D.H.: Epipolar-plane image analysis: An approach to determining structure from motion. IJCV 1(1) (1987) 7–55
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE CVPR. (2016) 770–778
-  Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS. (2010) 249–256
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)