Flexible, Fast and Accurate Densely-Sampled Light Field Reconstruction Network

The densely-sampled light field (LF) is highly desirable in various applications, such as 3-D reconstruction, post-capture refocusing and virtual reality. However, it is costly to acquire such data. Although many computational methods have been proposed to reconstruct a densely-sampled LF from a sparsely-sampled one, they still suffer from either low reconstruction quality, low computational efficiency, or the restriction on the regularity of the sampling pattern. To this end, we propose a novel learning-based method, which accepts sparsely-sampled LFs with irregular structures, and produces densely-sampled LFs with arbitrary angular resolution accurately and efficiently. Our proposed method, an end-to-end trainable network, reconstructs a densely-sampled LF in a coarse-to-fine manner. Specifically, the coarse sub-aperture image (SAI) synthesis module first explores the scene geometry from an unstructured sparsely-sampled LF and leverages it to independently synthesize novel SAIs, giving an intermediate densely-sampled LF. Then, the efficient LF refinement module learns the angular relations within the intermediate result to recover the LF parallax structure. Comprehensive experimental evaluations demonstrate the superiority of our method on both real-world and synthetic LF images when compared with state-of-the-art methods. In addition, we illustrate the benefits and advantages of the proposed approach when applied in various LF-based applications, including image-based rendering, depth estimation enhancement, and LF compression.

READ FULL TEXT VIEW PDF

page 3

page 7

page 10

page 12

page 13

02/26/2020

Learning Light Field Angular Super-Resolution via a Geometry-Aware Network

The acquisition of light field images with high angular resolution is co...
03/19/2020

DRST: Deep Residual Shearlet Transform for Densely Sampled Light Field Reconstruction

The Image-Based Rendering (IBR) approach using Shearlet Transform (ST) i...
03/20/2020

Self-Supervised Light Field Reconstruction Using Shearlet Transform and Cycle Consistency

The image-based rendering approach using Shearlet Transform (ST) is one ...
03/29/2019

Synthesizing a 4D Spatio-Angular Consistent Light Field from a Single Image

Synthesizing a densely sampled light field from a single image is highly...
08/11/2020

A Study of Efficient Light Field Subsampling and Reconstruction Strategies

Limited angular resolution is one of the main obstacles for practical ap...
03/24/2021

Light Field Reconstruction Using Convolutional Network on EPI and Extended Applications

In this paper, a novel convolutional neural network (CNN)-based framewor...
02/17/2019

LapEPI-Net: A Laplacian Pyramid EPI structure for Learning-based Dense Light Field Reconstruction

For dense sampled light field (LF) reconstruction problem, existing appr...

1 Introduction

The light field (LF) is a high-dimensional function describing light rays through every point traveling in every direction in the free space [24, 12]. This function is initially introduced for LF rendering, which is an attractive method for generating novel views from a given set of pre-acquired views. Instead of the traditional image-based rendering (IBR) methods, LF rendering treats the captured images as samples of the LF function, and the novel views can be generated by re-sampling a slice from the function in real-time, during which no geometry information is required. To avoid ghosting effects, the LF is required to be densely sampled [1]. Densely-sampled LFs including sufficient information will also facilitate a wide range of applications, such as accurate depth inference [50, 4], 3-D scene reconstruction [20] and post-capture refocusing [10]. In addition, with the rapid development of virtual reality technology, a densely-sampled LF becomes vital as it provides smooth angular parallax shift as well as natural focus details, which are important for a satisfying immersive viewing experience [60, 17, 33].

The densely-sampled LF is highly desirable but raises great challenges for the acquisition. For example, LF images with high angular resolution can be captured using a camera array [52] for simultaneous sampling from different viewpoints or computer-controlled gantry [43] for time-sequential sampling at different positions. However, the former is expensive and bulky, and the latter is limited to static scenes. The commercialization of hand-held LF cameras such as Lytro [27] and Raytrix [36] makes it convenient to acquire LF images. These cameras are cheaper and portable by encoding 4-D LF data into a single 2-D sensor. However, due to limited sensor resolution, a trade-off between spatial and angular resolution exists.

Instead of relying on the development of hardware, many computational methods have been proposed for reconstructing a densely-sampled LF from a sparse one, which can be realized with low cost commercial devices. Previous works [48, 31, 28, 38, 63, 46] either estimate disparity maps as auxiliary information, or use specific priors such as sparsity in transformation domain for dense reconstruction. With recent development of deep learning solutions for visual modeling, some learning-based methods [59, 19, 54, 53] have been proposed. However, most of the existing methods require the input sub-aperture images (SAIs) to be sampled with a specific or regular pattern, which raises difficulties for practical acquisition. Moreover, since the scene geometry is inexplicitly and insufficiently modeled in these methods, the aliasing problem becomes serious in the reconstructed images when the input LF is extremely under-sampled, i.e. the samples have large baselines.

As a preliminary work [58], we proposed a learning-based model for densely-sampled LF reconstruction. The reconstruction of all novel SAIs are performed in one forward pass during which the intrinsic LF structural information among them is fully explored. See more details in Section 2.2. Although this method can produce impressive and state-of-the-art results on extensive real-world images captured by the Lytro Illum camera, the performance degradation caused by large-baseline sampling and the problem of non-flexibility still exits. In this paper, built upon [58], we provide a few distinguishable improvements, enabling flexible and accurate reconstruction of a densely-sampled LF from very sparse sampling. We inherit the coarse-to-fine framework in [58]. That is, the proposed model consists of two modules, namely the coarse SAI synthesis and the efficient LF refinement. Specifically, the coarse SAI synthesis module independently synthesizes novel SAIs using geometry-based warping, where we take the sampling patterns with large baseline and arbitrary positions into consideration. We also propose a novel strategy for handling the occluded regions when blending the warped images from different viewpoints. This module synthesizes novel SAIs based on photo-consistency and only produces intermediate LF images under Lambertian assumption. We further refine the coarse results using efficient pseudo 4-D filters, which is capable of preserving the intrinsic structure of the LF images based on the complementary information extracted between SAIs.

In summary, the main contributions of this paper are as follows:

  • we propose an end-to-end learning-based method for the reconstruction of densely-sampled LFs from sparsely-sampled LFs. Our method maintains high reconstruction quality when the sampling baseline increases, and improves the generality by enabling flexible input positions as well as flexible output angular resolution. We also propose effective strategies for occlusion handling and LF parallax structure preservation;

  • we investigate the relations between the input sampling patterns and the reconstruction quality, and provide a simple yet effective method for optimizing the sampling pattern;

  • we design various and extensive experiments to evaluate and analyze our proposed method as well as those under comparison comprehensively; and

  • we demonstrate and discuss the benefits of the proposed approach to subsequent LF-based applications.

The rest of this paper is organized as follows. Sec. 2 comprehensively reviews existing methods for view synthesis and densely-sampled LF reconstruction. Sec. 3 presents the proposed approach and investigates the optimization for sampling patterns. In Sec. 4, extensive experiments are carried out to evaluate the performance of the proposed approach. The benefits of the proposed approach to practical LF-based applications are validated and discussed in Sec. 5. Finally, Sec. 6 concludes this paper.

2 Related Work

2.1 View Synthesis

View synthesis, taking one or more views as inputs to render novel views, is a long-standing problem in the field of computer graphics and computer vision. Most algorithms leverage the scene geometry information for view synthesis, that is, to extract/learn the global/local geometry from the input viewpoints and use the resulting geometry information to warp the input views, followed by blending for novel view rendering

[5, 30]. However, the forward warping operation typically leads to a hole-filling problem in occlusion areas. Flynn et al. [11]

proposed to project input views to a set of depth planes and learn the weights to average the color of each plane. This method needs to learn specific geometry for different target viewpoints. To overcome this shortage, some methods based on 3-D scene representation were proposed. Penner

et al. [35] presented a soft 3D representation by preserving depth uncertainty. Tulsiani et al. [44] modeled the 3-D structure of the scene by learning to predict a layer-based representation, which represents multiple ordered depths per pixel along with color values. Zhou et al. [64] proposed to use multiplane images where each plane encodes color and transparency maps. Through these methods, novel views at varying positions can be rendered by simply forward projecting their corresponding representations. Besides, many methods aim at reconstructing 3-D scenes and synthesizing novel views from a single image (e.g., [65, 45, 34, 42]). However, these methods are still limited over simple and non-photorealistic synthetic objects.

Fig. 1: The flowchart of the proposed method for reconstructing a densely-sampled LF with SAIs from a sparsely- and arbitrarily-sampled LF with SAIs. Our proposed model consists of two phases, i.e., the coarse SAI synthesis and the efficient LF refinement.

2.2 LF Reconstruction

LF rendering needs densely-sampled LFs as inputs. In what follows, we only focus on the methods that reconstruct a densely-sampled LF from a sparsely-sampled one. Available solutions can be roughly classified to two categorizes: non-learning based methods and learning based methods.

Non-learning based methods. Many traditional solutions that are originally adopted for natural image processing, such as Gaussian model and sparse representation, have been explored for LF processing tasks. Among them, Mitra et al. [31]

modeled the LF patches using a Gaussian mixture model to address many LF processing tasks. Although it can achieve promising results to a certain extent, it is not robust again noise. Shi

et al. [38] explored sparsity in the continuous Fourier domain to reconstruct densely-sampled LFs from a small set of samples. Vagharshakyan et al. [46] proposed a approach using the sparse representation of epipolar-plane images (EPIs) in the shearlet transform domain. These methods require the sparsely-sampled LF to be sampled in a regular grid. Moreover, some methods explore the compressive LF photography. Marwah et al. [28] proposed a compressive LF camera architecture which allows LF reconstruction based on overcomplete dictionaries. To reduce the computational cost for dictionary learning, Kamal et al. [14]

exploited a joint tensor low-rank and sparse prior for compressive reconstruction. These methods were specifically designed for coded LF acquisition.

Many works on LF reconstruction leverage explicit depth information for LF reconstruction. Zhang et al. [63] proposed a depth-assisted phase-based synthesis strategy for a micro-baseline stereo pair. Patch-based synthesis methods were presented by Zhang et al. [62], in which the center SAI is decomposed into different depth layers and LF editing is performed on all layers. However, this method has limited performance for view synthesis, especially for complex scenes. Some works were developed based on the idea of warping given SAIs to novel SAIs guided by an estimated disparity map. Wanner and Goldluecke [50] formulated the SAI synthesis problem as an energy minimization problem with a total variation prior, where the disparity map is obtained through global optimization with a structure tensor computed on the 2-D EPI slices. This approach considers disparity estimation as a separate step from view synthesis, which makes the reconstruction quality heavily depend on the accuracy of the estimated disparity maps. Although subsequent research [18, 48, 4] has shown significantly better disparity estimations, ghosting and tearing effects are still present.

Learning-based methods.

With the great success of deep convolutional neural networks in the field of image processing

[9, 39, 21, 23], many learning-based methods have been proposed for densely-sampled LF reconstruction. Yoon et al. [59] jointly super-resolved the LF image in both spatial and angular domain using a network that closely resembles the model proposed in [8]. Their approach is limited to scale 2 angular super-resolution and cannot flexibly adapt to very sparsely-sampled LF input. Following the idea of single image super-resolution, Wu et al. [55, 54] proposed an LF reconstruction method which focuses on recovering the high frequency details of the bicubic upsampled EPIs. In these methods, a blur-deblur scheme was proposed to address the information asymmetry problem caused by sparse angular sampling. Based on the observation that an EPI shows clear structure when sheared with the disparity value, Wu et al. [53] proposed to fuse a set of sheared EPIs for LF reconstruction. However, since each EPI is a 2-D slice of the 4-D LF, the accessible spatial and angular information of these EPI-based models are severely restricted. Moreover, for these models, novel SAIs must be synthesized horizontally or vertically in 2-D angular domain, resulting in accumulated errors. Yeung et al. [58] proposed an end-to-end network for densely-sampled LF reconstruction. By exploring the relations between SAIs with pseudo 4-D filters, this method achieves state-of-the-art performance over a large number of real-world scenes captured by the Lytro camera.

In addition, depth information is also utilized in some learning-based methods for LF reconstruction. Srinivasan et al. [40] proposed to synthesize a 4-D LF image from a 2-D RGB image based on estimated 4-D ray depth. However, this method requires a large training dataset and only works on simple scenes since the information contained in single 2-D images is extremely limited. Kalantari et al. [19] proposed to synthesize novel SAIs with two sequential networks that perform depth estimation and color prediction successively. Although this method achieves good performance on LF images captured by the Lytro camera, the depth estimation and color prediction module is implemented in a straightforward manner, which leaves room for improvement.

3 The Proposed Approach

3.1 4-D LF and Problem Formulation

A 4-D LF can be represented with the two-plane parameterization structure, which uniquely describes the propagation direction of a light ray via two points from two parallel planes, i.e., the angular plane and the spatial plane . Let denote a densely-sampled LF containing SAIs of spatial dimension , which are sampled on the angular plane with a regular 2-D grid of size . Let be the set of 2D angular coordinates of the SAIs in , i.e. . The SAI at is denoted as . Let denote a sparsely-sampled LF with SAIs, be the set of the 2D angular coordinates of the SAIs in , i.e., , and be an SAI in located at . Moreover, the SAIs of an sparsely-sampled LF are assumed to be arbitrarily sampled from a certain densely-sampled LF, i.e., and . The unsampled SAIs, which belong to but do not appear in are denoted by with the operator returing the difference between two sets.

Our goal it to learn as close to as possible based on such that a densely-sampled LF denoted by can be reconstructed, together with . This problem can be implicitly formulated as:

(1)

where denotes the mapping function to be learnt, and is the operator to combine two sets.

3.2 Overview of the Proposed Method

SAIs in are correlated to each other, which reveals the LF parallax structure. Specifically, under the Lambertian assumption and in the absence of occlusions, the relation between SAIs of can be expressed as

(2)

where is the spatial coordinates, and is the disparity at the pixel . Being aware of this unique characteristic as well as the great success of deep learning, we propose a learning-based approach to explore the LF parallax structure for densely-sampled LF reconstruction, i.e., constructing a deep network to learn , as shown in Fig. 1. Our approach consists of two modules, namely the coarse SAI synthesis network and the LF refinement network , which predicts in a coarse-to-fine manner. To be specific, by explicitly learning the scene geometry from input SAIs, the coarse SAI synthesis network individually generates novel SAIs, giving an intermediate densely-sampled LF denoted as :

(3)

The independent synthesis of the novel SAIs greatly saves computational time and memory usage during testing stage. Then, the efficient refinement network learns residuals for by exploring the complementary information between the SAIs to recover the LF parallax structure, leading to the final output:

(4)

By characterizing the sparsely- and densely-sampled LFs, our approach improves the flexibility and accuracy of the reconstruction of a densely-sampled LF. Specifically, our approach has the following characteristics:

  • it overcomes the aliasing problem caused by wide-baseline sampling, making it possible for sparsely-sampled LFs with different angular sampling rate as inputs;

  • it enables SAIs with arbitrary angular sampling patterns to be used as inputs, which brings more flexibility for the densely-sampled LF reconstruction. Moreover, we further investigated to optimize the sampling patterns for improving reconstruction quality;

  • beyong the early mentioned goal, our method can produce densely-sampled LFs with user-defined angular resolution, making it more flexible for densely-sampled LF reconstruction in various scenes; and

  • it is able to accurately recover the valuable LF parallax structure, which is crucial for various applications based on a densely-sampled LF.

In the following, the details of the proposed approach are presented step-by-step.

3.3 Coarse SAI Synthesis

This module aims at independently synthesizing intermediate novel SAIs denoted by , which is formulated as

(5)

To handle the inputs with wide baselines, we utilize the geometry information explicitly for novel SAI synthesis. That is, we learn the disparity map at from and synthesize the target SAI via backward warping. To deal with the challenge posed by the irregular sampling patterns, we construct the disparity estimation network by learning correspondence from the plane-sweep volumes (PSVs) [6]. We also propose a new strategy for blending the warped images, which is able to alleviate the artifacts around occlusion boundaries caused by warping. To this end, this module consists of three steps: PSV construction, disparity estimation, warping and blending.

PSV construction. A naive way of disparity estimation is via directly extracting features from using sequential convolutional layers. However, for randomly-sampled SAI inputs, i.e. the angular position set always varies, it is difficult to properly provide the network with indicators w.r.t the sampling and target positions, making the prediction unreliable (see results in Fig. 7). Instead, we use PSVs for disparity estimation. A PSV with respect to a target position is constructed by backward warping, i.e., reprojecting with respect to a set of disparity planes , resulting in a set of warped images :

(6)

In this way, the arbitrary sampling positions of input SAIs as well as the target position for synthesis are encoded into the PSVs during its construction.

The disparity inference from a PSV is based on principles of photo-consistency. However, in occlusion areas or non-Lambertian surfaces, the relations between the matching patches of different SAIs are complicated. We propose to feed the whole PSV into the disparity estimation network, which is different from the way adopted in [19]

, where simple hand-craft features such as mean and standard deviation of the PSV across disparity planes are used. With the convolutional network’s powerful ability in learning the representation, we are able to accurately estimate the disparity maps at challenging regions with the rich information provided by the PSVs.

Disparity estimation. The disparity estimation network is designed to predict a disparity map at the target position based on . The network consists of a cost calculator to learn the matching cost for each disparity plane, and an estimator to predict the disparity value.

For cost calculator, several convolutional layers are applied to per disparity plane using shared weights. For a typical disparity plane , features measuring the similarity and diversity between images warped from different input SAIs are extracted from . We use kernel size to obtain a relatively large receptive filed and set the number of channels in the final layer as in the cost calculator. For the disparity estimator, all features from each disparity plane are concatenated together. Then sequential convolutional layers are used to predict the disparity value. Instead of selecting the disparity value with a minimum cost from the predefined disparity set, we let the network learn the disparity value, so that the number of the predefined disparity plane, as well as the width of the network (i.e., the channel number), can be reduced. The numbers of channels in the hidden layers of the estimator is set to at the front layer, and then gradually decreased from to , , and to output a disparity map finally.

Warping and blending. The novel SAI at the target position can be synthesized by warping the input SAIs in using the predicted disparity map . Specifically, the resulting image by warping to the target position is can be expressed as

(7)

Since the input SAIs contain valuable information of the scene from different viewpoints, they will contribute to the target SAI in different areas. The warped images inevitably show artifacts around occlusion boundaries, and locations of the artifacts vary among different source SAIs. Direct combination of the images warped from different viewpoints by simple average or convolutional layers adopted in [19] will produce blurry effects caused by convolution and loss [29], especially when the input SAIs have large baselines. Therefore, we propose a blending strategy to fuse the images warped from different input SAIs to generate the novel SAI by using adaptive dense confidence maps. Specifically, the confidence maps are learned to indicate the pixel-wise accuracy of the images warped from different input SAIs. Then the warped images are blended by combining the accurate regions from the warped images according to the confidence maps. This strategy properly handles the occlusion problem after warping and preserves clear textures in the synthesized novel SAI (see details in 4.3).

Considering that the disparity estimation network learns the relations between the input SAIs and implicitly models their relations with the target SAI, we let the final layer of the disparity estimation network predict confidence maps corresponding to the input SAIs along with the disparity map. Then the blending can be formulated as:

(8)

where is the dense confidence map for -th input SAIs, and is the element-wise multiplication operator.

3.4 Efficient LF Refinement

In the coarse SAI synthesis phase, novel SAIs are independently synthesized, and the particular LF parallax structure among them are not well taken into account, resulting in possible photometric inconsistencies between SAIs in the intermediate LF image . Therefore, an efficient refinement network is designed to further exploit the structure of , which is expected to recover the photo-consistency and contribute positively to the reconstruction quality of the densely-sampled LF. Since the goal is to correct possible flaws inconsistent across SAIs while preserve high-frequency textures, residual learning is used in this module. In summary, we first exploit the LF parallax structure from and then reconstruct residual maps for it, as formulated in Eq. 4.

The LF parallax structure. To exploit the LF parallax structure within , 4-D convolution is a straightforward choice. However, the computational cost required by 4-D convolution is very high. Instead, pseudo filters or separable filters, which reduce model complexity by approximating a high dimensional filter with filters of a lower dimension, have been applied to solve different computer vision problems, such as image structure extraction [37], 3-D rendering [56]

and video frame interpolation

[32]. This has been recently adopted in [49] for LF material classification and [57] for LF spatial super-resolution, which verify that pseudo 4-D filters can achieve comparable performance to 4-D filters.

For preventing potential overfitting and long training time from the use of full 4-D filter while characterizing the 4-D information of the LF, we adopt the pseudo 4-D filter which approximates a single 4-D filtering step with two 2-D filters. Specifically, the intermediate feature maps are reshaped between the stack of spatial images and the stack of angular patches so that the convolution is performed alternatively on the spatial and angular domains. Such a design requires only the computation of of a 4-D convolution while still exploiting all available information from the LF image.

Residual reconstruction

. After exploring the relation among angular dimension, the residual maps are reconstructed separately for each SAI in the intermediate LF image. Several layers of 2-D spatial convolution are applied to learn a residual map from the extracted spatial-angular deep features for each SAI. Here each SAI is processed independently for two reasons. First, we believe the previous spatial-angular convolutions are capable of exploiting the LF parallax structure. Second and more importantly, in this way, we can make sure a fully-convolutional network on both spatial and angular dimension, such that flexible output angular resolution is achieved. Finally, the reconstructed residual map is added to the previously synthesized intermediate LF image as the final reconstructed LF

.

(a)
(b)
(c)
Fig. 2: Illustration of the relation between the minimum distance of the sampling patterns and the reconstruction quality. The blue dots denote the patterns generated randomly. The green dots and their marks correspond to the patterns in Fig. 3. The results of the selected optimal patterns are highlighted as red stars.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
Fig. 3: Illustration of different sampling patterns. From top to bottom are sampling patterns with 4, 3 and 2 input SAIs, respectively. (f), (l) and (r) depict the selected optimal sampling patterns by our algorithm for the tasks , and , respectively.

3.5 The Loss Function

All modules in our approach are differentiable, leading to an end-to-end trainable network. The loss function for training the network consists of three parts. The first part provides supervision for the intermediate LF by calculating the absolute error between the intermediate LF images and ground-truth ones, i.e.,

(9)

To promote smoothness of the predicted ray disparity, we penalize the norm of the second-order gradients [47], denoted as :

(10)

where , , and are the second-order gradients for the spatial domain of the disparity map . Finally, the output reconstructed LF image is optimized by minimizing the absolute error as:

(11)

Thus, our final objective is written as

(12)

where , and are the weighting for the reconstruction accuracy and the disparity smoothness, which are empirically set to , and , respectively.

Fig. 4: Visual comparisons of different methods on the synthesized center SAI for the task (fixed models). Selected regions have been zoomed in for better comparison. It is recommended to view this figure by zooming in.
Algorithms learning-based geometry-based flexible input flexible output
Vagharshakyan et al. [46] - - -
Wu et al. [54] - -
Wu et al. [53] -
Kalantari et al. [19]
Yeung et al. [58] - - -
Ours
TABLE I: Comparison of attributes for densely-sampled LF reconstruction algorithms, where flexible input means whether the method is feasible for an arbitrary sampling pattern, and flexible output means whether the method can produce densely-sampled LFs with flexible angular resolution.
Test set Vagharshakyan et al. [46] Wu et al. [54] Wu et al. [53] Kalantari et al. [19] Yeung et al. [58] Ours (fixed)
HCI 26.98/0.734 26.64/0.744 31.84/0.898 32.85/0.909 32.30/0.900 37.14/0.966
HCI old 32.47/0.853 31.43/0.850 37.61/0.942 38.58/0.944 39.69/0.941 41.80/0.974
30scenes 34.17/0.907 33.66/0.918 39.17/0.975 41.40/0.982 42.77/0.986 42.75/0.986
Occlusions 32.64/0.923 32.72/0.924 34.41/0.955 37.25/0.972 38.88/0.980 38.51/0.979
Reflective 35.34/0.935 34.76/0.930 36.38/0.944 38.09/0.953 38.33/0.960 38.35/0.957
TABLE II: Quantitative comparisons (PSNR/SSIM) of the proposed approach with the state-of-the-art ones under task . The input sparsely-sampled LFs are sampled at the four corners during both training and test.

3.6 The Optimized Sampling Pattern

Optimizing the sampling pattern for densely-sampled LF reconstruction is a valuable topic, which could further realize the full potential of the reconstruction algorithm, and improve the reconstruction quality using as few hardware resources as possible. Additionally, optimizing the sampling pattern will be greatly beneficial to its application in LF compression (see more details in Sec. 5). In this section, we first investigate how the sampling patterns affects the reconstruction qualitatively and experimentally, then we propose a simple yet effective method for seeking the optimal sampling pattern tailored to our reconstruction model.

Intuitively, the reconstruction quality is influenced by how thoroughly the scene contents have been recorded by the sparsely-sampled input. Since most foreground objects can be completely captured from different viewpoints, the occluded regions are the critical challenge. There are several factors determining how many occluded regions can be recorded. One of the factors being the overall distance between the novel SAIs and the sampled SAIs. That is, SAIs nearby can provide more references for novel SAI reconstruction compared to those far away. Additionally, sampling patterns with SAIs distributed at more diverse locations along the horizontal and vertical directions are better than their counterparts with less variation, as the former sees more occluded regions. Finally, this issue should be related to the scene content. Factors such as the geometry complexity between objects can play an important role.

We experimentally investigated the effect of the sampling pattern on reconstruction quality. First, we define a metric, namely minimum distance, which is the average of the angular Euclidean distances of all novel SAIs to their nearest input SAI in the 2-D sampling grid. We then conducted the following experiments, in which we randomly selected some sampling patterns for , and dense reconstruction, respectively, then fit the relations between their minimum distance against their reconstruction quality with a second degree polynomial. Fig. 2 illustrates the results, where we can see that with the increase of the minimum distance of the sampling pattern, the corresponding reconstruction quality decreases in general. Moreover, the corresponding sampling patterns of the green dots are illustrated in Fig. 3. It can be seen that patterns with smaller variation along horizontal or vertical directions always stay below the fitted curve, which indicates that the divergence is indeed a factor influencing the reconstructed quality.

Based on the above observations, we propose a simple yet effective strategy for optimizing the sampling patterns, which is formulated as:

(13)

where is the -th entry of the indicator matrix , which indicates whether the -th sampled SAI is the nearest one in all samples to the -th novel SAI. We first find a solution of the optimization problem in Eq. 13 using the deterministic annealing based method [15, 26]. As the solution varies with initialization, we select the one producing the minimum objective value after repeating the algorithm with random initialization for 5 times. In addition, as the resulting optimal positions may not be located on the grid, we consider the divergence along both horizontal and vertical directions to round the solutions. In this way, we selected the optimal sampling patterns as depicted in Fig. 3, 3 and 3. As demonstrated in Fig. 2, the corresponding quantitative reconstruction quality under the sampling patterns by our algorithm achieves the highest when compared with others, which indicates the effectiveness of our algorithm for optimizing the sampling pattern.

4 Experimental Results

4.1 Datasets and Implementation Details

Both synthetic LF images from the 4-D lgith field benchmark [13] [51] and real-world LF images captured with a Lytro Illum camera provided by Standford Lytro LF Archive [41] and Kalantari et al. [19] were employed to train and test. Specifically, 20 synthetic images and 100 real-world images were used for training, while 9 synthetic data, including 4 LF images from the HCI [13] dataset and 5 LF images from the old HCI [51] dataset, and 3 datasets with totally 70 real-world LF images captured with a Lytro Illum camera were used for test, namely 30scenes [19], Occlusions [41] and Reflective [41]. These datasets cover several important factors in evaluating the methods for ligth field reconstruction. Specifically, the synthetic datasets contain high-resolution textures to measure the ability of maintaining high-frequency details. The real-world datasets can evaluate the performance of different methods under natural illumination and practical camera distortion. Moreover, the HCI dataset contains LF images with large baselines, which measure the ability on very sparse sampling. The Occlusions and Reflective datasets focus on the challenging scenes in which the assumption of photo-consistency is not guaranteed.

During training, patches of spatial size were randomly cropped, and the batch size was set to 1 due to the limitation of the computational memory. Moreover, we adopted ADAM [22] optimizer with and . The learning rate was initialized as and reduced by a half when the loss stops decreasing. The spatial resolution of the model output was kept unchanged at

with padding of zeros. We implemented the model with PyTorch. The code will be publicly avaiable.

Test set Kalantari et al. [19] Ours Kalantari et al. [19] Ours Kalantari et al. [19] Ours
HCI 32.22/0.908 36.54/0.961 33.98/0.929 37.70/0.966 34.28/0.933 38.68/0.971
HCI old 37.47/0.941 41.13/0.976 38.43/0.951 41.80/0.979 38.87/0.954 43.06/0.984
30scenes 40.06/0.978 41.18/0.982 40.72/0.981 42.02/0.985 40.91/0.982 42.83/0.986
Occlusions 35.17/0.962 36.45/0.970 36.90/ 0.971 38.45/0.977 36.88/0.971 39.57/0.981
Reflective 36.38/0.941 37.05/0.946 38.60/0.957 39.41/0.960 38.64/0.956 40.15/0.961
TABLE III: Quantitative comparisons of the proposed approach with Kalantari et al. [19] on the reconstruction with arbitrary sampling patterns under task . Sampling patterns (a), (c) and (f) (depicted in Fig. 3) are used for comparison.
Test set Kalantari et al. [19] Ours Kalantari et al. [19] Ours Kalantari et al. [19] Ours
HCI 31.02/0.883 36.38/0.960 33.23/0.918 37.99/0.967 33.49/0.922 38.43/0.970
HCI old 36.33/0.927 41.22/0.976 38.02/0.947 42.48/0.981 38.49/0.949 43.09/0.983
30scenes 38.95/0.973 40.65/0.981 40.56/0.980 41.86/0.984 40.86/0.981 42.57/0.986
Occlusions 34.05/0.951 35.80/0.967 36.14/0.967 38.000.976 36.63/0.970 39.12/0.980
Reflective 35.49/0.936 36.43/0.948 38.30/0.951 39.41/0.958 38.77/0.954 40.00/0.961
TABLE IV: Quantitative comparisons of the proposed approach with Kalantari et al. [19] on the reconstruction with arbitrary sampling patterns under task . Sampling patterns (g), (j) and (l) (depicted in Fig. 3) are used for comparison.
Test set Kalantari et al. [19] Ours Kalantari et al. [19] Ours Kalantari et al. [19] Ours
HCI 30.69/0.877 33.93/0.946 31.65/0.897 35.27/0.957 32.50/0.906 37.02/0.963
HCI old 36.05/0.927 40.44/0.967 36.27/0.933 39.88/0.961 36.46/0.939 41.30/0.977
30scenes 37.42/0.964 40.05/0.979 38.83/0.974 40.79/0.981 38.54/0.973 40.98/0.982
Occlusions 32.95/0.936 35.11/0.960 34.88/0.958 36.69/0.970 34.83/0.958 37.08/0.971
Reflective 34.88/0.929 36.53/0.944 36.15/0.945 38.35/0.956 36.82/0.950 38.45/0.956
TABLE V: Quantitative comparisons of the proposed approach with Kalantari et al. [19] on the reconstruction with arbitrary sampling patterns under task . Sampling patterns (m), (p) and (r) (depicted in Fig. 3) are used for comparison.
Fig. 5: Visual comparisons of different methods on the synthesized center SAI for the task (flexible models). Selected regions have been zoomed in for better comparison. It is recommended to view this figure by zooming in.

4.2 Comparisons with State-of-the-Art Methods

Besides our preliminary work Yeung et al. [58], we also compared with 4 state-of-the-art learning-based methods that were specifically designed for densely-sampled LF reconstruction, i.e., Vagharshakyan et al. [46], Wu et al. [54], Wu et al. [53] and Kalantari et al. [19] 111Note that the methods with training code released, i.e., Wu et al. [53], Kalantari et al. [19] and Yeung et al. [58] were retrained with the same training data for fair comparisons. For method without training code released, i.e. Wu et al. [54], we used the trained model provided by the authors. The retrained models achieve comparable performance to those provided by the authors.. Table I lists the feature comparisons of these algorithms in terms of whether they are learning-based, geometry-based, whether they are flexible with arbitrary input patterns, and whether they can produce reconstruction with flexible angular resolutions . We conducted various experiments for comparisons, listed as follows:

  • as four out of five methods under comparions, i.e. Vagharshakyan et al. [46], Wu et al. [54], Wu et al. [53] and Yeung et al. [58], are unable to handle the input with flexible and irregular sampling patterns, we first designed the experiment , in which the same and fixed sampling pattern was used during both training and testing, such that all compared methods can be evaluated. We name our method Ours (fixed) under such a training setting. See subsection 1);

  • as both Ours and Kalantari et al. [19] can accept flexible and irregular sampling patterns, we designed the experiments , and , in which sparsely-sampled LFs each containing SAIs with arbitrary positions and structures were fed into the network during training, and some of patterns illustrated in Fig. 3 were used during test. Here we considered three cases, i.e., , respectively. See subsection 2); and

  • we compared the ability of different methods on preserving the LF parallax structure both quantitatively and qualitatively. We also evaluate the running time for different methods. See subsection 3).

Comparisons on the reconstruction with fixed input sampling patterns.

This comparison was performed over the task , which attempts to reconstruct a densely-sampled LF with SAI from a sparsely-sampled LF with SAIs distributed regularly. Here the SAIs of a sparsely-sampled LF are located at the four corners of the densely-sampled LF to be reconstructed, as shown in Fig. (a)a. We used the average value of PSNR and SSIM over all synthetic novel SAIs to quantitatively measure the quality of reconstructed densely-sampled LFs, and the corresponding results are listed in Table II, where it can be observed that:

  • EPI-based methods, including Vagharshakyan et al. [46], Wu et al. [54] and Wu et al. [53], are inferior compared with others. The possible reason is that only 2 rows or columns of pixels are available during the reconstruction of each EPI, making it difficult to recover the intermediate linear structures without modeling the 2-D spatial structure, especially when the scenes are complicated. Among them, Wu et al. [53] performs relatively better, as depth information is utilized as guidance;

  • Kalantari et al. [19] achieves good results on real-world datasets, which indicates the effectiveness of geometry-based warping. However, it fails on the HCI dataset with larger baselines. The reason is that Kalantari et al. [19] uses hand-crafted features to estimate the disparity and simple convolutional layers to combine the warped images, which makes it difficult to build long distance connection between SAIs with large baseline;

  • Yeung et al. [58] achieves the best results on the real-world datasets, indicating that the pseudo 4-D filters effectively explore the spatial and angular relations between input SAIs. However, this method also does not work well on the HCI dataset, because it entirely relies on deep regression for novel view synthesis, which indicates the importance of explicit geometric modeling for the reconstruction based on large-baseline sampling; and

  • our approach achieves the highest PSNR/SSIM for the HCI and HCI old datasets, and comparable performance with Yeung et al. [58] at 30scenes, Occlusions and Reflective datasets, showing the advantages of the proposed framework.

We also visually compared the reconstruction results of different algorithms, as shown in Fig. 4. It can be observed that Wu et al. [54] and Wu et al. [53] fail to recover delicate structures, such as the leaves and the textures on the wall, while Kalantari et al. [19] and Yeung et al. [58] struggle with large disparities. In contrary, our approach produces accurate estimations, which are closer to the ground-truth ones.

Algorithms Vagharshakyan et al. [46] Wu et al. [54] Wu et al. [53] Kalantari et al. [19] Yeung et al. [58] Ours
HCI 924.52 257.70 101.70 168.86 0.85 40.21
TABLE VI: Comparisons of the running time (in second) of different methods for reconstructing a densely-sampled LF.

Comparisons on the reconstruction with flexible input sampling patterns.

We performed comparisons over random input positions with Kalantari et al. [19] and our approach. During training, the input SAIs were selected at random positions, and the input patterns illustrated in Fig. 3 were used for testing. We report the quantitative results of task , and in Table III, IV and V, respectively. It can observed that our method improves the PSNR by around 4 dB on synthetic datasets and around 0.4-1 dB on real-world datasets.

To visually compare the outputs from Kalantari et al. [19] with our method, we calculated the error maps of the reconstructed center SAI under task in Fig. 5. The results further demonstrate the advantages of our proposed approach. As shown in the results of synthetic data in Fig. 5 (see the first row), basic textures are severly blurred or distorted in the reconstructed SAI of Kalantari et al. [19] when the sampling baselines are large, while our method can reconstruct most of the high-frequency details. For real-world LF reconstruction in Fig. 5 (see the second row), Kalantari et al. [19] produces artifacts near the boundaries of the foreground objects, while fine edges and small objects are well preserved in the results by our method.

Fig. 6: Quantitative comparisons of the LF parallax structure by comparing the parallax content PR curves for different methods.
Fig. 7: Visual comparisons of the intermediate disparity maps estimated by directly applying convolutional layers to the input SAIs, Kalantari et al. [19] and our network.
Fig. 8: Demonstration of the effectiveness of our blending strategy. The estimated disparity map, the zoom-in of the images warped from the input SAIs, the learned confidence maps and the blended images are presented.

Comparisons of the LF parallax structure.

The most valuable information of LF images is the LF parallax structure, which implicitly represents the scene geometry. We compared the LF parallax structure of the densely-sampled LFs reconstructed from different algorithms. In Figs. 4 and 5, the EPIs of the reconstructed LF images are compared. It can be seen that the the EPIs of ours enhanced methods preserve clearer linear structures and are closer to the ground truth.

We also quantitatively evaluated the LF parallax structure by using the LF parallax edge precision-recall (PR) curves [3]. Fig. 6 shows the comparisons on PR curves of the densely-sampled LF reconstructed from different algorithms with fixed and flexible sampling. It can be observed that the PR curves of our method are closer to the top right corner than others, indicating that our method preserves the LF parallax structure best.

Comparisons of running time.

We compared the running time (in second) of different methods for reconstructing a densely-sampled LF, and Table VI lists the results. All methods were tested on a desktop with Intel CPU i7-8700 @ 3.70GHz, 32 GB RAM and NVIDIA GeForce RTX 2080 Ti. From Table VI, it can be observed that our approach is greatly faster than other methods (our preliminary work Yeung et al. [58] excepted), taking about only 0.8 seconds to generate a novel SAI. Although Yeung et al. [58] is the fastest one, considering its limitations on accuracy and flexibility, our approach holds superiority.

Test set without refinement with refinement without refinement with refinement
HCI 35.60/0.954 36.54/0.961 37.33/0.965 38.68/0.971
30scenens 40.12/0.979 41.18/0.982 41.57/0.983 42.83/0.986
HCI 35.39/0.953 36.38/0.960 37.15/0.963 38.43/0.970
30scenens 39.77/0.977 40.65/0.981 41.49/0.983 42.57/0.986
TABLE VII: Effectiveness verification of the refinement module in our approach. We compare the reconstruction quality of the LF images generated by our method without the refinement module and the LF images by our method with all modules under tasks and over HCI and 30scenes.

4.3 Ablation study

In this section, we experimentally validate the effectiveness of three components of our network, including the disparity estimation module, the blending strategy and the refinement module.

The effectiveness of the disparity estimation module.

In our approach, the disparity maps are estimated by constructing PSVs, which are fed into the subsequent network. Alternative ways include applying convolutional layers to the input SAIs straightly, or abstracting hand-craft features from PSVs as the input of a network [19]. To validate the advantages of our disparity estimation module, we visually compared the by-product disparity maps estimated by these three manners. As shown in Fig. 7, it can be observed that our method produces disparity maps with much fewer error in both background and occlusion boundaries.

The effectiveness of the blending strategy.

The blending strategy in our approach is designed to address the occlusion isses during the fusion of the images warped from different input SAIs. To validate the effectiveness of the proposed blending strategy, the intermediate results before and after blending are visualized in Fig. 8. It can be observed that the errors around occlusion boundaries in the intermediate images warped from different source SAIs are closely related to the location of the source SAIs, and appear in different positions. The learned confidence maps are able to indicate these error areas in each warped image, and provide guidance for the fusion of the warped images. After the blending according to the learned confidence maps, these errors are removed, while the correct regions of each warped image are maintained.

The effectiveness of the refinement module.

To demonstrate the effectiveness of the refinement module, we quantitatively compared the quality of the LF images generated by our method without the refinement module and the LF images by our method with all modules, and Table VII lists the results. It can be seen that the refinement provides around 1 dB PSNR improvement, which indicates that the refinement module successfully takes advantage of the complementary information between the synthesized SAIs and improves the intermediate LF images. Moreover, Fig. 6 shows the comparisons on the parallax content PR curves, which demonstrate that the refinement helps recover the LF parallax structure in the reconstructed densely-sampled LFs.

Fig. 9: Visual comparisons on LF reconstruction with flexible output angular resolution. We present the results of reconstruction from 4 corner SAIs of a sampling grid (top), and the results of reconstruction from 4 corner SAIs of a sampling grid (bottom). The center SAI of the LF images recosntructed from different algorithms are presented. Horizontal and vertical EPIs corresponding to the colored lines are shown below the center SAI, and regions with obvious artifacts or blurring are highlighted with yellow boxes. It is recommended to view this figure by zooming in.
Fig. 10: Visual comparisons of the depth estimation results. The center SAIs of the LF images, the depth maps estimated from the ground truth densely-sampled LFs, the sparsely-sampled LFs, the reconstructed densely-sampled LFs by different algorithms are presented from left to right. It is recommended to view this figure by zooming in.

5 Applications

In this section, we will discuss three applications, which will benefit from our accurate, flexible and efficient method for the reconstruction of densely-sampled LFs.

5.1 Image-based rendering (IBR)

IBR aims at generating novel views from a set of captured images. Comprehensive review on IBR can be found in [61]. Among IBR techniques, LF rendering is attractive as novel views can be generated by straightforward interpolation without the need of any geometric information such that real-time rendering can be achieved. To produce novel views without ghosting artifacts, LF rendering requires the LF to be densely sampled, where disparities between neighboring views are less than 1 pixel [1]. Therefore, for a sparsely-sampled LF that does not meet the sampling requirement, our method can reconstruct a densely-sampled LF with desired angular resolution to enable subsequent LF rendering. More generally, as our method is capable of generating novel views at arbitrary viewpoints from a set of sparsely-sampled SAIs, it can realize IBR directly.

To validate the effectiveness of our approach on the IBR application, we performed the comparisons of dense reconstruction under different sampling baselines and output angular resolution. Specifically, we compared the performance of different algorithms when reconstructing densely-sampled LFs from corner SAIs sampled at a grid, and reconstructing densely-sampled LFs from corner SAIs sampled at a grid on HCI dataset. As the ground truth images are unavailable, we visually compared the center SAIs of the reconstructed LF images. Moreover, to compare the ability of preserving the LF parallax structure, horizontal and vertical EPIs are presented. Fig. 9 shows the results, and it can be observed that our method can produce novel SAIs with sharp textures and construct EPIs with clear line structures, even when the input sampling baselines are extremely large.

5.2 Depth estimation enhancement

The value of an LF image lies in the implicitly encoded scene geometry information. By finding correspondences in different SAIs, depth maps can be estimated from the LF images. A densely-sampled LF leads to more accurate and more robust depth inference, as matching points can be detected more easily and occlusion problems can be alleviated by multiple viewpoints. Therefore, the proposed method can be used to enhance LF depth estimation.

Here, we present the depth maps estimated from sparsely-sampled LF images as well as those estimated from densely-sampled LF images reconstructed by different algorithms. The state-of-the-art depth estimation algorithm [4] was applied, and Fig. 10 shows the results. It can be observed that the reconstructed densely-sampled LFs enables better estimations than sparsely-sampled LF ones, and the depth maps from our method are more accurate than those from others, especially in the regions including detailed objects and occluded boundaries. Additionaly, the high accuracy of estimated depth maps validate the advantage of our method on preserving the LF parallax structure again.

5.3 Light field compression

Although the LF image contains much richer scene/object information than the traditional 2D image, its huge data size poses great challenges to both data storage and transmission. For example, the commercial LF camera Lytro Illum has a sensor of approximately 40 million pixels, and the resulting LF image is around 110 MB. Thus, the compression of LF data is becoming an urgent task, which is attracting attention from both academia and industrial [16, 2, 25, 7]. Particularly, Hou et al. [16] proposed to partition the LF image into key SAIs and non-key SAIs, then an LF reconstruction method is used to synthesize non-key SAIs from the key SAIs to compensate non-key SAIs. Only the key SAIs and residuals of non-key SAIs are encoded with a typical video encoder. This framework produces the current state-of-the-art. However, the compression performance of this framework highly depends on the adopted LF reconstruction method.

Our framework has great potential to contribute to LF data compression from two perspectives: (1) proper selection of key SAIs can improve the reconstruction quality of non-key SAIs using the same number of key SAIs, and thus decrease the encoding bits of the residuals of non-key SAIs under the same encoding bits of the key SAIs. Our framework adapting to flexible inputs can naturally address this issue by optimizing the combination of key SAIs without the need for re-training a model for each combination; and (2) according to our quantitative experiments, we found that 3 key SAIs and 4 key SAIs by our algorithm for the optimized sampling pattern can achieve comparable performance. See the the last columns of Tables III, IV, and V. Therefore, one can only select 3 key SAIs, such that the encoding bits of the key SAIs will be saved, leading to high compression performance under the same encoding bits for the residuals of non-key SAIs.

6 Conclusion

We have presented a novel learning-based algorithm for the reconstruction of densely-sampled LFs from sparsely-sampled ones. Owing to the deep, effective and comprehensive modeling of the unique LF parallax structure, including the geometry-based SAI synthesis based on position-aware PSVs, the adaptive blending strategy and the efficient LF refinement network, our method breaks the obstacle in an arbitrary sampling pattern and large baseline sampling, not only achieving over 4 dB improvement on synthetic data and 1 dB improvement on real-world data, but also preserving the valuable LF parallax structure. Furthermore, we proposed a simple yet effective algorithm for determining the optimal sampling pattern tailored to our algorithm. Last but not least, the potential of our method on improving subsequent LF-based applications have been validated and discussed.

References

  • [1] J. Chai, X. Tong, S. Chan, and H. Shum (2000) Plenoptic sampling. In SIGGPRAH, pp. 307–318. Cited by: §1, §5.1.
  • [2] J. Chen, J. Hou, and L. Chau (2018) Light field compression with disparity-guided sparse coding based on structural key views. IEEE Transactions on Image Processing 27 (1), pp. 314–324. Cited by: §5.3.
  • [3] J. Chen, J. Hou, and L. Chau (2018) Light field denoising via anisotropic parallax analysis in a cnn framework. IEEE Signal Processing Letters 25 (9), pp. 1403–1407. Cited by: §4.2.
  • [4] J. Chen, J. Hou, Y. Ni, and L. Chau (2018) Accurate light field depth estimation with superpixel regularization over partially occluded regions. IEEE Transactions on Image Processing 27 (10), pp. 4889–4900. Cited by: §1, §2.2, §5.2.
  • [5] S. E. Chen and L. Williams (1993) View interpolation for image synthesis. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques (SIGGPAPH ’93), pp. 279–288. Cited by: §2.1.
  • [6] R. T. Collins (1996) A space-sweep approach to true multi-image matching. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 358–363. Cited by: §3.3.
  • [7] C. Conti, P. Nunes, and L. D. Soares (2016) HEVC-based light field image coding with bi-predicted self-similarity compensation. In 2016 IEEE International Conference on Multimedia Expo Workshops (ICMEW), Vol. , pp. 1–4. Cited by: §5.3.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §2.2.
  • [9] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 295–307. Cited by: §2.2.
  • [10] J. Fiss, B. Curless, and R. Szeliski (2014) Refocusing plenoptic images using depth-adaptive splatting. In IEEE International Conference on Computational Photography (ICCP), Vol. , pp. 1–9. Cited by: §1.
  • [11] J. Flynn, I. Neulander, J. Philbin, and N. Snavely (2016) DeepStereo: learning to predict new views from the world’s imagery. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [12] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen (1996) The lumigraph. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 43–54. Cited by: §1.
  • [13] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke (2016) A dataset and evaluation methodology for depth estimation on 4d light fields. In Asian Conference on Computer Vision (ACCV), pp. 19–34. Cited by: §4.1.
  • [14] M. Hosseini Kamal, B. Heshmat, R. Raskar, P. Vandergheynst, and G. Wetzstein (2016) Tensor low-rank and sparse light field photography. Computer Vision Image Understanding 145 (C), pp. 172–181. Cited by: §2.2.
  • [15] J. Hou, L. Chau, N. Magnenat-Thalmann, and Y. He (2015) Human motion capture data tailored transform coding. IEEE transactions on visualization and computer graphics 21 (7), pp. 848–859. Cited by: §3.6.
  • [16] J. Hou, J. Chen, and L. Chau (2018) Light field image compression based on bi-level view compensation with rate-distortion optimization. IEEE Transactions on Circuits and Systems for Video Technology 29 (2), pp. 517–530. Cited by: §5.3.
  • [17] F. Huang, K. Chen, and G. Wetzstein (2015) The light field stereoscope: immersive computer graphics via factored near-eye light field displays with focus cues. ACM Transaction on Graphics 34 (4), pp. 60:1–60:12. Cited by: §1.
  • [18] H. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. Tai, and I. So Kweon (2015) Accurate depth map estimation from a lenslet light field camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1547–1555. Cited by: §2.2.
  • [19] N. K. Kalantari, T. Wang, and R. Ramamoorthi (2016) Learning-based view synthesis for light field cameras. ACM Transactions on Graphics 35 (6), pp. 193:1–193:10. Cited by: §1, §2.2, §3.3, §3.3, TABLE I, TABLE II, Fig. 7, 2nd item, 2nd item, §4.1, §4.2, §4.2, §4.2, §4.2, §4.3, TABLE III, TABLE IV, TABLE V, TABLE VI, footnote 1.
  • [20] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and M. Gross (2013) Scene reconstruction from high spatio-angular resolution light fields. ACM Transaction on Graphics 32 (4), pp. 73:1–73:12. Cited by: §1.
  • [21] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1646–1654. Cited by: §2.2.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [23] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 624–632. Cited by: §2.2.
  • [24] M. Levoy and P. Hanrahan (1996) Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 31–42. Cited by: §1.
  • [25] Y. Li, M. Sjöström, R. Olsson, and U. Jennehag (2016) Coding of focused plenoptic contents by displacement intra prediction. IEEE Transactions on Circuits and Systems for Video Technology 26 (7), pp. 1308–1319. Cited by: §5.3.
  • [26] S. Lloyd (1982) Least squares quantization in pcm. IEEE Transactions on Information Theory 28 (2), pp. 129–137. Cited by: §3.6.
  • [27] Lytro illum. Note: https://www.lytro.com/[Online] Cited by: §1.
  • [28] K. Marwah, G. Wetzstein, Y. Bando, and R. Raskar (2013) Compressive light field photography using overcomplete dictionaries and optimized projections. ACM Transactions on Graphics 32 (4), pp. 46:1–46:12. Cited by: §1, §2.2.
  • [29] M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §3.3.
  • [30] L. McMillan and G. Bishop (1995) Plenoptic modeling: an image-based rendering system. In Proceedings of the 22Nd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’95), pp. 39–46. Cited by: §2.1.
  • [31] K. Mitra and A. Veeraraghavan (2012) Light field denoising, light field superresolution and stereo camera based refocussing using a gmm light field patch prior. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 22–28. Cited by: §1, §2.2.
  • [32] S. Niklaus, L. Mai, and F. Liu (2017) Video frame interpolation via adaptive separable convolution. In IEEE International Conference on Computer Vision (ICCV), pp. 261–270. Cited by: §3.4.
  • [33] R. S. Overbeck, D. Erickson, D. Evangelakos, M. Pharr, and P. Debevec (2018) A system for acquiring, processing, and rendering panoramic light field stills for virtual reality. ACM Transaction on Graphics 37 (6), pp. 197:1–197:15. Cited by: §1.
  • [34] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg (2017) Transformation-grounded image generation network for novel 3d view synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3500–3509. Cited by: §2.1.
  • [35] E. Penner and L. Zhang (2017) Soft 3d reconstruction for view synthesis. ACM Transaction on Graphics 36 (6), pp. 235:1–235:11. Cited by: §2.1.
  • [36] Raytrix. Note: https://www.raytrix.de/[Online] Cited by: §1.
  • [37] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua (2013) Learning separable filters. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2754–2761. Cited by: §3.4.
  • [38] L. Shi, H. Hassanieh, A. Davis, D. Katabi, and F. Durand (2014) Light field reconstruction using sparsity in the continuous fourier domain. ACM Transactions on Graphics 34 (1), pp. 12:1–12:13. Cited by: §1, §2.2.
  • [39] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874–1883. Cited by: §2.2.
  • [40] P. P. Srinivasan, T. Wang, A. Sreelal, R. Ramamoorthi, and R. Ng (2017) Learning to synthesize a 4d rgbd light field from a single image. In IEEE International Conference on Computer Vision (ICCV), pp. 2243–2251. Cited by: §2.2.
  • [41] R. Sunder, M. Lowney, Shah,Raj, and G. Wetzstein Stanford lytro light field archive. Note: http://lightfields.stanford.edu/LF2016.html[Online] Cited by: §4.1.
  • [42] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2016) Multi-view 3d models from single images with a convolutional network. In European Conference on Computer Vision (ECCV), pp. 322–337. Cited by: §2.1.
  • [43] The (New) Stanford Light Field Archive. Note: http://lightfield.stanford.edu/acq.html[Online] Cited by: §1.
  • [44] S. Tulsiani, R. Tucker, and N. Snavely (2018) Layer-structured 3d scene inference via view synthesis. In European Conference on Computer Vision (ECCV), pp. 302–317. Cited by: §2.1.
  • [45] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik (2017) Multi-view supervision for single-view reconstruction via differentiable ray consistency. In IEEE conference on computer vision and pattern recognition (CVPR), pp. 2626–2634. Cited by: §2.1.
  • [46] S. Vagharshakyan, R. Bregovic, and A. Gotchev (2018) Light field reconstruction using shearlet transform. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1), pp. 133–147. Cited by: §1, §2.2, TABLE I, TABLE II, 1st item, 1st item, §4.2, TABLE VI.
  • [47] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki (2017) Sfm-net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804. Cited by: §3.5.
  • [48] T. Wang, A. A. Efros, and R. Ramamoorthi (2015) Occlusion-aware depth estimation using light-field cameras. In IEEE International Conference on Computer Vision (ICCV), pp. 3487–3495. Cited by: §1, §2.2.
  • [49] T. Wang, J. Zhu, E. Hiroaki, M. Chandraker, A. A. Efros, and R. Ramamoorthi (2016) A 4d light-field dataset and cnn architectures for material recognition. In European Conference on Computer Vision (ECCV), pp. 121–138. Cited by: §3.4.
  • [50] S. Wanner and B. Goldluecke (2014) Variational light field analysis for disparity estimation and super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 606–619. Cited by: §1, §2.2.
  • [51] S. Wanner, S. Meister, and B. Goldluecke (2013) Datasets and benchmarks for densely sampled 4d light fields. In VMV, pp. 225–226. Cited by: §4.1.
  • [52] B. Wilburn, N. Joshi, V. Vaish, E. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy (2005) High performance imaging using large camera arrays. ACM Transaction on Graphics 24 (3), pp. 765–776. Cited by: §1.
  • [53] G. Wu, Y. Liu, Q. Dai, and T. Chai (2019) Learning sheared epi structure for light field reconstruction. IEEE Transactions on Image Processing 28 (7), pp. 3261–3273. Cited by: §1, §2.2, TABLE I, TABLE II, 1st item, 1st item, §4.2, §4.2, TABLE VI, footnote 1.
  • [54] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai (2019) Light field reconstruction using convolutional network on epi and extended applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (7), pp. 1681–1694. Cited by: §1, §2.2, TABLE I, TABLE II, 1st item, 1st item, §4.2, §4.2, TABLE VI, footnote 1.
  • [55] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, and Y. Liu (2017) Light field reconstruction using deep convolutional network on epi. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1638–1646. Cited by: §2.2.
  • [56] L. Yan, S. U. Mehta, R. Ramamoorthi, and F. Durand (2015) Fast 4d sheared filtering for interactive rendering of distribution effects. ACM Transactions on Graphics 35 (1), pp. 7:1–7:13. Cited by: §3.4.
  • [57] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Ying. Chung (2019) Light field spatial super-resolution using deep efficient spatial-angular separable convolution. IEEE Transactions on Image Processing 28 (5), pp. 2319–2330. Cited by: §3.4.
  • [58] W. F. H. Yeung, J. Hou, J. Chen, Y. Ying Chung, and X. Chen (2018) Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues. In European Conference on Computer Vision (ECCV), pp. 137–152. Cited by: §1, §2.2, TABLE I, TABLE II, 1st item, 3rd item, 4th item, §4.2, §4.2, §4.2, TABLE VI, footnote 1.
  • [59] Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. So Kweon (2015) Learning a deep convolutional network for light-field image super-resolution. In IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 24–32. Cited by: §1, §2.2.
  • [60] J. Yu (2017) A light-field journey to virtual reality. IEEE MultiMedia 24 (2), pp. 104–112. Cited by: §1.
  • [61] C. Zhang and T. Chen (2004) A survey on image-based rendering—representation, sampling and compression. Signal Processing: Image Communication 19 (1), pp. 1–28. Cited by: §5.1.
  • [62] F. Zhang, J. Wang, E. Shechtman, Z. Zhou, J. Shi, and S. Hu (2017) PlenoPatch: patch-based plenoptic image manipulation. IEEE Transactions on Visualization and Computer Graphics 23 (5), pp. 1561–1573. Cited by: §2.2.
  • [63] Z. Zhang, Y. Liu, and Q. Dai (2015) Light field from micro-baseline image pair. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3800–3809. Cited by: §1, §2.2.
  • [64] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely (2018) Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics 37 (4), pp. 65:1–65:12. Cited by: §2.1.
  • [65] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros (2016) View synthesis by appearance flow. In European conference on computer vision (ECCV), pp. 286–301. Cited by: §2.1.