Light Field Image Super-Resolution Using Deformable Convolution

07/07/2020
by   Yingqian Wang, et al.
0

Light field (LF) cameras can record scenes from multiple perspectives, and thus introduce beneficial angular information for image super-resolution (SR). However, it is challenging to incorporate angular information due to disparities among LF images. In this paper, we propose a deformable convolution network (i.e., LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design an angular deformable alignment module (ADAM) for feature-level alignment. Based on ADAM, we further propose a collect-and-distribute approach to perform bidirectional alignment between the center-view feature and each side-view feature. Using our approach, angular information can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images. Moreover, we develop a baseline-adjustable LF dataset to evaluate SR performance under different disparities. Experiments on both public and our self-developed datasets have demonstrated the superiority of our method. Our LF-DFnet can generate high-resolution images with more faithful details and achieve state-of-the-art reconstruction accuracy. Besides, our LF-DFnet is more robust to disparity variations, which has not been well addressed in literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 8

page 9

page 10

page 13

12/17/2019

Spatial-Angular Interaction for Light Field Image Super-Resolution

Light field (LF) cameras record both intensity and directions of light r...
08/17/2021

Light Field Image Super-Resolution with Transformers

Light field (LF) image super-resolution (SR) aims at reconstructing high...
02/22/2022

Disentangling Light Fields for Super-Resolution and Disparity Estimation

Light field (LF) cameras record both intensity and directions of light r...
01/02/2022

Detail-Preserving Transformer for Light Field Image Super-Resolution

Recently, numerous algorithms have been developed to tackle the problem ...
06/02/2021

Feedback Network for Mutually Boosted Stereo Image Super-Resolution and Disparity Estimation

Under stereo settings, the problem of image super-resolution (SR) and di...
09/26/2020

Deep Selective Combinatorial Embedding and Consistency Regularization for Light Field Super-resolution

Light field (LF) images acquired by hand-held devices usually suffer fro...
02/07/2021

SR-Affine: High-quality 3D hand model reconstruction from UV Maps

Under various poses and heavy occlusions,3D hand model reconstruction ba...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Athough light field (LF) cameras enable many attractive functions such as post-capture refocusing [20, 77, 65], depth sensing [37, 7, 96, 34, 48, 46], saliency detection [43, 86, 88, 59, 31], and de-occlusion [32, 28, 64], the resolution of a sub-aperture image (SAI) is much lower than that of the total sensors. The low spatial resolution problem hinders the development of LF imaging [74]. Since high-resolution (HR) images are required in various LF applications, it is necessary to reconstruct HR images from low-resolution (LR) observations, namely, to perform LF image super-resolution (SR).

To achieve high SR performance, information both within a single view (i.e., spatial information) and among different views (i.e., angular information) is important. Several models have been proposed in early LF image SR methods, such as variational model [69]

, Gaussian mixture model

[38], and PCA analysis model [13]. Although different delicately handcrafted image priors have been investigated in these traditional methods [69, 38, 13, 45, 16, 2]

, their performance is relatively limited due to their inferiority in spatial information exploitation. In contrast, recent deep learning-based methods

[84, 83, 85, 89, 67, 80, 63] enhance spatial information exploitation via cascaded convolutions, and thus achieve improved performance as compared to traditional methods. Yoon et al. [84, 83] proposed the first CNN-based method LFCNN for LF image SR. Specifically, SAIs are first super-resolved using SRCNN [11], and then fine-tuned in pairs to incorporate angular information. Similarly, Yuan et al. [85] super-resolved each SAI separately using EDSR [33], and then proposed an EPI-enhancement network to refine the results. Although several recent deep learning-based methods [67, 80, 89, 63] have been proposed to achieve the state-of-the-art performance, the disparity issue in LF image SR is still under-investigated [84, 83, 85, 67, 80, 89, 63].

Fig. 1:

Results achieved by bicubic interpolation, RCAN

[92], LFSSR [80], resLF [89], LF-ATO [24], LF-InterNet [63], and our LF-DFnet for and SR. Here, scene Origami from the HCInew dataset [19] and scene Cards from the STFgantry dataset [54] are used for evaluation. Our method achieves superior visual performance with significant improvements in terms of PSNR and SSIM.

In real-world scenes, objects at different depths have different disparity values in LF images. Existing CNN-based LF image SR methods [84, 83, 85, 67, 80, 89, 63] do not explicitly address the disparity issue. Instead, they use cascaded convolutions to achieve a large receptive field to cover the disparity range. As demonstrated in [56, 57], it is difficult for SR networks to learn the non-linear mapping between LR and HR images under complex motion patterns. Consequently, the misalignment impedes the incorporation of angular information and leads to performance degradation. Therefore, specific mechanisms should be designed to handle the disparity problem in LF image SR.

Inspired by the success of deformable convolution [8, 97] in video SR [51, 60, 55, 76, 81], in this paper, we propose a deformable convolution network (namely, LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design an angular deformable alignment module (ADAM) and a collect-and-distribute approach to achieve feature-level alignment and angular information incorporation. In ADAM, all side-view features are first aligned with the center-view feature to achieve feature collection. These collected features are then fused and distributed to their corresponding views by performing alignment with their original features. Through feature collection and distribution, angular information can be incorporated and encoded into each view. Consequently, the SR performance is evenly improved among different views. Moreover, we develop a novel LF dataset named NUDT to evaluate the performance of LF image SR methods under different disparity variations. All scenes in our NUDT dataset are rendered using 3dsMax111https://www.autodesk.eu/products/3ds-max/overview and the baseline of virtual camera arrays is adjustable. In summary, the main contributions of this paper are as follows:

  • We propose a LF-DFnet to achieve the state-of-the-art LF image SR performance (as shown in Fig. 1) by addressing the disparity problem.

  • We propose an angular deformable alignment module and a collect-and-distribute approach to achieve high-quality reconstruction of each LF image. Compared to [89]

    , our approach avoids repetitive feature extraction and can exploit angular information from all SAIs.

  • We develop a novel NUDT dataset by rendering synthetic scenes with adjustable camera baselines. Experiments on the NUDT dataset have demonstrated the robustness of our method with respect to disparity variations.

The rest of this paper is organized as follows: In Secion II, we briefly review the related work. In Section III, we introduce the architecture of our LF-DFnet in details. In Section IV, we introduce our self-developed dataset. Experimental results are presented in Section V. Finally, we conclude this paper in Section VI.

Ii Related Work

In this section, we briefly review the major works in single image SR (SISR), LF image SR, and deformable convolution.

Ii-a Single Image SR

The task of SISR is to generate a clear HR image from its blurry LR counterpart. Since an input LR image can be associated to multiple HR outputs, SISR is a highly ill-posed problem. Recently, several surveys [4, 79, 68] have been published to comprehensively review SISR methods. Here, we only describe several mile-stone works in literature.

Since Dong et al. [10, 11] proposed the seminal work of CNN-based SISR method SRCNN, deep learning-based methods have dominated this area due to their remarkable performance in terms of both accuracy and efficiency. By far, various networks have been proposed to continuously improve the SISR performance. Kim et al. [26] proposed a very deep SR network (i.e., VDSR) and achieved a significant performance improvement over SRCNN [10, 11]. Lim et al. [33]

proposed an enhanced deep SR network (i.e., EDSR). With the combination of local and global residual connections, EDSR

[33] won the NTIRE 2017 SISR challenge [52]. Zhang et al. [93, 94] proposed a residual dense network (i.e., RDN), which achieved a further improvement over the state-of-the-arts at that time. Subsequently, Zhang et al. [92] proposed a residual channel attention network (i.e., RCAN) by introducing a channel attention module and a residual in residual mechanism. More recently, Dai et al. [9] proposed SAN by applying the second-order attention mechanism to SISR. Note that, RCAN [92] and SAN [9] achieve the state-of-the-art SISR performance to date in terms of PSNR and SSIM.

In summary, SISR networks are becoming increasingly deep and complicated, resulting in continuously improved capability in spatial information exploitation. Note that, performing SISR on LF images is a straightforward scheme to achieve LF image SR. However, the angular information is discarded in this scheme, resulting in limited performance.

Ii-B LF image SR

In the area of LF image SR, both traditional and deep learning-based methods are widely used. For traditional methods, various models have been developed for problem formulation. Wanner et al. [69]

proposed a variational method for LF image SR based on the estimated depth information. Mitra et al.

[38] encoded LF structure via a Gaussian mixture model to achieve depth estimation, view synthesis, and LF image SR. Farrugia et al. [13] decomposed HR-LR patches into subspaces and proposed a linear subspace projection method for LF image SR. Alain et al. proposed LFBM5D for LF image denoising [1] and LF image SR [2] by extending BM3D filtering [12] to LFs. Rossi et al. [45] developed a graph-based method to achieve LF image SR via graph optimization. Although the LF structure is well encoded by these models [69, 38, 13, 2, 45], the spatial information cannot be fully exploited due to the poor representation capability of these handcrafted image priors.

Recently, deep learning based SISR methods are demonstrated superior to traditional methods in spatial information exploitation. Inspired by these works, recent LF image SR methods adopted deep CNN to improve their performance. In the pioneering work LFCNN [84, 83], SAIs were first separately super-resolved via SRCNN [10], and then fine-tuned in pairs to enhance both spatial and angular resolution. Subsequently, Yuan et al. [85] proposed LF-DCNN to improve LFCNN by super-resolving each SAI via a more powerful SISR network EDSR [33] and fine-tuning the initial results using a specially designed EPI-enhancement network. Apart from these two-stage SR methods, a number of one-stage network architectures have been designed for LF image SR. Wang et al. proposed a bidirectional recurrent network LFNet [67] by extending BRCN [21] to LFs. Zhang et al. [89] proposed a multi-stream residual network resLF by stacking SAIs along different angular directions as inputs to super-resolve the center-view SAI. Yeung et al. [80] proposed LFSSR to alternately shuffle LF features between SAI pattern and macro-pixel image pattern for convolution. More recently, Jin et al. [24] proposed an all-to-one LF image SR method (i.e., LF-ATO) and performed structural consistency regularization to preserve the parallax structure among reconstructed views. Wang et al. [63] proposed an LF-InterNet to interact spatial and angular information for LF image SR. LF-ATO [24] and LF-InterNet [63] are state-of-the-art LF image SR methods to date and can achieve a high reconstruction accuracy.

Although the performance is continuously improved by recent networks, the disparity problem has not been well addressed in literature. Several methods [84, 83, 85, 89] use stacked SAIs as their inputs, making pixels of same objects vary in spatial locations. In LFSSR [80] and LF-InterNet [63], LF features are organized into a macro-pixel image pattern to incorporate angular information. However, pixels can fall into different macro-pixels due to the disparity problem. In summary, due to the lack of the disparity handling mechanism, the performance of these methods degrade when handling scenes with large disparities. Note that, LFNet [67] achieves LF image SR in a video SR framework and implicitly addresses the disparity issue via recurrent networks. Although all angular views can contribute to the final SR performance, the recurrent mechanism in LFNet [67] only takes SAIs from the same row or column as its inputs. Therefore, the angular information in LFs cannot be efficiently used.

Ii-C Deformable Convolution

The fixed kernel configuration in regular CNNs hinders the exploitation of long-range information. To address this problem, Dai et al. [8] proposed deformable convolution by introducing additional offsets, which can be learned adaptively to make the convolution kernel process feature far away from its local neighborhood. Deformable convolutions have been applied to both high-level vision tasks [5, 8, 95, 50], and low-level vision tasks such as video SR [51, 60, 55, 76]. Specifically, Tian et al. [51] proposed a temporal deformable alignment network (i.e., TDAN) by applying deformable convolution to align input video frames without explicit motion estimation. Wang et al. [60] proposed an enhanced deformable video restoration network (i.e., EDVR) by introducing a pyramid, cascading and deformable alignment module to handle large motions between frames. EDVR [60] won the NTIRE19 video restoration and enhancement challenges [41]. More recently, deformable convolution is integrated with non-local operation [55], convolutional LSTM [76] and 3D convolutions [81] to further enhance the video SR performance.

In summary, existing deformable convolution-based video SR methods [51, 60, 55, 76, 81] only perform unidirectional alignments to align neighborhood frames to the reference frame. However, in LF image SR, it is computational expensive to repetitively perform unidirectional alignments for each view to super-resolve all LF images. Consequently, we propose a collect-and-distribute approach to achieve bidirectional alignments using deformable convolutions. To the best of our knowledge, this is the first work to apply deformable convolutions to LF image SR.

Iii Network Architecture

In this section, we introduce our LF-DFnet in details. Following [89, 67, 80, 85, 63]

, we convert input images from RGB channel space to YCbCr channel space and only super-resolve the Y channel images, leaving Cb and Cr channel images being bicubicly upscaled. Consequently, without considering the channel dimension, an LF can be formulated as a 4D tensor

, where and represent angular dimensions and represent spatial dimensions. Specifically, a 4D LF can be considered as a array of SAIs, and the resolution of each SAI is . Following [89, 67, 80, 85, 63], we achieve LF image SR using SAIs distributed in a square array, i.e., .

As illustrated in Fig. 2(a), our LF-DFnet takes LR SAIs as its inputs and sequentially performs feature extraction (Section III-A), angular deformable alignment (Section III-B), reconstruction and upsampling (Section III-C).

Fig. 2: An overview of our LF-DFnet.

Iii-a Feature Extraction Module

Discriminative feature representation with rich spatial context information is beneficial to the subsequent feature alignment and SR reconstruction steps. Therefore, a large receptive field with a dense pixel sampling rate is required to extract hierarchical features. To this end, we follow [58] and use residual atrous spatial pyramid pooling (ASPP) module as the feature extraction module in our LF-DFnet.

As shown in Fig. 2(a), input SAIs are first processed by a convolution to generate initial features, and then fed to residual ASPP modules (Fig. 2(b)) and residual blocks (Fig. 2

(c)) for deep feature extraction. Note that, each view are processed separately and the weights in our feature extraction module are shared among these views. In each residual ASPP block, three

dilated convolutions (with dilation rates of 1, 2, 4, respectively) are combined in parallel to extract hierarchical features with dense sampling rates. After activation with a Leaky ReLU layer (with a leaky factor of 0.1), features of these three branches are concatenated and fused by a

convolution. Finally, both the center-view feature and side-view features are generated by our feature extraction module. Following [89], we set the feature depth to 32 (i.e., ). The effectiveness of residual ASPP module is demonstrated in Section V-C.

Iii-B Angular Deformable Alignment Module (ADAM)

Given features generated by the feature extraction module, the main objective of ADAM is to perform alignment between the center-view feature and each side-view feature. Here, we propose a bidirectional alignment approach (i.e., collect-and-distribute) to incorporate angular information. Specifically, side-view features are first warped to the center view and aligned with the center-view feature (i.e., feature collection). These aligned features are fused by a convolution to incorporate angular information. Afterwards, the fused feature is warped to side views by performing alignment with their original features (i.e., feature distribution). In this way, angular information can be jointly incorporated into each angular view, and the SR performance of all perspectives can be evenly improved. In this paper, we cascaded ADAMs to perform feature collection and feature distribution. Without loss of generality, we take the ADAM as an example to introduce its mechanism, as shown in Fig. 2(d).

The core component of ADAM is deformable convolution, which is used to align features according to their corresponding offsets. In our implementation, we use a deformable convolution for feature collection and another deformable convolution with opposite offset values for feature distribution. The first deformable convolution, which is used for feature collection, takes the side-view feature and learnable offsets as its input to generate the feature (which is aligned to the center view). That is,

(1)

where represents the deformable convolution in the deformable block, is the offset of with respect to . More specifically, for each position on , we have

(2)

where represents a neighborhood region centered at . is the predefined integral offset. is an additional learnable offset, which is added to the predefined offset to make the positions of deformable kernels spatially-variant. Thus, information far away from can be adaptively processed by deformable convolution. Since can be fractional, we follow [8] to perform bilinear interpolation in our implementation.

Since an accurate offset is beneficial to deformable alignment, we design an offset generation branch to learn offset in Eq. (1). As illustrated in Fig. 2(d), the side-view feature is first concatenated with the center-view feature , and then fed to a convolution for feature depth reduction. To handle the complicated and large motions between and , a residual ASPP module (which is identical to that in Section III-A) is applied to enlarge the receptive field while maintaining a dense sampling rate. The residual ASPP module enhances the exploitation of angular dependencies between the center view and side views, resulting in improved SR performance. The effectiveness of the residual ASPP module in the offset generation branch is investigated in Section V-C. Finally, another convolution with output channels is used to generate the offset feature.

Once all side-view features are aligned to the center view, a convolution is performed to fuse the angular information in these aligned features.

(3)

where denotes concatenation and denotes a convolution.

To super-resolve all LF images, the incorporated angular information need to be encoded into each side view. Consequently, we perform feature distribution to propagate the incorporated angular information to side views. Since the disparities between side-view features and center-view features are mutual, we do not perform additional offset learning. Instead, we use the opposite offset to warp the fused center-view feature to the side view. That is,

(4)

After feature distribution, both the center-view feature and side-view features are produced by the ADAM. In this paper, we cascade four ADAMs to achieve repetitive feature collection and distribution. Consequently, angular information can be repetitively incorporated into the center view and then propagated to all side views, resulting in continuous performance improvements (see Section V-C).

Fig. 3: Example images and their groundtruth depth maps in our NUDT dataset.

Iii-C Reconstruction & Upsampling Module

To achieve high reconstruction accuracy, the spatial and angular information has to be incorporated. Since preceding modules in our LF-DFnet have produced angular-aligned hierarchical features, a reconstruction module is needed to fuse these features for LF image SR. Following [22], we propose a reconstruction module with information multi-distillation blocks (IMDB). By adopting distillation mechanism to gradually extract and process hierarchical features, superior SR performance can be achieved with a small number of parameters and a low computational cost [87].

The overall architecture of our reconstruction module is illustrated in Fig. 2(e). For each view, the outputs of the feature extraction module and each ADAM are concatenated and processed by a convolution for coarse fusion. Then, the coarsely-fused feature (with 128 channels) is fed to several stacked IMDBs for deep feature fusion. The structure of IMDB is illustrated in Fig. 2(f). Specifically, in each IMDB, the input feature is first processed by a convolution and a Leaky ReLU. The processed feature is then split into two parts along the channel dimension, resulting in a narrow feature (with 32 channels) and a wide feature (with 96 channels). The narrow feature is preserved and directly fed to the final bottleneck of this IMDB, while the wide feature is fed to a convolution to enlarge its channels to 128 for further refinement. In this way, useful information can be gradually distilled, and the SR performance is improved in an efficient manner. Finally, features of different stages in the IMDB are concatenated and processed by a convolution for local residual learning. Moreover, the feature produced by the last IMDB is processed by a convolution to reduce its depth from 128 to 32 for global residual learning.

Features obtained from the reconstruction module are finally fed to a upsampling module. Specifically, a convolution is first applied to the reconstructed features to extend their depth to , where is the upsampling factor. Then, pixel shuffle is performed to upscale the reconstructed feature to the target resolution . Finally, a convolution is applied to squeeze the number of feature channels to 1 to generate super-resolved SAIs.

Fig. 4: An illustration of the concentric configuration. (a) Configuration of scene Robots. Here, camera arrays of 5 different settings of baselines are used as examples. Blocks on the translucent yellow plane denote virtual cameras, where camera arrays of different baselines are drawn in different colors. (b) concentric configuration with 5 different settings of baselines. (c) concentric configuration with 3 different settings of baselines.
Datasets Type #Scenes AngRes SpaRes GT Depth BRISQUE () NIQE () CEIQ () ENIQA ()
EPFL [44] real (lytro) 119 1414 0.034 Mpx 47.19 5.820 3.286 0.212
HCInew [19] synthetic 24 99 0.026 Mpx 14.80 3.833 3.153 0.087
HCIold [70] synthetic 12 99 0.070 Mpx 24.17 2.985 3.369 0.117
INRIA [28] real (lytro) 57 1414 0.027 Mpx 23.56 5.338 3.184 0.160
STFgantry [54] real (gantry) 12 1717 0.118 Mpx 25.28 4.246 2.781 0.232
NUDT (Ours) synthetic 32 99 0.105 Mpx 8.901 3.593 3.375 0.041

  Note: 1) Mpx denotes mega-pixels per image. 2) The best results are in bold faces and the second best results are underlined. 3) Lower scores of   BRISQUE [39], NIQE [40], ENIQA [6] and higher scores of CEIQ [78] indicate better perceptual quality.

TABLE I: Main characteristics of several popular LF datasets. Note that, average scores are reported for spatial resolution (SpaRes) and perceptual quality metrics (i.e., BRISQUE [39], NIQE [40], CEIQ [78], and ENIQA [6]). Images in our NUDT dataset have high spatial resolution and perceptual quality.

Iv The Nudt Dataset

LF images captured by different devices (especially camera arrays) usually have significantly different baseline lengths. It is therefore, necessary to know how existing LF algorithms work under baseline variations, including those developed for depth estimation [49, 42, 30, 23, 47, 90], view synthesis [75, 73, 72, 53, 71, 66, 91, 35, 25], and image SR [36, 24, 15, 14, 18]. However, all existing LF datasets [44, 19, 70, 28, 54] only include images with fixed baselines. To facilitate the study of LF algorithms under baseline variations, we introduce a novel LF dataset (namely, the NUDT dataset) with adjustable baselines, which is available at: https://github.com/YingqianWang/NUDT-Dataset.

Iv-a Technical Details

Our NUDT dataset has 32 synthetic scenes and covers diverse scenarios (see Fig. 3). All scenes in our dataset are rendered using the 3dsMax software222https://www.autodesk.eu/products/3ds-max/overview, and have an angular resolution of and a spatial resolution of . Groundtruth depth maps are available for LF depth/disparity estimation methods. During the image rendering process, all virtual cameras in the array have identical internal parameters and are coplanar with the parallel optical axes. To capture LF images with different baselines, we used a concentric configuration to align camera arrays at the center views. In this way, LF images of different baselines share the same center-view SAI and groundtruth depth map. An illustration of our concentric configuration is shown in Fig. 4. For each scene, we rendered LF images with 10 different baselines. Note that, we tuned the parameters (e.g., lighting, and depth range) to better reflect real scenes. Consequently, our dataset has a high perceptual quality, which will be introduced in the next subsection.

Iv-B Comparison to Existing Datasets

In this section, We compare our NUDT dataset to several popular LF datasets [44, 19, 70, 28, 54]. Following [62], we use four no-reference image quality assessment (NRIQA) metrics to evaluate the perceptual quality of LF images in these datasets. These NRIQA metrics, including blind/referenceless image spatial quality evaluator (BRISQUE) [39], natural image quality evaluator (NIQE) [40], contrast enhancement based image quality evaluator (CEIQ) [78], and entropy-based image quality assessment (ENIQA) [6], are highly correlated to human perception. As shown in Table I, our NUDT dataset achieves the best scores in BRISQUE [39], CEIQ [78], and ENIQA [6], and achieves the second best score in NIQE [40]. That is, images in our NUDT dataset have high perceptual quality. Meanwhile, our dataset has more scenes (see #Scenes) and higher image resolution (see SpaRes) than the HCInew [19] and the HCIold [70] datasets.

Datasets      #Training      #Test
EPFL [44] 70 10
HCInew [19] 20 4
HCIold [70] 10 2
INRIA [28] 35 5
STFgantry [54] 9 2
Total 144 23
TABLE II: Public datasets used in our experiments.

V Experiments

In this section, we first introduce our implementation details. Then, we compare our LF-DFnet to state-of-the-art SISR and LF image SR methods. Finally, we present ablation studies to investigate our network.

Method  Scale Dataset
    EPFL [44]    HCInew [19]    HCIold [70]    INRIA [28]    STFgantry [54]
Bicubic 29.50 0.935 31.69 0.934 37.46 0.978 31.10 0.956 30.82 0.947
VDSR [26] 32.01 0.959 34.37 0.956 40.34 0.986 33.80 0.972 35.80 0.980
EDSR [33] 32.86 0.965 35.02 0.961 41.11 0.988 34.61 0.977 37.08 0.985
RCAN [92] 33.46 0.967 35.56 0.963 41.59 0.989 35.18 0.978 38.18 0.988
SAN [9] 33.36 0.967 35.51 0.963 41.47 0.989 35.15 0.978 37.98 0.987
LFBM5D [2] 31.15 0.956 33.72 0.955 39.62 0.985 32.85 0.969 33.55 0.972
GB [45] 31.22 0.959 35.25 0.969 40.21 0.988 32.76 0.972 35.44 0.984
LFNet [67] 31.79 0.950 33.52 0.943 39.44 0.982 33.49 0.966 32.76 0.957
LFSSR [80] 34.15 0.973 36.98 0.974 43.29 0.993 35.76 0.982 37.67 0.989
resLF [89] 33.22 0.969 35.79 0.969 42.30 0.991 34.86 0.979 36.28 0.985
LF-ATO [24] 34.49 0.976 37.28 0.977 43.76 0.994 36.21 0.984 39.06 0.992
LF-InterNet [63] 34.76 0.976 37.20 0.976 44.65 0.995 36.64 0.984 38.48 0.991
LF-DFnet (Ours) 34.37 0.977 37.77 0.979 44.64 0.995 36.17 0.985 40.17 0.994
Bicubic 25.14 0.831 27.61 0.851 32.42 0.934 26.82 0.886 25.93 0.843
VDSR [26] 26.82 0.869 29.12 0.876 34.01 0.943 28.87 0.914 28.31 0.893
EDSR [33] 27.82 0.892 29.94 0.893 35.53 0.957 29.86 0.931 29.43 0.921
RCAN [92] 28.31 0.899 30.25 0.896 35.89 0.959 30.36 0.936 30.25 0.934
SAN [9] 28.30 0.899 30.25 0.898 35.88 0.960 30.29 0.936 30.30 0.933
SRGAN [29] 26.85 0.870 28.95 0.873 34.03 0.942 28.85 0.916 28.19 0.898
ESRGAN [61] 25.59 0.836 26.96 0.819 33.53 0.933 27.54 0.880 28.00 0.905
LFBM5D [2] 26.61 0.869 29.13 0.882 34.23 0.951 28.49 0.914 28.30 0.900
GB [45] 26.02 0.863 28.92 0.884 33.74 0.950 27.73 0.909 28.11 0.901
LFNet [67] 25.95 0.854 28.14 0.862 33.17 0.941 27.79 0.904 26.60 0.858
LFSSR [80] 29.16 0.915 30.88 0.913 36.90 0.970 31.03 0.944 30.14 0.937
resLF [89] 27.86 0.899 30.37 0.907 36.12 0.966 29.72 0.936 29.64 0.927
LF-ATO [24] 29.16 0.917 31.08 0.917 37.23 0.971 31.21 0.950 30.78 0.944
LF-InterNet [63] 29.52 0.917 31.01 0.917 37.23 0.972 31.65 0.950 30.44 0.941
LF-DFnet (Ours) 28.92 0.919 31.33 0.921 37.46 0.973 31.00 0.952 31.29 0.952
TABLE III: PSNRSSIM values achieved by different methods for and SR. The best results are in red and the second best results are in blue.

V-a Implementation Details

As listed in Table II, we used 5 public LF datasets in our experiments for both training and test. All LFs in these datasets have an angular resolution of

. In the training stage, we cropped each SAI into HR patches with a stride of 32, and used the bicubic downsampling approach to generate LR patches with a resolution of

. We performed random horizontal flipping, vertical flipping, and 90-degree rotation to augment the training data by 8 times. Note that, both spatial and angular dimensions need to be flipped or rotated during data augmentation to maintain LF structures.

By default, we used the model with , , , and an angular resolution of for both and SR. We also investigated several variants of our LF-DFnet in Section V-C. The loss function was used to train our network due to its robustness to outliers [3].

Following [89, 67, 63, 24, 92, 82], we used PSNR and SSIM as quantitative metrics for performance evaluation. Both PSNR and SSIM were separately calculated on the Y channel of each SAI. To obtain the overall metric score for a dataset with test scenes (each scene with an angular resolution of ), we first obtained the score for a scene by averaging its scores, and then generated the overall score by averaging the scores of all scenes.

Our LF-DFnet was implemented in PyTorch on a PC with two NVidia RTX GPUs. Our model was initialized using the Xavier method

[17] and optimized using the Adam method [27]. The batch size was set to 8 and the learning rate was initially set to

and decreased by a factor of 0.5 for every 10 epochs. The training was stopped after 50 epochs and took about 1.5 days.

Fig. 5: Visual results of 2SR.
Fig. 6: Visual results of 4SR.

V-B Comparison to the State-of-the-arts

We compare our method to several state-of-the-art methods, including 6 single image SR methods (i.e., VDSR [26], EDSR [33], RCAN [92], SAN [9], SRGAN [29], and ESRGAN [61]) and 7 LF image SR methods (i.e., LFBM5D [2], GB [45], LFNet [67], LFSSR [80], resLF [89], LF-ATO [24], and LF-InterNet [63]). We also use bicubic interpolation method to present baseline results.

V-B1 Quantitative Results

Quantitative results are presented in Table III. Our LF-DFnet achieves the highest SSIM scores on all the 5 datasets for both and SR. In terms of PSNR, our method achieves the best performance on the HCInew and STFgantry datasets for and SR, and on the HCIold dataset for SR. On datasets captured by Lytro cameras (i.e., EPFL and INRIA), our method is marginally inferior to LF-ATO and LF-InterNet but significantly better than other methods (e.g., RCAN, SAN, and resLF). It is worth noting that, the superiority of our LF-DFnet is very significant on the STFgantry dataset for SR. That is because, scenes in the STFgantry dataset are captured by a moving camera mounted on a gantry, and thus have relatively large baselines and significant disparity variations. Our LF-DFnet can handle this disparity problem by using deformable convolutions for angular alignment, while maintaining promising performance for LFs with small baselines (e.g., LFs on the EPFL and INRIA datasets). More analyses with respect to different baseline lengths are presented in Section V-B5.

Method Scale #Params. FLOPs(G) PSNR / SSIM
RCAN [92] 15.44M 15.7125 36.79 / 0.977
SAN [9] 15.71M 16.0525 36.69 / 0.797
resLF [89] 6.35M 37.06 36.49 / 0.979
LF-ATO [24] 1.51M 597.66 38.16 / 0.985
LF-InterNet [63] 4.80M 47.46 38.35 / 0.985
LF-DFnet 4.62M 57.21 38.62 / 0.986
RCAN [92] 15.59M 16.3425 31.01 / 0.925
SAN [9] 15.86M 16.6725 31.00 / 0.925
resLF [89] 6.79M 39.70 30.74 / 0.927
LF-ATO [24] 1.66M 686.99 31.89 / 0.940
LF-InterNet [63] 5.23M 50.10 31.97 / 0.939
LF-DFnet 4.64M 59.85 32.00 / 0.943
TABLE IV: Comparisons of the number of parameters (i.e., #Params.) and FLOPs for and SR. Note that, FLOPs is calculated on an input LF with a size of . Here, we use PSNR and SSIM values averaged over 5 datasets [44, 19, 70, 28, 54] to represent their reconstruction accuracy.
Fig. 7: PSNR values achieved by RCAN [92], resLF [89] and our LF-DFnet on each SAI on the HCInew dataset [19]. Here, the central 33, 55, 77, and 99 input views are used to perform 2

SR. We use the standard deviation (Std) value to measure the uniformity of PSNR distribution. Note that, our LF-DFnet achieves high reconstruction quality (i.e., high PSNR values) and balanced distribution (i.e., low Std scores) among different perspectives.

V-B2 Qualitative Results

Qualitative results for and SR are shown in Figs. 5 and 6, respectively. As compared to the state-of-the-art SISR and LF image SR methods, our method can produce images with more faithful details and less artifacts. Specifically, for SR, the images generated by our LF-DFnet are very close to the groundtruth images. Note that, the stairway in scene INRIA_Sculpture is faithfully recovered by our method without blurring or artifacts, and the tile edges in the scene HCIold_buddha are as sharp as in the groundtruth image. For SR, state-of-the-art SISR methods RCAN and SAN produce blurring results with warped textures, and the perceptual-oriented SISR method ESRGAN generates images with fake textures. That is because, the SR problem becomes highly ill-posed for SR, and the spatial information in a single image is insufficient to reconstruct high-quality HR images. In contrast, our LF-DFnet can use complementary information among different views to recover missing details, and thus achieves superior SR performance.

V-B3 Computational Efficiency

We compare our LF-DFnet to several competitive methods [92, 9, 89, 24, 63] in terms of the number of parameters (i.e., #Params) and FLOPs. As shown in Table IV, our method achieves the highest PSNR and SSIM scores with a small number of parameters and FLOPs. Note that, the FLOPs of our method are significantly lower than RCAN, SAN, and LF-ATO but marginally higher than resLF and LF-InterNet. That is because, our LF-DFnet uses more complicated feature extraction and reconstruction modules than resLF and LF-InterNet. These modules introduce a notable performance improvement at the cost of a reasonable increase of FLOPs.

V-B4 Performance w.r.t. Perspectives

Since LF image SR methods aim at super-resolving all SAIs in an LF, we compare our method to resLF under different perspectives. We used the central 33, 55, 77, and 99 SAIs in the HCInew dataset to perform 2SR, and used PSNR values for performance evaluation, as visualized in Fig. 7. Note that, due to the changing perspectives, the contents of different SAIs are not identical, resulting in inherent PSNR variations among perspectives. Therefore, we evaluate this variation by using RCAN to perform SISR on each SAI. As shown in Fig. 7, RCAN achieves a relatively balanced PSNR distribution (Std=0.0327 for 99 LFs). It demonstrates that the inherent PSNR variation among perspectives are relatively small. It can be observed that resLF achieves notable performance improvements over RCAN under all angular resolutions. However, since resLF uses part of views for LF image SR, the PSNR scores achieved by resLF on side views are relatively low.

As compared to RCAN and resLF, our method uses all SAIs to super-resolve each view and handles the disparity problem. Consequently, our method can achieve better SR performance (i.e., higher PSNR values) with a more balanced distribution (i.e., lower Std scores). It can be also observed that, our SR performance is continuously improved as the angular resolution increases from 33 to 77. That is because, the additional views can introduce more angular information, which is beneficial to SR reconstruction. Note that, our LF-DFnet achieves comparable performance with 77 and 99 input views (37.91 v.s. 37.89 in average PSNR score). That is, the angular information tends to be saturated when angular resolution is more than 77, and a further increase in angular resolution cannot introduce significant performance improvement.

Fig. 8: PSNR and SSIM values achieved by resLF [89] and LF-DFnet on 4 scenes on our NUDT dataset under linearly increased disparities for 2SR. Our LF-DFnet achieves better performance than resLF, especially on LF images with large disparity variations (i.e., wide baselines).

V-B5 Performance w.r.t. Disparity Variations

We selected 4 scenes (see Fig. 8) from our NUDT dataset, and rendered them with linearly increased baselines. Note that, the disparities in a specific scene are proportional to the baseline length when the camera intrinsic parameters (e.g., focal length) are fixed. Consequently, we can investigate the performance of LF image SR algorithms with respect to disparity variations by straightforwardly applying them to the same scenes rendered with different baselines. Following the HCInew dataset [19], we calculated the disparity range of each scene using its groundtruth depth value. As shown in Fig. 8, the reconstruction accuracy (i.e., PSNR) of both methods tends to be decreased with increasing disparities, except for LF-DFnet on scene Robots (see Fig. 8(c)). Note that, the superiority of our LF-DFnet becomes more significant on LF images with larger disparity variations (i.e., wider baselines). That is because, large disparities can result in large misalignment among LF images and thus introduce difficulties in angular information exploitation. Since deformable convolution is used by our method to perform angular alignment, our LF-DFnet is more robust to disparity variations, and thus achieves better performance on LF images with wide baselines.

V-B6 Performance Under Real-World Degradations

We compare our method to RCAN, ESRGAN, resLF, LF-ATO, and LF-InterNet under real-world degradation by directly applying them to LFs in the EPFL dataset. Since no groundtruth HR images are available in this dataset, we compare their visual performance in Fig. 9. Our LF-DFnet recovers much finer details from input LF images, and produces less artifacts than RCAN and ESRGAN. Since LF structures keep unchanged between bicubic and real-world degradation, our method can successfully learn to incorporate spatial and angular information from bicubicly downsampled training data, and is well generalized to LF images under real degradation.

Fig. 9: Visual results achieved by different methods under real-world degradation.

V-C Ablation Study

In this subsection, we compare our LF-DFnet with several variants to investigate the potential benefits introduced by our network modules.

Fig. 10: Comparative results achieved on 5 datasets [44, 19, 70, 28, 54] by LF-DFnet and its variants with 1–5 ADAMs for SR. Here, the center-view, average, and minimum PSNR and SSIM values are reported to comprehensively evaluate the reconstruction accuracy. Moreover, the number of parameters (i.e., #Params.) and FLOPs (calculated with an input SAI of size ) are also reported to show their computational efficiency.

V-C1 ADAMs

As the core component of our LF-DFnet, ADAM can perform feature alignment between the center view and each side view. Here, we investigate ADAM by introducing the following three variants:

  • RegularCNN: it is introduced by replacing the deformable convolutions with regular convolutions in both feature collection and feature distribution stages.

  • RegularDist: it is introduced by replacing the deformable convolutions with regular convolutions only in feature distribution stage.

  • LearnDist: it is introduced by performing offset learning during feature distribution rather than using their opposite values.

Apart from these three variants, SR performance is also influenced by the number of ADAMs in the network. To investigate the effect of these coupled factors, we trained 20 models with 4 design options and 5 different numbers of ADAMs from scratch. Comparative results of these 20 models are shown in Fig. 10.

It can be observed from Figs. 10(a) and 10(d) that RegularDist, LearnDist, and our model achieve comparable results in center-view PSNR/SSIM, while RegularCNN achieves relatively lower scores. That is because, by using deformable convolutions for feature collection, contributive information can be effectively collected and used to reconstruct the center views. Similarly, deformable convolutions in the feature distribution stage also play an important role in LF image SR. As shown in Figs. 10(c) and 10(f), RegularCNN and RegularDist achieve lower PSNR/SSIM scores than LearnDist and the proposed model. That is because, without deformable convolutions for feature distribution, the incorporated information cannot be effectively fused to side views, resulting in a lower minimum reconstruction accuracy. In terms of averaged PSNR/SSIM (see Figs. 10(b) and 10(e)), the proposed model and LearnDist achieve comparable results. However, LearnDist performs offset learning twice in each ADAM, and thus has larger model size and higher FLOPs than the proposed model, as shown in Figs. 10(g) and 10(h). In summary, our proposed model can achieve good SR performance on all angular views while maintaining a reasonable computational cost.

Moreover, it can be observed in Figs. 10(a)–10(f) that the reconstruction accuracy is improved as the number of ADAMs increases. However, the performance tends to be saturated when the number of ADAM is increased from 4 to 5. Since the model size (i.e., Params.) and memory cost (i.e., FLOPs) grow linearly with respect to the number of ADAMs (as shown in Figs. 10(g) and 10(h)), we finally use 4 ADAMs in our network (i.e., ) to achieve a tradeoff between reconstruction accuracy and computational efficiency.

Model PSNR SSIM #Params. FLOPs
LF-DFnet_woASPPinFEM 38.46 0.9856 4.60M 56.6G
LF-DFnet_woASPPinOFS 38.44 0.9856 4.57M 56.1G
LF-DFnet 38.59 0.9858 4.62M 57.2G
TABLE V: Average PSNR and SSIM values achieved on 5 datasets [44, 19, 70, 28, 54] by LF-DFnet and its variants for SR. Here, the number of parameters (i.e., #Params.) and FLOPs (calculated with an input SAI of size ) are reported to show their computational efficiency.

V-C2 Residual ASPP Module

Residual ASPP module is used in our LF-DFnet for both feature extraction and offset learning. To demonstrate its effectiveness, we introduced two variants LF-DFnet_woASPPinFEM and LF-DFnet_woASPPinOFS by replacing the residual ASPP blocks with residual blocks in the feature extraction module and the offset learning branch, respectively. As shown in Table V, LF-DFnet_woASPPinFEM suffers a 0.13 dB decrease in PSNR as compared to LF-DFnet. That is because, residual ASPP module can extract hierarchical features from input images, which are beneficial to LF image SR. Similarly, a 0.16 dB PSNR decrease is introduced when ASPP module is removed from the offset learning branch. That is because, the ASPP module can achieve accurate offset learning through multi-scale feature representation and the enlargement of receptive fields.

Vi Conclusion

In this paper, we proposed an LF-DFnet to handle the disparity problem for LF image SR. By performing feature alignment using our angular deformable alignment module, the angular information can be well incorporated and the SR performance is significantly improved. Moreover, we develop a baseline-adjustable LF dataset for performance evaluation. Experimental results on both public and our self-developed datasets have demonstrated the superiority of our method. Our LF-DFnet achieves state-of-the-art quantitative and qualitative SR performance, and is more robust to disparity variations.

References

  • [1] M. Alain and A. Smolic (2017) Light field denoising by sparse 5d transform domain collaborative filtering. In International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. Cited by: §II-B.
  • [2] M. Alain and A. Smolic (2018) Light field super-resolution via lfbm5d sparse coding. In IEEE International Conference on Image Processing (ICIP), pp. 2501–2505. Cited by: §I, §II-B, §V-B, TABLE III.
  • [3] Y. Anagun, S. Isik, and E. Seke (2019) SRLibrary: comparing different loss functions for super-resolution over various convolutional architectures. Journal of Visual Communication and Image Representation 61, pp. 178–187. Cited by: §V-A.
  • [4] S. Anwar, S. Khan, and N. Barnes (2020) A deep journey into super-resolution: a survey. ACM Computing Surveys (CSUR) 53 (3), pp. 1–34. Cited by: §II-A.
  • [5] G. Bertasius, L. Torresani, and J. Shi (2018) Object detection in video with spatiotemporal sampling networks. In

    European Conference on Computer Vision (ECCV)

    ,
    pp. 331–346. Cited by: §II-C.
  • [6] X. Chen, Q. Zhang, M. Lin, G. Yang, and C. He (2019) No-reference color image quality assessment: from entropy to perceptual quality. EURASIP Journal on Image and Video Processing 2019 (1), pp. 77. Cited by: TABLE I, §IV-B.
  • [7] A. Chuchvara, A. Barsi, and A. Gotchev (2019) Fast and accurate depth estimation from sparse light fields. IEEE Transactions on Image Processing. Cited by: §I.
  • [8] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In IEEE International Conference on Computer Vision (ICCV), pp. 764–773. Cited by: §I, §II-C, §III-B.
  • [9] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang (2019) Second-order attention network for single image super-resolution. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 11065–11074. Cited by: §II-A, §V-B3, §V-B, TABLE III, TABLE IV.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §II-A, §II-B.
  • [11] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 295–307. Cited by: §I, §II-A.
  • [12] K. Egiazarian and V. Katkovnik (2015) Single image super-resolution via bm3d sparse coding. In European Signal Processing Conference (EUSIPCO), pp. 2849–2853. Cited by: §II-B.
  • [13] R. A. Farrugia, C. Galea, and C. Guillemot (2017) Super resolution of light field images using linear subspace projection of patch-volumes. IEEE Journal of Selected Topics in Signal Processing 11 (7), pp. 1058–1071. Cited by: §I, §II-B.
  • [14] R. A. Farrugia and C. Guillemot (2020) A simple framework to leverage state-of-the-art single-image super-resolution methods to restore light fields. Signal Processing: Image Communication 80, pp. 115638. Cited by: §IV.
  • [15] R. Farrugia and C. Guillemot (2019)

    Light field super-resolution using a low-rank prior and deep convolutional neural networks

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §IV.
  • [16] V. K. Ghassab and N. Bouguila (2019) Light field super-resolution using edge-preserved graph-based regularization. IEEE Transactions on Multimedia. Cited by: §I.
  • [17] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 249–256. Cited by: §V-A.
  • [18] M. S. K. Gul and B. K. Gunturk (2018) Spatial and angular resolution enhancement of light fields using convolutional neural networks. IEEE Transactions on Image Processing 27 (5), pp. 2146–2159. Cited by: §IV.
  • [19] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke (2016) A dataset and evaluation methodology for depth estimation on 4d light fields. In Asian Conference on Computer Vision (ACCV), pp. 19–34. Cited by: Fig. 1, TABLE I, §IV-B, TABLE II, §IV, Fig. 10, Fig. 7, §V-B5, TABLE III, TABLE IV, TABLE V.
  • [20] C. Huang, Y. Wang, L. Huang, J. Chin, and L. Chen (2016) Fast physically correct refocusing for sparse light fields using block-based multi-rate view interpolation. IEEE Transactions on Image Processing 26 (2), pp. 603–618. Cited by: §I.
  • [21] Y. Huang, W. Wang, and L. Wang (2015) Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information Processing Systems (NeurIPS), pp. 235–243. Cited by: §II-B.
  • [22] Z. Hui, X. Gao, Y. Yang, and X. Wang (2019) Lightweight image super-resolution with information multi-distillation network. In The 27th ACM International Conference on Multimedia, pp. 2024–2032. Cited by: §III-C.
  • [23] H. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. Tai, and I. S. Kweon (2018) Depth from a light field image with learning-based matching costs. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 297–310. Cited by: §IV.
  • [24] J. Jin, J. Hou, J. Chen, and S. Kwong (2020) Light field spatial super-resolution via deep combinatorial geometry embedding and structural consistency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2260–2269. Cited by: Fig. 1, §II-B, §IV, §V-A, §V-B3, §V-B, TABLE III, TABLE IV.
  • [25] J. Jin, J. Hou, H. Yuan, and S. Kwong (2020) Learning light field angular super-resolution via a geometry-aware network. In AAAI Conference on Artificial Intelligence, Cited by: §IV.
  • [26] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1646–1654. Cited by: §II-A, §V-B, TABLE III.
  • [27] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning and Representation (ICLR). Cited by: §V-A.
  • [28] M. Le Pendu, X. Jiang, and C. Guillemot (2018) Light field inpainting propagation via low rank matrix completion. IEEE Transactions on Image Processing 27 (4), pp. 1981–1993. Cited by: §I, TABLE I, §IV-B, TABLE II, §IV, Fig. 10, TABLE III, TABLE IV, TABLE V.
  • [29] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4681–4690. Cited by: §V-B, TABLE III.
  • [30] J. Y. Lee and R. Park (2019) Complex-valued disparity: unified depth model of depth from stereo, depth from focus, and depth from defocus based on the light field gradient. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §IV.
  • [31] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu (2014) Saliency detection on light field. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2806–2813. Cited by: §I.
  • [32] T. Li, D. P. Lun, Y. Chan, et al. (2018) Robust reflection removal based on light field imaging. IEEE Transactions on Image Processing 28 (4), pp. 1798–1812. Cited by: §I.
  • [33] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 136–144. Cited by: §I, §II-A, §II-B, §V-B, TABLE III.
  • [34] F. Liu, S. Zhou, Y. Wang, G. Hou, Z. Sun, and T. Tan (2019) Binocular light-field: imaging theory and occlusion-robust depth perception application. IEEE Transactions on Image Processing 29, pp. 1628–1640. Cited by: §I.
  • [35] N. Meng, H. K. So, X. Sun, and E. Lam (2019) High-dimensional dense residual convolutional neural network for light field reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §IV.
  • [36] N. Meng, X. Wu, J. Liu, and E. Y. Lam (2020) High-order residual network for light field super-resolution. AAAI Conference on Artificial Intelligence. Cited by: §IV.
  • [37] K. Mishiba (2020) Fast depth estimation for light field cameras. IEEE Transactions on Image Processing 29, pp. 4232–4242. Cited by: §I.
  • [38] K. Mitra and A. Veeraraghavan (2012) Light field denoising, light field superresolution and stereo camera based refocussing using a gmm light field patch prior. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 22–28. Cited by: §I, §II-B.
  • [39] A. Mittal, A. K. Moorthy, and A. C. Bovik (2012) No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing 21 (12), pp. 4695–4708. Cited by: TABLE I, §IV-B.
  • [40] A. Mittal, R. Soundararajan, and A. C. Bovik (2012) Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters 20 (3), pp. 209–212. Cited by: TABLE I, §IV-B.
  • [41] S. Nah, R. Timofte, S. Baik, S. Hong, G. Moon, S. Son, and K. Mu Lee (2019) Ntire 2019 challenge on video deblurring: methods and results. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 0–0. Cited by: §II-C.
  • [42] I. K. Park, K. M. Lee, et al. (2017) Robust light field depth estimation using occlusion-noise aware data costs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (10), pp. 2484–2497. Cited by: §IV.
  • [43] Y. Piao, X. Li, M. Zhang, J. Yu, and H. Lu (2019) Saliency detection via depth-induced cellular automata on light field. IEEE Transactions on Image Processing 29, pp. 1879–1889. Cited by: §I.
  • [44] M. Rerabek and T. Ebrahimi (2016) New light field image dataset. In International Conference on Quality of Multimedia Experience (QoMEX), Cited by: TABLE I, §IV-B, TABLE II, §IV, Fig. 10, TABLE III, TABLE IV, TABLE V.
  • [45] M. Rossi and P. Frossard (2018) Geometry-consistent light field super-resolution via graph-based regularization. IEEE Transactions on Image Processing 27 (9), pp. 4207–4218. Cited by: §I, §II-B, §V-B, TABLE III.
  • [46] H. Sheng, S. Zhang, X. Cao, Y. Fang, and Z. Xiong (2017) Geometric occlusion analysis in depth estimation using integral guided filter for light-field image. IEEE Transactions on Image Processing 26 (12), pp. 5758–5771. Cited by: §I.
  • [47] H. Sheng, P. Zhao, S. Zhang, J. Zhang, and D. Yang (2018) Occlusion-aware depth estimation for light field using multi-orientation epis. Pattern Recognition 74, pp. 587–599. Cited by: §IV.
  • [48] J. Shi, X. Jiang, and C. Guillemot (2019) A framework for learning depth from a flexible subset of dense and sparse light field views. IEEE Transactions on Image Processing 28 (12), pp. 5867–5880. Cited by: §I.
  • [49] C. Shin, H. Jeon, Y. Yoon, I. So Kweon, and S. Joo Kim (2018) Epinet: a fully-convolutional neural network using epipolar geometry for depth from light field images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4748–4757. Cited by: §IV.
  • [50] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In European Conference on Computer Vision (ECCV), pp. 529–545. Cited by: §II-C.
  • [51] Y. Tian, Y. Zhang, Y. Fu, and C. Xu (2020) Tdan: temporally deformable alignment network for video super-resolution. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I, §II-C, §II-C.
  • [52] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 114–125. Cited by: §II-A.
  • [53] S. Vagharshakyan, R. Bregovic, and A. Gotchev (2017) Light field reconstruction using shearlet transform. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1), pp. 133–147. Cited by: §IV.
  • [54] V. Vaish and A. Adams (2008) The (new) stanford light field archive. Computer Graphics Laboratory, Stanford University 6 (7). Cited by: Fig. 1, TABLE I, §IV-B, TABLE II, §IV, Fig. 10, TABLE III, TABLE IV, TABLE V.
  • [55] H. Wang, D. Su, C. Liu, L. Jin, X. Sun, and X. Peng (2019) Deformable non-local network for video super-resolution. IEEE Access 7, pp. 177734–177744. Cited by: §I, §II-C, §II-C.
  • [56] L. Wang, Y. Guo, Z. Lin, X. Deng, and W. An (2018) Learning for video super-resolution through hr optical flow estimation. In Asian Conference on Computer Vision (ACCV), pp. 514–529. Cited by: §I.
  • [57] L. Wang, Y. Guo, L. Liu, Z. Lin, X. Deng, and W. An (2020) Deep video super-resolution using HR optical flow estimation. IEEE Transactions on Image Processing. Cited by: §I.
  • [58] L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo (2019) Learning parallax attention for stereo image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-A.
  • [59] T. Wang, Y. Piao, X. Li, L. Zhang, and H. Lu (2019) Deep learning for light field saliency detection. In IEEE International Conference on Computer Vision (ICCV), pp. 8838–8848. Cited by: §I.
  • [60] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy (2019) Edvr: video restoration with enhanced deformable convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 0–0. Cited by: §I, §II-C, §II-C.
  • [61] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy (2018) Esrgan: enhanced super-resolution generative adversarial networks. In European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §V-B, TABLE III.
  • [62] Y. Wang, L. Wang, J. Yang, W. An, and Y. Guo (2019) Flickr1024: a large-scale dataset for stereo image super-resolution. In IEEE International Conference on Computer Vision Workshops (CVPRW), pp. 0–0. Cited by: §IV-B.
  • [63] Y. Wang, L. Wang, J. Yang, W. An, J. Yu, and Y. Guo (2020) Spatial-angular interaction for light field image super-resolution. In European Conference on Computer Vision (ECCV), Cited by: Fig. 1, §I, §I, §II-B, §II-B, §III, §V-A, §V-B3, §V-B, TABLE III, TABLE IV.
  • [64] Y. Wang, T. Wu, J. Yang, L. Wang, W. An, and Y. Guo (2020) DeOccNet: learning to see through foreground occlusions in light fields. In Winter Conference on Applications of Computer Vision (WACV), Cited by: §I.
  • [65] Y. Wang, J. Yang, Y. Guo, C. Xiao, and W. An (2018) Selective light field refocusing for camera arrays using bokeh rendering and superresolution. IEEE Signal Processing Letters 26 (1), pp. 204–208. Cited by: §I.
  • [66] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, and T. Tan (2018) End-to-end view synthesis for light field imaging with pseudo 4dcnn. In European Conference on Computer Vision (ECCV), pp. 333–348. Cited by: §IV.
  • [67] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan (2018) LFNet: a novel bidirectional recurrent convolutional neural network for light-field image super-resolution. IEEE Transactions on Image Processing 27 (9), pp. 4274–4286. Cited by: §I, §I, §II-B, §II-B, §III, §V-A, §V-B, TABLE III.
  • [68] Z. Wang, J. Chen, and S. C. Hoi (2020) Deep learning for image super-resolution: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A.
  • [69] S. Wanner and B. Goldluecke (2013) Variational light field analysis for disparity estimation and super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 606–619. Cited by: §I, §II-B.
  • [70] S. Wanner, S. Meister, and B. Goldluecke (2013) Datasets and benchmarks for densely sampled 4d light fields.. In Vision, Modelling and Visualization (VMV), Vol. 13, pp. 225–226. Cited by: TABLE I, §IV-B, TABLE II, §IV, Fig. 10, TABLE III, TABLE IV, TABLE V.
  • [71] H. Wing Fung Yeung, J. Hou, J. Chen, Y. Ying Chung, and X. Chen (2018) Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues. In European Conference on Computer Vision (ECCV), pp. 137–152. Cited by: §IV.
  • [72] G. Wu, Y. Liu, Q. Dai, and T. Chai (2019) Learning sheared epi structure for light field reconstruction. IEEE Transactions on Image Processing 28 (7), pp. 3261–3273. Cited by: §IV.
  • [73] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai (2018) Light field reconstruction using convolutional network on epi and extended applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (7), pp. 1681–1694. Cited by: §IV.
  • [74] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, and Y. Liu (2017) Light field image processing: an overview. IEEE Journal of Selected Topics in Signal Processing 11 (7), pp. 926–954. Cited by: §I.
  • [75] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, and Y. Liu (2017) Light field reconstruction using deep convolutional network on epi. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6319–6327. Cited by: §IV.
  • [76] X. Xiang, Y. Tian, Y. Zhang, Y. Fu, A. Jan, and C. Xu (2020-06) Zooming slow-mo: fast and accurate one-stage space-time videosuper-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-C, §II-C.
  • [77] Z. Xiao, Q. Wang, G. Zhou, and J. Yu (2017) Aliasing detection and reduction scheme on angularly undersampled light fields. IEEE Transactions on Image Processing 26 (5), pp. 2103–2115. Cited by: §I.
  • [78] J. Yan, J. Li, and X. Fu (2019) No-reference quality assessment of contrast-distorted images using contrast enhancement. arXiv preprint. Cited by: TABLE I, §IV-B.
  • [79] W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, and Q. Liao (2019) Deep learning for single image super-resolution: a brief review. IEEE Transactions on Multimedia. Cited by: §II-A.
  • [80] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung (2018) Light field spatial super-resolution using deep efficient spatial-angular separable convolution. IEEE Transactions on Image Processing 28 (5), pp. 2319–2330. Cited by: Fig. 1, §I, §I, §II-B, §II-B, §III, §V-B, TABLE III.
  • [81] X. Ying, L. Wang, Y. Wang, W. Sheng, W. An, and Y. Guo (2020) Deformable 3d convolution for video super-resolution. arXiv preprint. Cited by: §I, §II-C, §II-C.
  • [82] X. Ying, Y. Wang, L. Wang, W. Sheng, W. An, and Y. Guo (2020) A stereo attention module for stereo image super-resolution. IEEE Signal Processing Letters 27, pp. 496–500. Cited by: §V-A.
  • [83] Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. S. Kweon (2017) Light-field image super-resolution using convolutional neural network. IEEE Signal Processing Letters 24 (6), pp. 848–852. Cited by: §I, §I, §II-B, §II-B.
  • [84] Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. So Kweon (2015) Learning a deep convolutional network for light-field image super-resolution. In IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 24–32. Cited by: §I, §I, §II-B, §II-B.
  • [85] Y. Yuan, Z. Cao, and L. Su (2018) Light-field image superresolution using a combined deep cnn based on epi. IEEE Signal Processing Letters 25 (9), pp. 1359–1363. Cited by: §I, §I, §II-B, §II-B, §III.
  • [86] J. Zhang, Y. Liu, S. Zhang, R. Poppe, and M. Wang (2020) Light field saliency detection with deep convolutional networks. IEEE Transactions on Image Processing 29, pp. 4421–4434. Cited by: §I.
  • [87] K. Zhang, S. Gu, R. Timofte, Z. Hui, X. Wang, X. Gao, D. Xiong, S. Liu, R. Gang, N. Nan, et al. (2019) Aim 2019 challenge on constrained super-resolution: methods and results. In IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 3565–3574. Cited by: §III-C.
  • [88] M. Zhang, J. Li, J. WEI, Y. Piao, and H. Lu (2019) Memory-oriented decoder for light field salient object detection. In Advances in Neural Information Processing Systems (NeurIPS), pp. 896–906. Cited by: §I.
  • [89] S. Zhang, Y. Lin, and H. Sheng (2019) Residual networks for light field image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11046–11055. Cited by: Fig. 1, 2nd item, §I, §I, §II-B, §II-B, §III-A, §III, Fig. 7, Fig. 8, §V-A, §V-B3, §V-B, TABLE III, TABLE IV.
  • [90] S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong (2016) Robust depth estimation for light field via spinning parallelogram operator. Computer Vision and Image Understanding 145, pp. 148–159. Cited by: §IV.
  • [91] S. Zhang, H. Sheng, D. Yang, J. Zhang, and Z. Xiong (2017) Micro-lens-based matching for scene recovery in lenslet cameras. IEEE Transactions on Image Processing 27 (3), pp. 1060–1075. Cited by: §IV.
  • [92] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision (ECCV), pp. 286–301. Cited by: Fig. 1, §II-A, Fig. 7, §V-A, §V-B3, §V-B, TABLE III, TABLE IV.
  • [93] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2472–2481. Cited by: §II-A.
  • [94] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2020) Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A.
  • [95] Y. Zhao, Y. Xiong, and D. Lin (2018) Trajectory convolution for action recognition. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2204–2215. Cited by: §II-C.
  • [96] W. Zhou, E. Zhou, G. Liu, L. Lin, and A. Lumsdaine (2019) Unsupervised monocular depth estimation from light field image. IEEE Transactions on Image Processing 29, pp. 1606–1617. Cited by: §I.
  • [97] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9308–9316. Cited by: §I.