Spatial-Angular Interaction for Light Field Image Super-Resolution

12/17/2019 ∙ by Yingqian Wang, et al. ∙ 3

Light field (LF) cameras record both intensity and directions of light rays, and capture scenes from a number of viewpoints. Both information within each perspective (i.e., spatial information) and among different perspectives (i.e., angular information) is beneficial to image super-resolution (SR). In this paper, we propose a spatial-angular interactive network (namely, LF-InterNet) for LF image SR. In our method, spatial and angular features are separately extracted from the input LF using two specifically designed convolutions. These extracted features are then repetitively interacted to incorporate both spatial and angular information. Finally, the interacted spatial and angular features are fused to super-resolve each sub-aperture image. Experiments on 6 public LF datasets have demonstrated the superiority of our method. As compared to existing LF and single image SR methods, our method can recover much more details, and achieves significant improvements over the state-of-the-arts in terms of PSNR and SSIM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Light field (LF) cameras provide multiple views of a scene, and thus enable many attractive applications such as post-capture refocusing [32], depth sensing [26], saliency detection [19], and de-occlusion [31]. However, LF cameras face a trade-off between spatial and angular resolutions [49]. That is, they either provide dense angular samplings with a low image resolution (e.g., Lytro111https://www.lytro.com and RayTrix222https://www.raytrix.de), or capture high-resolution (HR) sub-aperture images (SAIs) with sparse angular samplings (e.g., camera arrays [37, 30]). Consequently, many efforts have been made to improve the angular resolution through LF reconstruction [39, 38], or the spatial resolution through LF image super-resolution (SR) [1, 46, 23, 33, 41]. In this paper, we focus on the LF image SR problem, namely, to reconstruct HR SAIs from their corresponding low-resolution (LR) SAIs.

Figure 1: Average PSNR and SSIM values achieved by state-of-the-art SR methods on 6 public LF datasets [22, 10, 36, 17, 29, 21]. Note that, our LF-InterNet improves PSNR and SSIM values by a large margin as compared to single image SR methods (VDSR [15], EDSR [20], RCAN [47]) and LF image SR methods (LFBM5D [1], resLF [46], GBSQ [23], LFSSR_4D [41]).

Image SR is a long-standing problem in computer vision. To achieve high reconstruction performance, SR methods need to incorporate as much useful information as possible from LR inputs. In the area of single image SR, good performance can be achieved by fully exploiting the neighborhood context (

i.e., spatial information) in an image. Using the spatial information, single image SR methods [5, 15, 20, 47] can successfully hallucinate missing details. In contrast, LF cameras capture scenes from multiple views. The complementary information among different views (i.e., angular information) can be used to further improve the performance of LF image SR.

However, due to the complicated 4D structures of LFs [18], it is highly challenging to incorporate spatial and angular information in an LF. Existing LF image SR methods fail to fully exploit both the angular information and the spatial information, resulting in limited SR performance. Specifically, in [43, 42, 44], SAIs are first super-resolved separately using single image SR methods [5, 20], and then fine-tuned together to incorporate the angular information. The angular information is ignored by these two-stage methods [43, 42, 44] during their upsampling process. In [33, 46], only part of SAIs are used to super-resolve one view, and the angular information in these discarded views is not incorporated. In contrast, Rossi et al. proposed a graph-based method [23] to consider all angular views in an optimization process. However, this method [23]

cannot fully use the spatial information, and is inferior to deep learning-based SR methods

[20, 47, 46, 41]. It is worth noting that, even all views are fed to a deep network, it is still challenging to achieve superior performance. Yeung et al. proposed a deep network named LFSSR [41] to consider all views for LF image SR. However, as shown in Fig. 1, LFSSR [41] is inferior to resLF [46], EDSR [20], and RCAN [47].

The spatial information and the angular information are highly coupled in 4D LFs, and contribute to LF image SR in different manners. Consequently, it is difficult for networks to perform well directly using these coupled information. To efficiently incorporate spatial and angular information, we propose a spatial-angular interactive network (i.e., LF-InterNet) for LF image SR. We first specifically design two convolutions to decouple spatial and angular features from an input LF. Then, we develop LF-InterNet to repetitively interact and incorporate spatial and angular information. Extensive ablation studies have been conducted to validate our designs. We compare our method to the state-of-the-art single and LF image SR methods on 6 public LF datasets. As shown in Fig. 1, our LF-InterNet substantially improves the PSNR and SSIM performance as compared to existing SR methods.

2 Related Works

In this section, we review several major works on single image SR and LF image SR.

2.1 Single Image SR

In the area of single image SR, deep learning-based methods have been extensively explored. Readers are referred to recent surveys [34, 3, 40] for more details in single image SR. Here, we only review several milestone works. Dong et al. proposed the first CNN-based SR method (i.e., SRCNN [5]) by cascading 3 convolutional layers. Although SRCNN [5] is shallow and simple, it achieves significant improvements over traditional SR methods [28, 14, 45]. Afterwards, SR networks became increasingly deep and complex, and thus more powerful in spatial information exploitation. Kim et al. proposed a very deep SR network (i.e., VDSR [15]) with 20 convolutional layers. Global residual learning is applied to VDSR [15] to avoid slow convergence. Lim et al. proposed an enhanced deep SR network (i.e., EDSR [20]) with 65 convolutional layers by cascading residual blocks [9]. EDSR substantially improves its performance by applying both local and global residual learning, and won the NTIRE 2017 Challenge on single image SR [27]. More recently, Zhang et al. proposed a residual dense network (i.e., RDN [48]) with 149 convolutional layers by combining ResNet [9] with DenseNet [12]. Using residual dense connections, RDN [48] can fully extract hierarchical features for image SR, and thus achieve further improvements over EDSR [20]. Subsequently, Zhang et al. proposed a residual channel attention network (i.e., RCAN) [47] by applying both recursive residual mechanism and channel attention module [11]. RCAN [47] has 500 convolutional layers, and is one of the most powerful SR methods to date.

2.2 LF image SR

In the area of LF image SR, different paradigms have been proposed. Most early works follow the traditional paradigm. Bishop et al. [4]

first estimated the scene depth and then used a de-convolution approach to estimate HR SAIs. Wanner

et al. [35] proposed a variational LF image SR framework using the estimated disparity map. Farrugia et al. [7] decomposed HR-LR patches into several subspaces, and achieved LF image SR via PCA analysis. Alain et al. extended SR-BM3D [6] to LFs, and super-resolved SAIs using LFBM5D filtering [1]. Rossi et al. [23] formulated LF image SR as a graph optimization problem. These traditional methods [4, 35, 7, 1, 23] use different approaches to exploit angular information, but cannot fully exploit spatial information.

In contrast, deep learning-based SR methods are more effective in exploiting spatial information, and thus can achieve promising performance. Many deep learning-based methods have been recently developed for LF image SR. In the pioneering work proposed by Yoon et al. (i.e., LFCNN [43]), SAIs are first super-resolved separately using SRCNN [5], and then fine-tuned in pairs to incorporate angular information. Similarly, Yuan et al. proposed LF-DCNN [44], in which they used EDSR [20] to super-resolve each SAI and then fine-tuned the results. Both LFCNN [43] and LF-DCNN [44] handle the LF image SR problem in two stages, and do not use angular information in the first stage. Different from [43, 44], Wang et al. proposed LFNet [33] by extending BRCN [13] to LF image SR. In their method, SAIs from the same row (or column) are fed to a recurrent network to incorporate the angular information. Zhang et al. stacked SAIs along different angular directions to generate input volumes, and then proposed a multi-stream residual network named resLF [46]. Both LFNet [33] and resLF [46] reduce 4D LF to 3D LF by using part of SAIs to super-resolve one view. Consequently, the angular information in these discarded views cannot be incorporated. To consider all views for LF image SR, Yeung et al. proposed LFSSR [41] to alternately shuffle LF features between SAI pattern and macro-pixel image (MPI) pattern for convolution. However, the complicated LF structure and coupled information have hindered the performance gain of LFSSR [41].

3 Method

In this section, we first introduce the approach to decouple spatial and angular features in Section 3.1, and then present our network in details in Section 3.2.

3.1 Spatial-Angular Feature Decoupling

Figure 2: SAI array (left) and MPI (right) representations of LFs. Both the SAI array and the MPI representations have the same size of . Note that, to convert an SAI array representation into an MPI representation, pixels at the same spatial coordinates of each SAI need to be extracted and organized according to their angular coordinates to generate a macro-pixel. Then, an MPI can be generated by organizing these macro-pixels according to their spatial coordinates. More details are presented in the supplemental material.

An LF has 4D structures and can be denoted as , where and represent the angular dimensions (e.g., for a LF), and represent the height and width of each SAI. Intuitively, an LF can be considered as a 2D angular collection of SAIs, and the SAI at each angular coordinate can be denoted as . Similarly, an LF can also be organized into an MPI, namely, a 2D spatial collection of macro-pixels. The macro-pixel at each spatial coordinate can be denoted as . An illustration of these two LF representations is shown in Fig. 2. Note that, when an LF is organized as a 2D SAI array, the angular information is implicitly contained among different SAIs and thus is hard to extract. Therefore, we use the MPI representation in our method, and specifically design two special convolutions (i.e., Angular Feature Extractor (AFE) and Spatial Feature Extractor (SFE)) to extract and decouple angular and spatial features.

Since most methods use SAIs distributed in a square array as their input, we follow [1, 23, 43, 42, 41, 46] to set in our method, where denotes the angular resolution. Given an LF of size , an MPI of size can be generated by organizing macro-pixels of size according to their spatial coordinates. Here, we use a toy example in Fig. 3

to illustrate the processes for angular and spatial feature extraction. Specifically, AFE is defined as a convolution with a kernel size of

and a stride of

. Padding is not performed so that features generated by AFE have a size of

, where represents the feature depth. In contrast, SFE is defined as a convolution with a kernel size of , a stride of , and a dilation of . We perform zero padding to ensure the output features to have the same spatial size as the input MPI. It is worth noting that, during angular feature extraction, each macro-pixel can be exactly convolved by AFE, while the information across different macro-pixels will not be aliased. Similarly, during spatial feature extraction, pixels in each SAI can be convolved by the SFE, while the angular information will not be involved. In this way, the spatial and angular information in an LF is decoupled.

Due to the 3D property of real scenes, objects of different depths have different disparity values in LFs. Consequently, pixels of an object among different views cannot always locate at a single macro-pixel. To handle this problem, we apply AFE and SFE for multiple times (i.e., performing spatial-angular interaction) in our network. As shown in Fig. 4, in this way, the receptive field can be enlarged to cover pixels with disparities.

Figure 3: An illustration of angular and spatial feature extractors. Here, an LF of size is used as a toy example. For better visualization, pixels from different SAIs are represented with different labels (e.g., red arrays or green squares), while different macro-pixels are paint with different background colors. Note that, AFE only extracts angular features and SFE only extracts spatial features, resulting in spatial-angular information decoupling.
Figure 4: A visualization of the receptive field of our LF-InterNet using the Grad-CAM method [24]. We performed SR on the central views of scene HCInew_bicycle [10], and investigated the contribution of input pixels to the specified output pixel (marked in the zoom-in image of (a)). (a) Center-view SAI. (b) Heat maps generated by Grad-CAM [24]. The contributive pixels are highlighted. (c) Epipolar-plane images (EPIs) of the output LF and the heat maps. In summary, our LF-InterNet can handle the disparity problem in LF image SR, and its receptive field can cover the corresponding pixels in each LR image.
Figure 5: An overview of our LF-InterNet.

3.2 Network Design

Our LF-InterNet takes an LR MPI of size as its input and produces an HR SAI array of size , where denotes the upsampling factor. Following [46, 41, 33, 44], we convert RGB images into YCbCr color space, and only super-resolve Y channel images. An overview of our network is shown in Fig. 5.

3.2.1 Overall Architecture

Given an LR MPI , the angular and spatial features are first extracted by AFE and SFE, respectively.

(1)

where and respectively represent the extracted angular and spatial features, and respectively represent the angular and spatial feature extractors (as described in Section 3.1). After initial feature extraction, features and are further processed by a series of interaction groups (i.e., Inter-Groups, see Section 3.2.2) to achieve spatial-angular feature interaction:

(2)

where denotes the Inter-Group and denotes the total number of Inter-Groups.

Inspired by RDN [48], we cascade all these Inter-Groups to fully use the information interacted at different stages. Specifically, features generated by each Inter-Group are concatenated and fed to a bottleneck block to fuse the interacted information. The feature generated by the bottleneck block is further added with the initial feature to achieve global residual learning. The fused feature can be obtained by

(3)

where denotes the bottleneck block, denotes the concatenation operation. Finally, the fused feature is fed to the reconstruction module, and an HR SAI array can be obtained by

(4)

where , , and represent LF shuffle, pixel shuffle, and convolution, respectively. More details about feature fusion and reconstruction are introduced in Section 3.2.3.

3.2.2 Spatial-Angular Feature Interaction

The basic module for spatial-angular interaction is the interaction block (i.e., Inter-Block). As shown in Fig. 5 (b), the Inter-Block takes a pair of angular and spatial features as inputs to achieve interaction. Specifically, the input angular feature is first upsampled by a factor of . Here, a convolution followed by a pixel shuffle layer is used for upsampling. Then, the upsampled angular feature is concatenated with the input spatial feature, and further fed to an SFE to incorporate the spatial and angular information. In this way, the complementary angular information can be used to guide spatial feature extraction. Simultaneously, the new angular feature is extracted from the input spatial feature by an AFE, and then concatenated with the input angular feature. The concatenated angular feature is further fed to a convolution to integrate and update the angular information. Note that, the fused angular and spatial features are added with their respective input features to achieve local residual learning. In this paper, we cascade Inter-Blocks in an Inter-Group, i.e., the output of an Inter-Block forms the input of its subsequent Inter-Block. In summary, the spatial-angular feature interaction can be formulated as

(5)

where and represent the output spatial and angular features of the Inter-Block in the Inter-Group, respectively, represents the upsampling operation.

3.2.3 Feature Fusion and Reconstruction

The objective of this stage is to fuse the interacted features to reconstruct an HR SAI array. The fusion and reconstruction stage mainly consists of bottleneck fusion (Fig. 5 (c)), channel extension, LF shuffle (Fig. 5 (d)), pixel shuffle (Fig. 5 (e)), and final reconstruction.

In the bottleneck, the concatenated angular features are first fed to a

convolution and a ReLU layer to generate a feature map

. Then, the squeezed angular feature is upsampled and concatenated with spatial features. The final fused feature can be obtained as

(6)

After the bottleneck, we apply another SFE layer to extend the channel size of to for pixel shuffle [25]. However, since is organized in the MPI pattern, we apply LF shuffle to convert into an SAI array representation for pixel shuffle. To achieve LF shuffle, we first extract pixels with the same angular coordinates in the MPI feature, and then re-organize these pixels according to their spatial coordinates, which can be formulated as

(7)

where

(8)

Here, and denote the pixel coordinates in the shuffled SAI arrays, and denote the corresponding coordinates in the input MPI, represents the round-down operation. The derivation of Eqs. (7) and (8) is presented in the supplemental material.

Finally, a convolution is applied to squeeze the number of feature channels to for HR SAI reconstruction.

4 Experiments

In this section, we first introduce the datasets and our implementation details, then conduct ablation studies to investigate our network. Finally, we compare our LF-InterNet to recent LF image SR and single image SR methods.

4.1 Datasets and Implementation Details

Datasets Type Training Test
EPFL [22] real-world
HCInew [10] synthetic
HCIold [36] synthetic
INRIA [17] real-world
STFgantry [29] real-world
STFlytro [21] real-world
Total
Table 1: Datasets used in our experiments.

As listed in Table 1, we used public LF datasets in our experiments. All the LFs in the training and test sets have an angular resolution of . In the training stage, we first cropped each SAI into patches with a size of , and then used bicubic downsampling with a factor of to generate LR patches. The generated LR patches were re-organized into MPI pattern to form the input of our network. The loss function was used since it can generate good results for the SR task and is robust to outliers [2]. Following the recent works [46, 26], we augmented the training data by times using random horizontal flipping, vertical flipping, and -degree rotation. Note that, during each data augmentation, all SAIs need to be flipped and rotated along both spatial and angular directions to maintain their LF structures.

By default, we used the model with , , , and angular resolution of for both and SR. We also investigated the performance of other branches of our LF-InterNet in Section 4.2. We used PSNR and SSIM as quantitative metrics for performance evaluation. Note that, PSNR and SSIM were separately calculated on the Y channel of each SAI. To obtain the overall metric score for a dataset with scenes (each with an angular resolution of ), we first obtain the score for a scene by averaging its scores, and then get the overall score by averaging the scores of all scenes.

Our LF-InterNet

was implemented in PyTorch on a PC with an Nvidia RTX 2080Ti GPU. Our model was initialized using the

Xavier method [8] and optimized using the Adam method [16]. The batch size was set to and the learning rate was initially set to and decreased by a factor of for every epochs. The training was stopped after epochs and took about day.

4.2 Ablation Study

In this subsection, we compare the performance of our LF-InterNet with different architectures and angular resolutions to investigate the potential benefits introduced by different modules.

4.2.1 Network Architecture

Model PSNR SSIM Params.
LF-InterNet-onlySpatial
LF-InterNet-onlyAngular
LF-InterNet-SAcoupled
LF-InterNet
Bicubic
VDSR [15]
EDSR [20]
Table 2: Comparative results achieved on the STFlytro dataset [21] by our LF-InterNet with different settings for

SR. Note that, the results of bicubic interpolation,

VDSR [15], and EDSR [20] are also listed as baselines.

Angular information. We investigated the benefit of angular information by removing the angular path in LF-InterNet. That is, we only use SFE for LF image SR. Consequently, the network is identical to a single image SR network, and can only incorporate spatial information within each SAI. As shown in Table 2, only using the spatial information, the network (i.e., LF-InterNet-onlySpatial) achieves a PSNR of and a SSIM of . Both the performance and the parameter number of LF-InterNet-onlySpatial is between VDSR [15] and EDSR [20].

Spatial information. To investigate the benefit introduced by spatial information, we changed the kernel size of all SFEs from to . In this case, the spatial information cannot be exploited and integrated by convolutions. As shown in Table 2, the performance of LF-InterNet-onlyAngular is even inferior to that of bicubic interpolation. That is because, neighborhood context in an image is highly significant in recovering details. Consequently, spatial information plays a major role in LF image SR, while angular information can only be used as a complimentary part to spatial information but cannot be used alone.

Information decoupling. To investigate the benefit of spatial-angular information decoupling, we stacked all SAIs along the channel dimension as input, and used convolutions with a stride of to extract both spatial and angular information from these stacked images. Note that, the cascaded framework with global and local residual learning was maintained to keep the overall network architecture unchanged, and the feature depth was set to to keep the number of parameters comparable to that of LF-InterNet. As shown in Table 2, LF-InterNet-SAcoupled is inferior to LF-InterNet. That is, with comparable number of parameters, LF-InterNet can handle the 4D LF structure and achieve LF image SR in a more efficient way.

IG_1 IG_2 IG_3 IG_4 PSNR SSIM Params.
Table 3: Comparative results achieved on the STFlytro dataset [21] by our LF-InterNet with different number of interactions for SR.
AngRes Scale PSNR SSIM Params.
Table 4: Comparative results achieved on the STFlytro dataset [21] by our LF-InterNet with different angular resolutions for and SR.
Method Scale Params. Dataset
EPFL [22] HCInew [10] HCIold [36] INRIA [17] STFgantry [29] STFlytro [21] Average
Bicubic 29.500.935 31.690.934 37.460.978 31.100.956 30.820.947 33.020.950 32.270.950
VDSR [15] 32.010.959 34.370.956 40.340.985 33.800.972 35.800.980 35.910.970 35.370.970
EDSR [20] 32.860.965 35.020.961 41.110.988 34.610.977 37.080.985 36.870.975 36.260.975
RCAN [47] 33.460.967 35.560.963 41.590.989 35.180.978 38.180.988 37.320.977 36.880.977
LFBM5D [1] 31.150.955 33.720.955 39.620.985 32.850.969 33.550.972 35.010.966 34.320.967
GBSQ [23] 31.220.959 35.250.969 40.210.988 32.760.972 35.440.983 35.040.956 34.990.971
LFSSR_4D [41] 32.560.967 34.470.960 41.040.989 34.060.976 34.080.975 36.620.976 35.470.974
resLF [46] 33.220.969 35.790.969 42.300.991 34.860.979 36.280.985 35.800.970 36.380.977
LF-InterNet_32 34.430.975 36.960.974 43.990.994 36.310.983 37.400.989 38.470.982 37.880.983
LF-InterNet_64 34.760.976 37.200.976 44.650.995 36.640.984 38.480.991 38.810.983 38.420.984
Bicubic 25.140.831 27.610.851 32.420.934 26.820.886 25.930.843 27.840.855 27.630.867
VDSR [15] 26.820.869 29.120.876 34.010.943 28.870.914 28.310.893 29.170.880 29,380.896
EDSR [20] 27.820.892 29.940.893 35.530.957 29.860.931 29.430.921 30.290.903 30.480.916
RCAN [47] 28.310.899 30.250.896 35.890.959 30.360.936 30.250.934 30.660.909 30.950.922
LFBM5D [1] 26.610.869 29.130.882 34.230.951 28.490.914 28.300.900 29.070.881 29.310.900
GBSQ [23] 26.020.863 28.920.884 33.740.950 27.730.909 28.110.901 28.370.973 28.820.913
LFSSR_4D [41] 27.390.894 29.610.893 35.400.962 29.260.930 28.530.908 30.260.908 30.080.916
resLF [46] 27.860.899 30.370.907 36.120.966 29.720.936 29.640.927 28.940.891 30.440.921
LF-InterNet_32 29.160.912 30.740.913 36.780.970 31.300.947 29.920.934 31.490.923 31.570.933
LF-InterNet_64 29.520.917 31.010.917 37.230.972 31.650.950 30.440.941 31.840.927 31.950.937

    Note: Since the SR model of resLF [46] is unavailable, we cascaded two SR models for SR.

Table 5: PSNRSSIM values achieved by different methods for and SR. The best results are in bold faces and the second best results are underlined.

Spatial-Angular interaction. We investigated the benefits introduced by our spatial-angular interaction mechanism. Specifically, we canceled feature interaction in each Inter-Group by removing the upsampling and AFE modules in each Inter-Block (see Fig. 5 (b)). In this case, spatial and angular features can only be processed separately. When all interactions are removed, these spatial and angular features can only be incorporated by the bottleneck block. Table 3 presents the results achieved by our LF-InterNet with different numbers of interactions. It can be observed that, without any feature interaction, our network achieves comparable performance to the LF-InterNet-onlySpatial model ( vs in PSNR and vs in SSIM). That is, the angular and spatial information cannot be effectively incorporated by the bottleneck block without interactions. As the number of interactions increases, the performance is steadily improved. This clearly demonstrates the effectiveness of our spatial-angular feature interaction mechanism.

4.2.2 Angular Resolution

We compared the performance of LF-InterNet with different angular resolutions. Specifically, we extracted the central SAIs from the input LFs, and trained different models for both and SR. As shown in Table 4, the PSNR and SSIM values for both and SR are improved as the angular resolution is increased. That is because, additional views provide rich angular information for LF image SR. It is also notable that, the improvements tend to be saturated as the angular resolution increases from to (only dB improvement in PSNR). That is because, the complementary information provided by additional views is already sufficient. As the angular information is fully exploited, a further increase of views can only provide minor performance improvements.

4.3 Comparison to the State-of-the-arts

Figure 6: Visual results of 2SR.
Figure 7: Visual results of 4SR.

We compare our method to three milestone single image SR methods (i.e., VDSR [15], EDSR [20], and RCAN [47]) and four state-of-the-art LF image SR methods (i.e., LFBM5D [1], GBSQ [23], LFSSR [41], and resLF [46]). All these methods were implemented using their released codes and pre-trained models. We also present the results of bicubic interpolation as the baseline results. For simplification, we only present the results on LFs for and SR. Since the angular resolution of LFSSR [41] is fixed, we use its original version with input SAIs.

Quantitative Results. Quantitative results are presented in Table 5. For both and SR, our method (i.e., LF-InterNet_64) achieves the best results on all the datasets and surpasses existing methods by a large margin. For example, dB and dB PSNR improvements in average over the state-of-the-art LF image SR method resLF [46] can be observed for and SR, respectively. It is worth noting that, even the feature depth of our model is halved to , our method (i.e., LF-InterNet_32) can still achieve the highest SSIM scores on all the datasets and the highest PSNR scores on of the datasets as compared to existing methods. Note that, the numbers of parameters of LF-InterNet_32 are only for SR and for SR, which are significantly smaller than recent deep learning-based SR methods [47, 41, 46].

Qualitative Results. Qualitative results of and SR are shown in Figs. 6 and 7, with more visual comparisons being provided in our supplemental material. It can be observed from Fig. 6 that, our method can well preserve the textures and details (e.g., the horizontal stripes in the scene HCInew_origami and the stairway in the scene INRIA_Sculpture) in these super-resolved images. In contrast, although the single image SR method RCAN [47] achieves high PSNR and SSIM scores, the images generated by RCAN [47] are over-smoothed and poor in details. It can be observed from Fig. 7 that the visual superiority of our method is more obvious for SR. Since the input LR images are severely degraded by the down-sampling operation, the process of SR is highly ill-posed. Single image SR methods use spatial information only to hallucinate missing details, and they usually generate ambiguous and even fake textures (e.g., the window frame in scene EPFL_Palais generated by RCAN [47]). In contrast, LF image SR methods can use complementary angular information among different views to produce authentic results. However, the results generated by existing LF image SR methods [23, 41, 46] are relatively blurring. As compared to these single image and LF image SR methods, the results produced by our LF-InterNet are much more close to the groundtruth images.

Figure 8: Visualizations of PSNR and SSIM values achieved by resLF [46] and LF-InterNet on each perspective of scene HCIold_MonasRoom [36]. Here, input views are used to perform both and SR. Our LF-InterNet achieves high reconstruction qualities with a balanced distribution among different SAIs.

Performance w.r.t. Perspectives. Since our LF-InterNet can super-resolve all SAIs in an LF, we further investigate the reconstruction quality with respect to different perspectives. We used the central views of scene HCIold_MonasRoom [36] as input to perform both and SR. The PSNR and SSIM values are calculated for each perspective and are visualized in Fig. 8. Since resLF [46] uses part of views to super-resolve different perspectives, the reconstruction qualities of resLF [46] for non-central views are relatively low. In contrast, our LF-InterNet jointly uses the angular information from all input views to super-resolve each perspective, and thus achieves much higher reconstruction qualities with a more balanced distribution among different perspectives.

5 Conclusion

In this paper, we proposed a deep convolutional network LF-InterNet for LF image SR. We first introduce an approach to extract and decouple spatial and angular features, and then design a feature interaction mechanism to incorporate spatial and angular information. Experimental results have clearly demonstrated the superiority of our method. Our LF-InterNet outperforms the state-of-the-art SR methods by a large margin in terms of PSNR and SSIM, and can recover rich details in the reconstructed images.

6 Acknowledgement

This work was partially supported by the National Natural Science Foundation of China (No. 61972435, 61602499), Natural Science Foundation of Guangdong Province, Fundamental Research Funds for the Central Universities (No. 18lgzd06).

References

  • [1] M. Alain and A. Smolic (2018) Light field super-resolution via lfbm5d sparse coding. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2501–2505. Cited by: Figure 1, §1, §2.2, §3.1, §4.3, Table 5.
  • [2] Y. Anagun, S. Isik, and E. Seke (2019) SRLibrary: comparing different loss functions for super-resolution over various convolutional architectures. Journal of Visual Communication and Image Representation 61, pp. 178–187. Cited by: §4.1.
  • [3] S. Anwar, S. Khan, and N. Barnes (2019) A deep journey into super-resolution: a survey. arXiv preprint arXiv:1904.07523. Cited by: §2.1.
  • [4] T. E. Bishop and P. Favaro (2011) The light field camera: extended depth of field, aliasing, and superresolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (5), pp. 972–986. Cited by: §2.2.
  • [5] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §1, §1, §2.1, §2.2.
  • [6] K. Egiazarian and V. Katkovnik (2015) Single image super-resolution via bm3d sparse coding. In European Signal Processing Conference (EUSIPCO), pp. 2849–2853. Cited by: §2.2.
  • [7] R. A. Farrugia, C. Galea, and C. Guillemot (2017) Super resolution of light field images using linear subspace projection of patch-volumes. IEEE Journal of Selected Topics in Signal Processing 11 (7), pp. 1058–1071. Cited by: §2.2.
  • [8] X. Glorot and Y. Bengio (2010)

    Understanding the difficulty of training deep feedforward neural networks

    .
    In

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    ,
    pp. 249–256. Cited by: §4.1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. Cited by: §2.1.
  • [10] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke (2016) A dataset and evaluation methodology for depth estimation on 4d light fields. In Asian Conference on Computer Vision (ACCV), pp. 19–34. Cited by: Figure 1, Figure 4, Table 1, Table 5.
  • [11] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: §2.1.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §2.1.
  • [13] Y. Huang, W. Wang, and L. Wang (2015) Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information Processing Systems (NeurIPS), pp. 235–243. Cited by: §2.2.
  • [14] Y. Jianchao, W. John, H. Thomas, and M. Yi (2010) Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19 (11), pp. 2861–2873. Cited by: §2.1.
  • [15] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1646–1654. Cited by: Figure 1, §1, §2.1, §4.2.1, §4.3, Table 2, Table 5.
  • [16] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Proceedings of the International Conference on Learning and Representation (ICLR). Cited by: §4.1.
  • [17] M. Le Pendu, X. Jiang, and C. Guillemot (2018) Light field inpainting propagation via low rank matrix completion. IEEE Transactions on Image Processing 27 (4), pp. 1981–1993. Cited by: Figure 1, Table 1, Table 5.
  • [18] M. Levoy and P. Hanrahan (1996) Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 31–42. Cited by: §1.
  • [19] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu (2014) Saliency detection on light field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2806–2813. Cited by: §1.
  • [20] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 136–144. Cited by: Figure 1, §1, §1, §2.1, §2.2, §4.2.1, §4.3, Table 2, Table 5.
  • [21] A. S. Raj, M. Lowney, R. Shah, and G. Wetzstein (2016) Stanford lytro light field archive. Cited by: Figure 1, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [22] M. Rerabek and T. Ebrahimi (2016) New light field image dataset. In International Conference on Quality of Multimedia Experience (QoMEX), Cited by: Figure 1, Table 1, Table 5.
  • [23] M. Rossi and P. Frossard (2018) Geometry-consistent light field super-resolution via graph-based regularization. IEEE Transactions on Image Processing 27 (9), pp. 4207–4218. Cited by: Figure 1, §1, §1, §2.2, §3.1, §4.3, §4.3, Table 5.
  • [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: Figure 4.
  • [25] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016)

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874–1883. Cited by: §3.2.3.
  • [26] C. Shin, H. Jeon, Y. Yoon, I. So Kweon, and S. Joo Kim (2018) Epinet: a fully-convolutional neural network using epipolar geometry for depth from light field images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4748–4757. Cited by: §1, §4.1.
  • [27] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 114–125. Cited by: §2.1.
  • [28] R. Timofte, V. De Smet, and L. Van Gool (2013) Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of the IEEE IInternational Conference on Computer Vision (ICCV), pp. 1920–1927. Cited by: §2.1.
  • [29] V. Vaish and A. Adams (2008) The (new) stanford light field archive. Computer Graphics Laboratory, Stanford University 6 (7). Cited by: Figure 1, Table 1, Table 5.
  • [30] K. Venkataraman, D. Lelescu, J. Duparré, A. McMahon, G. Molina, P. Chatterjee, R. Mullis, and S. Nayar (2013) Picam: an ultra-thin high performance monolithic camera array. ACM Transactions on Graphics 32 (6), pp. 166. Cited by: §1.
  • [31] Y. Wang, T. Wu, J. Yang, L. Wang, W. An, and Y. Guo DeOccNet: learning to see through foreground occlusions in light fields. In Winter Conference on Applications of Computer Vision (WACV), 2020, Cited by: §1.
  • [32] Y. Wang, J. Yang, Y. Guo, C. Xiao, and W. An (2018) Selective light field refocusing for camera arrays using bokeh rendering and superresolution. IEEE Signal Processing Letters 26 (1), pp. 204–208. Cited by: §1.
  • [33] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan (2018) LFNet: a novel bidirectional recurrent convolutional neural network for light-field image super-resolution. IEEE Transactions on Image Processing 27 (9), pp. 4274–4286. Cited by: §1, §1, §2.2, §3.2.
  • [34] Z. Wang, J. Chen, and S. C. Hoi (2019) Deep learning for image super-resolution: a survey. arXiv preprint arXiv:1902.06068. Cited by: §2.1.
  • [35] S. Wanner and B. Goldluecke (2013) Variational light field analysis for disparity estimation and super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 606–619. Cited by: §2.2.
  • [36] S. Wanner, S. Meister, and B. Goldluecke (2013) Datasets and benchmarks for densely sampled 4d light fields.. In Vision, Modelling and Visualization (VMV), Vol. 13, pp. 225–226. Cited by: Figure 1, Figure 8, §4.3, Table 1, Table 5.
  • [37] B. Wilburn, N. Joshi, V. Vaish, E. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy (2005) High performance imaging using large camera arrays. In ACM Transactions on Graphics, Vol. 24, pp. 765–776. Cited by: §1.
  • [38] G. Wu, Y. Liu, Q. Dai, and T. Chai (2019) Learning sheared epi structure for light field reconstruction. IEEE Transactions on Image Processing 28 (7), pp. 3261–3273. Cited by: §1.
  • [39] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, and Y. Liu (2017) Light field reconstruction using deep convolutional network on epi. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6319–6327. Cited by: §1.
  • [40] W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, and Q. Liao (2019) Deep learning for single image super-resolution: a brief review. IEEE Transactions on Multimedia. Cited by: §2.1.
  • [41] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung (2018) Light field spatial super-resolution using deep efficient spatial-angular separable convolution. IEEE Transactions on Image Processing 28 (5), pp. 2319–2330. Cited by: Figure 1, §1, §1, §2.2, §3.1, §3.2, §4.3, §4.3, §4.3, Table 5.
  • [42] Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. S. Kweon (2017) Light-field image super-resolution using convolutional neural network. IEEE Signal Processing Letters 24 (6), pp. 848–852. Cited by: §1, §3.1.
  • [43] Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. So Kweon (2015) Learning a deep convolutional network for light-field image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 24–32. Cited by: §1, §2.2, §3.1.
  • [44] Y. Yuan, Z. Cao, and L. Su (2018) Light-field image superresolution using a combined deep cnn based on epi. IEEE Signal Processing Letters 25 (9), pp. 1359–1363. Cited by: §1, §2.2, §3.2.
  • [45] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In International conference on Curves and Surfaces, pp. 711–730. Cited by: §2.1.
  • [46] S. Zhang, Y. Lin, and H. Sheng (2019) Residual networks for light field image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11046–11055. Cited by: Figure 1, §1, §1, §2.2, §3.1, §3.2, Figure 8, §4.1, §4.3, §4.3, §4.3, §4.3, Table 5.
  • [47] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 286–301. Cited by: Figure 1, §1, §1, §2.1, §4.3, §4.3, §4.3, Table 5.
  • [48] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2472–2481. Cited by: §2.1, §3.2.1.
  • [49] H. Zhu, M. Guo, H. Li, Q. Wang, and A. Robles-Kelly (2019) Revisiting spatio-angular trade-off in light field cameras and extended applications in super-resolution.. IEEE transactions on visualization and computer graphics. Cited by: §1.