Light Field Image Super-Resolution with Transformers

by   Zhengyu Liang, et al.

Light field (LF) image super-resolution (SR) aims at reconstructing high-resolution LF images from their low-resolution counterparts. Although CNN-based methods have achieved remarkable performance in LF image SR, these methods cannot fully model the non-local properties of the 4D LF data. In this paper, we propose a simple but effective Transformer-based method for LF image SR. In our method, an angular Transformer is designed to incorporate complementary information among different views, and a spatial Transformer is developed to capture both local and long-range dependencies within each sub-aperture image. With the proposed angular and spatial Transformers, the beneficial information in an LF can be fully exploited and the SR performance is boosted. We validate the effectiveness of our angular and spatial Transformers through extensive ablation studies, and compare our method to recent state-of-the-art methods on five public LF datasets. Our method achieves superior SR performance with a small model size and low computational cost.



There are no comments yet.


page 2

page 3

page 4


Spatial-Angular Interaction for Light Field Image Super-Resolution

Light field (LF) cameras record both intensity and directions of light r...

Detail-Preserving Transformer for Light Field Image Super-Resolution

Recently, numerous algorithms have been developed to tackle the problem ...

Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution

Recent vision transformers along with self-attention have achieved promi...

Light Field Image Super-Resolution Using Deformable Convolution

Light field (LF) cameras can record scenes from multiple perspectives, a...

Ensemble Super-Resolution with A Reference Dataset

By developing sophisticated image priors or designing deep(er) architect...

Light Field Spatial Super-resolution via Deep Combinatorial Geometry Embedding and Structural Consistency Regularization

Light field (LF) images acquired by hand-held devices usually suffer fro...

Deep Selective Combinatorial Embedding and Consistency Regularization for Light Field Super-resolution

Light field (LF) images acquired by hand-held devices usually suffer fro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Light field (LF) cameras record both intensity and directions of light rays, and enable many applications such as post-capture refocusing[1, 2], depth sensing [3, 4], saliency detection[5] and de-occlusion [6, 7]. Since high-resolution (HR) images are required in various applications, it is necessary to use the complementary information among different views (i.e., angular information) to achieve LF image super-resolution (SR).

In the past few years, convolutional neural networks (CNNs) have been widely used for LF image SR and achieved promising performance

[8, 9, 10, 11, 12, 13, 14, 15]. Yoon et al. [8] proposed the first CNN-based method called LFCNN to improve the resolution of LF images. Yuan et al. [9] applied EDSR [16] to super-resolve each sub-aperture image (SAI) independently, and developed an EPI-enhancement network to refine the super-resolved images. Zhang et al. [11] proposed a multi-branch residual network to incorporate the multi-directional epipolar geometry prior for LF image SR. Since both view-wise angular information and image-wise spatial information contribute to the SR performance, state-of-the-art CNN-based methods [12, 14, 13, 15] designed different network structures to leverage both angular and spatial information for LF image SR.

Although continuous progress has been achieved in reconstruction accuracy via delicate network designs, existing CNN-based LF image SR methods have the following two limitations. First, these methods either use part of views to reduce the complexity of the 4D LF structure [8, 9, 10, 11], or integrate angular information without considering view position and image content [12, 14, 13]. The under-use of the rich angular information results in performance degradation especially on complex scenes (e.g., occlusions and non-Lambertain surfaces). Second, existing CNN-based methods extract spatial features by applying (cascaded) convolutions on SAIs. The local receptive field of convolutions hinders these methods to capture long-range spatial dependencies from input images. In summary, existing CNN-based LF image SR methods cannot fully exploit both angular and spatial information, and thus face a bottleneck for further performance improvement.

Recently, Transformers have been demonstrated effective in modeling positional and long-range correlations, and were applied to various computer vision tasks such as image classification

[17, 18], object detection [17, 19], semantic segmentation [20]

and depth estimation

[21]. In the area of low-level vision, Chen et al.[22] developed an image processing transformer with multi-heads and multi-tails. Their method achieves state-of-the-art performance on image denoising, deraining and SR. Wang et al.[23] proposed a hierarchical U-shaped Transformer to capture both local and non-local context information for image restoration. Cao et al. [24] proposed a Transformer-based network to exploit correlations among different frames for video SR.

Inspired by the recent advances of Transformers, in this paper, we propose a Transformer-based network (i.e., LFT) to address the aforementioned limitations of CNN-based methods. Specifically, we design an angular Transformer to model the relationship among different views, and design a spatial Transformer to capture both local and non-local context information within each SAI. Compared to CNN-based methods, our LFT can discriminately incorporate the information from all angular views, and capture long-range spatial dependencies in each SAI.

The contributions of this paper can be summarized as: 1) We make the first attempt to adapt Transformers to LF image processing, and propose a Transformer-based network for LF image SR. 2) We propose a novel paradigm (i.e., angular and spatial Transformers) to incorporate angular and spatial information in an LF. The effectiveness of our paradigm is validated through extensive ablation studies. 3) With a small model size and low computational cost, our LFT achieves superior SR performance than other state-of-the-art methods.

Fig. 1: An overview of our network.

Ii Method

We formulate an LF as a 4D tensor

, where and represent angular dimensions, and represent spatial dimensions. Specifically, an LF can be considered as a array of SAIs of size . Following [11, 25, 15, 13, 12, 14], we achieve LF image SR using SAIs distributed in a square array (i.e., ==). As shown in Fig. 1

, our network consists of three stages including initial feature extraction, Transformer-based feature incorporation


In our LFT, we cascade four angular Transformer with four spatial Transformer alternately for deep feature extraction.

, and up-sampling.

Ii-a Angular Transformer

The input LF images are first processed by cascaded 33 convolutions to generate initial features . The extracted features are then fed to the angular Transformer to model the angular dependencies. Our angular Transformer is designed to correlate highly relevant features in the angular dimension and can fully exploit the complementary information among all the input views.

Specifically, feature is first reshaped into a sequence of angular tokens , where represents the batch dimension, is the length of the sequence and denotes the embedding dimension of each angular token. Then, we perform angular positional encoding to model the positional correlation of different views [26], i.e.,


where represents the angular position and denotes the channel index in embedding dimension.

The angular position codes are directly added to , and passed through a layer normalization (LN) to generate query and key , i.e., . Value is directly assigned as , i.e., . Afterward, we apply the multi-head self-attention (MHSA) to learn the relationship among different angular tokens. Similar to other MHSA approaches [26, 22, 18], the embedding dimension of , and is split into groups, where is the number of heads. For each attention head, the calculation can be formulated as:


where denotes the index of head groups. , and are the linear projection matrices. In summary, the MHSA can be formulated as:


where is output projection matrix, denotes the concatenation operation.

To further incorporate the correlations built by MHSA, the tokens are further fed to a feed forward network (FFN), which consists of a LN and a multi-layer perception (MLP) layer. In summary, the calculation process of our angular Transformer can be formulated as:


Finally, is reshaped into and fed to the subsequent spatial Transformer to incorporate spatial context information.

Ii-B Spatial Transformer

The goal of our spatial Transformer is to leverage both local context information and long-range spatial dependencies within each SAI. Specifically, the input feature is first unfolded in each 33 neighbor region [27], and then fed to an MLP to achieve local feature embedding. That is,


where denotes an arbitrary spatial coordinate on feature . The local assembled feature is then cropped into overlapping spatial tokens , where denotes the batch dimension, represents the length of the sequence, and represents the embedding dimension of the spatial tokens.

Bicubic 29.74/0.941 31.89/0.939 37.69/0.979 31.33/0.959 31.06/0.954 25.14/0.833 27.61/0.853 32.42/0.931 26.82/0.886 25.93/0.847
VDSR [29] 32.50/0.960 34.37/0.956 40.61/0.987 34.43/0,974 35.54/0.979 27.25/0.878 29.31/0.883 34.81/0.952 29.19/0.921 28.51/0.901
EDSR [16] 33.09/0.963 34.83/0.959 41.01/0.988 34.97/0.977 36.29/0.982 27.84/0.886 29.60/0.887 35.18/0.954 29.66/0.926 28.70/0.908
RCAN [30] 33.16/0.964 34.98/0.960 41.05/0.988 35.01/0.977 36.33/0.983 27.88/0.886 29.63/0.888 35.20/0.954 29.76/0.927 28.90/0.911
resLF[11] 33.62/0.971 36.69/0.974 43.42/0.993 35.39/0.981 38.36/0.990 28.27/0.904 30.73/0.911 36.71/0.968 30.34/0.941 30.19/0.937
LFSSR [25] 33.68/0.974 36.81/0.975 43.81/0.994 35.28/0.983 37.95/0.990 28.27/0.908 30.72/0.912 36.70/0.969 30.31/0.945 30.15/0.939
LF-ATO [13] 34.27/0.976 37.24/0.977 44.20/0.994 36.15/0.984 39.64/0.993 28.52/0.912 30.88/0.914 37.00/0.970 30.71/0.949 30.61/0.943
LF-InterNet [12] 34.14/0.972 37.28/0.977 44.45/0.995 35.80/0.985 38.72/0.992 28.67/0.914 30.98/0.917 37.11/0.972 30.64/0.949 30.53/0.943
LF-DFnet [14] 34.44/0.977 37.44/0.979 44.23/0.994 36.36/0.984 39.61/0.993 28.77/0.917 31.23/0.920 37.32/0.972 30.83/0.950 31.15/0.949
LFT(ours) 34.56/0.978 37.74/0.979 44.55/0.995 36.44/0.985 40.25/0.994 29.02/0.918 31.25/0.921 37.47/0.973 31.01/0.950 31.47/0.951
TABLE I: PSNR/SSIM values achieved by different methods for 2 and 4SR. The best results are in bold faces and the second best results are underlined.

By performing feature unfolding and overlap cropping, the local context information can be fully integrated into the generated spatial tokens, which enables our spatial Transformer to capture both local and non-local dependencies. To further model the spatial position information, we perform 2D positional encoding on spatial tokens:


where denotes the spatial position and denotes the index in the embedding dimension. Then, , and can be calculated according to:


Similar to the proposed angular Transformer, we use the MHSA and FFN to build our spatial Transformer. That is,


Then, is reshaped into and fed to the next angular Transformer. After passing through all the angular and spatial Transformers, both angular and spatial information in an LF can be fully incorporated. Finally, we apply pixel shuffling [28] to achieve feature up-sampling, and obtain the super-resolved LF image .

Iii Experiments

In this section, we first introduce our implementation details, then compare our LFT to state-of-the-art SR methods. Finally, we conduct ablation studies to validate our design choices.

Iii-a Implementation Details

Following [14], we used 5 public LF datasets [31, 32, 33, 34, 35] to validate our method. All LF images in the training and test set have an angular resolution of 55. In the training stage, we cropped LF images into patches of 6464128128 for 24 SR, and used the bicubic downsampling approach to generate LR patches of size 3232.

We used peak signal-to-noise ratio (PSNR) and structural similarity (SSIM)

[36] as quantitative metrics for performance evaluation. To obtain the metric score for a dataset with scenes, we calculated the metrics on the SAIs of each scene separately, and obtained the score for this dataset by averaging all the scores.

All experiments were implemented in Pytorch on a PC with four Nvidia GTX 1080Ti GPUs. The weights of our network were initialized using the Xavier method

[37], and optimized using the Adam method [38]. The batch size was set to 48 for 24 SR. The learning rate was initially set to 2

and halved for every 15 epochs. The training was stopped after 50 epochs.

Fig. 2: Visual results achieved by different methods for 4SR.

Iii-B Comparison to state-of-the-art methods

We compare our LFT to several state-of-the-art methods, including 3 single image SR methods [29, 16, 30] and 5 LF image SR methods [11, 25, 13, 12, 14]. We retrained all these methods on the same training datasets as our LFT.

Iii-B1 Quantitative Results

Table I shows the quantitative results achieved by our method and other state-of-the-art SR methods. Our LFT achieves the highest PSNR and SSIM results on all the 5 datasets for both 2 and 4 SR. Note that, the superiority of our LFT is very significant on the STFgantry dataset [35] (i.e., 0.64 dB and 0.32dB higher than the second top-performing method [14] for 2 and 4SR, respectively). That is because, LFs in the STFgantry dataset have more complex structures and larger disparity variations. By using our angular and spatial Transformers, our method can well handle these complex scenes while maintaining state-of-the-art performance on other datasets.

Iii-B2 Qualitative Results

Figure 2 shows the qualitative results achieved by different methods. Our LFT can well preserve the textures and details in the SR images and achieves competitive visual performance. Readers can refer to this video for a visual comparison of the angular consistency.

Iii-B3 Efficiency

We compare our LFT to several competitive LF image SR methods [11, 13, 12, 14] in terms of the number of parameters and FLOPs. As shown in Table II, compared to other methods, our LFT achieves higher reconstruction accuracy with smaller model size and lower computational cost. It clearly demonstrates the high efficiency of our method.

resLF 7.98M 79.63G 37.50/0.982 8.65M 85.47G 31.25/0.932
LFSSR 0.89M 91.06G 37.51/0.983 1.77M 455.04G 31.56/0.937
LF-ATO 1.22M 1815.36G 38.30/0.985 1.36M 1898.91G 31.54/0.938
LF-InterNet 4.91M 38.97G 38.08/0.985 4.96M 40.25G 31.59/0.939
LF-DFnet 3.94M 57.22G 38.42/0.985 3.99M 58.49G 31.86/0.942
LFT(ours) 1.11M 29.48G 38.87/0.986 1.16M 30.94G 32.04/0.943
TABLE II: The number of parameters (#Param.), FLOPs and average PSNR/SSIM scores achieved by state-of-the-art methods for 2 and 4SR. Note that, FLOPs is computed with an input LF of size 553232. The best results are in bold faces and the second best results are underlined.
AngTr AngPos SpaTr SpaPos #Param. EPFL HCIold INRIA
1 1.49M 28.63 37.00 30.66
2 1.42M 28.85 37.29 30.93
3 1.42M 28.98 37.38 30.93
4 1.28M 28.93 37.30 30.97
5 1.28M 28.95 37.41 30.98
6 1.16M 29.02 37.47 31.01
TABLE III: PSNR results achieved on the EPFL [31], HCIold [33] and INRIA [34] datasets by several variants of LFT for SR. Note that, AngTr and SpaTr represent models using angular Transformer and spatial Transformer, respectively. AngPos and SpaPos denote models using positional encoding in AngTr and SpaTr, respectively. #Param. represents the number of parameters of different variants.

Iii-C Ablation Study

We introduce several variants with different architectures to validate the effectiveness of our method. As shown in Table III, we first introduce a baseline model (i.e., model-1) without using angular and spatial Transformers222Spatial-angular alternate convolutions [25, 39, 40] are used in model-1 to keep its model size comparable to other variants., then separately added angular Transformer (i.e., model-2) and spatial Transformer (i.e., model-4) to the baseline model. Moreover, we introduce model-3 and model-5 to validate the effectiveness of angular and spatial positional encoding.

Iii-C1 Angular Transformer

We compare the performance of model-2 to model-1 and model-6 to model-5 to validate the effectiveness of the angular Transformer. As shown in Table III, by using the angular Transformer, model-2 achieves a 0.20.3 dB PSNR improvements over model-1. When the angular positional encoding is introduced, model-3 can further achieve a 0.1dB improvement over model-2 on the EPFL [31] and HCIold [33] datasets. By comparing the performance of model-5 and model-6, we can see that removing the angular Transformer (and angular positional encoding) from our LFT will cause a notable PSNR drop (around 0.05 dB). The above experiments demonstrate that our angular Transformer and angular positional encoding are beneficial to the SR performance.

Moreover, we investigate the spatial-aware modeling capability of our angular Transformer by visualizing the local angular attention maps. Specifically, we selected two patches from scene Cards [35], and obtained the attention maps (a 2525 matrix for a spatial location in a 55 LF) produced by the MHSA in the first angular Transformer at each spatial location in the patches. Note that, larger values in the attention maps represent higher similarities between a pair of angular tokens. We then define “local angular attention” by calculating the ratios of similar tokens (with attention scores larger than 0.025) in the selected patches. Finally, we visualize the local angular attention map in Fig. 3(b) by assembling the calculated attention values into attention maps. It can be observed in Fig. 3(b) (top) that the attention values in the occlusion area (red patch) are distributed unevenly, where the non-occluded pixels share larger values. It demonstrates that our angular Transformer can adapt to different image contents and achieve spatial-aware angular modeling.

(a) Center view and corresponding epipolar plane images of scene Cards [35].
(b) Local angular similarity maps of the two patches in (a).
Fig. 3: Visualization of local angular similarities generated by our angular Transformer. (a) Center view and epipolar plane images of scene Cards. (b) Local angular similarity maps of the red patch (top) and green patch (bottom) in (a). Each tile in the map illustrates the similarities between the current view (at the same position as the tile) and all the views (different pixels within the tile).

Iii-C2 Spatial Transformer

We demonstrate the effectiveness of the spatial Transformer by comparing the performance of model-4 to model-1 and model-6 to model-3. As shown in Table III, model-4 achieves a 0.3 dB improvements in PSNR over model-1. Moreover, when the spatial Transformer is removed from our LFT, model-3 suffers a notable performance degradation (0.040.09 dB in PSNR). That is because, compared to cascaded convolutions, the proposed spatial Transformer can better exploit long-range context information with a global receptive field, and can capture more beneficial spatial information for image SR.

Iv Conclusion

In this paper, we propose a Transformer-based network (i.e., LFT) for LF image SR. By using our proposed angular and spatial Transformers, the complementary angular information among all the views and the long-range spatial dependencies within each SAI can be effectively incorporated. Experimental results have demonstrated the superior performance of our LFT over state-of-the-art CNN-based SR methods.


  • [1] Y. Wang, J. Yang, Y. Guo, C. Xiao, and W. An, “Selective light field refocusing for camera arrays using bokeh rendering and superresolution,” IEEE Signal Processing Letters, vol. 26, no. 1, pp. 204–208, 2018.
  • [2] S. Jayaweera, C. Edussooriya, C. Wijenayake, P. Agathoklis, and L. Bruton, “Multi-volumetric refocusing of light fields,” IEEE Signal Processing Letters, vol. 28, pp. 31–35, 2020.
  • [3] W. Wang, Y. Lin, and S. Zhang, “Enhanced spinning parallelogram operator combining color constraint and histogram integration for robust light field depth estimation,” IEEE Signal Processing Letters, vol. 28, pp. 1080–1084, 2021.
  • [4] J. Lee and R. Park, “Reduction of aliasing artifacts by sign function approximation in light field depth estimation based on foreground–background separation,” IEEE Signal Processing Letters, vol. 25, no. 11, pp. 1750–1754, 2018.
  • [5] A. Wang, “Three-stream cross-modal feature aggregation network for light field salient object detection,” IEEE Signal Processing Letters, vol. 28, pp. 46–50, 2020.
  • [6] Y. Wang, T. Wu, J. Yang, L. Wang, W. An, and Y. Guo, “Deoccnet: Learning to see through foreground occlusions in light fields,” in Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 118–127.
  • [7] S. Zhang, Z. Shen, and Y. Lin, “Removing foreground occlusions in light field using micro-lens dynamic filter,” in

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)

    , 2021, pp. 1302–1308.
  • [8]

    Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. Kweon, “Light-field image super-resolution using convolutional neural network,”

    IEEE Signal Processing Letters, vol. 24, no. 6, pp. 848–852, 2017.
  • [9] Y. Yuan, Z. Cao, and L. Su, “Light-field image superresolution using a combined deep cnn based on epi,” IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1359–1363, 2018.
  • [10] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan, “Lfnet: A novel bidirectional recurrent convolutional neural network for light-field image super-resolution,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4274–4286, 2018.
  • [11] S. Zhang, Y. Lin, and H. Sheng, “Residual networks for light field image super-resolution,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2019, pp. 11 046–11 055.
  • [12] Y. Wang, L. Wang, J. Yang, W. An, J. Yu, and Y. Guo, “Spatial-angular interaction for light field image super-resolution,” in European Conference on Computer Vision (ECCV).   Springer, 2020, pp. 290–308.
  • [13] J. Jin, J. Hou, J. Chen, and S. Kwong, “Light field spatial super-resolution via deep combinatorial geometry embedding and structural consistency regularization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2260–2269.
  • [14] Y. Wang, J. Yang, L. Wang, X. Ying, T. Wu, W. An, and Y. Guo, “Light field image super-resolution using deformable convolution,” IEEE Transactions on Image Processing, vol. 30, pp. 1057–1071, 2020.
  • [15] N. Meng, H. So, X. Sun, and E. Lam, “High-dimensional dense residual convolutional neural network for light field reconstruction,” IEEE transactions on pattern analysis and machine intelligence, 2019.
  • [16] B. Lim, S. Son, H. Kim, S. Nah, and K. Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 136–144.
  • [17] M. Zheng, P. Gao, X. Wang, H. Li, and H. Dong, “End-to-end object detection with adaptive clustering transformer,” arXiv preprint arXiv:2011.09315, 2020.
  • [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [19] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision (ECCV).   Springer, 2020, pp. 213–229.
  • [20] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6881–6890.
  • [21] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” arXiv preprint arXiv:2103.13413, 2021.
  • [22] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12 299–12 310.
  • [23] Z. Wang, X. Cun, J. Bao, and J. Liu, “Uformer: A general u-shaped transformer for image restoration,” arXiv preprint arXiv:2106.03106, 2021.
  • [24] J. Cao, Y. Li, K. Zhang, and L. Van Gool, “Video super-resolution transformer,” arXiv preprint arXiv:2106.06847, 2021.
  • [25] H. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Chung, “Light field spatial super-resolution using deep efficient spatial-angular separable convolution,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2319–2330, 2018.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [27] Y. Chen, S. Liu, and X. Wang, “Learning continuous image representation with local implicit image function,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8628–8638.
  • [28] W. Shi, J. Caballero, F. Huszár, J. Totz, A. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.
  • [29] J. Kim, J. Lee, and K. Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.
  • [30] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in European Conference on Computer Vision (ECCV), 2018, pp. 286–301.
  • [31] M. Rerabek and T. Ebrahimi, “New light field image dataset,” in International Conference on Quality of Multimedia Experience (QoMEX), 2016.
  • [32] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A dataset and evaluation methodology for depth estimation on 4d light fields,” in Asian Conference on Computer Vision (ACCV).   Springer, 2016, pp. 19–34.
  • [33] S. Wanner, S. Meister, and B. Goldluecke, “Datasets and benchmarks for densely sampled 4d light fields.” in Vision, Modelling and Visualization (VMV)

    , vol. 13.   Citeseer, 2013, pp. 225–226.

  • [34] M. Pendu, X. Jiang, and C. Guillemot, “Light field inpainting propagation via low rank matrix completion,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1981–1993, 2018.
  • [35] V. Vaish and A. Adams, “The (new) stanford light field archive,” Computer Graphics Laboratory, Stanford University, vol. 6, no. 7, 2008.
  • [36] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [37] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 249–256.
  • [38] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Proceedings of the International Conference on Learning and Representation (ICLR), 2015.
  • [39] H. Yeung, J. Hou, J. Chen, Y. Chung, and X. Chen, “Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues,” in European Conference on Computer Vision (ECCV), 2018, pp. 137–152.
  • [40] M. Guo, J. Hou, J. Jin, J. Chen, and L. Chau, “Deep spatial-angular regularization for light field imaging, denoising, and super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.