I Introduction
Light field (LF) cameras record both intensity and directions of light rays, and enable many applications such as postcapture refocusing[1, 2], depth sensing [3, 4], saliency detection[5] and deocclusion [6, 7]. Since highresolution (HR) images are required in various applications, it is necessary to use the complementary information among different views (i.e., angular information) to achieve LF image superresolution (SR).
In the past few years, convolutional neural networks (CNNs) have been widely used for LF image SR and achieved promising performance
[8, 9, 10, 11, 12, 13, 14, 15]. Yoon et al. [8] proposed the first CNNbased method called LFCNN to improve the resolution of LF images. Yuan et al. [9] applied EDSR [16] to superresolve each subaperture image (SAI) independently, and developed an EPIenhancement network to refine the superresolved images. Zhang et al. [11] proposed a multibranch residual network to incorporate the multidirectional epipolar geometry prior for LF image SR. Since both viewwise angular information and imagewise spatial information contribute to the SR performance, stateoftheart CNNbased methods [12, 14, 13, 15] designed different network structures to leverage both angular and spatial information for LF image SR.Although continuous progress has been achieved in reconstruction accuracy via delicate network designs, existing CNNbased LF image SR methods have the following two limitations. First, these methods either use part of views to reduce the complexity of the 4D LF structure [8, 9, 10, 11], or integrate angular information without considering view position and image content [12, 14, 13]. The underuse of the rich angular information results in performance degradation especially on complex scenes (e.g., occlusions and nonLambertain surfaces). Second, existing CNNbased methods extract spatial features by applying (cascaded) convolutions on SAIs. The local receptive field of convolutions hinders these methods to capture longrange spatial dependencies from input images. In summary, existing CNNbased LF image SR methods cannot fully exploit both angular and spatial information, and thus face a bottleneck for further performance improvement.
Recently, Transformers have been demonstrated effective in modeling positional and longrange correlations, and were applied to various computer vision tasks such as image classification
[17, 18], object detection [17, 19], semantic segmentation [20]and depth estimation
[21]. In the area of lowlevel vision, Chen et al.[22] developed an image processing transformer with multiheads and multitails. Their method achieves stateoftheart performance on image denoising, deraining and SR. Wang et al.[23] proposed a hierarchical Ushaped Transformer to capture both local and nonlocal context information for image restoration. Cao et al. [24] proposed a Transformerbased network to exploit correlations among different frames for video SR.Inspired by the recent advances of Transformers, in this paper, we propose a Transformerbased network (i.e., LFT) to address the aforementioned limitations of CNNbased methods. Specifically, we design an angular Transformer to model the relationship among different views, and design a spatial Transformer to capture both local and nonlocal context information within each SAI. Compared to CNNbased methods, our LFT can discriminately incorporate the information from all angular views, and capture longrange spatial dependencies in each SAI.
The contributions of this paper can be summarized as: 1) We make the first attempt to adapt Transformers to LF image processing, and propose a Transformerbased network for LF image SR. 2) We propose a novel paradigm (i.e., angular and spatial Transformers) to incorporate angular and spatial information in an LF. The effectiveness of our paradigm is validated through extensive ablation studies. 3) With a small model size and low computational cost, our LFT achieves superior SR performance than other stateoftheart methods.
Ii Method
We formulate an LF as a 4D tensor
, where and represent angular dimensions, and represent spatial dimensions. Specifically, an LF can be considered as a array of SAIs of size . Following [11, 25, 15, 13, 12, 14], we achieve LF image SR using SAIs distributed in a square array (i.e., ==). As shown in Fig. 1, our network consists of three stages including initial feature extraction, Transformerbased feature incorporation
^{1}^{1}1In our LFT, we cascade four angular Transformer with four spatial Transformer alternately for deep feature extraction.
, and upsampling.Iia Angular Transformer
The input LF images are first processed by cascaded 33 convolutions to generate initial features . The extracted features are then fed to the angular Transformer to model the angular dependencies. Our angular Transformer is designed to correlate highly relevant features in the angular dimension and can fully exploit the complementary information among all the input views.
Specifically, feature is first reshaped into a sequence of angular tokens , where represents the batch dimension, is the length of the sequence and denotes the embedding dimension of each angular token. Then, we perform angular positional encoding to model the positional correlation of different views [26], i.e.,
(1) 
(2) 
where represents the angular position and denotes the channel index in embedding dimension.
The angular position codes are directly added to , and passed through a layer normalization (LN) to generate query and key , i.e., . Value is directly assigned as , i.e., . Afterward, we apply the multihead selfattention (MHSA) to learn the relationship among different angular tokens. Similar to other MHSA approaches [26, 22, 18], the embedding dimension of , and is split into groups, where is the number of heads. For each attention head, the calculation can be formulated as:
(3) 
where denotes the index of head groups. , and are the linear projection matrices. In summary, the MHSA can be formulated as:
(4) 
where is output projection matrix, denotes the concatenation operation.
To further incorporate the correlations built by MHSA, the tokens are further fed to a feed forward network (FFN), which consists of a LN and a multilayer perception (MLP) layer. In summary, the calculation process of our angular Transformer can be formulated as:
(5) 
(6) 
Finally, is reshaped into and fed to the subsequent spatial Transformer to incorporate spatial context information.
IiB Spatial Transformer
The goal of our spatial Transformer is to leverage both local context information and longrange spatial dependencies within each SAI. Specifically, the input feature is first unfolded in each 33 neighbor region [27], and then fed to an MLP to achieve local feature embedding. That is,
(7) 
where denotes an arbitrary spatial coordinate on feature . The local assembled feature is then cropped into overlapping spatial tokens , where denotes the batch dimension, represents the length of the sequence, and represents the embedding dimension of the spatial tokens.
Methods  
EPFL  HCInew  HCIold  INRIA  STFgantry  EPFL  HCInew  HCIold  INRIA  STFgantry  
Bicubic  29.74/0.941  31.89/0.939  37.69/0.979  31.33/0.959  31.06/0.954  25.14/0.833  27.61/0.853  32.42/0.931  26.82/0.886  25.93/0.847 
VDSR [29]  32.50/0.960  34.37/0.956  40.61/0.987  34.43/0,974  35.54/0.979  27.25/0.878  29.31/0.883  34.81/0.952  29.19/0.921  28.51/0.901 
EDSR [16]  33.09/0.963  34.83/0.959  41.01/0.988  34.97/0.977  36.29/0.982  27.84/0.886  29.60/0.887  35.18/0.954  29.66/0.926  28.70/0.908 
RCAN [30]  33.16/0.964  34.98/0.960  41.05/0.988  35.01/0.977  36.33/0.983  27.88/0.886  29.63/0.888  35.20/0.954  29.76/0.927  28.90/0.911 
resLF[11]  33.62/0.971  36.69/0.974  43.42/0.993  35.39/0.981  38.36/0.990  28.27/0.904  30.73/0.911  36.71/0.968  30.34/0.941  30.19/0.937 
LFSSR [25]  33.68/0.974  36.81/0.975  43.81/0.994  35.28/0.983  37.95/0.990  28.27/0.908  30.72/0.912  36.70/0.969  30.31/0.945  30.15/0.939 
LFATO [13]  34.27/0.976  37.24/0.977  44.20/0.994  36.15/0.984  39.64/0.993  28.52/0.912  30.88/0.914  37.00/0.970  30.71/0.949  30.61/0.943 
LFInterNet [12]  34.14/0.972  37.28/0.977  44.45/0.995  35.80/0.985  38.72/0.992  28.67/0.914  30.98/0.917  37.11/0.972  30.64/0.949  30.53/0.943 
LFDFnet [14]  34.44/0.977  37.44/0.979  44.23/0.994  36.36/0.984  39.61/0.993  28.77/0.917  31.23/0.920  37.32/0.972  30.83/0.950  31.15/0.949 
LFT(ours)  34.56/0.978  37.74/0.979  44.55/0.995  36.44/0.985  40.25/0.994  29.02/0.918  31.25/0.921  37.47/0.973  31.01/0.950  31.47/0.951 
By performing feature unfolding and overlap cropping, the local context information can be fully integrated into the generated spatial tokens, which enables our spatial Transformer to capture both local and nonlocal dependencies. To further model the spatial position information, we perform 2D positional encoding on spatial tokens:
(8) 
(9) 
where denotes the spatial position and denotes the index in the embedding dimension. Then, , and can be calculated according to:
(10) 
(11) 
Similar to the proposed angular Transformer, we use the MHSA and FFN to build our spatial Transformer. That is,
(12) 
(13) 
Then, is reshaped into and fed to the next angular Transformer. After passing through all the angular and spatial Transformers, both angular and spatial information in an LF can be fully incorporated. Finally, we apply pixel shuffling [28] to achieve feature upsampling, and obtain the superresolved LF image .
Iii Experiments
In this section, we first introduce our implementation details, then compare our LFT to stateoftheart SR methods. Finally, we conduct ablation studies to validate our design choices.
Iiia Implementation Details
Following [14], we used 5 public LF datasets [31, 32, 33, 34, 35] to validate our method. All LF images in the training and test set have an angular resolution of 55. In the training stage, we cropped LF images into patches of 6464128128 for 24 SR, and used the bicubic downsampling approach to generate LR patches of size 3232.
We used peak signaltonoise ratio (PSNR) and structural similarity (SSIM)
[36] as quantitative metrics for performance evaluation. To obtain the metric score for a dataset with scenes, we calculated the metrics on the SAIs of each scene separately, and obtained the score for this dataset by averaging all the scores.All experiments were implemented in Pytorch on a PC with four Nvidia GTX 1080Ti GPUs. The weights of our network were initialized using the Xavier method
[37], and optimized using the Adam method [38]. The batch size was set to 48 for 24 SR. The learning rate was initially set to 2and halved for every 15 epochs. The training was stopped after 50 epochs.
IiiB Comparison to stateoftheart methods
We compare our LFT to several stateoftheart methods, including 3 single image SR methods [29, 16, 30] and 5 LF image SR methods [11, 25, 13, 12, 14]. We retrained all these methods on the same training datasets as our LFT.
IiiB1 Quantitative Results
Table I shows the quantitative results achieved by our method and other stateoftheart SR methods. Our LFT achieves the highest PSNR and SSIM results on all the 5 datasets for both 2 and 4 SR. Note that, the superiority of our LFT is very significant on the STFgantry dataset [35] (i.e., 0.64 dB and 0.32dB higher than the second topperforming method [14] for 2 and 4SR, respectively). That is because, LFs in the STFgantry dataset have more complex structures and larger disparity variations. By using our angular and spatial Transformers, our method can well handle these complex scenes while maintaining stateoftheart performance on other datasets.
IiiB2 Qualitative Results
Figure 2 shows the qualitative results achieved by different methods. Our LFT can well preserve the textures and details in the SR images and achieves competitive visual performance. Readers can refer to this video for a visual comparison of the angular consistency.
IiiB3 Efficiency
We compare our LFT to several competitive LF image SR methods [11, 13, 12, 14] in terms of the number of parameters and FLOPs. As shown in Table II, compared to other methods, our LFT achieves higher reconstruction accuracy with smaller model size and lower computational cost. It clearly demonstrates the high efficiency of our method.
Methods  
#Param.  FLOPs  PSNR/SSIM  #Param.  FLOPs  PSNR/SSIM  
resLF  7.98M  79.63G  37.50/0.982  8.65M  85.47G  31.25/0.932 
LFSSR  0.89M  91.06G  37.51/0.983  1.77M  455.04G  31.56/0.937 
LFATO  1.22M  1815.36G  38.30/0.985  1.36M  1898.91G  31.54/0.938 
LFInterNet  4.91M  38.97G  38.08/0.985  4.96M  40.25G  31.59/0.939 
LFDFnet  3.94M  57.22G  38.42/0.985  3.99M  58.49G  31.86/0.942 
LFT(ours)  1.11M  29.48G  38.87/0.986  1.16M  30.94G  32.04/0.943 
AngTr  AngPos  SpaTr  SpaPos  #Param.  EPFL  HCIold  INRIA  
1  1.49M  28.63  37.00  30.66  
2  ✓  1.42M  28.85  37.29  30.93  
3  ✓  ✓  1.42M  28.98  37.38  30.93  
4  ✓  1.28M  28.93  37.30  30.97  
5  ✓  ✓  1.28M  28.95  37.41  30.98  
6  ✓  ✓  ✓  ✓  1.16M  29.02  37.47  31.01 
IiiC Ablation Study
We introduce several variants with different architectures to validate the effectiveness of our method. As shown in Table III, we first introduce a baseline model (i.e., model1) without using angular and spatial Transformers^{2}^{2}2Spatialangular alternate convolutions [25, 39, 40] are used in model1 to keep its model size comparable to other variants., then separately added angular Transformer (i.e., model2) and spatial Transformer (i.e., model4) to the baseline model. Moreover, we introduce model3 and model5 to validate the effectiveness of angular and spatial positional encoding.
IiiC1 Angular Transformer
We compare the performance of model2 to model1 and model6 to model5 to validate the effectiveness of the angular Transformer. As shown in Table III, by using the angular Transformer, model2 achieves a 0.20.3 dB PSNR improvements over model1. When the angular positional encoding is introduced, model3 can further achieve a 0.1dB improvement over model2 on the EPFL [31] and HCIold [33] datasets. By comparing the performance of model5 and model6, we can see that removing the angular Transformer (and angular positional encoding) from our LFT will cause a notable PSNR drop (around 0.05 dB). The above experiments demonstrate that our angular Transformer and angular positional encoding are beneficial to the SR performance.
Moreover, we investigate the spatialaware modeling capability of our angular Transformer by visualizing the local angular attention maps. Specifically, we selected two patches from scene Cards [35], and obtained the attention maps (a 2525 matrix for a spatial location in a 55 LF) produced by the MHSA in the first angular Transformer at each spatial location in the patches. Note that, larger values in the attention maps represent higher similarities between a pair of angular tokens. We then define “local angular attention” by calculating the ratios of similar tokens (with attention scores larger than 0.025) in the selected patches. Finally, we visualize the local angular attention map in Fig. 3(b) by assembling the calculated attention values into attention maps. It can be observed in Fig. 3(b) (top) that the attention values in the occlusion area (red patch) are distributed unevenly, where the nonoccluded pixels share larger values. It demonstrates that our angular Transformer can adapt to different image contents and achieve spatialaware angular modeling.
IiiC2 Spatial Transformer
We demonstrate the effectiveness of the spatial Transformer by comparing the performance of model4 to model1 and model6 to model3. As shown in Table III, model4 achieves a 0.3 dB improvements in PSNR over model1. Moreover, when the spatial Transformer is removed from our LFT, model3 suffers a notable performance degradation (0.040.09 dB in PSNR). That is because, compared to cascaded convolutions, the proposed spatial Transformer can better exploit longrange context information with a global receptive field, and can capture more beneficial spatial information for image SR.
Iv Conclusion
In this paper, we propose a Transformerbased network (i.e., LFT) for LF image SR. By using our proposed angular and spatial Transformers, the complementary angular information among all the views and the longrange spatial dependencies within each SAI can be effectively incorporated. Experimental results have demonstrated the superior performance of our LFT over stateoftheart CNNbased SR methods.
References
 [1] Y. Wang, J. Yang, Y. Guo, C. Xiao, and W. An, “Selective light field refocusing for camera arrays using bokeh rendering and superresolution,” IEEE Signal Processing Letters, vol. 26, no. 1, pp. 204–208, 2018.
 [2] S. Jayaweera, C. Edussooriya, C. Wijenayake, P. Agathoklis, and L. Bruton, “Multivolumetric refocusing of light fields,” IEEE Signal Processing Letters, vol. 28, pp. 31–35, 2020.
 [3] W. Wang, Y. Lin, and S. Zhang, “Enhanced spinning parallelogram operator combining color constraint and histogram integration for robust light field depth estimation,” IEEE Signal Processing Letters, vol. 28, pp. 1080–1084, 2021.
 [4] J. Lee and R. Park, “Reduction of aliasing artifacts by sign function approximation in light field depth estimation based on foreground–background separation,” IEEE Signal Processing Letters, vol. 25, no. 11, pp. 1750–1754, 2018.
 [5] A. Wang, “Threestream crossmodal feature aggregation network for light field salient object detection,” IEEE Signal Processing Letters, vol. 28, pp. 46–50, 2020.
 [6] Y. Wang, T. Wu, J. Yang, L. Wang, W. An, and Y. Guo, “Deoccnet: Learning to see through foreground occlusions in light fields,” in Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 118–127.

[7]
S. Zhang, Z. Shen, and Y. Lin, “Removing foreground occlusions in light field
using microlens dynamic filter,” in
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)
, 2021, pp. 1302–1308. 
[8]
Y. Yoon, H. Jeon, D. Yoo, J. Lee, and I. Kweon, “Lightfield image superresolution using convolutional neural network,”
IEEE Signal Processing Letters, vol. 24, no. 6, pp. 848–852, 2017.  [9] Y. Yuan, Z. Cao, and L. Su, “Lightfield image superresolution using a combined deep cnn based on epi,” IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1359–1363, 2018.
 [10] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan, “Lfnet: A novel bidirectional recurrent convolutional neural network for lightfield image superresolution,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4274–4286, 2018.

[11]
S. Zhang, Y. Lin, and H. Sheng, “Residual networks for light field image
superresolution,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2019, pp. 11 046–11 055.  [12] Y. Wang, L. Wang, J. Yang, W. An, J. Yu, and Y. Guo, “Spatialangular interaction for light field image superresolution,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 290–308.
 [13] J. Jin, J. Hou, J. Chen, and S. Kwong, “Light field spatial superresolution via deep combinatorial geometry embedding and structural consistency regularization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2260–2269.
 [14] Y. Wang, J. Yang, L. Wang, X. Ying, T. Wu, W. An, and Y. Guo, “Light field image superresolution using deformable convolution,” IEEE Transactions on Image Processing, vol. 30, pp. 1057–1071, 2020.
 [15] N. Meng, H. So, X. Sun, and E. Lam, “Highdimensional dense residual convolutional neural network for light field reconstruction,” IEEE transactions on pattern analysis and machine intelligence, 2019.
 [16] B. Lim, S. Son, H. Kim, S. Nah, and K. Lee, “Enhanced deep residual networks for single image superresolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 136–144.
 [17] M. Zheng, P. Gao, X. Wang, H. Li, and H. Dong, “Endtoend object detection with adaptive clustering transformer,” arXiv preprint arXiv:2011.09315, 2020.
 [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
 [19] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “Endtoend object detection with transformers,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 213–229.
 [20] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. Torr et al., “Rethinking semantic segmentation from a sequencetosequence perspective with transformers,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6881–6890.
 [21] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” arXiv preprint arXiv:2103.13413, 2021.
 [22] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pretrained image processing transformer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12 299–12 310.
 [23] Z. Wang, X. Cun, J. Bao, and J. Liu, “Uformer: A general ushaped transformer for image restoration,” arXiv preprint arXiv:2106.03106, 2021.
 [24] J. Cao, Y. Li, K. Zhang, and L. Van Gool, “Video superresolution transformer,” arXiv preprint arXiv:2106.06847, 2021.
 [25] H. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Chung, “Light field spatial superresolution using deep efficient spatialangular separable convolution,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2319–2330, 2018.
 [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
 [27] Y. Chen, S. Liu, and X. Wang, “Learning continuous image representation with local implicit image function,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8628–8638.
 [28] W. Shi, J. Caballero, F. Huszár, J. Totz, A. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Realtime single image and video superresolution using an efficient subpixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.
 [29] J. Kim, J. Lee, and K. Lee, “Accurate image superresolution using very deep convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.
 [30] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image superresolution using very deep residual channel attention networks,” in European Conference on Computer Vision (ECCV), 2018, pp. 286–301.
 [31] M. Rerabek and T. Ebrahimi, “New light field image dataset,” in International Conference on Quality of Multimedia Experience (QoMEX), 2016.
 [32] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A dataset and evaluation methodology for depth estimation on 4d light fields,” in Asian Conference on Computer Vision (ACCV). Springer, 2016, pp. 19–34.

[33]
S. Wanner, S. Meister, and B. Goldluecke, “Datasets and benchmarks for densely
sampled 4d light fields.” in Vision, Modelling and Visualization
(VMV)
, vol. 13. Citeseer, 2013, pp. 225–226.
 [34] M. Pendu, X. Jiang, and C. Guillemot, “Light field inpainting propagation via low rank matrix completion,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1981–1993, 2018.
 [35] V. Vaish and A. Adams, “The (new) stanford light field archive,” Computer Graphics Laboratory, Stanford University, vol. 6, no. 7, 2008.
 [36] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 [37] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 249–256.
 [38] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Proceedings of the International Conference on Learning and Representation (ICLR), 2015.
 [39] H. Yeung, J. Hou, J. Chen, Y. Chung, and X. Chen, “Fast light field reconstruction with deep coarsetofine modeling of spatialangular clues,” in European Conference on Computer Vision (ECCV), 2018, pp. 137–152.
 [40] M. Guo, J. Hou, J. Jin, J. Chen, and L. Chau, “Deep spatialangular regularization for light field imaging, denoising, and superresolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Comments
There are no comments yet.