Hyperspectral images have recently received a lot of attention in many fields since the rich spectral information of each pixel in a scenario provides more abundant and reliable knowledge than those multispectral images (e.g., RGB images). As a result, many researchers start taking advantage of the characteristics of HSI in various computer vision tasks, such as classification, segmentation, and object tracking begin the HSIs for better performance. However, limited by the current physical imaging system, there is an unavoidable trade-off, the high spatial resolution and high spectral resolution cannot be obtained at the same time[dian2020recent]. The imaging mechanism can only capture the image with high spatial resolution along with limited spectral bands, e.g., high-resolution multispectral image (HR-MSI), or low spatial resolution but with a higher spectral resolution, e.g., low-resolution hyperspectral image (LR-HSI) in practice. Thus, Fusing the high spatial resolution of MSI and high spectral resolution of HSI becomes a promising technique to generate the desired high-resolution hyperspectral image (HR-HSI).
Many methods have been proposed from various perspectives to address the HSI super-resolution problem in the last few years. They are roughly divided into two classes: traditional methods and deep learning-based methods. As for the traditional methods, many researchers introduce different prior knowledge in their proposed optimization models for exploiting the intrinsic properties under the maximum a posteriori (MAP) framework.
Since matrix or tensor factorization-based methods show reliable and promising performance in many computer vision and image processing tasks[jiang2020framelet, 8948303, 7579662, Jiang_2017_CVPR], many researchers also introduce the matrix or tensor factorization methodology into the hyperspectral image super-resolution or pansharpening problems [han2017hyperspectral, GLP-HS, 1518950, CSTF, CNN-FUS, UTV, pan2019multispectral, LTMR, LTTR, Wu, 9106801, deng2018variational, deng2019fusion]. Although those methods have achieved excellent performance, they often require known or estimated and beforehand, which are difficult to obtain in practice. Furthermore, the representation ability of those handcraft regularization terms is usually limited in actual life, and their optimal parameters need to be tuned for different devices.
Recently, deep learning-based methods have shown satisfactory performance in different fields due to their remarkable feature representing ability. Thus, many researchers take the deep learning, especially the CNN technique into consideration for fusing the LR-HSI and HR-MSI [MHFnetpami, dian2018deep, shao2018remote, li2017hyperspectral, vitale2019cnn, han2019hyperspectral, liu2018deep, huang2015new, liu2018psgan, CNN-FUS, SSRNET, ResTFNet] which have outperformed many traditional methods. Nevertheless, due to its insufficient information extraction ability of convolution, there is still room for improvement.
In this work, we notice that the transformer and its various modifications have obtained outstanding achievements in natural language processing and computer vision tasks[vaswani2017attention, chen2021pre, dosovitskiy2020image]. Hence, a transformer-based network architecture called Fusformer is proposed for the hyperspectral image super-resolution problem. Our method integrates a self-attention mechanism that can exploit more global relationships among the data than the convolution operation with a limited receptive field. Furthermore, we force our Fusformer to estimate the residuals instead of reconstructing the whole HR-HSI, enabling the network architecture to learn in a smaller mapping space. This paper designs an efficient network architecture to solve the HSI super-resolution problem via fusing the HR-MSI and LR-HSI. To sum up, the contributions of this paper are presented as follows:
To our knowledge, it is the first time using the transformer to solve the hyperspectral image super-resolution problem. The self-attention mechanism in the transformer enables our network to represent more global information than previous CNN architectures.
The proposed approach focuses on the residual domain instead of the primitive image domain, which leads to a smaller mapping space. It relieves the network of the burden of reconstructing the desired HR-HSI directly.
Only a few parameters are involved in the network with light computation making our approach practical in the real-life application. Furthermore, the network is plain and easy to follow. Thus, future researches can be improved based on our simple yet effective architecture.
The organization of this article is as follows. Section II will introduce our Fusformer architecture in detail. In Section III, extensive experiments on different datasets are presented to validate the superiority of our network. Finally, we draw the conclusion in Section IV. In this paper, we utilize non-bold case, bold upper case, and calligraphic upper case letters to denote the scalar, matrix, tensor, respectively.
2 Network Architecture
With the rapid development of deep learning techniques, CNN-based approaches are also used for solving many tasks of computer vision and image processing, including the pansharpening and the HSI-MSI fusion problem[MHFnetpami, dian2018deep, shao2018remote, li2017hyperspectral, vitale2019cnn, han2019hyperspectral, liu2018deep, huang2015new, liu2018psgan, CNN-FUS, SSRNET, ResTFNet]
. They have obtained state-of-the-art performance in recent years due to their powerful feature extraction ability. Notwithstanding the remarkable achievements of those CNN-based methods have obtained, the core elements in the neural network are those various convolution kernels with localized kernel sizes. Thus, the region of interest by convolution is restricted within a small area,i.e., the convolution operation is conducted locally, and the global structure is neglected while it contains valuable information. Considering the limitation of the convolution, how to better extract and understand the global information becomes a difficult but vital issue.
The transformer model was created by Vaswani et al. in 2017[vaswani2017attention]
to collect better long-term information than recurrent neural networks and convolutional neural networks. The proposed transformer outperforms other methods and has been proven to be quite crucial in natural language processing (NLP) tasks. Furthermore, motivated by the success of transformer architecture in NLP, Dosovitskiyet al. [dosovitskiy2020image] propose the vision transformer (ViT) for image classification and Chen et al. [chen2021pre] design the image processing transformer (IPT) to address low-level vision tasks. Both of them obtain the best results compared to existing approaches. The achievements of the transformer in various computer vision tasks inspire us to design a network based on the transformer to solve the hyperspectral image super-resolution problem via the superior ability to capture long-term information and relationships.
2.2 Proposed Method
The global information is barely used due to the limitation of the regular convolution operation with the local region of interest in the CNN architecture. Hence, we expect to use the transformer model to consider the features and information effectively and globally. The flowchart of our network structure is presented in Fig. 1. Note that the inputs of our Fusformer are the upsampled LR-HSI and the HR-MSI since the holds the similar structure as the ground-truth HR-HSI .
2.2.1 Input of Transformer
The HR-MSI is then concatenated with along the spectral dimension. After obtaining the data cube containing the raw spectral and spatial information, we unfold the tensor to a matrix
, due to the input’s dimension requirement of the transformer model. It is worth noting that each vector in the matrixhas its physical meaning i.e., representing a pixel in the image with the spectral structure and corresponding spatial information. While other transformer-based methods for computer vision tasks[dosovitskiy2020image, chen2021pre] reshape a small image patch into a vector directly. On the one hand, the hyperspectral image contains more spectral bands than other natural RGB images. The vector reshaped from an image patch will be too long to compute. On the other hand, we believe a vector representing pixel information instead of a patch is also suitable for our pixel-wise super-resolution problem. Hence, the transformer model is quite consistent with the characteristics of the hyperspectral image super-resolution issue. Every pixel can be naturally represented as a vector, and the transformer architecture enables the network to discover and consider the relationships among all the pixels globally. With a simple fully connected layer, the matrix is then embedding to the matrix , where denotes the number of feature channels ( in this paper.). Next, we send the embedded patches to the transformer model, which is represented in Fig. 1-(b).
The transformer model is the main part of our architecture which is shown in Fig. 1-(b). Here we use both the encoder and decoder part of the original transformer. As for the encoder (See the top of Fig. 1-(b)), LN indicates the layer normalization which is widely used in the transformer-based methods [vaswani2017attention, chen2021pre, dosovitskiy2020image] for the training’s stability. Multi-head attention is the self-attention mechanism with multi-heads that enables the network to capture and consider the relationship globally. The general procedures of the self-attention mechanism are as follows.
where , and denote the corresponding weight matrices need to be trained, represents the dimension of for scaling. As for the multi-head attention, heads denotes individual groups of () () with attention values. Furthermore, gives attention values from 0 to 1 which differentiate levels of importance in V. The whole algorithm of the encoder can be described as follows.
where represents the layer normalization, denotes the multi-head attention module and the defines the multi-layer perception plotted in Fig. 1-(c). Similarly, the decoder can be described as follows.
After getting the learned features , we reshape it back to a 3D tensor and then feed it to a refine module for the desired residual . Finally, the output is obtained by adding the upsampled LR-HSI and learned residual .
3 Experiment Results
To verify the effectiveness of our proposed Fusformer, we compare with classic and state-of-the-art 1) traditional approaches: FUSE [FUSE], GLP-HS [GLP-HS], CSTF [CSTF] and CNN-FUS [CNN-FUS]. 2) deep learning-based approaches: SSRNet [SSRNET], ResTFNet [ResTFNet], MHF-Net [MHFnetpami] and HSRnet [Hutnnls] on two different hyperspectral image datasets i.e., CAVE dataset [yasuma2010generalized] and Harvard dataset [chakrabarti2011statistics]. Both of them contain hyperspectral images with 31 spectral channels. Images of CAVE dataset are with a size of 512 512, while images of Harvard dataset are cropped with the spatial size of 1000 1000. 20 images from CAVE dataset are selected for training, 11 images from CAVE dataset, and 10 images from Harvard dataset are for testing. Note that all the training images are from the CAVE dataset. The images from the Harvard dataset thus can be used for the generalization test, which is quite crucial for deep learning-based methods.
Tab. 1 and 2 list the quantitative comparisons in CAVE and Harvard datasets. Our proposed Fusformer obtains the best results on almost every QI and involves only 0.1 million parameters, making our network more practical. We also show visual presentation and corresponding absolute residual maps of two samples selected from CAVE and Harvard dataset in Fig. 2. It is obvious that our approach still outperforms other methods and the residuals are the darkest.
3.1 Generalization Ability
Furthermore, deep learning-based methods usually perform poorly on examples that differ from the training dataset. Hence, the generalization ability of deep learning-based methods is quite crucial. However, our Fusformer is still satisfying, and only the ERGAS is not the smallest. The generalization performance of HSRnet is close to the proposed Fusformer, but its involved parameters are much more than Fusofrmer.
3.2 Ablation Study
Our Fusformer is excepted for learning the residuals between the upsampled LR-HSI and ground-truth instead of reconstructing the whole HSI. We conduct a simple experiment to verify the residual learning strategy (RLS) i.e., adding the upsampled LR-HSI to the outcome learned by the network. Tab. 3 records the results of the architecture with or without the RLS. It is clear that adding the upsampled LR-HSI is quite vital for the network. The rough information of helps the network to boost the performance and strengthen the stability.
Average QIs and related standard deviations of the results on the CAVE dataset using the proposed method with and without the residual learning strategy. The best values are highlighted in boldface.
In this work, a transformer-based network architecture called Fusformer is proposed. Compared with previous CNN-based methods, our method can consider the global information instead of the local information in a receptive field with a limited kernel size. This is the first time adopting the transformer model in the hyperspectral image super-resolution issue to the best of our knowledge. Our method is simple yet effective and contains few parameters. Future researches can further exploit the potential base on our proposed framework.
The work is supported by National Natural Science Foundation of China grants 12001446, 61702083, 12171072 and 61876203, and the Fundamental Research Funds for the Central Universities JBK2001011.