1 Introduction
Recent advances in hyperspectral (HS) imaging technology have enabled the availability of enormous HS images (HSIs) with a densely sampled spectrum [25]
. Benefited from the abundant spectral information contained in those hundreds of bands measurement, HSI features great promise in delivering faithful representation of realworld materials and objects. Thus the pursuit of effective and efficient processing of HS data has long been recognized as a prominent topic in the field of computer vision
[10].Though physically, the insufficient spatial resolution of HS instruments, combined with an inherently intimate mixing effect, severely hampers the abilities of HSI in various real applications [2, 34]. Fortunately, the multispectral (MS) imaging systems (e.g., RGB cameras, spaceborne MS sensors) are capable of providing complementary products, which preserve much finer spatial information at the cost of reduced spectral resolution [12]. Accordingly, the research on enhancing the spatial resolution (henceforth, resolution refers to the spatial resolution) of an observable lowresolution HSI (LrHSI) by merging a highresolution MSI (HrMSI) under the same scene, which is referred to hyperspectral image superresolution (HSISR), has been gaining considerable attention [14, 15].
The last decade has witnessed a dominant development of optimizationbased methods, from either deterministic or stochastic perspectives, to tackle the HSISR issue [36]. To mitigate the severe illposedness of such an inverse problem, the majority of prevailing methods put their focus on exploiting various handcrafted priors to characterize spatial and spectral information underlying the desired solution. Moreover, the dependency on the knowledge of relevant sensor characteristics, such as spectral response function (SRF) and point spread function (PSF), inevitably compromises their transparency and practicability.
More recently, a growing interest has been paid to leverage the tool of deep learning (DL) by exploiting its merit on lowlevel vision applications. Among them, the best result is achieved by investigators who resort to performing HSISR progressively in a supervised fashion [33]. However, the demand for sufficient training image pairs acquired with different sensors inevitably makes their practicability limited. On the other hand, though being rarely studied, the existing unsupervised works rely on either complicated multistage alternating optimization [24], or an external camera spectral response (CSR) dataset in the context of RGB image guidance [11], the latter of which also losses generality in confronting other kinds of data with higher spectral resolution than RGB one.
To address the aforementioned challenges, we propose a novel coupled unmixing network with crossattention (CUCaNet) for unsupervised HSISR. The contributions of this paper are briefly summarized as follows:

We propose a novel unsupervised HSISR model, called CUCaNet, which is built on a coupled convolutional autoencoder network. CUCaNet models the physically mixing properties in HS imaging into the networks to transfer the spatial information of MSI to HSI and preserve the high spectral resolution itself simultaneously in a coupled fashion.

We devise an effective crossattention module to extract and transfer significant spectral (or spatial) information from HSI (or MSI) to another branch, yielding more sufficient spatialspectral information blending.

Beyond previous coupled HSISR models, the proposed CUCaNet is capable of adaptively learning PSFs and SRFs across MSHS sensors with a high ability to generalize. To find the local optimum of the network more effectively, we shrink the solution space by designing a closedloop consistency regularization in networks, acting on both spatial and spectral domains.
2 Related Work
The investigated HSISR problem is closely associated with the pansharpening task, which aims at generating a HrMSI (or HrHSI) by fusing the LrMSI (or LrHSI) with a corresponding panchromatic image of higher resolution [27, 21]. Therefore pioneer researches emerge naturally by adapting the extensively studied pansharpening techniques to HSISR [5]. Still, as a result of separate sharpening operation, these methods usually fail to well capture the global continuity in the spectral profiles, which brings unignorable performance degradation and thus leaves much room to be desired.
2.1 Conventional Methods
Apace with the advances in statistically modeling and machine learning, recent optimizationbased methods has lifted the HSISR ratio evidently. According to a subspace assumption, Bayesian approach was first introduced by Eismann
et al. utilizing a stochastic mixing model [9], and developed through subsequent researches by exploiting more inherent characteristics [26, 32]. Another class of methods that have been actively investigated stems from the idea of spectral unmixing [13], which takes the intimate mixing effect into consideration. Yokoya et al. brought up coupled nonnegative matrix factorization (CNMF) [37]to estimate the spectral signature of the underlying materials and corresponding coefficients alternately. On basis of CNMF, Kawakami
et al. [4] employed sparse regularization and an effective projected gradient solver was devised by Lanaras et al. [19]. Besides, [2, 8]adopted dictionary learning and sparse coding techniques in this context. Various kinds of tensor factorization strategies are also studied, such as Tucker decomposition adopted by Dian
et al. [6] and Li et al. [20] to model nonlocal and coupled structure information, respectively.2.2 DLBased Methods
To avoid tedious handcrafted priors modeling in conventional methods, DLbased methods have attracted increasing interest these years. In the class of supervised methods, Dian et al. [7] employed CNN with prior training to finely tune the result acquired by solving a conventional optimization problem, while Xie et al. [33] introduced a deep unfolding network based on a novel HSI degradation model. Unsupervised methods are more rarely studied. Qu et al. [24] developed an unsupervised HSISR net with Dirichlet distributioninduced layer embedded, which results in a multistage alternating optimization. Under the guidance of RGB image and an external CSR database, Fu et al. [11] designed an unified CNN framework with a particular CSR optimization layer. Albeit demonstrated to be comparatively effective, these methods require either large training data for supervision or the knowledge of PSFs or SRFs, which are both unrealistic in real HSISR scenario. Very recently, Zheng et al. [39] proposed a coupled CNN by adaptively learning the two functions of PSFs and SRFs for unsupervised HSISR. However, due to the lack of effective regularizations or constraints, the two tobeestimated functions inevitably introduce more freedoms, limiting the performance to be further improved.
3 Coupled Unmixing Nets with CrossAttention
In this section, we present the proposed coupled unmixing networks with a crossattention module implanted, which is called CUCaNet for short. For mathematical brevity, we resort to a 2D representation of the 3D image cube, that is, the spectrum of each pixel is stacked rowbyrow.
3.1 Method Overview
The proposed CUCaNet builds on a twostream convolutional autoencoder backbone, which aims at decomposing MS and HS data into a spectrally meaningful basis and corresponding coefficients jointly. In accordance with the idea of coupled spectral unmixing, the fused HrHSI is obtained by feeding the decoder of the HSI branch with the encoded maps of the MSI branch. Two additional convolution layers are incorporated to simulate the spatial and spectral downsampling processes across MSHS sensors. To guarantee that CUCaNet can converge to a faithful product through an unsupervised training, reasonable consistency, and necessary unmixing constraints, are integrated smoothly without imposing evident redundancy. Moreover, we introduced the attention mechanism into the HSISR for the first time. More specifically, a crossattention module is devised to transfer significant spectral (or spatial) information from HSI (or MSI) to another MSI (or HSI) branch network, which fully exploits the advantageous spectral (or spatial) guidance for the performance improvement.
3.2 Problem Formulation
Given the LrHSI , and the HrMSI , the goal of HSISR is to recover the latent HrHSI , where are the reduced height, width, and number of spectral bands, respectively, and
are corresponding upsampled version. Based on the linear mixing model that well explains the phenomenon of
mixed pixels involved in , we then have the following NMFbased representation,(1) 
where and are a collection of spectral signatures of pure materials (or say, endmembers) and their fractional coefficients (or say, abundances), respectively.
On the other hand, the degradation processes in the spatial () and the spectral () observations can be modeled as
(2)  
(3) 
where and represent the PSF and SRF from the HrHSI to the HrMSI and the LrHSI, respectively. Since and are nonnegative and normalized, and can be regarded as spatially downsampled abundances and spectrally downsampled endmembers, respectively. Therefore, an intuitive solution is to unmix and based on Eq. (2) and Eq. (3) alternately, which is coupled with the prior knowledge of and . Such a principle has been exploited in various optimization formulations, obtaining stateoftheart fusion performance by linear approximation with converged and .
Constraints. Still, the issued HSISR problem involves the inversions from and to and , which are highly illposed. To narrow the solution space, several physically meaningful constraints are commonly adopted, they are the abundance sumtoone constraint (ASC), the abundance nonnegative constraint (ANC), and nonnegative constraint on endmembers, i.e.,
(4) 
where marks elementwise inequality, and represents
length allone vector. It is worth mentioning that the combination of ASC and ANC would promote the sparsity of abundances, which well characterizes the rule that the endmembers are sparsely contributing to the spectrum in each pixel.
Yet in practice, the prior knowledge of PSFs and SRFs for numerous kinds of imaging systems is hardly available. This restriction motivates us to extend the current coupled unmixing model to a fully endtoend framework, which is only in need of LrHSI and HrMSI. To estimate and in an unsupervised manner, we introduce the following consistency constraint,
(5) 
where denotes the latent LrMSI.
3.3 Network Architecture
Inspired by the recent success of deep networks on visual processing tasks, we would like to first perform coupled spectral unmixing by the established twostream convolutional autoencoder for the twomodal inputs, i.e., we consider two deep subnetworks, with to selfexpress the LrHSI, for the HrMSI, and the fused result can be obtained by , herein collects the weights of corresponding subpart.
In specific, as shown in Fig. 1, both encoders and are constructed by cascading “Convolution+LReLU” blocks with an additional convolution layer. We set the sizes of convolutional kernels in all as while those in are with larger but descending scales of the receptive field. The idea behind this setting is to take the low fidelity of spatial information in LrHSI into consideration and simultaneously map the crosschannel and spatial correlations underlying HrMSI. Furthermore, to ensure that the encoded maps are able to possess the properties of abundances, an additional activation layer using the clamp function in the range of is concatenated after each encoder. As for the structure of decoders and , we simply adopt a convolution layer without any nonlinear activation, making the weights and interpretable as the endmembers and according to Eq. (2) and Eq. (3). By backward gradient descentbased optimization, our backbone network can not only avoid the need for good initialization for conventional unmixing algorithms but also enjoy the amelioration brought by its capability of local perception and nonlinear processing.
CrossAttention. To further exploit the advantageous information from the two modalities, we devise an effective crossattention module to enrich the features across modalities. As shown in Fig. 2, the crossattention module is employed on highlevel features within the encoder part, with three steps to follow. First, we compute the spatial and spectral attention from the branch of LrHSI and HrMSI, since they can provide with more faithful spatial and spectral guidance. Next, we multiply the original features with the attention maps from another branch to transfer the significant information. Lastly, we concatenate the original features with the above crossmultiplications in each branch, to construct the input of next layer in the form of such preserved and refined representation.
Formally, the output features of the th layer in the encoder part, take for example, are formulated as
(6) 
which is similar for obtaining from . To gather the spatial and spectral significant information, we adopt global and local convolution to generate channelwise and spatial statistics respectively as
(7) 
where is a set of convolution filters with size , is the th channel of a 3D convolution filter with spatial size as
. Then we apply a softmax layer to the above statistics to get the attention maps
, and , wheredenotes the softmax activation function. The original features are finally fused into the input of next layer as
, and , where denotes the concatenation, and denotes the pointwise multiplication.SpatialSpectral Consistency. An essential part that tends to be ignored is related to the coupled factors caused by PSFs and SRFs. Previous researches typically assume an ideal average spatial downsampling and the prior knowledge of SRFs, which rarely exist in reality. Unlike them, we introduce a spatialspectral consistency module into networks in order to better simulate the tobeestimated PSF and SRF, which is performed by simple yet effective convolution layers.
We can rewrite the spectral resampling from the HS sensor to the MS sensor by revisiting the left part of Eq. (3) more accurately as follows. Given the spectrum of th pixel in HrHSI , for the th channel in corresponding LrHSI, the radiance is defined as
(8) 
where denotes the support set that the wavelength belongs to, denotes the normalization constant . We directly replace with a set of convolution kernels with the weights being collected in . Therefore, the SRF layer can be well defined as follows,
(9) 
where corresponds to an additional normalization with . The PSF layer for spatial downsampling is more straightforward. Note that PSF generally indicates that each pixel in LrHSI is produced by combining neighboring pixels in HrHSI with unknown weights in a disjoint manner [29]. To simulate this process, we propose
by the means of a channelwise convolution layer with kernel size and stride both same as the scaling ratio.
To sum up, multiple consistency constraints derived from the statements in Section 3.2, either spectrally or spatially, can be defined in our networks as
(10) 
which enables the whole networks to be trained within a closed loop.
3.4 Network Training
Loss Function. As shown in Fig. 1, our CUCaNet mainly consists of two autoencoders for hyperspectral and multispectral data, respectively, thus leading to the following reconstruction loss:
(11) 
in which the norm is selected as the loss criterion for its perceptually satisfying performance in the lowlevel image processing tasks [38].
The important physically meaningful constraints in spectral unmixing are considered, building on Eq. (4), we then derive the second ASC loss as
(12) 
and the ANC is reflected through the activation layer used behind the encoders.
To promote the sparsity of abundances of both stream, we adopt the KullbackLeibler (KL) divergencebased sparsity loss term by penalizing the discrepancies between them and a tiny scalar ,
(13) 
where is the standard KL divergence [23].
Last but not least, we adopt the norm to define the spatialspectral consistency loss based on Eq. (10) as follows,
(14) 
By integrating all the abovementioned loss terms, the final objective function for the training of CUCaNet is given by
(15) 
where we use to tradeoff the effects of different constituents.
Implementation Details.
Our network is implemented on the PyTorch framework. We choose the Adam optimizer under default parameters setting for the training with the training batch parameterized by 1
[17]. The learning rate is initialized with 0.005 and a linear decay from 2000 to 10000 epochs dropstep schedule is applied
[22]. We determined the hyperparameters using a grid search on the validation set and stopped training before validation loss fails to decrease.
4 Experimental Results
In this section, we first review the HSIMSI datasets and setup adopted in our experiments. Then, we provide an ablation study to verify the effectiveness of the proposed modules. Extensive comparisons with the stateoftheart methods on indoor and remotely sensed images are reported at last.
Dataset and Experimental Setting. Three widely used HSIMSI datasets are investigated in this section, including the CAVE dataset [35]^{1}^{1}1http://www.cs.columbia.edu/CAVE/databases/multispectral, the Pavia University dataset, and the Chikusei dataset [36]^{2}^{2}2http://naotoyokoya.com/Download.html. The CAVE dataset captures 32 different indoor scenes. Each image consists of 512512 pixels with 31 spectral bands uniformly measured in the wavelength ranging from 400nm to 700nm. In our experiments, 16 scenes are randomly selected to report the performance. The Pavia dataset was acquired by the ROSIS airborne sensor over the University of Pavia, Italy, in 2003. The original HSI comprises 610340 pixels and 115 spectral bands. The topleft corner of the HSI with 336336 pixels and 103 bands (after removing 12 noisy bands), covering the spectral range from 430nm to 838nm is used. The Chikusei dataset was taken by a Visible and NearInfrared (VNIR) imaging sensor over Chikusei, Japan, in 2014. The original HSI consists of 2,5172,335 pixels and 128 bands with a spectral range of 363nm to 1,018nm. We crop 6 nonoverlapped parts with the size of 576448 pixels from the bottom part for the test.
Considering the diversity of MS sensors in generating the HrMS images, we employ the SRFs of Nikon D700 camera[24] and Landsat8 spaceborne MS sensor[3]^{3}^{3}3http://landsat.gsfc.nasa.gov/?p=5779 for the CAVE dataset and two remotely sensed datasets^{4}^{4}4We select the spectral radiance responses of bluegreenred(BGR) bands and BGRNIR bands for the experiments on Pavia and Chikusei datasets, respectively., respectively. We adopt the Gaussian filter to obtain the LrHS images, by constructing the filter with the width same as SR ratio and 0.5 valued deviations. The SR ratios are set as 16 for the Pavia University dataset and 32 for the other two datasets.
Method  Module  Metric  

Clamp  SSC  CA  PSNR  SAM  ERGAS  SSIM  UQI  
CNMF        32.73  7.05  1.18  0.830  0.973 
CUCaNet  ✗  ✗  ✗  34.25  6.58  1.01  0.862  0.975 
CUCaNet  ✓  ✗  ✗  35.67  5.51  0.92  0.897  0.981 
CUCaNet  ✓  ✓  ✗  36.55  4.76  0.85  0.904  0.991 
CUCaNet  ✓  ✗  ✓  36.49  4.63  0.86  0.902  0.989 
CUCaNet  ✓  ✓  ✓  37.22  4.43  0.82  0.914  0.991 
We use the following five complementary and widelyused picture quality indices (PQIs) for the quantitative HSISR assessment, including peak signaltonoise ratio (PSNR), spectral angle mapper (SAM)
[18], erreur relative globale adimensionnellede synthèse (ERGAS) [28], structure similarity (SSIM) [30], and universal image quality index (UIQI) [31]. SAM reflects the spectral similarity by calculating the average angle between two vectors of the estimated and reference spectra at each pixel. PSNR, ERGAS, and SSIM are mean square error (MSE)based bandwise PQIs indicating spatial fidelity, global quality, and perceptual consistency, respectively. UIQI is also bandwisely used to measure complex distortions among monochromatic images.4.1 Ablation Study
Our CUCaNet consists of a baseline network – coupled convolutional autoencoder networks – and two newlyproposed modules, i.e., the spatialspectral consistency module (SSC) and the crossattention module (CA). To investigate the performance gain of different components in networks, we perform ablation analysis on the Pavia University dataset. We also study the effect of replacing clamp function with conventional softmax activation function at the end of each encoder. Table 1 details the quantitative results, in which CNMF is adopted as the baseline method.
As shown in Table 1, single CUCaNet can outperform CNMF in all metrics owing to its benefit from employing deep networks. We find that the performance is further improved remarkably by the use of clamp function. Meanwhile, single SSC module performs better than single CA module except in SAM, which means that CA module tend to favor spectral consistency. By jointly employing the two modules, the proposed CUCaNet achieves the best results in HSISR tasks, demonstrating the effectiveness of our whole network architecture.
4.2 Comparative Experiments
Compared Methods. Here, we make comprehensive comparison with the following eleven stateoftheart (SOTA) methods in HSIRS tasks: pioneer work, GSA [1]^{1}^{1}1http://naotoyokoya.com/Download.html, NMFbased approaches, CNMF [37]^{†}^{†}footnotemark: and CSU [19]^{2}^{2}2https://github.com/lanha/SupResPALM, Bayesianbased approaches, FUSE [32]^{3}^{3}3https://github.com/qw245/BlindFuse and HySure [26]^{4}^{4}4https://github.com/alfaiate/HySure, dictionary learningbased approach, NSSR [8]^{5}^{5}5http://see.xidian.edu.cn/faculty/wsdong, tensorbased approaches, STEREO [16]^{6}^{6}6https://github.com/marhar19/HSR_via_tensor_decomposition, CSTF [20]^{7}^{7}7https://sites.google.com/view/renweidian, and LTTR [6]^{†}^{†}footnotemark: , and DLbased methods, unsupervised uSDN [24]^{8}^{8}8https://github.com/aicip/uSDN and supervised MHFnet [33]^{9}^{9}9https://github.com/XieQi2015/MHFnet. The parameters of all compared methods are tuned to our best. As for the supervised deep method MHFnet, we use the remaining part of each dataset for the training following the strategies in [33].
Functions  GSA  CNMF  CSU  FUSE  HySure  NSSR  STEREO  CSTF  LTTR  uSDN  MHFnet  CUCaNet 
SRF  ✗  ✗  ✗  ✗  ✓  ✗  ✗  ✗  ✗  ✗  ✓  ✓ 
PSF  ✗  ✗  ✗  ✗  ✓  ✗  ✗  ✗  ✗    ✓  ✓ 
Note that most of the above methods rely on the prior knowledge of SRFs and PSFs. We summarize the properties of all compared methods in learning SRFs and PSFs (see Table 2), where only HySure and MHFnet are capable of learning the two unknown functions. More specifically, HySure adopts a multistage method and MHFnet models them as convolution layers under a supervised framework. Hence our CUCaNet serves as the first unsupervised method that can simultaneously learn SRFs and PSFs in an endtoend fashion.
Metric  Method  

GSA  CNMF  CSU  FUSE  HySure  NSSR  STEREO  CSTF  LTTR  uSDN  MHFnet  CUCaNet  
PSNR  27.89  30.11  30.26  29.87  31.26  33.52  30.88  32.74  35.45  34.67  37.30  37.51 
SAM  19.71  9.98  11.03  16.05  14.59  12.09  15.87  13.13  9.69  10.02  7.75  7.49 
ERGAS  1.11  0.69  0.65  0.77  0.72  0.69  0.75  0.64  0.53  0.52  0.49  0.47 
SSIM  0.713  0.919  0.911  0.876  0.905  0.912  0.896  0.914  0.949  0.921  0.961  0.959 
UQI  0.757  0.911  0.898  0.860  0.891  0.904  0.873  0.902  0.942  0.905  0.949  0.955 
Indoor Dataset. We first conduct experiments on indoor images of the CAVE dataset. The average quantitative results over 16 testing images are summarized in Table 3 with the best ones highlighted in bold. From the table, we can observe that LTTR and CSTF can obtain better reconstruction results than other conventional methods, mainly by virtue of their complex regularizations under tensorial framework. Note that the SAM values of earlier methods CNMF and CSU are still relatively lower because they consider the coupled unmixing mechanism. As for the DLbased methods, supervised MHFnet outperforms unsupervised uSDN evidently, while our proposed CUCaNet achieves the best results in terms of four major metrics. Only the SSIM value of ours is slightly worse than that of the most powerful competing method MHFnet, due to its extra exploitation of supervised information.
The visual comparison on two selected scenes demonstrated in Fig. 3 and Fig. 4 exhibits a consistent tendency. From the figures, we can conclude that the results of CUCaNet maintain the highest fidelity to the groundtruth (GT) compared to other methods. In specific, for certain bands, our method can not only estimate background more accurately, but also maintain the texture details on different objects. The SAM values of CUCaNet on two images are obviously less than others, which validates the superiority of proposed method in capturing the spectral characteristics by the means of joint coupled unmixing and degrading functions learning.
Remotely Sensed Dataset. We then carry out more experiments using airborne HS data to further evaluate the generality of our method. The quantitative evaluation results on the Pavia University and Chikusei datasets are provided in Table 4 and Table 5, respectively. Generally, we can observe a significant performance improvements than on CAVE, since more spectral information can be used as the number of HS bands increases. For the same reason, NMFbased and Bayesianbased methods show competitive performance owing to their accurate estimation of highresolution subspace coefficients [36]. The limited performance of tensorbased methods suggests they may lack robustness to the spectral distortions in real cases. The multistage unsupervised training of uSDN makes it easily trapped into local minima, which results in only comparable performance to stateoftheart conventional methods such as HySure and FUSE. It is particularly evident that MHFnet performs better on Chikusei rather than Pavia University. This can be explained by the fact that training data is relatively adequate on Chikusei so that the tested patterns are more likely to be well learned. We have to admit, however that MHFnet requires extremely rich training samples, which restricts its practical applicability to a great extent. Remarkably, our CUCaNet can achieve better performance in most cases, especially showing advantage in the spectral quality measured by SAM, which confirms that our method is good at capturing the spectral properties and hence attaining a better reconstruction of HrHSI.
Metric  Method  

GSA  CNMF  CSU  FUSE  HySure  NSSR  STEREO  CSTF  LTTR  uSDN  MHFnet  CUCaNet  
PSNR  30.29  32.73  33.18  33.24  35.02  34.74  31.34  30.97  29.98  34.87  36.34  37.22 
SAM  9.14  7.05  6.97  7.78  6.54  7.21  9.97  7.69  6.92  5.80  5.15  4.43 
ERGAS  1.31  1.18  1.17  1.27  1.10  1.06  1.35  1.23  1.30  1.02  0.89  0.82 
SSIM  0.784  0.830  0.815  0.828  0.861  0.831  0.751  0.782  0.775  0.871  0.919  0.914 
UQI  0.965  0.973  0.972  0.969  0.975  0.966  0.938  0.969  0.967  0.982  0.987  0.991 
Metric  Method  

GSA  CNMF  CSU  FUSE  HySure  NSSR  STEREO  CSTF  LTTR  uSDN  MHFnet  CUCaNet  
PSNR  32.07  38.03  37.89  39.25  39.97  38.35  32.40  36.52  35.54  38.32  43.71  42.70 
SAM  10.44  4.81  5.03  4.50  4.35  4.97  8.52  6.33  7.31  3.89  3.51  3.13 
ERGAS  0.98  0.58  0.61  0.47  0.45  0.63  0.74  0.66  0.70  0.51  0.42  0.40 
SSIM  0.903  0.961  0.945  0.970  0.974  0.961  0.897  0.929  0.918  0.964  0.985  0.988 
UQI  0.909  0.976  0.977  0.977  0.976  0.914  0.902  0.915  0.917  0.976  0.992  0.990 
Fig. 5 and Fig. 6 show the HSISR results demonstrated in falsecolor on these two datasets. Since it is hard to visually discern the differences of most fused results, we display the RMSEbased residual images of local windows compared with GT for better visual evaluation. For both datasets, we can observe that GSA and STEREO yield bad results with relatively higher errors. CNMF and CSU show evident patterns in residuals that are similar to the original image, which indicates that their results are missing actual details. The block patternlike errors included in CSTF and LTTR make their reconstruction unsmooth. Note that residual images of CUCaNet and MHFnet exhibit more dark blue areas than other methods. This means that the errors are small and the fused results are more reliable.
5 Conclusion
In this paper, we put forth CUCaNet for the task of HSISR by integrating the advantage of coupled spectral unmixing and deep learning techniques. For the first time, the learning of unknown SRFs and PSFs across MSHS sensors is introduced into an unsupervised coupled unmixing network. Meanwhile, a crossattention module and reasonable consistency enforcement are employed jointly to enrich feature extraction and guarantee a faithful production. Extensive experiments on both indoor and airborne HS datasets utilizing diverse simulations validate the superiority of proposed CUCaNet with evident performance improvements over competitive methods, both quantitatively and perceptually. Finally, we will investigate more theoretical insights on explaining the effectiveness of the proposed network in our future work.
Acknowledgements. This work has been supported in part by projects of the NSFC (No. 61721002) and the China Scholarship Council.
References
 [1] (2007) Improving component substitution pansharpening through multivariate regression of ms pan data. IEEE Transactions on Geoscience and Remote Sensing 45 (10), pp. 3230–3239. Cited by: §4.2.
 [2] (2014) Sparse spatiospectral representation for hyperspectral image superresolution. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 63–78. Cited by: §1, §2.1.
 [3] (2014) The spectral response of the landsat8 operational land imager. Remote Sensing 6 (10), pp. 10232–10251. Cited by: §4.
 [4] (2011) Hyperspectral image resolution enhancement based on spectral unmixing and information fusion. In ISPRS Hannover Workshop 2011, Cited by: §2.1.
 [5] (2014) Fusion of hyperspectral and multispectral images: a novel framework based on generalization of pansharpening methods. IEEE Geoscience and Remote Sensing Letters 11 (8), pp. 1418–1422. Cited by: §2.

[6]
(2019)
Learning a low tensortrain rank representation for hyperspectral image superresolution.
IEEE Transactions on Neural Networks and Learning Systems
30 (9), pp. 2672–2683. Cited by: §2.1, §4.2.  [7] (2018) Deep hyperspectral image sharpening. IEEE Transactions on Neural Networks and Learning Systems 29 (99), pp. 1–11. Cited by: §2.2.
 [8] (2016) Hyperspectral image superresolution via nonnegative structured sparse representation. IEEE Transactions on Image Processing 25 (5), pp. 2337–2352. Cited by: §2.1, §4.2.

[9]
(2004)
Resolution enhancement of hyperspectral imagery using maximum a posteriori estimation with a stochastic mixing model
. Ph.D. Thesis, University of Dayton. Cited by: §2.1.  [10] (2018) Joint camera spectral sensitivity selection and hyperspectral image recovery. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 788–804. Cited by: §1.

[11]
(2019)
Hyperspectral image superresolution with optimized rgb guidance.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 11661–11670. Cited by: §1, §2.2.  [12] (2015) A novel hierarchical approach for multispectral palmprint recognition. Neurocomputing 151, pp. 511–521. Cited by: §1.
 [13] (2019) An augmented linear mixing model to address spectral variability for hyperspectral unmixing. IEEE Transactions on Image Processing 28 (4), pp. 1923–1938. Cited by: §2.1.
 [14] (2019) Cospace: common subspace learning from hyperspectralmultispectral correspondences. IEEE Transactions on Geoscience and Remote Sensing 57 (7), pp. 4349–4359. Cited by: §1.
 [15] (2019) Learnable manifold alignment (lema): a semisupervised crossmodality learning framework for land cover and land use classification. ISPRS Journal of Photogrammetry and Remote Sensing 147, pp. 193–205. Cited by: §1.
 [16] (2018) Hyperspectral superresolution: a coupled tensor factorization approach. IEEE Transactions on Signal Processing 66 (24), pp. 6503–6517. Cited by: §4.2.
 [17] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.4.
 [18] (1993) The spectral image processing system (sips)interactive visualization and analysis of imaging spectrometer data. Remote Sensing of Environment 44 (23), pp. 145–163. Cited by: §4.
 [19] (2015) Hyperspectral superresolution by coupled spectral unmixing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3586–3594. Cited by: §2.1, §4.2.
 [20] (2018) Fusing hyperspectral and multispectral images via coupled sparse tensor factorization. IEEE Transactions on Image Processing 27 (8), pp. 4118–4130. Cited by: §2.1, §4.2.
 [21] (2015) Hyperspectral pansharpening: a review. IEEE Geoscience and Remote Sensing Magazine 3 (3), pp. 27–46. Cited by: §2.
 [22] (2019) Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), Cited by: §3.4.
 [23] (2011) Sparse autoencoder. CS294A Lecture Notes 72 (2011), pp. 1–19. Cited by: §3.4.
 [24] (2018) Unsupervised sparse dirichletnet for hyperspectral image superresolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2511–2520. Cited by: §1, §2.2, §4.2, §4.
 [25] (2020) Feature extraction for hyperspectral imagery: the evolution from shallow to deep (overview and toolbox). IEEE Geoscience and Remote Sensing Magazine. Note: DOI: 10.1109/MGRS.2020.2979764 Cited by: §1.
 [26] (2014) A convex formulation for hyperspectral image superresolution via subspacebased regularization. IEEE Transactions on Geoscience and Remote Sensing 53 (6), pp. 3373–3388. Cited by: §2.1, §4.2.
 [27] (2014) A critical comparison among pansharpening algorithms. IEEE Transactions on Geoscience and Remote Sensing 53 (5), pp. 2565–2586. Cited by: §2.
 [28] (2000) Quality of high resolution synthesised images: is there a simple criterion?. In 3rd Conference Fusion Earth Data: Merging Point Measurements, Raster Maps, and Remotely Sensed Images, Cited by: §4.
 [29] (2017) The effect of the point spread function on subpixel mapping. Remote Sensing of Environment 193, pp. 127–137. Cited by: §3.3.
 [30] (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.
 [31] (2002) A universal image quality index. IEEE Signal Processing Letters 9 (3), pp. 81–84. Cited by: §4.
 [32] (2015) Fast fusion of multiband images based on solving a sylvester equation. IEEE Transactions on Image Processing 24 (11), pp. 4109–4121. Cited by: §2.1, §4.2.
 [33] (2019) Multispectral and hyperspectral image fusion by ms/hs fusion net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1585–1594. Cited by: §1, §2.2, §4.2.
 [34] (2019) Nonconvexsparsity and nonlocalsmoothnessbased blind hyperspectral unmixing. IEEE Transactions on Image Processing 28 (6), pp. 2991–3006. Cited by: §1.
 [35] (2010) Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE Transactions on Image Processing 19 (9), pp. 2241–2253. Cited by: §4.
 [36] (2017) Hyperspectral and multispectral data fusion: a comparative review of the recent literature. IEEE Geoscience and Remote Sensing Magazine 5 (2), pp. 29–56. Cited by: §1, §4.2, §4.
 [37] (2011) Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. IEEE Transactions on Geoscience and Remote Sensing 50 (2), pp. 528–537. Cited by: §2.1, §4.2.
 [38] (2016) Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging 3 (1), pp. 47–57. Cited by: §3.4.

[39]
(2020)
Coupled convolutional neural network with adaptive response function learning for unsupervised hyperspectral superresolution
. IEEE Transactions on Geoscience and Remote Sensing. Note: DOI: 10.1109/TGRS.2020.3006534 Cited by: §2.2.