Cross-Attention in Coupled Unmixing Nets for Unsupervised Hyperspectral Super-Resolution

07/10/2020 ∙ by Jing Yao, et al. ∙ DLR Grenoble Institute of Technology Xi'an Jiaotong University 0

The recent advancement of deep learning techniques has made great progress on hyperspectral image super-resolution (HSI-SR). Yet the development of unsupervised deep networks remains challenging for this task. To this end, we propose a novel coupled unmixing network with a cross-attention mechanism, CUCaNet for short, to enhance the spatial resolution of HSI by means of higher-spatial-resolution multispectral image (MSI). Inspired by coupled spectral unmixing, a two-stream convolutional autoencoder framework is taken as backbone to jointly decompose MS and HS data into a spectrally meaningful basis and corresponding coefficients. CUCaNet is capable of adaptively learning spectral and spatial response functions from HS-MS correspondences by enforcing reasonable consistency assumptions on the networks. Moreover, a cross-attention module is devised to yield more effective spatial-spectral information transfer in networks. Extensive experiments are conducted on three widely-used HS-MS datasets in comparison with state-of-the-art HSI-SR models, demonstrating the superiority of the CUCaNet in the HSI-SR application. Furthermore, the codes and datasets will be available at: https://github.com/danfenghong/ECCV2020_CUCaNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 11

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in hyperspectral (HS) imaging technology have enabled the availability of enormous HS images (HSIs) with a densely sampled spectrum [25]

. Benefited from the abundant spectral information contained in those hundreds of bands measurement, HSI features great promise in delivering faithful representation of real-world materials and objects. Thus the pursuit of effective and efficient processing of HS data has long been recognized as a prominent topic in the field of computer vision

[10].

Though physically, the insufficient spatial resolution of HS instruments, combined with an inherently intimate mixing effect, severely hampers the abilities of HSI in various real applications [2, 34]. Fortunately, the multispectral (MS) imaging systems (e.g., RGB cameras, spaceborne MS sensors) are capable of providing complementary products, which preserve much finer spatial information at the cost of reduced spectral resolution [12]. Accordingly, the research on enhancing the spatial resolution (henceforth, resolution refers to the spatial resolution) of an observable low-resolution HSI (LrHSI) by merging a high-resolution MSI (HrMSI) under the same scene, which is referred to hyperspectral image super-resolution (HSI-SR), has been gaining considerable attention [14, 15].

The last decade has witnessed a dominant development of optimization-based methods, from either deterministic or stochastic perspectives, to tackle the HSI-SR issue [36]. To mitigate the severe ill-posedness of such an inverse problem, the majority of prevailing methods put their focus on exploiting various hand-crafted priors to characterize spatial and spectral information underlying the desired solution. Moreover, the dependency on the knowledge of relevant sensor characteristics, such as spectral response function (SRF) and point spread function (PSF), inevitably compromises their transparency and practicability.

More recently, a growing interest has been paid to leverage the tool of deep learning (DL) by exploiting its merit on low-level vision applications. Among them, the best result is achieved by investigators who resort to performing HSI-SR progressively in a supervised fashion [33]. However, the demand for sufficient training image pairs acquired with different sensors inevitably makes their practicability limited. On the other hand, though being rarely studied, the existing unsupervised works rely on either complicated multi-stage alternating optimization [24], or an external camera spectral response (CSR) dataset in the context of RGB image guidance [11], the latter of which also losses generality in confronting other kinds of data with higher spectral resolution than RGB one.

To address the aforementioned challenges, we propose a novel coupled unmixing network with cross-attention (CUCaNet) for unsupervised HSI-SR. The contributions of this paper are briefly summarized as follows:

  1. We propose a novel unsupervised HSI-SR model, called CUCaNet, which is built on a coupled convolutional autoencoder network. CUCaNet models the physically mixing properties in HS imaging into the networks to transfer the spatial information of MSI to HSI and preserve the high spectral resolution itself simultaneously in a coupled fashion.

  2. We devise an effective cross-attention module to extract and transfer significant spectral (or spatial) information from HSI (or MSI) to another branch, yielding more sufficient spatial-spectral information blending.

  3. Beyond previous coupled HSI-SR models, the proposed CUCaNet is capable of adaptively learning PSFs and SRFs across MS-HS sensors with a high ability to generalize. To find the local optimum of the network more effectively, we shrink the solution space by designing a closed-loop consistency regularization in networks, acting on both spatial and spectral domains.

2 Related Work

The investigated HSI-SR problem is closely associated with the pan-sharpening task, which aims at generating a HrMSI (or HrHSI) by fusing the LrMSI (or LrHSI) with a corresponding panchromatic image of higher resolution [27, 21]. Therefore pioneer researches emerge naturally by adapting the extensively studied pansharpening techniques to HSI-SR [5]. Still, as a result of separate sharpening operation, these methods usually fail to well capture the global continuity in the spectral profiles, which brings unignorable performance degradation and thus leaves much room to be desired.

2.1 Conventional Methods

Apace with the advances in statistically modeling and machine learning, recent optimization-based methods has lifted the HSI-SR ratio evidently. According to a subspace assumption, Bayesian approach was first introduced by Eismann

et al. utilizing a stochastic mixing model [9], and developed through subsequent researches by exploiting more inherent characteristics [26, 32]. Another class of methods that have been actively investigated stems from the idea of spectral unmixing [13], which takes the intimate mixing effect into consideration. Yokoya et al. brought up coupled non-negative matrix factorization (CNMF) [37]

to estimate the spectral signature of the underlying materials and corresponding coefficients alternately. On basis of CNMF, Kawakami

et al. [4] employed sparse regularization and an effective projected gradient solver was devised by Lanaras et al. [19]. Besides, [2, 8]

adopted dictionary learning and sparse coding techniques in this context. Various kinds of tensor factorization strategies are also studied, such as Tucker decomposition adopted by Dian

et al. [6] and Li et al. [20] to model non-local and coupled structure information, respectively.

2.2 DL-Based Methods

To avoid tedious hand-crafted priors modeling in conventional methods, DL-based methods have attracted increasing interest these years. In the class of supervised methods, Dian et al. [7] employed CNN with prior training to finely tune the result acquired by solving a conventional optimization problem, while Xie et al. [33] introduced a deep unfolding network based on a novel HSI degradation model. Unsupervised methods are more rarely studied. Qu et al. [24] developed an unsupervised HSI-SR net with Dirichlet distribution-induced layer embedded, which results in a multi-stage alternating optimization. Under the guidance of RGB image and an external CSR database, Fu et al. [11] designed an unified CNN framework with a particular CSR optimization layer. Albeit demonstrated to be comparatively effective, these methods require either large training data for supervision or the knowledge of PSFs or SRFs, which are both unrealistic in real HSI-SR scenario. Very recently, Zheng et al. [39] proposed a coupled CNN by adaptively learning the two functions of PSFs and SRFs for unsupervised HSI-SR. However, due to the lack of effective regularizations or constraints, the two to-be-estimated functions inevitably introduce more freedoms, limiting the performance to be further improved.

Figure 1: An illustration of the proposed end-to-end CUCaNet inspired by spectral unmixing techniques, which mainly consists of two important modules: cross-attention and spatial-spectral consistency.

3 Coupled Unmixing Nets with Cross-Attention

In this section, we present the proposed coupled unmixing networks with a cross-attention module implanted, which is called CUCaNet for short. For mathematical brevity, we resort to a 2D representation of the 3D image cube, that is, the spectrum of each pixel is stacked row-by-row.

3.1 Method Overview

The proposed CUCaNet builds on a two-stream convolutional autoencoder backbone, which aims at decomposing MS and HS data into a spectrally meaningful basis and corresponding coefficients jointly. In accordance with the idea of coupled spectral unmixing, the fused HrHSI is obtained by feeding the decoder of the HSI branch with the encoded maps of the MSI branch. Two additional convolution layers are incorporated to simulate the spatial and spectral downsampling processes across MS-HS sensors. To guarantee that CUCaNet can converge to a faithful product through an unsupervised training, reasonable consistency, and necessary unmixing constraints, are integrated smoothly without imposing evident redundancy. Moreover, we introduced the attention mechanism into the HSI-SR for the first time. More specifically, a cross-attention module is devised to transfer significant spectral (or spatial) information from HSI (or MSI) to another MSI (or HSI) branch network, which fully exploits the advantageous spectral (or spatial) guidance for the performance improvement.

Figure 2: Detail unfolding for two modules in networks: spatial-spectral consistency (left) and cross-attention (right).

3.2 Problem Formulation

Given the LrHSI , and the HrMSI , the goal of HSI-SR is to recover the latent HrHSI , where are the reduced height, width, and number of spectral bands, respectively, and

are corresponding upsampled version. Based on the linear mixing model that well explains the phenomenon of

mixed pixels involved in , we then have the following NMF-based representation,

(1)

where and are a collection of spectral signatures of pure materials (or say, endmembers) and their fractional coefficients (or say, abundances), respectively.

On the other hand, the degradation processes in the spatial () and the spectral () observations can be modeled as

(2)
(3)

where and represent the PSF and SRF from the HrHSI to the HrMSI and the LrHSI, respectively. Since and are non-negative and normalized, and can be regarded as spatially downsampled abundances and spectrally downsampled endmembers, respectively. Therefore, an intuitive solution is to unmix and based on Eq. (2) and Eq. (3) alternately, which is coupled with the prior knowledge of and . Such a principle has been exploited in various optimization formulations, obtaining state-of-the-art fusion performance by linear approximation with converged and .

Constraints. Still, the issued HSI-SR problem involves the inversions from and to and , which are highly ill-posed. To narrow the solution space, several physically meaningful constraints are commonly adopted, they are the abundance sum-to-one constraint (ASC), the abundance non-negative constraint (ANC), and non-negative constraint on endmembers, i.e.,

(4)

where marks element-wise inequality, and represents

-length all-one vector. It is worth mentioning that the combination of ASC and ANC would promote the sparsity of abundances, which well characterizes the rule that the endmembers are sparsely contributing to the spectrum in each pixel.

Yet in practice, the prior knowledge of PSFs and SRFs for numerous kinds of imaging systems is hardly available. This restriction motivates us to extend the current coupled unmixing model to a fully end-to-end framework, which is only in need of LrHSI and HrMSI. To estimate and in an unsupervised manner, we introduce the following consistency constraint,

(5)

where denotes the latent LrMSI.

3.3 Network Architecture

Inspired by the recent success of deep networks on visual processing tasks, we would like to first perform coupled spectral unmixing by the established two-stream convolutional autoencoder for the two-modal inputs, i.e., we consider two deep subnetworks, with to self-express the LrHSI, for the HrMSI, and the fused result can be obtained by , herein collects the weights of corresponding subpart.

In specific, as shown in Fig. 1, both encoders and are constructed by cascading “Convolution+LReLU” blocks with an additional convolution layer. We set the sizes of convolutional kernels in all as while those in are with larger but descending scales of the receptive field. The idea behind this setting is to take the low fidelity of spatial information in LrHSI into consideration and simultaneously map the cross-channel and spatial correlations underlying HrMSI. Furthermore, to ensure that the encoded maps are able to possess the properties of abundances, an additional activation layer using the clamp function in the range of is concatenated after each encoder. As for the structure of decoders and , we simply adopt a convolution layer without any nonlinear activation, making the weights and interpretable as the endmembers and according to Eq. (2) and Eq. (3). By backward gradient descent-based optimization, our backbone network can not only avoid the need for good initialization for conventional unmixing algorithms but also enjoy the amelioration brought by its capability of local perception and nonlinear processing.

Cross-Attention. To further exploit the advantageous information from the two modalities, we devise an effective cross-attention module to enrich the features across modalities. As shown in Fig. 2, the cross-attention module is employed on high-level features within the encoder part, with three steps to follow. First, we compute the spatial and spectral attention from the branch of LrHSI and HrMSI, since they can provide with more faithful spatial and spectral guidance. Next, we multiply the original features with the attention maps from another branch to transfer the significant information. Lastly, we concatenate the original features with the above cross-multiplications in each branch, to construct the input of next layer in the form of such preserved and refined representation.

Formally, the output features of the -th layer in the encoder part, take for example, are formulated as

(6)

which is similar for obtaining from . To gather the spatial and spectral significant information, we adopt global and local convolution to generate channel-wise and spatial statistics respectively as

(7)

where is a set of convolution filters with size , is the -th channel of a 3D convolution filter with spatial size as

. Then we apply a softmax layer to the above statistics to get the attention maps

, and , where

denotes the softmax activation function. The original features are finally fused into the input of next layer as

, and , where denotes the concatenation, and denotes the point-wise multiplication.

Spatial-Spectral Consistency. An essential part that tends to be ignored is related to the coupled factors caused by PSFs and SRFs. Previous researches typically assume an ideal average spatial downsampling and the prior knowledge of SRFs, which rarely exist in reality. Unlike them, we introduce a spatial-spectral consistency module into networks in order to better simulate the to-be-estimated PSF and SRF, which is performed by simple yet effective convolution layers.

We can rewrite the spectral resampling from the HS sensor to the MS sensor by revisiting the left part of Eq. (3) more accurately as follows. Given the spectrum of -th pixel in HrHSI , for the -th channel in corresponding LrHSI, the radiance is defined as

(8)

where denotes the support set that the wavelength belongs to, denotes the normalization constant . We directly replace with a set of convolution kernels with the weights being collected in . Therefore, the SRF layer can be well defined as follows,

(9)

where corresponds to an additional normalization with . The PSF layer for spatial downsampling is more straightforward. Note that PSF generally indicates that each pixel in LrHSI is produced by combining neighboring pixels in HrHSI with unknown weights in a disjoint manner [29]. To simulate this process, we propose

by the means of a channel-wise convolution layer with kernel size and stride both same as the scaling ratio.

To sum up, multiple consistency constraints derived from the statements in Section 3.2, either spectrally or spatially, can be defined in our networks as

(10)

which enables the whole networks to be trained within a closed loop.

3.4 Network Training

Loss Function. As shown in Fig. 1, our CUCaNet mainly consists of two autoencoders for hyperspectral and multispectral data, respectively, thus leading to the following reconstruction loss:

(11)

in which the -norm is selected as the loss criterion for its perceptually satisfying performance in the low-level image processing tasks [38].

The important physically meaningful constraints in spectral unmixing are considered, building on Eq. (4), we then derive the second ASC loss as

(12)

and the ANC is reflected through the activation layer used behind the encoders.

To promote the sparsity of abundances of both stream, we adopt the Kullback-Leibler (KL) divergence-based sparsity loss term by penalizing the discrepancies between them and a tiny scalar ,

(13)

where is the standard KL divergence [23].

Last but not least, we adopt the -norm to define the spatial-spectral consistency loss based on Eq. (10) as follows,

(14)

By integrating all the above-mentioned loss terms, the final objective function for the training of CUCaNet is given by

(15)

where we use to trade-off the effects of different constituents.

Implementation Details.

 Our network is implemented on the PyTorch framework. We choose the Adam optimizer under default parameters setting for the training with the training batch parameterized by 1

[17]

. The learning rate is initialized with 0.005 and a linear decay from 2000 to 10000 epochs drop-step schedule is applied

[22]

. We determined the hyperparameters using a grid search on the validation set and stopped training before validation loss fails to decrease.

4 Experimental Results

In this section, we first review the HSI-MSI datasets and setup adopted in our experiments. Then, we provide an ablation study to verify the effectiveness of the proposed modules. Extensive comparisons with the state-of-the-art methods on indoor and remotely sensed images are reported at last.

Dataset and Experimental Setting. Three widely used HSI-MSI datasets are investigated in this section, including the CAVE dataset [35]111http://www.cs.columbia.edu/CAVE/databases/multispectral, the Pavia University dataset, and the Chikusei dataset [36]222http://naotoyokoya.com/Download.html. The CAVE dataset captures 32 different indoor scenes. Each image consists of 512512 pixels with 31 spectral bands uniformly measured in the wavelength ranging from 400nm to 700nm. In our experiments, 16 scenes are randomly selected to report the performance. The Pavia dataset was acquired by the ROSIS airborne sensor over the University of Pavia, Italy, in 2003. The original HSI comprises 610340 pixels and 115 spectral bands. The top-left corner of the HSI with 336336 pixels and 103 bands (after removing 12 noisy bands), covering the spectral range from 430nm to 838nm is used. The Chikusei dataset was taken by a Visible and Near-Infrared (VNIR) imaging sensor over Chikusei, Japan, in 2014. The original HSI consists of 2,5172,335 pixels and 128 bands with a spectral range of 363nm to 1,018nm. We crop 6 non-overlapped parts with the size of 576448 pixels from the bottom part for the test.

Considering the diversity of MS sensors in generating the HrMS images, we employ the SRFs of Nikon D700 camera[24] and Landsat-8 spaceborne MS sensor[3]333http://landsat.gsfc.nasa.gov/?p=5779 for the CAVE dataset and two remotely sensed datasets444We select the spectral radiance responses of blue-green-red(BGR) bands and BGR-NIR bands for the experiments on Pavia and Chikusei datasets, respectively., respectively. We adopt the Gaussian filter to obtain the LrHS images, by constructing the filter with the width same as SR ratio and 0.5 valued deviations. The SR ratios are set as 16 for the Pavia University dataset and 32 for the other two datasets.

Method Module Metric
 Clamp  SSC  CA  PSNR  SAM ERGAS  SSIM UQI
CNMF - - - 32.73 7.05 1.18 0.830 0.973
CUCaNet 34.25 6.58 1.01 0.862 0.975
CUCaNet 35.67 5.51 0.92 0.897 0.981
CUCaNet 36.55 4.76 0.85 0.904 0.991
CUCaNet 36.49 4.63 0.86 0.902 0.989
CUCaNet 37.22 4.43 0.82 0.914 0.991
Table 1: Ablation study on the Pavia University dataset by our CUCaNet with different modules and a baseline CNMF. The best results are shown in bold.

Evaluation Metrics.

 We use the following five complementary and widely-used picture quality indices (PQIs) for the quantitative HSI-SR assessment, including peak signal-to-noise ratio (PSNR), spectral angle mapper (SAM)

[18], erreur relative globale adimensionnellede synthèse (ERGAS) [28], structure similarity (SSIM) [30], and universal image quality index (UIQI) [31]. SAM reflects the spectral similarity by calculating the average angle between two vectors of the estimated and reference spectra at each pixel. PSNR, ERGAS, and SSIM are mean square error (MSE)-based band-wise PQIs indicating spatial fidelity, global quality, and perceptual consistency, respectively. UIQI is also band-wisely used to measure complex distortions among monochromatic images.

4.1 Ablation Study

Our CUCaNet consists of a baseline network – coupled convolutional autoencoder networks – and two newly-proposed modules, i.e., the spatial-spectral consistency module (SSC) and the cross-attention module (CA). To investigate the performance gain of different components in networks, we perform ablation analysis on the Pavia University dataset. We also study the effect of replacing clamp function with conventional softmax activation function at the end of each encoder. Table 1 details the quantitative results, in which CNMF is adopted as the baseline method.

As shown in Table 1, single CUCaNet can outperform CNMF in all metrics owing to its benefit from employing deep networks. We find that the performance is further improved remarkably by the use of clamp function. Meanwhile, single SSC module performs better than single CA module except in SAM, which means that CA module tend to favor spectral consistency. By jointly employing the two modules, the proposed CUCaNet achieves the best results in HSI-SR tasks, demonstrating the effectiveness of our whole network architecture.

4.2 Comparative Experiments

Compared Methods. Here, we make comprehensive comparison with the following eleven state-of-the-art (SOTA) methods in HSI-RS tasks: pioneer work, GSA [1]111http://naotoyokoya.com/Download.html, NMF-based approaches, CNMF [37]footnotemark: and CSU [19]222https://github.com/lanha/SupResPALM, Bayesian-based approaches, FUSE [32]333https://github.com/qw245/BlindFuse and HySure [26]444https://github.com/alfaiate/HySure, dictionary learning-based approach, NSSR [8]555http://see.xidian.edu.cn/faculty/wsdong, tensor-based approaches, STEREO [16]666https://github.com/marhar19/HSR_via_tensor_decomposition, CSTF [20]777https://sites.google.com/view/renweidian, and LTTR [6]footnotemark: , and DL-based methods, unsupervised uSDN [24]888https://github.com/aicip/uSDN and supervised MHFnet [33]999https://github.com/XieQi2015/MHF-net. The parameters of all compared methods are tuned to our best. As for the supervised deep method MHFnet, we use the remaining part of each dataset for the training following the strategies in [33].

Functions  GSA CNMF CSU FUSE HySure NSSR STEREO CSTF LTTR  uSDN MHFnet CUCaNet
SRF
PSF -
Table 2: The ability of learning unkonwn SRF and PSF of competing methods.

Note that most of the above methods rely on the prior knowledge of SRFs and PSFs. We summarize the properties of all compared methods in learning SRFs and PSFs (see Table 2), where only HySure and MHFnet are capable of learning the two unknown functions. More specifically, HySure adopts a multi-stage method and MHFnet models them as convolution layers under a supervised framework. Hence our CUCaNet serves as the first unsupervised method that can simultaneously learn SRFs and PSFs in an end-to-end fashion.

Figure 3: The HSI-SR performance on the CAVE dataset (fake and real food) of CUCaNet in comparison with SOTA methods. For each HSI, the 20th (590nm) band image is displayed with two demarcated areas zoomed in 3 times for better visual assessment, and two main scores (PSNR/SAM) are reported with the best results in bold.
Metric Method
 GSA CNMF CSU  FUSE HySure NSSR STEREO CSTF LTTR  uSDN MHFnet CUCaNet
PSNR 27.89 30.11 30.26 29.87 31.26 33.52 30.88 32.74 35.45 34.67 37.30 37.51
SAM 19.71 9.98 11.03 16.05 14.59 12.09 15.87 13.13 9.69 10.02 7.75 7.49
ERGAS 1.11 0.69 0.65 0.77 0.72 0.69 0.75 0.64 0.53 0.52 0.49 0.47
SSIM 0.713 0.919 0.911 0.876 0.905 0.912 0.896 0.914 0.949 0.921 0.961 0.959
UQI 0.757 0.911 0.898 0.860 0.891 0.904 0.873 0.902 0.942 0.905 0.949 0.955
Table 3: Quantitative performance comparison with the investigated methods on the CAVE dataset. The best results are shown in bold.

Indoor Dataset. We first conduct experiments on indoor images of the CAVE dataset. The average quantitative results over 16 testing images are summarized in Table 3 with the best ones highlighted in bold. From the table, we can observe that LTTR and CSTF can obtain better reconstruction results than other conventional methods, mainly by virtue of their complex regularizations under tensorial framework. Note that the SAM values of earlier methods CNMF and CSU are still relatively lower because they consider the coupled unmixing mechanism. As for the DL-based methods, supervised MHFnet outperforms unsupervised uSDN evidently, while our proposed CUCaNet achieves the best results in terms of four major metrics. Only the SSIM value of ours is slightly worse than that of the most powerful competing method MHFnet, due to its extra exploitation of supervised information.

The visual comparison on two selected scenes demonstrated in Fig. 3 and Fig. 4 exhibits a consistent tendency. From the figures, we can conclude that the results of CUCaNet maintain the highest fidelity to the groundtruth (GT) compared to other methods. In specific, for certain bands, our method can not only estimate background more accurately, but also maintain the texture details on different objects. The SAM values of CUCaNet on two images are obviously less than others, which validates the superiority of proposed method in capturing the spectral characteristics by the means of joint coupled unmixing and degrading functions learning.

Figure 4: The HSI-SR performance on the CAVE dataset (chart and staffed toy) of CUCaNet in comparison with SOTA methods. For each HSI, the 7th (460nm) band image is displayed with two demarcated areas zoomed in 3.5 times for better visual assessment, and two main scores (PSNR/SAM) are reported with the best results in bold.
Figure 5: The HSI-SR performance on the Pavia University dataset (cropped area) of all competing methods. The false-color image with bands 61-36-10 as R-G-B channels is displayed. One demarcated area (red frame) as well as its RMSE-based residual image (blue frame) with respect to GT are zoomed in 3 times for better visual assessment.

Remotely Sensed Dataset. We then carry out more experiments using airborne HS data to further evaluate the generality of our method. The quantitative evaluation results on the Pavia University and Chikusei datasets are provided in Table 4 and Table 5, respectively. Generally, we can observe a significant performance improvements than on CAVE, since more spectral information can be used as the number of HS bands increases. For the same reason, NMF-based and Bayesian-based methods show competitive performance owing to their accurate estimation of high-resolution subspace coefficients [36]. The limited performance of tensor-based methods suggests they may lack robustness to the spectral distortions in real cases. The multi-stage unsupervised training of uSDN makes it easily trapped into local minima, which results in only comparable performance to state-of-the-art conventional methods such as HySure and FUSE. It is particularly evident that MHFnet performs better on Chikusei rather than Pavia University. This can be explained by the fact that training data is relatively adequate on Chikusei so that the tested patterns are more likely to be well learned. We have to admit, however that MHFnet requires extremely rich training samples, which restricts its practical applicability to a great extent. Remarkably, our CUCaNet can achieve better performance in most cases, especially showing advantage in the spectral quality measured by SAM, which confirms that our method is good at capturing the spectral properties and hence attaining a better reconstruction of HrHSI.

Metric Method
 GSA CNMF CSU  FUSE HySure NSSR STEREO CSTF LTTR  uSDN MHFnet CUCaNet
PSNR 30.29 32.73 33.18 33.24 35.02 34.74 31.34 30.97 29.98 34.87 36.34 37.22
SAM 9.14 7.05 6.97 7.78 6.54 7.21 9.97 7.69 6.92 5.80 5.15 4.43
ERGAS 1.31 1.18 1.17 1.27 1.10 1.06 1.35 1.23 1.30 1.02 0.89 0.82
SSIM 0.784 0.830 0.815 0.828 0.861 0.831 0.751 0.782 0.775 0.871 0.919 0.914
UQI 0.965 0.973 0.972 0.969 0.975 0.966 0.938 0.969 0.967 0.982 0.987 0.991
Table 4: Quantitative performance comparison with the investigated methods on the Pavia University dataset. The best results are shown in bold.
Figure 6: The HSI-SR performance on the Chikusei dataset (cropped area) of all competing methods. The false-color image with bands 61-36-10 as R-G-B channels is displayed. One demarcated area (red frame) as well as its RMSE-based residual image (blue frame) with respect to GT are zoomed in 3 times for better visual assessment.
Metric Method
 GSA CNMF CSU  FUSE HySure NSSR STEREO CSTF LTTR  uSDN MHFnet CUCaNet
PSNR 32.07 38.03 37.89 39.25 39.97 38.35 32.40 36.52 35.54 38.32 43.71 42.70
SAM 10.44 4.81 5.03 4.50 4.35 4.97 8.52 6.33 7.31 3.89 3.51 3.13
ERGAS 0.98 0.58 0.61 0.47 0.45 0.63 0.74 0.66 0.70 0.51 0.42 0.40
SSIM 0.903 0.961 0.945 0.970 0.974 0.961 0.897 0.929 0.918 0.964 0.985 0.988
UQI 0.909 0.976 0.977 0.977 0.976 0.914 0.902 0.915 0.917 0.976 0.992 0.990
Table 5: Quantitative performance comparison with the investigated methods on the Chikusei dataset. The best results are shown in bold.

Fig. 5 and Fig. 6 show the HSI-SR results demonstrated in false-color on these two datasets. Since it is hard to visually discern the differences of most fused results, we display the RMSE-based residual images of local windows compared with GT for better visual evaluation. For both datasets, we can observe that GSA and STEREO yield bad results with relatively higher errors. CNMF and CSU show evident patterns in residuals that are similar to the original image, which indicates that their results are missing actual details. The block pattern-like errors included in CSTF and LTTR make their reconstruction unsmooth. Note that residual images of CUCaNet and MHFnet exhibit more dark blue areas than other methods. This means that the errors are small and the fused results are more reliable.

5 Conclusion

In this paper, we put forth CUCaNet for the task of HSI-SR by integrating the advantage of coupled spectral unmixing and deep learning techniques. For the first time, the learning of unknown SRFs and PSFs across MS-HS sensors is introduced into an unsupervised coupled unmixing network. Meanwhile, a cross-attention module and reasonable consistency enforcement are employed jointly to enrich feature extraction and guarantee a faithful production. Extensive experiments on both indoor and airborne HS datasets utilizing diverse simulations validate the superiority of proposed CUCaNet with evident performance improvements over competitive methods, both quantitatively and perceptually. Finally, we will investigate more theoretical insights on explaining the effectiveness of the proposed network in our future work.

Acknowledgements. This work has been supported in part by projects of the NSFC (No. 61721002) and the China Scholarship Council.

References

  • [1] B. Aiazzi, S. Baronti, and M. Selva (2007) Improving component substitution pansharpening through multivariate regression of ms pan data. IEEE Transactions on Geoscience and Remote Sensing 45 (10), pp. 3230–3239. Cited by: §4.2.
  • [2] N. Akhtar, F. Shafait, and A. Mian (2014) Sparse spatio-spectral representation for hyperspectral image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 63–78. Cited by: §1, §2.1.
  • [3] J. A. Barsi, K. Lee, G. Kvaran, B. L. Markham, and J. A. Pedelty (2014) The spectral response of the landsat-8 operational land imager. Remote Sensing 6 (10), pp. 10232–10251. Cited by: §4.
  • [4] J. Bieniarz, D. Cerra, J. Avbelj, P. Reinartz, and R. Müller (2011) Hyperspectral image resolution enhancement based on spectral unmixing and information fusion. In ISPRS Hannover Workshop 2011, Cited by: §2.1.
  • [5] Z. Chen, H. Pu, B. Wang, and G. Jiang (2014) Fusion of hyperspectral and multispectral images: a novel framework based on generalization of pan-sharpening methods. IEEE Geoscience and Remote Sensing Letters 11 (8), pp. 1418–1422. Cited by: §2.
  • [6] R. Dian, S. Li, and L. Fang (2019) Learning a low tensor-train rank representation for hyperspectral image super-resolution.

    IEEE Transactions on Neural Networks and Learning Systems

    30 (9), pp. 2672–2683.
    Cited by: §2.1, §4.2.
  • [7] R. Dian, S. Li, A. Guo, and L. Fang (2018) Deep hyperspectral image sharpening. IEEE Transactions on Neural Networks and Learning Systems 29 (99), pp. 1–11. Cited by: §2.2.
  • [8] W. Dong, F. Fu, G. Shi, X. Cao, J. Wu, G. Li, and X. Li (2016) Hyperspectral image super-resolution via non-negative structured sparse representation. IEEE Transactions on Image Processing 25 (5), pp. 2337–2352. Cited by: §2.1, §4.2.
  • [9] M. T. Eismann (2004)

    Resolution enhancement of hyperspectral imagery using maximum a posteriori estimation with a stochastic mixing model

    .
    Ph.D. Thesis, University of Dayton. Cited by: §2.1.
  • [10] Y. Fu, T. Zhang, Y. Zheng, D. Zhang, and H. Huang (2018) Joint camera spectral sensitivity selection and hyperspectral image recovery. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 788–804. Cited by: §1.
  • [11] Y. Fu, T. Zhang, Y. Zheng, D. Zhang, and H. Huang (2019) Hyperspectral image super-resolution with optimized rgb guidance. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 11661–11670. Cited by: §1, §2.2.
  • [12] D. Hong, W. Liu, J. Su, Z. Pan, and G. Wang (2015) A novel hierarchical approach for multispectral palmprint recognition. Neurocomputing 151, pp. 511–521. Cited by: §1.
  • [13] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu (2019) An augmented linear mixing model to address spectral variability for hyperspectral unmixing. IEEE Transactions on Image Processing 28 (4), pp. 1923–1938. Cited by: §2.1.
  • [14] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu (2019) Cospace: common subspace learning from hyperspectral-multispectral correspondences. IEEE Transactions on Geoscience and Remote Sensing 57 (7), pp. 4349–4359. Cited by: §1.
  • [15] D. Hong, N. Yokoya, N. Ge, J. Chanussot, and X. X. Zhu (2019) Learnable manifold alignment (lema): a semi-supervised cross-modality learning framework for land cover and land use classification. ISPRS Journal of Photogrammetry and Remote Sensing 147, pp. 193–205. Cited by: §1.
  • [16] C. I. Kanatsoulis, X. Fu, N. D. Sidiropoulos, and W. Ma (2018) Hyperspectral super-resolution: a coupled tensor factorization approach. IEEE Transactions on Signal Processing 66 (24), pp. 6503–6517. Cited by: §4.2.
  • [17] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.4.
  • [18] F. A. Kruse, A. Lefkoff, J. Boardman, K. Heidebrecht, A. Shapiro, P. Barloon, and A. Goetz (1993) The spectral image processing system (sips)-interactive visualization and analysis of imaging spectrometer data. Remote Sensing of Environment 44 (2-3), pp. 145–163. Cited by: §4.
  • [19] C. Lanaras, E. Baltsavias, and K. Schindler (2015) Hyperspectral super-resolution by coupled spectral unmixing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3586–3594. Cited by: §2.1, §4.2.
  • [20] S. Li, R. Dian, L. Fang, and J. M. Bioucas-Dias (2018) Fusing hyperspectral and multispectral images via coupled sparse tensor factorization. IEEE Transactions on Image Processing 27 (8), pp. 4118–4130. Cited by: §2.1, §4.2.
  • [21] L. Loncan, L. B. De Almeida, J. M. Bioucas-Dias, X. Briottet, J. Chanussot, N. Dobigeon, S. Fabre, W. Liao, G. A. Licciardi, M. Simoes, et al. (2015) Hyperspectral pansharpening: a review. IEEE Geoscience and Remote Sensing Magazine 3 (3), pp. 27–46. Cited by: §2.
  • [22] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), Cited by: §3.4.
  • [23] A. Ng et al. (2011) Sparse autoencoder. CS294A Lecture Notes 72 (2011), pp. 1–19. Cited by: §3.4.
  • [24] Y. Qu, H. Qi, and C. Kwan (2018) Unsupervised sparse dirichlet-net for hyperspectral image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2511–2520. Cited by: §1, §2.2, §4.2, §4.
  • [25] B. Rasti, D. Hong, R. Hang, P. Ghamisi, X. Kang, J. Chanussot, and J. A. Benediktsson (2020) Feature extraction for hyperspectral imagery: the evolution from shallow to deep (overview and toolbox). IEEE Geoscience and Remote Sensing Magazine. Note: DOI: 10.1109/MGRS.2020.2979764 Cited by: §1.
  • [26] M. Simoes, J. Bioucas-Dias, L. B. Almeida, and J. Chanussot (2014) A convex formulation for hyperspectral image superresolution via subspace-based regularization. IEEE Transactions on Geoscience and Remote Sensing 53 (6), pp. 3373–3388. Cited by: §2.1, §4.2.
  • [27] G. Vivone, L. Alparone, J. Chanussot, M. Dalla Mura, A. Garzelli, G. A. Licciardi, R. Restaino, and L. Wald (2014) A critical comparison among pansharpening algorithms. IEEE Transactions on Geoscience and Remote Sensing 53 (5), pp. 2565–2586. Cited by: §2.
  • [28] L. Wald (2000) Quality of high resolution synthesised images: is there a simple criterion?. In 3rd Conference Fusion Earth Data: Merging Point Measurements, Raster Maps, and Remotely Sensed Images, Cited by: §4.
  • [29] Q. Wang and P. M. Atkinson (2017) The effect of the point spread function on sub-pixel mapping. Remote Sensing of Environment 193, pp. 127–137. Cited by: §3.3.
  • [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.
  • [31] Z. Wang and A. C. Bovik (2002) A universal image quality index. IEEE Signal Processing Letters 9 (3), pp. 81–84. Cited by: §4.
  • [32] Q. Wei, N. Dobigeon, and J. Tourneret (2015) Fast fusion of multi-band images based on solving a sylvester equation. IEEE Transactions on Image Processing 24 (11), pp. 4109–4121. Cited by: §2.1, §4.2.
  • [33] Q. Xie, M. Zhou, Q. Zhao, D. Meng, W. Zuo, and Z. Xu (2019) Multispectral and hyperspectral image fusion by ms/hs fusion net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1585–1594. Cited by: §1, §2.2, §4.2.
  • [34] J. Yao, D. Meng, Q. Zhao, W. Cao, and Z. Xu (2019) Nonconvex-sparsity and nonlocal-smoothness-based blind hyperspectral unmixing. IEEE Transactions on Image Processing 28 (6), pp. 2991–3006. Cited by: §1.
  • [35] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar (2010) Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE Transactions on Image Processing 19 (9), pp. 2241–2253. Cited by: §4.
  • [36] N. Yokoya, C. Grohnfeldt, and J. Chanussot (2017) Hyperspectral and multispectral data fusion: a comparative review of the recent literature. IEEE Geoscience and Remote Sensing Magazine 5 (2), pp. 29–56. Cited by: §1, §4.2, §4.
  • [37] N. Yokoya, T. Yairi, and A. Iwasaki (2011) Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. IEEE Transactions on Geoscience and Remote Sensing 50 (2), pp. 528–537. Cited by: §2.1, §4.2.
  • [38] H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2016) Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging 3 (1), pp. 47–57. Cited by: §3.4.
  • [39] K. Zheng, L. Gao, W. Liao, D. Hong, B. Zhang, X. Cui, and J. Chanussot (2020)

    Coupled convolutional neural network with adaptive response function learning for unsupervised hyperspectral super-resolution

    .
    IEEE Transactions on Geoscience and Remote Sensing. Note: DOI: 10.1109/TGRS.2020.3006534 Cited by: §2.2.