Log In Sign Up

Hyperspectral Image Super-resolution via Deep Progressive Zero-centric Residual Learning

This paper explores the problem of hyperspectral image (HSI) super-resolution that merges a low resolution HSI (LR-HSI) and a high resolution multispectral image (HR-MSI). The cross-modality distribution of the spatial and spectral information makes the problem challenging. Inspired by the classic wavelet decomposition-based image fusion, we propose a novel lightweight deep neural network-based framework, namely progressive zero-centric residual network (PZRes-Net), to address this problem efficiently and effectively. Specifically, PZRes-Net learns a high resolution and zero-centric residual image, which contains high-frequency spatial details of the scene across all spectral bands, from both inputs in a progressive fashion along the spectral dimension. And the resulting residual image is then superimposed onto the up-sampled LR-HSI in a mean-value invariant manner, leading to a coarse HR-HSI, which is further refined by exploring the coherence across all spectral bands simultaneously. To learn the residual image efficiently and effectively, we employ spectral-spatial separable convolution with dense connections. In addition, we propose zero-mean normalization implemented on the feature maps of each layer to realize the zero-mean characteristic of the residual image. Extensive experiments over both real and synthetic benchmark datasets demonstrate that our PZRes-Net outperforms state-of-the-art methods to a significant extent in terms of both 4 quantitative metrics and visual quality, e.g., our PZRes-Net improves the PSNR more than 3dB, while saving 2.3× parameters and consuming 15× less FLOPs.


page 1

page 3

page 7

page 9


Spatial-Spectral Residual Network for Hyperspectral Image Super-Resolution

Deep learning-based hyperspectral image super-resolution (SR) methods ha...

Deep Posterior Distribution-based Embedding for Hyperspectral Image Super-resolution

In this paper, we investigate the problem of hyperspectral (HS) image sp...

Hyperspectral Image Super-resolution via Deep Spatio-spectral Convolutional Neural Networks

Hyperspectral images are of crucial importance in order to better unders...

Joint Sub-bands Learning with Clique Structures for Wavelet Domain Super-Resolution

Convolutional neural networks (CNNs) have recently achieved great succes...

Unsupervised Sparse Dirichlet-Net for Hyperspectral Image Super-Resolution

In many computer vision applications, obtaining images of high resolutio...

Hyperspectral Pansharpening Based on Improved Deep Image Prior and Residual Reconstruction

Hyperspectral pansharpening aims to synthesize a low-resolution hyperspe...

Implicit Neural Representation Learning for Hyperspectral Image Super-Resolution

Hyperspectral image (HSI) super-resolution without additional auxiliary ...

I Introduction

Hyperspectral imaging is aimed at collecting information from across the electromagnetic spectrum for each pixel of the image of a scene [6, 21]. The rich spectral information of the recorded hyperspectral image (HSI) enables it to deliver more faithful knowledge of a targeted scene than conventional imaging modalities [54]

. As a result, the HSI has grown increasingly popular over the past ten years in various fields, such as military, industrial, and scientific arenas. The HSI has also boosted the performance of applications in computer vision, such as tracking

[43], segmentation [26, 27], and object recognition [40].

However, due to the hardware limitation of existing imaging systems, there is an inevitable trade-off between the spectral and spatial resolution [9]. For a specific optical system, it could only record the image with either high spatial resolution together with very limited spectral bands, e.g., the high resolution multispectral image (HR-MSI), or dense spectral bands with reduced spatial resolution, e.g., the low resolution HSI (LR-HSI). Hence, as illustrated in Fig.1, HSI super-resolution (a.k.a MSI/HSI fusion) that merges an HR-MSI and an LR-HSI has being become a promising way to obtain HR-HSIs [52, 47, 9].

To tackle this challenge, various methods have been proposed in the last few decades. From the perspective of signal processing, multi-scale decomposition-based methods that only consume very limited computation resources have demonstrated their abilities in information fusion [28], e.g., pyramid-based [20] and wavelet decomposition-based [36, 61]. Based on the prior knowledge of HSIs, e.g., the sparse prior and the low-rank prior, plenty of matrix factorization based-methods have emerged [59, 10, 14]. Recently, owing to the remarkable representation learning ability, deep neural network (DNN)-based methods have been introduced [52, 9, 47, 12], whose performance exceeds that of traditional/non-DNN methods to a large extent (see Section II for more details). However, the reconstruction quality of the current state-of-the-art methods is still not satisfactory, due to the insufficient utilization/modeling of the cross-modality information.

Fig. 1: The illustration of HSI super-resolution from an LR-HSI and an HR-MSI that capture the same scene.

Inspired by the classic wavelet decomposition-based methods [36, 13], we propose a novel DNN-based framework, namely Progressive Zero-centric Residual Network (PZRes-Net), to achieve HSI super-resolution in both efficient and effective ways. As shown in Fig. 2, the input LR-HSI is first up-sampled in a mean-value invariant manner. Following that, a zero-centric residual image is progressively learned along the spectral dimension from both the up-sampled LR-HSI and HR-MSI with a zero-centric residual learning module, in which spectral-spatial separable convolutions with dense aggregation extract spectral-spatial information efficiently and effectively. Moreover, zero-mean normalization is applied for promoting the zero-centric characteristic of the learned residual image. The resulting residual image is further superimposed on the up-sampled LR-HSI, leading to a coarse HR-HSI, which is finally refined through exploring the coherency among all spectral bands simultaneously. We conduct various and extensive experiments and comparisons to evaluate and analyze the proposed PZRes-Net comprehensively. It is concluded that our PZRes-Net remarkably outperforms state-of-the-art methods both quantitatively and qualitatively across multiple real and synthetic benchmark datasets. Especially, our PZRes-Net improves the PSNR more than 3dB, while saving 2.3 parameters and consuming 15 less FLOPs.

The rest of this paper is organized as follows. Section II briefly reviews existing methods. Section III presents the proposed method, namely PZRes-Net, followed by extensive experimental results as well as comprehensive analyses on both synthetic and real data in Section IV. Finally Section V concludes this paper.

Ii Related Work

We classify existing methods into two categories, i.e., (1) traditional methods, including multi-scale decomposition-based

[20], [61, 38, 29], [55, 62] and optimization-based [3]; and (2) deep learning-based methods [8, 52, 47]. In the following, we will review them in detail.

Ii-a Traditional Methods

Multi-scale decomposition-methods focus on representing the image spatial structures with multiple layers. Various wavelet decomposition (WD)-based methods have been proposed in the past few decades [36, 20, 61, 37, 58, 35]. For example, Nunez et al. [36] proposed two types of WD-based methods for MSI pansharpening: the additive method and the substitution method. Specifically, the former consists of the following steps: (1) register an LR-MSI with an HR-panchromatic image and upsmaple it to the same resolution in order to be superimposed; (2) decompose the HR-panchromatic image to several wavelet planes containing high-frequency spatial information which follows zero-mean distribution; and (3) add the wavelet planes to the up-sampled LR-MSI. However, the latter decomposes both inputs, then replace wavelet planes of the up-sampled LR-MSI with those of the HR-panchromatic images. They also experimentally demonstrated that the additive method preserves spatial information better than the substitution method. Gonzalo et al. [13] also proposed an adaptive WD-based method, which fuses the wavelet planes with different weights.

Based on the prior knowledge of HSIs [52]

, a considerable number of traditional machine learning-based methods have been proposed. For example, matrix factorization-based methods assume that each spectrum can be linearly represented with several spectral atoms

[8]. Under the assumption that an HSI lies in a low-dimensional subspace, Wei et al. [49] used the spatial dictionaries learned from the HR-MSI to promote the spatial similarities. Fang et al. [11] proposed a super-pixel-based sparse representation. Kanatsoulis et al. [18]

established a coupled tensor factorization framework. Han

et al. [15] utilized a self-similarity prior as the constraint for the sparse representation of the input HSI and MSI. Akhtar et al. [3]

first learned a non-negative dictionary, then introduced a simultaneous greedy pursuit algorithm to estimate coefficients for each local patch. Dong

et al. [10] proposed a matrix factorization-based method which imposes joint sparsity and non-negativity constraints on the learned representation coefficients.

However, these traditional methods were usually constructed based on some priors, e.g., sparse prior, low-rank prior, and global similarity prior, which may not be consistent with the complex real world scenarios [52].

Ii-B DNN-based Methods

Powered by the strong representations learning ability, DNNs have become an emerging tool for HSI super-resolution [8]. Palsson et al. [39]

proposed a 3-D convolutional neural network (CNN)-based MSI/HSI fusion approach and reduced the computational cost by using principal component analysis (PCA). Dian

et al. [9] used deep priors learned by residual learning-based DNNs and reconstructed HR-HSI by solving optimization problems. Mei et al. [33] proposed a 3-D CNN to exploit both the spatial context and the spectral correlation. Arun et al. [4] explored DNNs to jointly optimize the unmixing and mapping operations in a supervised manner. Xie et al. [52]

proved that an HR-HSI could be represented by the linearly transformed HR-MSI and a to-be-estimated residual image, then unfolded an iterative algorithm, which solves aforementioned two components within a deep learning framework for HSI/MSI fusion. Wang

et al. [46, 45] tried to solve the HSI reconstruction using compressive sensing based methods. Xiong et al. [53] proposed CNN-based methods to achieve the recovery of HSIs from RGB images. Qu et al. [41] solved the HSI super-resolution problem using an unsupervised encoder-decoder architecture.

Most of the above-mentioned methods are not blind, meaning that they are trained with the knowledge of HR-HSIs or degradation models. Although Wang et al. [47] proposed a blind fusion model similar to [52], their performance leaves much to be desired. Moreover, due to the iterative HSI refinement, their computation burdens are very high, which may restrict practical implementations.

Fig. 2: (a) The overall flowchart of the proposed PZRes-Net for HSI super-resolution that merges an LR-HSI and an HR-MSI. (b) The network architecture of a stage contained in the zero-centric residual learning module. (c) The network architecture of the refinement module.

Iii The Proposed Method

Let be an -spectral bands HR-HSI to be reconstructed, each column of which is the vectorial representation of a spectral band of spatial resolution . The degradation models for the observed HR-MSI with () spectral bands denoted by and LR-HSI of spatial resolution denoted by (, ) from can be formulated as:


where is the camera spectral response function that integrates over the spectral dimension of the HSI to produce the MSI; represents the blurring matrix applied on the HR-HSI before performing spatial decimation via the matrix ; and are the noises in and , respectively. From Eqs. (1) and (2), it can be known that contains high-resolution spatial context, while keeps dense spectral details. Thus, the challenge of HSI super-resolution, i.e., reconstructing from under the assistance of , boils down to “how to leverage the spatial advantage of and propagate it across the densely sampled spectral bands of effectively.”

Iii-a Motivation and Overview

Multi-scale decomposition-based methods have demonstrated their effectiveness in image fusion [64], [63], [28]. Particularly, the classic wavelet decomposition-based scheme for enhancing an LR-image with an HR-image from another modality contains the following procedures: the LR-image is first up-sampled to the same resolution as the HR-image in order to be superimposed; and wavelet planes containing zero-centric/mean high-frequency details are then obtained by decomposing the HR-image with a shift-invariant wavelet transform, which are further superimposed onto the up-sampled LR-image. Such a scheme is able to utilize the spatial detail information from both images. Moreover, as the wavelet planes are designed to have zero mean-values, the total flux of the enhanced LR-image will be preserved. Inspired by the principles of traditional multi-scale decomposition-based methods, we will study the HSI super-resolution by exploring the powerful representation capabilities of DNNs to learn such zero-centric high-frequency details adaptively. In addition to inheriting the advantage of the traditional methods, it is expected that the data-adaptive characteristic of such a learning manner can boost the performance, compared with the pre-defined and data-independent decomposition process in traditional frameworks.

(a) DIV2K dataset
(b) COCO dataset
Fig. 3: Statistics of the difference between the mean-values of 800 and 40K pairs of HR and LR RGB images respectively from DIV2K [1], and COCO datasets [32]. All images are normalized to [0,1]. For each dataset, the LR-images were obtained by downsampling the corresponding HR ones with factor 8.

As shown in Fig. 2, the proposed framework, namely progressive zero-centric residual network (PZRes-Net), is mainly composed of three modules: a mean-value invariant up-sampling module, a progressive zero-centric residual learning module, and a refinement module. Specifically, the up-sampling module first lifts the input LR-HSI to the same resolution as the input HR-MSI in a mean-value invariant manner. Then, the residual learning module estimates a residual image from both the input HR-MSI and up-sampled LR-HSI progressively along the spectral dimension in which zero-mean normalization is applied on the input of each convolutional layer to enforce the mean-value of each band of the predicted residual image to be zero. The resulting zero-centric residual image is further superimposed onto the up-sampled LR-HSI, leading to a coarse HR-HSI, which is finally fed into the refinement module, where the coherence across all spectral bands of the coarse HR-HSI is simultaneously explored in a residual learning manner for further augmenting reconstruction quality. In what follows, we will introduce each module in detail.

Remark: It is worth pointing out the residual learning manner of our framework is essentially different from the residual learning that is widely exploited in various networks and image/video applications [60], [24]

. The traditional residual learning was introduced with the purposes of facilitating neural network optimization or enhancing feature extraction, while such a manner of our PZRes-Net mimics the classic multi-scale decomposition-based fusion methods. Moreover, our PZRes-Net learns a zero-centric residual, while there are no such constraints for traditional residual learning.

Iii-B Mean-value Invariant Up-sampling

This module aims to lift the input LR-HSI into the same spatial resolution as the HR-MSI for subsequent residual superimposition. The Law of Large Numbers dictates that

the observation average for a random variable should be close to its expectation value when it’s based on a large number of trials

. Generally, image pixels from LR and HR captures are repetitive samples of the same real scene, and are expected to have an approximately identical mean-value which is aligned with the expectation value. Such an observation was also experimentally validated in Fig. 3, where the histograms refer to the statistics of the difference between the mean values of 800 and 40K pairs of LR and HR RGB images respectively from DIV2K [1] and COCO [32], two commonly-used benchmark image datasets.

Recall that in our framework, a zero-mean residual image predicted by our zero-centric residual learning module will be superimposed on the up-sampled LR-HSI to form the HR-HSI. Therefore, to avoid distortion, the up-sampling process should be mean-value invariant, i.e., each band of the up-sampled LR-HSI will have an approximately identical mean-value to the corresponding band of the input LR-HSI. In order to achieve the objective, the widely-used transposed layer in image super-resolution could be employed but with additional restrictions on the learnable kernel installed in the layer, i.e., the sum of the elements of the kernel, which simultaneously convolve with the pixels of the input LR-image, should be equal to 1 while the kernel slides over the interpolated LR-image. For simplicity, in this paper we adopt the bi-linear interpolation to realize the mean-value invariant up-sampling, which has experimentally demonstrated to work very well.

Iii-C Progressive Zero-centric Residual Learning

In this module, a zero-centric residual image containing high-frequency spatial details of the captured scene is regressed by deeply extracting spatial context information from both the input HR-MSI and up-sampled LR-HSI. Inspired by the success of the progressive reconstruction strategy in image super-resolution [2, 23, 19], we embed the spectral bands of the up-sampled LR-HSI in a progressive fashion, rather than feed all of them into the network at the beginning. That is, as shown in Fig. 2

, the spectral bands of the up-sampled LR-HSI that are embedded into the network at different stages are regularly decimated with strides that decrease exponentially with the number of stages increasing. Taking an HSI with 32 spectral bands as an example, the numbers of decimated spectral bands from the first to the thrid stage are 8, 16, and 32, respectively.

Specifically, at each stage, 1-D convolution is first applied on the HR-MSI over the spectral domain to lift the number of feature channels to the same level as the input spectral bands decimated from the up-sampled LR-HSI, during which high-order details will be scattered to all channels. Then, the resulting feature maps from the HR-MSI and the decimated spectral bands from the up-sampled LR-HSI are concatenated along the spectral dimension, which are further fed into a sequence of spectral-spatial separable convolutional layers aggregated with densely connections [16, 60] for efficient and comprehensive spectral-spatial feature extraction. During the feature extraction, the spatial details are mainly provided by the HR-MSI and propagated to all spectral bands with reference to the spectral information by the LR-HSI. Moreover, to obtain a zero-centric residual image, zero-mean normalization is applied on the input feature maps of each convolutional layer over the spatial dimension independently. In addition, an identity mapping which directly connects the output of the first spectral convolutional layer to the output of the stage in an additive manner to further enhance the flow of the high-frequency spatial details. Note that for stage- (), the output of stage- is also concatenated as its input.

Table I outlines the architecture details of the first stage. In the following, we will give more detailed elaborations towards the spectral-spatial separable convolution and zero-mean normalization.

Iii-C1 Spectral-spatial separable convolution with dense connections

As mentioned earlier, the input to the residual learning module is a 3-D tensor, i.e., the concatenation of the 3-D HR-MSI and up-sampled LR-HSI. In order to comprehensively explore the information from both spectral and spatial domains, 3-D convolution is an intuitive choice to construct the residual learning module, which has demonstrated its effectiveness [33, 30]. However, compared with conventional 2-D convolution, 3-D convolution results in a significant increase in the parameter size, which may potentially cause over-fitting and consumption of huge computation resources. Analogy to the approximation of a high-dimensional filter with multiple low-dimensional filters in the field of signal processing, we use spectral-spatial separable (3S) convolutions to process the 3-D tensor for efficient spectral-spatial feature extraction. Note that separable convolutions have also demonstrated its effectiveness and efficiency in other deep learning-based image processing [57, 50, 34].

Specifically, the 3S convolution conducts 1-D spectral convolution (i.e., 1-D convolution over the spectral domain) and 2-D spatial convolution (i.e., independent 2-D convolution over the spatial domain of each feature map) sequentially. Also, there is an activation layer inserted between the two kinds of convolutions. The spectral convolution is equipped with a kernel of size with being the number of feature channels, while the spatial convolution with a group of 2-D kernels of size . Moreover, to enhance the feature extraction ability of the network for residual learning, we densely connect the 3S convolutional layers within a stage [16]. That is, the feature maps obtained from all the preceding layers are concatenated along the spectral dimension and passed to the current layer. Additionally, such dense connections could potentially improve the information flow and reduce overfitting [16].

Kernel shape # Input Channels # Output Channels Output shape ReLU ZM-norm
LinearTransform 8311 3 8 1281288
Concatenate 8+8=16 16 12812816
3S Convolution-
Spectral Convolution 16 1611 16 16 12812816
Spatial Convolution 8133 16 16 1281288
3S Convolution (without Activation)
Spectral Convolution 128811 128 8 12812816
Spatial Convolution 8133 8 8 12812816
Residual Addition 8+8=16 8
Refinement Module Kernel shape # Input Channels # Output Channels Output shape ReLU ZM-norm
Input 31 31 12812831
3S Convolution-
Spectral Convolution 313111 31 31 12812831
Spatial Convolution 31133 31 31 12812831
3S Convolution (without Activation)
Spectral Convolution 313111 31 31 12812831
Spatial Convolution 313133 31 31 12812831
Residual Addition 31+31=62 31
TABLE I: The architecture details of the first stage of the residual learning module and the refinement module. “3S Convolution ” indicates the -th spectral-spatial convolutional layer in the stage.

Iii-C2 Zero-mean normalization

Our objective is to learn a zero-centric residual image. However, non-linear activation layers (e.g., ReLU and Swish) involved in the residual learning module make the output feature maps to be non-negative, resulting in that their mean-values deviate from zero and likewise the estimated residual image. Besides, without additional constraints on the learned convolutional kernels, the convolution operation may also affect the mean-values of the output of each layer111Generally, the mean-value of a feature map will be preserved after a convolution operation only if the sum of the elements of the involved kernel is equal to 1.

To achieve the objective, we propose a novel feature normalization process, namely Zero-Mean normalization (ZM-norm), which is performed on the spatial domain of each feature channel involved in the residual learning module. Specifically, in the forward propagation, the ZM-norm denoted by behaves as


where is the -th entry of , the input feature maps to a typical convolutional layer (, , ). Here , , and indicate the mini-batch number, the spatial location, and the channel number, respectively. The gradient of the training loss can be back propagated through ZM-norm according to Eq. (4).


The proposed ZM-norm can be easily and efficiently implemented and integrated with existing deep learning architectures. Moreover, ZM-norm also introduces an additional advantage, i.e., it accelerates the training process

. The underlying reason is that during backpropagation, gradients are related to their corresponding feature values, and a zero-centric feature distribution could limit the gradient magnitude

[17, 25], therefore leading to more stable updates which speeds up convergence.

Remark: We would like to point out that our ZM-norm is different from existing feature normalization methods, such as Layer Normalization [5], Group Nomarlization [51], and Instance Normalization [44]

. Those normalization methods were mainly proposed to speed up the training process and improve the model generalization ability. Generally, they enforce the intermediate feature maps into a certain learnable distribution, through calculating the standard deviation and rescaling the feature magnitudes accordingly, which may cause loss of scale information

[31]. Our ZM-norm, however, is focused on eliminating the mean value of feature maps for regressing a zero-centric residual image that captures the high-frequency spatial details.

Iii-D Refinement Module

In the residual learning module, the residual image of each spectral band is independently synthesized, which is then superimposed on the corresponding band of the up-sampled LR-HSI, leading to a coarse HR-HSI denoted by . However, the coherence among the bands of cannot be well preserved. Therefore, as shown in Fig. 2(c), we further introduce a refinement module, in which all the spectral bands of are simultaneously explored to enhance the reconstruction quality.

The overall process of the refinement module can be formulated as


where is the weights to be learned, and is the finally reconstructed HR-HSI. More specifically, three 3S convolutional layers are employed to achieve feature extraction. Note that ZM-norm is no longer used in this module. The detailed implementation of this module is summarized in Table I.

Iii-E Loss Function for Training

Our PZRes-Net is end-to-end trained with the following loss function:


where is the parameter to balance the two terms, which is empirically set to 1, and is the -norm of a matrix, which returns the sum of the absolute of elements. The first term enforces PZResNet to learn the zero-centric residual image, while the second term encourages the reconstructed HR-HSI to be close to the ground-truth one in sense of the mean absolute error.

(a) Toys
(b) Peppers
(c) Tomatoes
(d) Img f5
(e) Img h2
(f) Img e0
Fig. 4: Quantitative comparisons of the proposed PZRes-Net with state-of-the-art methods in terms of the PSNR value of each spectral band of the reconstructed HR-HSI. Note that these 6 results correspond to the 6 images illustrated in Fig. 5.

Iv Experiments

Iv-a Experiment Settings

Iv-A1 Implementation details

We adopted ADAM [22] optimizer with and . The learning rate of our PZRes-Net was initialized as and the cosine annealing decay strategy was employed to decrease it gradually, ended with

. During training, we kept the same number of iterations to be 32000. We implemented the model with PyTorch, and the batch size was set to 10 for CAVE and 30 for HARVARD. All the experiments were conducted on Linux 18.04 with Intel Xeon E5-2360 CPU and NVIDIA 2080TI GPUs. The code will be publicly available. Table

I summarizes the implementation details of our network architecture.

Iv-A2 Compared Methods

We compared our PZRes-Net with 6 state-of-the-art HSI super-resolution approaches, including 3 traditional methods, i.e., hyperspectral super-resolution (HySure) [42], nonnegative structured sparse representation (NSSR) [10], and low tensor-train rank-based method (LTTR) [8], and 3 DNN-based methods, i.e., deep HSI sharpening (DHSIS) [9], multispectral and hyperspectral image fusion network (MHF) [52], and deep blind hyperspectral image fusion network (DBIN+) [47]. For fair comparisons, the same data pre-processing was implemented in all methods, the DNN-based methods under comparison were trained with the codes provided by the authors with suggested parameters over the same training data as ours, and the same protocol in [8, 47] was used to evaluate the experimental results of all methods.

Iv-A3 Quantitative Metrics

For a comprehensive evaluation, we adopted four commonly-used quantitative metrics:

  • Peak Signal-to-Noise Ratio (PSNR):


    where and are the -th () spectral bands of and , respectively, and returns the mean squares error between the inputs.

  • Average Structural Similarity Index (ASSIM):


    where [48] computes the SSIM value of a typical spectral band.

    Fig. 5: Comparisons of the error maps between the spectral bands of reconstructed HR-HSIs by different methods and the corresponding ground-truth ones. (a)-(c): spectral bands of images from the CAVE dataset at wavelength 600 nm; (d) the spectral band of the image from the HARVARD dataset at wavelength 420 nm; (e)-(f): spectral bands of images from the HARVARD dataset at wavelength 520 nm.
  • Spectral Angle Mapper (SAM):


    where and are the spectral signatures of the -th () pixels of and , respectively, and is

    norm of a vector.

  • Erreur Relative Global Adimensionnelle Synthese (ERGAS):


    where is the mean-value of the -th spectral band of , and is the scale factor.

NSSR [10] 8 43.82 0.987 4.07 0.84 45.92 0.980 3.46 1.20
HySure [42] 8 37.35 0.945 9.87 2.01 43.88 0.975 4.20 1.56
LTTR [8] 8 46.35 0.990 3.76 0.80 46.45 0.982 3.46 1.20
DHSIS [9] 8 0.5M 270G 45.59 0.990 3.91 0.73 1.17T 46.42 0.982 3.54 1.17
MHF [52] 8 1.3M 476G 46.81 0.992 3.83 0.75 2.04T 46.40 0.982 3.45 1.20
DBIN+ [47] 8 0.7M 4,095G 47.51 0.992 3.18 0.58 17.58T 46.67 0.983 3.42 1.15
Ours 8 0.7M 271G 50.94 0.998 2.63 0.43 1.08T 47.52 0.989 2.83 1.07
NSSR [10] 32 39.89 0.958 8.33 0.63 40.17 0.956 5.13 0.3503
HySure [42] 32 30.14 0.913 10.35 2.35 35.17 0.937 7.15 0.5041
LTTR [8] 32 41.03 0.970 8.29 0.39 43.25 0.975 5.45 0.3062
DHSIS [9] 32 0.5M 270G 40.51 0.967 8.01 0.41 1.17T 43.73 0.978 5.31 0.3107
MHF [52] 32 1.7M 673G 40.12 0.962 7.30 0.45 2.46T 44.31 0.981 4.90 0.2955
DBIN+ [47] 32 1.67M 4,571G 42.83 0.988 6.63 0.27 31.05T 44.24 0.980 5.32 0.2964
Ours 32 0.7M 271G 45.77 0.991 6.25 0.23 1.08T 45.47 0.986 4.19 0.2878
TABLE II: Quantitative comparisons of different methods in terms of 4 metrics over the 12 and 20 testing samples from the CAVE and HARVARD datasets, respectively. For PSNR and ASSIM, the larger, the better. For SAM and ERGAS, the smaller, the better. Note that ERGAS can be compared only under the same scale. For FLOP and # parameters, the smaller, the more efficient. The top four are traditional methods, and the bottom two as well as ours are DNN-based. The best results are bold.

Iv-B Evaluation on Synthetic Data

In this scenario, two commonly used benchmark HSI datasets, i.e., CAVE222 [56] and HARVARD333 [7], were used to generate synthetic hybrid inputs. Specifically, the CAVE dataset contains 32 indoor HSIs of spatial resolution 512 512 and spectral bands 31 captured by a generalized assorted pixel camera with an interval wavelength of 10nm in the range of 400-700nm. The HARVARD dataset contains 50 indoor and outdoor HSIs recorded under the daylight illumination, and 27 images under the artificial or mixed illumination. Each HSI consists of 31 spectral bands of spatial resolution 10241392, whose wavelengths range from 420 to 720 nm. Following [9, 47], only the 50 daylight illumination images were used in our experiments. Moreover, the first 30 HSIs are used for training, and the last 20 ones for testing. Following [47, 8, 52], all the LR-HSIs (with down-sampling scale ) used in this scenario were acquired through following two steps: (1) an Gaussian kernel is used to blur HSIs; and (2) the blurred HSIs were regularly decimated every pixels in the spatial domain. We simulated the HR-MSI (RGB image) of the same scene by integrating the spectral bands of an HSI with the widely used response function of the Nikon D700 camera444

Iv-B1 Results on the CAVE dataset

Fig. 6: Experimental results on the real dataset, WV-2, and the visualized image corresponds to the bottom left part of WV-2. The scale factor of this dataset is equal to 8, and the corresponding regions of the input HR-MSI are also provided for reference. The images in the top are the synthesized MSI from the reconstructed HR-HSIs by different methods via the same learned spectral response function for better comparison. We zoomed in the selected regions within the colored boxes with the ’nearest’ interpolation 5 times for a better comparison.

As listed in the left side of Table TABLE II, we can see that the proposed method with only 0.7M parameters consistently surpasses all the methods under comparison significantly in terms of all the four metrics under both up-sampling scales. Although the parameter sharing strategy in DBIN+ could help to save parameters to some extent, its iterative refinement manner costs much more computation resources as demonstrated by the high FLOPs value. To be specific, DBIN+ consumes the FLOPs of 4095G, which is 15 of that of the proposed method, and thus its utilization in practice may be severely restricted. The traditional methods, e.g. NSSR, and LTTR, show good reconstruction performance in the 8 reconstruction. However, as the up-sampling scale rises to

32, the matrix factorization based method, e.g. LTTR, shows a sharp deterioration, which is caused probably by the model’s limited representation ability or by the failure of prior knowledge. When the performance of all the methods drops for a larger up-sampling scale, the proposed PZRes-Net model still maintains the highest performance in terms of the 4 metrics.

Iv-B2 Results on the HARVARD dataset

The experimental results on the HARVARD dataset of different methods are listed in right side of Table II. Note that the HARVARD dataset is more challenging than the CAVE dataset due to the higher resolution and more complex scenario. The significant superiority of our method over state-of-the-art methods is further validated. Specifically, the two state-of-the-art DNN-based methods, i.e., MHF and DBIN+, are only slightly better than the traditional methods. However, our PZRes-Net with PSNR of 47.52 dB (resp. 45.47 dB) and SAM of 2.83 (4.19) greatly pushes forward the limits at the scale of 8 (resp. 32). Moreover, compared with the CAVE dataset, the SAM values of all methods on the HARVARD dataset are smaller, which may be caused by the dark background in the CAVE dataset. With the same absolute error, the pixels with low intensities are easy to output high SAM values.

Iv-B3 Visual comparisons

The error maps between the spectral bands of reconstructed HR-HSI by different methods and the ground-truth ones are shown in Fig. 5. Accordingly, Fig. 4 provides the PSNR values of each spectral band of the visualized images in Fig. 5 for reference. For the images from the CAVE dataset, i.e., Figs. 5(a)-(c), the traditional methods, e.g. NSSR and LTTR, have compatible performance with the two state-of-the-art DNN-based methods, i.e., DBIN+ and MHF, but the shapes of objects can be easily inferred from the error maps of all the compared methods due to the large errors at the boundaries. However, our PZRes-Net consistently produces the smallest errors over all the images at a low computational cost, and there are only subtle errors, which are nearly invisible. Similar observations can be obtained for the images from the HARVARD database, i.e., Figs. 5. (d)-(f). The results convincingly demonstrate the advantage of our method.

Iv-C Evaluation on Real Data

In this scenario, we evaluated our PZRes-Net over a real dataset, i.e., World View-2 (WV-2)555, which contains a pair of an LR-HSI of spectral bands 8 and spatial resolution and an HR-MSI (i.e., RGB image) of spatial resolution . As ground-truth data are not available, following [52], we generated the data for training in the following way. We first degraded the image pair to the spatial resolution of 104 164 and 416 656, and trained all models equally on the top half part of the the degraded HR-MSI and LR-HSI, i.e., the resolution of the input HR-MSI and LR-HSI are and , respectively. During training, the corresponding part of the original LR-HSI was used as the supervision. The bottom half part of the original data excluded in the training process was used for testing. As we can only compare different methods visually in this scenario, we also trained a spectral function simultaneously which maps the reconstructed HR-HSI to an RGB image for better visualization. Here only the results of two state-of-the-art DNN-based methods, i.e., DBIN+ and MHF which always achieve the top two performance in all the compared methods in the previous scenario, are shown for comparison.

As visualized at the top of Fig. 6, the synthesized MSI with 3 spectral bands from the output HR-HSIs of different methods, our PZRes-Net is better than MHF and DBIN+ as both the colors and spatial patterns of objects by our method are closer to those of the input HR-MSI.

Moreover, the remaining three sub-images in Fig. 6 visualize three spectral bands of the reconstructed HR-HSI of different methods, where it can be observed that in the spectral band, our PZRes-Net is able to reconstruct both high-frequency details (eaves in the magenta box of the spectral band) and smooth parts (the roof in the magenta and green boxes of the and spectral bands, respectively) very well. However, both DBIN+ and MHF fail to reconstruct high-frequency spatial details, resulting in blurring boundaries, and unexpected visual artifacts also appear in the smooth regions. All these visually pleasing results of our method are credited to its advantage on modeling the cross-modality information.

Iv-D Ablation Studies

We carried out extensive ablation studies to comprehensively analyze the three key components of our PZRes-Net model over the CAVE dataset.

w/o Refinement 50.56 0.994 3.07 0.47
w/o residual-dense aggregation 50.45 0.995 2.95 0.47
Full PZRes-Net 50.94 0.998 2.63 0.43
TABLE III: Ablation studies towards the refinement module and the residual-dense architecture. The best results are bold.
One stage 50.47 0.994 2.83 0.47
Two stages 50.82 0.995 2.71 0.45
Three stages 50.94 0.998 2.63 0.43
TABLE IV: Comparisons of our PZRes-Net with different numbers of stages in the residual learning module. The best results are bold.
2-D conv. 1.4M 292G 48.77 0.990 3.10 0.55
3-D conv. 0.5M 565G 50.59 0.997 2.71 0.44
3S conv. 0.7M 271G 50.94 0.998 2.63 0.43
TABLE V: Comparisons of our PZRes-Net equipped with different kinds of convolutional layers for feature extraction. The best results are bold.
Mean-value invariant up-sampling ZM-norm PSNR ASSIM SAM ERGAS
47.60 0.992 3.10 0.77
47.45 0.992 3.14 0.77
46.42 0.989 3.75 0.81
50.94 0.998 2.63 0.43
TABLE VI: Illustrations of the necessarily of the mean-value invariant up-sampling and ZM-norm in our framework. A restriction-free transposed convolutional layer is learned w/o the mean-value invariant property for up-sampling when the marker is “” under the column. ZM-norm was not applied when the marker is “” under the column. The best results are bold.
Learned up-sampling 47.45 0.992 3.14 0.77
Bi-cubic 50.79 0.997 2.93 0.72
Bi-linear 50.94 0.998 2.63 0.43
TABLE VII: Investigations of the up-sampling process on reconstruction quality. The best results are bold.

Iv-D1 The refinement module

As aforementioned, we used a refinement module to boost the reconstruction performance by simultaneously exploring the coherence among all spectral bands of the resulting coarse HR-HSI. We trained PZRes-Net without (w/o) such a module. By comparing the and the rows of Table III, the effectiveness of this module can be validated.

Iv-D2 The residual-dense architecture

To facilitate feature extraction, residual dense aggregation is embedded in our network. Here we investigated the contribution of such an architecture by training our PZRes-Net without the all the identity mapping and dense connections of each stage. Additionally, we also widened the network such that the modified network has approximately the same number of parameters for a fair comparison, i.e., for the first stage we increased the widths of the three stages from 16, 32, 62 to 24, 48, and 93, respectively. As shown in Table III, compared with full PZRes-Net, the PSNR and ASSIM values of PZRes-Net w/o residual-dense aggregation decreases about 0.5 dB and 0.003, respectively, and the SAM and ERGAS values increases 0.32 and 0.04, respectively, validating the advantage of the residual-dense architecture.

Iv-D3 The progressive spectral embedding scheme

Inspired by the progressive strategy in image super-resolution, in our framework the spectral information of the input HSI is progressively fed into the network to reconstruct HR-HSI. In order to validate the advantage of such a progressive manner, we trained our PZRes-Net with different numbers of stages. “One stage” means all the spectral bands of the up-sampled LR-HSI are fed into the network at the beginning. Note that we kept the modified models under different settings with the same number of parameters though varying the width of the networks for fair comparisons. As shown in Table IV, the reconstruction quality gradually improves with the increasing the number of stages, and the growth rate is getting smaller. Thus, the advantage of the proposed progressive embedding scheme is convincing validated. Based on this observation, we use 3 stages in our framework.

Iv-D4 The 3S convolution

In our PZRes-Net, 3S convolution enables efficient HSI processing. To investigate its efficiency and effectiveness, we trained our model by respectively replacing the 3S convolutional layers with 2-D and 3-D convolutional layers while keeping approximately the same amount of parameters or FLOPs. The experimental results are listed in Table V. From Table V, it can be seen that when using 2-D convolution, the performance drops sharply from 50.94 dB to 48.77 dB because 2-D convolution has a very limited ability to capture spectral information. Although 3-D convolution is able to achieve good performance, it consumes much more computation resources. Moreover, it is quite time-consuming. Our PZRes-Net with 3S convolution is capable of well balancing the efficiency and effectiveness.

(a) Testing PSNR
(b) Training Loss
Fig. 7:

The introduced advantages of our ZM-norm by illustrating the behaviors of the testing PSNR and training loss under different epochs of our PZRes-Net with or w/o ZM-norm.

Iv-D5 ZM-norm and mean-value invariant up-sampling

To predict a zero-mean residual image, ZM-norm is applied on feature maps. Also, the input LR-HSI is up-sampled with a mean-value invariant up-sampling process to avoid distortion. Here we experimentally verified the necessarily of such a combination by either removing ZM-norm during training or using a learned restriction-free transposed convolutional layer w/o the mean-value invariant property for up-sampling. From Table VI, we can observe that our method achieves the best performance when both the ZM-norm and mean-value invariant up-sampling were applied. By comparing the and rows, we can see that the ZM-norm affects the performance of our framework severely because our PZRes-Net is built upon the classic wavelet decomposition-based fusion method which focuses on extraction of the zero-mean high-frequency residual image, and without ZM-norm, it is hard to maintain this unique property of the high-order residual image, thus leading poor performance. Meanwhile, by comparing the and rows, we can conclude that the mean-value invariant characteristic of the up-sampling process is necessary because without such a property, it is hard to keep the mean-values of spectral bands close to the ground-truth ones, and thus distortion is introduced. Last but not least, Fig. 7 also indicates that ZM-norm can accelerate the training process.

Iv-D6 The choice of the up-sampling process

We used the bi-linear operator to realize the mean-value invariant up-sampling for its simplicity. We also investigated another mean-value invariant interpolation operator, i.e., the bi-cubic interpolation operator. Experimental results in TABLE VII indicate that PZRes-Net with the bi-cubic and bi-linear interpolations achieve comparable performance, demonstrating the robustness of our framework. Compared with the bi-cubic interpolation, the bi-linear interpolation is more computationally efficient.

V Conclusions

In this paper, we have presented a progressive zero-centric residual network (PZRes-Net), which is capable of efficiently and effectively restoring HR-HSIs from hybrid inputs, including an LR-HSI and an HR-MSI. Our PZRes-Net is mainly inspired by the classic wavelet decomposition-based image fusion method and mimics it in an adaptive learning manner. That is, our PZRes-Net mainly aims to learn a zero-centric residual image from both inputs, which contains high-frequency spatial details of the scene all spectral bands. we have proposed using ZM-norm, mean-value invariant up-sampling, spectral-spatial separable convolution with dense aggregation, progressive spectral information embedding to achieve the objective. Extensive experimental results as well as comprehensive ablation studies on both synthetic and real benchmark datasets demonstrate that our PZRes-Net improves the current state-of-the-art performance to a new level both quantitatively and qualitatively. Moreover, our PZRes-Net is lightweight network, which is much more computationally efficient than state-of-the-art deep learning-based methods, which validates it practicality.

Encouraged by the impressive reconstruction quality, it raises our interests to investigate the potential of our zero-centric residual learning scheme on other high-order feature extraction tasks, e.g., image denoising.


  • [1] E. Agustsson and R. Timofte (2017-07) NTIRE 2017 challenge on single image super-resolution: dataset and study. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    Cited by: Fig. 3, §III-B.
  • [2] N. Ahn, B. Kang, and K. Sohn (2018) Image super-resolution via progressive cascading residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 791–799. Cited by: §III-C.
  • [3] N. Akhtar, F. Shafait, and A. Mian (2014) Sparse spatio-spectral representation for hyperspectral image super-resolution. In European Conference on Computer Vision, pp. 63–78. Cited by: §II-A, §II.
  • [4] P. Arun, K. M. Buddhiraju, and A. Porwal (2018) CNN based sub-pixel mapping for hyperspectral images. Neurocomputing 311, pp. 51–64. Cited by: §II-B.
  • [5] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §III-C2.
  • [6] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot (2013) Hyperspectral remote sensing data analysis and future challenges. IEEE Geoscience and remote sensing magazine 1 (2), pp. 6–36. Cited by: §I.
  • [7] A. Chakrabarti and T. Zickler (2011) Statistics of real-world hyperspectral images. In CVPR 2011, pp. 193–200. Cited by: §IV-B.
  • [8] R. Dian, S. Li, and L. Fang (2019) Learning a low tensor-train rank representation for hyperspectral image super-resolution. IEEE transactions on neural networks and learning systems 30 (9), pp. 2672–2683. Cited by: §II-A, §II-B, §II, §IV-A2, §IV-B, TABLE II.
  • [9] R. Dian, S. Li, A. Guo, and L. Fang (2018) Deep hyperspectral image sharpening. IEEE transactions on neural networks and learning systems (99), pp. 1–11. Cited by: §I, §I, §II-B, §IV-A2, §IV-B, TABLE II.
  • [10] W. Dong, F. Fu, G. Shi, X. Cao, J. Wu, G. Li, and X. Li (2016) Hyperspectral image super-resolution via non-negative structured sparse representation. IEEE Transactions on Image Processing 25 (5), pp. 2337–2352. Cited by: §I, §II-A, §IV-A2, TABLE II.
  • [11] L. Fang, H. Zhuo, and S. Li (2018) Super-resolution of hyperspectral image via superpixel-based sparse representation. Neurocomputing 273, pp. 171–177. Cited by: §II-A.
  • [12] Y. Fu, Y. Zheng, H. Huang, I. Sato, and Y. Sato (2018) Hyperspectral image super-resolution with a mosaic rgb image. IEEE Transactions on Image Processing 27 (11), pp. 5539–5552. Cited by: §I.
  • [13] C. Gonzalo and M. Lillo-Saavedra (2004) Customized fusion of satellite images based on the á trous algorithm. In Image and Signal Processing for Remote Sensing X, Vol. 5573, pp. 444–451. Cited by: §I, §II-A.
  • [14] C. Grohnfeldt, X. X. Zhu, and R. Bamler (2013) Jointly sparse fusion of hyperspectral and multispectral imagery. In 2013 IEEE International Geoscience and Remote Sensing Symposium-IGARSS, pp. 4090–4093. Cited by: §I.
  • [15] X. Han, B. Shi, and Y. Zheng (2018) Self-similarity constrained sparse representation for hyperspectral image super-resolution. IEEE Transactions on Image Processing 27 (11), pp. 5625–5637. Cited by: §II-A.
  • [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §III-C1, §III-C.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §III-C2.
  • [18] C. I. Kanatsoulis, X. Fu, N. D. Sidiropoulos, and W. Ma (2018) Hyperspectral super-resolution: a coupled tensor factorization approach. IEEE Transactions on Signal Processing 66 (24), pp. 6503–6517. Cited by: §II-A.
  • [19] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §III-C.
  • [20] J. Kautz, T. Mertens, and F. V. Reeth (2007) Exposure fusion. In Computer Graphics and Applications, Pacific Conference on, pp. 382–390. Cited by: §I, §II-A, §II.
  • [21] M. J. Khan, H. S. Khan, A. Yousaf, K. Khurshid, and A. Abbas (2018) Modern trends in hyperspectral image analysis: a review. IEEE Access 6, pp. 14118–14129. Cited by: §I.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A1.
  • [23] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §III-C.
  • [24] R. Lan, L. Sun, Z. Liu, H. Lu, Z. Su, C. Pang, and X. Luo (2020) Cascading and enhanced residual networks for accurate single-image super-resolution. IEEE Transactions on Cybernetics. Cited by: §III-A.
  • [25] Y. A. LeCun, L. Bottou, G. B. Orr, and K. Müller (2012) Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Cited by: §III-C2.
  • [26] J. Li, J. M. Bioucas-Dias, and A. Plaza (2010)

    Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning

    IEEE Transactions on Geoscience and Remote Sensing 48 (11), pp. 4085–4098. Cited by: §I.
  • [27] J. Li and A. Plaza (2011) Hyperspectral image segmentation using a new bayesian approach with active learning. IEEE Transactions on Geoscience and Remote Sensing 49 (10), pp. 3947–3960. Cited by: §I.
  • [28] S. Li, X. Kang, L. Fang, J. Hu, and H. Yin (2017) Pixel-level image fusion: a survey of the state of the art. information Fusion 33, pp. 100–112. Cited by: §I, §III-A.
  • [29] S. Li, J. T. Kwok, and Y. Wang (2002) Using the discrete wavelet frame transform to merge landsat tm and spot panchromatic images. Information Fusion 3 (1), pp. 17–23. Cited by: §II.
  • [30] Y. Li, H. Zhang, and Q. Shen (2017) Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing 9 (1), pp. 67. Cited by: §III-C1.
  • [31] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §III-C2.
  • [32] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Fig. 3, §III-B.
  • [33] S. Mei, X. Yuan, J. Ji, Y. Zhang, S. Wan, and Q. Du (2017) Hyperspectral image spatial super-resolution via 3d full convolutional neural network. Remote Sensing 9 (11), pp. 1139. Cited by: §II-B, §III-C1.
  • [34] S. Niklaus, L. Mai, and F. Liu (2017) Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 261–270. Cited by: §III-C1.
  • [35] J. Núnez, X. Otazu, O. Fors, A. Prades, V. Pala, and R. Arbiol (1999) Image fusion with additive multiresolution wavelet decomposition. applications to spot+ landsat images. JOSA A 16 (3), pp. 467–474. Cited by: §II-A.
  • [36] J. Nunez, X. Otazu, O. Fors, A. Prades, V. Pala, and R. Arbiol (1999) Multiresolution-based image fusion with additive wavelet decomposition. IEEE Transactions on Geoscience and Remote sensing 37 (3), pp. 1204–1211. Cited by: §I, §I, §II-A.
  • [37] J. Nunez, X. Otazu, A. Prades, et al. (1997) Simultaneous image fusion and reconstruction using wavelets applications to spot+ landsat images. Vistas in Astronomy 41 (3), pp. 351–357. Cited by: §II-A.
  • [38] G. Pajares and J. M. De La Cruz (2004) A wavelet-based image fusion tutorial. Pattern recognition 37 (9), pp. 1855–1872. Cited by: §II.
  • [39] F. Palsson, J. R. Sveinsson, and M. O. Ulfarsson (2017) Multispectral and hyperspectral image fusion using a 3-d-convolutional neural network. IEEE Geoscience and Remote Sensing Letters 14 (5), pp. 639–643. Cited by: §II-B.
  • [40] X. Pantazi, D. Moshou, and C. Bravo (2016) Active learning system for weed species recognition based on hyperspectral sensing. Biosystems Engineering 146, pp. 193–202. Cited by: §I.
  • [41] Y. Qu, H. Qi, and C. Kwan (2018) Unsupervised sparse dirichlet-net for hyperspectral image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2511–2520. Cited by: §II-B.
  • [42] M. Simoes, J. Bioucas-Dias, L. B. Almeida, and J. Chanussot (2014) A convex formulation for hyperspectral image superresolution via subspace-based regularization. IEEE Transactions on Geoscience and Remote Sensing 53 (6), pp. 3373–3388. Cited by: §IV-A2, TABLE II.
  • [43] G. Tochon, J. Chanussot, M. Dalla Mura, and A. L. Bertozzi (2017) Object tracking by hierarchical decomposition of hyperspectral video sequences: application to chemical gas plume tracking. IEEE Transactions on Geoscience and Remote Sensing 55 (8), pp. 4567–4585. Cited by: §I.
  • [44] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §III-C2.
  • [45] L. Wang, Z. Xiong, D. Gao, G. Shi, and F. Wu (2015) Dual-camera design for coded aperture snapshot spectral imaging. Applied optics 54 (4), pp. 848–858. Cited by: §II-B.
  • [46] L. Wang, T. Zhang, Y. Fu, and H. Huang (2018) Hyperreconnet: joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging. IEEE Transactions on Image Processing 28 (5), pp. 2257–2270. Cited by: §II-B.
  • [47] W. Wang, W. Zeng, Y. Huang, X. Ding, and J. Paisley (2019) Deep blind hyperspectral image fusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4150–4159. Cited by: §I, §I, §II-B, §II, §IV-A2, §IV-B, TABLE II.
  • [48] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: 2nd item.
  • [49] Q. Wei, J. Bioucas-Dias, N. Dobigeon, and J. Tourneret (2015) Hyperspectral and multispectral image fusion based on a sparse representation. IEEE Transactions on Geoscience and Remote Sensing 53 (7), pp. 3658–3668. Cited by: §II-A.
  • [50] H. Wing Fung Yeung, J. Hou, J. Chen, Y. Ying Chung, and X. Chen (2018) Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 137–152. Cited by: §III-C1.
  • [51] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §III-C2.
  • [52] Q. Xie, M. Zhou, Q. Zhao, D. Meng, W. Zuo, and Z. Xu (2019) Multispectral and hyperspectral image fusion by ms/hs fusion net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1585–1594. Cited by: §I, §I, §II-A, §II-A, §II-B, §II-B, §II, §IV-A2, §IV-B, §IV-C, TABLE II.
  • [53] Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu (2017) Hscnn: cnn-based hyperspectral image recovery from spectrally undersampled projections. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 518–525. Cited by: §II-B.
  • [54] C. Yan, L. Li, C. Zhang, B. Liu, Y. Zhang, and Q. Dai (2019) Cross-modality bridging and knowledge transferring for image understanding. IEEE Transactions on Multimedia 21 (10), pp. 2675–2685. Cited by: §I.
  • [55] S. Yang, M. Wang, and L. Jiao (2012) Fusion of multispectral and panchromatic images based on support value transform and adaptive principal component analysis. Information Fusion 13 (3), pp. 177–184. Cited by: §II.
  • [56] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar (2010) Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE transactions on image processing 19 (9), pp. 2241–2253. Cited by: §IV-B.
  • [57] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung (2018) Light field spatial super-resolution using deep efficient spatial-angular separable convolution. IEEE Transactions on Image Processing 28 (5), pp. 2319–2330. Cited by: §III-C1.
  • [58] D. A. Yocky (1996) Multiresolution wavelet decomposition i me merger of landsat thematic mapper and spot panchromatic data. Photogrammetric Engineering & Remote Sensing 62 (9), pp. 1067–1074. Cited by: §II-A.
  • [59] K. Zhang, M. Wang, and S. Yang (2016) Multispectral and hyperspectral image fusion based on group spectral embedding and low-rank factorization. IEEE Transactions on Geoscience and Remote Sensing 55 (3), pp. 1363–1371. Cited by: §I.
  • [60] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2020) Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §III-A, §III-C.
  • [61] Z. Zhang and R. S. Blum (1999) A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application. Proceedings of the IEEE 87 (8), pp. 1315–1326. Cited by: §I, §II-A, §II.
  • [62] S. Zheng, W. Shi, J. Liu, G. Zhu, and J. Tian (2007) Multisource image fusion method using support value transform. IEEE Transactions on Image Processing 16 (7), pp. 1831–1839. Cited by: §II.
  • [63] Y. Zheng (2009) Multi-scale fusion algorithm comparisons: pyramid, dwt and iterative dwt. In 2009 12th International Conference on Information Fusion, pp. 1060–1067. Cited by: §III-A.
  • [64] Z. Zhou, B. Wang, S. Li, and M. Dong (2016) Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with gaussian and bilateral filters. Information Fusion 30, pp. 15–26. Cited by: §III-A.