I Introduction
Hyperspectral imaging is aimed at collecting information from across the electromagnetic spectrum for each pixel of the image of a scene [6, 21]. The rich spectral information of the recorded hyperspectral image (HSI) enables it to deliver more faithful knowledge of a targeted scene than conventional imaging modalities [54]
. As a result, the HSI has grown increasingly popular over the past ten years in various fields, such as military, industrial, and scientific arenas. The HSI has also boosted the performance of applications in computer vision, such as tracking
[43], segmentation [26, 27], and object recognition [40].However, due to the hardware limitation of existing imaging systems, there is an inevitable tradeoff between the spectral and spatial resolution [9]. For a specific optical system, it could only record the image with either high spatial resolution together with very limited spectral bands, e.g., the high resolution multispectral image (HRMSI), or dense spectral bands with reduced spatial resolution, e.g., the low resolution HSI (LRHSI). Hence, as illustrated in Fig.1, HSI superresolution (a.k.a MSI/HSI fusion) that merges an HRMSI and an LRHSI has being become a promising way to obtain HRHSIs [52, 47, 9].
To tackle this challenge, various methods have been proposed in the last few decades. From the perspective of signal processing, multiscale decompositionbased methods that only consume very limited computation resources have demonstrated their abilities in information fusion [28], e.g., pyramidbased [20] and wavelet decompositionbased [36, 61]. Based on the prior knowledge of HSIs, e.g., the sparse prior and the lowrank prior, plenty of matrix factorization basedmethods have emerged [59, 10, 14]. Recently, owing to the remarkable representation learning ability, deep neural network (DNN)based methods have been introduced [52, 9, 47, 12], whose performance exceeds that of traditional/nonDNN methods to a large extent (see Section II for more details). However, the reconstruction quality of the current stateoftheart methods is still not satisfactory, due to the insufficient utilization/modeling of the crossmodality information.
Inspired by the classic wavelet decompositionbased methods [36, 13], we propose a novel DNNbased framework, namely Progressive Zerocentric Residual Network (PZResNet), to achieve HSI superresolution in both efficient and effective ways. As shown in Fig. 2, the input LRHSI is first upsampled in a meanvalue invariant manner. Following that, a zerocentric residual image is progressively learned along the spectral dimension from both the upsampled LRHSI and HRMSI with a zerocentric residual learning module, in which spectralspatial separable convolutions with dense aggregation extract spectralspatial information efficiently and effectively. Moreover, zeromean normalization is applied for promoting the zerocentric characteristic of the learned residual image. The resulting residual image is further superimposed on the upsampled LRHSI, leading to a coarse HRHSI, which is finally refined through exploring the coherency among all spectral bands simultaneously. We conduct various and extensive experiments and comparisons to evaluate and analyze the proposed PZResNet comprehensively. It is concluded that our PZResNet remarkably outperforms stateoftheart methods both quantitatively and qualitatively across multiple real and synthetic benchmark datasets. Especially, our PZResNet improves the PSNR more than 3dB, while saving 2.3 parameters and consuming 15 less FLOPs.
The rest of this paper is organized as follows. Section II briefly reviews existing methods. Section III presents the proposed method, namely PZResNet, followed by extensive experimental results as well as comprehensive analyses on both synthetic and real data in Section IV. Finally Section V concludes this paper.
Ii Related Work
We classify existing methods into two categories, i.e., (1) traditional methods, including multiscale decompositionbased
[20], [61, 38, 29], [55, 62] and optimizationbased [3]; and (2) deep learningbased methods [8, 52, 47]. In the following, we will review them in detail.Iia Traditional Methods
Multiscale decompositionmethods focus on representing the image spatial structures with multiple layers. Various wavelet decomposition (WD)based methods have been proposed in the past few decades [36, 20, 61, 37, 58, 35]. For example, Nunez et al. [36] proposed two types of WDbased methods for MSI pansharpening: the additive method and the substitution method. Specifically, the former consists of the following steps: (1) register an LRMSI with an HRpanchromatic image and upsmaple it to the same resolution in order to be superimposed; (2) decompose the HRpanchromatic image to several wavelet planes containing highfrequency spatial information which follows zeromean distribution; and (3) add the wavelet planes to the upsampled LRMSI. However, the latter decomposes both inputs, then replace wavelet planes of the upsampled LRMSI with those of the HRpanchromatic images. They also experimentally demonstrated that the additive method preserves spatial information better than the substitution method. Gonzalo et al. [13] also proposed an adaptive WDbased method, which fuses the wavelet planes with different weights.
Based on the prior knowledge of HSIs [52]
, a considerable number of traditional machine learningbased methods have been proposed. For example, matrix factorizationbased methods assume that each spectrum can be linearly represented with several spectral atoms
[8]. Under the assumption that an HSI lies in a lowdimensional subspace, Wei et al. [49] used the spatial dictionaries learned from the HRMSI to promote the spatial similarities. Fang et al. [11] proposed a superpixelbased sparse representation. Kanatsoulis et al. [18]established a coupled tensor factorization framework. Han
et al. [15] utilized a selfsimilarity prior as the constraint for the sparse representation of the input HSI and MSI. Akhtar et al. [3]first learned a nonnegative dictionary, then introduced a simultaneous greedy pursuit algorithm to estimate coefficients for each local patch. Dong
et al. [10] proposed a matrix factorizationbased method which imposes joint sparsity and nonnegativity constraints on the learned representation coefficients.However, these traditional methods were usually constructed based on some priors, e.g., sparse prior, lowrank prior, and global similarity prior, which may not be consistent with the complex real world scenarios [52].
IiB DNNbased Methods
Powered by the strong representations learning ability, DNNs have become an emerging tool for HSI superresolution [8]. Palsson et al. [39]
proposed a 3D convolutional neural network (CNN)based MSI/HSI fusion approach and reduced the computational cost by using principal component analysis (PCA). Dian
et al. [9] used deep priors learned by residual learningbased DNNs and reconstructed HRHSI by solving optimization problems. Mei et al. [33] proposed a 3D CNN to exploit both the spatial context and the spectral correlation. Arun et al. [4] explored DNNs to jointly optimize the unmixing and mapping operations in a supervised manner. Xie et al. [52]proved that an HRHSI could be represented by the linearly transformed HRMSI and a tobeestimated residual image, then unfolded an iterative algorithm, which solves aforementioned two components within a deep learning framework for HSI/MSI fusion. Wang
et al. [46, 45] tried to solve the HSI reconstruction using compressive sensing based methods. Xiong et al. [53] proposed CNNbased methods to achieve the recovery of HSIs from RGB images. Qu et al. [41] solved the HSI superresolution problem using an unsupervised encoderdecoder architecture.Most of the abovementioned methods are not blind, meaning that they are trained with the knowledge of HRHSIs or degradation models. Although Wang et al. [47] proposed a blind fusion model similar to [52], their performance leaves much to be desired. Moreover, due to the iterative HSI refinement, their computation burdens are very high, which may restrict practical implementations.
Iii The Proposed Method
Let be an spectral bands HRHSI to be reconstructed, each column of which is the vectorial representation of a spectral band of spatial resolution . The degradation models for the observed HRMSI with () spectral bands denoted by and LRHSI of spatial resolution denoted by (, ) from can be formulated as:
(1) 
(2) 
where is the camera spectral response function that integrates over the spectral dimension of the HSI to produce the MSI; represents the blurring matrix applied on the HRHSI before performing spatial decimation via the matrix ; and are the noises in and , respectively. From Eqs. (1) and (2), it can be known that contains highresolution spatial context, while keeps dense spectral details. Thus, the challenge of HSI superresolution, i.e., reconstructing from under the assistance of , boils down to “how to leverage the spatial advantage of and propagate it across the densely sampled spectral bands of effectively.”
Iiia Motivation and Overview
Multiscale decompositionbased methods have demonstrated their effectiveness in image fusion [64], [63], [28]. Particularly, the classic wavelet decompositionbased scheme for enhancing an LRimage with an HRimage from another modality contains the following procedures: the LRimage is first upsampled to the same resolution as the HRimage in order to be superimposed; and wavelet planes containing zerocentric/mean highfrequency details are then obtained by decomposing the HRimage with a shiftinvariant wavelet transform, which are further superimposed onto the upsampled LRimage. Such a scheme is able to utilize the spatial detail information from both images. Moreover, as the wavelet planes are designed to have zero meanvalues, the total flux of the enhanced LRimage will be preserved. Inspired by the principles of traditional multiscale decompositionbased methods, we will study the HSI superresolution by exploring the powerful representation capabilities of DNNs to learn such zerocentric highfrequency details adaptively. In addition to inheriting the advantage of the traditional methods, it is expected that the dataadaptive characteristic of such a learning manner can boost the performance, compared with the predefined and dataindependent decomposition process in traditional frameworks.
As shown in Fig. 2, the proposed framework, namely progressive zerocentric residual network (PZResNet), is mainly composed of three modules: a meanvalue invariant upsampling module, a progressive zerocentric residual learning module, and a refinement module. Specifically, the upsampling module first lifts the input LRHSI to the same resolution as the input HRMSI in a meanvalue invariant manner. Then, the residual learning module estimates a residual image from both the input HRMSI and upsampled LRHSI progressively along the spectral dimension in which zeromean normalization is applied on the input of each convolutional layer to enforce the meanvalue of each band of the predicted residual image to be zero. The resulting zerocentric residual image is further superimposed onto the upsampled LRHSI, leading to a coarse HRHSI, which is finally fed into the refinement module, where the coherence across all spectral bands of the coarse HRHSI is simultaneously explored in a residual learning manner for further augmenting reconstruction quality. In what follows, we will introduce each module in detail.
Remark: It is worth pointing out the residual learning manner of our framework is essentially different from the residual learning that is widely exploited in various networks and image/video applications [60], [24]
. The traditional residual learning was introduced with the purposes of facilitating neural network optimization or enhancing feature extraction, while such a manner of our PZResNet mimics the classic multiscale decompositionbased fusion methods. Moreover, our PZResNet learns a zerocentric residual, while there are no such constraints for traditional residual learning.
IiiB Meanvalue Invariant Upsampling
This module aims to lift the input LRHSI into the same spatial resolution as the HRMSI for subsequent residual superimposition. The Law of Large Numbers dictates that
the observation average for a random variable should be close to its expectation value when it’s based on a large number of trials
. Generally, image pixels from LR and HR captures are repetitive samples of the same real scene, and are expected to have an approximately identical meanvalue which is aligned with the expectation value. Such an observation was also experimentally validated in Fig. 3, where the histograms refer to the statistics of the difference between the mean values of 800 and 40K pairs of LR and HR RGB images respectively from DIV2K [1] and COCO [32], two commonlyused benchmark image datasets.Recall that in our framework, a zeromean residual image predicted by our zerocentric residual learning module will be superimposed on the upsampled LRHSI to form the HRHSI. Therefore, to avoid distortion, the upsampling process should be meanvalue invariant, i.e., each band of the upsampled LRHSI will have an approximately identical meanvalue to the corresponding band of the input LRHSI. In order to achieve the objective, the widelyused transposed layer in image superresolution could be employed but with additional restrictions on the learnable kernel installed in the layer, i.e., the sum of the elements of the kernel, which simultaneously convolve with the pixels of the input LRimage, should be equal to 1 while the kernel slides over the interpolated LRimage. For simplicity, in this paper we adopt the bilinear interpolation to realize the meanvalue invariant upsampling, which has experimentally demonstrated to work very well.
IiiC Progressive Zerocentric Residual Learning
In this module, a zerocentric residual image containing highfrequency spatial details of the captured scene is regressed by deeply extracting spatial context information from both the input HRMSI and upsampled LRHSI. Inspired by the success of the progressive reconstruction strategy in image superresolution [2, 23, 19], we embed the spectral bands of the upsampled LRHSI in a progressive fashion, rather than feed all of them into the network at the beginning. That is, as shown in Fig. 2
, the spectral bands of the upsampled LRHSI that are embedded into the network at different stages are regularly decimated with strides that decrease exponentially with the number of stages increasing. Taking an HSI with 32 spectral bands as an example, the numbers of decimated spectral bands from the first to the thrid stage are 8, 16, and 32, respectively.
Specifically, at each stage, 1D convolution is first applied on the HRMSI over the spectral domain to lift the number of feature channels to the same level as the input spectral bands decimated from the upsampled LRHSI, during which highorder details will be scattered to all channels. Then, the resulting feature maps from the HRMSI and the decimated spectral bands from the upsampled LRHSI are concatenated along the spectral dimension, which are further fed into a sequence of spectralspatial separable convolutional layers aggregated with densely connections [16, 60] for efficient and comprehensive spectralspatial feature extraction. During the feature extraction, the spatial details are mainly provided by the HRMSI and propagated to all spectral bands with reference to the spectral information by the LRHSI. Moreover, to obtain a zerocentric residual image, zeromean normalization is applied on the input feature maps of each convolutional layer over the spatial dimension independently. In addition, an identity mapping which directly connects the output of the first spectral convolutional layer to the output of the stage in an additive manner to further enhance the flow of the highfrequency spatial details. Note that for stage (), the output of stage is also concatenated as its input.
Table I outlines the architecture details of the first stage. In the following, we will give more detailed elaborations towards the spectralspatial separable convolution and zeromean normalization.
IiiC1 Spectralspatial separable convolution with dense connections
As mentioned earlier, the input to the residual learning module is a 3D tensor, i.e., the concatenation of the 3D HRMSI and upsampled LRHSI. In order to comprehensively explore the information from both spectral and spatial domains, 3D convolution is an intuitive choice to construct the residual learning module, which has demonstrated its effectiveness [33, 30]. However, compared with conventional 2D convolution, 3D convolution results in a significant increase in the parameter size, which may potentially cause overfitting and consumption of huge computation resources. Analogy to the approximation of a highdimensional filter with multiple lowdimensional filters in the field of signal processing, we use spectralspatial separable (3S) convolutions to process the 3D tensor for efficient spectralspatial feature extraction. Note that separable convolutions have also demonstrated its effectiveness and efficiency in other deep learningbased image processing [57, 50, 34].
Specifically, the 3S convolution conducts 1D spectral convolution (i.e., 1D convolution over the spectral domain) and 2D spatial convolution (i.e., independent 2D convolution over the spatial domain of each feature map) sequentially. Also, there is an activation layer inserted between the two kinds of convolutions. The spectral convolution is equipped with a kernel of size with being the number of feature channels, while the spatial convolution with a group of 2D kernels of size . Moreover, to enhance the feature extraction ability of the network for residual learning, we densely connect the 3S convolutional layers within a stage [16]. That is, the feature maps obtained from all the preceding layers are concatenated along the spectral dimension and passed to the current layer. Additionally, such dense connections could potentially improve the information flow and reduce overfitting [16].
Kernel shape  # Input Channels  # Output Channels  Output shape  ReLU  ZMnorm  
LinearTransform  8311  3  8  1281288  –  – 
Concatenate  –  8+8=16  16  12812816  –  – 
3S Convolution  
Spectral Convolution  16 1611  16  16  12812816  
Spatial Convolution  8133  16  16  1281288  
3S Convolution (without Activation)  
Spectral Convolution  128811  128  8  12812816  –  
Spatial Convolution  8133  8  8  12812816  –  
Residual Addition  –  8+8=16  8  –  – 
Refinement Module  Kernel shape  # Input Channels  # Output Channels  Output shape  ReLU  ZMnorm 

Input  –  31  31  12812831  –  – 
3S Convolution  
Spectral Convolution  313111  31  31  12812831  –  
Spatial Convolution  31133  31  31  12812831  –  
3S Convolution (without Activation)  
Spectral Convolution  313111  31  31  12812831  –  – 
Spatial Convolution  313133  31  31  12812831  –  – 
Residual Addition  –  31+31=62  31  –  –  – 
IiiC2 Zeromean normalization
Our objective is to learn a zerocentric residual image. However, nonlinear activation layers (e.g., ReLU and Swish) involved in the residual learning module make the output feature maps to be nonnegative, resulting in that their meanvalues deviate from zero and likewise the estimated residual image. Besides, without additional constraints on the learned convolutional kernels, the convolution operation may also affect the meanvalues of the output of each layer^{1}^{1}1Generally, the meanvalue of a feature map will be preserved after a convolution operation only if the sum of the elements of the involved kernel is equal to 1.
To achieve the objective, we propose a novel feature normalization process, namely ZeroMean normalization (ZMnorm), which is performed on the spatial domain of each feature channel involved in the residual learning module. Specifically, in the forward propagation, the ZMnorm denoted by behaves as
(3)  
where is the th entry of , the input feature maps to a typical convolutional layer (, , ). Here , , and indicate the minibatch number, the spatial location, and the channel number, respectively. The gradient of the training loss can be back propagated through ZMnorm according to Eq. (4).
(4) 
The proposed ZMnorm can be easily and efficiently implemented and integrated with existing deep learning architectures. Moreover, ZMnorm also introduces an additional advantage, i.e., it accelerates the training process
. The underlying reason is that during backpropagation, gradients are related to their corresponding feature values, and a zerocentric feature distribution could limit the gradient magnitude
[17, 25], therefore leading to more stable updates which speeds up convergence.Remark: We would like to point out that our ZMnorm is different from existing feature normalization methods, such as Layer Normalization [5], Group Nomarlization [51], and Instance Normalization [44]
. Those normalization methods were mainly proposed to speed up the training process and improve the model generalization ability. Generally, they enforce the intermediate feature maps into a certain learnable distribution, through calculating the standard deviation and rescaling the feature magnitudes accordingly, which may cause loss of scale information
[31]. Our ZMnorm, however, is focused on eliminating the mean value of feature maps for regressing a zerocentric residual image that captures the highfrequency spatial details.IiiD Refinement Module
In the residual learning module, the residual image of each spectral band is independently synthesized, which is then superimposed on the corresponding band of the upsampled LRHSI, leading to a coarse HRHSI denoted by . However, the coherence among the bands of cannot be well preserved. Therefore, as shown in Fig. 2(c), we further introduce a refinement module, in which all the spectral bands of are simultaneously explored to enhance the reconstruction quality.
The overall process of the refinement module can be formulated as
(5) 
where is the weights to be learned, and is the finally reconstructed HRHSI. More specifically, three 3S convolutional layers are employed to achieve feature extraction. Note that ZMnorm is no longer used in this module. The detailed implementation of this module is summarized in Table I.
IiiE Loss Function for Training
Our PZResNet is endtoend trained with the following loss function:
(6) 
where is the parameter to balance the two terms, which is empirically set to 1, and is the norm of a matrix, which returns the sum of the absolute of elements. The first term enforces PZResNet to learn the zerocentric residual image, while the second term encourages the reconstructed HRHSI to be close to the groundtruth one in sense of the mean absolute error.
Iv Experiments
Iva Experiment Settings
IvA1 Implementation details
We adopted ADAM [22] optimizer with and . The learning rate of our PZResNet was initialized as and the cosine annealing decay strategy was employed to decrease it gradually, ended with
. During training, we kept the same number of iterations to be 32000. We implemented the model with PyTorch, and the batch size was set to 10 for CAVE and 30 for HARVARD. All the experiments were conducted on Linux 18.04 with Intel Xeon E52360 CPU and NVIDIA 2080TI GPUs. The code will be publicly available. Table
I summarizes the implementation details of our network architecture.IvA2 Compared Methods
We compared our PZResNet with 6 stateoftheart HSI superresolution approaches, including 3 traditional methods, i.e., hyperspectral superresolution (HySure) [42], nonnegative structured sparse representation (NSSR) [10], and low tensortrain rankbased method (LTTR) [8], and 3 DNNbased methods, i.e., deep HSI sharpening (DHSIS) [9], multispectral and hyperspectral image fusion network (MHF) [52], and deep blind hyperspectral image fusion network (DBIN+) [47]. For fair comparisons, the same data preprocessing was implemented in all methods, the DNNbased methods under comparison were trained with the codes provided by the authors with suggested parameters over the same training data as ours, and the same protocol in [8, 47] was used to evaluate the experimental results of all methods.
IvA3 Quantitative Metrics
For a comprehensive evaluation, we adopted four commonlyused quantitative metrics:

Peak SignaltoNoise Ratio (PSNR):
(7) where and are the th () spectral bands of and , respectively, and returns the mean squares error between the inputs.

Average Structural Similarity Index (ASSIM):
(8) where [48] computes the SSIM value of a typical spectral band.

Spectral Angle Mapper (SAM):
(9) where and are the spectral signatures of the th () pixels of and , respectively, and is
norm of a vector.

Erreur Relative Global Adimensionnelle Synthese (ERGAS):
(10) where is the meanvalue of the th spectral band of , and is the scale factor.
CAVE  HARVARD  
Methods  Scale  # Pram  FLOPs  PSNR  ASSIM  SAM  ERGAS  FLOPs  PSNR  ASSIM  SAM  ERGAS 
NSSR [10]  8  –  –  43.82  0.987  4.07  0.84  –  45.92  0.980  3.46  1.20 
HySure [42]  8  –  –  37.35  0.945  9.87  2.01  –  43.88  0.975  4.20  1.56 
LTTR [8]  8  –  –  46.35  0.990  3.76  0.80  –  46.45  0.982  3.46  1.20 
DHSIS [9]  8  0.5M  270G  45.59  0.990  3.91  0.73  1.17T  46.42  0.982  3.54  1.17 
MHF [52]  8  1.3M  476G  46.81  0.992  3.83  0.75  2.04T  46.40  0.982  3.45  1.20 
DBIN+ [47]  8  0.7M  4,095G  47.51  0.992  3.18  0.58  17.58T  46.67  0.983  3.42  1.15 
Ours  8  0.7M  271G  50.94  0.998  2.63  0.43  1.08T  47.52  0.989  2.83  1.07 
Methods  Scale  #Pram  FLOPs  PSNR  SSIM  SAM  ERGAS  FLOPs  PSNR  ASSIM  SAM  ERGAS 
NSSR [10]  32  –  –  39.89  0.958  8.33  0.63  –  40.17  0.956  5.13  0.3503 
HySure [42]  32  –  –  30.14  0.913  10.35  2.35  –  35.17  0.937  7.15  0.5041 
LTTR [8]  32  –  –  41.03  0.970  8.29  0.39  –  43.25  0.975  5.45  0.3062 
DHSIS [9]  32  0.5M  270G  40.51  0.967  8.01  0.41  1.17T  43.73  0.978  5.31  0.3107 
MHF [52]  32  1.7M  673G  40.12  0.962  7.30  0.45  2.46T  44.31  0.981  4.90  0.2955 
DBIN+ [47]  32  1.67M  4,571G  42.83  0.988  6.63  0.27  31.05T  44.24  0.980  5.32  0.2964 
Ours  32  0.7M  271G  45.77  0.991  6.25  0.23  1.08T  45.47  0.986  4.19  0.2878 
IvB Evaluation on Synthetic Data
In this scenario, two commonly used benchmark HSI datasets, i.e., CAVE^{2}^{2}2http://www.cs.columbia.edu/CAVE/databases/ [56] and HARVARD^{3}^{3}3http://vision.seas.harvard.edu/hyperspec/ [7], were used to generate synthetic hybrid inputs. Specifically, the CAVE dataset contains 32 indoor HSIs of spatial resolution 512 512 and spectral bands 31 captured by a generalized assorted pixel camera with an interval wavelength of 10nm in the range of 400700nm. The HARVARD dataset contains 50 indoor and outdoor HSIs recorded under the daylight illumination, and 27 images under the artificial or mixed illumination. Each HSI consists of 31 spectral bands of spatial resolution 10241392, whose wavelengths range from 420 to 720 nm. Following [9, 47], only the 50 daylight illumination images were used in our experiments. Moreover, the first 30 HSIs are used for training, and the last 20 ones for testing. Following [47, 8, 52], all the LRHSIs (with downsampling scale ) used in this scenario were acquired through following two steps: (1) an Gaussian kernel is used to blur HSIs; and (2) the blurred HSIs were regularly decimated every pixels in the spatial domain. We simulated the HRMSI (RGB image) of the same scene by integrating the spectral bands of an HSI with the widely used response function of the Nikon D700 camera^{4}^{4}4http://www.maxmax.com/spectral_response.htm.
IvB1 Results on the CAVE dataset
As listed in the left side of Table TABLE II, we can see that the proposed method with only 0.7M parameters consistently surpasses all the methods under comparison significantly in terms of all the four metrics under both upsampling scales. Although the parameter sharing strategy in DBIN+ could help to save parameters to some extent, its iterative refinement manner costs much more computation resources as demonstrated by the high FLOPs value. To be specific, DBIN+ consumes the FLOPs of 4095G, which is 15 of that of the proposed method, and thus its utilization in practice may be severely restricted. The traditional methods, e.g. NSSR, and LTTR, show good reconstruction performance in the 8 reconstruction. However, as the upsampling scale rises to
32, the matrix factorization based method, e.g. LTTR, shows a sharp deterioration, which is caused probably by the model’s limited representation ability or by the failure of prior knowledge. When the performance of all the methods drops for a larger upsampling scale, the proposed PZResNet model still maintains the highest performance in terms of the 4 metrics.
IvB2 Results on the HARVARD dataset
The experimental results on the HARVARD dataset of different methods are listed in right side of Table II. Note that the HARVARD dataset is more challenging than the CAVE dataset due to the higher resolution and more complex scenario. The significant superiority of our method over stateoftheart methods is further validated. Specifically, the two stateoftheart DNNbased methods, i.e., MHF and DBIN+, are only slightly better than the traditional methods. However, our PZResNet with PSNR of 47.52 dB (resp. 45.47 dB) and SAM of 2.83 (4.19) greatly pushes forward the limits at the scale of 8 (resp. 32). Moreover, compared with the CAVE dataset, the SAM values of all methods on the HARVARD dataset are smaller, which may be caused by the dark background in the CAVE dataset. With the same absolute error, the pixels with low intensities are easy to output high SAM values.
IvB3 Visual comparisons
The error maps between the spectral bands of reconstructed HRHSI by different methods and the groundtruth ones are shown in Fig. 5. Accordingly, Fig. 4 provides the PSNR values of each spectral band of the visualized images in Fig. 5 for reference. For the images from the CAVE dataset, i.e., Figs. 5(a)(c), the traditional methods, e.g. NSSR and LTTR, have compatible performance with the two stateoftheart DNNbased methods, i.e., DBIN+ and MHF, but the shapes of objects can be easily inferred from the error maps of all the compared methods due to the large errors at the boundaries. However, our PZResNet consistently produces the smallest errors over all the images at a low computational cost, and there are only subtle errors, which are nearly invisible. Similar observations can be obtained for the images from the HARVARD database, i.e., Figs. 5. (d)(f). The results convincingly demonstrate the advantage of our method.
IvC Evaluation on Real Data
In this scenario, we evaluated our PZResNet over a real dataset, i.e., World View2 (WV2)^{5}^{5}5https://www.harrisgeospatial.com/DataImagery/SatelliteImagery/HighResolution/WorldView2, which contains a pair of an LRHSI of spectral bands 8 and spatial resolution and an HRMSI (i.e., RGB image) of spatial resolution . As groundtruth data are not available, following [52], we generated the data for training in the following way. We first degraded the image pair to the spatial resolution of 104 164 and 416 656, and trained all models equally on the top half part of the the degraded HRMSI and LRHSI, i.e., the resolution of the input HRMSI and LRHSI are and , respectively. During training, the corresponding part of the original LRHSI was used as the supervision. The bottom half part of the original data excluded in the training process was used for testing. As we can only compare different methods visually in this scenario, we also trained a spectral function simultaneously which maps the reconstructed HRHSI to an RGB image for better visualization. Here only the results of two stateoftheart DNNbased methods, i.e., DBIN+ and MHF which always achieve the top two performance in all the compared methods in the previous scenario, are shown for comparison.
As visualized at the top of Fig. 6, the synthesized MSI with 3 spectral bands from the output HRHSIs of different methods, our PZResNet is better than MHF and DBIN+ as both the colors and spatial patterns of objects by our method are closer to those of the input HRMSI.
Moreover, the remaining three subimages in Fig. 6 visualize three spectral bands of the reconstructed HRHSI of different methods, where it can be observed that in the spectral band, our PZResNet is able to reconstruct both highfrequency details (eaves in the magenta box of the spectral band) and smooth parts (the roof in the magenta and green boxes of the and spectral bands, respectively) very well. However, both DBIN+ and MHF fail to reconstruct highfrequency spatial details, resulting in blurring boundaries, and unexpected visual artifacts also appear in the smooth regions. All these visually pleasing results of our method are credited to its advantage on modeling the crossmodality information.
IvD Ablation Studies
We carried out extensive ablation studies to comprehensively analyze the three key components of our PZResNet model over the CAVE dataset.
Models  PSNR  ASSIM  SAM  ERGAS 

w/o Refinement  50.56  0.994  3.07  0.47 
w/o residualdense aggregation  50.45  0.995  2.95  0.47 
Full PZResNet  50.94  0.998  2.63  0.43 
Cases  PSNR  SSIM  SAM  ERGAS 

One stage  50.47  0.994  2.83  0.47 
Two stages  50.82  0.995  2.71  0.45 
Three stages  50.94  0.998  2.63  0.43 
Case  # Param  FLOPs  PSNR  ASSIM  SAM  ERGAS 

2D conv.  1.4M  292G  48.77  0.990  3.10  0.55 
3D conv.  0.5M  565G  50.59  0.997  2.71  0.44 
3S conv.  0.7M  271G  50.94  0.998  2.63  0.43 
Meanvalue invariant upsampling  ZMnorm  PSNR  ASSIM  SAM  ERGAS 

47.60  0.992  3.10  0.77  
47.45  0.992  3.14  0.77  
46.42  0.989  3.75  0.81  
50.94  0.998  2.63  0.43 
Models  PSNR  ASSIM  SAM  ERGAS 

Learned upsampling  47.45  0.992  3.14  0.77 
Bicubic  50.79  0.997  2.93  0.72 
Bilinear  50.94  0.998  2.63  0.43 
IvD1 The refinement module
As aforementioned, we used a refinement module to boost the reconstruction performance by simultaneously exploring the coherence among all spectral bands of the resulting coarse HRHSI. We trained PZResNet without (w/o) such a module. By comparing the and the rows of Table III, the effectiveness of this module can be validated.
IvD2 The residualdense architecture
To facilitate feature extraction, residual dense aggregation is embedded in our network. Here we investigated the contribution of such an architecture by training our PZResNet without the all the identity mapping and dense connections of each stage. Additionally, we also widened the network such that the modified network has approximately the same number of parameters for a fair comparison, i.e., for the first stage we increased the widths of the three stages from 16, 32, 62 to 24, 48, and 93, respectively. As shown in Table III, compared with full PZResNet, the PSNR and ASSIM values of PZResNet w/o residualdense aggregation decreases about 0.5 dB and 0.003, respectively, and the SAM and ERGAS values increases 0.32 and 0.04, respectively, validating the advantage of the residualdense architecture.
IvD3 The progressive spectral embedding scheme
Inspired by the progressive strategy in image superresolution, in our framework the spectral information of the input HSI is progressively fed into the network to reconstruct HRHSI. In order to validate the advantage of such a progressive manner, we trained our PZResNet with different numbers of stages. “One stage” means all the spectral bands of the upsampled LRHSI are fed into the network at the beginning. Note that we kept the modified models under different settings with the same number of parameters though varying the width of the networks for fair comparisons. As shown in Table IV, the reconstruction quality gradually improves with the increasing the number of stages, and the growth rate is getting smaller. Thus, the advantage of the proposed progressive embedding scheme is convincing validated. Based on this observation, we use 3 stages in our framework.
IvD4 The 3S convolution
In our PZResNet, 3S convolution enables efficient HSI processing. To investigate its efficiency and effectiveness, we trained our model by respectively replacing the 3S convolutional layers with 2D and 3D convolutional layers while keeping approximately the same amount of parameters or FLOPs. The experimental results are listed in Table V. From Table V, it can be seen that when using 2D convolution, the performance drops sharply from 50.94 dB to 48.77 dB because 2D convolution has a very limited ability to capture spectral information. Although 3D convolution is able to achieve good performance, it consumes much more computation resources. Moreover, it is quite timeconsuming. Our PZResNet with 3S convolution is capable of well balancing the efficiency and effectiveness.
The introduced advantages of our ZMnorm by illustrating the behaviors of the testing PSNR and training loss under different epochs of our PZResNet with or w/o ZMnorm.
IvD5 ZMnorm and meanvalue invariant upsampling
To predict a zeromean residual image, ZMnorm is applied on feature maps. Also, the input LRHSI is upsampled with a meanvalue invariant upsampling process to avoid distortion. Here we experimentally verified the necessarily of such a combination by either removing ZMnorm during training or using a learned restrictionfree transposed convolutional layer w/o the meanvalue invariant property for upsampling. From Table VI, we can observe that our method achieves the best performance when both the ZMnorm and meanvalue invariant upsampling were applied. By comparing the and rows, we can see that the ZMnorm affects the performance of our framework severely because our PZResNet is built upon the classic wavelet decompositionbased fusion method which focuses on extraction of the zeromean highfrequency residual image, and without ZMnorm, it is hard to maintain this unique property of the highorder residual image, thus leading poor performance. Meanwhile, by comparing the and rows, we can conclude that the meanvalue invariant characteristic of the upsampling process is necessary because without such a property, it is hard to keep the meanvalues of spectral bands close to the groundtruth ones, and thus distortion is introduced. Last but not least, Fig. 7 also indicates that ZMnorm can accelerate the training process.
IvD6 The choice of the upsampling process
We used the bilinear operator to realize the meanvalue invariant upsampling for its simplicity. We also investigated another meanvalue invariant interpolation operator, i.e., the bicubic interpolation operator. Experimental results in TABLE VII indicate that PZResNet with the bicubic and bilinear interpolations achieve comparable performance, demonstrating the robustness of our framework. Compared with the bicubic interpolation, the bilinear interpolation is more computationally efficient.
V Conclusions
In this paper, we have presented a progressive zerocentric residual network (PZResNet), which is capable of efficiently and effectively restoring HRHSIs from hybrid inputs, including an LRHSI and an HRMSI. Our PZResNet is mainly inspired by the classic wavelet decompositionbased image fusion method and mimics it in an adaptive learning manner. That is, our PZResNet mainly aims to learn a zerocentric residual image from both inputs, which contains highfrequency spatial details of the scene all spectral bands. we have proposed using ZMnorm, meanvalue invariant upsampling, spectralspatial separable convolution with dense aggregation, progressive spectral information embedding to achieve the objective. Extensive experimental results as well as comprehensive ablation studies on both synthetic and real benchmark datasets demonstrate that our PZResNet improves the current stateoftheart performance to a new level both quantitatively and qualitatively. Moreover, our PZResNet is lightweight network, which is much more computationally efficient than stateoftheart deep learningbased methods, which validates it practicality.
Encouraged by the impressive reconstruction quality, it raises our interests to investigate the potential of our zerocentric residual learning scheme on other highorder feature extraction tasks, e.g., image denoising.
References

[1]
(201707)
NTIRE 2017 challenge on single image superresolution: dataset and study.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
, Cited by: Fig. 3, §IIIB.  [2] (2018) Image superresolution via progressive cascading residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 791–799. Cited by: §IIIC.
 [3] (2014) Sparse spatiospectral representation for hyperspectral image superresolution. In European Conference on Computer Vision, pp. 63–78. Cited by: §IIA, §II.
 [4] (2018) CNN based subpixel mapping for hyperspectral images. Neurocomputing 311, pp. 51–64. Cited by: §IIB.
 [5] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §IIIC2.
 [6] (2013) Hyperspectral remote sensing data analysis and future challenges. IEEE Geoscience and remote sensing magazine 1 (2), pp. 6–36. Cited by: §I.
 [7] (2011) Statistics of realworld hyperspectral images. In CVPR 2011, pp. 193–200. Cited by: §IVB.
 [8] (2019) Learning a low tensortrain rank representation for hyperspectral image superresolution. IEEE transactions on neural networks and learning systems 30 (9), pp. 2672–2683. Cited by: §IIA, §IIB, §II, §IVA2, §IVB, TABLE II.
 [9] (2018) Deep hyperspectral image sharpening. IEEE transactions on neural networks and learning systems (99), pp. 1–11. Cited by: §I, §I, §IIB, §IVA2, §IVB, TABLE II.
 [10] (2016) Hyperspectral image superresolution via nonnegative structured sparse representation. IEEE Transactions on Image Processing 25 (5), pp. 2337–2352. Cited by: §I, §IIA, §IVA2, TABLE II.
 [11] (2018) Superresolution of hyperspectral image via superpixelbased sparse representation. Neurocomputing 273, pp. 171–177. Cited by: §IIA.
 [12] (2018) Hyperspectral image superresolution with a mosaic rgb image. IEEE Transactions on Image Processing 27 (11), pp. 5539–5552. Cited by: §I.
 [13] (2004) Customized fusion of satellite images based on the á trous algorithm. In Image and Signal Processing for Remote Sensing X, Vol. 5573, pp. 444–451. Cited by: §I, §IIA.
 [14] (2013) Jointly sparse fusion of hyperspectral and multispectral imagery. In 2013 IEEE International Geoscience and Remote Sensing SymposiumIGARSS, pp. 4090–4093. Cited by: §I.
 [15] (2018) Selfsimilarity constrained sparse representation for hyperspectral image superresolution. IEEE Transactions on Image Processing 27 (11), pp. 5625–5637. Cited by: §IIA.
 [16] (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §IIIC1, §IIIC.
 [17] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §IIIC2.
 [18] (2018) Hyperspectral superresolution: a coupled tensor factorization approach. IEEE Transactions on Signal Processing 66 (24), pp. 6503–6517. Cited by: §IIA.
 [19] (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §IIIC.
 [20] (2007) Exposure fusion. In Computer Graphics and Applications, Pacific Conference on, pp. 382–390. Cited by: §I, §IIA, §II.
 [21] (2018) Modern trends in hyperspectral image analysis: a review. IEEE Access 6, pp. 14118–14129. Cited by: §I.
 [22] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IVA1.
 [23] (2017) Deep laplacian pyramid networks for fast and accurate superresolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §IIIC.
 [24] (2020) Cascading and enhanced residual networks for accurate singleimage superresolution. IEEE Transactions on Cybernetics. Cited by: §IIIA.
 [25] (2012) Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Cited by: §IIIC2.

[26]
(2010)
Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning
. IEEE Transactions on Geoscience and Remote Sensing 48 (11), pp. 4085–4098. Cited by: §I.  [27] (2011) Hyperspectral image segmentation using a new bayesian approach with active learning. IEEE Transactions on Geoscience and Remote Sensing 49 (10), pp. 3947–3960. Cited by: §I.
 [28] (2017) Pixellevel image fusion: a survey of the state of the art. information Fusion 33, pp. 100–112. Cited by: §I, §IIIA.
 [29] (2002) Using the discrete wavelet frame transform to merge landsat tm and spot panchromatic images. Information Fusion 3 (1), pp. 17–23. Cited by: §II.
 [30] (2017) Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing 9 (1), pp. 67. Cited by: §IIIC1.
 [31] (2017) Enhanced deep residual networks for single image superresolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §IIIC2.
 [32] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Fig. 3, §IIIB.
 [33] (2017) Hyperspectral image spatial superresolution via 3d full convolutional neural network. Remote Sensing 9 (11), pp. 1139. Cited by: §IIB, §IIIC1.
 [34] (2017) Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 261–270. Cited by: §IIIC1.
 [35] (1999) Image fusion with additive multiresolution wavelet decomposition. applications to spot+ landsat images. JOSA A 16 (3), pp. 467–474. Cited by: §IIA.
 [36] (1999) Multiresolutionbased image fusion with additive wavelet decomposition. IEEE Transactions on Geoscience and Remote sensing 37 (3), pp. 1204–1211. Cited by: §I, §I, §IIA.
 [37] (1997) Simultaneous image fusion and reconstruction using wavelets applications to spot+ landsat images. Vistas in Astronomy 41 (3), pp. 351–357. Cited by: §IIA.
 [38] (2004) A waveletbased image fusion tutorial. Pattern recognition 37 (9), pp. 1855–1872. Cited by: §II.
 [39] (2017) Multispectral and hyperspectral image fusion using a 3dconvolutional neural network. IEEE Geoscience and Remote Sensing Letters 14 (5), pp. 639–643. Cited by: §IIB.
 [40] (2016) Active learning system for weed species recognition based on hyperspectral sensing. Biosystems Engineering 146, pp. 193–202. Cited by: §I.
 [41] (2018) Unsupervised sparse dirichletnet for hyperspectral image superresolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2511–2520. Cited by: §IIB.
 [42] (2014) A convex formulation for hyperspectral image superresolution via subspacebased regularization. IEEE Transactions on Geoscience and Remote Sensing 53 (6), pp. 3373–3388. Cited by: §IVA2, TABLE II.
 [43] (2017) Object tracking by hierarchical decomposition of hyperspectral video sequences: application to chemical gas plume tracking. IEEE Transactions on Geoscience and Remote Sensing 55 (8), pp. 4567–4585. Cited by: §I.
 [44] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §IIIC2.
 [45] (2015) Dualcamera design for coded aperture snapshot spectral imaging. Applied optics 54 (4), pp. 848–858. Cited by: §IIB.
 [46] (2018) Hyperreconnet: joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging. IEEE Transactions on Image Processing 28 (5), pp. 2257–2270. Cited by: §IIB.
 [47] (2019) Deep blind hyperspectral image fusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4150–4159. Cited by: §I, §I, §IIB, §II, §IVA2, §IVB, TABLE II.
 [48] (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: 2nd item.
 [49] (2015) Hyperspectral and multispectral image fusion based on a sparse representation. IEEE Transactions on Geoscience and Remote Sensing 53 (7), pp. 3658–3668. Cited by: §IIA.
 [50] (2018) Fast light field reconstruction with deep coarsetofine modeling of spatialangular clues. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 137–152. Cited by: §IIIC1.
 [51] (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §IIIC2.
 [52] (2019) Multispectral and hyperspectral image fusion by ms/hs fusion net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1585–1594. Cited by: §I, §I, §IIA, §IIA, §IIB, §IIB, §II, §IVA2, §IVB, §IVC, TABLE II.
 [53] (2017) Hscnn: cnnbased hyperspectral image recovery from spectrally undersampled projections. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 518–525. Cited by: §IIB.
 [54] (2019) Crossmodality bridging and knowledge transferring for image understanding. IEEE Transactions on Multimedia 21 (10), pp. 2675–2685. Cited by: §I.
 [55] (2012) Fusion of multispectral and panchromatic images based on support value transform and adaptive principal component analysis. Information Fusion 13 (3), pp. 177–184. Cited by: §II.
 [56] (2010) Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE transactions on image processing 19 (9), pp. 2241–2253. Cited by: §IVB.
 [57] (2018) Light field spatial superresolution using deep efficient spatialangular separable convolution. IEEE Transactions on Image Processing 28 (5), pp. 2319–2330. Cited by: §IIIC1.
 [58] (1996) Multiresolution wavelet decomposition i me merger of landsat thematic mapper and spot panchromatic data. Photogrammetric Engineering & Remote Sensing 62 (9), pp. 1067–1074. Cited by: §IIA.
 [59] (2016) Multispectral and hyperspectral image fusion based on group spectral embedding and lowrank factorization. IEEE Transactions on Geoscience and Remote Sensing 55 (3), pp. 1363–1371. Cited by: §I.
 [60] (2020) Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §IIIA, §IIIC.
 [61] (1999) A categorization of multiscaledecompositionbased image fusion schemes with a performance study for a digital camera application. Proceedings of the IEEE 87 (8), pp. 1315–1326. Cited by: §I, §IIA, §II.
 [62] (2007) Multisource image fusion method using support value transform. IEEE Transactions on Image Processing 16 (7), pp. 1831–1839. Cited by: §II.
 [63] (2009) Multiscale fusion algorithm comparisons: pyramid, dwt and iterative dwt. In 2009 12th International Conference on Information Fusion, pp. 1060–1067. Cited by: §IIIA.
 [64] (2016) Perceptual fusion of infrared and visible images through a hybrid multiscale decomposition with gaussian and bilateral filters. Information Fusion 30, pp. 15–26. Cited by: §IIIA.