DIP-HyperKite: Hyperspectral Pansharpening Based on Improved Deep Image Prior and Residual Reconstruction
Hyperspectral pansharpening aims to synthesize a low-resolution hyperspectral image (LR-HSI) with a registered panchromatic image (PAN) to generate an enhanced HSI with high spectral and spatial resolution. Recently proposed HS pansharpening methods have obtained remarkable results using deep convolutional networks (ConvNets), which typically consist of three steps: (1) up-sampling the LR-HSI, (2) predicting the residual image via a ConvNet, and (3) obtaining the final fused HSI by adding the outputs from first and second steps. Recent methods have leveraged Deep Image Prior (DIP) to up-sample the LR-HSI due to its excellent ability to preserve both spatial and spectral information, without learning from large data sets. However, we observed that the quality of up-sampled HSIs can be further improved by introducing an additional spatial-domain constraint to the conventional spectral-domain energy function. We define our spatial-domain constraint as the L_1 distance between the predicted PAN image and the actual PAN image. To estimate the PAN image of the up-sampled HSI, we also propose a learnable spectral response function (SRF). Moreover, we noticed that the residual image between the up-sampled HSI and the reference HSI mainly consists of edge information and very fine structures. In order to accurately estimate fine information, we propose a novel over-complete network, called HyperKite, which focuses on learning high-level features by constraining the receptive from increasing in the deep layers. We perform experiments on three HSI datasets to demonstrate the superiority of our DIP-HyperKite over the state-of-the-art pansharpening methods. The deployment codes, pre-trained models, and final fusion outputs of our DIP-HyperKite and the methods used for the comparisons will be publicly made available at https://github.com/wgcban/DIP-HyperKite.git.READ FULL TEXT VIEW PDF
DIP-HyperKite: Hyperspectral Pansharpening Based on Improved Deep Image Prior and Residual Reconstruction
Hyperspectral images (HSIs) with a large number of spectral bands have gained immense attention in the field of remote sensing due to its applications in broad research areas such as classification , unmixing 3], change detection 
, etc. However, due to the limited incident energy available when capturing an image, hyperspectral imaging systems face trade-offs between spectral resolution, spatial resolution, and signal-to-noise ratio (SNR). For this reason, hyperspectral imaging systems can provide images with high spectral resolution but with low spatial resolution. In contrast, multispectral imaging systems can provide data with high spatial resolution but with fewer spectral bands (e.g., panchromatic images or multispectral images(MSIs) with three or four spectral bands). Low spatial resolution in HSIs leads to relatively poor performance in some practical remote sensing applications, such as road topology extraction , and spectral unmixing 
. Therefore, full-resolution HSIs with high spatial and spectral resolution are desired. One way to obtain such ideal HSIs is to fuse high spectral resolution HSIs with high spatial resolution PAN/MSIs. This fusion process is called HS pansharpening in the remote sensing literature, which is indeed a form of super-resolution.
Traditional pansharpening methods can be mainly divided into four classes : (1) component substitution (CS), (2) multi-resolution analysis (MRA), (3) Bayesian, and (4) matrix factorization. Component substitution methods rely on substituting the spatial component of the HSI with the MSI/PAN image. The family of CS contains algorithms such as Gram–Schmidt adaptive (GSA) [8, 9]
, principal component analysis (PCA)[10, 11, 12], and intensity-hue-saturation (IHS) . Even though the CS methods usually generate pansharpened HSIs with accurate spatial information, sometimes they suffer from critical spectral distortions. The MRA approaches are based on injecting the spatial details obtained through the multi-scale decomposition of the MSI/PAN image into the HSI. In order to extract the spatial details from the PAN image, several algorithms have been proposed in the literature, such as decimated wavelet transform (DWT) , undecimated wavelet transform (UDWT) , smoothing filter-based intensity modulation (SFIM) , modulation transfer function with generalized Laplacian pyramid (MTF-GLP) , and MTF-GLP with high-pass modulation (MTF-GLP-HPM) . In contrast to the CS methods, the MRA family performs better in spectral preservation, but is more sensitive to registration errors which may cause critical distortions in the spatial domain. Due to these inherent advantages and disadvantages of CS and MRA approaches, there have been works which attempted to combine both CS and MRA methods. One of the representatives of hybrid CS and MRA algorithm is guided filter PCA (GFPCA) 
. The Bayesian-based methods also provide a convenient way to regularize the fusion methods by modeling the posterior distribution of the target HSI provided that the LR-HSI and MSI/PAN image. Examples of the algorithms based on the Bayesian inference framework include convex regularization under a Bayesian framework (abbreviated as Hysure), naive Bayesian Gaussian prior (abbreviated as BF) , and sparsity promoted Gaussian prior (abbreviated as BFS) . Finally, the coupled non-negative matrix factorization (abbreviated as CNMF) is one of the examples for matrix factorization-based methods, which regularizes the fusion problem by using the priors of spectral unmixing . However, the fusion performance of traditional pansharpening approaches is generally limited due to their inadequate representation ability. In addition, the algorithms mentioned above may result in severe quality degradation when the assumptions do not align with a particular dataset. Furthermore, most traditional pansharpening approaches typically reach the optimal solution through an iterative process, which is time-consuming and inefficient.
Recently, deep learning (DL) models based on convolutional neural networks (ConvNets) have also been introduced for the HS pansharpening problem due to ConvNets’ excellent ability to learn high-level features automatically. ConvNet-based HS pansharpening methods generally consist of three steps,
Up-sampling step: Up-sampling the LR-HSI to the spatial resolution of the PAN image,
Residual reconstruction step: Concatenating the up-sampled HSI and PAN image along the spectral dimension and passing it through a residual learning network to learn the residual image,
Final fusion step: Obtaining the final fused HSI by adding the up-sampled HSI and the residual image.
There have been many methods proposed to up-sample LR-HSI to the spatial resolution of PAN. In the earliest studies, nearest-neighbor and bicubic interpolation were the famous methods to perform up-sampling. However, the methods mentioned above conduct upsampling on each band of the LR-HSI successively, thus ignoring the high spectral correlation of HSIs which may lead to spectral distortions[24, 25]. In order to minimize the spectral distortion, data-driven up-sampling techniques (i.e., deep super-resolution networks) have also been utilized in HS pansharpening. The LapSRN  network is an example of such a data-driven super-resolution method, which progressively super-resolves a LR image in a coarse-to-fine manner in a Laplacian pyramid framework. However, the LapSRN method requires a large number of images for training which is impractical in the HS domain due to the limited number of datasets available to the public. A remedy to the problem mentioned above was proposed by Ulyanov et al.  where they proposed a deep learning-based super-resolution framework called deep image prior (DIP). The proposed method uses a randomly initialized ConvNet to upsample an image, using its structure as an image prior, similar to bicubic upsampling. However, this method does not require any training but produces much cleaner results with sharper edges. Motivated by the super-resolution performance of DIP in the RGB domain, researchers have applied DIP to the HS pansharpening problem [27, 24] and achieved impressive results. However, we observed that the energy function defined in HS DIP up-sampling directly applies the energy function formulated for the RGB DIP process, where they only impose spectral-domain constraint by computing the distance between the down-sampled version of the target up-sampled HSI and the LR-HSI. However, the existing HS DIP methods do not impose any spatial-domain constraint by utilizing the available PAN image. We address this issue by introducing an additional spatial-domain constraint to the HS DIP process as our first contribution.
For residual reconstruction, various ConvNet architectures have been proposed in the literature to accurately predict the residual component between the up-sampled HSI and the reference HSI with less spectral and spatial distortion. Among those, Giuseppe et al.  was the first to introduce simple three-layer ConvNet architecture for the residual learning. Further, Lin et al.  improved the spatial and spectral prediction capability of Giuseppe’s work (abbreviated as HyperPNN) by introducing spectral and spatial prediction modules. To further enhance the representational power of ConvNets, attention mechanisms  have also been introduced. Among those, Zheng et al.  proposed a spatial and spectral attention mechanism (abbreviated as DHP-DARN) for the residual learning in which they cascade several channel-spatial-attention residual blocks to adaptively learn more informative channel-wise and spatial-domain features simultaneously. More recently, Xu et al.  proposed a design (abbreviated as SDPNet) based on two encoder-decoder networks to extract deep-level features from two types of source images with densely connected blocks to strengthen feature propagation. However, we experimentally observed that most of the existing residual learning methods fail when predicting the high-frequency information, such as edges and delicate structures in the residual image. The main reason for this observation is due to the fact that the increasing receptive field of the network in the deep layers. Motivated by this observation, we introduce an over-complete network, called HyperKite, for residual reconstruction task as our second contribution, which constrains the receptive field from increasing in deep layers thus extracting more high-frequency information.
The main contributions of this paper are summarized as follows:
A novel spatial constraint is introduced for the DIP up-sampling process. To the best of our knowledge, this is the first study that integrates both spatial and spectral constraints to the DIP up-sampling. The proposed spatial constraint significantly improves the spatial and spectral performance measures of the up-sampled HSIs.
An over-complete network, called HyperKite, is proposed for the residual reconstruction, which is highly capable of extracting high-frequency information of the residual image by appropriately constraining the receptive field of the network.
We conduct extensive experiments to clearly demonstrate the improvements brought in from our contributions to the HS pansharpening. We compared the fusion performance of DIP-HyperKite with both conventional and deep learning-based approaches. The deployment codes, pre-trained models, and final fusion results of our DIP-HyperKite as well as the comparison methods in the results and discussion will be publicly made available at https://github.com/wgcban/DIP-HyperKite.git.
The rest of this paper is organized as follows. Section II provides some basics DIP and over-complete representations. In Section III the proposed DIP-HyperKite is described in detail. Section IV describes the datasets and performance metrics that we used in the experiments. In Section V, the experimental results and discussions of different data sets are presented. Finally, the conclusions are drawn in Section VI.
Generally, ConvNets have an excellent ability to learn realistic image priors from a large amount of visual data, placing them in leading positions on the benchmarks of various image processing tasks [32, 33]. Contrary to the general opinion on deep networks that they require large data to capture image priors, DIP  has shown that a randomly initialized network can capture low-level image statistics before any training. Concretely, in HS pansharpening, DIP can generate the up-sampled HSI of the LR-HSI with spatial up-sampling factor
by taking a fixed randomly initialized vectoras the input, and utilizing the deep network as a parametric function . Next, the network is optimized over its parameters to obtain the up-sampled HSI as follows:
where is an energy function that controls the fidelity toward the LR-HSI , and is a regularization function based on prior knowledge. In , it has been shown that the regularization term can be implicitly substituted by the deep network. Therefore, the minimization problem in (1) has simplified to optimizing the network over its parameters as follows:
where denotes the optimal set of parameters of the network. Furthermore, the most straightforward and commonly utilized energy function in HS pansharpening is that the distance  between the down-sampled version of the up-sampled HSI and the LR-HSI as follows:
where denotes the down-sampling operator by a factor of .
Most of the current architectures in deep learning are “encoder-decoder” [34, 35, 36] based. Here, the encoder translates the high-dimensional input to a low-dimensional latent space while the decoder learns to take the latent low-dimensional representation back to a high-dimensional output. These type of architectures learn low-level features at their initial layers and high-level features at their deeper layers. These are termed under-complete networks as the input is taken to a lower spatial dimension in the latent space.
In signal processing, over-complete dictionaries are widely used for their highly robust characteristic . The number of basis functions here are more than the number of input signal samples which enables a higher flexibility for capturing structure in data. In , over-complete auto-encoders were found to be better feature extractors for denoising when compared to under-complete auto-encoders. In an over-complete network , the encoder takes the input data to a higher spatial dimension unlike a traditional encoder. This is achieved by using an upsampling layer after every convolutional layer in the encoder. Using upsampling layers in the encoder causes the receptive field to be constrained in the deep layers. This causes the deep layers in the network to learn more fine-context high-frequency information when compared to under-complete networks. Increase in receptive field for an over-complete network can be generalized in an layer as follows:
where the initial receptive field of the conv filter is assumed to be on the image . This phenomenon has been visualized in Fig 1. As shown in Figure 1 (b), by employing an upsampling layer after every convolutional layer in the encoder, the over-complete network restricts the receptive field size to a smaller region which forces the network to learn very fine edges as it tries to focus heavily on smaller regions. This is completely different from the conventional over-complete architectures where they perform downsampling after each convolution block which makes the network to focus on a much larger region in the input as shown in Figure 1 (a).
Over-complete networks in deep learning is a new topic and was initially proposed for medical image segmentation of small anatomy . It has since been successfully extended to solve fine-context requiring tasks like fine edge segmentation of 3D volumes , deep subspace clustering , MRI reconstruction , adversarial defense against videos  and image restoration problems like single image de-raining .
The overall flowchart of the proposed DIP-HyperKite for HS pansharpening is shown in Figure 2. As can be seen from Figure 2 the proposed method consists of two main steps. In the first step, the LR-HSI with pixels and spectral bands is up-sampled to the spatial resolution of the PAN image , where denotes the ratio between spatial resolution of and . We denote the output from the DIP process as . In the second step, we train an over-complete deep network which takes up-sampled HSI and the corresponding PAN images as inputs to predict the residual component between the up-sampled HSI and the reference HSI .
As shown in Figure 2, the low resolution HSI is up-sampled to the spatial resolution of the PAN image using the DIP. This recently introduced DIP method is different from the other existing up-sampling techniques such as bicubic interpolation, and LapSNR . The main advantage of DIP over these conventional methods is that it does not require a large dataset for training. In other words, for each LR image , the DIP network takes a fixed random tensor as an input and optimize the network parameters
by minimizing the loss functionwhich is defined in terms of the output up-sampled image and available LR-HSI as given in (3). In contrast, the LapSRN network utilized in  is highly relied upon the RGB image datasets and the knowledge adaptation techniques. Furthermore, the bicubic and LapSNR methods up-sample each band in the HSI separately; thus ignoring the high spatial correlation between the spectral bands, which results in the loss of spatial details. Although the DIP method is capable of producing high-quality upsampling images compared to the other existing methods, it only utilizes the information from the LR-HSI , thus only imposing constraint on the spectral domain. However, we observed that the quality of the sampled HSIs can be further improved by incorporating an additional spatial constraint in the loss function using the available PAN image . In the next section we explain our novel spatial+spectral loss function.
As we discussed in Section II-A, the energy function given in (3) enforces a constraint only in spectral domain by defining the distance between up-sampled HSI and the LR HSI . Instead, we propose a loss function (denoted by ) for HS DIP, which enforces the constraints in both spatial and spectral domains as follows:
where denotes the spectral response function, (scalar) is the spectral response of -th band, is the -th band image of the up-sampled HSI , is the element-wise multiplication, and is a regularization constant. The first term in (5) enforces the spectral constraint on as in (3), and the additional second term enforces the constraint in spatial domain on by utilizing the available PAN image .
In the simplest case, the spectral response function can be approximated as the average across all spectral bands (i.e. ) [47, 48]. In this scenario, the spatial loss term in (5) enforces that the average across all the spectral bands in up-sampled HSI to be close as possible to the PAN image , thus assuming a flat (i.e. uniform) spectral response. However, in general, this assumption is not valid as spectral response varies with wavelength coverage and different spectral bands describe the same semantic information across a wide spectral range with varying quality (i.e. PSNR) .
A recent attempt  estimate the spectral response function
by utilizing the larger eigenvalue of the structure tensor (ST) matrix (originally proposed in Harris corner detection algorithm). However, this method cannot be directly utilized in an end-to-end deep learning network due to the difficulties encountered while performing back-propagation In addition, it is highly computationally complex as it requires to compute derivatives of each band image along both -and -directions at each iteration of learning as part of constructing the structure tensor matrix. Instead, we propose a computationally lightweight and learnable spectral response function which can be easily integrated into the spatial loss term in (5) and can be simultaneously learned with DIP.
In this part we describe our novel way of estimating the spectral response function which is computationally lightweight, differentiable, and can be easily integrated into the existing DIP learning process. The overall computational procedure of estimating the spectral response function and thereby evaluating the spatial energy that we introduced for the DIP process in (5) is graphically depicted in Figure 3. First, we assume that the spectral response is proportional to the ratio of information in each spectral band. The next problem arises with this assumption is how do we quantify the information embedded in each spectral band. Motivated by recently proposed Squeeze-and-Excitation networks, we utilize global average pooling to quantify the global information present in each band. Formally, a statistic which quantifies the informative features in each spectral band is generated by shrinking the up-sampled HSI through its spatial dimensions such that the -th element in is calculated as:
Next, we use a simple gating mechanism to capture the dependencies among spectral bands using the band-wise descriptor that we obtained in the previous step. We parameterize the spectral response function by forming a bottleneck with two fully-connected (FC) layers around the non-linearity as follows:
where is the Sigmoid activation function,
is the ReLU non-linearity, andare the learnable weight matrices. Here, we use Sigmoid activation to guarantee that the spectral responses of all the bands sump up to one.
Figure 4 illustrates a U-Net like deep network that we used for the DIP method. The DIP network includes five down-sampling blocks , five upsampling blocks , and five skip-connection blocks (
). We use stride convolutions as the down-sampling operator, bi-linear up-sampling as the upsampling operator, and Lanczos2 as non-linearity. We initialize the input noise vector with uniform noise betweenand . The Table I
tabulates the values of all the hyperparameters of DIP network.
|Number of iterations||1300|
Our motivation to design an over-complete network for the residual learning task emerged after observing the residual images between DIP up-sampled image and reference HSI as visualized in Figure 5. As we can see from Figure 5, the residual images correspond to different wavelength band mainly consists of boundary information like edges and other high-frequency components. In order to accurately capture this fine information, we design an over-complete HyperKite for the residual learning as shown in Figure 6.
The proposed HyperKite consists of an Initial Feature Extraction Network (IFEN), a High-dimensional Feature Mapping Network (HDFMN), and a Final Residual Reconstruction Network (FRRN). The input to the HyperKiteis obtained by concatenating the up-sampled HSI and the PAN image along the spectral dimension (denoted as ). The HyperKite starts with the IFEN layer, where one
convolutional layer is applied followed by Batch Normalization (BN) and LeakyReLU non-linearity to extract initial feature representation as:
where denotes the convolution followed by LeakyReLU and batch normalization, denotes the extracted features transformed from in dimensional pixel-space, and is the number of filters in the convolutional layer. Figure 7 (a) shows six example feature maps of for the 20-th patch of the Pavia Center dataset that we will introduce in Section IV. As we can see from the figure, the initial feature extraction network extract low-level feature of the input . In order to capture high-level features that that required for the residual learning, we successively transform the output of IFEN into three higher-dimensional pixel-spaces by utilizing the “bilinear” up-sampling denoted as , , and . Then we perform convolution followed by BN and LeakyReLU to extract meaningful high-level features at each higher-dimensional space as,
where denotes the “bilinear” interpolation by a factor of , denotes the convolution layer followed by BN and LeakyReLU at the d-th higher-dimensional feature space. Next, we successively transform the extracted high-level features to the original dimensional space by employing “bilinear” downsampling and skip connections. Formally, we can define the operations of HDFMN as,
where denotes the “bilinear” downsampling by a factor of , denotes the feature concatenation operator, denotes the convolution followed by BN and LeakyReLU at the d-th dimensional feature space, and is the most relevant high-level features obtained at space. After flowing through all the downsampling layers (decoder blocks), a convolutional layer is employed to recover the spectral dimension, and reconstruct the residual image as:
where denotes the convolutional layer followed by BN and LeakyReLU employed at FRNN.
After carrying out DIP up-sampling and residual prediction of our DIP-HyperKite, the DIP up-sampled HSIs and are created. Finally, we can obtain the fused HSI by using and as:
To this end, we utilize loss to optimize HyperKite, which has been demonstrated as a superior choice for remote sensing image SR [29, 5] and also experimentally verified to be effective for improving the fusion accuracy. For the training set where is the -th input, is the corresponding reference HSI, and is the total number of training HSIs in the training set. The loss function utilized for HyperKite training can be defined as follows:
Moreover, all the parameter details of our proposed HyperKite are summarized in Table II
. We train our network in Pytorch framework using an NVIDIA Quadro 8000 GPU. We use Adam optimizer with a learning rate of 0.001, weight decay of 0.0001 and momentum 0.9 to train HyperKite. We use a batch size of 4 and train the network for 2500 epcochs.
To evaluate the performance of our proposed DIP-HyperKite for HS pansharpening, we conduct a series of experiments on three HS data sets, which are described in detail below.
The Pavia Center scene was captured by the ROSIS camera . The original HSI consists of spectral bands spanning from to nm. The spatial size of the original image is pixels, where a single pixel is equivalent to geometric resolution of m. The thirteen noisy spectral bands in the original HSI were discarded, thus resulting in a HSI with spectral bands spanning from to nm. In addition, a rectangular area of size pixels with no information at the center of the original HSI was also discarded, and the resulting “two-part” image with size of was used for the experiments. Following the same experimental procedure outlined in , we also used only the top-left corner of the HSI with size of , and partitioned it into cubic patches of size with no overlap, which constituted the reference images () of Pavia Center data set. In order to generate PAN images () and LR-HSIs () corresponding to each HR-HSI, we utilize Wald’s protocol . Following the Wald’s protocol, we generate PAN images () of size by averaging first spectral bands of HR reference HSI. In order to generate LR-HSIs of size , we spatially blurred the HR reference HSI with an Gaussian filter, and then downsampled the result. The scaling factor was set to for the Pavia Center dataset. We randomly select cubic patches for the training, and the rest of the seven patches forms the testing set of the Pavia Center dataset.
The Botswana scene was acquired by the Hyperion sensor on the NASA’s Earth Observing 1 (EO-1) satellite. The original Botswana HSI consists of spectral bands spanning from to nm with spectral resolution of nm. The spatial size of the original Botswana image is pixels. We remove the uncalibrated and noisy spectral bands in the original image, thus resulting in a HSI with spectral bands. Following the same experimental procedure outlined in , we also use only the top-left corner of the HSI with size of , and partitioned it into 20 cubic patches of size with no overlap, which constitute the reference images of the Botswana dataset. In order to generate PAN images and the LR-HSIs corresponding to each HR reference image, we follow the Wald’s protocol. We generate PAN images of size by averaging first spectral bands of HR-HSI. To generate LR-HSIs , we spatially blur the HR-HSI with an window, and perform down-sampling. For the Botswana dataset, we set the down-sampling factor to 3. We randomly select cubic patches for training, and the rest of the patches are utilized for testing.
The Chikusei scene was captured by the Headwall Hyperspec-VNIR-C imaging sensor over the agricultural and urban areas in Chikusei, Japan. The original Chikusei HSI consists of 128 spectral bands spanning from to nm. The spatial size of the Chikusei HSI is pixels, where a single pixel is equivalent to geometric resolution of m. We used top-left corner of the HSI with size of , and partitioned it into cubic patches of size with no overlap, which constituted the reference images of Chikusei dataset. Following the Wald’s protocol, we generate PAN images of size by averaging first spectral bands of high resolution HSI. To generate LR-HSIs , we spatially blur the HR-HSI with an window, and perform down-sampling. For the Chikusei dataset, we set the down-sampling factor to . We randomly select 61 cubic patches for training, and the rest of the patches are utilized for testing.
In order to evaluate the quality of the proposed pansharpening method, we use different image quality measures. Following , we use Cross-Correlation (CC), Spectral Angle Mapping (SAM), Root Mean Square Error (RMSE), Errur Relative Globale Adimensionnelle Desynthese (ERGAS), and Peak Signal to Noise Ratio (PSNR). These measures have been widely used in the HSI processing community and are appropriate for evaluating fusion in spectral and spatial resolutions.
The CC metric characterizes the geometric distortion, and is defined as:
where, CCS denotes the cross-correlation for a single-band image as follows:
where, is the sample mean of . The ideal value of CC is , which indicates that the two HSIs are highly correlated.
SAM is a spectral measure which is defined as:
where given the vectors , ,
where denotes the inner product between and , and is the norm. The SAM is a measure of the spectral shape preservation. The SAM values reported in our experiments are in degrees and thus belongs to . The optimal value of SAM is . The values of SAM reported in our experiments have obtained by averaging the values for all image pixels.
The reconstruction SNR (RSNR) or root mean square error (RMSE) is related to the difference between the reference and fuse images, which is defined as follows:
Relative dimensionless global error in synthesis (ERGAS) calculates the amount of spectral distortion in the image. The ERGAS measure is defined as:
where is the ratio between the linear resolution of the PAN image and the HSIs. defined as:
, and i the sample mean of the -th band of . The ideal value of ERGAS is 0.
PSNR also assess the fusion quality of each bans, and the average PSNR is calculated as:
where is the maximum pixel value in the th band of . A larger value of PSNR indicates a higher reconstruction quality in spatial information of the fusion result.
This section presents the results of our proposed DIP-HyperKite for HS pansharpening, and compares it with the state-of-the-art methods on the Pavia Center, Botswana, and Chikusei datasets. For better clarity, we divide this section into two parts. In the first part (section V-A), we highlight the contribution from our proposed spatial+spectral energy function for the DIP up-sampling process and compare it with available state-of-the-art up-sampling techniques such as nearest-neighbor, bicubic, LapSRN, and DIP with only spectral loss. In the second part (section V-B), we present the final fusion results that we obtain from our proposed HyperKite network and compare it with classical and deep-learning-based pansharpening approaches.
As we discussed in Section III-A, the recently proposed pansharpening methods such as DHP-DARN  and DHP  utilized the DIP process to up-sample the LR-HSI instead of using the nearest-neighbor, bicubic, or LapSRN techniques due to its excellent performance. However, we have observed that the quality of up-sampled HSI can be further improved by carefully redesigning the loss function used in the DIP optimization. Instead of only utilizing spectral constraint in the DIP loss function, we derived a novel loss function with spectral and spatial constraints. This section demonstrates the performance improvement brought by our proposed spatial+spectral loss function to the DIP up-sampling process. We compare DIP with the proposed spatial+spectral loss against the DIP with spectral loss only. Furthermore, to make the analysis more comprehensive, we also added a conventional up-sampling techniques used in the HS pansharpening domain, such as nearest-neighbor, and bicubic. Further, motivated by the experimental discussion in , we also added the results from LapSRN , which is trained on a large amount of RGB images.
We start our discussion with the effect of the regularization constant in our proposed spatial+spectral loss function as defined in (5). The variation of CC, SAM, RMSE, ERGAS and PSNR values when varying the regularization parameter from to for the Pavia Center, Botswana and Chikusei datasets are shown in Figure 8, Figure 9, and Figure 10, respectively. As can be seen from these figures, as the value of the regularization constant increases, the performance metrics also begin to improve, then hit a saturation point, and then degrade, for all three data sets. Therefore, we carefully select the regularization constant for each dataset by considering all the performance metrics. For example, consider the variation of the performance metrics with the regularization constant for the Pavia Center dataset which is shown in Figure 8. As we can see, when the value of the regularization constant increases from to , we can see that CC, RMSE, ERGAS, and PSNR start to improve, and when increases beyond the performance metrics start to degrade. Therefore, we set as the optimal value of the regularization constant of our proposed spatial+spectral energy term for the Pavia Center dataset. The variation in performance metrics with the regularization parameter for the Botswana and Chikusei datasets are also shown in Figure 9 and Figure 10, respectively. Following the same analysis we described for the Pavia Center dataset, we select as the optimal value of the regularization constant for the Botswana and Chikusei datasets. Note that the performance improvement bringing from our proposed spatial+spectral loss function for the DIP upsampling process. Under the optimal regularization constant (), our spatial+spectral energy function improves the quality of up-sampled HSIs over the spectral loss (equivalent to point in Figure 8, 9, and 10) in-terms of CC, RMSE, ERGAS, and PSNR metrics by 6.64%, 37.8%, 63.4%, 37.3%, and 19.5%, respectively for the Pavia Center dataset. For the Botswana dataset, our proposed loss function improves CC, SAM, RMSE, ERGAS, and PSNR metrics over the DIP with spectral loss by 3.3%, 4.2%, 19.1%, 14.7%, 7.0%, and , respectively. Similarly for the Chikusei dataset, our method improves CC, RMSE, ERGAS, and PSNR metrics compared to DIP with spectral loss by 1.8%, 26.3%, 22.9%, 26.2%, and 8.9%, respectively.
Let us first consider the case where the regularization constant
is set to zero. This is equivalent to the case where we only have the spectral constraint. In this case, the DIP network minimizes the distance between the down-sampled version of the up-sampled HSI and the LR-HSI. Since the down-sampling operator acts as a low-pass filter in the frequency domain, what DIP network actually minimizes is that the distance between the low-pass version of the up-sampled HSI and the LR-HSI. Because of this reason, the up-sampled HSI from the DIP network trained only with spectral constraint lacks the high frequency components such as edge information and fine structures. Now let us consider the case where we have both spatial and spectral constraint in the DIP loss function. As we described in SectionIII-A, we combined the spatial and spectral constraints via regularization parameter . The value of controls the fidelity of the predicted PAN image towards the actual PAN image. Since the predicted PAN image and the up-sampled HSI are coupled via spectral response function, to make the predicted PAN image close as possible to the actual PAN image, the DIP network tries to predict some of the high-frequency components such as edges and fine structures in the PAN image, while maintaining the low-pass version of the up-sampled HSI close to the LR-HSI. Therefore, the regularization constant what actually controls is the amount of high-frequency components fused from PAN image to the up-sampled HSI. This explain the observation that we made from Figure 8, Figure 9, and Figure 10, where when the value of the regularization parameter increases the DIP network embed some of the high-frequency information to the up-sampled HSI, which ultimately helps to improve the quality of the up-sampled image. However, when the value of the regularization constant is large, the spatial loss term starts to dominate the loss function, and resulting in drop of spectral-domain performance metrics such as SAM and ERGAS. Therefore, we can achieve high-quality up-sampled HSIs by appropriately controlling the regularization parameter in spatial+spectral energy function.
In the previous section, we determined the optimal value of the regularization constant for our proposed spatial+spectral loss function for the three datasets. In this section, we compare DIP with our spatial+spectral loss against DIP with only spectral loss, and other commonly used up-sampling techniques such as nearest neighbor, bicubic, and LapSRN, both qualitatively and quantitatively.
Table III summarizes the quantitative results of nearest-neighbor, bicubic, LapSRN, and DIP up-sampling methods for the Pavia Center dataset. For this dataset, our proposed DIP method improves the quality of up-sampled images in terms of CC, SAM, RSNR, ERGAS, and PSNR performance measures by , , , and , respectively. We have also noticed that this improvement is accompanied by a drop in the SAM index which is around 1.8% compared to the DIP with spectral loss. This fall in the SAM index is not that significant compared to the improvements we have achieved in terms of all other performance measures. Further, we can cross-verify these quantitative results with qualitative results that we have shown in Figure 11 for the Pavia Center dataset. We can see that the DIP up-sampled images with our proposed spatial+spectral constraint looks much more closer to the reference image, and have predicted very fine structures and edges compared to other upsampling methods.
We also summarize the quantitative results for different up-sampling methods for the Botswana dataset in Table IV. As we can see, DIP with the proposed spatial+spectral loss improves the quality of up-sampled images in terms of all the performance metrics by a significant margin: CC value increased by 0.4%, SAM value reduced by 4.3%, RMSE value reduced by 14.0%, RSNR value improved by 11.2%, ERGAS value reduced by 1.7%, and PSNR value value increased by 5.2%. Also, we can verify these quantitative results with the qualitative results shown in Figure 12 for the Botswana dataset. Similar to the qualitative results that we have observed for the Pavia Center dataset, we can see the the up-sampled images using DIP with our proposed spatial+spectral loss is much more closer to the reference HSI.
Finally, we summarize the quantitative results for different upsampling methods for the Chikusei dataset in Table V. In this case also, the performance of DIP up-sampled images with our proposed spatial+spectral loss outperforms five out of six performance measures that we considered for the analysis. As we can see from the Table V, our DIP method has increased the value of CC by 0.1%, has decreased the value of RMSE by 26.3%, has increased the RSNR by 22.9%, has decreased the ERGAS by 26.2%, and has increased the PSNR by 8.9% over the state-of-the-art results. Similar to the Pavia Center dataset, in this dataset also we have observed that the drop in SAM index; however this is negligible compared to the performance gained in terms of the other quantitative measures. Furthermore, following the similar trend with other datasets, we have included the qualitative results in Figure 13 for the Chikusei dataset. From the qualitative results also we can see that DIP with our spatial+spectral constraint is able to predict very fine structures and edges more accurately than the other methods.
In summary, we have shown that the DIP method with our proposed spatial+spectral constraints outperforms the state-of-the-art up-sampling methods with a significant margin in all the datasets that we have considered in this study. In the next section, we present final fusion results and compare them with state-of-the-art pansharpening algorithms, qualitatively and quantitatively.
In this section we compare final fusion results from our DIP-HyperKite with the state-of-the-art pansharpening approaches such as PCA , GFPCA , BF , BFS , SFIM , GS , GSA , MTF-GLP-HPM , CNMF , MTF-GLP , HySure , HyperPNN , and DHP-DARN  for Pavia Center, Botswana, and Chikusei datasets.
The average quantitative results for different pansharpening approaches on the testing set of the Pavia Center dataset are shown in Table VI. As can be seen from Table VI, our proposed HyperKite achieves the highest CC value compared to all the other pansharpening approaches that we have considered in this study. A higher CC value indicates that the fused HSI is closer to the actual HSI with less geometric distortion. Furthermore, our proposed DIP-HyperKite achieved the smallest values for SAM, RMSE, and ERGAS performance measures, indicating the best fusion performance over the other pansharpening approaches. Especially the smallest SAM and ERGAS indicate that our DIP-HyperKite can fuse HSIs with less spectral distortion than the state-of-the-art methods. In addition, our DIP-HyperKite improved the PSNR metric by 3.6% over the state-of-the-art value. To further verify the fusion quality of our proposed DIP-HyperKite, we present qualitative results in Figure 14 for the Pavia Center dataset. To better highlight the fusion quality between different pansharpening approaches, we have shown the error plots along with the RGB composite image for each fused HSI. According to the figure, the error maps corresponding to our DIP-HyperKite are much purple than the other pansharpening approaches, indicating minor fusion error. This is mainly because of the ability of our HyperKite network to predict very fine structures and edges by constraining the receptive field of the deep network.
The Table VII summarizes the average quantitative results of different fusion methods on the Botswana dataset. Similar to the Pavia Center dataset, we can see that our DIP-HyperKite outperforms all the other HS pansharpening approaches by a considerable margin. Concretely, our DIP-HyperKite has improved the CC by 1.03%, and PSNR by 3.71%. In addition, our method has reduced the SAM by 9.68%, RMSE by 8.57%, and ERGAS by 10.85%. Furthermore, we have shown qualitative results related to different pansharpening approaches on Botswana dataset in Figure 15. By observing the RGB images and error plots in Figure 15, we can see that the fusion results related to our method are much closer to reference image than the other pansharpening approaches.
In this section, we compare the qualitative and quantitative results on the Chikusei dataset. The average quantitative results of different pansharpening approaches on the Chikusei dataset is listed in Table VIII. Similar to the results we have observed for the other two datasets, for this dataset also our proposed DIP-HyperKite outperforms all the pansharpening approaches that we considered for the analysis. Our pansharpening method improves the CC, SAM, RMSE, RSNR, ERGAS, and PSNR performance measures over the state-of-the-art results by 1.45%, 19.0%, 6.67%, 0.67%, 18.5%, and 0.90%, respectively. To further highlight the fusion quality of our method we present the qualitative results of selected panshaprpening approaches for the Chikusei dataset is shown in Figure 16. By observing the RGB composite image and the error maps we can clearly see that the fusion quality of the proposed DIP-HyperKite is higher than the other pansharpening approaches.
In this paper, we have presented a novel approach for HS pansharpening, which mainly consists of three steps: (1) Up-sampling the LR-HSI via DIP, (2) Predicting the residual image via over-complete HyperKite, and (3) Obtaining the final fused HSI by summation. The previously proposed DIP methods for HS up-sampling only impose a constraint in the spectral-domain by utilizing LR-HSI. To better preserve both spatial and spectral information, we first exploited an additional spatial constraint to DIP by utilizing the available PAN image, thereby introduced both spatial and spectral constraints to the DIP. The comprehensive experiments conducted on three HS datasets showed that our proposed spatial+spectral loss function significantly improved the quality of up-sampled HSIs in CC, RMSE, RSNR, SAM, ERGAS, and PSNR performance measures. Next, in the residual prediction task, we have shown that the residual component between up-sampled HSI and the reference HSI primarily consists of edge information and very fine structures. Motivated by this observation, we proposed a novel over-complete deep-learning network for the residual prediction task. In contrast to the conventional under-complete representations, we have shown that our over-complete network is competent to focus on high-level features such as edges and fine structures by constraining the receptive field of the network. Finally, the fused HSI is obtained by adding the residual HSI and the up-sampled HSI. The comprehensive experiments conducted on three HS datasets demonstrated the superiority of our DIP-HyperKite over the other state-of-the-art results in terms of qualitative and quantitative evaluations.
P. Ghamisi, J. Plaza, Y. Chen, J. Li, and A. J. Plaza, “Advanced spectral classifiers for hyperspectral images: A review,”IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 1, pp. 8–32, 2017.
G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa, “Pansharpening by convolutional neural networks,”Remote Sensing, vol. 8, no. 7, p. 594, 2016.
, “Photo-realistic single image super-resolution using a generative adversarial network,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in
Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.
, vol. 15, no. 50. Citeseer, 1988, pp. 10–5244.