The hyperspectral image (HSI) has been widely used in extensive earth observation applications because of the rich information in its abundant spectral bands. However, due to the cost and hardware limitations of imaging systems, the spatial resolution of HSI decreases when the numerous spectral signals are collected simultaneously [deng2021m2h, lanaras2018super, palsson2017multispectral]
. Due to this drawback, the HSI does not always meet the demands for high-accurate earth observation tasks. The HSI super-resolution aiming to estimate a high-resolution (HR) image from a single low-resolution (LR) counterpart is one of promising solutions. Currently, there are mainly two different approaches for HSI super-resolution: 1) the HSI fusion with the HR auxiliary image (e.g. panchromatic image) and 2) the single HSI super-resolution without any auxiliary information. Generally, the image fusion approach implements the super-resolution using filter-based approaches through integrating the high-frequency details of HR auxiliary image into the target LR HSI[yokoya2017hyperspectral, ghamisi2019multisource], such as component substitution [qu2017hyperspectral, khademi2017multi], spectral unmixing [brezini2021hypersharpening, lanaras2017hyperspectral]
, and Bayesian probability[benediktsson2018remote, wang2018high]. However, this method highly relies on the high-quality auxiliary image with high imaging cost, which limits its practical applications. In contrast, single HSI super-resolution does not need any other prior or auxiliary information, which has greater practical feasibility.
In recent years, the single HSI super-resolution technologies have attracted increasing attention in remotely sensed data enhancement [RN17]. Particularly, Deep Learning (DL)-based single image super-resolution (SISR) methods have achieved significant performance improvement [LN02]. The first DL-based method for single image super-resolution was proposed by Dong et al. [LN01]
, named as the super-resolution convolutional neural network (SRCNN). To recover the finer texture details from low-resolution HSIs with large upscaling factors, Lediget al. [RN2] proposed a super-resolution generative adversarial network (SRGAN) by introducing a generative adversarial network (GAN). After that, various GAN-based deep learning models have been developed and proven to be effective in improving the quality of image super-resolution [RN8, RN3, RN10, RN6].
However, existing GAN-based super-resolution approaches mainly focused on RGB images, in which the reflectance radiance characteristics between the neighbouring spectral channels were not considered in the model training processes. Therefore, using these models for HSI super-resolution directly will lead to the absence of spectral-spatial details in the generated images. For example, Fig.1 shows a comparison between an original high-resolution HSI and its super-resolution HSI counterpart generated from the SRGAN model [LN02]. Obvious spectral-spatial distortions can be observed on the generated super-resolution HSI (see the red and yellow frames in Fig.1). Mathematically, recovering spectral-spatial details in super-resolution HSI is an under-determined inverse problem in which a large number of plausible details in the high-resolution image need to be characterised from low-resolution information. The complexity of this under-determined issue will exponentially increase with the increased upscaling factor. With high upscaling factors (e.g. higher than 8 ), the spectral-spatial details of generated super-resolution HSIs could be distorted.
The potential reason behind the spectral-spatial distortion is due to mode collapse in the optimisation process of GANs [xiong2020improved, li2020tackling], in which GAN models get stuck in a local minimum and only learn limited modes of data distributions. Some studies have attempted to address mode collapse in GAN models. For instance, Hou et al. [hou2020unsupervised] improved the diversity of the generator in GAN models and attempted to avoid the mode collapse by adding a reverse generating module and an adaptive domain distance measurement module into the GAN framework. Their findings illustrated that these approaches facilitated solving the insufficient diversity of GAN models in remote sensing image super-resolution. Ma et al. [ma2019super] introduced a memory mechanism into GAN models to save feedforward features and extract local dense features between convolutional layers, which showed some effectiveness in increasing spatial details during the reconstruction procedure.
To benefit the remarkable super-resolution performance from GAN-based models and address the spectral-spatial distortions in HSI super-resolution, in this study, we proposed a novel latent encoder coupled GAN architecture. We treated an HSI as a high-dimensional manifold embedded in a higher dimensional ambient latent space. The optimisation of GAN models was converted to a problem of learning the feature distributions of high-resolution HSIs in the latent space, making the spectral-spatial feature distributions of generated super-resolution HSIs close to those of their original high-resolution counterparts. Our contributions included:
1) A novel GAN-based framework has been proposed to improve HSI super-resolution quality. The improvement was achieved from two aspects. Firstly, for improving the spectral-spatial fidelity, a short-term spectral-spatial relationship window (STSSRW) mechanism has been introduced to the generator in order to facilitate spectral-spatial consistency between the generated super-resolution and real high-resolution HSIs in the training process. Secondly, for alleviating the spectral-spatial distortion, a latent encoder has been introduced into the GAN framework as an extra module to make the generator do a better estimation on local spectral-spatial invariance in the latent space.
2) A spectral-spatial realistic perceptual (SSRP) loss has been proposed to guide the optimisation of the under-determined inverse problem and alleviate spectral-spatial mode collapse issues occurred in the HSI super-resolution process, and benefit on retrieving high-quality spectral-spatial details in the super-resolution HSI, especially for high upscaling factors (e.g. 8
). The loss function, SSRP, was able to enforce spectral-spatial invariance in the end-to-end learning process and made the generated super-resolution features closer to the manifold neighbourhood of the targeted high-resolution HSI.
The rest of this work is organised as follows: Section 2 introduces related works on existing GANs-based methods for HSI super resolution tasks; Section 3 details the proposed approach; Section 4 presents experimental evaluation results; Section 5 concludes the work.
Ii Related work
A traditional GAN-based super-resolution model contains two neural networks, a generator producing sample images from low-resolution images and a discriminator distinguishing real and generated images[RN18]. The generator and discriminator are trained in an adversarial fashion to reach a Nash equilibrium in which the generated super-resolution samples become indistinguishable from real high-resolution samples.
Focusing on the spectral and spatial characteristics of HSI data, various adversarial strategies were proposed to improve the GAN performance on HSI super-resolution tasks [RN20]. For example, Zhu et al [RN21] proposed a 3D-GAN to improve the generalisation capability of the discriminator in spectral and spatial feature classification with limited ground truth HSI data. Jiang et al [RN13] designed a spectral and spatial block inserted before the GAN generator in order to extract high-frequency spectral-spatial details for reconstructing super-resolution HSI data.
Some methods for improving the overall visual quality of generated HSIs were also proposed through constructing a reliable mapping function between LR and HR HSI pairs. For example, Li et al [RN22] proposed a GAN-based model for multi-temporal HSI data enhancement. In their model, a 3DCNN based upscaling block was used to collect more texture information in the upscaling process. Huang et al [RN23] integrated the residual learning based gradient features between an LR and HR HSI pair with a mapping function in the GAN model, and achieved the HSI super-resolution with an improved spectral and spatial fidelity.
The performance of a GAN-based model mainly depends on its generator, its discriminator and loss functions. Therefore, existing studies in improving GAN-based models for HSI resolution focused on their design and optimisation.
Ii-a Design of the generator and the discriminator
In the generator where LR data are upscaled to a desired size, upscaling filters are the most important components that influence the performance of the generator in term of accuracy and speed [RN25, RN24]. Ledig et al. [RN2] employed a deep ResNet with a skip-connection in the generator to produce super-resolution images with upscaling factor. Jiang et al. [RN8] proposed an edge-enhancement GAN generator in which a group of dense layers were introduced into the generator in order to capture intermediate high-frequency features and recover the high-frequency edge details of HSI data.
In regard to the discriminator, it was found that a deeper network architecture had greater potential in discriminating real images from generated ones [RN26, RN2]. For example, Rangneka et al. [RN27] proposed a GAN-based deep convolutional neural network with seven convolutional layers in the discriminator for aerial spectral super-resolution. Arun et al. [RN28] used six 3D convolutional filters and three deconvolution filters in the discriminator to discriminate the spectral-spatial features of real HR HSIs from the generated counterparts.
In the design of the generator and the discriminator, the computational cost needs to be considered. The upscaling process in the generator can significantly increase the computational cost at the scale of times for an the upscaling factor of . Meanwhile, the deep learning-based discriminator always requires a large amount of computational time and memory for extracting and discriminating the high-dimensional non-linear mapping features of input data. More efficient generator and discriminator are required for fast and accurate HSI super-resolution.
Ii-B Design of loss functions
The loss function plays a very important role in optimising the performance of GAN models [RN30, RN29]. In the traditional GAN model, the generator and the discriminator are trained simultaneously to find a Nash equilibrium to a two-player non-cooperative game. A min-max loss function is used, it is equivalent to minimising Jensen-Shannon (JS) divergence between the distributions of generative data and real samples when the discriminator is optimal. However, the GAN training is hard, and can be slow and unstable. There are some issues in the original GAN model, such as hard to achieve Nash Equilibrium, the problem of low dimensional supports of sample distributions and mode collapse [liu2020posegan, zhang2020ndvi]. To facilitate the training stability and address mode collapse problems in the original GAN model, several improved adversarial loss functions were developed, which can be divided into three categories: 1) the pixel-wised loss, 2) the perceptual loss, and 3) the probabilistic latent space loss.
In the first category, the pixel-wised mean squared error (MSE) loss is commonly used for measuring the discriminative difference between real and generated data in GAN models [RN2]. However, the MSE has some issues, such as the loss of high-frequency details, the over-smoothing problem, and the sparsity of weights [jiang2019respiratory, yarlagadda2018satellite, bhattacharjee2018posix]. Some studies have attempted to solve these issues. Chen et al. [chen2018attention] introduced a sparse MSE function into the GAN model in order to measure the high-frequency information in the spatial attention maps of images, their results showed that the GAN with the sparse MSE loss was able to provide more viable segmentation annotations of images. Zhang et al. [zhang2020supervised] emphasised that the MSE-loss always led to an over-smoothing issue in the GAN optimisation process. Therefore, they introduced a supervised identity-based loss functions to measure the semantic differences between pixels in the GAN model. Lei et al. [lei2019gcn] attempted to solve the issue of sparsity of the pixel-wised weights in the GAN model, and proposed two additional metrics, the edge-wise KL-divergence and the mismatch rate, for measuring the sparsity of pixel-wised weights and the wide-value-range property of edge weights.
In the second category, existing studies used different perceptual losses to balance the perceptual similarity, based on high-level features and pixel similarity. Cha et al. [cha2017adversarial]
proposed three perceptual loss functions in order to enforce the perceptual similarity between real and generated images, these functions achieved improved performance on generating high-resolution image with GAN models. Luoet al. [luo2018bi] introduced a novel perceptual loss into the GAN based SR model, named as Bi-branch GANs with soft-thresholding (Bi-GANs-ST), to improve the objective performance. Blau et al. [blau2018perception] proposed a perceptual-distortion loss function in which the generative and perceptual quality of GAN models were jointly quantified. Rad et al. [rad2019srobb] proposed a pixel-wise segmentation annotation to optimise the perceptual loss in a more objective way. Their model achieved a great performance in finding targeted perceptual features.
In the third category, Bojanowski et al. [RN11]
investigated the effectiveness of the latent optimisation for GAN models, and proposed a Generative Latent Optimisation (GLO) strategy for mapping the learnable noise vector to the generated images by minimising a simple reconstruction loss. Compared to a classical GAN, the GLO obtained competitive results but without using the adversarial optimisation scheme which was sensitive to random initialisation, model architectures, and the choice of hyper-parameter settings. Training a stable GAN model is challenging. Therefore, Wasserstein GAN[LN04]
was proposed to improve the stability of learning and reduce mode collapse. The WGAN replaced the discriminator model with a critic which scored the realness of a given image in the probabilistic latent space and was trained using Wasserstein loss. Rather than discriminating between real and generated images (i.e. the probability of a generated image being real), the critic maximises the difference between its prediction for real images and generated images (i.e. predict a ”realness” score of a generative image). Gulrajaniet al. [RN32] further improved the WGAN training by adding a regularisation term penalising the deviation of the critic’s gradient norm with regard to the input, and the model was named as (WGAN-GP).
Iii The proposed LE-GAN for single HSI super-resolution
To address the challenge of spectral-spatial distortions caused by mode collapse during the optimisation process, we proposed a novel GAN model coupled with a latent encoder, named as LE-GAN. In the proposed framework, the optimised generator and discriminator were designed to improve the super-resolution performance and reduce the computational complexity. Inspired by the encoder coupled GAN [RN11], we developed a latent encoder embedded into our GAN framework to facilitate the generator to achieve a better approximation on feature maps, in order to generate ideal super-resolution results. In addition, we designed a spectral-spatial realistic perceptual (SSRP) loss function in order to optimise the under-determined inverse problem by providing a trade-off between aligning the distributions of generated super-resolution and targeted high-resolution HSIs and increasing the spectral-spatial consistency between them.
Iii-a Model architecture
We have made two major changes to the traditional GAN framework: 1) proposed an improved generator, denoted as , with a simplified ResNet structure, and 2) introduced a latent encoder, denoted as , into the GAN framework. The network architecture is shown in Fig.2. It consists of a generator, a discriminator and an encoder.
Iii-A1 The architecture of the generator model
To improve the spectral-spatial reconstruction quality with low distortion and reduce the computational complexity, a short-term spectral-spatial relationship window (STSSRW) derived model was proposed, denoted as in our GAN framework. The architecture of the proposed generator is shown in Fig.3
. It serves three functions: low-resolution spectral-spatial feature extraction, residual learning with contiguous memory mechanism, and super-resolution HSI reconstruction.
Firstly, a feedforward 3D convolutional filter was introduced for the low-resolution spectral-spatial feature extraction. Unlike traditional RGB image super-resolution approaches that use 2D convolutional filters for spatial feature extraction, the HSI super-resolution requires processing continuous spectral channels and capturing spectral-spatial joint features from a data cube. Therefore, a 3D convolutional filter is a better choice for modelling both the spectral correlation characteristics and spatial texture information, and yielding the feature representation of the spectral-spatial hierarchical structures from low-resolution HSIs. In this study, the convolutional kernel is set to for a band HSI input. Nah et al. [nah2017deep] found that the batch normalisation (BN) layer would normalise the features and get rid of the range flexibility for upscaling features. Therefore, no BN layer is used here to avoid blurring the spectral-spatial information hidden in the convolutional features.
Secondly, residual blocks (ResBlocks) were employed for the residual learning. The architecture of a ResBlock is shown in Fig.3
b. Each ResBlock comprises two 3D convolutional filters, a ReLu activation layer, and a scaling layer. Wherein, two 3D convolutional filters are used to alleviate the effect of noise and strengthen the spectral-spatial hierarchical characteristics at the deeper level. The ReLU activator between convolutional filters is used to regularise deep features. The scaling layer after the last convolution layer in the ResBlock is used to scale the features, which increases the diversity of spectral-spatial hierarchical modes in the deep features. Similar to the initial feature extraction, no BN layer is used in the ResBlocks. The skip connection of ResBlocks forms an STSSRW mechanism, in which the spectral-spatial features extracted from previous ResBlocks are skip-connected with the features extracted from current ResBlocks. The skip-connected features not only represent extracted local spectral-spatial invariances, but also have the memory for high-frequency spectral-spatial information represented by a high dynamic range of previous features. This STSSRW mechanism enriches the spectral-spatial details with hierarchical structures and stabilises the learning process. It is noteworthy that for a given ResBlock, each convolutional layer withfeatures and kernels has parameters, requiring memory. Although increasing the number of ResBlocks is an efficient way to improve the performance of a generator in extracting deep features, it will lead to a high memory requirement. Moreover, the learning process may become numerically unstable with the increasing number of ResBlocks [arjovsky2017towards]. Therefore, in this study, we set the number of ResBlocks to 34 to balance the model performance and cost.
Finally, the spectral-spatial feature maps extracted from multiple ResBlocks were fed into an UpscaleBlock to generate the super-resolution spectral-spatial features in the super-resolution HSI reconstruction. As shown in Fig.3c, the UpdscaleBlock is a combination of a 3D-convolutional filter and a shuffle layer, in which convolutional filters with a depth of 32 are used to exploit features for an upscaling factor . The shuffle layer is used to arrange all the features corresponding to each sub-pixel position in a pre-determined order and aggregate them into super-resolution areas. The size of each super-resolution area is . After this operation, the final feature maps with a size of will be arranged into the super-resolution feature maps with a size of , where and are the height and width of the super-resolution HSI, respectively. At last, a deconvolution filter is used to decode the feature maps in each area, yielding the super-resolution HSI with enhanced spectral-spatial fidelity.
Iii-A2 The architecture of the discriminator
The architecture of the proposed discriminator, , as shown in Fig. 4a, adopts an architecture similar to that used in [RN2]. But, there is no sigmoid layer in our model, because the latent space optimisation requires the raw membership without compression. Thus, the proposed mainly contains one convolutional layer, Maxpool blocks ( in this study) and two dense layers. The Maxpool block is a combination of a convolutional layer, a BN layer, and a ReLU layer (see Fig. 4b). The Maxpool block aims to extract the high-level features of input data, and the resultant feature maps are input into two dense layers to obtain a membership distribution of the feature maps for real or generated HSIs.
Iii-A3 The architecture of the latent encoder,
The latent encoder, , is developed and introduced to the GAN architecture for preventing mode collapse by mapping the generated spectral-spatial features from the image space to the latent space and produces the latent regularisation components in the learning process. Mathematically, we regard the spectral-spatial features as
, the singular value decomposition ofin the latent space can be written as:
where, and are the left and right singular vectors of , respectively, and can be expressed as:
where represents the spectral-spatial distribution of . If the mode collapse occurs in the learning process, the
will concentrate on the first singular valueand the rest singular values would close to zero. Therefore, in order to avoid the mode collapse, we use a latent encoder to automatically generate a latent regularisation components for of the real data and the generated data, denoted as and , to compensate the singular value of .
The architecture of the latent encoder is shown in Fig. 5, which consists of eight convolutional layers with an increasing kernel depth by a factor 2 through different layers from 64 to 512. The striding operation is used to reduce the number of features once the kernel depth is doubled. The resultant of 512 feature maps are input into two dense layers so that its outputs match the dimension of the HSI. As shown in Fig. 2, receives signals from the generator, , and the targeted data, . The outputs of the encoder are used to calculate an L2 loss to regularise the loss function of the discriminator, defined as:
where denotes L2 norm, and are the output regularisation components in the latent space, respectively, corresponding to the inputs from real data and the generated data from the discriminator, , parametrised by . The encoder is simultaneously optimised with the generator, . To make sure that the outputs of and the real high-resolution HSI in the latent space have the same dimension, is pre-trained by real HSI data. This speeds up the formal optimisation process.
Iii-B Model optimisation with spectral-spatial realistic perceptual loss
In this study, we treat a low-resolution image as a low-dimension manifold embedded in the latent space, thus the super-resolution HSI can be generated by the parametrised latent space learnt by the model. Theoretically, the generated super-resolution sample, , from a low-resolution sample, , by the generator will be located in a neighbourhood area of its target, , in the latent space.
Previous studies [chen2018attention, zhang2020supervised, RN29] used the difference between and as the generator loss function, described as:
However, there are two drawbacks to use this loss function in the HSI super-resolution optimisation process. Firstly, the activated features in the latent space are very sparse. The distance based losses rarely consider the spectral-spatial consistency between and , which leads to the spectral-spatial distortion in the generated super-resolution HSI results. Secondly, the direct bounding on the difference between and makes it hard to converge because is usually disturbed by the network impairments or random noise.
In order to overcome the aforementioned drawbacks, we have designed a spectral-spatial realistic perceptual (SSRP) loss to comprehensively measure the spectral-spatial consistency between and in the latent space. The formula of the SSRP loss is defined as the weighted sum of the spectral contextual loss, the spatial texture loss, the adversarial loss, and a latent regularisation component, and is shown as follows:
where is the spectral contextual loss, is the spatial texture loss, is the adversarial loss, and is the latent regularisation component.
Based on the SSRP loss, the min-max problem in the GAN model can be described as follows:
The details of , , , and are provided below.
Iii-B1 Spectral contextual loss
is designed to measure the spectral directional similarity between and in the latent space, which is defined as follows:
where is the band number of an HSI, denotes the feature maps obtained from the convolutional layer before the first Maxpooling layer of the discriminator, .
Iii-B2 Spatial texture loss
In GAN models, if the loss function only measures the spatial resemblance of the generated and targeted samples, it usually leads to the blurry super-resolution results. In this study, we introduce a spatial texture loss to measure the texture differences between the feature maps of and in the latent space. In the , the feature maps of and before activation are used because they contain more sharp details. is defined as:
where denotes the feature maps obtained from the convolutional layer after the last Maxpooling layer of the discriminator .
Iii-B3 Adversarial loss
Along with the spectral contextual loss and the spatial texture loss, an adversarial loss is introduced to facilitate the generator in reconstructing the image in the ambient manifold space, and fooling the discriminator network. is defined based on the Wasserstein distance [LN04]
between the probability distributions of real data,, and the generated data, . Theoretically, is strong in alleviating the mode collapse during the training process, because the Wasserstein distance evaluating the similarity between and rely on the whole samples distributions rather than the individual sample. In other words, there is a penalty would be triggered when the only covers a fraction of , which facilitates the diversity of the generated super-resolution HSI. The goal of is to minimise the Wasserstein distance, , which is defined as:
where is the K-Lipschitz function. Suppose we have a parametrised family of functions, , that are all K-Lipschitz for some , then the can be written as:
where is chosen such that the Lipschitz constant of is smaller than a constant, . If the probability densities of and satisfy the Lipschitz continuous condition (LCC) [RN34], there is a solution . Thus, the discriminator is trained to learn a K-Lipschitz continuous function to help compute the Wasserstein distance. The LCC is a strong pre-requisite for calculating . Therefore, the parameters, , should lie in a -dimensional manifold in order to meet this constraint.
Iii-B4 The latent regularisation component
In our proposed model, is a ResNet with global Lipschitz continuity. As described in Section III-A-c, we have introduced a latent encoder, , to compensate the singular values of the spectral-spatial features of to the desired . In addition to the optimisation process, the Lipschitz Continuity Condition (LCC) is employed to enforce the local spectral-spatial invariances of , and map the latent manifold space to a more regularised latent space in case of mode collapse, described as:
Thus, the proposed loss function (see Eq. 2) would be penalised if the singular values of the spectral-spatial features of a generated super-resolution HSI are updated in a particular direction. In other words, LLC-derived updating is able to prevent the learning process of each layer from becoming sensitive to the limited direction, which mathematically alleviates the mode collapse, in turn stabilising the optimisation process.
Iv Experimental Evaluation
In this section, we evaluate the effect of proposed LE-GEN and determined whether it will improve the super-resolution quality and facilitate manifold mapping for solving the problem of mode collapse. Wherein, the developed SSRP loss function plays a key role for both of these prospects. A total of three experiments are designed. The first experiment is to evaluate the optimal parameter combination for the SSRP loss in our proposed model, the second experiment is proposed to evaluate the super-resolution quality, and the last experiment is to evaluate the mode collapse in the model training.
The proposed model was trained and tested on real HSI datasets coming from different sensors. It was also compared with five state-of-the-art super-resolution models, including the hyperspectral coupled network (HyCoNet) [zheng2020coupled]
, the low tensor-train rank (LTTR) network[dian2019learning], the band attention GAN (BAGAN) [RN14], the super resolution generative adversarial network (SRGAN) [RN2], and the Wasserstein GAN (WGAN) [LN04]. Among them, HyCoNet, LTTR and BAGAN are the state-of-the-art models for HSI super-resolution, while SRGAN and WGAN are the most widely used GAN frameworks for image super-resolution. In order to fit the HSI into the SRGAN and WGAN models, a band-wise strategy was employed [mei2017hyperspectral].
Iv-a HSI data descriptions
In our experiments, two types of datasets obtained from different sensors were used, one from the public AVIRIS archive, the other from the privately measured UHD-185 data of Guyuan Potato Field (GPF).
Iv-A1 AVIRIS datasets
Two publicly available HSIs from the AVIRIS data archive were chosen, including the HSIs of Indian Pines (IP) data and the Kennedy Space Center (KSC). Each of them contains hyperspectral bands from . The HSIs in the KSC dataset were collected by the Kennedy Space Center, Florida, on March 23, 1996. The spatial resolution was 18 m. The HSIs of IP covered the crop planting areas with the spatial resolution of 20 m in North-Western Indiana, USA. In this study, to keep the spectral consistency between different datasets, only the wavelength ranges from invisible to near-infrared () were considered in our experiments.
Iv-A2 UHD-185 dataset
The UHD-185 dataset contained three privately measured HSIs, denoted as , , and , in Guyuan Potato Field, Hebei, China. Each of the HSIs was collected by the DJI S1000 UAV system (SZ DJI Technology Co Ltd., Gungdong, China) based UHD-185 Imaging spectrometer (Cubert GmbH, Ulm, Baden-Württemberg, Germany) in 2019. All the images were obtained at a flight height of 30 m, with 220 bands from visible to near-infrared bands between and nm and a spatial resolution close to 0.25m per pixel.
Iv-B Evaluation metrics
The evaluation metrics include 1) the metrics for evaluating super-resolution quality and robustness and 2) the metrics for evaluating mode collapse of GANs.
Iv-B1 Evaluation metrics for super-resolution quality and robustness assessment
In total, five spectral-spatial evaluation metrics were employed for the super-resolution quality assessment.
These five metrics are 1) Information entropy associated peak signal-to-noise (PSNR), 2) Spatial texture associated structural similarity index (SSIM), 3) Perception-distortion associated perceptual index (PI), 4) spectral reality associated spectral angle mapper (SAM) and 5) Spectral consistency associated spectral relative error (SRE). Among them, the PSNR and SSIM were widely used in the evaluation of image quality [RN41], the larger the score of PSNR or SSIM the higher the image-quality.
The PSNR is defined as:
where is the mean squared error between the real HR HSI, , and the generated HR HSI through super-resolution, . The PSNR goes to infinity as the MSE goes to zero.
The SSIM is defined as:
where , , and are the difference measures for luminance, contrast, and saturation between real and generated HR HSI pairs, respectively. The details can be found in [RN44].
However, the numerical scores of PSNR and SSIM are not always correlated well with the subjective image quality. Therefore, Blau et al. [RN45] proposed an index, PI (Perception Index), as a compensatory reference for the image quality evaluation. The lower the PI value is, the higher the perceptual quality of the image. The PI is defined by two non-referenced image quality measurements, MA [RN46] and NIQE [RN47], described as:
In order to measure the spectral distortion, the spectral angle mapper(SAM), was used to calculate the average angle between a super-resolution HSI and its targeted high-resolution HSI. The SAM is defined as:
where is the pixel number of the HSI.
To evaluate the pixel-wised spectral reconstruction quality, the spectral relative error (SRE) was also used as a metric, defined as:
where the is the band number of an HSI.
Iv-B2 Evaluation metrics for mode collapse of GANs
Two metrics for GANs, Inception Score (IS) and Frechet Inception Distance (FID), were employed to measure the mode collapse through monitoring the image quality and diversity in the model training process [liu2019spectral, hartmann2018eeg]. The IS measures both the image quality of generated HSIs and their diversity, reflecting the probability of mode collapse in the model training process. In GANs, it is desirable for the conditional probability,
to be highly predictable (low entropy), that is, the probability density function is less uniform. The diversity of the generated image can be measured with the marginal probability,. The less uniform (low entropy) the marginal probability is, the less the diversity of the generated image is. Through computing the KL-divergence between these two probability distributions, the IS is computed with the equation below:
The Frechet Inception Distance (FID) score is a metric calculating the distance between the feature vectors extracted from real and generated images. It was used to evaluate the quality of GAN generated images, and a lower score correlates with a higher image quality. The lower the FID value is, the better image quality and diversity are. The FID is sensitive to mode collapse. Through modelling the distribution of the features extracted from an intermediate layer with a multivariate Gaussian distribution, the FID between the real image and generated images is calculated using the following equation,
where and refer to the feature-wise means of the real high-resolution HSI and the generated super-resolution HSI in discriminator model, respectively, and and are the covariance matrix for the real and generated feature vectors, respectively.
Iv-C Experimental configuration
In our experiments, the raw HSIs were labelled as HR samples. The LR samples were generated by down-sampling the HR samples with three scaling factors, , and
, based on the bi-cubic interpolation approach[RN38]. For the AVIRIS datasets, the KSC data was used for the model training and test, and the IP data was used for the independent test. For the UHD-185 dataset, the and were used for training and test, and the was used for the independent test. More specifically, for the training/test datasets, the HR HSI was cropped into a series of sub-images with a spatial size of , and the corresponding LR data was respectively cropped to , , and . After this operation, a total of 196 HR and LR HSI pairs were generated from the AVIRIS dataset, and 352 HR and LR HSI pairs were generated from the UHD-185 dataset, in which of image pairs were randomly selected as the training set and the rest of image pairs were used as the test set.
The training process was divided into two stages. In the first stage, the discriminator and the latent encoder were pre-trained over 5,000 iterations on the raw HR HSI dataset to get initial weights. The Adam optimiser was used by setting the forgetting factors, and , a small scalar constant and the learning rate [RN39]. In the second stage, the discriminator, the generator, and the latent encoder were jointly trained for over 10,000 times, until they converged. The Adam optimiser with the same parameters was used. All of the training were performed on NVIDIA 1080Ti GPUs. We also investigated the effect of the hyper-parameter on the optimisation performance. We found a value in the range of could generate high-quality HSI data. In this study, was used for training the proposed model.
Iv-D Experiment 1: the parameter selection for the SSRP loss function
To achieve an optimal performance, an optimised combination of the parameters in the SSRP loss function Eq.(4), , , and , needs to be found. In this study, a traversal method was employed to search the optimal parameter combination. These parameters were traversed in the range of to with a fixed step of for the range of to , and a fixed step of for the range of to . The selection of parameter combinations was based on the spectral-spatial quality of generated super-resolution HSIs measured with five evaluation metrics, PSNR, SSIM, PI, SAM, and SRE. Table. I lists the top five parameter combinations and the corresponding values of these metrics for generating the super-resolution HSI with the scaling factors of , and . It can be observed that all the parameters after optimisation are located in a relatively small range, for example, for and , for and for . In the following experiments, we employed the average values of the best parameters for various scaling factors, thus, , and .
|Scaling factor||No.||(, , , and )||PSNR||SSIM||PI||SAM||SRE|
|1||(12.8, 12.9, 0.008, 0.015)||31.738||0.982||3.782||5.011||8.383|
|2||(12.8, 12.8, 0.009, 0.016)||31.716||0.945||3.884||5.155||8.461|
|3||(12.7, 12.8, 0.007, 0.014)||31.712||0.963||3.87||5.115||8.469|
|4||(12.8, 12.8, 0.008, 0.014)||31.712||0.943||3.849||5.161||8.482|
|5||(12.6, 12.8, 0.006, 0.017)||31.708||0.926||3.876||5.174||8.499|
|1||(12.4, 12.4, 0.006, 0.015)||31.417||0.903||3.765||4.942||8.219|
|2||(12.4, 12.5, 0.009, 0.014)||31.395||0.901||3.764||5.075||8.267|
|3||(12.4, 12.3, 0.007, 0.015)||31.375||0.893||3.765||5.013||8.279|
|4||(12.5, 12.8, 0.007, 0.014)||31.359||0.891||3.767||5.017||8.276|
|5||(12.5, 12.8, 0.006, 0.017)||31.322||0.898||3.819||5.065||8.331|
|1||(12.3, 12.3, 0.005, 0.015)||29.881||0.931||3.672||4.741||8.672|
|2||(12.4, 12.3, 0.006, 0.014)||29.851||0.902||3.663||4.828||8.726|
|3||(12.4, 12.2, 0.004, 0.014)||29.816||0.885||3.583||4.797||8.753|
|4||(12.5, 12.5, 0.005, 0.014)||29.828||0.923||3.617||4.866||8.679|
|5||(12.4, 12.6, 0.005, 0.015)||29.791||0.885||3.634||4.817||8.733|
Iv-E Experiment 2: model robustness and super-resolution quality assessment
To evaluate the robustness and generalizability of the proposed model, we have evaluated our model on both testing datasets and independent datasets.
Iv-E1 Model assessment on the testing datasets
As described in Section IV-C, we divided the dataset into the training and testing datasets. The performance of the proposed model for hyperspectral super-resolution with three upscaling factors (, and
) was evaluated on testing datasets including AVIRIS (KSC) and UHD-185(GPF-1 and GPF-2), and compared with five state-of-the-art competition models. To assess the model robustness to noise, the models was also evaluated on the datasets artificially added with three Gaussian white noise levels (, and ) to each of the spectral bands of low-resolution HSIs. To facilitate ranking the models in terms of reconstruction quality, five most widely used evaluation metrics, PSNR, SSIM, PI, SAM, and SRE, were chosen. Specifically, PSNR, SSIM and PI were used to measure the spatial reconstructed quality from the aspects of information entropy, spatial similarity, and perception distortion, respectively. The higher PSNR and SSIM scores and the lower PI scores indicate the higher spatial reconstruction quality. In addition, the SAM and SRE scores were used for the spectral distortion measurement from the aspects of spectral angle offset and amplitude difference, respectively. The lower values of SAM and SRE scores indicate the higher spectral reconstruction quality.
Table II and Table III provide the average scores of PSNR, SSIM, PI, SAM, and SRE of HSI super-resolution results from the proposed model and its five competitors using the AVIRIS and UHD-185 testing datasets, respectively. In general, the results on both datasets consistently show that the proposed LE-GAN model achieves the highest PSNR and SSIM values and the lowest PI, SAM and SRE values for all three different upscaling factors and three added noise levels (see the highlighted values in Table II and Table III). This means that LE-GAN achieves the best spectral and spatial fidelity and super-resolution quality.
A more detailed analysis of the results for the model performance evaluation was performed from two aspects: (1) Super-resolution performance under various upscaling factors (2) Model robustness against different noise levels. Since the results in Table II and Table III have the similar patterns for all the models, here we only present the analyses and assessment using the results on AVIRIS data (i.e. Table II):
(1) Among three upscaling factors, the LE-GAN based super-resolution with the smallest upscaling factor and without added noise (i.e. db) achieves the best spectral and spatial reconstruction quality. The best scores of PSNR, SSIM, PI, SAM, and SRE are , , , , and , respectively, which are closer to the real high-resolution HSI (i.e. for PSNR, for SSIM, for PI, for SAM, and for SRE), compared to its competitors. The similarities (i.e. the ratio between the super-resolution HSI and the real high-resolution HSI) reach , , , , and , respectively. In addition, for a given added noise level, the spectral and spatial quality of the LE-GAN generated super-resolution HSIs are more stable between the upscaling factors of and . For example, under the added noise level of , the PSNR, SSIM, PI, SAM and SRE scores are , , , , and for upscaling factor, and increasing the upscaling factor to only causes the slight changes to these scores which are , , , and , respectively. The consistency ratios (i.e. the ratio between the and super-resolution HSI) are , , , , and , respectively. In contrast, a larger performance degradation occurs on the spectral and spatial reconstruction quality of the competitors. For example, with regard to the WGAN, the second best model in terms of PSNR and SSIM, the scores of PSNR, SSIM, PI, SAM, and SRE under non-added noise level are , , , , with upscaling, but change to , , , , with upscaling, showing the performance degradations of , , , , and , respectively. Although the degradations of PSNR, SSIM, PI, SAM, SRE scores can be observed on all the models for upscaling, the degradation rate of these scores from the proposed LE-GAN is the smallest. For example, under the non-added noise ( db), the SNR, SSIM, PI, SAM, SRE scores of LE-GAN based super-resolution HSIs for upscaling are , , , and , respectively, which are , , , and higher than those based on the second best models (i.e. the WGAN in terms of SSIM () and the BAGAN in terms of PSNR (), PI (), SAM (), and SRE()).
(2) With regard to the model robustness to noise, the proposed LE-GAN shows the best performance on the spectral and spatial reconstruction for a given upscaling factor in comparison with its competitors, although the degradation is observed with increased noise levels. The smaller the upscaling factor is, the more robust the model is. The most robust results against noise are at the upscaling factor of . Only , , , . and degradations of the PSNR, SSIM, PI, SAM, and SRE scores of LE-GAN-based super-resolution results occur when the added noise level increases from non-added ( db) to db (see Table II). In contrast, the added noise-induced degradations to the results from the WGAN (the second best model for upscaling factor) are much higher, reaching , , , , and , respectively. In addition, when the upscaling factor increases from to , the added noise-induced degradations on the PSNR, SSIM, PI, SAM and SRE scores of the LE-GAN super-resolution results are , , , , and , which are acceptable for the super-resolution with a high upscaling factor and high noises. In contrast, a more serious deterioration can be observed in the results from its competitors. For example, the added noise-induced degradations on the PSNR, SSIM, PI, SAM, SRE of the BAGAN-based super-resolution results, the second best model, are , , , and , respectively, for an upscaling factor of , but change to , , , , for an upscaling factor of .
Iv-E2 Model assessment on the independent test datasets
The proposed model has also been evaluated on two independent test datasets, AVIRIS (IP) and UHD-185 (GPF-3), which were not involved in the model training. Fig. 6
illustrates a comparison of five evaluation metrics (PSNR, SSIM, PI, SAM and SRE) between the proposed model and its five competitors. The average value and standard deviation of each metric were calculated based on the measures at three noise levels,, and . Compared to its competitors, the proposed model achieves the highest average values and lowest standard deviations for PSNR, SSIM, and the lowest average values and the lowest standard deviation for PI, SAM and SRE, across three upscaling factors on both AVIRIS test dataset (see Fig. 6a) and UHD-185 test dataset (see Fig. 6b). That is, the proposed model achieves the best performance on super-resolution. Similar to the evaluation results in Subsection IV-E1, overall the second best model on the independent test datasets is WGAN for the spatial information reconstruction measure (e.g. PSNR, SSIM), and BAGAN for the spectral information reconstruction measure (e.g. SAM, SRE).
It can also be observed that the changes of these metrics are relatively small with the increase of the upscaling factor. When the upscaling factor increases from to , the average values of SSIM, PI, SAM, and SRE from the proposed model almost stay the same; When the upsampling factor increases from to , the changes of these metrics are much smaller compared to those from its competitors.
These findings suggest that the proposed model overcomes the drawback associated with spectral-spatial reconstruction under the noises interferences compared to its competitors. Moreover, the proposed model is less sensitive to the upscaling factor, and has a good performance even with a large upscaling factor (e.g. ).
Iv-E3 Visual Analysis of generated super-resolution HSIs with a large upsampling factor ()
To demonstrate the performance improvement of the proposed model in spectral-spatial fidelity, visual analyses on generated super-resolution HSI samples have been performed. Fig. 7 displays the results from independent test datasets (IP and GPF-3). Although the visualisation results from the proposed method and its competitors are similar, the image edges from the LE-GAN are sharper than those from the competitors. For example, the internal textures of the bare-soil shown as grey in the false-colour images almost disappear in the generated super-resolution IP images by the HyCoNet, LTTR, and BAGAN (the second, third, and forth images in the first row of Fig. 7). These findings suggest that the LE-GAN provides improved spatial quality in general.
To further visualise the super-resolution details on spatial and spectral fidelity, some representative false-colour composite image patches and the spectral curves of the super-resolved HSI patches from independent test dataset (GPF-3) are shown in Fig. 8. It is obvious that the brightness, contrast, and internal structures of the false-colour images generated by the LE-GAN are more faithful to real HR data. For example, the land cover textures in the LE-GAN generated image (the second image from the right in the second row of images in Fig. 8) are clearer, compared to the images generated by the competitors (e.g. the HyCoNet and LTTR based images) in which the edges of streets are fuzzy. Moreover, the spectral curves from the LE-GAN generated images are more consistent with those from real HR HSI data. For example, the typical vegetation spectral curves in the images generated by the HyCoNet, SRGAN, and WGAN reveal distinct biases in the range of red-edge to near-infrared with real HR data (see the images of the first row in Fig. 8). In contrast, the vegetation spectral curves from the LE-GAN super-resolution are more consistent with those from real HR HSI. A detailed analysis of the spectral residual and standard deviation between the generated HSI and real HR HSI from the independent test dataset is shown in Fig. 9. It can be found that the residual error between the LE-GAN generated HSI and HR HSI is close to zero in the range of 450 to 780 nm and lower than in the range of 780 to 950 nm, and the deviation is lower than . All these results suggest that the proposed model provides a better performance in HSI super-resolution without losing the spectral details. The second and third best spectral residuals are achieved by the BAGAN and LTTR, respectively, and the spectral biases in the range of 630 to 950 nm and the average deviations reach and , respectively.
Iv-F Experiment 3: Mode collapse evaluation
Generally, mode collapse mainly happens in the training process when the super-resolution HSI produced by the generator only partially covers the target domain. When the generator learns a type of spectral-spatial pattern that is able to fool the discriminator, it may keep generating this kind of pattern so that the learning process is over-learned. The distance and distribution of the generated super-resolution HSI provide the most direct evidence for determining whether the mode collapse occurs in the generator. In this section, we evaluated the effect of the proposed LE-GAN on alleviating mode collapse from three aspects: 1) a quantitative evaluation on the diversity of the generated super-resolution HSI, based on the distance-derived IS and FID metrics, 2) a smoothness monitoring on the generator iterations during the network training process, and 3) a visualisation of the distributions of the real high-resolution HSI samples and the generated super-resolution samples.
Firstly, the quantitative evaluation for the diversity of the generated super-resolution HSI was conducted on the testing dataset and independent dataset mentioned in Section IV-C. In addition, in order to assess the potential affects of different upscaling factors and added-noise levels on the occurrence of mode collapse, all of the experiments were conducted on three upscaling factors (, and ) with three Gaussian white noise levels (, and ), and compared with five state-of-the-art competition models. The IS and FID were used as the evaluation metrics for assessing the diversity of the super-resolution HSIs and determining the existence of a mode collapse. A higher IS and lower FID scores will show the better diversity of the generated super-resolution HSI and the sign of the alleviation of mode collapse.
Table IV lists the IS and FID measurements on the proposed LE-GAN and five selected competitors using the testing datasets. The evaluation results on AVIRIS and UHD-185 testing datasets demonstrate that the proposed LE-GAN model outperforms its competitors in terms of the IS and FID measurements for all three upscaling factors and added noise combinations (see the highlighted values in Table IV). They also indicate that the proposed model has greater performance on alleviating mode collapse issue occurred in the generated spectral-spatial diversity. The differences of these measurements between the proposed model and its competitors are particularly significant for the cases of low added noise and low upscaling levels. For example, the proposed LE-GAN achieves the highest IS ( for AVIRIS data and for UHD-185 data) and the lowest FID ( for AVIRIS data and for UHD-185 data) on the datasets with the upscaling factor and non-added noise of SNR.
In addition, Table IV also reveals the IS and FID degradations with the increase of upscaling factor and added noise level for all of the models. The comparison of these degradations can help to explore the model robustness in preventing the mode collapse issue. Specifically, an average of IS drop and FID increase are observed from the LE-GAN-based super-resolution results when increasing the upscalling factor from to and decreasing SNR from from to . Meanwhile, the results from the WGAN, the second best model in terms of super-resolution fidelity (see Section IV-E), show an average of IS drop and FID increase. These findings suggest that the proposed LE-GAN achieves the best performance in preventing mode collapse under the higher upscaling factor and noise interferences.
Table V provides the scores of the IS and FID from the proposed LE-GAN and its five competitors using the independent datasets. Similar to the results shown in Table IV, the results highlighted in Table V illustrate that the LE-GAN model achieves the best and most robust performance in terms of IS and FID for all the upscaling factor and added noise combinations. In the case of a upscaling factor and non-added noise, the LE-GAN achieves the best IS and FID measurements (IS of and FID of for AVIRIS and IS of and FID of for UHD-185 dataset), with the smallest IS drop () and FID increase (). These results are consistent with the mode collapse assessment reported in Table IV, suggesting that the LE-GAN derived super-resolution HSIs have the best spectral-spatial diversity with alleviated mode collapse.
Secondly, a smoothness monitoring on the generator iteration was used to determine if mode collapse occurred during the training process. According to the results illustrated in Table IV and Table V, the high noise-levels and the large upscaling factors lead to more serious mode collapse. To demonstrate the performance difference between the proposed model and its competitors in alleviating model collapse, a comparison was conducted under a high added noise level (e.g. ) and a high upscaling factor (e.g. ). Fig. 10 illustrates the IS and FID iterations of the generated HSIs from the proposed LE-GAN and the two best competitors, i.e. WGAN and BAGAN, during the training process. It is obvious that the IS and FID curves from the proposed model are smoother and more stable than those from WGAN or BGAN, along with the increase of iteration number. Unlike the curves from WGAN or BGAN, the curves of IS from the proposed mode, LE-GAN, steadily increase and the curves of FID steadily decrease for both AVIRIS and UHD-186 datasets. This indicates that there is no significant mode collapse occurs during the training of LE-GAN. However, a big drop of IS is observed during the training of BAGAN (e.g. after 3500 iterations, as shown in Fig. 10a ) and during the training of WGAN (e.g. after 2000 iterations, as shown in Fig. 10c). Moreover, the curves of FID don’t steadily decrease during the training for WAGAN or BAGAN. These observations indicate that the mode collapse occurs in the training of representative GAN models (e.g. WGAN and BAGNAN), and the proposed model is more effective in alleviating the mode collapse.
Finally, to further understand and assess the performance of the generator model in dealing with the mode collapse issue, the distributions of the real high-resolution HSI () and the generated super-resolution HSI (
) were visualised in the feature space, where the probability densities of the discriminator eigenvalues ofand , denoted as and , were used to represent the sample distributions. With and as inputs separately fed into the discriminator described in Section III-A2, the outputs from the last Maxpool layer, denoted as and , represent the eigenvalues of the inputs and in the high-level discriminating space, and the probability densities of the and represent the sample distributions of and . The coverage of probability densities between the and represent the mode similarity of the and to indicate whether model collapse occurs in the generator.
Fig. 11 illustrates that the probability density curves of and obtained for three GAN models, LE-GAN and its two best competitors, WGAN and BAGAN, through training the models on AVIRIS and UHD-185 datasets with an SNR level of and an upscaling factor of . In comparison with the other two models, the probability density curves of generated by LE-GAN are much closer to those of the real for both AVIRIS (Fig. 11a) and UHD-185 datasets (Fig. 11d). However, the probability density curves of the generated by WGAN (11b and e) and BAGAN (11c and f) have an obvious tendency shifting towards the right and having a higher peak (i.e. a lower standard deviation). This means the generated by WGAN or BAGAN can be better discriminated from the real by (i.e. low spectral-spatial fidelity), and the generated only covers the limited spectral-spatial patterns of the real (i.e. existing the mode collapse issue). These observations shows that the proposed model outperforms the competitors in generating diversity of super-resolution samples and alleviating mode collapse.
The challenge of GANs in improving the spectral and spatial fidelity of HSI super-resolution and addressing the issue of mode collapse is on how to make the generator learn the real spectral-spatial patterns, and meanwhile, prevent the generator from over-learning limited patterns. Since there is no such kinds of constraints in the JS distance based loss functions, the original GAN is hard to generate the high fidelity HSI super-resolution and easy to suffer mode collapse. In this study, we proposed a novel GAN model, named as LE-GAN, through improving the GAN baseline and introducing a new SSRP loss function. The new SSRP loss was used to guide the optimisation and alleviate the spectral-spatial mode collapse issue occurred in the HSI super-resolution process. The model validation and evaluation were conducted using the datasets from two hyperspectral sensors (i.e. AVIRIS and UHD-185) with various upscaling factors (, , and ) and added-noises (db, db, and db). The evaluation results showed that the proposed LE-GAN can achieve high-fidelity HSI super-resolution for relatively high upscaling factors and have a better robustness against noise and better generalizability to various sensors.
V-a The ablation analysis of the improved modules
In the proposed model, a total of five different modifications have been made to improve the GAN baseline including: 1) using 3D-convolutional filters in , 2) adding an UpscaleBlock in , 3) removing the sigmoid in , 4) adding a novel network, and 5) using a new loss function to optimise the model.
To evaluate the effects of these improvements on the performance of the proposed LE-GAN, we have conducted an ablation analysis in which we gradually substituted the traditional GAN components with the proposed modules and compares their effects based on six evaluation metrics, PSNR, SIM, PI, SAM, SRE, and computing time (CT). Each improvement is an incremental modification to the original GAN model, thus forming five different models: Model 1 to Model 5. The details of the five models and their influences on the six evaluation metrics for the testing datasets (AVIRIS and UHD-185) with scale factor are presented in Fig.12. The super-resolution results of three example patches are also displayed for the visual comparison.
V-A1 Model 1: using 3D-convolutional filters in
In order to process continuous spectral channels and capture spectral-spatial joint features learning in the ResBlock in , 3D-convolutional filters are used. Theoretically, this modification is able to extract both the spectral correlation characteristics and spatial texture information.
V-A2 Model 2: Adding an UpscaleBlock in
In a super-resolution network, the most important strategy to improve the performance is to increase the information (e.g. the dimensionality of feature maps) of an LR HSI to match with that of the corresponding HR HSI. However, the traditional approaches increase the feature dimensionality in the entire intermediate layers gradually, which increases the computation complexity and computational cost. In contrast, we proposed an UpscaleBlock to super-resolve the detailed spectral-spatial information only at the end of the generator network (see Fig. 3). This adjustment directly eliminates the need of the computational and memory resources for super-resolution operations. Thus, a smaller filter size can be effectively used in our generator network for the extraction of super-resolution features. The results of Model 2 (the third column in Fig. 12) reveals a performance improvement after adding the UpscaleBlock. Compared to Model 1, the computation time has a reduction on average without losing the super-resolution quality, Model 2 even has a better super-resolution quality in terms of PSNR, SSIM, PI, SAM and SRE.
V-A3 Model 3: Removing the sigmoid function from the discriminator
In the traditional GAN framework, the sigmoid-activated features often skew the original feature distribution and result in lower reconstructed spectral values. Therefore, in this study, we removed the sigmoid activation innetwork for two reasons. Firstly, using the feature before activation can benefit accurate reconstruction of the spectral and spatial features of input. Secondly, the proposed latent space distance requires real feature distribution of the input HSI in the low-dimensional manifold in order to measure the divergence between the generated super-resolution HSI and real HR HIS. This modification, as shown in Model 3 in the fourth column of Fig. 12, contributes to an approximately reduction in SAM and reduction in SRE. These findings suggest that removing the sigmoid activation can help keep the spectral consistency between the LR and HR HSIs.
V-A4 Model 4: Adding a newly developed La network
The network is developed to produce a latent regularisation term, which holds up the manifold space of the generated super-resolution HSI so that the dimensionality of the generated HSI is consistent with that of real HR HIS. In addition, the network makes the divergence of the generated HSI and real HSI satisfy the Lipschitz condition for optimisation. The generated super-resolution HSI patches from Model 4 (see the fifth column of Fig. 12) indicates that, after adding the network into the original GAN framework, both SAM and SRE have a significant reduction, with a drop of and , respectively. Besides, there is a slight improvement on the PNSR, SSIM, and PI (the PNSR and SSIM respectively increase and , the PI declines ). These results indicate that the regularisation term produced by the network has a great contribution in reconstructing the spectral-spatial details consistent with real HR HIS. However, the need to occupy a certain amount of computational and memory resources, subsequently the computation time increases .
V-A5 Model 5: Using the new loss function to optimise the model
The most important contribution of our work is to develop a SSRP loss function with a latent regularisation to optimise the whole model. Model 5 (see the last column of Fig. 12), the final version of the LE-GAN, improves all of the evaluation metrics. The increases of PNSR and SSIM are and ,respectively, while the decreases of SAM and SRE are and , respectively. But, it leads to a increase of computation time. These findings suggest that the proposed SSEP loss function with the latent space regularisation can boost the performance on measuring the divergence of generated HSI and real HSI in both spectral and spatial dimensionality.
V-B The Evaluation of the loss function
The proposed loss function introducing latent regularisation into the Wasserstein loss function optimises the GAN in the latent manifold space and addresses the problems of mode collapse. In order to verify the effectiveness of the proposed loss function, we trained the proposed LE-GAN model with three kinds of losses: 1) the traditional JS divergence-based loss, 2) the Wasserstein distance-based loss, and 3) the proposed improved Wasserstein loss with latent regularisation, and plotted their loss curves on both the training and validation sets in Fig. 13.
It is obvious that the training process of the model with a JS divergence-based loss, as shown in Fig. 13a, is unstable and volatile. The reason behind lies in the fact that the JS divergence always leads to the supports of and disjointing in the low-dimensional manifolds during the process of maximising the discriminative capability of , which causes the gradient fluctuation. On the contrary, the Wasserstein distance based loss functions, as shown in Fig. 13b and c, can improve the stability of learning and lead the loss converges to the minimum. This findings is consistent with Arjovsky et al. [LN04]’s and Ishaan et al. [RN32]’s studies. In addition, it is noteworthy that the loss curve of the proposed model is more stable and smoother than that of the traditional Wasserstein distance-based losses. The theory behind is that introducing the latent regularisation terms into the training process provides a non-singular support to the generated sample sets at the corresponding low-dimensional manifolds. It is expected that the Wasserstein distance (i.e. ) performs better under the condition of the continuity and differentiability of the divergence of and . With the latent regularisation, the max-min game of LE-GAN will yield a probability distribution in a low-dimensional manifold that has a joint support with , and the process of minimizing the will facilitate the gradient descent of the trainable parameters in because the valid gradient can be captured from the optimised in the low dimensional manifold. Therefore, the latent regularisation derived Wasserstein loss is regarded as a more sensible loss function for HSI super-resolution than the JS divergence loss and the traditional Wasserstein loss.
The subplots above the learning curve shown in Fig. 13 are the images generated in the optimisation process when three different losses are used. It is obvious that the super-resolved HSI subplots optimised by the JS divergence-based loss (see Fig. 13a) do not produce the equivalent quality of spatial texture reconstruction as those from the proposed model (see Fig. 13b). The proposed latent regularisation term makes the dimensionality of the generated HSI manifold more consistent with that of the HR HSI in the optimisation process.
To address the challenge of spectral-spatial distortions caused by mode collapse during the optimisation process, this work has developed a latent encoder coupled GAN for spectral-spatial realistic HSI super-resolution. In the proposed GAN architecture, the generator is designed based on an STSSRW mechanism with a consideration of spectral-spatial hierarchical structures during the upscaling process. In addition, a latent regularised encoder is embedded in the GAN framework to map the generated spectral-spatial features into a latent manifold space and make the generator a better estimation of the local spectral-spatial invariances in the latent space. For the model optimisation, an SSRP loss has been introduced to avoid the spectral-spatial distortion in the super-resolution HSI. By using the SSRP loss, both spectral-spatial perceptual differences and adversarial loss in latent space are measured during the optimization process. More importantly, a latent regularisation component is coupled with the optimisation process to maintain the continuity and no-singularity of the generated spectral-spatial feature distribution in the latent space and increase the diversity of the super-resolution features. We have conducted different experimental evaluation in terms of mode collapse and performance. The proposed approach has been tested and validated on AVIRIS and UHD-185 HSI datasets and compared with five state-of-the-art super resolution methods. The results show that the proposed model outperforms the existing methods and is more robust to noise and less sensitive to the upscaling factor. The proposed model is capable of not only generating high quality super-resolution HSIs (both the spatial texture and spectral consistency) but also alleviating mode collapse issue.
This research was supported BBSRC (BB/R019983/1), BBSRC (BB/S020969/1). The work is also supported by Newton Fund Institutional Links grant, ID 332438911, under the Newton-Ungku Omar Fund partnership (the grant is funded by the UK Department of Business, Energy, and Industrial Strategy (BEIS)) and the Open Research Fund of Key Laboratory of Digital Earth Science, Chinese Academy of Sciences(No.2019LDE003). For further information, please visit www.newtonfund.ac.uk.