PSGAN: A Generative Adversarial Network for Remote Sensing Image Pan-Sharpening

05/09/2018 ∙ by Xiangyu Liu, et al. ∙ Beihang University 0

Remote sensing image fusion (also known as pan-sharpening) aims to generate a high resolution multi-spectral image from inputs of a high spatial resolution single band panchromatic (PAN) image and a low spatial resolution multi-spectral (MS) image. In this paper, we propose PSGAN, a generative adversarial network (GAN) for remote sensing image pan-sharpening. To the best of our knowledge, this is the first attempt at producing high quality pan-sharpened images with GANs. The PSGAN consists of two parts. Firstly, a two-stream fusion architecture is designed to generate the desired high resolution multi-spectral images, then a fully convolutional network serving as a discriminator is applied to distinct "real" or "pan-sharpened" MS images. Experiments on images acquired by Quickbird and GaoFen-1 satellites demonstrate that the proposed PSGAN can fuse PAN and MS images effectively and significantly improve the results over the state of the art traditional and CNN based pan-sharpening methods.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, a lot of high resolution (HR) optical Earth observation satellites, such as QuickBird, GeoEye and GaoFen-1 have been lunched, providing us a large amount of data available for various research fields including geography, land surveying, etc. Many of them require images at the highest resolution both in spatial and spectral domains. However, due to technical limitations [1], satellites usually acquire images at two different yet complementary modalities: one is high resolution panchromatic (PAN) image, and the other one is low resolution (LR) multi-spectral (MS) image. Pan-sharpening (i.e. panchromatic and multi-spectral image fusion), which aims at generating high spatial resolution MS images by combining spatial and spectral information of PAN and MS images, provides us a good solution to alleviate this problem.

Many methods have been developed during the last decades. A comprehensive review of pan-sharpening techniques can be found in [2]

. In recent years, inspired by the great success of deep learning in various computer vision tasks 

[3, 4, 5], researchers in remote sensing community also attempt to harness the capability of deep learning techniques and apply them to pan-sharpening task. For instance, Masi et al. [6]

proposed a pan-sharpening method based on convolutional neural networks (CNN). They utilized a three-layer CNN architecture modified from SRCNN 

[3] to achieve pan-sharpening. Zhong et al. [7] presented a CNN based hybrid pan-sharpening method. They employed CNNs to enhance the spatial resolution of MS images, then the Gram-Schmidt transformation was utilized to fuse the enhanced MS and PAN images. The network used in this study is also a three-layer CNN similar to SRCNN. Yang et al. [8]

presented a deep network architecture named PanNet for pan-sharpening, in which domain-knowledge is incorporated to improve the performance of the PanNet. Although promising results can be obtained, these methods either simply consider pan-sharpening as a super-resolution problem or train the networks by minimizing the Euclidean distance between the predicted and reference images, which will cause blurring results.

Figure 1: Example results of the proposed method. (c) are high resolution MS images generated from (a) PAN and (b) MS. (d) are ground truth HR MS images.

More recently, Goodfellow et al. [9] introduced generative adversarial networks (GANs) to generate images that are indistinguishable from real ones. GANs learn a discriminator that tries to distinguish whether the output image is real or fake, while simultaneously train a generator to minimize the difference between the generated and real images. Later, Isola et al. [10]

proposed a general-purpose solution to image-to-image translation, which applied GANs to learn mapping functions for different image translation problems. In addition to

[10], GANs have been successfully applied to many vision tasks such as image super-resolution [11] and visual saliency detection [12].

To further enhance the performance of pan-sharpening networks and obtain realistic pan-sharpened images, in this paper we explore utilization of GAN framework to address remote sensing image pan-sharpening problem. To the best of our knowledge, this is the very first attempt to apply GAN on solving pan-sharpening task. The contributions of this study include three aspects: 1) for the first time, GAN is applied to the remote sensing image pan-sharpening task; 2) to accomplish pan-sharpening with GAN, we design a two-stream CNN architecture as generator to produce high quality pan-sharpened images and employ a fully convolutional discriminator to learn adaptive loss function for improving quality of the pan-sharpened images; and 3) we demonstrate that the proposed PSGAN can produce astounding results on pan-sharpening problem. Fig. 

1 shows some example results produced by our method.

The remainder of the this paper is organized as follows, Section 2 formulates pan-sharpening from the perspective of generative adversarial learning, and gives details of proposed PSGAN architecture. Experiments are conducted in Section 3. And finally this paper is concluded in Section 4.

2 Pan-sharpening generative adversarial network

2.1 Pan-sharpening as GAN

Pan-sharpening aims to estimate a pan-sharpened HR MS image

from a LR MS image and a HR PAN image . The output images should be as close as possible to the ideal HR MS images . We describe

by a real-valued tensor of size

, by and , by respectively, where is spatial resoltuion ratio between LR MS and HR PAN , in this paper , and b is the number of bands. The ultimate goal of pan-sharpening takes a general form as follows,


where is a pan-sharpening model which takes and as input and produces desired HR MS , and is collection of parameters for this model. Eq. 1 can be solved by minimizing the following loss function:


where is number of training samples. As an example, Eq. 1 can be realized from the perspective of compressed sensing and Eq. 2 can be solved using dictionary learning [13].

From Eq. 3 we can see that can be considered as a mapping function from to . Thus, we can reformulate pan-sharpening as a conditional image generative problem that can be solved using conditional GAN [10]. Following  [9] and [10], we define a generative network

that maps the joint distribution

to the target distribution . The generator tries to produce pan-sharpened image that cannot be distinguished from the reference image by an adversarial trained discriminative network . This can be expressed as a mini-max game problem,


To train the generative network , many existing works employ as loss function, which will result in blurring effects. In this paper we adopt to train :


Finally, the loss function for takes the form:


where is the number of training samples, , are hyper-parameters and are set to 1 and 100 in the experiments. In the next two subsections we will give the detailed architectures of the generator and discriminator .

Figure 2: Detailed architectures of the (a) Generator and (b) Discriminator . The numbers of feature maps are indicated on the left, filter sizes are on the right. LeakyReLU are used throughout except for the last layers of two networks.

2.2 Two-stream generator

One possible way of designing generator is using network architectures similar to [6]. However, spectral distortion may occur when using [6]. In this paper, we develop a novel two-stream network as generator to generate pan-sharpened images[14]. The architecture of it is shown in the left part of Fig. 2. Instead of performing pan-sharpening in pixel level [6, 7, 8], we accomplish fusion in feature domain, which will reduce spectral distortion. This is because PAN and MS images contain different information. PAN image is carrier of geometric detail (spatial) information, while MS image preserves spectral information. To make the best use of spatial and spectral information, we utilize two sub-networks to extract hierarchical features to capture complementary information of PAN and MS images. After that, the networks proceed as an auto-encoder: the encoder fuses information extracted from PAN and MS images, and the decoder reconstructs the HR MS images from the fused features in the final part.

The two sub-networks have similar architecture but different weights. Each of them consists of two successive convolutional layers followed by a leaky rectified linear unit (LeakyReLU) 


and a down-sampling layer. The convolutional layer with a stride of 2 instead of simple pooling strategy e.g. max pooling is used to dowm-sample the feature maps. The feature maps are first concatenated, then fused by subsequent convolutional layers. Finally, a decoder like network architecture comprised of 2 transposed convolutional and 3 flat convolutional layers is applied to reconstruct the desired HR MS images.

Inspired by the U-Net [16]

, we adapt the network by adding skip connections. The skip connection will not only compensate the details to higher layers but also ease the training. In the last layer, ReLU is used to guarantee the output is not negative. The detailed architecture and parameters can be found in the left part of 

Fig. 2.

The generator takes a pair of LR MS and HR PAN as input. is up-sampled to match the size of . The output of the generator is a pan-sharpened image with the same shape of up-sampled .

2.3 Fully convolutional discriminator

In addition to the generator, a conditional discriminator network is trained simultaneously to discriminate the reference MS images from the generated pan-sharpened images. Similar to [10], we use a fully convolutional discriminator, which consists of five layers with kernels of 3

3. The stride of first three layers are set to 2, and the last two is 1. Except for the last layer, all the convolution layers are activated through LeakyReLU. Sigmoid is used to predict the probability of being

real HR MS or pan-sharpened MS for each patch.

2.4 Implement details

The PSGAN is implemented in Tensorflow and trained on a single NVIDIA Titan XP GPU. We use Adam optimizer 

[17] with a initial learning rate of 0.0002 and a momentum of 0.5 to minimize the loss. The mini-batch size is set to 32. It takes about 10 hours to train our network. The source code for this work is available at

3 Experiments

3.1 Datasets and evaluation indexes

We train and test our network on two datasets comprised of images acquired by QuickBird and GaoFen-1 (GF-1) satellites. The spatial resolution of Quickbird is 0.6 m ground sampling distance (GSD) for PAN and 2.4m GSD for MS, for GF-1 it’s 2m GSD at nadir for PAN and 8m GSD at nadir for MS.

Quickbird dataset contains 9 pairs of large scale MS and PAN images. 8 out of 9 images are used to generate training set and the last one is used for testing. For GF-1 dataset, there are 5 pairs of images, and 4 out of 5 are used to train. Since the desired HR MS images are not available, we follow Wald’s protocol [18] to down-sample both the MS and PAN images with a factor of ( in this paper). Then the original MS images are used as reference images to be compared with. Finally, training samples with a size of 128128 are randomly cropped from the training images to train the two-stream generator and fully convolutional discriminator . The number of training samples for two datasets are both 40,000. All the results reported in the following subsections are based on the test sets which are independent of the training images.

We use five widely used metrics to evaluate the performance of the proposed and other methods on the two datasets, including Q [19], ERGAS [18], RASE [20], sCC [21] and SAM [22].

3.2 Impact of loss functions

loss has been widely used while performing low level vision tasks [3, 6, 7, 8]. However, in this work we demonstrate that using loss will generate better results than on pan-sharpening task. Firstly, we conduct experiments on the two datasets only utilizing the generator (). The results are show in Table 1. It can be seen loss achieves better results than

loss in terms of all evaluation metrics. Furthermore, the results can be significantly improved via adversarial training. The results on the two datasets are also given in 

Table 1, in which the proposed PSGAN shows obvious advantages over non adversarial networks, i.e. .

Figure 3: Pansharpening results on the test site of Quickbird and GF-1. The first two rows are results cropped from Qucikbird and the last two rows show results of GF-1. Please zoom in to see more details.
QB - 0.952 6.953 25.973 0.958 6.198
- 0.955 6.622 23.891 0.961 5.947
PSGAN- 0.965 6.197 22.601 0.968 5.505
PSGAN- 0.971 5.549 20.560 0.972 5.371
GF-1 - 0.988 9.691 26.490 0.960 5.446
- 0.950 9.318 26.188 0.964 4.853
PSGAN- 0.969 7.734 19.658 0.970 4.772
PSGAN- 0.974 7.030 17.890 0.973 4.353
Table 1: Performance of Generator and PSGAN with different loss functions.
QB APCA [23] 0.859 11.815 44.116 0.909 7.563
AIHS [24] 0.841 11.055 41.306 0.916 7.082
AWLP [25] 0.902 11.701 41.049 0.915 7.744
PCNN [6] 0.952 7.022 26.208 0.957 6.614
- 0.955 6.622 23.891 0.961 5.947
PSGAN- 0.971 5.549 20.560 0.972 5.371
GF-1 APCA [23] 0.709 25.196 62.960 0.748 8.175
AIHS [24] 0.675 20.353 54.546 0.703 6.855
AWLP [25] 0.554 59.869 83.611 0.535 8.829
PCNN [6] 0.936 11.770 28.161 0.934 6.537
- 0.950 9.318 26.188 0.964 4.853
PSGAN- 0.974 7.030 17.890 0.973 4.353
Table 2: Performance comparisons on the test datasets.

3.3 Comparison with other methods

In this subsection we compare the proposed PSGAN with four widely used techniques: APCA [23], AIHS [24], AWLP [25] and PCNN [6]. We re-implement the first three methods in MATLAB and PCNN in Tensorflow. All hyper-parameters are set as suggested parameter settings described in their papers. Table 2 lists the quantitative evaluations on the two datasets in terms of all five metrics, from which we can see deep learning based pan-sharpening methods outperform traditional ones. As one of the first attempts at applying deep learning techniques to pan-sharpening, PCNN achieves promising results on both satellites. The proposed PSGAN gives the best results on two datasets, especially on GF-1. Also we can see that even the generator part of the PSGAN achieves superior performance than PCNN.

3.4 Visual comparisons

Fig. 3 shows some results cropped from the test site of Quickbird and GF-1. All the images are displayed in true color (RGB). In Fig. 3, APCA [23] and AIHS [24] produce images with obvious blurring. AWLP [25] returns results similar to the reference images with fine spatial details, but suffers from strong spectral distortions. The results of PCNN [6] are similar to ours, but some missing spatial details are noticeable. PSGAN produces the best results with less blurring and reduced spectral distortions.

4 Conclusion

In this paper, we have proposed PSGAN that consists of a two-stream fusion network as generator and a patch discriminator to solve the pan-sharpening problem. The experiments on different sensors demonstrate the effectiveness of the proposed method, and comparisons with other methods also shows great superiority of it. In the future, we would like to apply deeper network such as residual neural network to our approach. We will also explore the generalization of the model by testing it on the data from satellites which are never used to train.


This work was supported by the Natural Science Foundation of China (NSFC) under Grant No. 61601011.


  • [1] Yun Zhang, “Understanding image fusion,” PE&RS, vol. 70, no. 6, pp. 657–661, 2004.
  • [2] Hassan Ghassemian, “A review of remote sensing image fusion methods,” Information Fusion, vol. 32, pp. 75–89, 2016.
  • [3] Chao Dong, Chen Change Loy, Kaiming He, et al., “Image super-resolution using deep convolutional networks,” TPAMI, vol. 38, no. 2, pp. 295–307, 2016.
  • [4] Qinchuan Zhang, Yunhong Wang, Qingjie Liu, et al., “CNN based suburban building detection using monocular high resolution google earth images,” in IGARSS, 2016, pp. 661–664.
  • [5] Suhas Sreehari, SV Venkatakrishnan, Katherine L Bouman, et al., “Multi-resolution data fusion for super-resolution electron microscopy,” in CVPRW. IEEE, 2017, pp. 1084–1092.
  • [6] Giuseppe Masi, Davide Cozzolino, Luisa Verdoliva, et al., “Pansharpening by convolutional neural networks,” Remote Sens., vol. 8, no. 7, pp. 594, 2016.
  • [7] Jinying Zhong, Bin Yang, Guoyu Huang, et al., “Remote sensing image fusion with convolutional neural network,” Sensing and Imaging, vol. 17, no. 1, pp. 10, 2016.
  • [8] Junfeng Yang, Xueyang Fu, Yuwen Hu, et al., “PanNet: A deep network architecture for pan-sharpening,” in ICCV, 2017, pp. 5449–5457.
  • [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al., “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
  • [10] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al.,

    “Image-to-image translation with conditional adversarial networks,”

    in CVPR, 2017, pp. 1125–1134.
  • [11] Christian Ledig, Lucas Theis, Ferenc Huszár, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR, 2017, pp. 4681–4690.
  • [12] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, et al., “Salgan: Visual saliency prediction with adversarial networks,” in CVPR SUNw, 2017.
  • [13] Jin Xie, Yue Huang, John Paisley, et al., “Pan-sharpening based on nonparametric bayesian adaptive dictionary learning,” in ICIP, 2013, pp. 2039–2042.
  • [14] Xiangyu Liu, Yunhong Wang, and Qingjie Liu, “Remote sensing image fusion based on two-stream fusion network,” in MMM, 2018, pp. 428–439.
  • [15] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML, 2013, vol. 30.
  • [16] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015, pp. 234–241.
  • [17] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
  • [18] Lucien Wald, “Quality of high resolution synthesised images: Is there a simple criterion?,” in Proceedings of the Fusion of Earth Data: Merging Point Measurements, Raster Maps, and Remotely Sensed image, 2000, pp. 99–103.
  • [19] Zhou Wang, Bovik Alan, C, and Ligang Lu, “A universal image quality index,” Signal Process. Lett., vol. 9, no. 2, pp. 81–84, 2002.
  • [20] María González-Audícana, José Luis Saleta, Raquel García Catalán, et al., “Fusion of multispectral and panchromatic images using improved ihs and pca mergers based on wavelet decomposition,” TGRS, vol. 42, no. 6, pp. 1291–1299, 2004.
  • [21] J Zhou, DL Civco, and JA Silander, “A wavelet transform method to merge landsat tm and spot panchromatic data,” IJRS, vol. 19, no. 4, pp. 743–757, 1998.
  • [22] R. H. Yuhas, A. F. H. Goetz, and J.W. Boardman, “Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm,” in Proc. Summaries 3rd Annu. JPL Airborne Geosci. Workshop, Jun. 1992, pp. 147 – 149.
  • [23] Vijay P. Shah, Nicolas H. Younan, and Roger L. King, “An efficient pan-sharpening method via a combined adaptive PCA approach and contourlets,” TGRS, vol. 46, no. 5, pp. 1323–1335, 2008.
  • [24] Sheida Rahmani, Melissa Strait, Daria Merkurjev, et al., “An adaptive ihs pan-sharpening method,” GRSL, vol. 7, no. 4, pp. 746–750, 2010.
  • [25] Xavier Otazu, María González-Audícana, Octavi Fors, et al., “Introduction of sensor spectral response into image fusion methods. application to wavelet-based methods,” TGRS, vol. 43, no. 10, pp. 2376–2385, 2005.