Recently, a lot of high resolution (HR) optical Earth observation satellites, such as QuickBird, GeoEye and GaoFen-1 have been lunched, providing us a large amount of data available for various research fields including geography, land surveying, etc. Many of them require images at the highest resolution both in spatial and spectral domains. However, due to technical limitations , satellites usually acquire images at two different yet complementary modalities: one is high resolution panchromatic (PAN) image, and the other one is low resolution (LR) multi-spectral (MS) image. Pan-sharpening (i.e. panchromatic and multi-spectral image fusion), which aims at generating high spatial resolution MS images by combining spatial and spectral information of PAN and MS images, provides us a good solution to alleviate this problem.
Many methods have been developed during the last decades. A comprehensive review of pan-sharpening techniques can be found in 
. In recent years, inspired by the great success of deep learning in various computer vision tasks[3, 4, 5], researchers in remote sensing community also attempt to harness the capability of deep learning techniques and apply them to pan-sharpening task. For instance, Masi et al. 
proposed a pan-sharpening method based on convolutional neural networks (CNN). They utilized a three-layer CNN architecture modified from SRCNN to achieve pan-sharpening. Zhong et al.  presented a CNN based hybrid pan-sharpening method. They employed CNNs to enhance the spatial resolution of MS images, then the Gram-Schmidt transformation was utilized to fuse the enhanced MS and PAN images. The network used in this study is also a three-layer CNN similar to SRCNN. Yang et al. 
presented a deep network architecture named PanNet for pan-sharpening, in which domain-knowledge is incorporated to improve the performance of the PanNet. Although promising results can be obtained, these methods either simply consider pan-sharpening as a super-resolution problem or train the networks by minimizing the Euclidean distance between the predicted and reference images, which will cause blurring results.
More recently, Goodfellow et al.  introduced generative adversarial networks (GANs) to generate images that are indistinguishable from real ones. GANs learn a discriminator that tries to distinguish whether the output image is real or fake, while simultaneously train a generator to minimize the difference between the generated and real images. Later, Isola et al. 
proposed a general-purpose solution to image-to-image translation, which applied GANs to learn mapping functions for different image translation problems. In addition to, GANs have been successfully applied to many vision tasks such as image super-resolution  and visual saliency detection .
To further enhance the performance of pan-sharpening networks and obtain realistic pan-sharpened images, in this paper we explore utilization of GAN framework to address remote sensing image pan-sharpening problem. To the best of our knowledge, this is the very first attempt to apply GAN on solving pan-sharpening task. The contributions of this study include three aspects: 1) for the first time, GAN is applied to the remote sensing image pan-sharpening task; 2) to accomplish pan-sharpening with GAN, we design a two-stream CNN architecture as generator to produce high quality pan-sharpened images and employ a fully convolutional discriminator to learn adaptive loss function for improving quality of the pan-sharpened images; and 3) we demonstrate that the proposed PSGAN can produce astounding results on pan-sharpening problem. Fig.1 shows some example results produced by our method.
2 Pan-sharpening generative adversarial network
2.1 Pan-sharpening as GAN
Pan-sharpening aims to estimate a pan-sharpened HR MS imagefrom a LR MS image and a HR PAN image . The output images should be as close as possible to the ideal HR MS images . We describe
by a real-valued tensor of size, by and , by respectively, where is spatial resoltuion ratio between LR MS and HR PAN , in this paper , and b is the number of bands. The ultimate goal of pan-sharpening takes a general form as follows,
where is a pan-sharpening model which takes and as input and produces desired HR MS , and is collection of parameters for this model. Eq. 1 can be solved by minimizing the following loss function:
From Eq. 3 we can see that can be considered as a mapping function from to . Thus, we can reformulate pan-sharpening as a conditional image generative problem that can be solved using conditional GAN . Following  and , we define a generative network
that maps the joint distributionto the target distribution . The generator tries to produce pan-sharpened image that cannot be distinguished from the reference image by an adversarial trained discriminative network . This can be expressed as a mini-max game problem,
To train the generative network , many existing works employ as loss function, which will result in blurring effects. In this paper we adopt to train :
Finally, the loss function for takes the form:
where is the number of training samples, , are hyper-parameters and are set to 1 and 100 in the experiments. In the next two subsections we will give the detailed architectures of the generator and discriminator .
2.2 Two-stream generator
One possible way of designing generator is using network architectures similar to . However, spectral distortion may occur when using . In this paper, we develop a novel two-stream network as generator to generate pan-sharpened images. The architecture of it is shown in the left part of Fig. 2. Instead of performing pan-sharpening in pixel level [6, 7, 8], we accomplish fusion in feature domain, which will reduce spectral distortion. This is because PAN and MS images contain different information. PAN image is carrier of geometric detail (spatial) information, while MS image preserves spectral information. To make the best use of spatial and spectral information, we utilize two sub-networks to extract hierarchical features to capture complementary information of PAN and MS images. After that, the networks proceed as an auto-encoder: the encoder fuses information extracted from PAN and MS images, and the decoder reconstructs the HR MS images from the fused features in the final part.
The two sub-networks have similar architecture but different weights. Each of them consists of two successive convolutional layers followed by a leaky rectified linear unit (LeakyReLU)
and a down-sampling layer. The convolutional layer with a stride of 2 instead of simple pooling strategy e.g. max pooling is used to dowm-sample the feature maps. The feature maps are first concatenated, then fused by subsequent convolutional layers. Finally, a decoder like network architecture comprised of 2 transposed convolutional and 3 flat convolutional layers is applied to reconstruct the desired HR MS images.
Inspired by the U-Net 
, we adapt the network by adding skip connections. The skip connection will not only compensate the details to higher layers but also ease the training. In the last layer, ReLU is used to guarantee the output is not negative. The detailed architecture and parameters can be found in the left part ofFig. 2.
The generator takes a pair of LR MS and HR PAN as input. is up-sampled to match the size of . The output of the generator is a pan-sharpened image with the same shape of up-sampled .
2.3 Fully convolutional discriminator
In addition to the generator, a conditional discriminator network is trained simultaneously to discriminate the reference MS images from the generated pan-sharpened images. Similar to , we use a fully convolutional discriminator, which consists of five layers with kernels of 3
3. The stride of first three layers are set to 2, and the last two is 1. Except for the last layer, all the convolution layers are activated through LeakyReLU. Sigmoid is used to predict the probability of beingreal HR MS or pan-sharpened MS for each patch.
2.4 Implement details
The PSGAN is implemented in Tensorflow and trained on a single NVIDIA Titan XP GPU. We use Adam optimizer with a initial learning rate of 0.0002 and a momentum of 0.5 to minimize the loss. The mini-batch size is set to 32. It takes about 10 hours to train our network. The source code for this work is available at https://github.com/liouxy/PSGan.
3.1 Datasets and evaluation indexes
We train and test our network on two datasets comprised of images acquired by QuickBird and GaoFen-1 (GF-1) satellites. The spatial resolution of Quickbird is 0.6 m ground sampling distance (GSD) for PAN and 2.4m GSD for MS, for GF-1 it’s 2m GSD at nadir for PAN and 8m GSD at nadir for MS.
Quickbird dataset contains 9 pairs of large scale MS and PAN images. 8 out of 9 images are used to generate training set and the last one is used for testing. For GF-1 dataset, there are 5 pairs of images, and 4 out of 5 are used to train. Since the desired HR MS images are not available, we follow Wald’s protocol  to down-sample both the MS and PAN images with a factor of ( in this paper). Then the original MS images are used as reference images to be compared with. Finally, training samples with a size of 128128 are randomly cropped from the training images to train the two-stream generator and fully convolutional discriminator . The number of training samples for two datasets are both 40,000. All the results reported in the following subsections are based on the test sets which are independent of the training images.
3.2 Impact of loss functions
loss has been widely used while performing low level vision tasks [3, 6, 7, 8]. However, in this work we demonstrate that using loss will generate better results than on pan-sharpening task. Firstly, we conduct experiments on the two datasets only utilizing the generator (). The results are show in Table 1. It can be seen loss achieves better results than
loss in terms of all evaluation metrics. Furthermore, the results can be significantly improved via adversarial training. The results on the two datasets are also given inTable 1, in which the proposed PSGAN shows obvious advantages over non adversarial networks, i.e. .
3.3 Comparison with other methods
In this subsection we compare the proposed PSGAN with four widely used techniques: APCA , AIHS , AWLP  and PCNN . We re-implement the first three methods in MATLAB and PCNN in Tensorflow. All hyper-parameters are set as suggested parameter settings described in their papers. Table 2 lists the quantitative evaluations on the two datasets in terms of all five metrics, from which we can see deep learning based pan-sharpening methods outperform traditional ones. As one of the first attempts at applying deep learning techniques to pan-sharpening, PCNN achieves promising results on both satellites. The proposed PSGAN gives the best results on two datasets, especially on GF-1. Also we can see that even the generator part of the PSGAN achieves superior performance than PCNN.
3.4 Visual comparisons
Fig. 3 shows some results cropped from the test site of Quickbird and GF-1. All the images are displayed in true color (RGB). In Fig. 3, APCA  and AIHS  produce images with obvious blurring. AWLP  returns results similar to the reference images with fine spatial details, but suffers from strong spectral distortions. The results of PCNN  are similar to ours, but some missing spatial details are noticeable. PSGAN produces the best results with less blurring and reduced spectral distortions.
In this paper, we have proposed PSGAN that consists of a two-stream fusion network as generator and a patch discriminator to solve the pan-sharpening problem. The experiments on different sensors demonstrate the effectiveness of the proposed method, and comparisons with other methods also shows great superiority of it. In the future, we would like to apply deeper network such as residual neural network to our approach. We will also explore the generalization of the model by testing it on the data from satellites which are never used to train.
This work was supported by the Natural Science Foundation of China (NSFC) under Grant No. 61601011.
-  Yun Zhang, “Understanding image fusion,” PE&RS, vol. 70, no. 6, pp. 657–661, 2004.
-  Hassan Ghassemian, “A review of remote sensing image fusion methods,” Information Fusion, vol. 32, pp. 75–89, 2016.
-  Chao Dong, Chen Change Loy, Kaiming He, et al., “Image super-resolution using deep convolutional networks,” TPAMI, vol. 38, no. 2, pp. 295–307, 2016.
-  Qinchuan Zhang, Yunhong Wang, Qingjie Liu, et al., “CNN based suburban building detection using monocular high resolution google earth images,” in IGARSS, 2016, pp. 661–664.
-  Suhas Sreehari, SV Venkatakrishnan, Katherine L Bouman, et al., “Multi-resolution data fusion for super-resolution electron microscopy,” in CVPRW. IEEE, 2017, pp. 1084–1092.
-  Giuseppe Masi, Davide Cozzolino, Luisa Verdoliva, et al., “Pansharpening by convolutional neural networks,” Remote Sens., vol. 8, no. 7, pp. 594, 2016.
-  Jinying Zhong, Bin Yang, Guoyu Huang, et al., “Remote sensing image fusion with convolutional neural network,” Sensing and Imaging, vol. 17, no. 1, pp. 10, 2016.
-  Junfeng Yang, Xueyang Fu, Yuwen Hu, et al., “PanNet: A deep network architecture for pan-sharpening,” in ICCV, 2017, pp. 5449–5457.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al., “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al.,
“Image-to-image translation with conditional adversarial networks,”in CVPR, 2017, pp. 1125–1134.
-  Christian Ledig, Lucas Theis, Ferenc Huszár, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR, 2017, pp. 4681–4690.
-  Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, et al., “Salgan: Visual saliency prediction with adversarial networks,” in CVPR SUNw, 2017.
-  Jin Xie, Yue Huang, John Paisley, et al., “Pan-sharpening based on nonparametric bayesian adaptive dictionary learning,” in ICIP, 2013, pp. 2039–2042.
-  Xiangyu Liu, Yunhong Wang, and Qingjie Liu, “Remote sensing image fusion based on two-stream fusion network,” in MMM, 2018, pp. 428–439.
-  Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML, 2013, vol. 30.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015, pp. 234–241.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
-  Lucien Wald, “Quality of high resolution synthesised images: Is there a simple criterion?,” in Proceedings of the Fusion of Earth Data: Merging Point Measurements, Raster Maps, and Remotely Sensed image, 2000, pp. 99–103.
-  Zhou Wang, Bovik Alan, C, and Ligang Lu, “A universal image quality index,” Signal Process. Lett., vol. 9, no. 2, pp. 81–84, 2002.
-  María González-Audícana, José Luis Saleta, Raquel García Catalán, et al., “Fusion of multispectral and panchromatic images using improved ihs and pca mergers based on wavelet decomposition,” TGRS, vol. 42, no. 6, pp. 1291–1299, 2004.
-  J Zhou, DL Civco, and JA Silander, “A wavelet transform method to merge landsat tm and spot panchromatic data,” IJRS, vol. 19, no. 4, pp. 743–757, 1998.
-  R. H. Yuhas, A. F. H. Goetz, and J.W. Boardman, “Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm,” in Proc. Summaries 3rd Annu. JPL Airborne Geosci. Workshop, Jun. 1992, pp. 147 – 149.
-  Vijay P. Shah, Nicolas H. Younan, and Roger L. King, “An efficient pan-sharpening method via a combined adaptive PCA approach and contourlets,” TGRS, vol. 46, no. 5, pp. 1323–1335, 2008.
-  Sheida Rahmani, Melissa Strait, Daria Merkurjev, et al., “An adaptive ihs pan-sharpening method,” GRSL, vol. 7, no. 4, pp. 746–750, 2010.
-  Xavier Otazu, María González-Audícana, Octavi Fors, et al., “Introduction of sensor spectral response into image fusion methods. application to wavelet-based methods,” TGRS, vol. 43, no. 10, pp. 2376–2385, 2005.