Due to their sensor resolution constraints and bandwidth limitation, satellites often acquire multi-resolution multi-spectral images of the same target areas. In general, satellite images include pairs of a low-resolution (LR) multi-spectral image (MS) of longer ground sample distance (GSD), and a high-resolution (HR) panchromatic (PAN) image of shorter GSD. By extracting high-quality spatial structures from a PAN image and multi-spectral information from an MS image, one can generate a pan-sharpened (PS) image which has the same GSD as that of the PAN image but with the spectral information of the MS image. This is known as pan-sharpening or pan-colorization.
I-a Related Works
and machine-learning[8, 9, 10]. Comparisons for component substitution and multiresolution analysis based approaches were presented thoroughly in . Component substitution based methods often incorporated the Brovey transform , the intensity-hue-saturation 
, principal component analysis (PCA), or matting models  for pan-sharpening. In multiresolution analysis based methods, the spatial structures of PAN images are decomposed using wavelet  or undecimated wavelet  decomposition techniques, and are fused with up-sampled MS images to produce PS images. These methods have relatively low computation complexity but tend to produce PS images with mismatched spectral information. Machine-learning based methods [8, 9, 10] learn pan-sharpening models by optimizing a loss function of inputs and targets with some regularization terms.
With the advent of deep-learning, recent pan-sharpening methods [15, 16, 17, 18, 19] started to incorporate various types of convolutional neural network (CNN) structures and are showing a large margin of quality improvements over traditional pan-sharpening methods. Most of these CNN-based pan-sharpening methods utilized network structures that were proven to be effective in classification [20, 21] and super-resolution (SR) [22, 23, 24] tasks. As the goal for pan-sharpening is to increase the resolution of MS inputs, many conventional CNN-based pan-sharpening methods employed network structures from the previous CNN-based SR methods [22, 23, 24]. Pan-sharpening CNN (PNN)  is known as the first method to employ CNN into pan-sharpening. The PNN used a shallow 3-layered network adopted from SRCNN  which is the first CNN-based SR method. The PNN was trained and tested on the Ikonos, GeoEye-1 and WorldView-2 satellite image datasets. Inspired by the success of ResNet  in classification, PanNet  incorporated the ResNet architecture with a smaller number of filter parameters to perform pan-sharpening. Recently, Lanaras et al.  employed the state-of-the-art SR network, EDSR , and proposed a moderately deep network version (DSen2) and a very deep network version (VDSen2) for pan-sharpening. Both of the PanNet and DSen2 showed the state-of-the-art performance at the MS and PAN resolutions.
I-B Our Contributions
Since the state-of-the-art CNN-based pan-sharpening methods, PanNet  and DSen2 , were trained using an L2 loss function for minimizing reconstruction error between generated images and MS target images, their PS result images often suffer from visually unpleasant artifacts along building edges and on moving cars in their resulting PS images of shorter GSD such as the WorldView-3 dataset. This is because, as GSD becomes smaller, pixel misalignments between PAN and MS inputs tend to get larger due to inevitable acquisition time difference and mosaicked sensor arrays. In such scenarios, the conventional loss terms such as L1 and L2 losses between network outputs and MS target images are insufficient for training, thus resulting in the PS images of low visual quality.
In this letter, we propose a novel loss term, called a spectral-spatial structure (S3) loss, which can be effectively utilized for training of pan-sharpening CNNs to learn spectral information of MS targets while preserving the spatial structure of PAN inputs. Our S3 loss consists of two loss functions: a spectral loss between network outputs and MS targets, and a spatial loss between network outputs and PAN inputs. Here, both spectral and spatial losses are computed based on the correlation maps between MS targets and PAN inputs. The spectral loss is selectively applied for the areas where averaged MS targets and PAN inputs are highly correlated. The spatial loss only considers edge maps of generated images (network output) and PAN inputs. In doing so, our network using the S3 loss can generate PS images where double-edge artifacts and ghosting artifacts on moving cars are significantly reduced. Finally, we show that our S3 loss can effectively work with various pan-sharpening CNNs. Fig. 1 shows a CNN-based pan-sharpening architecture with our proposed S3 loss.
Ii Proposed Method
Most of satellite imagery datasets include PAN images of higher resolution (smaller GSD) , and the corresponding MS images of lower resolution (larger GSD) . Here, the subscripts of and denote a level of resolution where a smaller number is for a higher resolution. We have two scenarios in terms of scales (resolutions): (i) Our final goal in pan-sharpening is to utilize both and inputs to generate a high-quality PS image , which has the same resolution as , while preserving spectral information of . This case corresponds to the original scale scenario in [16, 19]; (ii) Now we consider a pan-sharpening model that requires training using input and target pairs. For target images, we use . For input images, we use and , which are down-scaled versions of and respectively, using a degradation model . The pan-sharpening CNN takes and as inputs, and generates . This case corresponds to the lower scale scenario in [16, 19]. In conclusion, training and testing the pan-sharpening networks are performed under the lower and original scale scenarios, respectively. In this regard, the conventional pan-sharpening networks were trained by simply minimizing L2 or L1 loss between network outputs and MS targets under the lower scale scenario.
Ii-B Proposed S3 Loss
We now define our spectral-spatial structure (S3) loss, which can be used for training any pan-sharpening CNN to yield high-quality PS images , and ultimately . First, we define our feedforward pan-sharpening operation as
However, solely using this loss function for training often leads to artifacts in resultant images , due to inherent misalignments between and . To overcome this limitation, we propose S3 loss consisting of two loss functions: a spectral loss between and ; and a spatial loss between and . First, the spectral loss is computed only for the areas where grayed (denoted as ) and are highly correlated. The correlation map can be formulated as
where is a mean filter, denotes an element-wise multiplication, is a control parameter, and is a very small value, i.e. . We empirically set to 4. Using , our spectral loss is then defined as
Here, we try to minimize spectral loss between and only for pixel areas where and have large positive and negative correlations. Note that is not trainable.
For our spatial loss , we try to minimize the difference between the edge map of grayed (denoted as ) and that of , which is formulated as
where for is a function defined as
We incorporated into in (9), so that focuses more on those areas where is less focused. Finally, combining and , we have our final S3 loss as
where is a weighting value. We empirically set to 1.
In order to show the effectiveness of our S3 loss, we incorporated our S3 loss into the state-of-the-art pan-sharpening network, DSen2 , which is named as DSen2-S3 in our experiments. The DSen2 network has 14 convolutional layers with 128 channels, having about 1.8M filter parameters. Fig. 1-(b) shows our network based on DSen2.
Iii Experiment Results and Discussions
Iii-a Experiment Settings
All the networks including ours and baselines were trained and tested on the WorldView-3 satellite image dataset, whose PAN images are of about 0.3 GSD and MS images are of about 1.2 GSD. PS images of 0.3 GSD are also provided in the dataset, but they are used only for a visual comparison purpose with our results. Note that the WorldView-3 satellite image dataset has the shortest GSD (highest-resolution) among aforementioned datasets. We selected and used the WorldView-3 satellite image dataset from SpaceNet Challenge dataset . The RGB channels of the MS images were used for all experiments. Total 13K MS-PAN image pairs were used for training networks, where cropping and various data augmentations were conducted on the fly during the training. The MS-PAN training subimages were created by applying a down-scaling method in . The cropped MS subimages used for training are 3232-sized, while PAN subimages are of 128128 size. Before being fed into the networks, the training image pairs were normalized to have a range between 0 and 1. Training was done in the lower scale scenario.
We trained all the networks using the decoupled ADAMW optimization  with an initial learning rate of , initial weight decay of , and the other hyper-parameters as defaults. The mini-batch size was set to 2. We employed a uniform weight initialization technique in 
. All the networks including our proposed networks were implemented using TensorFlow, and were trained and tested on Nvidia Titan Xp GPU. The networks were trained for total iterations, where the learning rate and weight decay were lowered by a factor of 10 after iterations.
|Method / Metric||Avg. ERGAS||Avg. SCC||Avg. SCC|
Iii-B Comparisons and Discussions
We now compare our proposed method, DSen2-S3, with the conventional pan-sharpening methods including bicubic, PS images provided from the WorldView-3 dataset, BT , PanNet  and DSen2 . We implemented PanNet and DSen2 according to their technical descriptions, and trained them on the WorldView-3 dataset. At testing, for MS input images with a size of 160160, average computation time for our DSen2-S3 on GPU is about 2 sec per image.
As in [11, 16], we use two popular metrics: Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS)  for measuring spectral distortion, and spatial correlation coefficient (SCC)  for measuring spatial distortion. Note that in the original scale scenario, there are no ground truth PS images for comparison. Therefore, in this letter, given a PS output at the original scale, SCC between the network output and PAN input, ERGAS between a down-scaled network output and its MS input, and SCC between the down-scaled network output and down-scaled PAN input were computed. Table I
shows average quality metric scores (with standard errors), ERGAS, SCC and SCC for PS results at the PAN resolution (0.3 GSD). Here, 100 MS-PAN pairs from the WorldView-3 satellite test dataset were selected for testing in the original scale scenario. As shown in Table I, the PS results by PanNet and DSen2 have lower ERGAS values, showing lower spectral distortion, but lower SCC values, indicating higher spatial distortion. On the other hand, our DSen2-S3 generated the PS images with much higher SCC values, but with slightly higher spectral distortion. Note that those metrics simply compute the score values using MS and PAN test image pairs that are misaligned with unknown magnitudes and directions. Therefore, the ERGAS and SCC metrics are not effective in measuring the distortions for misaligned MS-PAN pairs.
We now visually compare several pan-sharpening methods including ours. Fig. 2
shows PS images at the original scale for bicubic-interpolated MS, PS provided from the dataset, BT, PanNet , DSen2  and our DSen2-S3. First, PS images provided from the dataset show high spectral distortion, with blue glow around cars. Since trained using a simple loss between network outputs and MS targets, PanNet and DSen2 tend to perform poorly on misaligned MS-PAN test inputs, creating unpleasant artifacts around strong edges and moving objects in the PS images. On the other hand, our method using the proposed S3 loss can reconstruct PS images with highly sharpened edges, rooftops, roads and cars with much less artifacts, visually outperforming the conventional PanNet and DSen2 methods.
In order to verify the effectiveness of using the correlation maps for our S3 loss, we conducted the following experiment. Instead of using an MS target image as a network target, a correlation map between the MS target image and its corresponding PAN image was used as the network target. This training was performed at the lower scale, and the same network structure is used. In doing so, we were trying to estimate correlation maps at the original scale, where there is no ground truth correlation map. Fig.3 shows our results regarding the correlation maps. As shown in Fig. 3, the true correlation maps computed at the lower scale does not to carry useful information while they contain meaningful information at the original scale, especially in the areas of moving cars. Therefore, by incorporating the correlation maps into our S3 loss function, our network was trained not to restore the spectral information for the areas with low correlations between the MS and PAN inputs, thus generating the moving cars with less artifacts in output PS images.
We proposed a novel spectral-spatial structure (S3) loss that can be effectively applied for CNN-based pan-sharpening methods. Our S3 loss is featured with a combined measuring capability of spectral, spatial and structural distortions, so that the CNN-based pan-sharpening networks can be effectively trained to generate highly detailed PS images with less artifacts, compared to the conventional losses simply based on the difference between network outputs and MS targets.
-  A. R. Gillespie, A. B. Kahle, and R. E. Walker, “Color enhancement of highly correlated images. i. decorrelation and HSI contrast stretches,” Remote Sensing of Environment, vol. 20, no. 3, pp. 209–235, Dec. 1986.
-  W. J. Carper, T. M. Lillesand, and P. W. Kiefer, “The use of intensity-hue-saturation transformations for merging spot panchromatic and multispectral image data,” Photogrammetric Engineering and Remote Sensing, vol. 56, Jan. 1990.
-  V. P. Shah, N. H. Younan, and R. L. King, “An efficient pan-sharpening method via a combined adaptive PCA approach and contourlets,” IEEE Transactions on Geoscience and Remote Sensing, vol. 46, no. 5, pp. 1323–1335, May 2008.
-  X. Kang, S. Li, and J. A. Benediktsson, “Pansharpening with matting model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 8, pp. 5088–5099, Aug. 2014.
-  Q. Xu, B. Li, Y. Zhang, and L. Ding, “High-fidelity component substitution pansharpening by the fitting of substitution data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 11, pp. 7380–7392, Nov. 2014.
-  S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 7, pp. 674–693, 1989.
-  J.-L. Starck, J. Fadili, and F. Murtagh, “The undecimated wavelet decomposition and its reconstruction,” IEEE Transactions on Image Processing, vol. 16, no. 2, pp. 297–309, 2007.
-  Z. Pan, J. Yu, H. Huang, S. Hu, A. Zhang, H. Ma, and W. Sun, “Super-resolution based on compressive sensing and structural self-similarity for remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 9, pp. 4864–4876, Sep. 2013.
-  X. He, L. Condat, J. M. Bioucas-Dias, J. Chanussot, and J. Xia, “A new pansharpening method based on spatial and spectral sparsity priors,” IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 4160–4174, Sep. 2014.
-  F. Palsson, J. R. Sveinsson, and M. O. Ulfarsson, “A new pansharpening algorithm based on total variation,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 1, pp. 318–322, Jan. 2014.
-  G. Vivone, L. Alparone, J. Chanussot, M. D. Mura, A. Garzelli, G. A. Licciardi, R. Restaino, and L. Wald, “A critical comparison among pansharpening algorithms,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 5, pp. 2565–2586, May 2015.
-  A. Garzelli, F. Nencini, and L. Capobianco, “Optimal MMSE pan sharpening of very high resolution multispectral images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 46, no. 1, pp. 228–236, Jan. 2008.
-  N. Brodu, “Super-resolving multiresolution images with band-independent geometry of multispectral pixels,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 8, pp. 4610–4617, Aug. 2017.
-  L. Alparone, B. Aiazzi, S. Baronti, A. Garzelli, F. Nencini, and M. Selva, “Multispectral and panchromatic data fusion assessment without reference,” Photogrammetric Engineering and Remote Sensing, vol. 74, no. 2, pp. 193–200, Feb. 2008.
-  G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa, “Pansharpening by convolutional neural networks,” Remote Sensing, vol. 8, no. 7, p. 594, Jul. 2016.
J. Yang, X. Fu, Y. Hu, Y. Huang, X. Ding, and J. Paisley, “PanNet: A deep
network architecture for pan-sharpening,” in
2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Oct. 2017.
-  W. Huang, L. Xiao, Z. Wei, H. Liu, and S. Tang, “A new pan-sharpening method with deep neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 5, pp. 1037–1041, May 2015.
-  G. Scarpa, S. Vitale, and D. Cozzolino, “Target-adaptive CNN-based pansharpening,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 9, pp. 5443–5457, Sep. 2018.
-  C. Lanaras, J. Bioucas-Dias, S. Galliani, E. Baltsavias, and K. Schindler, “Super-resolution of sentinel-2 images: Learning a globally applicable deep neural network,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 146, pp. 305–319, Dec. 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385.
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European Conf. Comp. Vis., 2014, pp. 184–199.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proc. IEEE Conf. Comp. Vis. Pattern Recog., 2016, pp. 1646–1654.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in Proc. IEEE Conf. Comp. Vis. Pattern Recog. Workshops, vol. 1, no. 2, 2017, p. 3.
-  SpaceNet on Amazon Web Services (AWS), “Datasets,” Apr. 2018, [Accessed 1-July-2018]. [Online]. Available: https://spacenetchallenge.github.io/datasets/datasetHomePage.html
-  I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” arXiv preprint arXiv:1711.05101, 2017.
Y. Jia, E. Shelhamer, J. Donahue et al.
, “Caffe: Convolutional architecture for fast feature embedding,” inProc. ACM Int. Conf. Mul., 2014, pp. 675–678.
-  M. Abadi, P. Barham, J. Chen et al., “Tensorflow: A system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
-  W. Lucien, “Data fusion: Definitions and architectures,” Les Presses de l’Ecole des Mines, 2002.
-  J. Zhou, D. Civco, and J. Silander, “A wavelet transform method to merge landsat tm and spot panchromatic data,” International Journal of Remote Sensing, vol. 19, no. 4, pp. 743–757, 1998.