Multispectral Pan-sharpening via Dual-Channel Convolutional Network with Convolutional LSTM Based Hierarchical Spatial-Spectral Feature Fusion

by   Dong Wang, et al.
The University of Melbourne

Multispectral pan-sharpening aims at producing a high resolution (HR) multispectral (MS) image in both spatial and spectral domains by fusing a panchromatic (PAN) image and a corresponding MS image. In this paper, we propose a novel dual-channel network (DCNet) framework for MS pan-sharpening. In our DCNet, the dual-channel backbone involves a spatial channel to capture spatial information with a 2D CNN, and a spectral channel to extract spectral information with a 3D CNN. This heterogeneous 2D/3D CNN architecture can minimize causing spectral information distortion, which typically happens in conventional 2D CNN models. In order to fully integrate the spatial and spectral features captured from different levels, we introduce a multi-level fusion strategy. Specifically, a spatial-spectral CLSTM (S^2-CLSTM) module is proposed for fusing the hierarchical spatial and spectral features, which can effectively capture correlations among multi-level features. The S^2-CLSTM module attaches two fusion ways: the intra-level fusion via bi-directional lateral connections and inter-level fusion via the cell state in the S^2-CLSTM. Finally, the ideal HR-MS image is recovered by a reconstruction module. Extensive experiments have been conducted at both simulated lower scale and the original scale of real-world datasets. Compared with the state-of-the-art methods, the proposed DCNet achieves superior or competitive performance.


page 1

page 5

page 10

page 18

page 19

page 22


Spatial-Spectral Fusion by Combining Deep Learning and Variation Model

In the field of spatial-spectral fusion, the model-based method and the ...

A Deep Tree-Structured Fusion Model for Single Image Deraining

We propose a simple yet effective deep tree-structured fusion model base...

PanFormer: a Transformer Based Model for Pan-sharpening

Pan-sharpening aims at producing a high-resolution (HR) multi-spectral (...

HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening

Pansharpening aims to fuse a registered high-resolution panchromatic ima...

HDNet: High-resolution Dual-domain Learning for Spectral Compressive Imaging

The rapid development of deep learning provides a better solution for th...

A Triple-Double Convolutional Neural Network for Panchromatic Sharpening

Pansharpening refers to the fusion of a panchromatic image with a high s...

Pan-sharpening via High-pass Modification Convolutional Neural Network

Most existing deep learning-based pan-sharpening methods have several wi...

1 Introduction

High resolution (HR) multispectral (MS) images in both spectral and spatial domains are desirable for many practical applications, such as environmental monitoring bullock2020monitoring , object detection gong2019context , and classification fang2018semi ; li2017spectral ; fang2019hyperspectral . However, due to the hardware limitations, it is hard to provide such ideal images, and only panchromatic (PAN) images with high spatial resolution and low spatial resolution MS images can be captured by sensors, e.g., IKONOS, GaoFen-2, and WorldView-2. Multispectral pan-sharpening refers to the technology of obtaining the HR-MS image by fusing the PAN image and the corresponding MS image liu2018deep .

During the last decades, many and various MS pan-sharpening methods have been proposed, e.g., component substitution (CS) based methods garzelli2007optimal ; aiazzi2007improving , multi-resolution analysis (MRA) based methods khan2008indusion ; ranchin2003image , model-based methods palsson2019model ; fu2019variational , etc. CS-based methods assume that spatial information and spectral information are separable. Firstly, they up-sample MS images to the same spatial resolution with PAN images, and then these MS and PAN images are projected into another feature domain. Finally, the LR component of MS images is substituted by the counterpart of PAN images thomas2008synthesis . The underlying assumption of MRA methods is that high-frequency details lacked in MS images can be supplemented by PAN images. So they extract high-frequency information of PAN images and inject it into MS images while retaining all original spectral information of MS images. Generally speaking, model-based methods can obtain fused images of relatively high quality. However, they are time-consuming for inferring the desired HR-MS images because optimization algorithms usually are computationally intensive.

Recently, a large number of convolutional neural network (CNN) based methods have been proposed for pan-sharpening and have shown great potential. The high nonlinearity of deep CNNs facilitates modeling the fusion of PAN and MS images. Although these methods have achieved promising performance, there is still some space for reducing the spatial and spectral distortions. For example, the L-shaped object in the green bounding box of Fig.

1(b) is not the same with the lunar shape of ground truth in Fig. 1(a), which is referred to spatial distortion. The spectral distortion can be inferred by the color difference between the ground truth and the fused image, e.g., the color of the rectangle object in Fig. 1(c) is whiter than the pink ground truth in Fig. 1(a).

(a) Ground truth
(b) Spatial distortion
(c) Spectral distortion
Figure 1: An example of spatial and spectral distortions. The left image is the ground truth. The green bounding boxes indicate these distortions.

Spatial and spectral distortions in the pan-sharpened MS image comes from many reasons. One of them is that most methods employ 2D CNNs as a spectral feature extractor for 3D MS images. 2D CNN is an effective spatial feature extractor for 2D PAN images. When it comes to 3D MS images, it will mix different spectrums of MS images and will make it challenging to recover each band of HR-MS images from 2D feature maps. Besides, most existing methods only fuse spatial and spectral information at an early stage or a late stage. The hierarchical CNN features, outstanding powerful features, need to be exploited for spatial and spectral information fusion. A better fusion of these hierarchical features can significantly contribute to reducing the spatial and spectral distortions.

To this end, we proposed a dual-channel network (DCNet) that includes three main components: the dual-channel backbone for spatial and spectral feature extraction, a novel spatial-spectral CLSTM (S-CLSTM) for hierarchical fusion, and the reconstruction module for HR-MS image synthesis. Specifically, our model first learns a dual-channel backbone to capture spatial and spectral information from 2D PAN images and 3D MS images, respectively. Then, the extracted hierarchical spatial and spectral features are fused by the S-CLSTM module which can carry out both the intra-level and the inter-level fusion. Finally, our DCNet combines a reconstruction module to generate the ideal HR-MS images. The main contributions of this paper are summarized as follows:

  • A heterogeneous dual-channel backbone is proposed for feature extraction in multispectral pan-sharpening. The spatial channel is designed to capture spatial details of 2D PAN image with a 2D CNN, while the spectral channel is responsible for obtaining spectral information in 3D MS images with a 3D CNN.

  • Different from most existing CNN based pan-sharpening approaches that employ the single-level fusion strategy, i.e., early-fusion or late-fusion. The proposed DCNet adopts a hierarchical fusion strategy to integrate spatial and spectral features level by level.

  • An S-CLSTM module is proposed to effectively capture correlations among hierarchical spatial and spectral features and fully integrate these features. In S-CLSTM, the intra-level and inter-level spatial and spectral feature fusion are carried out via bi-directional lateral connections and the cell state of S-CLSTM, respectively. To the best of our knowledge, CLSTM is employed in the field of pan-sharpening for the first time.

  • Extensive experiments on three datasets, e.g., IKONOS, GaoFen-2, and WorldView-2, have been conducted at both the lower scale and the original scale. At the lower scale experiments, the proposed DCNet substantially outperforms state-of-the-art methods. Experiment results at the original scale also demonstrate that the DCNet achieves superior or competitive performance.

The remainder of this paper is organized as follows. Section 2 briefly introduces the background knowledge of the CLSTM and the existing CNN-based pan-sharpening methods. A detailed description of the proposed method is presented in Section 3. The experimental results are presented and discussed in Section 4. Finally, the conclusion of this paper is given in Section 5.

2 Related Work

2.1 CNN for multispectral pan-sharpening

Recently, CNN has become very popular in the pan-sharpening society. Most of the CNN-based methods employ 2D CNNs for feature extraction from both PAN and MS images. Besides, spectral and spectral information are normally fused at an early stage or a late stage. Few methods have realized the potential of the hierarchical spatial and spectral features. These methods are described below.

Inspired by SRCNN dong2015image , Masi et al. masi2016pansharpening

proposed PNN, which is the first model based on CNN. In the underlying architecture of PNN, the LR MS image is first upsampled and interpolated. The upsampled MS bands and the PAN band are then concatenated and are tailored by a three-layer 2D CNN, which means that the network fuse the PAN and MS images at the early stage. PNN works at HR from the beginning, and the output comprises four bands corresponding to HR-MS images. Scarpa et al.

scarpa2018target extended residual learning to the PNN and obtained a significant performance gain over PNN. Different from PNN, the proposed PNN+ has a skip connection from input to the output of the network. Besides, a target-adaptive tuning phase is introduced to solve the problem of insufficient data and allows users to apply the proposed architecture to their dataset. Wei et al. wei2017boosting introduced a deep convolutional neural network with residual learning (DRPNN). The network takes the concatenation of PAN and MS images, which is the same with PNN. DRPNN wei2017boosting

is a very deep convolutional neural network to make full use of the high nonlinearity of deep learning models. All of these networks employ 2D CNNs for pan-sharpening, and they fuse MS and PAN images at the early stage.

Yang et al. yang2017pannet designed a pan-sharpening network called PanNet that takes the high-pass components of the PAN images and MS images instead of original images. In PanNet, domain-specific knowledge is incorporated to preserve spectral and spatial information. For spectral preservation, they add up-sampled multispectral images to the network output, which directly propagates the spectral information to the reconstructed image. For spatial maintenance, the network is trained in the high-pass filtering domain rather than the image domain, the input is the concatenated high-pass components of the PAN and upsampled LR MS images. So the spatial and spectral information fusion are fused at an early stage in this method. Later, the multi-scale and multi-depth convolutional neural network (MSDCNN) is proposed by Yuan et al. yuan2018multiscale . They also concatenated the PAN band and the MS bands together and fed it into the network, which is the same with PNN. But they employ a shallow and a deep multi-scale branch to model pan-sharpening.

Unlike the methods mentioned above, Liu et al. liu2020remote proposed a two-stream fusion network (TFNet) that extracts CNN features from PAN and MS images, and then fuse them at the late stage. TFNet has three modules whose functions are feature extraction, feature fusion, and image reconstruction, respectively. TFNet firstly extracts spatial and spectral features by two 2D CNNs and then fuses the elements. All convolutional layers in TFNet are 2D, and they only fuse the features of two streams at the late stage. Zhang et al. zhang2019pan presented a bi-directional pyramid network (BDPN) for pan-sharpening, which fuses the features of PAN and MS images at two stages. But they inject the spatial information of PAN images into MS images in the image domain instead of the feature domain, which prevents the BDPN from leveraging the hierarchical features. Shao et al. shao2018remote proposed a remote sensing image fusion named RSIFNN that can adequately extract spectral and spatial features from source images. Although they studied the effect of different depths of each branch, they only fuse the spatial and spectral features at the late stage and fail to leverage the hierarchical features of PAN and MS branches.

2.2 Clstm

The Long short-term memory (LSTM) has achieved great success for sequence modeling in various natural language processing tasks, e.g., language processing, speech recognition

graves2013speech , and visual question answeringbai2020decomvqanet

. With the cell state and gates, LSTMs can remove or add to cell state and remember long term dependencies. However, LSTMs only take as input 1-D vectors and thus cannot be applied for 2D feature maps. Shi et al.

xingjian2015convolutional introduced 2D convolution operation to LSTM and proposed CLSTM, which can process 2-D feature maps and automatically capture temporal dependencies.

CLSTMs can also be used for 3D data processing. Song et al. song2018pyramid proposed a fast video salient object detection model, based on pyramid dilated bidirectional ConvLSTM (PDB-ConvLSTM). In liu2016spatio , Liu et al. proposed a powerful tree-structure based traversal method to model the 3D-skeleton and CLSTM to handle the noise and occlusions in 3D skeleton data. Jiang et al. jiang2017predicting proposed an object-to-motion convolutional neural network (OM-CNN). In the model, a two-layer convolutional long short-term memory (2C-LSTM) network to predict video saliency.

3 Methodology

In this section, we first present the proposed DCNet. Then we illustrate details of three main components of DCNet: the dual-channel backbone, S-CLSTM, and the reconstruction module. The objective function is introduced at last.

3.1 Fusion framework

The purpose of multispectral pan-sharpening is to get an HR-MS image by fusing a PAN image and a corresponding MS image, which has bands (e.g., for IKONOS and GaoFen-2 satellite, while for WorldView-2 satellite). In this paper, the observed PAN image is denoted as , where and are the height and width of the PAN image, respectively. represents the corresponding MS image, with 4 being a spatial reduction ratio. We denote the pan-sharpened HR-MS image as .

A detailed illustration of the proposed DCNet can be found in Fig. 2 Specifically, a dual-channel backbone is utilized to obtain the hierarchical spatial and spectral features, which is tailed the PAN and MS images. Then, the S-CLSTM fuses spatial and spectral information and is located in the middle of the two channels in the figure. Finally, the reconstruction module takes the high-level features of the spatial channel, the spectral channel, and the S-CLSTM as input and synthesis the desired HR-MS image. The coordinate system indicating the dimension of width, height, bands, and channels is at the left of Fig. 2.

Figure 2: The architecture of the proposed DCNet for multispectral pansharpning. and represent the height and width, respectively. indicates the number of bands. denotes the number of channels. is the filter number ratio between the spatial channel and the spectral channel.

3.2 The dual-channel Backbone

3.2.1 Spatial channel

The spatial channel contains a stem layer and stacked 2D residual blocks. The stem layer is comprised of a 2D convolutional layer and a parametric ReLU (PReLU). The first residual block is represented as

, where 1 denotes the first residual block. The other residual blocks are represented as , so the spatial channel can be formulated as follows:


where is the spatial feature extracted by the stem layer, comes from the level of S-CLSTM, and ranges from 1 to , and is a type of leaky ReLU. The 2D residual blocks can be formulated as


where is the input of residual blocks, and are learnable parameters, and indexes the -th residual block. Each residual block has two successive convolutional layers.

3.2.2 Spectral channel

The overall architecture of the spectral channel is the same as the spatial channel. The first layer is a stem, which is consisted of a 3D convolutional layer and a PReLU, and the others are 3D residual blocks. The spectral channel is formulated as


where represents the -th 3D residual block, denotes the output spectral feature at level . The 3D residual block process 3D information using the following formula:



is an activation function.

3.3 S-CLSTM module

Once the spatial and spectral representations and are obtained, the S-CLSTM module is utilized to fully integrate these features. The S-CLSTM module includes two fusion ways: the intra-level fusion and inter-level fusion. The former is carried out via the bi-directional lateral connections and the later via the cell state in the S-CLSTM.

Figure 3: The arichitecture of S-CLSTM fusion module. The element-wise multiplication is denoted as . represents the element-wise summation operation. is the sigmoid activation function, and indicates the tanh function.

The architecture of S-CLSTM is shown in Fig. 3. The S-CLSTM has two inputs ( and ), two outputs ( and ), and three gates (forget gate, input gate, and output gate). and represent the spatial and spectral features of dual channels at the level , where ranges in . is the hidden state of S-CLSTM at the level . indicates the cell state at the level

. The hidden state and the cell state are initialized as zero. Each of these gates can be thought of as a ”standard” neuron, and they are connected by black lines in Fig.


The other three main components of the S

-CLSTM are input flow, output flow, and the cell state. As lateral connections, the input flow (red lines) and output flow (green lines) bi-directionally connect spatial and spectral channels. The cell state (the dark yellow line) integrates features from different levels. Forget gate, input gate, and output gate are used for feature selection for lateral connections and the cell state.

3.3.1 Intra-level fusion via lateral connections

In the input flow, the spatial and spectral information from two channels is merged by element-wise addition. The fused spatial-spectral information then passes through the activation function. Finally, the spatial-spectral feature will be selected by the input gate. This process can be formulated as:


where denotes the activations of input gates at the level , and represent the convolution and deconvolution operation, respectively, and are learnable parameters of convolutional layers. It is worth noting that the in is a grouped convolution, where the number of groups is the same as the channel dimension. One of the advantages of group convolution is that it can release the restrict of the input image size produced by the Hadamard product in the original CLSTM. Since the size of the input images can be any, the block effect in the pan-sharpened image is eliminated.

The output flow is the information flow from the S-CLSTM to spatial and spectral channels, which is represented by the green lines in Fig. 3. The cell state memories the low-level features. The S-CLSTM automatically extracts hidden state by output gate . Since the spectral channel and S-CLSTM operate on 3D data, the output feature for the spatial channel needs to be transformed to 2D. The output features can be obtained with the following functions:


where is the output feature for the spectral channel, indicates the spatial location of the pixel in feature maps, andis a concatenation operation.

3.3.2 Inter-level fusion via the cell state

As the fused spatial-spectral features of previous levels are memorized in the cell state , acts as the bridge to connect the current level with previous levels, which can effectivley capture the underlying correslations among mult-level representations and facilitate inter-level fusion in different levels. Dark yellow lines show the information flow of inter-level fusion in Fig. 3. The following equations are presented for this procedure:


where the little circle represents an element-wise multiplication, and are parameters of convolutional layers.

3.4 Reconstruction module

As the last part of the proposed DCNet, the reconstruction module will recover the desired HR-MS image from the output features ,, and of the spatial channel, the spectral channel, and S-CLSTM, respectively. Fig. 4 illustrates the reconstruction module, which can be divided into four components: a 3D de-convolutional layer, a bottleneck layer, a 3D residual block, and a 3D convolutional layer without activation. First, the feature from the spatial channel is projected into by a de-convolutional layer. Next, we concatenate it with and . Then, the bottleneck layer is added to weight the three 3D features by filters of size . After that, the output of this layer is fed into a 3D residual block . Finally, filters of size in the convolutional layer will recover the ideal HR-MS image.

Figure 4: The architecture of the reconstruction module

3.5 Objective function

In the training phase, given the DCNet parameterized by , the objective is to find . The training set has pairs of (PAN image), (MS image), and (ground truth). Accordingly, the object function can be formulated as:



is a loss function. The first part of the loss function is a l1-loss function, which is efficient and edge-sensitive. To prevent overfitting, we add the l2 penalty

as a regularization item in the loss function. The is a balancing parameter that balances the importance of the l1-loss and the regularization term. Upon convergence, the parameter is frozen and can be used for tests on both the lower scale and the original scale data.

4 Experiments

4.1 Data sets

In this section, three datasets are constructed to compare the performance of DCNet with state-of-art networks. The original date is acquired by three satellites: IKONOS, GaoFen-2, and WorldView-2. Each satellite carries a PAN sensor and an MS sensor. The details of these datasets are demonstrated below.

The first dataset was acquired by the IKONOS satellite over the mountainous area in the west of Sichuan Province, China, in 2008. The spectral resolutions of PAN, blue, green, red, and near-infrared bands are nm, nm, nm, nm, and nm, respectively. We got two pairs of PAN and MS images. The PAN images consist of and pixels, respectively, while the spatial resolution of MS images is and . The spatial resolutions of these PAN and MS images are 4 and 1 m.

The second dataset was taken over the Guangzhou, China mall by GaoFen-2 in 2016 with PAN and MS dimensions are and respectively. The spectral resolutions of PAN, blue, green, red, and near-infrared bands are , , , , and nm, respectively. The spatial resolutions are the same as IKONOS.

The third dataset was acquired by the WorldView-2 satellite over Washington, DC, USA, in 2016. The spectral resolutions of Coastal, Blue, Green, yellow, red, Red edge, Near-IR1, Near-IR2, and bands are , , , , , , , , and nm, respectively. We got one pair of PAN and MS images. The PAN images consist of pixels, respectively, while the spatial resolution of MS images is . Different from IKONOS and GaoFen-2, the WorldView-2 has a higher spatial resolution 0.5 and 2 m for PAN and MS images.

Dataset Train set Validation set Test set
IKONOS 400 100 50
GaoFen-2 400 100 50
WorldView-2 400 100 50
Table 1: The distribution of images for training, validation, and testing.

From the above, we can find that the spectral resolution of IKONOS, GaoFen-2, and WorldView-2 differs from each other. Thus, the datasets cannot be fused for training and testing. The remote sensing images mentioned above are cut into patches. Because of the volume of our network, the height and width of the train set and validation set are 128. But the test set block has a larger height and width, which is 256. For a better test of the generalization, the resulting datasets are covering different types of areas, e.g., urban, mountain, lake, and so on. 1,650 images were collected in total, 550 images for each dataset.

Following huang2015new , original images with 4 or 8 bands from IKONOS, GaoFen-2, and WorldView-2 were used as the ground truth , and it was down-sampled using bicubic interpolation algorithm to obtain the simulated MS images with low spatial resolution according to Wald’s protocol Wald1997Fusion . Meanwhile, we have down-sampled PAN images using the same process. The distribution of images of the resulting datasets is listed in Table 1.

4.2 Experiment setting

Channel Stem Level 1, 2, and 3 Level 4


Table 2: Hyper-parameters of the DCNet for the IKONOS and GaoFen-2 datasets.
Channel Stem Level 1, 2, and 3 Level 4
Table 3: Hyper-parameters of the DCNet for the WorldView-2 dataset.

We implemented the proposed network using the PyTorch framework


. For each dataset, the proposed model was trained for 1600 epochs over the entire dataset, and we selected Adam

kingma2014adam to train the proposed network. The experiments were carried out on a GPU server. Two NVIDIA GeForce TITAN Xp GPUs (12GB memory per GPU) are used for training. The batch size was set to 20. The learning rate was initially set to 0.001 and reduced 20% per 150 epochs. The other hyper-parameters of DCNet are shown below in Tables 2 and 3. The kernel dimensions of the spatial channel are denoted by for width, channel, and stride sizes. In the spectral channel, the kernels and strides are represented as , where indicates the number of bands. The representation of spatial and spectral features takes the form of and , respectively.

4.3 Evaluation at lower scale

As mentioned above, training samples come from IKONOS, GaoFen-2, and WorldView-2 satellites. In this section, the proposed DCNet is compared with six state-of-the-art methods including: PNN masi2016pansharpening , PNN+ scarpa2018target , DRPNN wei2017boosting , PanNet yang2017pannet , MSDCNN yuan2018multiscale , and ResTFNet liu2020remote

. We assess these methods by visual evaluation and quantitative evaluation by evaluation metrics.

(a) Ground truth
(b) DCNet(ours)
(c) ResTFNet
(e) PNN
(f) PanNet
(g) PNN+
Figure 5: Pan-sharpened images by different methods on the IKONOS dataset.

For visual evaluation, the fused images are visualized to check spatial and spectral distortions. We begin with the IKONOS dataset. Fig. 5 shows an example of the experiment performs on an IKONOS image. Since MS images have more than three bands, only red, green, and blue bands are extracted to synthesize the TrueColor images. The ground truth is shown in Fig. 5(a). Fig. 5(b)-(h) display the pan-sharpened images by different methods. The proposed DCNet produces the pan-sharpened image with the best visual quality in terms of spectral preservation, e.g., the yellow part reconstructed by the proposed network is most close to the ground truth. The proposed DCNet does better in spectral preservation and provides images with more precious spatial details.

(a) Ground truth
(b) DCNet(ours)
(c) ResTFNet
(e) PNN
(f) PanNet
(g) PNN+
Figure 6: Pan-sharpened images by different methods on the GaoFen-2 dataset.
(a) Ground truth
(b) DCNet(ours)
(c) ResTFNet
(e) PNN
(f) PanNet
(g) PNN+
Figure 7: Pan-sharpened images by different methods on the WorldView-2 dataset.

Fig. 6 illustrates an experiment performed on a GaoFen-2 image. We have similar observations in Fig. 5. PNN, PNN+, DRPNN, PanNet, and MSDCNN produce spectral distortion, which is indicated by the white rectangle object that is pink in the reference MS image. TFNet can retrain most spectral information of the pink rectangle object, but the lunar shape object changes to an L shape. The fused image produced by LGC is blurred. The proposed network can obtain the closest pan-sharpened image to the ground truth no matter at the white lunar shape object nor the pink rectangle. The DCNet produces lest distortion in both spatial and spectral domains. Although it cannot identify the apparent difference between the pan-sharpened images in Fig. 7, the quantitative evaluation in Table 6 demonstrates the excellent performance of the proposed DCNet.

To quantitatively evaluate the performance of the DCNet and state-of-art methods, five popular indices have been employed. They are Q4 (for 4-band) or Q8 (for 8-band) zeng2010fusion , universal image quality index (UIQI) wang2002universal , spectral angle mapper (SAM) dennison2004comparison , relative dimensionless global error in synthesis (ERGAS) ayhan2012spectral and spatial correlation coefficient (SCC) zhou1998wavelet . Q4/Q8, UIQI, and ERGAS can comprehensively evaluate the spectral and spatial quality of the fused image. SCC is a widely used index to measure the spatial quality of the fused image. In contrast, SAM can effectively measure spectral distortion in the fused image compared with the reference image.

PNN masi2016pansharpening 0.5474 0.8458 4.9105 4.2016 0.8010
PNN+ scarpa2018target 0.5518 0.8599 4.2989 3.9579 0.8173
DRPNN wei2017boosting 0.5320 0.8465 4.5075 4.1639 0.8055
PanNet yang2017pannet 0.5293 0.8393 4.7777 4.2836 0.8014
MSDCNN yuan2018multiscale 0.5683 0.8686 4.1819 3.7883 0.8328
ResTFNet liu2020remote 0.6592 0.8964 4.1303 3.3761 0.9186
DCNet(ours) 0.7236 0.9242 3.6596 2.7353 0.9502
Ideal value 1 1 0 0 1
Table 4: Quantitative evaluation results of different methods on the IKONOS dataset. The optimal and the sub-optimal results are in red and blue, respectively.
PNN masi2016pansharpening 0.7390 0.9094 4.3010 4.3037 0.8776
PNN+ scarpa2018target 0.7843 0.9488 3.0594 3.5287 0.9092
DRPNN wei2017boosting 0.7275 0.8719 4.6670 4.8960 0.8515
PanNet yang2017pannet 0.7504 0.9008 4.1774 4.4150 0.8752
MSDCNN yuan2018multiscale 0.8031 0.9507 3.0032 3.2756 0.9213
ResTFNet liu2020remote 0.6907 0.9164 3.7689 3.9698 0.9139
DCNet(ours) 0.8707 0.9741 2.1171 2.3805 0.9655
Ideal value 1 1 0 0 1
Table 5: Quantitative evaluation results of different methods on the GaoFen-2 dataset. The optimal and the sub-optimal results are in red and blue, respectively.
PNN masi2016pansharpening 0.5520 0.8717 7.0039 4.1212 0.8709
PNN+ scarpa2018target 0.6209 0.9027 5.6871 3.4700 0.9116
DRPNN wei2017boosting 0.6399 0.9025 5.6503 3.3753 0.9226
PanNet yang2017pannet 0.5604 0.8740 6.9058 4.1172 0.8742
MSDCNN yuan2018multiscale 0.6628 0.9124 5.3072 3.1768 0.9303
ResTFNet liu2020remote 0.6587 0.9107 5.3367 3.3563 0.9180
DCNet(ours) 0.6982 0.9249 4.6668 2.8387 0.9476
Ideal value 1 1 0 0 1
Table 6: Quantitative evaluation results of different methods on the WorldView-2 dataset. The optimal and the sub-optimal results are in red and blue, respectively.

The quantitative evaluation results are shown in Tables 4-6. The optimal and sub-optimal results are shown in red and blue, respectively. For the spectral metric SAM, the spatial metric SCC, or other global metrics, DCNet substantially outperforms other methods. Specifically, for the SAM index on the GaoFen-2 dataset, DCNet achieves 2.1171 even though the sub-optimal result is 3.0032. It can demonstrate that the proposed DCNet significantly exceeds all other state-of-the-art methods, and the pan-sharpened images by DCNet have lest spatial and spectral distortion.

4.4 Evaluation at original scale

In this section, we will compare the proposed DCNet with state-of-the-art methods at the original scale of PAN and MS images. As there are no ground truth images, we use the model trained on lower scale images, and the original PAN and MS images are adopted as the spatial and spectral references, respectively. PNN PNN+, DRPNN, PanNet, MSDCNN, ResTFNet, and the proposed network are evaluated and compared by both visual evaluation and quantitative evaluation.

(a) PAN
(b) MS
(c) DCNet(ours)
(d) ResTFNet
(f) PNN
(g) PanNet
(h) PNN+
Figure 8: Pan-sharpened images by different methods on the WorldView-2 dataset.

For visual evaluation, we scale up a small region of all sub-images in Fig. 8. There is apparent spectral distortion in the green circle in Fig.8 produced by PNN, PNN+, DRPNN, PanNet, MSDCNN in Figs.8(e-i) because these adopt early fusion where they concatenate the PAN image and the bands of MS images as input. The concatenation makes it hard for their model to distinguish each band. The pan-sharpened image Fig. 8(d) suffers from spatial distortion as it produced more edges than the PAN image. Only the proposed DCNet can not only make full use of the spatial information provided by the PAN image, but also prevent the distortion of spectral content.

Dataset Method
IKONOS PNN masi2016pansharpening 0.0485 0.0608 0.8938
PNN+ scarpa2018target 0.0182 0.0718 0.9114
DRPNN wei2017boosting 0.0219 0.0726 0.9072
PanNet yang2017pannet 0.0235 0.0605 0.9175
MSDCNN yuan2018multiscale 0.0142 0.0579 0.9287
ResTFNet liu2020remote 0.0215 0.0561 0.9236
DCNet(ours) 0.0174 0.0521 0.9314
GaoFen-2 PNN masi2016pansharpening 0.0336 0.0592 0.9092
PNN+ scarpa2018target 0.0223 0.0546 0.9244
DRPNN wei2017boosting 0.0657 0.0911 0.8493
PanNet yang2017pannet 0.0394 0.0590 0.9040
MSDCNN yuan2018multiscale 0.0177 0.0421 0.9410
ResTFNet liu2020remote 0.0141 0.0312 0.9552
DCNet(ours) 0.0128 0.0307 0.9569
WorldView-2 PNN masi2016pansharpening 0.0277 0.0825 0.8922
PNN+ scarpa2018target 0.0185 0.0952 0.8884
DRPNN wei2017boosting 0.0414 0.0904 0.8727
PanNet yang2017pannet 0.0257 0.0980 0.8793
MSDCNN yuan2018multiscale 0.0346 0.0784 0.8900
ResTFNet liu2020remote 0.0202 0.0824 0.8996
DCNet(ours) 0.0176 0.0863 0.8976
Ideal value 0 0 1
Table 7: Quantitive results evaluated at the original scale on three datasets. The optimal and the sub-optimal results are in red and blue, respectively.

Furthermore, we use the reference-free measurement QNR alparone2008multispectral to assess the pan-sharpened images. The QNR index is composed of two components: the spectral distortion index and spatial distortion index . Comparison results in Table 7 are obtained by calculating the mean over 50 images for each dataset. The optimal and the sub-optimal results are in red and blue, respectively. It can be concluded that compared with other state-of-the-art methods, the DCNet achieves competitive or superior performance in terms of , , and QNR metrics. In other words, the DCNet achieves better fusion results in terms of spatial and spectral preservation.

4.5 Experimental analysis

4.5.1 Effect of 2D/3D backbone

In this part, we will discuss the effectiveness of the dual-channel backbone of 2D/3D architecture for spatial and spectral feature extraction. In contrast to the proposed DCNet that extract spatial and spectral features by a 2D CNN and a 3D CNN, respectively, some existing two-steam methods shao2018remote adopt 2D CNNs in both channels. We compare the heterogeneous architecture (denoted as 2D/3D) with the homogeneous counterpart (indicated by 2D/2D) and report the results in Table 8. In the experiment, the 2D/3D backbone has the same convolutional layers as the homogeneous one, and they have the same kernel size at each layer. The S-CLSTM module is also applied to the compared model to fuse the hierarchical features. The same experimental settings are used for a fair comparison. It can be seen that the network equipped with 2D/3D backbone outperforms the 2D/2D counterpart in terms of all the evaluation indexes, indicating the effectiveness of it.

Dataset Backbone Q4/Q8 UIQI SAM ERGAS SCC
IKONOS 2D/2D 0.6009 0.8768 4.1365 3.7250 0.8565
2D/3D 0.7236 0.9242 3.6596 2.7353 0.9502
GaoFen-2 2D/2D 0.7970 0.9517 2.9076 3.3160 0.9160
2D/3D 0.8707 0.9741 2.1171 2.3805 0.9655
WorldView-2 2D/2D 0.6807 0.9207 4.8672 2.9484 0.9420
2D/3D 0.6982 0.9249 4.6668 2.8387 0.9476
Table 8: Effect of the 2D/3D backbone on the IKONOS, GaoFen-2, and WorldView-2 datasets. The optimal results are in red.

4.5.2 Effect of hierarchical feature fusion

In this experiment, we will test the effectiveness of the hierarchical feature fusion manner. We select different levels for fusion, and the results are listed in Table 9. The level set {1, 2, 3, 4} indicates that the DCNet employs the hierarchical fusion manner, in which the S-CLSTM integrates two channels at each level. {4} represents the S-CLSTM only fuse the spatial and spectral features at single level 4, and {3,4}, {2,3,4} indicate that the S-CLSTM merge two channels at two levels 3,4 and at three levels 2,3,4, respectively. For each setting, the same experimental parameters are used, e.g., employing Adam optimizer, the learning rate, the number of epochs, etc. The results show that the model with the hierarchical fusion manner achieves the best results on either spatial indexes or spectral indexes.

Dataset Fusion Levels Q4/Q8 UIQI SAM ERGAS SCC
IKONOS {4} 0.1630 0.3103 15.4476 25.6835 0.6087
{3,4} 0.3105 0.6667 10.3325 12.7822 0.7482
{2,3,4} 0.6481 0.8989 3.9113 3.2781 0.8997
{1,2,3,4} 0.7236 0.9242 3.6596 2.7353 0.9502
GaoFen-2 {4} 0.3085 0.3625 23.1641 30.8456 0.6705
{3,4} 0.0862 0.5095 18.5059 18.2513 0.6677
{2,3,4} 0.8181 0.9636 2.4609 2.8779 0.9340
{1,2,3,4} 0.8707 0.9741 2.1171 2.3805 0.9655
WorldView-2 {4} 0.4827 0.1863 62.4558 32.9168 0.7715
{3,4} 0.5284 0.7866 15.4305 8.2610 0.8565
{2,3,4} 0.5536 0.8493 7.4289 6.1733 0.8603
{1,2,3,4} 0.6982 0.9249 4.6668 2.8387 0.9476
Table 9: The effect of hierarchical features. The optimal results are in red.

4.6 Effect of S-Clstm

To evaluate the effect of the proposed S-CLSTM fusion module, we replace it with several standard fusion methods, e.g., element-wise summation (sum) fusion, element-wise maximization (max) fusion, element-wise average fusion, element-wise product fusion, and Conv fusion feichtenhofer2016convolutional . Similar to the S-CLSTM, for each of the replaced fusion operations, the fused features at the previous levels, except the last level, are fed back into the two channels, and the fused feature at the last level is directly injected into the reconstruction network. The results are shown in Table 10. The results demenstrate that S-CLSTM fusion strategy can boost the performance of our DCNet on each dataset.

Dataset Fusion method Q4/Q8 UIQI SAM ERGAS SCC
IKONOS Sum 0.5649 0.8515 5.0278 4.3132 0.8256
Max 0.5885 0.8701 4.3609 3.8540 0.8474
Average 0.6111 0.8819 4.1098 3.6417 0.8669
Product 0.5529 0.8462 5.0558 4.2705 0.8209
Conv 0.6402 0.8957 3.9526 3.3367 0.8936
S-CLSTM 0.7236 0.9242 3.6596 2.7353 0.9503
GaoFen-2 Sum 0.7736 0.9379 3.4308 3.8346 0.8990
Max 0.7860 0.9441 3.2131 3.5801 0.9072
Average 0.8044 0.9565 2.7750 3.1697 0.9231
Product 0.8133 0.9611 2.5893 2.9790 0.9312
Conv 0.8301 0.9659 2.4278 2.7677 0.9453
S-CLSTM 0.8707 0.9741 2.1171 2.3805 0.9655
WorldView-2 Sum 0.6071 0.8981 5.9973 3.646 0.9093
Max 0.6344 0.9060 5.5230 3.3531 0.9224
Average 0.6568 0.9137 5.2045 3.1561 0.9313
Product 0.6719 0.9195 4.9014 2.9752 0.9410
Conv 0.6771 0.9216 4.7897 2.9107 0.9442
S-CLSTM 0.6982 0.9249 4.6668 2.8387 0.9476
Table 10: The results of DCNet with different fusion methods. The optimal results are in red.
(a) Q
(b) UIQI
(c) SAM
(e) SCC
Figure 9: The results of models with different numbers of levels. The models with one, two, three, four, five, and six levels are shown in cyan, green, red, blue, purple, and black

4.6.1 Impact of the number of levels

One of the most critical hyper-parameters in DCNet is the number of levels. It is common knowledge that the nonlinearity of CNNs can be improved by increasing the depth of it, e.g., in dong2014learning

, deep CNNs have brought prosperous development for the super resolution task. In

wei2017boosting , the concept of residual learning is introduced to construct a very deep CNN to further improve the performance by making full use of the high nonlinearity of deep CNN models. Thus, the depth of our model needs to be discussed. Experimental results about the impact of the number of levels are shown in Fig. 9. The model with 4 levels achieves superior performance when compared to the other four models consisting of 1, 2, 3, 5 levels, respectively, while the model with 6 levels has only small gains. Thus we prefer four levels based model considering the trade-off of the performance and computation cost.

5 Conclusion

In this paper, we have proposed a novel DCNet for pan-sharpening. Instead of employing 2D CNNs for processing both PAN and MS images, we develop a heterogeneous dual-channel backbone with a 2D CNN and a 3D CNN for spatial and spectral information extraction, respectively. The 3D CNN in spectral channel avoid the mixing of different spectrums, which facilitates the HR-MS image reconstruction. The S-CLSTM fuses spatial and spectral information at each level via bi-directional lateral connections and expoit the multi-level correlation via the cell state. Compared with state-of-the-art methods, the proposed network demonstrates the potent ability for feature extraction and fusion of spatial and spectral information on three datasets. For future work, within the current framework, loss functions and unsupervised methods will be explored for the real-world task.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Dong Wang: Conceptualization, Methodology, Data curation, Software, Writing - Original draft preparation. Yunpeng Bai: Visualization, Investigation, Validation, Writing- Reviewing and Editing. Ying Li: Supervision, Software, Writing- Reviewing and Editing.


The work was supported in part by the National Natural Science Foundation of China(61871460), the Shaanxi Provincial Key R&D Program(2020KW-003), and the Fundamental Research Funds for the Central Universities (3102019ghxm016).


  • [1] E. L. Bullock, C. E. Woodcock, P. Olofsson, Monitoring tropical forest degradation using spectral unmixing and landsat time series analysis, Remote Sensing of Environment 238 (2020) 110968.
  • [2] Y. Gong, Z. Xiao, X. Tan, H. Sui, C. Xu, H. Duan, D. Li, Context-aware convolutional neural network for object detection in vhr remote sensing imagery, IEEE Transactions on Geoscience and Remote Sensing 58 (1) (2019) 34–44.
  • [3] B. Fang, Y. Li, H. Zhang, J. C.-W. Chan, Semi-supervised deep learning classification for hyperspectral image based on dual-strategy sample selection, Remote Sensing 10 (4) (2018) 574.
  • [4] Y. Li, H. Zhang, Q. Shen, Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network, Remote Sensing 9 (1) (2017) 67.
  • [5] B. Fang, Y. Li, H. Zhang, J. C.-W. Chan, Hyperspectral images classification based on dense convolutional networks with spectral-wise attention mechanism, Remote Sensing 11 (2) (2019) 159.
  • [6] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, X. Wang, Deep learning for pixel-level image fusion: Recent advances and future prospects, Information Fusion 42 (2018) 158–173.
  • [7] A. Garzelli, F. Nencini, L. Capobianco, Optimal mmse pan sharpening of very high resolution multispectral images, IEEE Transactions on Geoscience and Remote Sensing 46 (1) (2007) 228–236.
  • [8] B. Aiazzi, S. Baronti, M. Selva, Improving component substitution pansharpening through multivariate regression of ms pan data, IEEE Transactions on Geoscience and Remote Sensing 45 (10) (2007) 3230–3239.
  • [9] M. M. Khan, J. Chanussot, L. Condat, A. Montanvert, Indusion: Fusion of multispectral and panchromatic images using the induction scaling technique, IEEE Geoscience and Remote Sensing Letters 5 (1) (2008) 98–102.
  • [10] T. Ranchin, B. Aiazzi, L. Alparone, S. Baronti, L. Wald, Image fusion—the arsis concept and some successful implementation schemes, ISPRS Journal of Photogrammetry and Remote Sensing 58 (1-2) (2003) 4–18.
  • [11] F. Palsson, M. O. Ulfarsson, J. R. Sveinsson, Model-based reduced-rank pansharpening, IEEE Geoscience and Remote Sensing Letters 17 (4) (2019) 656–660.
  • [12]

    X. Fu, Z. Lin, Y. Huang, X. Ding, A variational pan-sharpening with local gradient constraints, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10265–10274.

  • [13] C. Thomas, T. Ranchin, L. Wald, J. Chanussot, Synthesis of multispectral images to high spatial resolution: A critical review of fusion methods based on remote sensing physics, IEEE Transactions on Geoscience and Remote Sensing 46 (5) (2008) 1301–1312.
  • [14] C. Dong, C. C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE transactions on pattern analysis and machine intelligence 38 (2) (2015) 295–307.
  • [15] G. Masi, D. Cozzolino, L. Verdoliva, G. Scarpa, Pansharpening by convolutional neural networks, Remote Sensing 8 (7) (2016) 594.
  • [16] G. Scarpa, S. Vitale, D. Cozzolino, Target-adaptive cnn-based pansharpening, IEEE Transactions on Geoscience and Remote Sensing 56 (9) (2018) 5443–5457.
  • [17] Y. Wei, Q. Yuan, H. Shen, L. Zhang, Boosting the accuracy of multispectral image pansharpening by learning a deep residual network, IEEE Geoscience and Remote Sensing Letters 14 (10) (2017) 1795–1799.
  • [18] J. Yang, X. Fu, Y. Hu, Y. Huang, X. Ding, J. Paisley, Pannet: A deep network architecture for pan-sharpening, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5449–5457.
  • [19] Q. Yuan, Y. Wei, X. Meng, H. Shen, L. Zhang, A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (3) (2018) 978–989.
  • [20] X. Liu, Q. Liu, Y. Wang, Remote sensing image fusion based on two-stream fusion network, Information Fusion 55 (2020) 1–15.
  • [21] Y. Zhang, C. Liu, M. Sun, Y. Ou, Pan-sharpening using an efficient bidirectional pyramid network, IEEE Transactions on Geoscience and Remote Sensing 57 (8) (2019) 5549–5563.
  • [22] Z. Shao, J. Cai, Remote sensing image fusion with deep convolutional neural network, IEEE journal of selected topics in applied earth observations and remote sensing 11 (5) (2018) 1656–1669.
  • [23]

    A. Graves, A.-r. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, 2013, pp. 6645–6649.

  • [24]

    Z. Bai, Y. Li, M. Woźniak, M. Zhou, D. Li, Decomvqanet: Decomposing visual question answering deep network via tensor decomposition and regression, Pattern Recognition (2020) 107538.

  • [25]

    S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-c. Woo, Convolutional lstm network: A machine learning approach for precipitation nowcasting, in: Advances in neural information processing systems, 2015, pp. 802–810.

  • [26] H. Song, W. Wang, S. Zhao, J. Shen, K.-M. Lam, Pyramid dilated deeper convlstm for video salient object detection, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 715–731.
  • [27] J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3d human action recognition, in: European conference on computer vision, Springer, 2016, pp. 816–833.
  • [28] L. Jiang, M. Xu, Z. Wang, Predicting video saliency with object-to-motion cnn and two-layer convolutional lstm, arXiv preprint arXiv:1709.06316 (2017).
  • [29] W. Huang, L. Xiao, Z. Wei, H. Liu, S. Tang, A new pan-sharpening method with deep neural networks, IEEE Geoscience and Remote Sensing Letters 12 (5) (2015) 1037–1041.
  • [30] L. Wald, T. Ranchin, M. Mangolini, Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images, American Society for Photogrammetry and Remote Sensing 63 (6) (1997) 691–699.
  • [31] N. Ketkar, Introduction to pytorch, in: Deep learning with python, Springer, 2017, pp. 195–208.
  • [32] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  • [33] Y. Zeng, W. Huang, M. Liu, H. Zhang, B. Zou, Fusion of satellite images in urban area: Assessing the quality of resulting images, in: 2010 18th International Conference on Geoinformatics, IEEE, 2010, pp. 1–4.
  • [34] Z. Wang, A. C. Bovik, A universal image quality index, IEEE signal processing letters 9 (3) (2002) 81–84.
  • [35] P. E. Dennison, K. Q. Halligan, D. A. Roberts, A comparison of error metrics and constraints for multiple endmember spectral mixture analysis and spectral angle mapper, Remote Sensing of Environment 93 (3) (2004) 359–367.
  • [36] E. Ayhan, G. Atay, Spectral and spatial quality analysis in pan sharpening process, Journal of the Indian Society of Remote Sensing 40 (3) (2012) 379–388.
  • [37] J. Zhou, D. Civco, J. Silander, A wavelet transform method to merge landsat tm and spot panchromatic data, International journal of remote sensing 19 (4) (1998) 743–757.
  • [38] L. Alparone, B. Aiazzi, S. Baronti, A. Garzelli, F. Nencini, M. Selva, Multispectral and panchromatic data fusion assessment without reference, Photogrammetric Engineering & Remote Sensing 74 (2) (2008) 193–200.
  • [39] C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933–1941.
  • [40] C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: European conference on computer vision, Springer, 2014, pp. 184–199.