Modeling the relationship between pixel and pixel, patch and patch, even image and image is a fundamental problem in computer vision, where applications are ranging from single image super-resolution, image restoration, to image fusion. Convolution is indeed an effective tool to solve this problem. By performing convolution operations on the image, specific important features can be extracted. And through the stacked convolutional layers, learning ability of the network can be strengthened, thus end-to-end relational mapping can be realized. However, when faced with refined pixel-wise tasks, standard convolution operations are often unable to focus on each pixel of the image, and the accuracy can only be improved by deepening the network, resulting in a cumbersome model.
This paper focuses on image fusion, which aims to improve the spatial resolution and geometric accuracy of images through appropriate fusion strategies. It is a task that has a wide range of vital applications including the following two: 1) Remote sensing image pansharpening, which fuses low-resolution multispectral image (LR-MSI) and high-resolution panchromatic image (HR-PANI), makes up for the deficiencies of a certain kind of remote sensing data, and promotes the applicability of remote sensing image. 2) Hyperspectral image super-resolution (HISR), which fuses a low-resolution hyperspectral image (LR-HSI) and a high-resolution multispectral image (HR-MSI), takes advantage of different types of images, finally obtains a high-resolution hyperspectral image (HR-HSI). In this work, we mainly address these two fusion tasks.
Whether it is pansharpening or HISR, the difficulty lies in achieving a balance between spatial resolution and spectral resolution. An ideal solution is that we can pay attention to pixel-level information, and perform feature representation with pixel uniqueness while ensuring that the global information is not distorted. It is undeniable that the method based on convolutional neural networks (CNNs) has significantly promoted the development of image fusion technology. However, most of the current methods that emerge in an endless stream are working on changing network structure and depth, and the performance of the network mainly depends on the number of network parameters.
In addition, the existing CNN-based methods for image fusion have several fundamental problems. Firstly, for an image, its low-frequency components occupy the main part. In contrast, the high-frequency components that represent texture account for a much smaller proportion. Therefore, in order to minimize the loss, the network will concentrate on most of the low-frequency patches during the learning process. In other words, the convolution kernels will update the parameters following the direction that is conducive to the super-resolution of the low-frequency patches. Instead, the features of the high-frequency patches are ignored, and further leads to the undesirable smoothness of the fusion results. However, the evaluation of image quality is highly dependent on these high-frequency components. Secondly, in standard convolution, the update and optimization of the bias are all based on the average value of the pixels in the feature map. And the bias of one feature map for different samples are fixed, which restricts the network flexibility. Therefore, for the fusion tasks, the local characteristics are exactly what needs to be paid attention to, and the conventional bias can also be improved. The key to the tasks is the mutual coordination and supplementation of high-frequency information in spatial and spectral dimensions and the repair of local details.
In this paper, we propose a novel local adaptive convolution (LAConv), in which each single kernel will be scaled with a adaptive weight generated from the local patch. Besides, dynamic bias (DYB), which yields from the whole given feature map, is adopted to provide the global information. To a certain extent, DYB can be seen as a novel channel attention mechanism. With the combination of the weighted convolution kernel and DYB, the LAConv has therefore become more flexible and has more powerful capability of local feature representation. In addition, we embed LAConv into a residual structure to construct a simple network, which can achieve satisfactory results with a small computational cost. Extensive experiments indicate the remarkable effectiveness and efficiency of LAConv in image fusion tasks.
In brief, the contributions of this work are as follows:
We propose a new local adaptive convolution (LAConv), which generates an adaptive kernel based on each pixel and its neighbors. LAConv not only inherits all the advantages of standard convolution, but also enhances the ability of focusing on local features.
A dynamic bias (DYB) strategy is introduced in LAConv. Compared with the conventional bias in the standard convolution, the DYB can supplement the global information into the local features, thus mitigate the subtle distortion caused by spatial discontinuities, further making the network more flexible.
A simple residual network based on LAConv (LAResNet) is designed and applied to two image fusion tasks. Experiments demonstrate that benefit from proposed low-cost and easy-to-implement LAConv, LAResNet can achieve surpassing performance over the state-of-the-art methods even if it does not have deep layers and huge parameters.
2 Related Works
In this section, we first present the distinction between our work and other dynamic convolution methods for other applications and then introduce a series of previous advanced works on pansharpening and HISR. Finally, our motivation is stated.
2.1 Dynamic Convolution
In order to improve the performance of the model, the existing standard convolution neural network can mainly increase the parameters, depth, and number of channels, resulting in the model being too complex. Many pioneers have aware of this bottleneck. In[condconv], Yang et al. proposed conditionally parameterized convolutions (CondConv), which breaks the traditional static convolution characteristics by calculating the convolution kernel parameters through the input samples, which is effective in inference. Different from our LAConv, the kernel dynamically generated by CondConv is for the entire sample, not for specific regions or pixels. Another notable work is dynamic convolution (DYConv) proposed in [dynet], it aggregates multiple convolution kernels according to their customized attention degree to each input sample. Compared with standard convolution, it significantly improves the representation ability of the network but increases much computational parameters. In addition, it still focuses on the entire sample rather than the specific regions or pixels. Recently, a dynamic region-aware convolution (DRConv) was proposed in [drconv], which automatically allocates multiple filters to spatial regions with similar characteristics, achieves satisfactory performance on many tasks (i.e.
, classification, segmentation, and face recognition). The same idea as DRConv is that the proposed LAConv also has the characteristics of translation-invariance, but in contrast, LAConv only assigns weights to the convolution kernel and more completely retains the core of the standard convolution. Another difference between DRConv and ours is that DRConv has one more step to classify the region of the input feature map, this step is redundant in the fusion task.
In addition to the above mentioned, an important point is that the above three dynamic convolutions all include the operation of global average pooling (GAP). We hold the opinion that GAP will cause information distortion in spatial dimension thus it is not suitable for image fusion tasks.
Pansharpening is a challenging task in the field of remote sensing. The existing methods can be divided into traditional methods and deep learning (DL) methods based on big data-driven. Some classic traditional methods are the smoothing filter-based intensity modulation (SFIM)[sfim], the generalized Laplacian pyramid (GLP) [glp] with MTF-matched filter [mtf] and regression-based injection model (GLP-CBD) [glp_cbd]
, and the band-dependent spatial-detail with local parameter estimation (BDSD)[bdsd].
Numerous DL-methods based on CNN have emerged recently, pushing the task of pansharpening to a new era, alleviating the distortions more or less existed in traditional methods. Typical works are PanNet [pannet], DiCNN1 [dicnn], DMDNet [dmdnet], and FusionNet [fusionnet]. What they have in common is the use of the same convolution kernel and conventional bias for feature extraction, resulting in limited learning capabilities of the network.
Similarly, previous work for HISR can be classified as traditional methods and DL-methods. The SOTA traditional methods including FUSE [FUSE]
, the coupled sparse tensor factorization (CSTF)[CSTF] method and the cnn denoiser (CNN-FUSE) [CNN-FUS]. And advanced DL-methods including SSRNet [SSRNET], ResTFNet [ResTFnet], and MHFNet [MHFNet]. They all showed exceptional performance, but HISR also requires separate modeling of the characteristics of each pixel in order to better achieve super-resolution.
The essence of the image fusion problem lies in the mutual complementation of information and the enhancement of spatial and spectral resolution. After in-depth consideration, we hold that the uniform convolution kernel and the conventional bias are not adequate for these pixel-to-pixel tasks. Specifically, the key high-frequency information can not be given special treatment, even be ignored during the learning process, leading to an undesirable smoothness of results in the super-resolution tasks. And the addition of a fixed value and the overall feature map is probably meaningless in this task. In order to address these issues, we propose LAConv and DYB, which we will introduce in detail in the following section.
In this section, the procedure of LAConv and DYB is detailed firstly. After that, the structure of the proposed LAResNet will be expressed.
3.1 Local Adaptive Convolution
Image fusion needs to accurately determine the value of each pixel, the specific situation is restoration and reconstruction of a pixel is only related to its neighbors, and it is weakly related to pixels that are far away from it. In order to fully explore the local information of the pixel, we have made a change in the design of the convolution kernel. While retaining the original convolution kernel, we have updated the state of the convolution kernel for each pixel. The specific operation is described below.
Standard Convolution Firstly, let us review the standard convolution. Consider a standard convolution without bias operates on a pixel that located at spatial coordinates , its local patch is defined as , where and mean the channels of the inputting feature map and the patch size, respectively. During the standard convolution operation, all the local patches of the inputting feature map use the same kernels . Thus the operation can be expressed as follows:
where can be viewed as convolution kernels with the size on one layer. represents the convolution operation in conventional CNN (also can be viewed as the operation of matrix multiplication (MatMul)). is the result after the convolution, where mean the channels of the outputting feature map.
LAConv Different from the standard convolution, the kernel in our LAConv is automatically adjusting depending on the local patch. Let represents the kernel which is used to perform the convolution on , the proposed LAConv can be expressed as follows:
In particular, the generation of contains following three steps as shown in Fig. 2. Firstly,
will be sent to the convolutional layer with the ReLU activation to yield its shallow feature. Secondly, the shallow feature will be sent to the fully connected (FC) layers with ReLU and sigmoid activations, then a weightthat can perceive the potential relationship between the central pixel and its neighbors is learned. Thirdly, the is reshaped to used as the scaling factor for every kernel in K, the scaled kernel is represented as , whose generation process is expressed as follows:
where represents the pixel-wise multiplication. is duplicated . More details please refer to the top part in Fig. 2.
3.2 Dynamic Bias
In this section, we design DYB for our LAConv. Different from the conventional bias, the DYB is generated from the global inputting feature, which is denoted as . The operation process of LAConv with the conventional bias can be expressed as follows:
where is defined as the DYB, which is generated by the following two steps. Firstly, the inputting feature will pass through the global average pooling layer (GAP) to obtain . Secondly, will be sent to the FC layers with ReLU activations, and the output is . More details can be referred to the bottom part in Fig. 2.
3.3 Proposed LAResNet
In this work, we mainly perform two different image fusion tasks, namely pansharpening and HISR, whose source data are different. For convenience of explanation, we will uniformly denote the LR-MSI in pansharpening and LR-HSI in HISR as , and unify the HR-PANI in pansharpening and HR-MSI in HISR as . We aim to develop a simple and efficient image fusion network that takes an upsampled (denoted as ) and an as input and output a fused image . Fig. 3 shows a detailed architecture of the proposed LAResNet.
Before introducing the architecture, it is necessary to illustrate the important components of the network, called LAResBlock. In fact, LAResBlock is exactly the same as the original ResBlock [resnet], except that the standard convolution in ResBlock is substituted by the proposed LAConv. In what follows, we will introduce overall architecture of LAResNet. As shown in Fig. 3, the proposed network has three stages. The first stage contains a LAConv layer and an activation layer, followed by several stacked LAResBlocks. And the last stage is also a LAConv layer. Specifically, the HR and the are first concatenated together to obtain a feature map M, which contains the information of the two input images. After that, M will pass through the network stage after stage. Finally, the output of the network will be added with as the final SR image. The whole processing can be expressed by the following equation:
where represents the mapping functional with its parameters , which is updated to minimize the distance between the and the ground-truth (
) image. Here we also just chose the simplest mean square error (MSE) loss, the loss function can be expressed as follows:
where is the number of training examples, and is Frobenius norm.
In this section, we separately discuss the quantitative and qualitative results of the experiments for the two tasks, i.e., pansharpening and HISR.
|SFIM [sfim]||5.452 1.903||4.690 6.574||0.866 0.067||0.798 0.122|
|GLP-CBD [glp_cbd]||5.286 1.958||4.163 1.775||0.890 0.070||0.854 0.114|
|BDSD [bdsd]||7.000 2.853||5.167 2.248||0.871 0.080||0.813 0.123|
|PanNet [pannet]||4.092 1.273||2.952 0.978||0.949 0.046||0.894 0.117|
|DiCNN1 [dicnn]||3.981 1.318||2.737 1.016||0.952 0.047||0.910 0.112|
|DMDNet [dmdnet]||3.971 1.248||2.857 0.966||0.953 0.045||0.913 0.115|
|FusionNet [fusionnet]||3.744 1.226||2.568 0.944||0.958 0.045||0.914 0.112|
|LAResNet||3.473 1.197||2.338 0.911||0.965 0.043||0.923 0.114|
4.1 Results for Pansharpening
In this section, we will first introduce the training implementation, then, datasets and evaluation indicators will be described, and finally, our pansharpening results will be presented.
4.1.1 Training Details and Parameters
The models are implemented with PyTorch. For the parameters of LAResNet, the number of the LAResBlock is set to 5 (i.e.,), while the channels of the LAConv and the kernel size are 32 and (i.e.,
), respectively. Besides, we set 1000 epochs for the network training, while the learning ratein the first 500 epochs and in the last 500 epochs. The FC layers used in the LAConv consist of two dense layers with neurons, and the FC layers in the DYB consist of two dense layers with neurons. Adam optimizer is used for training with the batch size 32 while and are set to 0.9 and 0.999, respectively.
|SFIM [sfim]||7.718 1.872||8.778 2.380||0.832 0.105||0.767 0.119|
|GLP-CBD [glp_cbd]||7.398 1.783||7.297 0.932||0.854 0.064||0.819 0.128|
|BDSD [bdsd]||7.671 1.911||7.466 0.991||0.851 0.062||0.813 0.136|
|PanNet [pannet]||5.314 1.018||5.162 0.681||0.930 0.059||0.883 0.140|
|DiCNN1 [dicnn]||5.307 0.996||5.231 0.541||0.922 0.051||0.882 0.143|
|DMDNet [dmdnet]||5.120 0.940||4.738 0.649||0.935 0.065||0.891 0.146|
|FusionNet [fusionnet]||4.540 0.779||4.051 0.267||0.955 0.046||0.910 0.136|
|LAResNet||4.378 0.727||3.740 0.298||0.959 0.047||0.916 0.134|
4.1.2 Datasets and Evaluation Metrics
To benchmark the effectiveness of LAResNet for pansharpening, we adopt a wide range of datasets including 8-band datasets captured by WorldView-3 (WV3), 4-band datasets captured by GaoFen-2 (GF2) and QuickBird (QB) satellites. As the ground truth (GT) images are not available, Wald’s protocol [exp4] is performed to ensure the baseline image generation. All the source data can be download from the public website. For WV3-data, we obtain 12580 HR-PANI/LR-MSI/GT image pairs (// as training/validation/testing dataset) with the size 64641, 16168, and 64648, respectively; For GF2 data, we use 10000 PAN/MS/GT image pairs (// as training/validation/testing dataset) with the size 64641, 16164, and 64644, respectively; For QB data, 20000 PAN/MS/GT image pairs (// as training/validation/testing dataset) with the size 64641, 16164, and 64644 were adopted.
The quality evaluation is conducted both at reduced and full resolutions. For reduced resolution test, the spectral angle mapper (SAM) [sam], the relative dimensionless global error in synthesis (ERGAS) [ergas], the spatial correlation coefficient (SCC) [SCC], and quality index for 4-band images (Q4) [q2n] and 8-band images (Q8) [q2n] are used to assess the quality of the results. In addition, to assess the performance of all involved methods on full resolutions, the QNR [QNR], the [QNR], and the [QNR] indexes are applied.
4.1.3 Comparison with State-of-the-art
|SFIM [sfim]||2.297 0.638||2.189 0.695||0.861 0.054||0.865 0.040|
|GLP-CBD [glp_cbd]||2.274 0.733||2.046 0.620||0.873 0.053||0.877 0.041|
|BDSD [bdsd]||2.307 0.670||2.070 0.610||0.877 0.052||0.876 0.042|
|PanNet [pannet]||1.400 0.326||1.224 0.283||0.956 0.012||0.947 0.022|
|DiCNN1 [dicnn]||1.495 0.381||1.320 0.354||0.946 0.022||0.945 0.021|
|DMDNet [dmdnet]||1.297 0.316||1.128 0.267||0.964 0.010||0.953 0.022|
|FusionNet [fusionnet]||1.180 0.271||1.002 0.227||0.971 0.007||0.963 0.017|
|LAResNet||1.085 0.238||0.912 0.206||0.977 0.006||0.970 0.016|
In this section, we will show the comparison of the results on various datasets obtained by our LAResNet and several competitive methods (including traditional methods and DL-based methods), which were introduced in Sec. 2.2.
Evaluation on 8-band reduced resolution dataset. We compare the proposed method with recent state-of-the-art pansharpening methods on the quantitative performance on 1258 WV3 testing datasets. The results of compared methods and LAResNet are reported in Tab. 1. It can be observed that LAResNet achieves a transcendence performance. Also we compare the related approaches on the Rio-dataset (WV3), whose visual results are shown in Fig. 4. It can be seen that our result is the closest to the GT image.
Evaluation on 8-band full resolution dataset. We further perform a full resolution test experiment on the WV3 dataset with 50 pairs. The quantitative results are reported in Tab. 4, and the visual results are shown in Fig. 5. Again, our method also surpasses other methods both in visual comparison and quantitative indicators.
Evaluation on 4-band reduced resolution dataset. In order to prove the wide applicability of LAResNet, we also conducted experiments on the 4-band GF2 and QB datasets. Similarly, the comparison of quantified indicators is shown in Tab. 2 and Tab. 3, which indicates that our method can produce the best outcomes whether the GF2 or QB data.
4.2 Results for HISR
4.2.1 Training Details and Parameters
We conduct 550 epochs training under the Pytorch framework, and the learning rate is fixed as during the training process. For the parameters of LAResNet, the number of the LAResBlock is set to 3 (i.e., ), while the channels of the LAConv is set to 64. The rest of the settings and parameters are the same as that in Sec. 4.1.1
|SFIM [sfim]||0.9282 0.0512||0.0254 0.0287||0.0485 0.0283|
|GLP-CBD [glp_cbd]||0.9113 0.0671||0.0331 0.0338||0.0590 0.0432|
|BDSD [bdsd]||0.9300 0.0491||0.0177 0.0130||0.0537 0.0404|
|PanNet [pannet]||0.9521 0.0219||0.0260 0.0114||0.0226 0.0123|
|DiCNN1 [dicnn]||0.9436 0.0458||0.0185 0.0210||0.0392 0.0299|
|DMDNet [dmdnet]||0.9554 0.0200||0.0215 0.0099||0.0237 0.0118|
|FusionNet [fusionnet]||0.9556 0.0316||0.0198 0.0168||0.0254 0.0183|
|LAResNet||0.9637 0.0119||0.0147 0.0077||0.0220 0.0064|
|FUSE [FUSE]||39.72 3.52||5.83 2.02||4.18 3.08||0.975 0.018|
|GLP-HS [GLP-HS]||37.81 3.06||5.36 1.78||4.66 2.71||0.972 0.015|
|CSTF [CSTF]||42.14 3.04||9.92 4.11||3.08 1.56||0.964 0.027|
|CNN-FUS [CNN-FUS]||42.66 3.46||6.44 2.31||2.95 2.24||0.982 0.007|
|SSRNet [SSRNET]||45.28 3.13||4.72 1.76||2.06 1.30||0.990 0.004|
|ResTFNet [ResTFnet]||45.35 3.68||3.76 1.31||1.98 1.62||0.993 0.003|
|MHFNet [MHFNet]||46.32 2.76||4.33 1.48||1.74 1.44||0.992 0.006|
|LAResNet||47.68 3.37||3.07 0.97||1.49 0.96||0.995 0.002|
|SC + NB||4.2564||3.1026||0.9628||0.9511|
|SC + CB||4.3483||3.1302||0.9614||0.9511|
|SC + DYB||4.2267||3.0575||0.9637||0.9524|
|LAC + NB||4.0354||2.9495||0.9676||0.9571|
|LAC + CB||4.0264||2.9129||0.9684||0.9568|
|LAC + DYB||3.9740||2.9010||0.9692||0.9584|
4.2.2 Datasets and Evaluation Metrics
In this work, we mainly adopt the CAVE dataset [cavedata] for training the network. We have simulated a total of 3920 HR-MSI/LR-HSI/GT image pairs (/ as training/testing dataset) with the size 64643, 161631, and 646431, respectively. The process of data generation contains the following three steps: 1) Crop 3920 overlapping patches from the original CAVE dataset as GT, also called HR-HSI; 2) Apply a Gaussian blur with the kernel size of 33 and standard deviation of 0.5 to HR-HSI patches, and then the blurred patches are downsampled to generate LR-HSI patches; 3) Use the spectral response function of Nikon D700 camera [nikon1, CNN-FUS, MHFNet, nikon2]
to generate RGB patches. Besides, to evaluate the performance of HISR, we adopt the following indicators, SAM, ERGAS, the peak signal-to-noise ratio (PSNR) and the structure similarity (SSIM)[SSIM].
4.2.3 Comparison with State-of-the-art
In this section, we will report the comparison of the results on CAVE datasets produced by our LAResNet and several advanced methods (including traditional methods and DL-based methods), which were introduced in Sec. 2.3. Quantitative and qualitative evaluation results of these approaches are summarized in Tab. 5 and Fig. 6. As can be observed, our method exceeds the state-of-the-art methods significantly from the perspective of visual effects. Besides the significant improvement of pansharpening performance, our proposed LAResNet also has a favorable performance for HISR. We believe that LAConv can also achieve satisfactory results for more super-resolution tasks.
4.3 Ablation Study
In order to verify the effectiveness of LAConv and DYB, we perform six groups of ablation on WV3 dataset. In the case of the same number of residual blocks and channels, six modes are performed. The specific settings are standard convolution + no bias (SC+NB), standard convolution + conventional bias (SC+CB), standard convolution + dynamic bias (SC+DYB), LAConv + no bias (LAC+NB), LAConv + conventional bias (LAC+CB), LAConv + dynamic bias (LAC+DYB, i.e., the proposed). The experimental results are shown in Fig. 7 and Tab. 6. It is clear that the network with LAConv works better than the network with standard Conv. And the comparison between conventional bias and no bias indicates that the conventional bias is worse than the no bias network, which demonstrates the conventional bias is not suitable for this image fusion task. The network with dynamic bias, on the other hand, is significantly more comparable than the network with conventional bias and no bias under the same settings.
4.4 More Discussions
Exploration on LAConv
To better illustrate the effectiveness of LAConv in the fusion process, we present the average and variance of the weights () by LAConv for each convolution layer in Fig. 8. Through analysis, it can be seen that generated by LAConv in different local areas are different, more specifically, LAConv mainly focuses on the overall information of objects in the shallow layers, and then focuses on the local high-frequency features such as edges with the layers going deeper.
Report on the number of parameters The number of parameters (NoPs) of all the compared CNNs are presented in Tab. 7. It can be seen that the amount of parameters of LAResNet is the least, whereas the best results are achieved.
We have presented a novel local adaptive convolution (LAConv) and dynamic bias (DYB) for image fusion. Leveraging on the locally adaptive dynamic convolution kernel, LAConv has powerful local focusing and feature representation capabilities. We further propose a simple residual structure network equiped with LAConv and DYB called LAResNet for two image fusion tasks. Experiments prove that our method achieves the state-of-the-art results both in pansharpening and HISR. The adaptive local focusing mechanism and translation-invariance property of the LAConv guarantee its huge potential for other pixel-wise vision tasks, such as single image super-resolution or image classification.
This work is supported by NSFC (61702083), Key Projects of Applied Basic Research in Sichuan Province (Grant No. 2020YJ0216), and National Key Research and Development Program of China (Grant No. 2020YFA0714001).