Log In Sign Up

Cross Attention-guided Dense Network for Images Fusion

by   Zhengwen Shen, et al.
NetEase, Inc

In recent years, various applications in computer vision have achieved substantial progress based on deep learning, which has been widely used for image fusion and shown to achieve adequate performance. However, suffering from limited ability in modelling the spatial correspondence of different source images, it still remains a great challenge for existing unsupervised image fusion models to extract appropriate feature and achieves adaptive and balanced fusion. In this paper, we propose a novel cross attention-guided image fusion network, which is a unified and unsupervised framework for multi-modal image fusion, multi-exposure image fusion, and multi-focus image fusion. Different from the existing self-attention module, our cross attention module focus on modelling the cross-correlation between different source images. Using the proposed cross attention module as core block, a densely connected cross attention-guided network is built to dynamically learn the spatial correspondence to derive better alignment of important details from different input images. Meanwhile, an auxiliary branch is also designed to model the long-range information, and a merging network is attached to finally reconstruct the fusion image. Extensive experiments have been carried out on publicly available datasets, and the results demonstrate that the proposed model outperforms the state-of-the-art quantitatively and qualitatively.


page 5

page 6

page 7


UFA-FUSE: A novel deep supervised and hybrid model for multi-focus image fusion

Traditional and deep learning-based fusion methods generated the interme...

Cross-Modal Self-Attention Distillation for Prostate Cancer Segmentation

Automatic segmentation of the prostate cancer from the multi-modal magne...

Boundary Aware Multi-Focus Image Fusion Using Deep Neural Network

Since it is usually difficult to capture an all-in-focus image of a 3D s...

Attention-Aware Anime Line Drawing Colorization

Automatic colorization of anime line drawing has attracted much attentio...

Unsupervised Image Fusion Method based on Feature Mutual Mapping

Deep learning-based image fusion approaches have obtained wide attention...

TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning

Image fusion is a technique to integrate information from multiple sourc...

Two-branch Multi-scale Deep Neural Network for Generalized Document Recapture Attack Detection

The image recapture attack is an effective image manipulation method to ...

1 Introduction

Image fusion, which aims at unifying the feature information from different source images into an efficient representation and advancing the visual perception performance of humans and machines. Particularly, image fusion has a wide and various applications in computer vision, such as multi-modal image fusion: infrared and visible image fusion(Li and Wu, 2018), RGB and depth image fusion for semantic segmentation(Wang et al., 2020), medical image fusion(Ma et al., 2020), etc; multi-focus image fusion(Zhang et al., 2021); multi-exposure image fusion(Ram Prabhakar et al., 2017), etc.
In recent years, traditional methods for image fusion take the lead and group three main directions: transform domain approach(Cao et al., 2014), sparse representation(Bin et al., 2016; Zhang et al., 2018), various component analysis(Kuncheva and Faithfull, 2013)

. While many traditional methods are proposed and perform well in image fusion task, there also exists shortcoming and limitations: the feature extraction relies on handcrafted, lack of flexibility and generalization ability. With the development and wide application of deep learning, the limitations of traditional methods got a certain breakthrough and many image fusion methods based on deep learning are proposed

(Ram Prabhakar et al., 2017; Li and Wu, 2018; Ma et al., 2019)

. Although existing methods proposed based on deep learning improve the performance and generalization ability, image fusion still suffer from feature extraction and modelling spatial correspondence learning from source images issues. For example, design fusion strategy need more artificial for different fusion tasks, the balance of feature extraction and effective jointly representation learning from source images. In order to solve the mentioned problems, in this paper we proposed a novel cross attention-guided dense network for multi-task image fusion which is constructed by cross attention-guided densenet, auxiliary network, and merging network. Modelling the cross-correlation, we proposed the cross attention-guided dense network focus on the spatial correspondence from source images by dense connected strategy. Avoiding lack of the global information, we adopted an auxiliary network for extract the long-range information. To better reconstruct the fusion image, we adopted a residual connect merging network to aggregate the fusion feature from source images.

The main contributions of this paper are summarized as follows:

  • A. We propose a novel multi-task images fusion method, which is named Cross Attention-guided Dense Network for Images Fusion(CADNIF). It utilizes the advantage of the deep neural network model with attention mechanisms, and improves the ability of feature extraction and modelling spatial correspondence from source images effectively and robustness.

  • B. We propose a new attention-guided dense network to modelling spatial correspondence from source images. To preserve more global information and details from source images, we employ an auxiliary network to capture long-range information and a dilated residual dense blocks with larger receptive field for reconstructing the fusion image.

  • C. Extensive experiments on different datasets validate the superiority of the proposed CADNIF, which is successfully applied to multi image fusion tasks with superior performance against the state-of-the-art methods.

2 Related Work

In this section, we group the image fusion methods into two categories, traditional methods and deep learning-based approaches for image fusion. We also briefly introduce the dense block and attention mechanisms, which are highly related to our proposed method.

Image fusion.

In the field of image fusion, the key to the evaluation of fusion algorithms is how to effectively extract features and fuse features. Numbers of traditional image fusion methods have been proposed to solve the feature extract problem, and group the methods into three main categories: transform domain approach, such as discrete cosine(DCT), discrete wavelet(DWT), etc; sparse representation domain approach; various component analysis, such as principal component analysis(PCA), independent component analysis(ICA), etc. However, the feature extraction methods lack flexibility and generalizability because of the increasing complexity. Moreover, traditional methods need to pay much attention to design the appropriateness of fusion methods to ensure the features come from different specific source images.

With the wide application of deep learning in high-level vision tasks, the limitations of traditional methods have a breakthrough to a certain extent. More and more CNNs based approaches were introduced for feature extraction as a backbone for image fusion tasks via various fusion strategies. The deep learning-based approaches for image fusion have been application successful in several areas, such as multi-modal image fusion, multi-exposure image fusion, multi-focus image fusion, and medical image fusion. (Ram Prabhakar et al., 2017) proposed a deep learning-based unsupervised Deepfuse framework for exposure and extreme exposure image fusion. (Li and Wu, 2018) proposed an infrared and visible image fusion method named DenseFuse based on Deepfuse and incorporated dense block in the encoding network for feature extraction. (Ma et al., 2019)

proposed a FusionGAN based on the generative adversarial network for infrared and visible image fusion.

(Ma et al., 2020) extended previous work to adopt a dual-discriminator strategy for fusing multi-modality images of multi-resolution image fusion. Recently, some works are focus on unified unsupervised end-to-end image fusion network, (Xu et al., 2020; Zhang et al., 2020a) are each proposed unified fusion network for image fusion, such as multi-modal multi-exposure, multi-focus. However, above methods focus on significance of the difference between the different source images and do not consider the adaptability balance of the final fusion information from source images. In this paper, we propose a more generic unified unsupervised approach to multi image fusion task, which joint learning spatial correspondence information from source images, and for each cross attention module outputs, makes it possible for building a balance that each source images become one.

Image fusion with dense block.

While deep convolution neural network has achieved remarkable results in the field of computer vision, with the wide application and the depth increased, gradually exposed a lot of problems, such as performance degradation, vanishing gradient, exploding gradient, and previous features not reused fully, etc. To address the above problems,

(He et al., 2016) proposed a deep residual learning network, based on He’s work, (Huang et al., 2017) proposed a new network framework densenet, which each layers can be concatenated by previous layer and as an input for the next layer and strengthen feature propagation, encourage feature reuse, and alleviates the problem of the gradient. Based on the above advantages, the dense block has been widely incorporated into multi computer vision tasks, such as semantic segmentation, image classification, object detection, etc. In the image fusion task, a dense block is widely used in feature extraction. (Li and Wu, 2018) incorporate the dense block into the encoding network, preserve the useful information from middle layers. (Ma et al., 2020) proposed a novel GAN framework for image fusion and the architecture of the generator is based on densenet. Instead of directly using densenet as a feature extraction network, we propose cross attention-guided dense network, which effective learning cross-correlation information between different source images.

Attention mechanisms in deep learning methods.

In recent years, with the emergence of attention mechanism, which is a technique that enables models to focus on important information and fully learn to absorb it, it has been widely used in many computer vision tasks, natural language processing, image and video understanding.

(Bastidas and Tang, 2019) proposed a novel channel attention network for multi-spectral imagery. (Tian et al., 2021) proposed a multi-level attention to address the problem of Crowd Counting. (Song et al., 2021) proposed a cross-modal attention for MRI and Ultrasound Volume Registration. Our work explores cross attention for basic mutual feature extraction of source images.

Figure 1: The overall architecture of the proposed CADNIF mainly consists of three components: Attention-guided Network, Auxiliary network, and Merging Network. Firstly, the attention-guided network is adopted to extract cross attention feature from source images. Simultaneously, the auxiliary network extracts the long-range feature representations via series convolution. Finally, merging network merging the global feature representations and the spatial correspondence via a dilated residual dense block and global residual connection.

3 Methods

We propose a novel cross attention-guided image fusion framework, which focuses on improving the cross significance information representation, modelling the spatial correspondence, capture the long-range information and merging the feature from source images to reconstruct the fusion image. As shown in Figure 1, there are three networks: cross attention-guided dense network, auxiliary network, and merging networks. The attention-guided network captures the spatial correspondence feature via using five attention blocks based on densenet, the responsible of auxiliary for capture the long-range information, while the merging network relying on a series of dilated residual dense blocks to utilize the image features effectively and obtain the fusion images more details. Finally, the proposed attention-guided fusion network further to fuse the source images and the following experiments show that the fusion result prefers well.

3.1 Cross Attention Guided Densenet

Unlike the previous methods, which use the densenet as a feature extract network directly, our proposed cross attention guided densenet obtains the cross-attention information from each source image. As shown in Figure 2, the attention-guided densenet consists of five major cross attention blocks. Given input images and , each input image is concatenated by three one-channel gray images of the same source. We first concatenate the input images and obtain the 6-channel concatenation image. As shown in Figure 2a, the input images and the concatenated image feed into the attention-guided network, and obtain the final cross attention features via dense connection. As shown in Figure 2b, the cross attention guided block consists of three convolution layers to obtain feature maps, concatenate operation, and attention module. Details of the cross attention block are provided below. The final cross attention maps can be obtained via:


where denotes the point-wise multiplication, and denote the original input images, and denote each original image concatenate with , and denote the attention map, denote the final cross attention feature maps.

Figure 2: (a) The architecture of the proposed attention-guided dense network, the network consists of five attention blocks for modelling the spatial correspondence. (b) shows the cross attention blocks. (c) shows the attention module.

3.2 Attention Module

Attention modules play an important role in the cross attention block, as shown in Figure 2c, to obtain the one-to-one correspondence maps for each source image, the attention module consists of two convolution layers and a Sigmoid layer. From Eq. 3, the input of the attention module is or , and obtain the feature maps via two convolution layers. Each convolution layer applies

layers. Respectively, a ReLU activation and a Sigmoid activation in the module after convolution operation. As a result, we finally obtain the attention map

with values in the range .

3.3 Auxiliary network

As shown in Figure 3, in the auxiliary network, we first use convolution kernel to obtain 32-channels feature maps via concatenating operation of the input images and , then we use a convolution block consists of three different kernel size convolution operation branches, , , , respectively, each of convolutions is followed by layer normalization. Finally, obtain the final fusion feature maps via concatenate operation and followed a layer normalization after the convolution of kernel.

Figure 3: The architecture of the auxiliary network.

3.4 Merging Network

To better reconstruct the fused image, integrate the cross-correlation information and long-range information to obtain more details of the feature maps, we adopt a dilated residual dense block proposed in (Yan et al., 2019) as the reconstructed network. The merging network takes the concatenated feature maps after a convolution layer of kernel operation as input, which is concatenated from the attention-guided network and auxiliary network. As shown in figure 4, the dilated residual dense block consists of three convolution layers followed by ReLu activation, concatenation-based skip-connection similar to densenet, and a convolution layer of kernel operation as output. Different from the normal convolution operation, each convolution layers adopt 2-dilated convolutions, kernel size is . Finally, we apply a global residual connection strategy to concatenate the output with concatenation image from source images.

Figure 4: The detail architecture of the merging network.

3.5 Training Loss

As described above section, the proposed CANIF method can be trained to obtain the context details structural and the background details from the different sources images. In the proposed method, the loss function consists of Mean Square Error(

) loss and Structural Similarity() loss . The loss is an index to measure the similarity of two images, the larger the value, the better, the maximum is 1. Generally, if the value is 1, then the fused image retains the input images more structural details, the is defined by Eq. 4,


where denotes the input source image, denotes the fused image, denotes the measures structural similarity. The loss emphasize the matching of each corresponding pixel between the input image and output image, the loss is calculated as Eq. 5,


where denotes the input source image, denotes the fused image, denotes the normal. Finally, the total losses of the proposed model can be expressed as follows:


where denotes the contribution of each loss to the whole objective function. In this paper, for each experiment below, setting the parameter , , , equal to: 1, 0.5, 0.03, 0.03, in infrared and visible fusion experiment; 1, 1, 0.01, 0, in MRI and PET fusion experiment; 2, 5, 1, 1, in Multi-focus fusion experiment; 0.5, 0.7, 1.3, 1, in Multi-exposure fusion experiment.

4 Experiments

To demonstrate the proposed method, we conduct the experiments on five publicly available datasets: TNO111 and RoadScene222 datasets for visible and infrared image fusion task, TNO used for training, RoadSence are all used for testing; MFF and SICE datasets333 for multi-exposure image fusion task; MRI and PET444 datasets for medical image fusion task; Lytro555 for multi-focus image fusion task. For the lack of training data problem, we adopt the expansion strategy based on(Zhang et al., 2020a). For the testing, the number of image pairs used for testing dataset is 17, 50, 19, 20, and 18, respectively.

For all experiments, we set the batchsize, learning rate, and epoch equal to 16,

, 10, respectively. The proposed method was implemented on NVIDIA GEFORCE RTX 2080 Ti GPU and based on TensorFlow.

4.1 Infrared and visible image fusion

In the infrared and visible image fusion experiment, most TNO datasets are used to train the proposed image fusion method, image pairs used for testing is 17, and 50 pairs of RoadSence datasets are used to test our proposed method to verify its generalization and model robustness. We compare our proposed method with CE(Zhou et al., 2016), DDLatLRR(Li et al., 2020), DenseFuse(Li and Wu, 2018), PMGI(Zhang et al., 2020a), U2Fusion(Xu et al., 2020).
To effective quantitatively evaluate the fusion quality, we adopt five related indicators to compare the existing methods, namely entropy (EN), sum of the correlations of differences (SCD)(Aslantas and Bendes, 2015)

, standard deviation (SD), represents the ratio of noise added to the final image(Qabf), and mutual information (MI). As shown in Table 

1 and Table 2, the quantitative result of TNO and RoadSence dataset, the red font in italic represents the best value and the bold black font in italic represents the second best value. Similarly, we can infer that the quantitative value of proposed method outperform in EN, SCD, SD and MI metrics. It demonstrates that the proposed method maintains abundant correspondence spatial information and detail information from source images.
The qualitative fusion results on two typical image pairs of each test dataset are illustrated Figure 5 and Figure 6, red box highlights are the local fine features. From the results of TNO dataset and RoadSence dataset, we can observe that the proposed method capture abundant correspondence semantic information compare with the state-of-the-art methods. Similarly, the highlights information obtained by zooming in on local fine features, we can infer that the local fine information retained well from source images, such as sky, window, license plate, and the top of the tree, etc. In particular, for each source image illumination information and texture details, the fusion result shows that the proposed method performs well by adaptive balance strategy via cross attention-guided network.

Figure 5: Comparison result of infrared and visible image fusion of TNO datasets. From left to right: infrared image, visible image, the results of CE, DDLatLRR, DenseFuse, PMGI, U2Fusion and proposed.
Figure 6: Comparison result of infrared and visible image fusion of RoadSence datasets. From left to right: infrared image, visible image, the results of CE, DDLatLRR, DenseFuse, PMGI, U2Fusion and proposed.
Methods EN SCD SD Qabf MI
CE 7.1939 1.5846 90.6729 0.4798 14.3879
DDLatLRR 6.8565 1.7255 76.3305 0.4881 13.7131
DenseFuse 6.8425 1.6183 89.3237 0.5350 13.6851
PMGI 7.1191 0.1062 99.6564 0.1535 14.1382
U2Fusion 6.8959 1.5877 76.4380 0.3941 13.7919
Proposed 7.2841 1.8318 104.3948 0.3256 14.5683
Table 1: Quantitative evaluation for each methods via 17 pairs infrared and visible image fusion of TNO dataset.
Methods EN SCD SD Qabf MI
CE 7.3635 1.5893 75.8509 0.4587 14.7271
DDLatLRR 7.1485 1.6482 67.6889 0.5017 14.2971
DenseFuse 6.8549 1.2950 65.5098 0.5074 13.7099
PMGI 7.2492 1.7410 78.2691 0.4230 14.6985
U2Fusion 7.1968 1.7692 68.0393 0.4791 14.3937
Proposed 7.5315 1.7785 97.9758 0.4571 15.0631
Table 2: Quantitative evaluation for each methods via 50 pairs infrared and visible image fusion of RoadSence dataset.

4.2 MRI and PET image fusion

Methods EN SD CC Qabf MI
DCHWT 5.7501 84.9344 0.7302 0.6849 11.1003
SA 4.9705 84.6888 0.7079 0.7415 9.9411
DDCTPCA 4.8017 84.9032 0.8030 0.4486 9.6035
PMGI 5.3341 83.9781 0.7646 0.6917 10.6684
IFCNN 4.8187 98.1931 0.7726 0.6662 9.6374
Proposed 5.5149 80.2054 0.8180 0.6211 11.1298
Table 3: Quantitative evaluation for each methods via 20 pairs of MRI and PET image fusion.
Figure 7: Comparison result of MRI and PET image fusion. From left to right: MRI image, PET image, the results of DCHWT, Structure Aware(SA), DDCTPCA, PMGI, IFCNN and Proposed.

In the MRI and PET image fusion experiment, the number of test image pairs is 20. We compare our proposed method with DCHWT(Kumar, 2013), Structure Aware(Li et al., 2018), DDCTPCA(Naidu, 2014), PMGI(Zhang et al., 2020a), IFCNN(Zhang et al., 2020b)

by quantitative and qualitative evaluation metrics.

We adopt five related indicators to compare the existing methods, namely entropy (EN), Feature mutual information (FMI), standard deviation (SD),represents the ratio of noise added to the final image(Qabf), and mutual information (MI). As shown in Table  3, we can observe that the proposed fusion method outperforms in CC, and MI metrics, in EN metric, the result performance better than other methods but DCHWT. Qualitative results can be seen from Figure 7, the spatial correspondence details information is preserved, and the local context detail information is contained from source input images.

4.3 Multi-focus Image Fusion

VSMWLS 7.4934 0.8850 106.0445 15.0068 0.9564
CBF 7.4699 0.8927 106.5535 14.9399 0.9514
SA 7.3825 0.8943 107.3123 14.7652 0.9495
ConvSR 7.4615 0.8940 107.3435 14.9230 0.9496
MFF 7.5666 0.8826 111.8743 15.1333 0.9515
Proposed 7.5005 0.8788 113.1872 15.0811 0.9530
Table 4: Quantitative evaluation for each specialized methods via 18 pairs of Far-focused and Near-focused image fusion.
Figure 8: Comparison result of Multi-focus image fusion. From left to right: Far-focused, Near-focused, the results of VSMWLS, CBF, Structure Aware, ConvSR, MFF and Proposed.

To verify the proposed method ability which extract features, versatility, and generalization from source images, we conducted fusion experiments based on the near-focused and far-focused image datasets. At the same time, our proposed method was compared with other methods, which specialized focused on the far and near focused images fusion task. We employ five existing methods to compare with our method: VSMWLS(Ma et al., 2017), CBF(Kumar, 2015), Structure Aware(Li et al., 2018), ConvSR(Liu et al., 2016) and MFF(Zhang et al., 2021), respectively.
For quantitative analysis, we adopt five related indicators to compare the existing methods, namely entropy (EN), feature mutual information (FMI), standard deviation (SD), mutual information (MI), and correlation coefficient (CC). As shown in Table 4, we can infer that the proposed fusion method outperform in SD metric, outperform in EN, MI, and CC metrics but MFF and VSMWLS method. More intuitively for qualitative analysis, from Figure 8, it demonstrates that our proposed method has a good feature extraction capability of illumination information and spatial details information for multi-focus images compared with the specialized methods.

4.4 Multi-exposure Image Fusion

In the multi-exposure image fusion experiment, we employ five existing methods to compare with our methos: FMMR(Li and Kang, 2012), DSIFT(Hayat and Imran, 2019), DeepFuse(Ram Prabhakar et al., 2017), MGFF(Bavirisetti et al., 2019) and IFCNN(Zhang et al., 2020b), respectively.
From Table 5, compare the five existing methods by entropy (EN), Feature mutual information (FMI), standard deviation (SD), correlation coefficient (CC), and mutual information (MI). We can observe that the proposed fusion method outperforms in SD and CC metrics, in EN and MI metric our method outperforms but MGFF and DeepFuse. Visual comparison results of each image fusion method can be seen from Figure 9. From the perspective of global image information, the proposed method adaptive balance the illuminate information well, similarly the local context detail information is retained and presented from source images.

FMMR 6.3172 0.8736 92.5337 0.5448 12.6344
DISIFT 6.5245 0.8934 101.3824 0.5445 13.0491
DeepFuse 6.8183 0.8926 116.3688 0.7945 13.8566
MGFF 6.8792 0.8874 116.2462 0.7976 13.7586
IFCNN 6.7259 0.8838 113.8987 0.7572 13.4519
Proposed 6.8598 0.8619 122.6287 0.8202 13.8398
Table 5: Quantitative comparisons of the 19 pairs of underexposed and overexposed image by five metrics.
Figure 9: Comparison result of Multi-exposure image fusion. From left to right: underexposed image, overexposed image, the fusion results of FMMR, DISIFT, DeepFuse, MGFF, IFCNN and Proposed.

4.5 Ablation Study and Visualization

In this section, we take an ablation study which is conducted on infrared and visible images fusion experiment as an example. Compared with dilated residual dense block(DRDB) and original densenet without cross attention-guided module. As shown in Table 6, the bold black font in italic represents the best value. It is evident that for the reconstructed fusion image, the proposed method can effect modelling the spatial correspondence, and the DRDB module can obtain sufficient information from the reconstructed image based on a multi-illusion image by use dilated convolution to obtain a larger receptive field to generate illusion and details.
As shown in Figure 10, we visualize the cross attention maps obtained by each cross-attention module. Taking infrared and visible image fusion task as an example, we can infer from the figure that the fusion image background, detail information, and significance features obtained by each attention module are enhanced. It shows that the cross attention-guided is effective for the extraction of spatial correspondence feature information from source images.

Methods EN SCD SD Qabf MI
Densenet 7.0573 0.5774 100.39 0.3050 14.1147
Without DRDB 7.1020 1.3297 97.9295 0.2757 14.2040
Proposed 7.2841 1.8318 104.3948 0.3256 14.5683
Table 6: Ablation Study on cross attention-guided network for infrared and visible image fusion.
Figure 10: The attention map of cross attention unit from the source images.

5 Conclusion

This work introduced a novel unsupervised framework for the challenging multi-image fusion task, and it can serve as a unified framework for four tasks including infrared and visual image fusion, medical image fusion, multi-focus image fusion, and multi-exposure image fusion. To learn semantic features better for image fusion effectively, a cross attention-guided dense network is proposed to learn the spatial correspondence explicitly, and only the essential details are then learned and aligned attentively. We also proposed an auxiliary branch and a global merging block to better model the long-range relationship and better leverage the global information to reconstruct the fused image. Through quantitative comparison and qualitative analysis, the proposed method achieves better result compared with the state-of-the-art fusion methods.

6 Acknowledgments

This work was partially supported by the Scientific Innovation 2030 Major Project for New Generation of AI under Grant 2020AAA0107300.


  • V. Aslantas and E. Bendes (2015) A new image quality metric for image fusion: the sum of the correlations of differences. Aeu-international Journal of electronics and communications 69 (12), pp. 1890–1896. Cited by: §4.1.
  • A. A. Bastidas and H. Tang (2019) Channel attention networks. In

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Vol. , pp. 881–888. Cited by: §2.
  • D. P. Bavirisetti, G. Xiao, J. Zhao, R. Dhuli, and G. Liu (2019) Multi-scale guided image and video fusion: a fast and efficient approach. Circuits, Systems, and Signal Processing 38 (12), pp. 5576–5605. Cited by: §4.4.
  • Y. Bin, Y. Chao, and H. Guoyu (2016) Efficient image fusion with approximate sparse representation. International Journal of Wavelets, Multiresolution and Information Processing 14 (04), pp. 1650024. Cited by: §1.
  • L. Cao, L. Jin, H. Tao, G. Li, Z. Zhuang, and Y. Zhang (2014) Multi-focus image fusion based on spatial frequency in discrete cosine transform domain. IEEE signal processing letters 22 (2), pp. 220–224. Cited by: §1.
  • N. Hayat and M. Imran (2019) Ghost-free multi exposure image fusion technique using dense sift descriptor and guided filter. Journal of Visual Communication and Image Representation 62, pp. 295–308. Cited by: §4.4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.
  • B. S. Kumar (2013) Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal, Image and Video Processing 7 (6), pp. 1125–1143. Cited by: §4.2.
  • B. S. Kumar (2015) Image fusion based on pixel significance using cross bilateral filter. Signal, image and video processing 9 (5), pp. 1193–1204. Cited by: §4.3.
  • L. I. Kuncheva and W. J. Faithfull (2013) PCA feature extraction for change detection in multidimensional unlabeled data. IEEE transactions on neural networks and learning systems 25 (1), pp. 69–80. Cited by: §1.
  • H. Li, X. Wu, and J. Kittler (2020) MDLatLRR: a novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing. Note: doi: 10.1109/TIP.2020.2975984 Cited by: §4.1.
  • H. Li and X. Wu (2018) DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §1, §2, §2, §4.1.
  • S. Li and X. Kang (2012) Fast multi-exposure image fusion with median filter and recursive filter. IEEE Transactions on Consumer Electronics 58 (2), pp. 626–632. Cited by: §4.4.
  • W. Li, Y. Xie, H. Zhou, Y. Han, and K. Zhan (2018) Structure-aware image fusion. Optik 172, pp. 1–11. Cited by: §4.2, §4.3.
  • Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang (2016) Image fusion with convolutional sparse representation. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: §4.3.
  • J. Ma, H. Xu, J. Jiang, X. Mei, and X. Zhang (2020) DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing 29, pp. 4980–4995. Cited by: §1, §2, §2.
  • J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang (2019) FusionGAN: a generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §1, §2.
  • J. Ma, Z. Zhou, B. Wang, and H. Zong (2017) Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Physics & Technology 82, pp. 8–17. Cited by: §4.3.
  • V. Naidu (2014) Hybrid ddct-pca based multi sensor image fusion. Journal of Optics 43 (1), pp. 48–61. Cited by: §4.2.
  • K. Ram Prabhakar, V. Sai Srikar, and R. Venkatesh Babu (2017) Deepfuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE international conference on computer vision, pp. 4714–4722. Cited by: §1, §2, §4.4.
  • X. Song, H. Guo, X. Xu, H. Chao, S. Xu, B. Turkbey, B. J. Wood, G. Wang, and P. Yan (2021) Cross-modal attention for mri and ultrasound volume registration. arXiv preprint arXiv:2107.04548. Cited by: §2.
  • M. Tian, H. Guo, and C. Long (2021) Multi-level attentive convoluntional neural network for crowd counting. arXiv preprint arXiv:2105.11422. Cited by: §2.
  • Y. Wang, W. Huang, F. Sun, T. Xu, Y. Rong, and J. Huang (2020) Deep multimodal fusion by channel exchanging. Advances in Neural Information Processing Systems 33. Cited by: §1.
  • H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling (2020) U2Fusion: a unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2, §4.1.
  • Q. Yan, D. Gong, Q. Shi, A. v. d. Hengel, C. Shen, I. Reid, and Y. Zhang (2019) Attention-guided network for ghost-free high dynamic range imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1751–1760. Cited by: §3.4.
  • H. Zhang, Z. Le, Z. Shao, H. Xu, and J. Ma (2021) MFF-gan: an unsupervised generative adversarial network with adaptive and gradient joint constraints for multi-focus image fusion. Information Fusion 66, pp. 40–53. Cited by: §1, §4.3.
  • H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma (2020a) Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    pp. 12797–12804. Cited by: §2, §4.1, §4.2, §4.
  • Q. Zhang, Y. Liu, R. S. Blum, J. Han, and D. Tao (2018) Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: a review. Information Fusion 40, pp. 57–75. Cited by: §1.
  • Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang (2020b) IFCNN: a general image fusion framework based on convolutional neural network. Information Fusion 54, pp. 99–118. Cited by: §4.2, §4.4.
  • Z. Zhou, M. Dong, X. Xie, and Z. Gao (2016) Fusion of infrared and visible images for night-vision context enhancement. Applied optics 55 (23), pp. 6480–6490. Cited by: §4.1.