Image fusion is the technique of integrating information of different types of images obtained from different sensors, so as to improve the accuracy and richness of the information contained in one image . Image fusion can compensate for the limitation of single imaging sensors, and this technique has developed rapidly in recent years [4, 18, 23]. In the process of image fusion, the selection of the active feature maps and the fusion rules are the two key factors determining the quality of the fused image 
. The feature maps contain measurement of the activity level of each pixel location, serving as the basis for weight allocation from different sources, and the fusion rule also plays an indispensable role. Recently, with the continuous progress of image fusion algorithms and the wide availability of different kinds of imaging devices, the application of image fusion is becoming more and more extensive. For example, in medical imaging applications, images of different modalities can be fused to achieve more reliable and precise medical diagnosis . In military surveillance applications, image fusion can integrate information from different electromagnetic spectrums (such as visible and infrared bands) to achieve night vision 
. Due to the rapid development of artificial intelligence, multi-sensor image fusion has become a hot-spot in clinical diagnosis, industrial production and military research.
From the perspective of obtaining active feature maps, image fusion can be divided into two categories: conventional and deep learning-based image fusion algorithms.
Conventional methods include transform domain algorithms and spatial domain algorithms . In transform domain algorithms, the active feature maps are represented by the decomposition coefficients of the multi-scale transform. Unlike the transform domain algorithms, the spatial domain algorithms transform the image into a single-scale feature through advanced signal representation methods. Regardless of whether it is a transform domain algorithm or a spatial domain algorithm, the measurement of the activity level is obtained through a specific hand-crafted filter. However, due to the limitation of computational cost and difficulty in implementation, it is still a demanding task to design an ideal activity level measurement method or fusion strategy in practical applications, taking all the key issues of image fusion into full consideration .
Nowadays, deep learning has been widely employed in the fields of image processing and computer vision, such as image segmentation[1, 5, 26, 37], classification [2, 31] and object detection [35, 20, 19, 6]
. Traditional pattern recognition contains three key steps, namely feature extraction, selection and prediction, which can correspond to image transformation, activity level measurement and fusion rules in image fusion to a large extent
. Meanwhile, the convolutional neural network (CNN) can learn the most effective features from a large amount of training data to better solve the problem of pattern recognition. Therefore, the application of CNN in image fusion also has great potential theoretically, introducing a new perspective to the measurement of activity level. That is, CNN can be used to automatically extract the fused features and learn the direct mapping from source images to active feature maps.
In recent image fusion research based on deep learning [22, 32, 17, 41, 3, 23, 13, 14], fusion using learned features through CNN achieved higher quality than traditional fusion approaches, but some exsisting drawbacks hinder further improvement. First of all, in the existing studies, the feature maps obtained through deep learning were not fully utilized, with only the weighted average of feature maps being calculated. Additionally, traditional image fusion techniques (such as multi-resolution analysis, consistency verification etc.) cannot be ignored, but currently no studies have been conducted to explore the combination of conventional image fusion algorithms and deep learning algorithms. Finally, in the existing fusion algorithms based on deep learning, previous works [22, 32, 17, 41, 13, 14] failed to address the problems in training time and memory cost. To solve these problems, we propose an unsupervised image fusion algorithm based on the combination of multi-scale discrete wavelet transform (DWT) through regional energy and deep learning. To our best knowledge, this is the first time to realize the integration of conventional image fusion techniques and deep learning. We propose an architecture consisting of an encoder, a DWT-based fusion layer, and a decoder. In order to make the best of the information of feature maps, we use the DWT in the fusion layer to transform the feature map into the wavelet domain. As for the transformed feature map obtained from the encoder, adaptive fusion rules are adopted at low and high frequencies. Finally, the inverse wavelet transform is used to reconstruct the final feature map, which is decoded by the decoder to obtain the final fused image. With the additional processing of the feature maps by DWT, the quality of the fused image is remarkably improved.
The main contributions are summarized as follows:
(1) An unsupervised multi-scene image fusion architecture is proposed based on the combination of multi-scale discrete wavelet transform and deep learning.
(2) With multi-level decomposition in DWT, the useful information of feature maps can be fully made use of. Moreover, a region-based fusion strategy is adopted to capture more detail information. Extensive experiments demonstrate the superiority of our network over the state-of-the-art fusion methods.
(3) Our network can be trained in a comparatively small dataset with low computational cost and comparable fusion performance compared with training in COCO dataset. Our experiments showed that the quality of the fused images and the training efficiency are improved sharply.
Our paper is structured as follows. In Section II, we briefly reviewed related works. In Section III, the proposed network and its feasibility are introduced in detail. The Section IV the experimental results and analysis. In the last section, we give the conclusions of our paper. introduces
Ii Related Works
In deep learning-based fusion methods, CNN is designed to capture deep features from source images effectively. In, Liu et al. proposed a fusion method based on convolutional sparse representation(CSR), where multi-scale and multi-layer features are employed to construct the fused output. CNN was applied to multi-focus image fusion for the first time in Liu et al. . This method directly learns the mapping from the source image to the focus map through deep learning. By virtue of the CNN model, the selection of activity level measurement and fusion rules can be done simultaneously, thus overcoming the difficulties faced by existing fusion methods in fulfilling these two tasks at the same time.
Following , Liu et al.  extended the CNN model to multi-modal medical image fusion. CNNs are used to generate a weight map representing the pixel activity information of the source image, and the fusion process is performed in a multi-scale way through the image pyramid, which is more consistent with human visual perception. In addition, the strategy based on local similarity is applied to adjust the fusion rules adaptively through the decomposed coefficients. Du and Gao  proposes a novel multi-focus image fusion method based on image segmentation through a multi-scale CNN (MSCNN). Yang et al. 
proposes a unified framework for simultaneous image fusion and super-resolution. DeepFuse is the first network for multi-exposure image fusion using deep learning methods. This network can effectively fuse images from different exposure levels with no artifacts and high fusion quality.
In Deepfuse , the network just fuses the features extracted from the last layer of the network, losing a lot of useful feature information of the middle layers. To further resolve this issue, Li et al.,  proposed two novel fusion frameworks based on the pretrained network(VGG-19  and ResNet 50 ), which is used to explore deep feature extraction. In , a new deep fusion framework based on zero-phase component analysis(ZCA) was proposed. The residual network and ZCA are used to extract deep features from the source image and obtain an initial weight map, respectively. DenseFuse  is a novel deep learning fusion network for the fusion of infrared and visible images, where densely connected blocks  are utilized to propagate the information in the middle layers to the last layer, further improving the flow of information between layers and the flow of gradients through the network at the same time. Moreover, the DenseFuse model is a typical encoder-decoder architecture. The encoding network is composed of convolutional layers, dense blocks and fusion layers, where the output of each layer is used as the input to the next layer. In the process of encoding, more useful features are obtained from the source image, and a new l1-Norm fusion strategy is introduced to fuse the features. Finally, the decoding network is employed to reconstruct the fused image. Compared to aforementioned fusion methods, DenseFuse achieves the state-of-the-art performance in both objective and subjective evaluation. Considering that DenseFuse only works on a single scale, Song et al.  proposes a multi-scale medical image fusion framework, MSDNet. Three different filters are applied to extract features in the encoding layer from different scales. More image details are obtained by increasing the width of the encoding network. Due to availability and effectiveness of conditional GAN, the FusionGAN was proposed by Ma et al.
to fuse infrared and visible images using a generative adversarial network. The fused image generated by the generator is expected to capture more details existing in the visible image by applying the discriminator to distinguish differences between them.
, are trained by a mixed loss function consisiting of the modified structural similarity metric (MS-SSIM) and the mean square error (MSE). VIF-Net is an end to end model based on a robust mixed loss function including MS-SSIM and the total variation (TV), which can adaptively fuse thermal radiation and texture details and supress noise interference. In , jung et al.
employed the structure tensor to compute the loss, which is defined as the sum of an intensity fidelity term () and a structure tensor fidelity term (). Consequently, the network outputs an image preserving the overall contrast of the multiple images, while containing a naturalistic intensity of the putative image .
Iii Proposed Method
In this section, the proposed deep learning-based fusion network is introduced in detail.
Iii-a Background and Motivation
WaveFuse is a novel network model by introducing wavelet transform and adding more convolution layers based on the backbone network, DenseFuse . DenseFuse achieved promising results for the fusion of infrared and visible images, which proves the effectiveness of the model to a large extent. However, these exsiting deep learning-based methods are still suffering from ineffective utilization of the extracted feature maps, which significantly limits the generalization of learned features. Firstly, the architecture of this model is relatively simple. Compared with the current large-scale networks, DenseFuse is not very deep, which limits its capability to extract more features from images. Secondly, although a comparatively effective l1-Norm-based fusion strategy is adopted in DenseFuse, it only performs a simple weighted average on the extracted feature maps, failing to utilize and integrate the local information of the feature maps more adequately. To enable effective local information utilization, we introduce discrete wavelet transform(DWT) based decomposition and reconstruction module and select the region-based fusion rules of the fusion layer. The architecture of the proposed network is shown in Fig 1.
Iii-B Network Architecture
The WaveFuse is a typical encoder-decoder structure, consisiting of three components: an encoder, a DWT-based fusion layer and a decoder. First, the input images are denoted as , where k 1,2 is used to index the images and both input images have been spatially aligned. Feature maps are obtained by extracting features from the input source images through the encoder, and subsequently we transform the extracted features into the wavelet domain. After that, the adaptive fusion method is used to obtain the fused feature map . Finally, the fused image is generated by the decoder.
The encoder is mainly composed of two convolutional layers C1 and G1, a maxpooling layer and a DenseBlock  module. In order to solve the problems in the DenseFuse model with few numbers of layers and insufficient image feature extractions, we added G1 and G2 convolutional layers, pooling layers, and deconvolutional layers during the encoding and decoding processes, respectively. The kernels of both C1 and G1 convolution layers are all 3 3. C1 is used to initially extract features from the image, and G1 is used to generate the feature map for the wavelet decomposition .
In the DWT-based fusion layer, the feature maps
are decomposed through the wavelet decomposition layer to obtain the wavelet components, which can be divided into low-frequency component L1k and high-frequency components: horizontal component H1k, vertical component V1k and diagonal component D1k, respectively. And different fusion strategies are employed for different components to obtain the fused wavelet components F, which contains low-frequency component L2 and high-frequency components : H2, V2 and D2. In our previous research, we obtained the optimal fusion strategy for wavelet transform, that is, the low-frequency component adopts an adaptive weighted average algorithm based on regional energy, and the high-frequency component with larger variance will be selected. Finally, the fused low-frequency component and high-frequency component are integrated by wavelet reconstruction to obtain the final fused feature map.
The decoder is mainly composed of the deconvolution layer G2 and convolutional layers C2-C5. The fused feature map is first enlarged through G2 and upsampled by the deconvolutional layer. Then, the reconstructed fused image is finally obtained by C2-C5.
In the field of image fusion, it is a challenging task to obtain the effective fusion rules under end-to-end supervision. Thus, the main goal of our training process is to ensure that the decoder can reconstruct the image from the features encoded by the encoder with the lowest image quality loss, and subsequently we can leverage the effective features obtained by the training process for fusion. Additionally, no trainable parameters are involved in the fusion layer. Therefore, the DWT-based fusion part in Fig.1 is discarded in training. The training model is illustrated in Fig.2. We train our network using COCO  as input images containing 70,000 images, and all of them are resized to 256
256 and transformed to gray images. The batch size and epochs are set as 64 and 50, respectively. Learning rate is
. The proposed method is implemented on Pytorch 1.1.0 with Adam as the opitimizer and a NVIDIA GTX 2080 Ti GPU for training. In our practical training process, we find that using comparatively small dataset, containing 300-700 images chosen randomly from COCO, can still achieve a comparable fusion quality. The learning parameters are as follows: learning rate is set as, and the batch size and epochs are 4 and 500, respectively. Therefore, the computational training cost of our network can be remarkably saved, and our model outperforms overwhelmingly the current deep learning models in the training cost.
Iii-D Loss Function
The loss function we trained in WaveFuse is shown in Eq.1. This is a weighted combination of pixel loss and structural similarity loss with the weight , where the best is assigned as 1000 according to.
And the pixel loss is obtained by Eq.2, where and represent the input image to the encoder and the output image of the decoder, respectively. The structural similarity loss is calculated by Eq.3. The Structural Similarity Index Metric (SSIM) is a widely used perceptual image quality metric, which combines the three components of luminance, structure and contrast to comprehensively measure image quality .
Iii-E Fusion Strategy
As mentioned above, the selection of fusion rules largely determines the quality of fused images. Existing image fusion algorithms based on deep learning basically add feature maps directly, leaving the information of the feature maps not fully mined. In our method, multi-scale wavelet transform based on regional energy  is applied to the processing of feature maps, where the feature maps are processed at different scales, leading to the prominent improved quality of the fused image.
We use to represent the energy in the region centered at (m,n), as shown in Eq.4:
where and represent the maximum row and column index of the local region, and means weighted coefficients. And the matching degree of the two feature maps and is defined as shown in Eq.5-6:
where represents the wavelet coefficients of the wavelet decomposition. For the energy matching degree defined in Eq.6, appropriate matching threshold T (0.51) should be selected, and T is set as 0.8 in our network. When , it means that the energy of the two feature maps in this region is greatly discriminative. In this way, the central pixel of the region with the larger energy value will be selected as the central pixel of the fused feature map , which is calculated by Eq.7,
On the contrary, when , it means the two feature maps have similar energy in this region. Consequently, a weighted fusion strategy  is used to determine the central pixel of the fused feature map , as shown in Eq.8-9,
To preserve more structural information and make our fused image more natural, we apply l1-Norm strategy to our proposed network, where the fused feature map generated by l1-Norm strategy are denoted as . The final fused features is calculated by Eq.10, and will be set as different values for different scenarios to achieve the optimal fusion performance. In our experiments, is set as 0.6 for infrared and visible image, 1 for multi-focus image and 0.4 for multi-modal medical image, respectively.
Iv Experimental Results and Analysis
In this section, to validate the effectiveness and generalization of our WaveFuse, we first evaluated it with several state-of-the-art methods on three fusion tasks, including infrared and visible, multi-focus and multi-modal medical image fusion. For quantitative comparison, we utilized eight metrics to evaluate the fusion results. Moreover, we evaluated the fusion performance of the proposed method trained with small datasets. Finally, we also conducted the fine-tuning experiments on wavelet parameters for further fusion performance improvement.
Iv-a Experimental Parameters Setting
The test data are avaliable online, each of which contains six pairs of images. The source images of three scenarios are shown in Fig.3.
For comparsion, the WaveFuse is compared against 7 representative methods including discrete wavelet transform (DWT) , cross bilateral filter method (CBF) , convolutional sparse representation (ConvSR) , weighted least square optimization-based method (WLS) , ResNet50 and zero-phase component analysis fusion framework (ResZCA) , GAN-based fusion algorithm (FusionGAN)  and DenseFuse . All the seven comparative methods were implemented based on public available codes, where the parameters were set according to the original papers. Note that, ResZCA and FusionGAN are designed for infrared and visible images, so they are only compared in the infrared and visible image fusion task.
Due to the diversity of image fusion scenario, it turns out to be difficult to evaluate the quality of the fused images objectively and comprehensively with a unified framework of metrics. The commonly used evaluation methods can be classified into two categories: subjective evaluation and objective evaluation. Subjective visual evaluation is susceptible to human factors, such as eyesight, subjective preference and individual emotion. Furthermore, no prominent difference among the fusion results can be observed in most cases based on subjective evaluation. In contrast, objective evaluation is a relatively accurate and quantitative method on a basis of mathematical statistical models. In our experiments, we adopted the following objective evaluation metrics: information entropy(EN), mutual information(MI), Qabf , multiscale structural similarity(MS-SSIM) , visual information fidelity(VIF) 
, standard deviation(STD), average gradient(AVG) and edge intensity(EIN) .
En and MI are used to measure the informative richness of the fused image. The larger the En and MI is, the greater the richness of information is contained in the image and the higher quality of the fused image is. Qabf is a novel objective non-reference quality evaluation index of the fused image. It uses local metrics to estimate the performance of significant information from the input in the fused image. Higher Qabf value means better quality of the fused image. MS-SSIM  is an extension of SSIM 
and it is more consistent with the visual perception of human visual system. VIF is proposed to follow our human visual system to compute the distortion between two random variables. STD is based on statistical characteristics. Larger STD indicates higher gray dispersion of an image, leading to higher information richness. AVG and EIN are based on gradients, reflecting the difference in the details of the image and the texture changes, respectively. The larger values of the eight quality metrics, the better fusion results will be.
Iv-B Comparison to Other Methods
Iv-B1 Subjective Evaluation
Examples of the original image pairs and the fusion results obtained by each comparative method for the three scenarios are shown in Fig.4. The red boxes mark the region of interest that should be focused in the fusion results.
Infrared/Visible Image Fusion: Visible images could capture more detail information compared to the infrared images. However, the interested objects could not be easily observed in visible image especially when it is under low contrast circumstance and the light is insufficient. Infrared images can provide thermal radiation information, making it easy to detect the salient object even in complex background. Thus, the fused image can provide more complementary information. Fig.4 (a1 - a8) show infrared and visible image fusion results with the comparison methods. CBF and ConvSR exhibit significant artifacts and unclear salient objects. Focused on the ”door” boxed in red, the results in DWT and DenseFuse weaken the contrast. We can see that, the WaveFuse preserves more details in high contrast and brightness.
Multi-foucs Image Fusion: The multi-focus image fusion aims to reconstruct a fully foucsed image from partly focused images of the same scene. From the Fig.4 (b1-b6), we can observe that all the compared methods perform well. Focused on the number ”8” in the red box, CBF and WaveFuse outperform other methods with high resolution.
Multi-modal Medical Image Fusion: Multi-modal medical image fusion could offer more accurate and effective information for biomedical research and clinical applications. Better multi-modal medical fused image should provide combined features sufficiently and preserve both significant textural features. As shown in Fig.4 (c1-c6), ConvSR shows obvious artifacts in the whole image. DWT and CBF fail to preserve the crucial features of the source images. DenseFuse shows better visual results than the aforementioned methods. However, DenseFuse still weakens the details and brightness. Information-rich fused images can be obtained by WLS. In contrast, our method preserve the details and edge information of both source images, which is more in line with the perception characteristics of the human vision compared to other fusion methods.
Iv-B2 Objective Evaluation
The main purpose of image fusion is to increase the richness of image information, so EN and MI are the most important evaluation metrics in the three fusion tasks. Given the differences among different scenarios, the emphasis on evaluation metrics of various fusion tasks should be placed differently. For infrared and visible images, the evaluation of SSIM and VIF are also important metrics to ensure the information retention and visual information integrity of each band. In addition, in multi-focus images, detail information should be saved, so STD and EIN are more worthy of reference. In multi-modal medical images, STD and AVG should be considered in priority. Besides, from Fig.4 (a2-a3) and (b2-b3), we can observe that the fusion results of CBF and ConvSR in infrared and visible images and medical images contain poor visual effects owing to considerable artificial noise, and in this case their objective quality metrics will not be referenced for the quantitative evaluation.
Table.I shows the average values of the fusion quality metrics among three different fusion tasks by different fusion methods. In the infrared and visible image fusion, our method has the highest metrics in EN, MI and STD. MS-SSIM, AVG, EIN and VIF rank second. In addition, in multi-focus image fusion, EN, MI, STD, AVG, EIN rank first, and VIF ranks fourth. In multi-modal medical image fusion, EN, MI, STD and EIN rank first, MS-SSIM, AVG and VIF rank second, and Qabf ranks third. Figure 5-7 shows the curve of fusion quality metrics obtained in three kinds of fused images. Im Fig.5-7, the results show the consistency of our fusion performance among six pairs of images, which demonstrate the robust and universality of our method. Therefore, combined with our emphasis on fusion metrics in different scenarios, our proposed method achieves the best performance.
Iv-C Comparison of Using Different Training Dataset
In order to further demonstrate the effectiveness and robustness of our network, we conducted experiments on another three different training minisets: MINI1-MINI3, each of which contains 300, 500 and 700 images respectively chosen randomly from COCO, and the fusion results are shown in Fig.8 and Table II. We compared the fusion results of DenseFuse and WaveFuse on COCO and MINI1-MINI3, respectively. The same three sets of images and values are chosen for testing process. The fusion performance was compared and analyzed by the averaged fusion quality metrics. From the fusion results in Fig.8, no obvious visual difference can be found among the fused images both in DenseFuse and WaveFuse when different training sets are chosen. However, compared with DenseFuse, the fused images obtained by WaveFuse exhibit higher resolution and contrast. Then objective metrics are employed to evaluate the fusion performance.
The fusion metrics of DenseFuse and WaveFuse trained with the different training datasets are shown in Table II. From the objective evaluation metrics of the three sets of fusion images, no significant difference are shown in the objective fusion performance of DenseFuse and WaveFuse respectively. Overall, higher fusion metrics are obtained in COCO than MINI1-MINI3, and WaveFuse outperforms DenseFuse in all scenarios for most of the quality metrics. In WaveFuse, higher performance are even achieved by training on minisets. Accordingly, we can learn that our proposed network is robust both to the size of the training dataset and to the selection of training images, so it can be trained with lower computational cost.
Iv-D Comparison of Using Different Wavelet Decomposition Layers and Bases
In wavelet transform, the number of decomposition layers and the selection of different wavelet bases could exert great impacts on the effectiveness of wavelet transform. In the following experiments in COCO, different wavelet decomposition layers and bases are selected for further optimization on our proposed method.
Iv-D1 Experiments on Different Wavelet Decomposition Layers
We choose decomposition layers from 1 to 4, and wavelet base is set as sym2 in this experiment.
From Table.III, we can clearly see that the higher fusion metrics, higher brightness and contrast of the fused images are obtained with the increase of decomposition layers However, when the decomposition layers is set as 3, results exhibit a little artificial noise in visible and infrared images, and when the decomposition layers is set as 4, except for multi-focus images, other fused images contain obvious noise. From the above analysis, the number of decomposition layers is not the more the better, and the evaluation of fused images should be fully combined with subjective and objective evaluation methods.
Iv-D2 Experiments on Different Wavelet Bases
For the comparison of using different wavelet bases, we set the decomposition layer as 3, and four bases including sym2, sym3, db1 and rbio6.8 will be chosen. From a subjective point of view, we find it difficult to distinguish which wavelet base achieves better fusion performance. Combined with the objective evaluation metrics in Table.IV, the fusion quality of wavelet base db1 is the highest in the three application scenarios. Through the above two experiments, we can further improve our proposed method by selecting the appropriate number of decomposition layers and wavelet bases, providing a new direction for the follow-up improvement of our method.
In this paper, we propose a novel image fusion method through the combination of a multi-scale wavelet transform based on regional energy and deep learning. To our best knowledge, this is the first time that a conventional technique is integrated in the pipeline of a deep learning-based image fusion method, and we think there are still a lot of possibilities to explore in this direction.
Our network consists three parts: an encoder, a DWT-based fusion layer and a decoder. The features of the input image are extracted by the encoder, then we use the adaptive fusion strategy at the fusion layer to obtain the fused features, and finally reconstruct the fused image through the decoder. Compared with the current excellent fusion algorithms, our proposed method achieves better performance. Additionally, our network has strong universality and can be applied to image fusion of various scenarios. At the same time, there is no reliance on big datasets, and our network can be trained in comparatively small datasets to obtain the comparable fusion results trained in large datasets with shorter training time and higher efficiency. Moreover, extensive experiments on different wavelet decomposition layers and bases demonstrate the further improvement of our method. Therefore, our network has conspicuous advantages over current deep learning-based algorithms.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §I.
-  (2015) PCANet: a simple deep learning baseline for image classification?. IEEE transactions on image processing 24 (12), pp. 5017–5032. Cited by: §I.
-  (2017) Image segmentation-based multi-focus image fusion through multi-scale convolutional neural network. IEEE access 5, pp. 15750–15761. Cited by: §I, §II.
-  (2007) Image fusion: advances in the state of the art. Information fusion 2 (8), pp. 114–118. Cited by: §I.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §II.
-  (2008) Comments on’information measure for performance of image fusion’. Electronics letters 44 (18), pp. 1066–1067. Cited by: §IV-A.
-  (2020) VIF-net: an unsupervised framework for infrared and visible image fusion. IEEE Transactions on Computational Imaging 6, pp. 640–651. Cited by: §II.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §II, §III-B.
-  (2020) Unsupervised deep image fusion with structure tensor representations. IEEE Transactions on Image Processing 29, pp. 3845–3858. Cited by: §II.
-  (2015) Image fusion based on pixel significance using cross bilateral filter. Signal, image and video processing 9 (5), pp. 1193–1204. Cited by: §IV-A.
-  (1995) Multisensor image fusion using the wavelet transform. Graphical models and image processing 57 (3), pp. 235–245. Cited by: §IV-A.
-  (2019) Infrared and visible image fusion with resnet and zero-phase component analysis. Infrared Physics & Technology 102, pp. 103039. Cited by: §I, §II.
-  (2019) Infrared and visible image fusion with resnet and zero-phase component analysis. Infrared Physics & Technology 102, pp. 103039. Cited by: §I.
-  (2019) Infrared and visible image fusion with resnet and zero-phase component analysis. Infrared Physics & Technology 102, pp. 103039. Cited by: §II, §IV-A.
-  (2018) Infrared and visible image fusion using a deep learning framework. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. Cited by: §II.
-  (2018) DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §I, §II, §II, §III-A, §III-D, §III-E, §IV-A.
-  (2017) Pixel-level image fusion: a survey of the state of the art. information Fusion 33, pp. 100–112. Cited by: §I, §I.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §I.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §I.
-  (2017) A medical image fusion method based on convolutional neural networks. In 2017 20th International Conference on Information Fusion (Fusion), pp. 1–7. Cited by: §II.
-  (2017) Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191–207. Cited by: §I, §I, §I, §II, §II.
-  (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §I, §I, §I.
-  (2016) CONVSR. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: §II.
-  (2016) Image fusion with convolutional sparse representation. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: §IV-A.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I.
-  (2019) FusionGAN: a generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §IV-A.
-  (2017) Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Physics & Technology 82, pp. 8–17. Cited by: §IV-A.
-  (2015) Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing 24 (11), pp. 3345–3356. Cited by: §IV-A, §IV-A.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §II.
-  (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621. Cited by: §I.
-  (2017) DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs.. In ICCV, pp. 4724–4732. Cited by: §I, §II, §II.
-  (2018) Hybrid multimodality medical image fusion technique for feature enhancement in medical diagnosis. International Journal of Engineering Science Invention 2 (Special issue), pp. 52–60. Cited by: §IV-A.
-  (1997) In-fibre bragg grating sensors. Measurement science and technology 8 (4), pp. 355. Cited by: §IV-A.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
-  (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §IV-A.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I.
-  (2006) Image information and visual quality. IEEE Transactions on image processing 15 (2), pp. 430–444. Cited by: §IV-A.
-  (2006) Improved on the approach of image fusion based on region-energy [j]. Journal of Projectiles, Rockets, Missiles and Guidance 4. Cited by: §III-E, §III-E.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II.
-  (2019) MSDNet for medical image fusion. In International Conference on Image and Graphics, pp. 278–288. Cited by: §I, §I, §II, §II.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-D, §IV-A.
-  (2000) Objective image fusion performance measure. Electronics letters 36 (4), pp. 308–309. Cited by: §IV-A, §IV-A.
-  (2017) Multi-focus image fusion and super-resolution with convolutional neural network. International Journal of Wavelets, Multiresolution and Information Processing 15 (04), pp. 1750037. Cited by: §II.