Log In Sign Up

WaveFuse: A Unified Deep Framework for Image Fusion with Wavelet Transform

We propose an unsupervised image fusion architecture for multiple application scenarios based on the combination of multi-scale discrete wavelet transform through regional energy and deep learning. To our best knowledge, this is the first time the conventional image fusion method has been combined with deep learning. The useful information of feature maps can be utilized adequately through multi-scale discrete wavelet transform in our proposed method.Compared with other state-of-the-art fusion method, the proposed algorithm exhibits better fusion performance in both subjective and objective evaluation. Moreover, it's worth mentioning that comparable fusion performance trained in COCO dataset can be obtained by training with a much smaller dataset with only hundreds of images chosen randomly from COCO. Hence, the training time is shortened substantially, leading to the improvement of the model's performance both in practicality and training efficiency.


page 1

page 3

page 5

page 7

page 10


Wavelet Channel Attention Module with a Fusion Network for Single Image Deraining

Single image deraining is a crucial problem because rain severely degene...

Simple Signal Extension Method for Discrete Wavelet Transform

Discrete wavelet transform of finite-length signals must necessarily han...

Analysis of Probabilistic multi-scale fractional order fusion-based de-hazing algorithm

In this report, a de-hazing algorithm based on probability and multi-sca...

Single-Image Superresolution Through Directional Representations

We develop a mathematically-motivated algorithm for image superresolutio...

MS and PAN image fusion by combining Brovey and wavelet methods

Among the existing fusion algorithms, the wavelet fusion method is the m...

An investigation towards wavelet based optimization of automatic image registration techniques

Image registration is the process of transforming different sets of data...

I Introduction

Image fusion is the technique of integrating information of different types of images obtained from different sensors, so as to improve the accuracy and richness of the information contained in one image [4]. Image fusion can compensate for the limitation of single imaging sensors, and this technique has developed rapidly in recent years [4, 18, 23]. In the process of image fusion, the selection of the active feature maps and the fusion rules are the two key factors determining the quality of the fused image [22]

. The feature maps contain measurement of the activity level of each pixel location, serving as the basis for weight allocation from different sources, and the fusion rule also plays an indispensable role

[23]. Recently, with the continuous progress of image fusion algorithms and the wide availability of different kinds of imaging devices, the application of image fusion is becoming more and more extensive. For example, in medical imaging applications, images of different modalities can be fused to achieve more reliable and precise medical diagnosis [41]. In military surveillance applications, image fusion can integrate information from different electromagnetic spectrums (such as visible and infrared bands) to achieve night vision [18]

. Due to the rapid development of artificial intelligence, multi-sensor image fusion has become a hot-spot in clinical diagnosis, industrial production and military research


From the perspective of obtaining active feature maps, image fusion can be divided into two categories: conventional and deep learning-based image fusion algorithms.

Conventional methods include transform domain algorithms and spatial domain algorithms [18]. In transform domain algorithms, the active feature maps are represented by the decomposition coefficients of the multi-scale transform. Unlike the transform domain algorithms, the spatial domain algorithms transform the image into a single-scale feature through advanced signal representation methods. Regardless of whether it is a transform domain algorithm or a spatial domain algorithm, the measurement of the activity level is obtained through a specific hand-crafted filter. However, due to the limitation of computational cost and difficulty in implementation, it is still a demanding task to design an ideal activity level measurement method or fusion strategy in practical applications, taking all the key issues of image fusion into full consideration [22].

Nowadays, deep learning has been widely employed in the fields of image processing and computer vision, such as image segmentation

[1, 5, 26, 37], classification [2, 31] and object detection [35, 20, 19, 6]

. Traditional pattern recognition contains three key steps, namely feature extraction, selection and prediction, which can correspond to image transformation, activity level measurement and fusion rules in image fusion to a large extent


. Meanwhile, the convolutional neural network (CNN) can learn the most effective features from a large amount of training data to better solve the problem of pattern recognition. Therefore, the application of CNN in image fusion also has great potential theoretically, introducing a new perspective to the measurement of activity level. That is, CNN can be used to automatically extract the fused features and learn the direct mapping from source images to active feature maps.

In recent image fusion research based on deep learning [22, 32, 17, 41, 3, 23, 13, 14], fusion using learned features through CNN achieved higher quality than traditional fusion approaches, but some exsisting drawbacks hinder further improvement. First of all, in the existing studies, the feature maps obtained through deep learning were not fully utilized, with only the weighted average of feature maps being calculated. Additionally, traditional image fusion techniques (such as multi-resolution analysis, consistency verification etc.) cannot be ignored, but currently no studies have been conducted to explore the combination of conventional image fusion algorithms and deep learning algorithms. Finally, in the existing fusion algorithms based on deep learning, previous works [22, 32, 17, 41, 13, 14] failed to address the problems in training time and memory cost. To solve these problems, we propose an unsupervised image fusion algorithm based on the combination of multi-scale discrete wavelet transform (DWT) through regional energy and deep learning. To our best knowledge, this is the first time to realize the integration of conventional image fusion techniques and deep learning. We propose an architecture consisting of an encoder, a DWT-based fusion layer, and a decoder. In order to make the best of the information of feature maps, we use the DWT in the fusion layer to transform the feature map into the wavelet domain. As for the transformed feature map obtained from the encoder, adaptive fusion rules are adopted at low and high frequencies. Finally, the inverse wavelet transform is used to reconstruct the final feature map, which is decoded by the decoder to obtain the final fused image. With the additional processing of the feature maps by DWT, the quality of the fused image is remarkably improved.

The main contributions are summarized as follows:

(1) An unsupervised multi-scene image fusion architecture is proposed based on the combination of multi-scale discrete wavelet transform and deep learning.

(2) With multi-level decomposition in DWT, the useful information of feature maps can be fully made use of. Moreover, a region-based fusion strategy is adopted to capture more detail information. Extensive experiments demonstrate the superiority of our network over the state-of-the-art fusion methods.

(3) Our network can be trained in a comparatively small dataset with low computational cost and comparable fusion performance compared with training in COCO dataset. Our experiments showed that the quality of the fused images and the training efficiency are improved sharply.

Our paper is structured as follows. In Section II, we briefly reviewed related works. In Section III, the proposed network and its feasibility are introduced in detail. The Section IV the experimental results and analysis. In the last section, we give the conclusions of our paper. introduces

Ii Related Works

In deep learning-based fusion methods, CNN is designed to capture deep features from source images effectively. In

[24], Liu et al. proposed a fusion method based on convolutional sparse representation(CSR), where multi-scale and multi-layer features are employed to construct the fused output. CNN was applied to multi-focus image fusion for the first time in Liu et al. [22]. This method directly learns the mapping from the source image to the focus map through deep learning. By virtue of the CNN model, the selection of activity level measurement and fusion rules can be done simultaneously, thus overcoming the difficulties faced by existing fusion methods in fulfilling these two tasks at the same time.

Following [22], Liu et al. [21] extended the CNN model to multi-modal medical image fusion. CNNs are used to generate a weight map representing the pixel activity information of the source image, and the fusion process is performed in a multi-scale way through the image pyramid, which is more consistent with human visual perception. In addition, the strategy based on local similarity is applied to adjust the fusion rules adaptively through the decomposed coefficients. Du and Gao [3] proposes a novel multi-focus image fusion method based on image segmentation through a multi-scale CNN (MSCNN). Yang et al. [44]

proposes a unified framework for simultaneous image fusion and super-resolution. DeepFuse

[32] is the first network for multi-exposure image fusion using deep learning methods. This network can effectively fuse images from different exposure levels with no artifacts and high fusion quality.

In Deepfuse [32], the network just fuses the features extracted from the last layer of the network, losing a lot of useful feature information of the middle layers. To further resolve this issue, Li et al.[16], [15] proposed two novel fusion frameworks based on the pretrained network(VGG-19 [40] and ResNet 50 [6]), which is used to explore deep feature extraction. In [15], a new deep fusion framework based on zero-phase component analysis(ZCA) was proposed. The residual network and ZCA are used to extract deep features from the source image and obtain an initial weight map, respectively. DenseFuse [17] is a novel deep learning fusion network for the fusion of infrared and visible images, where densely connected blocks [9] are utilized to propagate the information in the middle layers to the last layer, further improving the flow of information between layers and the flow of gradients through the network at the same time. Moreover, the DenseFuse model is a typical encoder-decoder architecture. The encoding network is composed of convolutional layers, dense blocks and fusion layers, where the output of each layer is used as the input to the next layer. In the process of encoding, more useful features are obtained from the source image, and a new l1-Norm fusion strategy is introduced to fuse the features. Finally, the decoding network is employed to reconstruct the fused image. Compared to aforementioned fusion methods, DenseFuse achieves the state-of-the-art performance in both objective and subjective evaluation. Considering that DenseFuse only works on a single scale, Song et al. [41] proposes a multi-scale medical image fusion framework, MSDNet. Three different filters are applied to extract features in the encoding layer from different scales. More image details are obtained by increasing the width of the encoding network. Due to availability and effectiveness of conditional GAN[30], the FusionGAN was proposed by Ma et al.

to fuse infrared and visible images using a generative adversarial network. The fused image generated by the generator is expected to capture more details existing in the visible image by applying the discriminator to distinguish differences between them.

Fig. 1: Architecture of the proposed WaveFuse image fusion network. The feature maps learned by the encoder from the input images are processed by multi-scale wavelet transform, and finally the fused feature maps are utilized to the fused image reconstruction by the decoder.

Additionally, due to the lack of labeled datasets, the exsisting deep learning-based architectures [17],[13],[41]

,[16] are trained by a mixed loss function consisiting of the modified structural similarity metric (MS-SSIM) and the mean square error (MSE). VIF-Net

[8] is an end to end model based on a robust mixed loss function including MS-SSIM and the total variation (TV), which can adaptively fuse thermal radiation and texture details and supress noise interference. In [10], jung et al.

employed the structure tensor to compute the loss, which is defined as the sum of an intensity fidelity term (

) and a structure tensor fidelity term (). Consequently, the network outputs an image preserving the overall contrast of the multiple images, while containing a naturalistic intensity of the putative image [10].

Iii Proposed Method

In this section, the proposed deep learning-based fusion network is introduced in detail.

Fig. 2: The architecture of the training process. The objective of training is to obtain an encoder-decoder network that tries to make the input and the output image as similar as possible.

Iii-a Background and Motivation

WaveFuse is a novel network model by introducing wavelet transform and adding more convolution layers based on the backbone network, DenseFuse [17]. DenseFuse achieved promising results for the fusion of infrared and visible images, which proves the effectiveness of the model to a large extent. However, these exsiting deep learning-based methods are still suffering from ineffective utilization of the extracted feature maps, which significantly limits the generalization of learned features. Firstly, the architecture of this model is relatively simple. Compared with the current large-scale networks, DenseFuse is not very deep, which limits its capability to extract more features from images. Secondly, although a comparatively effective l1-Norm-based fusion strategy is adopted in DenseFuse, it only performs a simple weighted average on the extracted feature maps, failing to utilize and integrate the local information of the feature maps more adequately. To enable effective local information utilization, we introduce discrete wavelet transform(DWT) based decomposition and reconstruction module and select the region-based fusion rules of the fusion layer. The architecture of the proposed network is shown in Fig 1.

Iii-B Network Architecture

The WaveFuse is a typical encoder-decoder structure, consisiting of three components: an encoder, a DWT-based fusion layer and a decoder. First, the input images are denoted as , where k 1,2 is used to index the images and both input images have been spatially aligned. Feature maps are obtained by extracting features from the input source images through the encoder, and subsequently we transform the extracted features into the wavelet domain. After that, the adaptive fusion method is used to obtain the fused feature map . Finally, the fused image is generated by the decoder.

The encoder is mainly composed of two convolutional layers C1 and G1, a maxpooling layer and a DenseBlock [9] module. In order to solve the problems in the DenseFuse model with few numbers of layers and insufficient image feature extractions, we added G1 and G2 convolutional layers, pooling layers, and deconvolutional layers during the encoding and decoding processes, respectively. The kernels of both C1 and G1 convolution layers are all 3 3. C1 is used to initially extract features from the image, and G1 is used to generate the feature map for the wavelet decomposition .

In the DWT-based fusion layer, the feature maps

are decomposed through the wavelet decomposition layer to obtain the wavelet components, which can be divided into low-frequency component L1k and high-frequency components: horizontal component H1k, vertical component V1k and diagonal component D1k, respectively. And different fusion strategies are employed for different components to obtain the fused wavelet components F, which contains low-frequency component L2 and high-frequency components : H2, V2 and D2. In our previous research, we obtained the optimal fusion strategy for wavelet transform, that is, the low-frequency component adopts an adaptive weighted average algorithm based on regional energy, and the high-frequency component with larger variance will be selected. Finally, the fused low-frequency component and high-frequency component are integrated by wavelet reconstruction to obtain the final fused feature map


The decoder is mainly composed of the deconvolution layer G2 and convolutional layers C2-C5. The fused feature map is first enlarged through G2 and upsampled by the deconvolutional layer. Then, the reconstructed fused image is finally obtained by C2-C5.

Iii-C Training

In the field of image fusion, it is a challenging task to obtain the effective fusion rules under end-to-end supervision. Thus, the main goal of our training process is to ensure that the decoder can reconstruct the image from the features encoded by the encoder with the lowest image quality loss, and subsequently we can leverage the effective features obtained by the training process for fusion. Additionally, no trainable parameters are involved in the fusion layer. Therefore, the DWT-based fusion part in Fig.1 is discarded in training. The training model is illustrated in Fig.2. We train our network using COCO [9] as input images containing 70,000 images, and all of them are resized to 256

256 and transformed to gray images. The batch size and epochs are set as 64 and 50, respectively. Learning rate is

. The proposed method is implemented on Pytorch 1.1.0 with Adam as the opitimizer and a NVIDIA GTX 2080 Ti GPU for training. In our practical training process, we find that using comparatively small dataset, containing 300-700 images chosen randomly from COCO, can still achieve a comparable fusion quality. The learning parameters are as follows: learning rate is set as

, and the batch size and epochs are 4 and 500, respectively. Therefore, the computational training cost of our network can be remarkably saved, and our model outperforms overwhelmingly the current deep learning models in the training cost.

Iii-D Loss Function

The loss function we trained in WaveFuse is shown in Eq.1. This is a weighted combination of pixel loss and structural similarity loss with the weight , where the best is assigned as 1000 according to[17].


And the pixel loss is obtained by Eq.2, where and represent the input image to the encoder and the output image of the decoder, respectively. The structural similarity loss is calculated by Eq.3. The Structural Similarity Index Metric (SSIM) is a widely used perceptual image quality metric, which combines the three components of luminance, structure and contrast to comprehensively measure image quality [42].


Iii-E Fusion Strategy

As mentioned above, the selection of fusion rules largely determines the quality of fused images. Existing image fusion algorithms based on deep learning basically add feature maps directly, leaving the information of the feature maps not fully mined. In our method, multi-scale wavelet transform based on regional energy [39] is applied to the processing of feature maps, where the feature maps are processed at different scales, leading to the prominent improved quality of the fused image.

We use to represent the energy in the region centered at (m,n), as shown in Eq.4:


where and represent the maximum row and column index of the local region, and means weighted coefficients. And the matching degree of the two feature maps and is defined as shown in Eq.5-6:

DWT 6.3189 12.6377 0.4289 0.8830 22.9205 3.7287 35.8938 0.3679
CBF * * * * * * * *
ConvSR * * * * * * * *
WLS 6.7412 13.4824  0.4755  0.9294 33.8465  4.5559  44.0795  0.7463
ResZCA 6.4447 12.8893 0.2387 0.6323 27.1435 2.2178 22.7138 0.2899
FusionGAN 6.5888 13.1775 0.1519 0.5477 28.3121 1.9382 20.1227 0.2826
DenseFuse 6.5140 13.0280 0.4338 0.8474 29.4581 2.5841 25.8742 0.4141
WaveFuse 6.8580  13.7161 0.3754 0.8723  35.5306  3.6939  38.4883  0.6339
WaveFuse_3_db1 6.9152 13.8304 0.3876 0.8884 36.2466 3.9474 41.3366 0.7155
DWT 7.2438 14.4877 0.4998 0.9547 44.4301 4.9644 51.9155 0.6918
CBF 7.2550 14.5100 0.6774  0.9886 45.3357 5.8665 61.0520 0.8483
ConvSR 7.2737 14.5474  0.6957 0.9867  46.2913 5.9747 63.1097  0.8794
WLS 7.2786 14.5571 0.6631 0.9839 45.6133 5.8128 60.5427 0.8681
DenseFuse 7.2142 14.4285 0.5958 0.9680 43.9375 4.3209 45.9510 0.7235
WaveFuse  7.3561  14.7123 0.5196 0.9594  48.1359  6.5495  69.8810 0.8380
WaveFuse_3_db1 7.3830 14.7660 0.5176 0.9556 48.9388 6.6945 71.3553 0.8363
DWT 5.3748 10.7497 0.4827 0.8789 53.4287 6.2316 63.1409 0.5942
CBF * * * * * * * *
ConvSR * * * * * * * *
WLS 5.5099 11.0197 0.5771  0.9226 63.5586  6.9586 71.3395  0.8305
DenseFuse 5.3300 10.6599  0.6055 0.8742 63.7758 5.4067 57.0015 0.6988
WaveFuse  5.5883  11.1767  0.5119  0.8854  69.6581  6.8353  71.3563  0.8057
WaveFuse_3_db1 5.5755 11.1509 0.5127 0.8870 70.1196 7.3028 75.3542 0.7986
TABLE I: The average values of fusion quality metrics for six fused images of three different scenarios. Red ones are the best results, and WaveFuse results among top three are marked orange. For all metrics, larger is better.

where represents the wavelet coefficients of the wavelet decomposition. For the energy matching degree defined in Eq.6, appropriate matching threshold T (0.51) should be selected, and T is set as 0.8 in our network. When , it means that the energy of the two feature maps in this region is greatly discriminative. In this way, the central pixel of the region with the larger energy value will be selected as the central pixel of the fused feature map , which is calculated by Eq.7,


On the contrary, when , it means the two feature maps have similar energy in this region. Consequently, a weighted fusion strategy [39] is used to determine the central pixel of the fused feature map , as shown in Eq.8-9,


To preserve more structural information and make our fused image more natural, we apply l1-Norm strategy[17] to our proposed network, where the fused feature map generated by l1-Norm strategy are denoted as . The final fused features is calculated by Eq.10, and will be set as different values for different scenarios to achieve the optimal fusion performance. In our experiments, is set as 0.6 for infrared and visible image, 1 for multi-focus image and 0.4 for multi-modal medical image, respectively.


Iv Experimental Results and Analysis

Fig. 3: Six pairs of source images in each fusion task. From top to bottom are infrared and visible images, multi-foucs images and multi-modal medical images respectively.

In this section, to validate the effectiveness and generalization of our WaveFuse, we first evaluated it with several state-of-the-art methods on three fusion tasks, including infrared and visible, multi-focus and multi-modal medical image fusion. For quantitative comparison, we utilized eight metrics to evaluate the fusion results. Moreover, we evaluated the fusion performance of the proposed method trained with small datasets. Finally, we also conducted the fine-tuning experiments on wavelet parameters for further fusion performance improvement.

Iv-a Experimental Parameters Setting

The test data are avaliable online, each of which contains six pairs of images. The source images of three scenarios are shown in Fig.3.

For comparsion, the WaveFuse is compared against 7 representative methods including discrete wavelet transform (DWT) [12], cross bilateral filter method (CBF) [11], convolutional sparse representation (ConvSR) [25], weighted least square optimization-based method (WLS) [28], ResNet50 and zero-phase component analysis fusion framework (ResZCA) [15], GAN-based fusion algorithm (FusionGAN) [27] and DenseFuse [17]. All the seven comparative methods were implemented based on public available codes, where the parameters were set according to the original papers. Note that, ResZCA and FusionGAN are designed for infrared and visible images, so they are only compared in the infrared and visible image fusion task.

Due to the diversity of image fusion scenario, it turns out to be difficult to evaluate the quality of the fused images objectively and comprehensively with a unified framework of metrics. The commonly used evaluation methods can be classified into two categories: subjective evaluation and objective evaluation. Subjective visual evaluation is susceptible to human factors, such as eyesight, subjective preference and individual emotion. Furthermore, no prominent difference among the fusion results can be observed in most cases based on subjective evaluation. In contrast, objective evaluation is a relatively accurate and quantitative method on a basis of mathematical statistical models. In our experiments, we adopted the following objective evaluation metrics: information entropy(EN)

[34], mutual information(MI)[7], Qabf [43], multiscale structural similarity(MS-SSIM) [29], visual information fidelity(VIF) [38]

, standard deviation(STD)

[36], average gradient(AVG) and edge intensity(EIN) [33].

En and MI are used to measure the informative richness of the fused image. The larger the En and MI is, the greater the richness of information is contained in the image and the higher quality of the fused image is. Qabf is a novel objective non-reference quality evaluation index of the fused image. It uses local metrics to estimate the performance of significant information from the input in the fused image. Higher Qabf value means better quality of the fused image

[43]. MS-SSIM [29] is an extension of SSIM [42]

and it is more consistent with the visual perception of human visual system. VIF is proposed to follow our human visual system to compute the distortion between two random variables. STD is based on statistical characteristics. Larger STD indicates higher gray dispersion of an image, leading to higher information richness. AVG and EIN are based on gradients, reflecting the difference in the details of the image and the texture changes, respectively. The larger values of the eight quality metrics, the better fusion results will be.

Fig. 4: Fusion results by different methods. (1)-(6) are Infrared, Visible , Multi-Focus, MRI-T1 and MRI-T2 source images respectively. (a1)-(a8) are infrared and visible fused images obtained by eight different fusion methods. (b1)-(b6) are multi-focus fused images, and (c1)-(c6) are multi-modal medical fused images.
Fig. 5: Quantitative comparison of our WaveFuse for infrared and visible image fusion with 7 state-of-the-art methods.

Iv-B Comparison to Other Methods

Iv-B1 Subjective Evaluation

Examples of the original image pairs and the fusion results obtained by each comparative method for the three scenarios are shown in Fig.4. The red boxes mark the region of interest that should be focused in the fusion results.

Infrared/Visible Image Fusion: Visible images could capture more detail information compared to the infrared images. However, the interested objects could not be easily observed in visible image especially when it is under low contrast circumstance and the light is insufficient. Infrared images can provide thermal radiation information, making it easy to detect the salient object even in complex background. Thus, the fused image can provide more complementary information. Fig.4 (a1 - a8) show infrared and visible image fusion results with the comparison methods. CBF and ConvSR exhibit significant artifacts and unclear salient objects. Focused on the ”door” boxed in red, the results in DWT and DenseFuse weaken the contrast. We can see that, the WaveFuse preserves more details in high contrast and brightness.

Multi-foucs Image Fusion: The multi-focus image fusion aims to reconstruct a fully foucsed image from partly focused images of the same scene. From the Fig.4 (b1-b6), we can observe that all the compared methods perform well. Focused on the number ”8” in the red box, CBF and WaveFuse outperform other methods with high resolution.

Multi-modal Medical Image Fusion: Multi-modal medical image fusion could offer more accurate and effective information for biomedical research and clinical applications. Better multi-modal medical fused image should provide combined features sufficiently and preserve both significant textural features. As shown in Fig.4 (c1-c6), ConvSR shows obvious artifacts in the whole image. DWT and CBF fail to preserve the crucial features of the source images. DenseFuse shows better visual results than the aforementioned methods. However, DenseFuse still weakens the details and brightness. Information-rich fused images can be obtained by WLS. In contrast, our method preserve the details and edge information of both source images, which is more in line with the perception characteristics of the human vision compared to other fusion methods.

Iv-B2 Objective Evaluation

The main purpose of image fusion is to increase the richness of image information, so EN and MI are the most important evaluation metrics in the three fusion tasks. Given the differences among different scenarios, the emphasis on evaluation metrics of various fusion tasks should be placed differently. For infrared and visible images, the evaluation of SSIM and VIF are also important metrics to ensure the information retention and visual information integrity of each band. In addition, in multi-focus images, detail information should be saved, so STD and EIN are more worthy of reference. In multi-modal medical images, STD and AVG should be considered in priority. Besides, from Fig.4 (a2-a3) and (b2-b3), we can observe that the fusion results of CBF and ConvSR in infrared and visible images and medical images contain poor visual effects owing to considerable artificial noise, and in this case their objective quality metrics will not be referenced for the quantitative evaluation.

Table.I shows the average values of the fusion quality metrics among three different fusion tasks by different fusion methods. In the infrared and visible image fusion, our method has the highest metrics in EN, MI and STD. MS-SSIM, AVG, EIN and VIF rank second. In addition, in multi-focus image fusion, EN, MI, STD, AVG, EIN rank first, and VIF ranks fourth. In multi-modal medical image fusion, EN, MI, STD and EIN rank first, MS-SSIM, AVG and VIF rank second, and Qabf ranks third. Figure 5-7 shows the curve of fusion quality metrics obtained in three kinds of fused images. Im Fig.5-7, the results show the consistency of our fusion performance among six pairs of images, which demonstrate the robust and universality of our method. Therefore, combined with our emphasis on fusion metrics in different scenarios, our proposed method achieves the best performance.

Fig. 6: Quantitative comparison of our WaveFuse for multi-focus image fusion with 5 state-of-the-art methods.
Fig. 7: Quantitative comparison of our WaveFuse for multi-modal medical image fusion with 5 state-of-the-art methods.
Fig. 8: Fusion results with different training datasets. The first two cows of results are fused in DenseFuse. The last two cows of results are fused in WaveFuse.
Application Method Dataset EN MI Qabf MS_SSIM STD AVG EIN VIF
IR_VIS DenseFuse COCO  6.5140  13.0280  0.4338  0.8474  29.4581  2.5841  25.8742  0.4141
MINI 6.5016 13.0032 0.4298 0.8454 29.1593 2.5641 25.6589 0.4084
WaveFuse COCO 6.8583 13.7167  0.3755  0.8722 35.5311 3.6948 38.4966 0.6338
MINI  6.8751  13.7502 0.3688 0.8677  35.8324  3.7616  39.2384  0.6576
Multi-focus DenseFuse COCO  7.2142  14.4285  0.5958  0.9680  43.9375  4.3209  45.9510  0.7235
MINI 7.2047 14.4095 0.5832 0.9662 43.5970 4.2467 45.2459 0.7168
WaveFuse COCO 7.3561 14.7123  0.5199  0.9594 48.1370 6.5520 69.9019 0.8383
MINI  7.3672  14.7345 0.5174 0.9541  48.4257 6.9271  73.8873  0.8686
Medical DenseFuse COCO 5.3300 10.6599  0.6055  0.8742  63.7758 5.4067 57.0015  0.6988
MINI  5.3583  10.7167 0.5965 0.8679 63.6040  5.5630  57.9339 0.6808
WaveFuse COCO 5.5881 11.1761  0.5119 0.8854 69.6581 6.8352 71.3558  0.8057
MINI  5.6629  11.3258 0.5056 0.8529  70.6554  7.8938  81.2461 0.7995
TABLE II: Quantitative comparison of DenseFuse and WaveFuse with different training datasets. The metrics of MINI are obtained by the average of MINI1-MINI3. Red ones are the best results. For all metrics, larger is better.

Iv-C Comparison of Using Different Training Dataset

In order to further demonstrate the effectiveness and robustness of our network, we conducted experiments on another three different training minisets: MINI1-MINI3, each of which contains 300, 500 and 700 images respectively chosen randomly from COCO, and the fusion results are shown in Fig.8 and Table II. We compared the fusion results of DenseFuse and WaveFuse on COCO and MINI1-MINI3, respectively. The same three sets of images and values are chosen for testing process. The fusion performance was compared and analyzed by the averaged fusion quality metrics. From the fusion results in Fig.8, no obvious visual difference can be found among the fused images both in DenseFuse and WaveFuse when different training sets are chosen. However, compared with DenseFuse, the fused images obtained by WaveFuse exhibit higher resolution and contrast. Then objective metrics are employed to evaluate the fusion performance.

The fusion metrics of DenseFuse and WaveFuse trained with the different training datasets are shown in Table II. From the objective evaluation metrics of the three sets of fusion images, no significant difference are shown in the objective fusion performance of DenseFuse and WaveFuse respectively. Overall, higher fusion metrics are obtained in COCO than MINI1-MINI3, and WaveFuse outperforms DenseFuse in all scenarios for most of the quality metrics. In WaveFuse, higher performance are even achieved by training on minisets. Accordingly, we can learn that our proposed network is robust both to the size of the training dataset and to the selection of training images, so it can be trained with lower computational cost.

Iv-D Comparison of Using Different Wavelet Decomposition Layers and Bases

In wavelet transform, the number of decomposition layers and the selection of different wavelet bases could exert great impacts on the effectiveness of wavelet transform. In the following experiments in COCO, different wavelet decomposition layers and bases are selected for further optimization on our proposed method.

Iv-D1 Experiments on Different Wavelet Decomposition Layers

We choose decomposition layers from 1 to 4, and wavelet base is set as sym2 in this experiment.

Fig. 9: Fusion results obtained by our WaveFuse with different wavelet decomposition layers.
Fig. 10: Fusion results obtained by our WaveFuse with different wavelet bases.

From Table.III, we can clearly see that the higher fusion metrics, higher brightness and contrast of the fused images are obtained with the increase of decomposition layers However, when the decomposition layers is set as 3, results exhibit a little artificial noise in visible and infrared images, and when the decomposition layers is set as 4, except for multi-focus images, other fused images contain obvious noise. From the above analysis, the number of decomposition layers is not the more the better, and the evaluation of fused images should be fully combined with subjective and objective evaluation methods.

Iv-D2 Experiments on Different Wavelet Bases

For the comparison of using different wavelet bases, we set the decomposition layer as 3, and four bases including sym2, sym3, db1 and rbio6.8 will be chosen. From a subjective point of view, we find it difficult to distinguish which wavelet base achieves better fusion performance. Combined with the objective evaluation metrics in Table.IV, the fusion quality of wavelet base db1 is the highest in the three application scenarios. Through the above two experiments, we can further improve our proposed method by selecting the appropriate number of decomposition layers and wavelet bases, providing a new direction for the follow-up improvement of our method.

Application Layer EN MI Qabf MS_SSIM STD AVG EIN VIF
IR_VIS 1 6.8251 13.6501 0.3716 0.8494 35.2559 3.5373 36.5797 0.5941
2 6.8583 13.7167 0.3755 0.8722 35.5311 3.6948 38.4966 0.6338
3 6.8994 13.7988 0.3829 0.8928 35.9602 3.7396 39.0690 0.6789
4  6.9580  13.9159 0.3843  0.9023  36.7579  3.7598  39.2986  0.7237
Multi-Focus 1 7.3349 14.6699 0.4809 0.9494 47.4765 6.3098 66.8584 0.7533
2 7.3561 14.7123 0.5199 0.9594 48.1370 6.5520 69.9019 0.8383
3 7.3676 14.7353  0.5296 0.9599 48.6419 6.6343 70.7985 0.8688
4  7.3767  14.7534 0.5295 0.9587  48.9999  6.6592  71.0260  0.8712
Medical 1 5.5585 11.1170 0.5277 0.8805 69.6862 6.5261 68.4686 0.8081
2 5.5881 11.1761 0.5119 0.8854 69.6581 6.8352 71.3558 0.8057
3 5.6743 11.3487  0.5038  0.8812 70.0103 7.2669 75.2111 0.8009
4  5.7541  11.5081 0.5005 0.8731  70.3898  7.6128  78.2182 0.7820
TABLE III: Quantitative comparison with different wavelet decomposition layers in WaveFuse. The fusion quality metrics for six fused images of three different scenarios are obtained by average operation. Red ones are the best results. For all metrics, larger is better.
Application Method EN MI Qabf MS_SSIM STD AVG EIN VIF
IR_VIS sym2 6.8995 13.7990 0.3833  0.8931 35.9582 3.7392 39.0663 0.6791
sym3 6.8950 13.7901 0.3831 0.8935 35.9461 3.6904 38.5472 0.6833
db1  6.9152  13.8304  0.3876 0.8884  36.2466  3.9474  41.3366  0.7155
rbio6.8 6.8989 13.7977 0.3812 0.8887 35.9574 3.6857 38.4857 0.6747
Multi-Focus sym2 7.3676 14.7353 0.5296  0.9599 48.6419 6.6343 70.7985  0.8688
sym3 7.3700 14.7401  0.5299 0.9575 48.6571 6.6658 71.1048 0.8726
db1  7.3830  14.7660 0.5176 0.9556  48.9388  6.6945  71.3553 0.8363
rbio6.8 7.3609 14.7218 0.5328 0.9624 48.5020 6.5911 70.2980 0.8756
Medical sym2 5.6743 11.3487 0.5038 0.8812 70.0103 7.2669 75.2111  0.8009
sym3 5.7093  11.4185 0.4998 0.8828 69.9106 7.1585 74.0353 0.7939
db1 5.5755 11.1509  0.5127  0.8870  70.1196  7.3028 75.3542 0.7986
rbio6.8 5.6831 11.3662 0.4810 0.8604 69.8820 7.2228  74.5406 0.7656
TABLE IV: Quantitative comparison with different wavelet bases in WaveFuse.The fusion quality metrics for six fused images of three different scenarios are obtained by average operation. Red ones are the best results. For all metrics, larger is better.

V Conclusions

In this paper, we propose a novel image fusion method through the combination of a multi-scale wavelet transform based on regional energy and deep learning. To our best knowledge, this is the first time that a conventional technique is integrated in the pipeline of a deep learning-based image fusion method, and we think there are still a lot of possibilities to explore in this direction.

Our network consists three parts: an encoder, a DWT-based fusion layer and a decoder. The features of the input image are extracted by the encoder, then we use the adaptive fusion strategy at the fusion layer to obtain the fused features, and finally reconstruct the fused image through the decoder. Compared with the current excellent fusion algorithms, our proposed method achieves better performance. Additionally, our network has strong universality and can be applied to image fusion of various scenarios. At the same time, there is no reliance on big datasets, and our network can be trained in comparatively small datasets to obtain the comparable fusion results trained in large datasets with shorter training time and higher efficiency. Moreover, extensive experiments on different wavelet decomposition layers and bases demonstrate the further improvement of our method. Therefore, our network has conspicuous advantages over current deep learning-based algorithms.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §I.
  • [2] T. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma (2015) PCANet: a simple deep learning baseline for image classification?. IEEE transactions on image processing 24 (12), pp. 5017–5032. Cited by: §I.
  • [3] C. Du and S. Gao (2017) Image segmentation-based multi-focus image fusion through multi-scale convolutional neural network. IEEE access 5, pp. 15750–15761. Cited by: §I, §II.
  • [4] A. A. Goshtasby and S. Nikolov (2007) Image fusion: advances in the state of the art. Information fusion 2 (8), pp. 114–118. Cited by: §I.
  • [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §I.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §II.
  • [7] M. Hossny, S. Nahavandi, and D. Creighton (2008) Comments on’information measure for performance of image fusion’. Electronics letters 44 (18), pp. 1066–1067. Cited by: §IV-A.
  • [8] R. Hou, D. Zhou, R. Nie, D. Liu, L. Xiong, Y. Guo, and C. Yu (2020) VIF-net: an unsupervised framework for infrared and visible image fusion. IEEE Transactions on Computational Imaging 6, pp. 640–651. Cited by: §II.
  • [9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §II, §III-B.
  • [10] H. Jung, Y. Kim, H. Jang, N. Ha, and K. Sohn (2020) Unsupervised deep image fusion with structure tensor representations. IEEE Transactions on Image Processing 29, pp. 3845–3858. Cited by: §II.
  • [11] B. S. Kumar (2015) Image fusion based on pixel significance using cross bilateral filter. Signal, image and video processing 9 (5), pp. 1193–1204. Cited by: §IV-A.
  • [12] H. Li, B. Manjunath, and S. K. Mitra (1995) Multisensor image fusion using the wavelet transform. Graphical models and image processing 57 (3), pp. 235–245. Cited by: §IV-A.
  • [13] H. Li, X. Wu, and T. S. Durrani (2019) Infrared and visible image fusion with resnet and zero-phase component analysis. Infrared Physics & Technology 102, pp. 103039. Cited by: §I, §II.
  • [14] H. Li, X. Wu, and T. S. Durrani (2019) Infrared and visible image fusion with resnet and zero-phase component analysis. Infrared Physics & Technology 102, pp. 103039. Cited by: §I.
  • [15] H. Li, X. Wu, and T. S. Durrani (2019) Infrared and visible image fusion with resnet and zero-phase component analysis. Infrared Physics & Technology 102, pp. 103039. Cited by: §II, §IV-A.
  • [16] H. Li, X. Wu, and J. Kittler (2018) Infrared and visible image fusion using a deep learning framework. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. Cited by: §II.
  • [17] H. Li and X. Wu (2018) DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §I, §II, §II, §III-A, §III-D, §III-E, §IV-A.
  • [18] S. Li, X. Kang, L. Fang, J. Hu, and H. Yin (2017) Pixel-level image fusion: a survey of the state of the art. information Fusion 33, pp. 100–112. Cited by: §I, §I.
  • [19] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §I.
  • [20] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §I.
  • [21] Y. Liu, X. Chen, J. Cheng, and H. Peng (2017) A medical image fusion method based on convolutional neural networks. In 2017 20th International Conference on Information Fusion (Fusion), pp. 1–7. Cited by: §II.
  • [22] Y. Liu, X. Chen, H. Peng, and Z. Wang (2017) Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191–207. Cited by: §I, §I, §I, §II, §II.
  • [23] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §I, §I, §I.
  • [24] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang (2016) CONVSR. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: §II.
  • [25] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang (2016) Image fusion with convolutional sparse representation. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: §IV-A.
  • [26] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I.
  • [27] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang (2019) FusionGAN: a generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §IV-A.
  • [28] J. Ma, Z. Zhou, B. Wang, and H. Zong (2017) Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Physics & Technology 82, pp. 8–17. Cited by: §IV-A.
  • [29] K. Ma, K. Zeng, and Z. Wang (2015) Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing 24 (11), pp. 3345–3356. Cited by: §IV-A, §IV-A.
  • [30] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §II.
  • [31] L. Perez and J. Wang (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621. Cited by: §I.
  • [32] K. R. Prabhakar, V. S. Srikar, and R. V. Babu (2017) DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs.. In ICCV, pp. 4724–4732. Cited by: §I, §II, §II.
  • [33] B. Rajalingam and R. Priya (2018) Hybrid multimodality medical image fusion technique for feature enhancement in medical diagnosis. International Journal of Engineering Science Invention 2 (Special issue), pp. 52–60. Cited by: §IV-A.
  • [34] Y. Rao (1997) In-fibre bragg grating sensors. Measurement science and technology 8 (4), pp. 355. Cited by: §IV-A.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
  • [36] J. W. Roberts, J. A. van Aardt, and F. B. Ahmed (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §IV-A.
  • [37] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I.
  • [38] H. R. Sheikh and A. C. Bovik (2006) Image information and visual quality. IEEE Transactions on image processing 15 (2), pp. 430–444. Cited by: §IV-A.
  • [39] X. SHEN, G. YANG, and H. ZHANG (2006) Improved on the approach of image fusion based on region-energy [j]. Journal of Projectiles, Rockets, Missiles and Guidance 4. Cited by: §III-E, §III-E.
  • [40] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II.
  • [41] X. Song, X. Wu, and H. Li (2019) MSDNet for medical image fusion. In International Conference on Image and Graphics, pp. 278–288. Cited by: §I, §I, §II, §II.
  • [42] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-D, §IV-A.
  • [43] C. Xydeas and V. Petrovic (2000) Objective image fusion performance measure. Electronics letters 36 (4), pp. 308–309. Cited by: §IV-A, §IV-A.
  • [44] B. Yang, J. Zhong, Y. Li, and Z. Chen (2017) Multi-focus image fusion and super-resolution with convolutional neural network. International Journal of Wavelets, Multiresolution and Information Processing 15 (04), pp. 1750037. Cited by: §II.