A Deep Decomposition Network for Image Processing: A Case Study for Visible and Infrared Image Fusion

02/21/2021 ∙ by Yu Fu, et al. ∙ 13

Image decomposition is a crucial subject in the field of image processing. It can extract salient features from the source image. We propose a new image decomposition method based on convolutional neural network. This method can be applied to many image processing tasks. In this paper, we apply the image decomposition network to the image fusion task. We input infrared image and visible light image and decompose them into three high-frequency feature images and a low-frequency feature image respectively. The two sets of feature images are fused using a specific fusion strategy to obtain fusion feature images. Finally, the feature images are reconstructed to obtain the fused image. Compared with the state-of-the-art fusion methods, this method has achieved better performance in both subjective and objective evaluation.



There are no comments yet.


page 2

page 4

page 6

page 7

page 8

page 9

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image fusion is an important task in image processing. It aims to extract important features from images of multi-modality signal sources and uses certain fusion strategies to generate a fused image containing complementary information of multiple pictures. Our work is one of the common image fusion tasks, that is, to fuse visible light images and infrared images[21]. The fused images not only contain the radiation information of the occluded object, but also retain sufficient texture detail information. At present, many advanced methods are widely used in production and life, such as security monitoring, autonomous driving, target tracking, target recognition and other fields.

There are many excellent fusion methods, which can be divided into two categories: traditional methods and deep learning based methods[20]. Most of the traditional methods are based on signal processing methods to obtain high-frequency bands and low-frequency bands of the image and then merge them. With the development of deep learning, methods based on deep neural networks have also shown great potential in image fusion, because neural networks can extract features of source images and perform feature fusion.

Traditional methods can be broadly divided into two categories: one is based on multi-scale decomposition, and the other is representation learning based methods. In the multi-scale domain, the image is decomposed into multi-scale representation feature maps, and then the multi-scale feature representations are fused through a specific fusion strategy. Finally, the corresponding inverse transform is used to obtain the fused image. There are many representative multi-scale decomposition methods, such as pyramid [25], curvelet[39], contourlet[32],discrete wavelet transform,[9], etc.

In the representation learning domain. The most methods are based on sparse representation such as sparse representation (SR) and gradient histogram (HOG)[40], joint sparse representation (JSR)[37], approximate sparse representation with multi-selection strategy[2], etc.

In the low-rank domain,Li and Wu et al. proposed a low-rank representation(LRR) based fusion method[15]. The most recent approaches, such as MDLatLRR[14] are based on image decomposition with Latent LRR. This method can extract source image features in low-rank domains.

Although the methods based on multi-scale decomposition and representation learning have achieved good performance. But these methods still have some problems. These methods are very complicated, and dictionary learning is a time-consuming operation especially for online training. If the source image is complex, these methods will not be able to extract the features well.

In order to solve this problem, in recent years, many methods based on deep learning have been proposed[20]

because of the powerful feature extraction capabilities of neural networks.

In 2017, Liu et al. proposed a method based on convolutional neural network for multi-focus image fusion[19]. In ICCV2017, Prabhakar et al. proposed DeepFuse[29]

to solve the problem of multi-exposure image fusion. In 2018, Li and Wu et al. proposed an new infrared and visible light image fusion method based on denseblock and autoencoder structure

[16]. In the next two years, with the rapid development of deep learning, a large number of excellent methods emerged. Including IFCNN[38] proposed by Zhang et al., and fusion network based on GANs (FusionGan)[22] proposed by Ma et al., and the multi-scale fusion network framework (NestFuse)[13] proposed by Li et al. in 2020. Most of the methods based on neural networks use the powerful feature extraction function of neural networks, and then perform fusion at the feature level, and obtain the final fused image with some specific fusion strategies.

However, the method based on deep network also has some shortcomings: 1. As a feature extraction tool, neural network cannot explain the meaning of the extracted features. 2. The network is complex and takes a long time. 3. The amount and scale of infrared and visible light dataset is small, and many methods use other data sets for training. This is not necessarily suitable for extracting infrared and visible light images.

To solve these problems, we propose a novel network that can be used to decompose images. At the same time, drawing on traditional methods and deep learning based methods, our proposed network can decompose infrared and visible light images into high-frequency feature images and low-frequency feature images to achieve better decomposition effect than traditional methods. At the same time, we design some fusion rules to fuse the high and low frequency feature images to obtain the fused feature image. Finally, these fusion feature images are reconstructed to a fused image. The method we proposed not only utilizes the powerful feature extraction capabilities of neural networks, but also realizes the decomposition of image. Compared with the state-of-the-art methods, our fusion framework has achieved better performance in both subjective and objective evaluation.

This paper is structured as follows. In Section II, we introduce some related work. In t Section III, we will introduce our proposed fusion method in detail. And in Section IV, we illustrate the experimental settings, and we analyze and compare our experimental results. Finally, in the last section V, we draw a conclusion of this paper.

Ii Related Works

Whether it is based on traditional image signal processing methods or deep learning based methods. They are all very reasonable and excellent methods. We will introduce some related works that inspired us in this section.

Ii-a Wavelet Decomposition and Laplacian Filter

Wavelet transform has been successfully applied to many image processing tasks. The most common wavelet transform technique for image fusion is the Discrete Wavelet Transform (DWT)[12][3].

DWT is a signal processing tool that can decompose signals into high-frequency information and low-frequency information. Generally speaking, low-frequency information contains the main characteristics of the signal, and high-frequency information includes the detailed information of the signal. In the field of image processing, 2-D DWT is usually used to decompose images. The wavelet decomposition of the image is given as follows:


where is a low-pass filter, and is a high-pass filter. The input signal is an image with signals in two directions. Along the direction and the direction, high-pass and low-pass filtering are performed respectively. As shown in Fig.1, we can get a low-frequency image which is approximate representation and three high-frequency images which are vertical detail, diagonal detail and horizontal detail respectively.

Fig. 1: Wavelet Decomposition. We perform wavelet decomposition on the image to get a low-frequency image (b) and three high-frequency images (c)(d)(e) in three directions.

The Laplacian operator is a simple differential operator with rotation invariance. The Laplacian transform of a two-dimensional image function is the isotropic second derivative, defined as:


In order to be more suitable for digital image processing, the equation is approximated as a discrete form:


The Laplacian operator can also be expressed in the form of a convolution template, using it as a filtering kernel:


and are the template and the extended template of the discrete Laplacian operator, and the second differential characteristic of this template can be used to determine the position of the edge. They are often used in image edge detection and image sharpening processing, as shown in Fig.2,.

Fig. 2: Laplacian Filter. We use the Laplacian extended template for image filtering to get its high-frequency image (b), and magnify a local area of the high-frequency image (c).

We can easily observe that traditional edge filtering is usually just a high-frequency filtering. While highlighting the edges, they also highlight the noise.

Ii-B Decomposition-based Fusion Methods

Li and Wu et al. proposed a method[14] to decompose images using low-rank representation[18].

First, LatLRR[18] can be described as the following optimization problem:


Where is a hyper-parameter, and is nuclear norm, and is norm. is observed data matrix. is low-rank coefficients matrix. is a projection matrix. is a sparse noisy matrix.

The author use this method to decompose the image into detail image and base image . We can see from Fig .3 that is a high-frequency image, and is a low-frequency image.

Fig. 3: The framework of MDLatLRR.

As shown in Fig .3, the low-frequency image is continuously decomposed to obtain several high-frequency image , and .

Finally, this method decomposes the infrared image and the visible light image to obtain high-frequency images and low-frequency images. Then we perform a certain fusion to get the fused image .

Ii-C Deep Learning-based Fusion Methods

In 2017, Liu et al. proposed a neural network-based method[19]. The authors divides the picture into many small patches. Then CNN is used to predict whether each small patch is blurry or clear. The network builds a decision activation map to indicate which pixels of the original image are clear and focused. A well-trained network can accomplish multi-focus fusion tasks very well. However, due to the limitations of network design, this method is only suitable for multi-focus image fusion.

In order to enable the network to fuse visible light images and infrared images, Li and Wu et al. proposed a deep neural network (DenseFuse)[16] based on an autoencoder. First they train a sufficiently powerful encoder and decoder which can fully extract the features of the original image and reconstruct the image without losing information as much as possible. Then the infrared image and the visible light image are inputted into the encoder to obtain the coding features, and the two sets of features are specifically fused to obtain the fusion featurs. Finally, the fusion features are inputted into the decoder to obtain the fused image. These methods use the encoder to decompose the image into several latent features. Then these features are fused and reconstructed to obtain a fused image.

In the past few years, Generative Adversarial Networks(GANs) have also been applied to many fields, including image fusion. In


FusionGan first uses GANs to generate a fused image. The generator inputs infrared and visible light images and outputs a fused image. In order to improve the quality of the generated image, the author designed an appropriate loss function. Finally, the generator can be used to fuse any infrared image and visible light image.

In view of the superiority of these two methods, we propose a multi-layer image decomposition method based on neural network. And we propose an image fusion framework for infrared image and visible light image based on this method.

Iii Proposed Fusion Method

In this section, the proposed multi-scale decomposition-based fusion network is introduced in detail. Firstly, the fusion framework is presented in section III-C. Then, the detail of training phase is described in section III-A. Next, in section III-B we give the design of the loss function of the network. Finally, we present different fusion strategy in section III-D.

Iii-a Network Structure

In the training phase, we discard the fusion strategy and train the decomposition network.

Our training goal is to make the decomposition network better decompose the source image into several high-frequency and one low-frequency images, which are used for subsequent operations. The structure of the network is shown in Fig.4, and the detailed network settings are shown in Table I.

Fig. 4: The framework of training process.
Fig. 5: We use three convolutions to reduce the channels to one and get a high-frequency image.
Fig. 6: We use three convolutions and downsampling twice to reduce the channels to one to get a low-frequency image.
Block Layer Channel Channel Size Size Size Activation
(input) (output) (kernel) (input) (output)
Cin Conv(Cin-1) 1 16 3 256 256 LeakyReLU
Conv(Cin-2) 16 32 3 256 256 LeakyReLU
Conv(Cin-3) 32 64 3 256 256 LeakyReLU
C1 Conv(C1) 64 64 3 256 256 LeakyReLU
C2 Conv(C2) 64 64 3 256 256 LeakyReLU
C3 Conv(C3) 64 64 3 256 256 LeakyReLU
R1 Conv(R1) 64 64 1 256 256 -
R2 Conv(R2) 64 64 1 256 256 -
R3 Conv(R3) 64 64 1 256 256 -
Detail Conv(D0) 64 32 3 256 256 LeakyReLU
Conv(D1) 32 16 3 256 256 LeakyReLU
Conv(D2) 16 1 3 256 256 Tanh
C-res Conv(C-res1) 64 64 3 256 256 ReLU
Conv(C-res2) 64 64 3 256 256 ReLU
Conv(C-res3) 64 64 3 256 256 ReLU
Semantic Conv(S0) 64 32 3 256 128 ReLU
Conv(S1) 32 16 3 128 64 ReLU
Conv(S2) 16 1 3 64 64 Tanh
Upsample Upsample 1 1 - 64 256 -
TABLE I: The parameters of the network

In Fig.4 and Table I, is the original input image, and is the reconstructed image. The backbone of the network is four feature extraction convolutional blocks ().

Then the following is the low-frequency feature extraction part, that is, the block in figure. The block shown in Fig.6 include two down-sampling convolutional layers (

) with a stride of 2 and a common convolutional layer (

) which can generate a low-resolution semantic image . Then the is up-sampled to the same size of the to obtain the low-frequency image .

We copy the features of different depths () and and then reshuffle their channels with convolutional layers (). After that, we input them into the branch of the shared weight to obtain three high-frequency images , and . The detail branch here is shown in detail in Fig.5 and the Table I, which includes three convolutions (D0, D1, D2), and the number of channels is reduced to 1 to obtain a high-frequency image.

The reason for adding reshuffle layers() here is that the detail block is weight-sharing, the feature maps they extract high-frequency information should follow the same channel distribution. So we add a convolutional layer that does not share weights, and reshuffle and sort the channels of the features so that the features can adapt to the weight-shared details block.

Finally, the three high-frequency images (, , ) and one low-frequency image () are added pixel by pixel to obtain the final reconstructed image .

Here we observe that the final reconstructed image is obtained by adding the high frequency image and the low frequency image. Therefore, the high-frequency image and the low-frequency image should be a complementary relationship in the data distribution space. When the network learns to generate images, the high-frequency image should be the residual data of the low-frequency image. So we design the residual branch ( block). We skip-connect the result of to the front of the block, add it to the result of , and input it to the following layers. In this way, what and

get are the residual data between the source image and the semantic image, which is compulsive and natural. In order to make the skip-connected data more closely match the deep features of

, we performed three convolutions in block to increase the semantics of the skip-connected features.

As shown in the activation function in the Table

I, we consider some properties of low-frequency images and high-frequency images, we choose LeakyRelu[23] as the activation function of the convolution of the backbone network and the high frequency part (), and Relu fuction is used as the activation function of the convolution layers of the residual branch () and the block (). Because the output of Relu has a certain degree of sparseness, which allows our low-frequency features to filter out more useless information and retain more blurred but semantic information. Finally, in order to constrain the pixel value of the obtained image to a controllable range, we use the activation function after the last layer of and block.

In general, we first perform a convolution block () to obtain a set of feature maps containing various features. Then after three same convolution operations(), three sets of shallow features are obtained. Then after two downsampling (), deep feature is obtained. We believe that shallow features contain more low-level information such as texture and detailed features. We reshuffle the channels and feed these three sets of shallow features into the high-frequency branch) to obtain three high-frequency images. What is more, we believe that deep feature has more semantic information and global information, so we convolve and upsample the deep feature to get our low-frequency images. At the same time, we use the residual branch () to explicitly establish the residual relationship between the high-frequency feature and the low-frequency feature. Lastly, we add these feature images pixel by pixel to get a reconstructed image.

Fig. 7: The component of loss function.

Iii-B Loss Fuction

Fig. 8: Decomposition and fusion of infrared and visible images.

In the training phase, the loss function() of our network consists of three parts. These losses are the gradient loss() of the high-frequency image, the distribution loss () of the low-frequency image and the content reconstruction loss() of the reconstructed image. The formula of the loss function is defined as follows:


and are hyper-parameters that balances the three losses.

As shown in Fig. 7, Where is to calculate the mean square error loss between the high-frequency feature map (, , ) and the gradient image of the original image, and then we accumulate these three losses. The detailed calculation formula of is presented as follows:


Where is the input source image and is the high-frequency image. The is the mean square error between and . The gradient image of the original image is obtained by using the Laplacian gradient operator . The Laplacian operator performs a mathematical convolution operation in Equ.4.

In Equ.6,

is a data distribution loss. We calculate a strong supervised loss of the high-frequency image, and calculate a strong supervised loss of the reconstructed image below. We hope that the low-frequency semantic block learn to extract deep semantic information, rather than giving it an answer to let it remember the answer. At the same time, we cannot give a suitable low-frequency image to the network for reference. The low-frequency information is definitely not a simply down-sampled image. But if the network does not have any loss function, it is difficult to get the low-frequency image we really want. Therefore, we use the down-sampled images as an approximate data distribution of low frequency images, so that the low-frequency results generated by our network can be in the ”low frequency domain” space. The experiment in the next section proves that this loss is indeed very effective.


where is the low-frequency semantic image generated by the network, is the low-frequency blurred image obtained by downsampling the source image twice, and is the adversarial loss.


where , represents the number of images. The loss function we use here is defined in LSGAN[24].

In Equ.6, is the image content reconstruction loss of the reconstructed image. The loss consists of two parts, one is the pixel-level reconstruction loss , and the other is the structural similarity loss as follows:


Where is a hyper-parameter that balances the two losses. and are calculated as follows:


As shown in the Fig. 7, the total loss function is given as follows:


are hyper-parameters and are used to balance the losses.

Iii-C Image Fusion

In the testing phase, our fusion structure is divided into two parts: decomposition and fusion, as shown in Fig.9. The decomposition network can decompose the image into three high-frequency images and one low-frequency image. The fusion strategy (”FS” in Fig.9) can fuse the corresponding feature images and reconstruct them to obtain the final image.

Fig. 9: The framework of proposed method. ”Decomposition Network” can decompose the image and ”FS” indicates fusion strategy.

In Fig.9, and represent infrared image and visible light image, respectively. The two images are fed into the decomposition network to obtain two sets of feature images. One group of feature images comes from visible light images including three visible light high-frequency images (, , ) and one visible light low-frequency image (). And another group of feature images comes from infrared images including three infrared high-frequency images (, , ) and one infrared low frequency image (). For the corresponding four groups of feature images, our fusion strategy contains a variety of fusion methods to obtain the final fused image .

In the following subsection, we will introduce the fusion strategy.

Iii-D Fusion strategy

We design a fusion strategy to get a fused image. As shown in the Fig.8, we first use the decomposition network to decompose the visible light image and the infrared image to obtain two sets of high and low frequency feature images. The corresponding high-frequency and low-frequency feature images (such as and ) are fused using different specific fusion strategies to obtain fused high-frequency feature images and low-frequency feature images (, , , ). Finally, the fusion feature image is added pixel by pixel to obtain the fused image , which is the same as reconstructing an image in the training phase.

We designed two fusion strategies for high-frequency image fusion, namely, pixel-wise addition (addition) and the corresponding pixel taking the maximum value (max). In addition, we also designed two fusion methods for low-frequency images, which are adding and averaging pixel by pixel (avg), and the corresponding pixel takes the maximum value (max), as shown in Fig.10.

Fig. 10: High and low frequency image fusion strategy. Here a and b are any pair of points from feature images, and c is the corresponding fused pixel.

The formulas of high frequency fusion feature and low frequency fusion feature are described as follows:


Where represents three high-frequency images, and represents all pixels in the image. and are any pixel in the corresponding three groups of high-frequency images, and and are any pixel in the low-frequency image. We calculate and fuse the corresponding pixels to get the pixels of the fused high frequency image and low frequency image . Finally, the three fusion features are added to obtain the final fused image as follows:


Iv Experiments and Analysis

Iv-a Training and Testing Details

For the selection of hyper-parameters, we make the values of losses as close to the same order of magnitude as possible. So, in formula 12, we set = 0.1, = 100, = 10 by cross validation.

Our goal is to train a powerful decomposition network that can decompose images into high-frequency and low-frequency images well. In this way, our input images in the training phase are not limited to infrared images and visible light images. We can also use MS-COCO


and Imagenet

[5] or other images to achieve this goal. In our experiment, we use MS-COCO as the training set to train our decomposition network. We select about 80,000 images as input images. These images are converted to gray scale images which are then resized to 256256.

We select twelve pairs of infrared and visible light images from the TNO [31] as our test images. The reason why the TNO dataset is not used as training data is that the TNO dataset has few pictures and is suitable for testing. At the same time, we select fifty pairs of infrared and visible light images from the RoadScene dataset[34] for testing.

We input batchsize of 64 images to the network every iteration. And, we select Adam[10] iterator and adaptive learning rate decay method[36]

as the learning rate scheduler. We set the initial learning rate to 1e-3, the attenuation factor to 0.5, the maximum patience to 5 iterations, and the minimum learning rate threshold to 1e-8. We set the maximum number of epoches to 1000.

In the test phase, because our network is a fully convolutional network, we input infrared images and visible light images without preprocessing operations.

The experiment is conducted on the two NVIDIA TITAN Xp GPUs and 128GB of CPU memory. We decompose 1000 images with 256256 resolution one by one and calculate the average calculation time. It takes about 2ms to decompose each image.

Iv-B the role of the adversarial loss

As shown in Fig 11, if we do not give constraints on the low-frequency image , it is difficult for the network to learn smartly to get a semantic low-frequency image we want. Without the distribution loss function, the high-frequency images learned by the network have too much semantic information, such as the distribution of colors-this is not high-frequency information. And low-frequency images loses a lot of semantic information.

In order to allow the block to learn the real low-frequency information we want, we give it a hint that is the weak supervision loss. As in Equ. 9, we regard the down-sampled image as an approximate solution of the low-frequency image, so that the low-frequency image generated by the network follows the distribution of the low-frequency images.

Fig. 11: The effect of adversarial loss. a) is the original image, b) and c) is the high-frequency image and low-frequency image without adversarial loss, and d) and e) is the result of using adversarial loss.

Iv-C the details of the decomposed images

Although the loss function of our high-frequency image is calculated with gradient map which is calculated by Laplacian operator. But the result of our high-frequency image is quiet different from the Laplacian gradient high-frequency image.

As shown in Fig 12, we list the high-frequency images decomposed by the proposed decomposition network and the Laplacian operator. It can be seen that the Laplacian gradient images only extract part of the high-frequency information, and the image has a lot of noises.

Fig. 12: High-frequency images decomposed by the proposed decomposition network and the Laplacian operator.

The high-frequency image decomposed by our decomposition network not only retains almost all high-frequency information on the basis of the Laplacian gradient image, but also completely extracts the contour and detail information of the object. In addition, our high-frequency images have a certain degree of semantic recognition, and can clearly express the semantic features according to the outline of the objects.

Fig. 13: Experiment on street images.

Iv-D Comparison with State of The Art Methods

We select ten classic and the state of the art fusion methods to compare the fusion effect of our proposed method, including Curvelet Transform (CVT)[27], dualtree complex wavelet transform (DTCWT)[11]

, Multi-resolution Singular Value Decomposition (MSVD)

[26], DenseFuse[16], the GAN-based fusion network (FusionGAN)[22], a general end-to-end fusion network(IFCNN)[38], MDLatLRR[14], NestFuse[13], FusionDN[34] and U2Fusion[33]. We use the public codes of these methods and the parameters shown in the paper to obtain fused images.

Because there is currently no clear specific evaluation indicators to measure the quality of the fused image, we will comprehensively compare it according to the subjective evaluation and the objective evaluation respectively.

Iv-D1 subjective evaluation

In different fields, for different tasks, everyone has his/her own criteria for judging. We consider the subjective feelings of the picture, such as lightness, fidelity, noise, and clarity etc.

Fig. 14: Experiment on road images.

In Fig. 13 and Fig. 14, our method is compared with other methods. It can be clearly seen that our fused image not only perfectly retains the radiation information of the infrared image, but also fully retains the detailed texture information of the visible light image. More importantly, our image does not have a lot of noises. We marked some salient areas with red boxes. For example, in Fig. 13, the canopy of the shop has less noises. In Fig. 14, the outline of the person in the distance is clearly visible..

CVT 42.9631 11.1129 6.4989 1.5812 0.4240 0.3945 12.9979 27.4613 5.4530 4.2802 0.4623
DTCWT 42.4889 11.1296 6.4791 1.5829 0.4419 0.3936 12.9583 27.3099 5.4229 4.2370 0.5024
MSVD 27.6098 8.5538 6.2807 1.5857 0.2828 0.2470 12.5613 24.0288 4.2283 2.8773 0.3375
DenseFuse 36.4838 9.3238 6.8526 1.5329 0.4389 0.3897 13.7053 38.0412 4.6176 3.6299 0.4569
FusionGan 32.5997 8.0476 6.5409 0.6876 0.4083 0.4142 13.0817 29.1495 4.2727 3.2803 0.2784
IFCNN 44.9725 11.8590 6.6454 1.6126 0.4052 0.3739 13.2909 33.0086 5.9808 4.5521 0.4864
MDLatLRR 28.0985 7.3383 6.3016 1.6043 0.4296 0.4080 12.6032 24.7217 3.5486 2.7938 0.4127
NestFuse 38.4401 9.7098 6.8856 1.5839 0.4504 0.3694 13.7713 38.3311 4.9099 3.8376 0.4895
FusionDN 61.3491 14.2256 7.4073 1.6148 0.3651 0.3159 13.6147 48.5659 7.4565 5.9832 0.3785
U2Fusion 48.4915 11.0368 6.7227 1.5946 0.3594 0.3381 13.4453 31.3794 5.8343 4.7392 0.4039
max + avg 30.1185 8.0322 6.3621 0.7133 0.1282 0.0984 12.7242 25.2654 3.9611 3.0370 0.1032
max + max 33.5991 8.4408 6.6711 0.6959 0.1299 0.0985 13.3422 38.5077 4.1961 3.3191 0.1023
add + avg 46.0222 12.0192 6.5235 0.6629 0.1278 0.0984 13.0470 27.4135 6.1148 4.6537 0.1017
Ours add + max 48.5475 12.3100 6.8973 0.6748 0.1293 0.0984 13.7945 39.9003 6.2682 4.8526 0.1019
CVT 59.7642 14.7379 7.0159 1.3418 0.4138 0.3631 14.0319 36.0884 6.9618 5.7442 0.4499
DTCWT 57.3431 14.7318 6.9211 1.3329 0.3458 0.2383 13.8421 34.7264 6.7810 5.5228 0.4402
MSVD 36.0475 11.3182 6.6960 1.3458 0.2659 0.2195 13.3919 30.9643 5.0926 3.6171 0.3600
DenseFuse 34.0135 8.5541 6.6740 1.3491 0.4173 0.3857 13.3480 30.6655 3.9885 3.2740 0.3916
FusionGan 35.4048 8.6400 7.1753 0.8671 0.3410 0.3609 14.3507 42.3040 3.9243 3.3469 0.2591
IFCNN 57.6653 15.0677 6.9730 1.3801 0.4032 0.3456 13.9460 35.8183 7.0401 5.6242 0.5100
MDLatLRR 36.9468 9.3638 6.7171 1.3636 0.4241 0.3875 13.4342 31.3505 4.3216 3.5530 0.4483
NestFuse 53.9286 14.2820 7.3598 1.2597 0.4342 0.3484 14.7196 48.9920 6.2840 5.1834 0.4758
FusionDN 63.1690 16.7138 7.5323 1.1882 0.3621 0.3009 15.0646 55.0559 7.7925 6.4950 0.4631
U2Fusion 66.2529 15.8242 7.1969 1.3551 0.3717 0.3199 14.3938 42.9368 7.5930 6.3133 0.5112
max + avg 39.4996 10.7670 6.7575 1.3394 0.3151 0.2311 13.5150 31.6977 4.8180 3.8643 0.3591
max + max 39.7592 10.6215 6.8186 1.3223 0.3171 0.2217 13.6371 39.6907 4.7275 3.8575 0.3505
add + avg 64.5832 16.5082 6.9535 1.4087 0.4295 0.3883 13.9071 35.7447 7.7606 6.2775 0.4969
Ours add + max 63.4740 16.1703 6.9666 1.3774 0.4076 0.3640 13.9332 43.2688 7.5247 6.1431 0.4478

Iv-D2 objective evaluation

Subjective feelings have great personal factors, and it is not enough for evaluation to rely solely on subjective evaluation. We select fifteen objective evaluation indicators from the popular objective indicators for comprehensive evaluation. They are: Edge Intensity(EI)[35], SF[7], Entropy (EN)[30], Sum of Correlation Coefficients (SCD)[1], Fast Mutual Information ( and )[8] ,Mutual Information (MI)[28]

, Standard Deviation of Image (SD), Definition (DF)

[6], Average gradient (AG)[4] and QG[35] respectively.

The objective evaluation indicators here are divided into two categories. One is to evaluate the fused image, such as calculating the edge(EI), the number of mutations in the image(SF), average gradient(AG), entropy (EN), clarity(DF) and contrast of the image(SD). The other is to evaluate the fused image with the source image. There is another category that evaluates the fused image and the source image, such as the mutual information (MI, and ) and some complex calculation methods(QG).

We compare the proposed method with ten other excellent methods, and the results of the average values for all fused images shown in Table II and Table III respectively. The best value in the quality table is made bold in red and bold, and the second best value is given in bold and italic.

It can be seen from Table II and Table III that our proposed method obtains eleven best values and four second best values. On the TNO dataset, our method () obtains one best result and six second best results. On the RoadScene dataset, our method () obtained two best results and four second best results. Comparing with other operator in fusion strategy, the addition operator of high frequency information can achieve better results in several indicators. Although our method does not get the best results in every result, it was able to get second best results in many indicators.

V Conclusions

In this paper, we propose a novel multi-network for image decomposition. We also develop a decomposition network fusion framework to fuse infrared images and visible light images. Firstly, with the help of decomposition networks, the infrared image and the visible light image are decomposed into multiple high-frequency feature images and a low-frequency feature image, respectively. Secondly, the corresponding feature image is fused with a specific fusion strategy to obtain the fusion feature images. Finally, the fusion feature images are added pixel by pixel to obtain the fused image. This kind of image decomposition network is universal, and any number of images can be quickly and effectively decomposed by neural network. At the same time, using the power of GPUS, neural networks can easily use GPU for matrix calculation acceleration. The speed of image decomposition can also be very fast. We have performed a subjective and objective evaluation of the proposed method, and the experimental results show that it has reached the state of the art. Although the network structure is simple, it proves the feasibility of the neural network to decompose the image. We have a conjecture that CNN uses the semantics of the image to filter the noise while preserving the edges, and obtain a very good high-frequency information image. We will continue to study image decomposition based on deep learning, including simplifying some originally complex image decomposition calculations such as wavelet transformation, low-rank decomposition, etc., or designing more reasonable network structures for other image processing applications. We think that the network we propose can be used for different image processing tasks, including multi-focus fusion, medical image fusion, multi-exposure fusion, and some basic computer vision tasks such as detection, recognition, and classification. We will then experiment and test this method in other image tasks.


  • [1] V. Aslantas and E. Bendes (2015) A new image quality metric for image fusion: the sum of the correlations of differences. Aeu-international Journal of electronics and communications 69 (12), pp. 1890–1896. Cited by: §IV-D2.
  • [2] Y. Bin, Y. Chao, and H. Guoyu (2016) Efficient image fusion with approximate sparse representation. International Journal of Wavelets, Multiresolution and Information Processing 14 (04), pp. 1650024. Cited by: §I.
  • [3] L. J. Chipman, T. M. Orr, and L. N. Graham (1995) Wavelets and image fusion. 3, pp. 3248. Cited by: §II-A.
  • [4] G. Cui, H. Feng, Z. Xu, Q. Li, and Y. Chen (2015) Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Optics Communications 341, pp. 199–209. Cited by: §IV-D2.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §IV-A.
  • [6] X. Desheng (2004) Research of measurement for digital image definition. Journal of Image and Graphics. Cited by: §IV-D2.
  • [7] A. M. Eskicioglu and P. S. Fisher (1995) Image quality measures and their performance. IEEE Transactions on communications 43 (12), pp. 2959–2965. Cited by: §IV-D2.
  • [8] M. Haghighat and M. A. Razian (2014) Fast-fmi: non-reference image fusion metric. In 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–3. Cited by: §IV-D2.
  • [9] A. B. Hamza, Y. He, H. Krim, and A. S. Willsky (2005) A multiscale approach to pixel-level image fusion. Computer-Aided Engineering 12 (2), pp. 135–146. Cited by: §I.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv: Learning. Cited by: §IV-A.
  • [11] J. J. Lewis, R. J. Callaghan, S. G. Nikolov, D. R. Bull, and N. Canagarajah (2007) Pixel-and region-based image fusion with complex wavelets. Information fusion 8 (2), pp. 119–130. Cited by: §IV-D.
  • [12] H. Li, B. S. Manjunath, and S. K. Mitra (1995) Multisensor image fusion using the wavelet transform. Graphical Models and Image Processing 57 (3), pp. 235–245. Cited by: §II-A.
  • [13] H. Li, X. Wu, and T. Durrani (2020)

    NestFuse: an infrared and visible image fusion architecture based on nest connection and spatial/channel attention models

    IEEE Transactions on Instrumentation and Measurement. Cited by: §I, §IV-D.
  • [14] H. Li, X. Wu, and J. Kittler (2020) MDLatLRR: a novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing. Cited by: §I, §II-B, §IV-D.
  • [15] H. Li and X. Wu (2017) Multi-focus image fusion using dictionary learning and low-rank representation. In International Conference on Image and Graphics, pp. 675–686. Cited by: §I.
  • [16] H. Li and X. Wu (2018) Densefuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §I, §II-C, §IV-D.
  • [17] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §IV-A.
  • [18] G. Liu and S. Yan (2011) Latent low-rank representation for subspace segmentation and feature extraction. pp. 1615–1622. Cited by: §II-B, §II-B.
  • [19] Y. Liu, X. Chen, H. Peng, and Z. Wang (2017) Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191–207. Cited by: §I, §II-C.
  • [20] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §I, §I.
  • [21] J. Ma, Y. Ma, and C. Li (2019) Infrared and visible image fusion methods and applications: a survey. Information Fusion 45, pp. 153–178. Cited by: §I.
  • [22] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang (2019) FusionGAN: a generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §I, §II-C, §IV-D.
  • [23] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §III-A.
  • [24] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley (2017) Least squares generative adversarial networks. pp. 2813–2821. Cited by: §III-B.
  • [25] T. Mertens, J. Kautz, and F. Van Reeth (2009) Exposure fusion: a simple and practical alternative to high dynamic range photography. Computer Graphics Forum 28 (1), pp. 161–171. Cited by: §I.
  • [26] V. Naidu (2011)

    Image fusion technique using multi-resolution singular value decomposition

    Defence Science Journal 61 (5), pp. 479. Cited by: §IV-D.
  • [27] F. Nencini, A. Garzelli, S. Baronti, and L. Alparone (2007) Remote sensing image fusion using the curvelet transform. Information fusion 8 (2), pp. 143–156. Cited by: §IV-D.
  • [28] H. Peng, F. Long, and C. Ding (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8), pp. 1226–1238. Cited by: §IV-D2.
  • [29] K. R. Prabhakar, V. S. Srikar, and R. V. Babu (2017) DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs.. In ICCV, pp. 4724–4732. Cited by: §I.
  • [30] J. W. Roberts, J. A. van Aardt, and F. B. Ahmed (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §IV-D2.
  • [31] A. Toet et al. (2014) TNO image fusion dataset. Figshare. data. Cited by: §IV-A.
  • [32] K. P. Upla, M. V. Joshi, and P. P. Gajjar (2014) An edge preserving multiresolution fusion: use of contourlet transform and mrf prior. IEEE Transactions on Geoscience and Remote Sensing 53 (6), pp. 3210–3220. Cited by: §I.
  • [33] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling (2020) U2fusion: a unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §IV-D.
  • [34] H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo (2020) FusionDN: a unified densely connected network for image fusion.. In AAAI, pp. 12484–12491. Cited by: §IV-A, §IV-D.
  • [35] C. S. Xydeas and V. S. Petrovic (2000) Objective pixel-level image fusion performance measure. In Sensor Fusion: Architectures, Algorithms, and Applications IV, Vol. 4051, pp. 89–98. Cited by: §IV-D2.
  • [36] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv: Learning. Cited by: §IV-A.
  • [37] Q. Zhang, Y. Fu, H. Li, and J. Zou (2013) Dictionary learning method for joint sparse representation-based image fusion. Optical Engineering 52 (5), pp. 057006. Cited by: §I.
  • [38] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang (2020) IFCNN: a general image fusion framework based on convolutional neural network. Information Fusion 54, pp. 99–118. Cited by: §I, §IV-D.
  • [39] Z. Zhang and R. S. Blum (1999) A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application. Proceedings of the IEEE 87 (8), pp. 1315–1326. Cited by: §I.
  • [40] J. Zong and T. Qiu (2017)

    Medical image fusion based on sparse representation of classified image patches

    Biomedical Signal Processing and Control 34, pp. 195–205. Cited by: §I.