DenseFuse: A Fusion Approach to Infrared and Visible Images

04/23/2018 ∙ by Hui Li, et al. ∙ 0

In this paper, we present a novel deep learning architecture for infrared and visible images fusion problem. In contrast to conventional convolutional networks, our encoding network is combined by convolutional neural network layer and dense block in which the output of each layer is connected to every other layer. We attempt to use this architecture to get more useful features from source images in encoding process. Two fusion strategies are designed to fuse these features. Finally, the fused image is reconstructed by decoder. Compared with existing fusion methods, the proposed fusion method achieves state-of-the-art performance in objective and subjective assessment.Code and pre-trained models are available at densefuse



There are no comments yet.


page 2

page 4

page 5

page 6

Code Repositories


Infrared and visible image fusion using CNN layers and dense block architecture. -- tensorflow

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The infrared and visible image fusion task is an important problem in image processing field. It attempts to extract salient features from source images, then these features are integrated into a single image by appropriate fusion method[1]. For decades, these fusion methods achieve extrodinary fusion performance and are widely used in many applications, like video surveillance and military applications.

As we all know, many signal processing methods have been applied in the image fusion task to extract image salient features, such as muli-scale decomposition-based methods[2, 3, 4, 5, 6, 7]. Firstly, the salient features are extracted by image decomposition methods. Then an appropriate fusion strategy is utilized to obtain the final fused image.

In recent years, the representation learning-based methods have also attracted great attention. In sparse domain, many fusion methods are presented, like sparse representation(SR) and Histogram of Oriented Gradients(HOG)-based fusion method[8], joint sparse representation(JSR)-based fusion method[9] and co-sparse representation-based method[10]. In low-rank domain, Li et al.[11] proposed a low-rank representation(LRR)-based fusion method. They use LRR instead of SR to extract features, then -norm and the max selection strategy are used to reconstruct the fused image.

With the rise of deep learning, many fusion methods based on deep learning are proposed. The convolutional neural network(CNN) is used to obtain the image features and reconstruct the fused image[12, 13]. In these CNN-based fusion methods, only the last layer result are used to be the image features and this operation will lose many useful information which is obtained by middle layers. We think these imformation are important for fusion method.

In order to solve this problem, in our paper, we propose a novel deep learning architecture which is constructed by encoding network and deconding network. We use encoding network to extract image features and the fused image is obtained by decoding network. The encoding network is constructed by convolutional layer and dense block[14] in which the output of each layer is used as the input of next layer. So in our deep learning architecture, the results of each layer in encoding network are utilized to construct feature maps. Finally, the fused image will be reconstructed by fusion strategy and decoding network which is combined by four CNN layers.

Our paper is structured as follows. In SectionII, we briefly review related works. In SectionIII, the proposed fusion method is introducted in detail. The experimental results are shown in SectionIV. The conclusion of our paper with discussion are presented in section V.

Fig. 1: The architecture of proposed method.

Ii Related Works

Many fusion algorithms have been proposed in the last two years, especially based on deep learning. Unlike muli-scale decomposition-based methods and representation learning-based methods, the deep learning-based algorithms use lot of images to train the network and these networks are used to obtian the salient features.

In 2016, Yu Liu et al.[12]

proposed a fusion method based on convolutional sparse representation(CSR). The CSR is different from CNN-based methods, but this algorithm is still deep learning-based algorithm, because it also extracts the deep features. In this method, authors use source images to learn several dictionaries which have different scale and employ CSR to extract multi-layer features, then fused image is generated by these features. In 2017, Yu Liu et al.

[13] also presented a CNN-based fusion method for multi-focus image fusion task. The image patches which contain different blur versions of the input image are used to train the network and use it to get a decision map. Then, the fused image is obtained by using the decision map and the source images. However, this method is only suitable for multi-focus image fusion.

In ICCV 2017, Prabhakar et al.[15] performed a CNN-based approach for exposure fusion problem. They proposed a simple CNN-based architecture which contains two CNN layers in encoding network and three CNN layers in decoding network. Encoding network has siamese network architecture and the weights are tied. Two input images are encoded by this network. Then two feature map sequences are obtained and they are fused by addition strategy. The final fused image is reconstructed by three CNN layers which called decoding network. Although this method achieves better performance, it still suffers from two main drawbacks: 1) The network architecture is too simple and the salient features may not be extracted properly; 2) These methods just use the result which is calculated by the last layers in encoding network and many useful information which are obtained by middle layers will be lost, this phenomenon will get worse when the network is deeper.

To overcome these drawbacks, we propose a novel deep learning architecture based on CNN layers and dense block. In our network, we use infrared and visible image pairs as input for our method. And in dense block, their feature maps which are obtained by each layer in encoding network are cascaded as the next layer’s input.

In traditional CNN based network, with the increase of network depth, a degradation problem[15] has been exposed and many information which are extracted by middle layers are not be used thoroughly. To address the degradation problem, He et al.[16] introduced a deep residual learning framework. To further improve the information flow between layers, Huang et al.[14] propose a novel architecture with dense block in which direct connections from any layer to all the subsequent layers are used. Dense block architecture has three advantages: 1) this architecture can preserve as much information as possible; 2) this model can improve flow of information and gradients through the network, which makes network easy to train; and 3) the dense connections have a regularizing effect, which reduces overfitting on tasks.

Based on these observations, we incorporate dense block in our encoding network, which is the origin of our proposed name: Densefuse. With this operation, our network can preserve more useful information from middle layers and easy to train. We will introduce our fusion algorithm in detail in the next section.

Fig. 2: The framework of training process.

Iii Proposed Fusion Method

In this section, the proposed deep learning-based fusion method is introduced in detail. With a span of last 5 years, CNN gains great success in image processing field. It is also the footstone for our network.

The input infrared and visible images(gray level images) are denoted as and , respectively. We assume that input images are registered using existing algorithms. Our network architecture has three parts: encoder, fusion layer, and decoder. The architecture of the proposed network is shown in Fig.1.

As shown in Fig.1, the encoder is a siamese architecture network and it has two channels (C11 and DenseBlock11 for channel1, C12 and DenseBlock12 for channel2). The first layer (C11 and C12) contains filters to extract rough features and the dense block (DenseBlock11 and DenseBlock12) contains three convolution layers (each layer’s output is cascaded as the next layer’s input) which also contain filters. The weights of encoder channels are tied, C11 and C12 (DenseBlock11 and DenseBlock12) share same weights. For each convolution layer in encoding network, the channel number of feature maps is 16. The architecture of encoder has two advantages. First, the filter size and step of convolutional operation are and 1, respectively. With this strategy, the input image can be any size. Second, dense block architecture can preserve as much features as possible which are obtained by each convolution layer in encoding network and this operation can make sure all the salient features will be used in fusion strategy.

We choose different fusion strategies in fusion layer and these will be introduced in SectionIII-B.

The decoder contains four convolution layers ( filters). The output of fusion layer will be the input of decoder. We use this simple and effective architecture to reconstruct the final fused image.

Iii-a Training

In training process, we just consider encoder and decoder network. We attempt to train our encoder and decoder network to reconstruct the input image. The framework of our training process is shown in Fig.2, and the architecture of training process is outlined in Table I.

In Fig.2 and Table I, C1 is convolution layer in encoder network which contains filters. DC1, DC2 and DC3 are convolution layers in dense block and the output of each layer is connected to every other layer by cascaded operation. The encoder consists of C2, C3, C4 and C5, which will be utilized to reconstruct the input image.

Layer Size Stride
Encoder Conv(C1) 3 1 1 16 ReLu
Decoder Conv(C2) 3 1 64 64 ReLu
Conv(C3) 3 1 64 32 ReLu
Conv(C4) 3 1 64 16 ReLu
Conv(C5) 3 1 64 1
(dense block)
Conv(DC1) 3 1 16 16 ReLu
Conv(DC2) 3 1 32 16 ReLu
Conv(DC3) 3 1 48 16 ReLu
TABLE I: The architecture of training process. Conv denotes the convolutional block(convolutional layer + activation); Dense denotes the dense block.

In order to reconstruct the input image more precisely, we minimize the loss function L to train our encoder and decoder,


which is a weighted combination of pixel loss and structural similarity (SSIM) loss with the weight .

The pixel loss is calculated as,


where and indicate the output and input images, respectively. It is the Euclidean distance between the output and the input .

The SSIM loss is obtained by Eq.3,


where represents the structural similarity operation[17] and it denotes the structural similarity of two images. Because the order of magnitude between pixel loss and SSIM loss is different, in training process, the is set as 1, 10, 100 and 1000, respectively.

We train our network using MS-COCO[18] as input images which contains 80000 images and all of them are resized to and RGB images are transformed to gray ones. Learning rate is set as

.The batch size and epochs are 2 and 4, respectively. Our method is implemented with GTX 1080Ti and 64GB RAM.

Iii-B Fusion Strategy

Once the encoder and decoder networks are trained, in testing process, we used two-stream architecture in encoder and the weights are tied. We choose two fusion strategies (addition strategy and -norm strategy) to combine salient feature maps which are obtained by encoder.

In our network, represents the number of feature maps. indicates the input images or feature maps.

Iii-B1 Addition Strategy

The addition fusion strategy just like the fusion strategy in [15]. And the strategy procedure is shown in Fig.3.

Fig. 3: The procedure of addition strategy.

and indicate the feature maps which are obtained by encoder from input images, denotes the fused feature maps. The addition strategy is formulated by Eq.4,


where denotes the corresponding position in feature maps and fused feature maps. Then will be the input of decoder and final fused image will be reconstructed by decoder.

Iii-B2 -norm Strategy

The performance of addition strategy was proved in [15]

. But this operation is a very rough fusion strategy for salient feature selection. We applied a new strategy which is based on

-norm and soft-max operation into our network. The diagram of this strategy is shown in Fig.4.

Fig. 4: The diagram of -norm and soft-max strategy.

In Fig.4, the features maps are denoted by

, the activity level map

will be calculated by -norm and block-based average operator, and still denotes the fused feature maps.

Inspired by [11], the -norm of can be the activity level measure of the feature maps. Thus, the initial activity level map is calculated by Eq.5,


Then block-based average operator is utilized to calculate the final activity level map by Eq.6.


where determines the block size and in our strategy .

After we get the final activity level map , is calculated by Eq.7,


The final fused image will be reconstructed by decoder in which the fused feature maps as the input.

Iv Experimental results and analysis

The purpose of the experiment is to validate the proposed fusion method using subjective and objective criteria and to carry out the comparison with existing methods.

Iv-a Experimental Settings

In our experiment, the source infrared and visible images were collected from [19] and [20]. There are 20 pairs of our source images for the experiment and infrared and visible images are available at [29]. A sample of these images is shown in Fig.5.

Fig. 5: Four pairs of source images. The top row contains infrared images, and the second row contains visible images.
Fig. 6: Experiment on “car” images. (a) Infrared image; (b) Visible image; (c) DCHWT; (d) JSR; (e) GTF; (f) JSRSD; (g) WLS. (h) DeepFuse; The last two rows contain the fused images which obtained by proposed method with different SSIM weights and fusion strategy.

We compare the proposed method with several typical fusion methods, including cross bilateral filter fusion method(CBF)[21], the joint-sparse representation model(JSR)[22], gradient transfer and total variation minimization(GTF)[23], the JSR model with saliency detection fusion method(JSRSD)[24], deep convolutional neural network-based method(CNN)[13] and the DeepFuse method(DeepFuse)[15]. In our experiment, the filter size is set as for DeepFuse methods.

For the purpose of quantitative comparison between our fusion method and other existing algorithms, seven quality metrics are utilized. These are: entropy(En); Qabf[25]; the sum of the correlations of differences(SCD)[26]; and [27] which calculate mutual information (FMI) for the wavelet and discrete cosine features, respectively; modified structural similarity for no-reference image(); and a new no-reference image fusion performance measure(MS_SSIM)[28].

In our experiment, the is calculated by Eq.8,


where denotes the structural similarity operation[16], is fused image, and , are source images. The value of represents the ability to preserve structural information.

The fusion performance improves with the increasing numerical index of all these seven metrics.

Fig. 7: Experiment on “street” images. (a) Infrared image; (b) Visible image; (c) DCHWT; (d) JSR; (e) GTF; (f) JSRSD; (g) WLS. (h) DeepFuse; The last two rows contain the fused images which obtained by proposed method with different SSIM weights and fusion strategy.

Iv-B Fusion methods Evaluation

The fused images obtained by the six existing methods and the proposed method use different parameters which are shown in Fig.6 and Fig.7. Due to the space limit, we evaluate the relative performance of the fusion methods on two pairs of images(“car” and “street”).

The fused images which are obtained by CBF, JSR and JSRSD have more artificial noise and the saliency features are not clear, such as sky(orange and dotted) and floor(red and solid) in Fig.6 and billboard(red box) in Fig.7.

On the other hand, the fused images obtained by the proposed method contain less noise in the red box no matter what parameters were chosen. Compared with GTF, CNN and DeepFuse, our fusion method preserves more detail information in red box, as we can see from Fig.6.

In Fig.7, the fused image is darker than other images when the CNN-based method is utilized to fuse images. The reason of this phenomenon is CNN-based method is not suitable for infrared and visible images. On the contrary, the fused images obtained by our method look more natural.

However, as there is no validate difference between DeepFuse and proposed method in human sensitivity, we choose several objective metrics to evaluate the fusion performance in the next.

The average values of seven metrics for 20 fused images which are obtained by existing methods and the proposed fusion method are shown in Table II.

The best values for quality metrics are indicated in bold and the second-best values are indicated in red and italic. As we can see, the proposed method which use addition and -norm strategies have four best average values (En, Qabf, , ) and three second-best values (SCD, , MS_SSIM).

Our method has best values in , , this denotes that our method preserves more structural information and features. The fused images obtained by proposed method are more natural and contain less artificial noise because of the best values of En, Qabf and second-best values of SCD.

With different fusion strategy (addition and -norm) are utilized in to our network, our algorithm still has best or second-best values in seven quality metrics. This means our network is an effective architecture for infrared and visible image fusion task.

Methods En Qabf[25] SCD[26] [27] [27] MS_SSIM[28]
CBF[21] 6.81494 0.44119 1.38963 0.32350 0.26309 0.60304 0.70879
JSR[22] 6.78576 0.32572 1.59136 0.18506 0.14236 0.53906 0.75523
GTF[23] 6.63597 0.40992 1.00488 0.41038 0.39787 0.70369 0.80844
JSRSD[24] 6.78441 0.32553 1.59124 0.18498 0.14253 0.53963 0.75517
CNN[13] 6.80593 0.29451 1.48060 0.54051 0.36658 0.71109 0.80772
DeepFuse[15] 6.68170 0.43989 1.84525 0.42477 0.41501 0.72949 0.93353
ours Addition Densefuse_1e0 6.66280 0.44114 1.83459 0.42744 0.41689 0.73159 0.92909
Densefuse_1e1 6.65139 0.44039 1.83126 0.42741 0.41691 0.73246 0.92779
Densefuse_1e2 6.65426 0.44190 1.83502 0.42767 0.41727 0.73186 0.92896
Densefuse_1e3 6.64377 0.43831 1.82801 0.42735 0.41699 0.73259 0.92691
-norm Densefuse_1e0 6.83278 0.47560 1.69380 0.43144 0.38394 0.71880 0.84846
Densefuse_1e1 6.81348 0.47680 1.69476 0.43176 0.38388 0.72052 0.84964
Densefuse_1e2 6.83091 0.47684 1.70044 0.43075 0.38443 0.71901 0.85137
Densefuse_1e3 6.84189 0.47595 1.70347 0.43098 0.38733 0.72106 0.85513
TABLE II: The average values of quality metrics for 20 fused images. Addition and -norm denote the fusion strategies which we used in our method; Densefuse_1e0 - Densefuse_1e3 indicate the different SSIM loss weights().

V Conclusion

In this paper, we present a novel and effective deep learning architecture based on CNN and dense block for infrared and visible image fusion problem. Our network has three parts: encoder, fusion layer and decoder. Firstly, the source images (infrared and visible images) are utilized to be the input of encoder. And the features maps are obtained by CNN layer and dense block, which are fused by fusion strategy (addition and -norm). After fusion layer, the feature maps are integrated into one feature map which contains all salient features from source images. Finally, the fused image is reconstructed by decoder network. We use both subjective and objective quality metrics to evaluate our fusion method. The experimental results show that the proposed method exhibits state-of-the-art fusion performance.

We believe our network architecture can be applied to other image fusion problem, such as multi-exposure image, multi-focus image and medical image fusion. In the future, we will continue to research this fusion method and hope to achieve better fusion performance.


  • [1] Li S, Kang X, Fang L, et al. Pixel-level image fusion: A survey of the state of the art[J]. Information Fusion, 2017, 33: 100-112.
  • [2] Ben Hamza A, He Y, Krim H, et al. A multiscale approach to pixel-level image fusion[J]. Integrated Computer-Aided Engineering, 2005, 12(2): 135-146.
  • [3] Yang S, Wang M, Jiao L, et al. Image fusion based on a new contourlet packet[J]. Information Fusion, 2010, 11(2): 78-84.
  • [4] Wang L, Li B, Tian L F. EGGDD: An explicit dependency model for multi-modal medical image fusion in shift-invariant shearlet transform domain[J]. Information Fusion, 2014, 19: 29-37.
  • [5] Pang H, Zhu M, Guo L. Multifocus color image fusion using quaternion wavelet transform[C]//Image and Signal Processing (CISP), 2012 5th International Congress on. IEEE, 2012: 543-546.
  • [6] Bavirisetti D P, Dhuli R. Two-scale image fusion of visible and infrared images using saliency detection[J]. Infrared Physics & Technology, 2016, 76: 52-64.
  • [7] Li S, Kang X, Hu J. Image fusion with guided filtering[J]. IEEE Transactions on Image Processing, 2013, 22(7): 2864-2875.
  • [8]

    Zong J, Qiu T. Medical image fusion based on sparse representation of classified image patches[J]. Biomedical Signal Processing and Control, 2017, 34: 195-205.

  • [9] Zhang Q, Fu Y, Li H, et al. Dictionary learning method for joint sparse representation-based image fusion[J]. Optical Engineering, 2013, 52(5): 057006.
  • [10] Gao R, Vorobyov S A, Zhao H. Image fusion with cosparse analysis operator[J]. IEEE Signal Processing Letters, 2017, 24(7): 943-947.
  • [11] Li H, Wu X J. Multi-focus Image Fusion Using Dictionary Learning and Low-Rank Representation[C]//International Conference on Image and Graphics. Springer, Cham, 2017: 675-686.
  • [12] Liu Y, Chen X, Ward R K, et al. Image fusion with convolutional sparse representation[J]. IEEE signal processing letters, 2016, 23(12): 1882-1886.
  • [13] Liu Y, Chen X, Peng H, et al. Multi-focus image fusion with a deep convolutional neural network[J]. Information Fusion, 2017, 36: 191-207.
  • [14]

    Huang G, Liu Z, Weinberger K Q, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 1(2): 3.

  • [15] Prabhakar K R, Srikar V S, Babu R V. DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs[C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 4724-4732.
  • [16] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
  • [17] Wang Z, Bovik A C, Sheikh H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE transactions on image processing, 2004, 13(4): 600-612.
  • [18] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//European conference on computer vision. Springer, Cham, 2014: 740-755.
  • [19] Ma J, Zhou Z, Wang B, et al. Infrared and visible image fusion based on visual saliency map and weighted least square optimization[J]. Infrared Physics & Technology, 2017, 82: 8-17.
  • [20] Toet A. TNO Image fusion dataset[J]. Figshare. data, 2014.
  • [21] Kumar B K S. Image fusion based on pixel significance using cross bilateral filter[J]. Signal, image and video processing, 2015, 9(5): 1193-1204.
  • [22] Zhang Q, Fu Y, Li H, et al. Dictionary learning method for joint sparse representation-based image fusion[J]. Optical Engineering, 2013, 52(5): 057006.
  • [23] Ma J, Chen C, Li C, et al. Infrared and visible image fusion via gradient transfer and total variation minimization[J]. Information Fusion, 2016, 31: 100-109.
  • [24] Liu C H, Qi Y, Ding W R. Infrared and visible image fusion method based on saliency detection in sparse domain[J]. Infrared Physics & Technology, 2017, 83: 94-102.
  • [25] Xydeas C S, Petrovic V. Objective image fusion performance measure[J]. Electronics letters, 2000, 36(4): 308-309.
  • [26] Aslantas V, Bendes E. A new image quality metric for image fusion: The sum of the correlations of differences[J]. AEU-International Journal of Electronics and Communications, 2015, 69(12): 1890-1896.
  • [27] Haghighat M, Razian M A. Fast-FMI: non-reference image fusion metric[C]//Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on. IEEE, 2014: 1-3.
  • [28] Ma K, Zeng K, Wang Z. Perceptual quality assessment for multi-exposure image fusion[J]. IEEE Transactions on Image Processing, 2015, 24(11): 3345-3356.
  • [29] Li H. CODE: DenseFuse-A Fusion Approach to Infrared and Visible Image.