Infrared and Visible Image Fusion using a Deep Learning Framework

04/19/2018 ∙ by Hui Li, et al. ∙ University of Surrey NetEase, Inc 0

In recent years, deep learning has become a very active research tool which is used in many image processing fields. In this paper, we propose an effective image fusion method using a deep learning framework to generate a single image which contains all the features from infrared and visible images. First, the source images are decomposed into base parts and detail content. Then the base parts are fused by weighted-averaging. For the detail content, we use a deep learning network to extract multi-layer features. Using these features, we use l_1-norm and weighted-average strategy to generate several candidates of the fused detail content. Once we get these candidates, the max selection strategy is used to get final fused detail content. Finally, the fused image will be reconstructed by combining the fused base part and detail content. The experimental results demonstrate that our proposed method achieves state-of-the-art performance in both objective assessment and visual quality. The Code of our fusion method is available at



There are no comments yet.


page 3

page 4

page 5

Code Repositories


Infrared and visible image fusion using deep learning framework(multi-layers strategy)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The fusion of infrared and visible imaging is an important and frequently occuring problem. Recently, many fusion methods have been proposed to combine the features present in infrared and visible images into a single image[1]. These state-of-the-art methods are widely used in many applications, like image pre-processing, target recognition and image classification.

The key problem of image fusion is how to extract salient features from the source images and how to combine them to generate the fused image.

For decades, many signal processing methods have been applied in the image fusion field to extract image features, such as discrete wavelet transform(DWT)[2], contourlet transform[3], shift-invariant shearlet transform[4] and quaternion wavelet transform[5] etc. For the infrared and visible image fusion task, Bavirisetti et al. [6] proposed a two-scale decomposition and saliency detection-based fusion method, where by the mean and median filter are used to extract the base layers and detail layers. Then visual saliency is used to obtain weight maps. Finally, the fused image is obtained by combining these three parts.

Besides the above methods, the role of sparse representation(SR) and low-rank representation has also attracted great attention. Zong et al.[7]

proposed a medical image fusion method based on SR, in which, the Histogram of Oriented Gradients(HOG) features are used to classify the image patches and learn several sub-dictionaries. The

-norm and the max selection strategy are used to reconstruct the fused image. In addition, there are many methods based on combining SR and other tools for image fusion, such as pulse coupled neural network(PCNN)

[8] and shearlet transform[9]. In the sparse domain, the joint sparse representation[10] and cosparse representation[11] were also applied in the image fusion field. In the low-rank category, Li et al.[12] proposed a low-rank representation(LRR)-based fusion method. They use LRR instead of SR to extract features, then -norm and the max selection strategy are used to reconstruct the fused image.

With the rise of deep learning, deep features of the source images which are also a kind of saliency features are used to reconstruct the fused image. In


, Yu Liu et al. proposed a fusion method based on convolutional sparse representation(CSR). The CSR is different from deep learning methods, but the features extracted by CSR are still deep features. In their method, the authors employ CSR to extract multi-layer features, and then use these features to generate the fused image. In addition, Yu Liu et al.


also proposed a convolutional neural network(CNN)-based fusion method. They use image patches which contain different blur versions of the input image to train the network and use it to get a decision map. Finally, the fused image is obtained by using the decision map and the source images. Although the deep learning-based methods achieve better performance, these methods still have many drawbacks: 1) The method in

[14] is only suitable for multi-focus image fusion; 2) These methods just use the result which is calculated by the last layers and a lot of useful information which is obtained by the middle layers will be lost. The information loss tends to get worse when the network is deeper.

In this paper, we propose a novel and effective fusion method based on a deep learning framework for infrared and visible image fusion. The source images are decomposed into base parts and detail content by the image decomposition approach in [15]. We use a weighted-averaging strategy to obtain the fused base part. To extract the detail, first, we use deep learning network to compute multi-layer features so as to preserve as much information as possible. For the features at each layer, we use soft-max operator to obtain weight maps and a candidate fused detail content will be obtained. Applying the same operation at multiple layers, we will get several candidates for the fused detail content. The final fused detail image is generated by the max selection strategy. The final fused image is reconstructed by fusing the base part with the detail content.

This paper is structured as follows. In SectionII, the image style transfer using deep learning framework will be presented. In SectionIII, the proposed deep learning based image fusion method is introducted in detail. The experimental results are shown in SectionIV. Finally, SectionV draws the paper to conclusion.

Ii Image style transfer using deep learning framework

As we all know, deep learning achieves the state-of-the-art performance in many image processing tasks, such as image classification. In addition, deep learning also can be a useful tool for extracting image fearures which contain different information at each layer. Different applications of deep learning received a lot of attention in the last two years. Hence, we believe deep learning can also be applied to the image fusion task.

In CVPR 2016, Gatys et al.[16] proposed an image style transfer method based on CNN. They use VGG-network[17] to extract deep features at diffierent layers from the “content” image, “style” image and a generated image, respectively. The difference of deep features extracted from the generated image and source images is mimimised by iteration. The generated image will contain the main object from the “content” image and texture features from the “style” image. Although this method can obtain good stylized image, its speed is extremly slow even when using GPU.

Due to these drawbacks, in ECCV 2016, Justin Johnson et al.[18] proposed a feed-forward network to solve the optimization problem formulated in [16] in real time. But in this method, each network is tied to a fixed style. To solve this problem, in ICCV 2017, Xun Huang et al.[19] used VGG-network and adaptive instance normalization to construct a new style transfer framework. In this framework, the stylized image can be of arbitrary style and the method is nearly three orders of magnitude faster than [16].

These methods have one thing in common. They all use multi-layer network fearures as a constraint condition. Inspired by them, multi-layer deep features are extracted by a VGG-network in our fusion method. We use the fixed VGG-19[17]

which is trained on ImageNet to extract the features. The detail of our proposed fusion method will be introduced in the next section.

Iii The Proposed Fusion Method

The fusion processing of base parts and detail content is introduced in the next subsections.

Suppose that there are preregistered source images, in our paper, we choose , but the fusion strategy is the same for . The source images will be denoted as , .

Compared with other image decomposition methods, like wavelet decomposition and latent low-rank decomposition, the optimization method[15] is more effective and can save time. So in our paper, we use this method to decompose the source images.

For each source image , the base parts and detail content are obtained separated by [15]. The base parts are obtained by solving this optimization problem:


where and are the horizontal and vertical gradient operators, respectively. The parameter is set to 5 in our paper.

After we get the base parts , the detail content is obtained by Eq.2,


The framework of the proposed fusion method is shown in Fig.1.

Fig. 1: The framework of proposed method.

As shown in Fig.1, the source images are denoted as and . Firstly, the base part and the detail content for each source image are obtained by solving Eq.(1) and Eq.(2), where . Then the base parts are fused by weighted-averaging strategy and the detail content is reconstructed by our deep learning framework. Finally, the fused image will be reconstructed by adding the fused base part and detail content .

Iii-a Fusion of base parts

The base parts which are extracted from the source images contain the common features and redundant information. In our paper, we choose the weighted-averaging strategy to fuse these base parts. The fused base part is calculated by Eq.3,


where denotes the corresponding position of the image intensity in , and . and indicate the weight values for pixel in and , respectively. To preserve the common features and reduce the redundant information, in this paper, we choose and .

Iii-B The fusion of the detail content

For the detail content and , we propose a novel fusion strategy which uses deep learning method(VGG-network) to extract deep features. This procedure is shown in Fig.2.

Fig. 2: The procedure of detail content fusion.

In Fig.2, we use VGG-19 to extract deep features. Then the weight maps are obtained by a multi-layer fusion strategy. Finally, the fused detail content is reconstructed by these weight maps and the detail content.

Now, we introduce our multi-layer fusion strategy in detail.

Consider detail content . indicates the feature maps of -th detail content extracted by the -th layer and m is the channel number of the -th layer, ,,


where each denotes a layer in the VGG-network and represents the , , and , respectively.

Let denote the contents of at the position in the feature maps. As we can see, is an

-dimensional vector. The procedure of our strategy is presented in Fig.


Fig. 3: The procedure of the fusion strategy for detailed parts.

As shown in Fig.3, after we get the deep features

, the activity level map

will be calculated by -norm and block-based average operator, where and .

Inspired by [12], the -norm of can be the activity level measure of the source detail content. Thus, the initial activity level map is obtained by


We then use the block-based average operator to calculate the final activity level map in order to make our fusion method robust to misregistration.


where determines the block size. The fusion method will be more robust to misregistration if the is larger, but some detail could be lost. Thus, in our strategy .

Once we get the activity level map , the initial weight maps will be calculated by soft-max operator, as shown in Eq.7,


where denotes the number of activity level map, which in our paper is set to . indicates the initial weight map value in the range of [0,1].

As we all know, the pooling operator in VGG-network is a kind of subsampling method. Every time this operator resizes the feature maps to times of the original size where

is the stride of the pooling operator. In the VGG-network, the stride of the pooling operator is 2. So in different layers, the size of feature maps is

times the detail content size, where indicates the layers of , , and , respectively. After we get each initial weight map , we use an upsampling operator to resize the weight map size to the input detail content size.

As shown in Fig.3, with the upsampling operator, we will get the final weight map , the size of which equals the input detail content size. The final weight map is calculated by Eq.8,


Now we have four pairs of weight maps , and . For each pair , the initial fused detail content is obtained by Eq.9,


Finally, the fused detail content is obtained by Eq.10 in which we choose the maximum value from the four initial fused detail content for each position.


Iii-C Reconstruction

Once the fused detail content is obtained, we use the fused base part and the fused detail content to reconstruct the final fused image, as shown in Eq.11,


Iii-D Summary of the Proposed Fusion Method

In this section, we summarize the proposed fusion method based on deep learning as follows:

Iii-D1 Image decomposition

The source images are decomposed by the image decomposition operation[15] to obtain the base part and the detail content , where .

Iii-D2 Fusion of base parts

We choose the weighted-averaging fusion strategy to fuse base parts, with the weight value for each base part of 0.5.

Iii-D3 Fusion of detail content

The fused detail content is obtained by the multi-layer fusion strategy.

Iii-D4 Reconstruction

Finally, the fused image is given by Eq.11.

Iv Experimental results and analysis

The aim of the experiment is to validate the proposed method using subjective and objective criteria and to carry out a comparison with existing methods.

Iv-a Experimental Settings

In our experiment, the source infrared and visible images were collected from [20] and [25]. There are 21 pairs of our source images and they are available at [26]. A sample of these images is shown in Fig.4.

Fig. 4: Four pairs of source images. The top row contains infrared images, and the second row contains visible images.

In multi-layer fusion strategy, we choose few layers from a pre-trained VGG-19 network[17] to extract deep features. These layers are , , and , respectively

For comparison, we selected several recent and classical fusion methods to perform the same experiment, including: cross bilateral filter fusion method(CBF)[21], the joint-sparse representation model(JSR)[10], the JSR model with saliency detection fusion method(JSRSD)[22], weighted least square optimization-based method(WLS)[20] and the convolutional sparse representation model(ConvSR)[13].

All the fusion algorithms are implemented in MATLAB R2016a on 3.2 GHz Intel(R) Core(TM) CPU with 12 GB RAM.

Iv-B Subjective Evaluation

The fused images which are obtained by the five existing methods and the proposed method are shown in Fig.5 and Fig.6. Due to the space limit, we evaluate the relative performance of the fusion methods only on a single pair of images(“street” and“people”).

Fig. 5: Results on “street” images. (a) Infrared image; (b) Visible image; (c) CBF; (d) JSR; (e) JSRSD; (f) WLS. (g) ConvSR; (h) The proposed method.
Fig. 6: Results on “people” images. (a) Infrared image; (b) Visible image; (c) CBF; (d) JSR; (e) JSRSD; (f) WLS. (g) ConvSR; (h) The proposed method.

As we can see from Fig.5(c-h), the fused image obtained by the proposed method preserves more detail information in the red window and contains less artificial noise. In Fig.6(c-h), the fused image obtained by the proposed method also contains less noise in the red box.

In summary, the fused images which are obtained by CBF have more artificial noise and the salient features are not clear. The fused images obtained by JSR, JSRSD and WLS, in addition, contain artificial structures around the salient features and the image detail is blurred. In contrast, the fused images obtained by ConvSR and the proposed fusion method contain more salient features and preserve more detail information. Compared with the four existing fusion methods, the fused images obtained by the proposed method look more natural. As there is no visible difference between ConvSR and the proposed fusion method in terms of human sensitivity, we use several objective quality metrics to evaluate the fusion performance in the next section.

Iv-C Objective Evaluation

For the purpose of quantitative comparison between the proposed method and existing fusion methods, four quality metrics are utilized. These are: and [23] which calculate mutual information (FMI) for the discrete cosine and wavelet features, respectively; [24] which denotes the rate of noise or artifacts added to the fused image by the fusion process; and modified structural similarity().

In our paper, the is calculated by Eq.12,


where represents the structural similarity operation, is the fused image, and , are the source images. The value of assesses the ability to preserve structural information.

The performance improves with the increasing numerical index of , and . On the contrary, the fusion performance is better when the value of is small, which means the fused images contain less artificial information and noise.

The average values of , , and obtained by teh existing methods and the proposed method for the 21 fused images are shown in Table I.

Methods CBF[21] WLS[20] JSR[10] JSRSD[22] ConvSR[13] Proposed
0.26309 0.33103 0.14236 0.14253 0.34640 0.40463
0.32350 0.37662 0.18506 0.18498 0.34640 0.41684
0.59957 0.72360 0.54073 0.54127 0.75335 0.77799
0.31727 0.21257 0.34712 0.34657 0.0196 0.00120
TABLE I: The average values of , , and for 21 fused images.

In Table I, the best values for , , and are indicated in bold. As we can see, the proposed method has all the best average values for these metrics. These values indicate that the fused images obtained by the proposed method are more natural and contain less artificial noise. From the objective evaluation, our fusion method has better fusion performance than the existing methods.

Specifically, we show in Table II all values of for the 21 pairs produced by the respective methods. The graph plot of for all fused images is shown in Fig.7.

Methods CBF[21] WLS[20] JSR[10] JSRSD[22] ConvSR[13] Proposed
image1 0.23167 0.14494 0.34153 0.34153 0.0149 0.00013
image2 0.48700 0.16997 0.19749 0.19889 0.0220 0.00376
image3 0.54477 0.21469 0.38627 0.38627 0.0207 0.00622
image4 0.45288 0.22866 0.42353 0.42353 0.0238 0.00132
image5 0.43257 0.19188 0.49804 0.49804 0.0099 0.00020
image6 0.23932 0.22382 0.36619 0.36509 0.0230 0.00099
image7 0.41779 0.15368 0.52301 0.52220 0.0151 0.00188
image8 0.15233 0.23343 0.21640 0.21536 0.0340 0.00037
image9 0.11741 0.17177 0.30983 0.30761 0.0237 0.00029
image10 0.20090 0.22419 0.34329 0.34271 0.0201 0.00048
image11 0.47632 0.20588 0.33225 0.32941 0.0102 0.00109
image12 0.25544 0.22335 0.32488 0.32502 0.0154 0.00058
image13 0.36066 0.19607 0.28106 0.28220 0.0189 0.00035
image14 0.18971 0.20332 0.40615 0.40261 0.0204 0.00082
image15 0.21509 0.20378 0.35106 0.35013 0.0221 0.00060
image16 0.52783 0.30672 0.26907 0.26888 0.0194 0.00090
image17 0.52887 0.31160 0.33544 0.33720 0.0156 0.00122
image18 0.26649 0.25937 0.55761 0.55732 0.0150 0.00023
image19 0.12582 0.16205 0.27327 0.27302 0.0138 0.00002
image20 0.25892 0.18401 0.16588 0.16541 0.0257 0.00203
image21 0.18091 0.25074 0.38734 0.38546 0.0275 0.00171
TABLE II: The values for 21 fused images which obtained by fusion methods.
Fig. 7: Plotting for all fused images obtained by the fusion methods experimentally compared.

From Table II and Fig.7, the values of produced by our method are nearly two orders of magnitude batter than CBF, JSR and JSRSD. Even compared with ConvSR, the values of proposed method are extremely small. This indicates that the fused images obtained by the proposed method contain less artificial information and noise.

V Conclusion

In this paper, we present a simple and effective fusion method based on a deep learning framework(VGG-network) for an infrared and visible image fusion task. Firstly, the source images are decomposed into base parts and detail content. The former contains low frequency information and the latter contains texture information. These base parts are fused by the weight-averaging strategy. For the detail content, we proposed a novel multi-layer fusion strategy based on a pre-trained VGG-19 network. The deep features of the detail content are obtained by this fixed VGG-19 network. The -norm and block-averaging operator are used to get the initial weight maps. The final weight maps are obtained by the soft-max operator. The initial fused detail content is generated for each pair of weight maps and the input detail content. The fused detail content is reconstructed by the max selection operator applied to these initial fused detail content. Finally, the fused image is reconstructed by adding the fused base part and the fused detail content. We use both subjective and objective methods to evaluate the proposed method. The experimental results show that the proposed method exhibits state-of-the-art fusion performance.

We believe our fusion method and the novel multi-layer fusion strategy can be applied to other image fusion tasks, such as medical image fusion, multi-exposure image fusion and multi-focus image fusion.


  • [1] Li S, Kang X, Fang L, et al. Pixel-level image fusion: A survey of the state of the art[J]. Information Fusion, 2017, 33: 100-112.
  • [2] Ben Hamza A, He Y, Krim H, et al. A multiscale approach to pixel-level image fusion[J]. Integrated Computer-Aided Engineering, 2005, 12(2): 135-146.
  • [3] Yang S, Wang M, Jiao L, et al. Image fusion based on a new contourlet packet[J]. Information Fusion, 2010, 11(2): 78-84.
  • [4] Wang L, Li B, Tian L F. EGGDD: An explicit dependency model for multi-modal medical image fusion in shift-invariant shearlet transform domain[J]. Information Fusion, 2014, 19: 29-37.
  • [5] Pang H, Zhu M, Guo L. Multifocus color image fusion using quaternion wavelet transform[C]//Image and Signal Processing (CISP), 2012 5th International Congress on. IEEE, 2012: 543-546.
  • [6] Bavirisetti D P, Dhuli R. Two-scale image fusion of visible and infrared images using saliency detection[J]. Infrared Physics & Technology, 2016, 76: 52-64.
  • [7] Zong J, Qiu T. Medical image fusion based on sparse representation of classified image patches[J]. Biomedical Signal Processing and Control, 2017, 34: 195-205.
  • [8] Lu X, Zhang B, Zhao Y, et al. The infrared and visible image fusion algorithm based on target separation and sparse representation[J]. Infrared Physics & Technology, 2014, 67: 397-407.
  • [9] Yin M, Duan P, Liu W, et al. A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation[J]. Neurocomputing, 2017, 226: 182-191.
  • [10] Zhang Q, Fu Y, Li H, et al. Dictionary learning method for joint sparse representation-based image fusion[J]. Optical Engineering, 2013, 52(5): 057006.
  • [11] Gao R, Vorobyov S A, Zhao H. Image fusion with cosparse analysis operator[J]. IEEE Signal Processing Letters, 2017, 24(7): 943-947.
  • [12] Li H, Wu X J. Multi-focus Image Fusion Using Dictionary Learning and Low-Rank Representation[C]//International Conference on Image and Graphics. Springer, Cham, 2017: 675-686.
  • [13] Liu Y, Chen X, Ward R K, et al. Image fusion with convolutional sparse representation[J]. IEEE signal processing letters, 2016, 23(12): 1882-1886.
  • [14] Liu Y, Chen X, Peng H, et al. Multi-focus image fusion with a deep convolutional neural network[J]. Information Fusion, 2017, 36: 191-207.
  • [15] Li S, Kang X, Hu J. Image fusion with guided filtering[J]. IEEE Transactions on Image Processing, 2013, 22(7): 2864-2875.
  • [16]

    Gatys L A, Ecker A S, Bethge M. Image style transfer using convolutional neural networks[C]//Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 2016: 2414-2423.

  • [17] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
  • [18]

    Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution[C]//European Conference on Computer Vision. Springer, Cham, 2016: 694-711.

  • [19] Xun Huang, Serge Belongie. Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization[C]//2017 The IEEE International Conference on Computer Vision (ICCV). IEEE, 2017:1501-1510.
  • [20] Ma J, Zhou Z, Wang B, et al. Infrared and visible image fusion based on visual saliency map and weighted least square optimization[J]. Infrared Physics & Technology, 2017, 82: 8-17.
  • [21] Kumar B K S. Image fusion based on pixel significance using cross bilateral filter[J]. Signal, image and video processing, 2015, 9(5): 1193-1204.
  • [22] Liu C H, Qi Y, Ding W R. Infrared and visible image fusion method based on saliency detection in sparse domain[J]. Infrared Physics & Technology, 2017, 83: 94-102.
  • [23] Haghighat M, Razian M A. Fast-FMI: non-reference image fusion metric[C]//Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on. IEEE, 2014: 1-3.
  • [24] Kumar B K S. Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform[J]. Signal, Image and Video Processing, 2013, 7(6): 1125-1143.
  • [25] Toet A. TNO Image fusion dataset[J]. Figshare. data, 2014.
  • [26] Li H. CODE: Infrared and Visible Image Fusion using a Deep Learning Framework.