Infrared and visible image fusion using deep learning framework(multi-layers strategy)
In recent years, deep learning has become a very active research tool which is used in many image processing fields. In this paper, we propose an effective image fusion method using a deep learning framework to generate a single image which contains all the features from infrared and visible images. First, the source images are decomposed into base parts and detail content. Then the base parts are fused by weighted-averaging. For the detail content, we use a deep learning network to extract multi-layer features. Using these features, we use l_1-norm and weighted-average strategy to generate several candidates of the fused detail content. Once we get these candidates, the max selection strategy is used to get final fused detail content. Finally, the fused image will be reconstructed by combining the fused base part and detail content. The experimental results demonstrate that our proposed method achieves state-of-the-art performance in both objective assessment and visual quality. The Code of our fusion method is available at https://github.com/exceptionLi/imagefusion_deeplearning.READ FULL TEXT VIEW PDF
Infrared and visible image fusion is an important problem in image fusio...
In image fusion task, feature extraction and processing are keys for fus...
Infrared and visible image fusion is an important problem in the field o...
In this work, we propose a novel unsupervised deep learning model to add...
In this paper, we present a novel deep learning approach, deeply-fused n...
Presenting context images to a viewer's peripheral vision is one of the ...
We propose a real-time image fusion method using pre-trained neural netw...
Infrared and visible image fusion using deep learning framework(multi-layers strategy)
The fusion of infrared and visible imaging is an important and frequently occuring problem. Recently, many fusion methods have been proposed to combine the features present in infrared and visible images into a single image. These state-of-the-art methods are widely used in many applications, like image pre-processing, target recognition and image classification.
The key problem of image fusion is how to extract salient features from the source images and how to combine them to generate the fused image.
For decades, many signal processing methods have been applied in the image fusion field to extract image features, such as discrete wavelet transform(DWT), contourlet transform, shift-invariant shearlet transform and quaternion wavelet transform etc. For the infrared and visible image fusion task, Bavirisetti et al.  proposed a two-scale decomposition and saliency detection-based fusion method, where by the mean and median filter are used to extract the base layers and detail layers. Then visual saliency is used to obtain weight maps. Finally, the fused image is obtained by combining these three parts.
Besides the above methods, the role of sparse representation(SR) and low-rank representation has also attracted great attention. Zong et al.
proposed a medical image fusion method based on SR, in which, the Histogram of Oriented Gradients(HOG) features are used to classify the image patches and learn several sub-dictionaries. The
-norm and the max selection strategy are used to reconstruct the fused image. In addition, there are many methods based on combining SR and other tools for image fusion, such as pulse coupled neural network(PCNN) and shearlet transform. In the sparse domain, the joint sparse representation and cosparse representation were also applied in the image fusion field. In the low-rank category, Li et al. proposed a low-rank representation(LRR)-based fusion method. They use LRR instead of SR to extract features, then -norm and the max selection strategy are used to reconstruct the fused image.
With the rise of deep learning, deep features of the source images which are also a kind of saliency features are used to reconstruct the fused image. In
, Yu Liu et al. proposed a fusion method based on convolutional sparse representation(CSR). The CSR is different from deep learning methods, but the features extracted by CSR are still deep features. In their method, the authors employ CSR to extract multi-layer features, and then use these features to generate the fused image. In addition, Yu Liu et al.
also proposed a convolutional neural network(CNN)-based fusion method. They use image patches which contain different blur versions of the input image to train the network and use it to get a decision map. Finally, the fused image is obtained by using the decision map and the source images. Although the deep learning-based methods achieve better performance, these methods still have many drawbacks: 1) The method in is only suitable for multi-focus image fusion; 2) These methods just use the result which is calculated by the last layers and a lot of useful information which is obtained by the middle layers will be lost. The information loss tends to get worse when the network is deeper.
In this paper, we propose a novel and effective fusion method based on a deep learning framework for infrared and visible image fusion. The source images are decomposed into base parts and detail content by the image decomposition approach in . We use a weighted-averaging strategy to obtain the fused base part. To extract the detail, first, we use deep learning network to compute multi-layer features so as to preserve as much information as possible. For the features at each layer, we use soft-max operator to obtain weight maps and a candidate fused detail content will be obtained. Applying the same operation at multiple layers, we will get several candidates for the fused detail content. The final fused detail image is generated by the max selection strategy. The final fused image is reconstructed by fusing the base part with the detail content.
This paper is structured as follows. In SectionII, the image style transfer using deep learning framework will be presented. In SectionIII, the proposed deep learning based image fusion method is introducted in detail. The experimental results are shown in SectionIV. Finally, SectionV draws the paper to conclusion.
As we all know, deep learning achieves the state-of-the-art performance in many image processing tasks, such as image classification. In addition, deep learning also can be a useful tool for extracting image fearures which contain different information at each layer. Different applications of deep learning received a lot of attention in the last two years. Hence, we believe deep learning can also be applied to the image fusion task.
In CVPR 2016, Gatys et al. proposed an image style transfer method based on CNN. They use VGG-network to extract deep features at diffierent layers from the “content” image, “style” image and a generated image, respectively. The difference of deep features extracted from the generated image and source images is mimimised by iteration. The generated image will contain the main object from the “content” image and texture features from the “style” image. Although this method can obtain good stylized image, its speed is extremly slow even when using GPU.
Due to these drawbacks, in ECCV 2016, Justin Johnson et al. proposed a feed-forward network to solve the optimization problem formulated in  in real time. But in this method, each network is tied to a fixed style. To solve this problem, in ICCV 2017, Xun Huang et al. used VGG-network and adaptive instance normalization to construct a new style transfer framework. In this framework, the stylized image can be of arbitrary style and the method is nearly three orders of magnitude faster than .
These methods have one thing in common. They all use multi-layer network fearures as a constraint condition. Inspired by them, multi-layer deep features are extracted by a VGG-network in our fusion method. We use the fixed VGG-19
which is trained on ImageNet to extract the features. The detail of our proposed fusion method will be introduced in the next section.
The fusion processing of base parts and detail content is introduced in the next subsections.
Suppose that there are preregistered source images, in our paper, we choose , but the fusion strategy is the same for . The source images will be denoted as , .
Compared with other image decomposition methods, like wavelet decomposition and latent low-rank decomposition, the optimization method is more effective and can save time. So in our paper, we use this method to decompose the source images.
For each source image , the base parts and detail content are obtained separated by . The base parts are obtained by solving this optimization problem:
where and are the horizontal and vertical gradient operators, respectively. The parameter is set to 5 in our paper.
After we get the base parts , the detail content is obtained by Eq.2,
The framework of the proposed fusion method is shown in Fig.1.
As shown in Fig.1, the source images are denoted as and . Firstly, the base part and the detail content for each source image are obtained by solving Eq.(1) and Eq.(2), where . Then the base parts are fused by weighted-averaging strategy and the detail content is reconstructed by our deep learning framework. Finally, the fused image will be reconstructed by adding the fused base part and detail content .
The base parts which are extracted from the source images contain the common features and redundant information. In our paper, we choose the weighted-averaging strategy to fuse these base parts. The fused base part is calculated by Eq.3,
where denotes the corresponding position of the image intensity in , and . and indicate the weight values for pixel in and , respectively. To preserve the common features and reduce the redundant information, in this paper, we choose and .
For the detail content and , we propose a novel fusion strategy which uses deep learning method(VGG-network) to extract deep features. This procedure is shown in Fig.2.
In Fig.2, we use VGG-19 to extract deep features. Then the weight maps are obtained by a multi-layer fusion strategy. Finally, the fused detail content is reconstructed by these weight maps and the detail content.
Now, we introduce our multi-layer fusion strategy in detail.
Consider detail content . indicates the feature maps of -th detail content extracted by the -th layer and m is the channel number of the -th layer, ,,
where each denotes a layer in the VGG-network and represents the , , and , respectively.
Let denote the contents of at the position in the feature maps. As we can see, is an
-dimensional vector. The procedure of our strategy is presented in Fig.3.
As shown in Fig.3, after we get the deep features
, the activity level mapwill be calculated by -norm and block-based average operator, where and .
Inspired by , the -norm of can be the activity level measure of the source detail content. Thus, the initial activity level map is obtained by
We then use the block-based average operator to calculate the final activity level map in order to make our fusion method robust to misregistration.
where determines the block size. The fusion method will be more robust to misregistration if the is larger, but some detail could be lost. Thus, in our strategy .
Once we get the activity level map , the initial weight maps will be calculated by soft-max operator, as shown in Eq.7,
where denotes the number of activity level map, which in our paper is set to . indicates the initial weight map value in the range of [0,1].
As we all know, the pooling operator in VGG-network is a kind of subsampling method. Every time this operator resizes the feature maps to times of the original size where
is the stride of the pooling operator. In the VGG-network, the stride of the pooling operator is 2. So in different layers, the size of feature maps istimes the detail content size, where indicates the layers of , , and , respectively. After we get each initial weight map , we use an upsampling operator to resize the weight map size to the input detail content size.
Now we have four pairs of weight maps , and . For each pair , the initial fused detail content is obtained by Eq.9,
Finally, the fused detail content is obtained by Eq.10 in which we choose the maximum value from the four initial fused detail content for each position.
Once the fused detail content is obtained, we use the fused base part and the fused detail content to reconstruct the final fused image, as shown in Eq.11,
In this section, we summarize the proposed fusion method based on deep learning as follows:
The source images are decomposed by the image decomposition operation to obtain the base part and the detail content , where .
We choose the weighted-averaging fusion strategy to fuse base parts, with the weight value for each base part of 0.5.
The fused detail content is obtained by the multi-layer fusion strategy.
Finally, the fused image is given by Eq.11.
The aim of the experiment is to validate the proposed method using subjective and objective criteria and to carry out a comparison with existing methods.
In our experiment, the source infrared and visible images were collected from  and . There are 21 pairs of our source images and they are available at . A sample of these images is shown in Fig.4.
In multi-layer fusion strategy, we choose few layers from a pre-trained VGG-19 network to extract deep features. These layers are , , and , respectively
For comparison, we selected several recent and classical fusion methods to perform the same experiment, including: cross bilateral filter fusion method(CBF), the joint-sparse representation model(JSR), the JSR model with saliency detection fusion method(JSRSD), weighted least square optimization-based method(WLS) and the convolutional sparse representation model(ConvSR).
All the fusion algorithms are implemented in MATLAB R2016a on 3.2 GHz Intel(R) Core(TM) CPU with 12 GB RAM.
The fused images which are obtained by the five existing methods and the proposed method are shown in Fig.5 and Fig.6. Due to the space limit, we evaluate the relative performance of the fusion methods only on a single pair of images(“street” and“people”).
As we can see from Fig.5(c-h), the fused image obtained by the proposed method preserves more detail information in the red window and contains less artificial noise. In Fig.6(c-h), the fused image obtained by the proposed method also contains less noise in the red box.
In summary, the fused images which are obtained by CBF have more artificial noise and the salient features are not clear. The fused images obtained by JSR, JSRSD and WLS, in addition, contain artificial structures around the salient features and the image detail is blurred. In contrast, the fused images obtained by ConvSR and the proposed fusion method contain more salient features and preserve more detail information. Compared with the four existing fusion methods, the fused images obtained by the proposed method look more natural. As there is no visible difference between ConvSR and the proposed fusion method in terms of human sensitivity, we use several objective quality metrics to evaluate the fusion performance in the next section.
For the purpose of quantitative comparison between the proposed method and existing fusion methods, four quality metrics are utilized. These are: and  which calculate mutual information (FMI) for the discrete cosine and wavelet features, respectively;  which denotes the rate of noise or artifacts added to the fused image by the fusion process; and modified structural similarity().
In our paper, the is calculated by Eq.12,
where represents the structural similarity operation, is the fused image, and , are the source images. The value of assesses the ability to preserve structural information.
The performance improves with the increasing numerical index of , and . On the contrary, the fusion performance is better when the value of is small, which means the fused images contain less artificial information and noise.
The average values of , , and obtained by teh existing methods and the proposed method for the 21 fused images are shown in Table I.
In Table I, the best values for , , and are indicated in bold. As we can see, the proposed method has all the best average values for these metrics. These values indicate that the fused images obtained by the proposed method are more natural and contain less artificial noise. From the objective evaluation, our fusion method has better fusion performance than the existing methods.
From Table II and Fig.7, the values of produced by our method are nearly two orders of magnitude batter than CBF, JSR and JSRSD. Even compared with ConvSR, the values of proposed method are extremely small. This indicates that the fused images obtained by the proposed method contain less artificial information and noise.
In this paper, we present a simple and effective fusion method based on a deep learning framework(VGG-network) for an infrared and visible image fusion task. Firstly, the source images are decomposed into base parts and detail content. The former contains low frequency information and the latter contains texture information. These base parts are fused by the weight-averaging strategy. For the detail content, we proposed a novel multi-layer fusion strategy based on a pre-trained VGG-19 network. The deep features of the detail content are obtained by this fixed VGG-19 network. The -norm and block-averaging operator are used to get the initial weight maps. The final weight maps are obtained by the soft-max operator. The initial fused detail content is generated for each pair of weight maps and the input detail content. The fused detail content is reconstructed by the max selection operator applied to these initial fused detail content. Finally, the fused image is reconstructed by adding the fused base part and the fused detail content. We use both subjective and objective methods to evaluate the proposed method. The experimental results show that the proposed method exhibits state-of-the-art fusion performance.
We believe our fusion method and the novel multi-layer fusion strategy can be applied to other image fusion tasks, such as medical image fusion, multi-exposure image fusion and multi-focus image fusion.
Gatys L A, Ecker A S, Bethge M. Image style transfer using convolutional neural networks[C]//Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 2016: 2414-2423.
Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution[C]//European Conference on Computer Vision. Springer, Cham, 2016: 694-711.