Infrared and Visible Image Fusion with ResNet and zero-phase component analysis

06/19/2018 ∙ by Hui Li, et al. ∙ 2

In image fusion task, feature extraction and processing are keys for fusion algorithm. Not only traditional feature extraction methods, deep learning-based methods are also applied into image fusion field to extract features. However, most of them use deep features directly which without feature processing. And this will lead the fusion performance degradation in some cases. In this paper, a novel fusion framework which based on deep features and zero-phase component analysis(ZCA) is proposed. Firstly, the residual network(ResNet) is used to extract the deep features from source images. Then ZCA and l_1-norm are utilized to normalize the deep features and obtain initial weight maps. And the final weight maps are obtained by initial weight maps and soft-max operation. Finally, the fused image is reconstructed by weight maps and source images. Compare with the existing fusion methods, experimental results demonstrate that our algorithm achieves better performance in both objective assessment and visual quality. And the code of our fusion algorithm is available at



There are no comments yet.


page 5

page 7

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Infrared and visible image fusion is a frequently occuring requirement in image fusion, and the fusion methods for this work are widely used in many applications. These algorithms combine the salient features of source images into a single image1

. The fused image approach is utilized in several computer vision tasks.

The extraction and processing of the features are keys tasks in infrared and visible image fusion, and the fusion performance is directly affected by the different features and processing methods undertaken.

For decades, signal processing algorithms2 3 4 5 were the most propular feature extraction tools in image fusion tasks. In 2016, a two-scale decomposition and saliency detection-based fusion method was proposed by Bavirisetti et al.6 . The base layers and detail layers were extracted by a mean filter and a median filter. The visual salient features were used to obtain weight maps. Then the fused image was reconstructed by combining these three parts.

In recent years, representation learning-based fusion methods have attracted great attention and exhibited state-of-the-art fusion performance.

In the sparse representation(SR) domain, Zong et al.7 proposed a novel medical image fusion method based on SR. In their paper, the sub-dictionaries are learned by Histogram of Oriented Gradients (HOG) features. Then -norm and the max selection strategy are used to reconstruct the fused image. In addition to this approach, the joint sparse representation8 , cosparse representation9

, pulse coupled neural network(PCNN)

10 and shearlet transform11 are also applied to image fusion, which incorporate the SR.

In other representation learning domain, for the first time, the low-rank representation(LRR) was applied into image fusion tasks by Li et al.12 . In 12 , they use HOG and dictionary learning method to obtain a global dictionary. The dictionary is then used in LRR and the fused low-rank coefficients are obtained by using an -norm and choose-max strategy. Finally, the fused image is reconsturcted using the global dictionary and LRR. For infrared and visible image fusion, Li et.al13 also proposed an effective and simple algorithm based on latent low-rank representation(LatLRR). Here the source images is decomposed into low-frequency and high-frequancy coefficients by LatLRR and the fused image is reconstructed by using a weighted-averaging strategy.

Although these representation learning-based methods exhibit good fusion performance, they still have two main drawbacks: 1) It is very difficult to learn a dictionary offline, for representation learning-based methods; 2) The time efficiency of representation learning-based methods is very low, especially, when the online dictionary learning methods are used in fusion algorithms. So recently, the fusion algorithms have been improved in two aspects: time efficiency and fusion performance.

In the last two years, deep learning has been applied to image fusion tasks, and has been shown to achieve better fusion performance and time efficiency than non-deep learning-based methods. Most of the deep learning-based fusion methods just treat deep learning as feature extraction operation and use deep features which are obtained by a fixed network to reconstruct the fused image. In 14 , a convolutional sparse representation(CSR) based fusion method was proposed by Yu Liu et al. The CSR is used to extract features which are obtained by different dictionaries. In addition, Yu Liu et al.15

also proposed an algorithm based on convolutional neural network(CNN). Image patches which contain different blur versions of the input image are used to train the network and a decision map is obtained. Finally, the fused image is obtained by the decision map and the source images. The obvious drawback of these two methods is that they are just suitable for the multi-focus image fusion task.

In ICCV 2017, Prabhakar et al.16 proposed a simple and efficient method based on CNN for the exposure fusion problem. In their method, the encoding network has a siamese network architecture where the weights are tied. Input images are encoded by encoding. Then two feature map sequences are obtained and are fused by an addition strategy. The final fused image is reconstructed by a decoding network which contains three CNN layers. This network is not only suitable for the exposure fusion problem, it also achieves good performance in other fusion tasks. However, the architecture is too simple and the information contained in deep network may not have been fully utilized.

So Li et al.17 proposed a VGG-based18

fusion method which uses a deeper network and multi-layer deep features. Firstly, the source images are decomposed into base parts and detail content. The base parts are fused by weighted-averaging strategy. And the fixed VGG-19 network, which is trained by ImageNet, is used to extract multi-layer deep features from detail content. Then initial weight maps are calculated by soft-max operator and multi-layer deep features. Then several candidates fused detail content is obtained by initial weight maps. The choose-max is used to construct the final weight maps for the detail content. The final weight maps are utilized to obtain fused detail content. Finally, the fused image is reconstructed by combining the fused base part and the detailed content.

Although the middle layers information is used by VGG-based fusion method17 , the multi-layer combining method is still too simple and much useful information is lost in feature extraction. This phenomenon gets worse when the network is deeper.

To solve these problems, we propose a fusion method to fully utilize and process the deep features. In this paper, a novel fusion algorithm based on residual network(ResNet)19 and zero-phase component analysis(ZCA)20 is proposed for infrared and visible image fusion task. Firstly, the source images are fed into ResNet which is fixed to obtain the deep features. Due to the architecture of ResNet, the deep features already contain multi-layer information, so we just use the output which is obtained by single layer. Then ZCA operation is utilized to project deep features into sparse domain and initial weight maps are obtained by

-norm. We use bicubic interpolation to resize the initial weight maps to source image size. And the final weight maps are obtained by soft-max operation. Finally, the fused image is reconstructed by final weight maps and source images.

In Section2, we review related work while Section3 we describe our fusion algorithm. The experimental results are shown in Section4. Finally, Section5 draws the conclusions to the paper.

2 Related work

Deep residual network (ResNet). In CVPR 2016, He et al.19 proposed a novel network architecture to address the degradation problem. With the shortcut connections and residual representations, their nets were easier to optimize than previous networks and offered better accuracy by increasing the depth. The residual block architecture is shown in Fig.1.

Figure 1: The architecture of residual block.

X indicate the input of net block,

denotes the network operation which contains two weight layers, and “relu” represents the rectified linear unit. The output of residual block is calculated by

. With this structure, the multi-layer information is utilzed. Furthermore, in image reconstruction tasks[21]-[23], the performance gets better by use of the residual block. We also use this architecture in our fusion methods.

Zero-phase component analysis(ZCA). In 20

, Kessy et al. analyzed the whitening and decorrelation by ZCA operation. ZCA opeartion is used to project a random vector into a irrelevant sub-space which is also named whitening. In image processing field, ZCA is a very useful tool to process the features which can obtain useful features to improve algorithm performance. We will introduce ZCA operation briefly.

Let indicate the d-dimensional random vector and represent the mean values. And the covariance matrix Co will be calculated by

. Then the Singular Value Decomposition(SVD) is utilized to decompose

, as shown in Eq.1,


Finally, the new random vector is calculated by Eq.2,



denotes the identity matrix, and

is a small value avoiding bad matrix inversion.

ZCA utilization in image style transfer. Recently, ZCA is also utilized in image style transfer task which is one of the most popular in the image processing field.

Li et al.24 proposed a universal style transfer algorithm using ZCA operation to transfer the style of artistic image into content image. The encoder network is used to obtain the style features() and content features(). Then authors use ZCA operation to project and into the same space. The final transferred features will be obtained by a coloring transform method which is a reverse operation to the ZCA operation. Finally, the styled image is obtained by transferred features and a decoder network.

In addition, in CVPR 2018, Lu et al.25 also use ZCA operation in their style transfer method. The VGG network is utilized to extract image features, and ZCA is used to project features into the same space. Then transferred features are obained by a reassembling operation based on patches. Finally, the transferred features and a decoder network is trained by MSCOCO26 dataset and utilized to reconstruct the styled image.

From above style transfer methods, the ZCA operation is a powerful tool to process image features, especially in the image reconstruction task. The ZCA operation projects image features into a sub-space and makes it easy to classify and reconstruct. Inspired by these methods, we also apply the ZCA operation into image fusion task.

3 The Proposed Fusion Method

In this section, the proposed fusion method is introduced in detail.

Assuming there are preregistered source images, in our paper, the . Note that the fusion strategy is the same for . The source images are represented as ,. The framework of the proposed fusion method is shown in Fig.2.

Figure 2: The framework of proposed method.

As shown in Fig.2, the source images are indicated as and , and ResNet50 contains 50 weight layers which include 5 convolutional blocks (conv1, conv2, conv3, conv4, conv5). The ResNet50 is a fixed network and trained by ImageNet27 , we use it to extract the deep features. And the output of i-th blocks are indicated by the deep features which contain C channels, . and we use ZCA and -norm to process , will be obtained by these operations. Then the weight maps are obtained by resize (bicubic interpolation) and soft-max. Finally, the fused image is reconstructed using a weighted-average strategy. In our paper we choose and to evaluate our fusion framework.

3.1 ZCA operation for deep features

As we discussed earlier, ZCA projects the original features into the same space, and the features become more useful for the next processing. The ZCA operation for deep features is shown in Fig.3.

Figure 3: ZCA operation for deep features.

In Fig.3, we choose the output of conv2 layer which contains 3 residual blocks as an example to introduce the influence of ZCA operation. Each block indicates one channel of the output. The original deep features have different orders of magnitude in each channel. We use ZCA to project original features into the same space. The features become more significant, as shown in Fig.3 (ZCA feature).

3.2 ZCA and -norm operations

After the deep features were obtained, we use ZCA to process the deep features . When we obtain the processed features , we utilize -norm to calculate initial weight map . The procedure of ZCA and -norm operation is shown in Fig.4.

Figure 4: The procedure of ZCA and -norm operation.

indicates the deep features obtained by the i-th convolutional block, which contains channels, . In ZCA operation, the covariance matrix and its decomposition are calculated by Eq.3,


where denotes the index of channel in deep features.

Then we use Eq.4 to obtain the processed features which is combined by ,


After we obtain the processed features , we utilize the local -norm and average operation to calculate the initial weight maps using Eq.5,


As shown in Fig.4, we choose a window which centers at the to calculate the average -norm, and in our paper .

3.3 Reconstruction

When the initial weight maps and are calculated by ZCA and -norm, the upsampling and soft-max operations are applied to obtain the final weight maps and , as shown in Fig.5.

Figure 5: Resize and soft-max operation.

Firstly, the bicubic interpolation which is provided by Matlab is used to resize the initial weight maps into source image size.

Then the final weight maps are obtained by Eq.6,


Finally, the fused image is reconstructed using Eq.7,


4 Experiments and Analysis

In this section, the source images and experimental environment are introduced first. Secondly, the effect of different networks and norms in our method are discussed. Then the influence of ZCA operation is analyzed. Finally, the proposed algorithm is evaluated by using subjective and objective criteria. We choose several existing state of the art methods to compare with our fusion method.

4.1 Experimental Settings

We collect 21 pairs of source infrared and visible images from 28 and 29 . Our source images are available at 30 . And samples of these source images are shown in Fig.6.

Figure 6: Four pairs of source images. The top row contains infrared images, and the second row contains visible images.

In our experiment, DeepFuse16

is implemented with Tensorflow and GTX 1080Ti, 64GB RAM. Other fusion algorithms are implemented in MATLAB R2017b on 3.2 GHz Intel(R) Core(TM) CPU with 12 GB RAM. The details of our experiment are introduced in the next sections.

4.2 The effect of different networks and norms

In this section, we choose different networks(VGG1918 , ResNet5019 and ResNet10119 ) and different norms(-norm, -norm and nuclear-norm31 ) to evaluate our fusion framework.

When the nuclear-norm is utilized in our framework, Eq.5 is rewritten to Eq.8


where indicates the reshape operation and . And the renshape and nuclear-norm operation are shown in Fig.7.

Figure 7: The procedure of reshape and nuclear-norm operation.

Five quality metrics are utilized to assess the performance. These are: 32 , 32 and 32 which calculate mutual information (FMI) for the pixel, discrete cosine and wavelet features, respectively; 33 denotes the rate of noise or artifacts added to the fused image by the fusion process; and modified structural similarity()17 .

The performance improves with the increasing numerical index of, , and . Also, the fusion performance is better when the value of is small which means the fused images contain less artificial information and noise.

We calculate the average quality metrics values of 21 pairs of source images. In VGG19, the outputs of four layers(relu1_1, relu2_1, relu3_1, relu4_1) are used. In ResNet50 and ResNet101, we choose four convolutional blocks (Conv2, Conv3, Conv4, Conv5). These values are shown in Table 1 2 3.

VGG19 relu1_1 0.90120 0.36208 0.35872 0.11696 0.73833
relu2_1 0.91030 0.39170 0.40032 0.04227 0.76469
relu3_1 0.91122 0.39399 0.40948 0.01309 0.77326
relu4_1 0.91057 0.39666 0.41306 0.00397 0.77613
ResNet50 Conv2 0.91257 0.39545 0.41126 0.01495 0.77251
Conv3 0.91156 0.39651 0.41442 0.00468 0.77561
Conv4 0.91093 0.40296 0.41652 0.00131 0.77749
Conv5 0.90921 0.40577 0.41689 0.00062 0.77825
ResNet101 Conv2 0.91255 0.39472 0.41089 0.01599 0.77215
Conv3 0.91135 0.39589 0.41383 0.00510 0.77544
Conv4 0.90961 0.40386 0.41643 0.00091 0.77791
Conv5 0.90934 0.40605 0.41706 0.00062 0.77821
Table 1: Quality metrics values - Our fusion framework use -norm and different networks.
VGG19 relu1_1 0.90191 0.36500 0.36172 0.11063 0.74113
relu2_1 0.91042 0.39217 0.40078 0.04053 0.76529
relu3_1 0.91118 0.39433 0.40962 0.01272 0.77344
relu4_1 0.91054 0.39696 0.41312 0.00381 0.77622
ResNet50 Conv2 0.91255 0.39522 0.41088 0.01487 0.77263
Conv3 0.91146 0.39635 0.41430 0.00470 0.77562
Conv4 0.91087 0.40265 0.41645 0.00134 0.77746
Conv5 0.90925 0.40544 0.41654 0.00064 0.77825
ResNet101 Conv2 0.91254 0.39468 0.41059 0.01576 0.77230
Conv3 0.91126 0.39584 0.41366 0.00511 0.77547
Conv4 0.90965 0.40385 0.41643 0.00091 0.77792
Conv5 0.90932 0.40561 0.41666 0.00064 0.77825
Table 2: Quality metrics values - Our fusion framework use -norm and different networks.
VGG19 relu1_1 0.90505 0.37650 0.37536 0.07989 0.75391
relu2_1 0.91040 0.39454 0.40288 0.03176 0.76845
relu3_1 0.91092 0.39546 0.41017 0.01125 0.77403
relu4_1 0.91045 0.39758 0.41338 0.00349 0.77637
ResNet50 Conv2 0.91274 0.39659 0.41178 0.01320 0.77329
Conv3 0.91177 0.39712 0.41483 0.00439 0.77566
Conv4 0.91110 0.40238 0.41673 0.00141 0.77726
Conv5 0.90932 0.40509 0.41648 0.00067 0.77817
ResNet101 Conv2 0.91266 0.39621 0.41166 0.01359 0.77304
Conv3 0.91148 0.39651 0.41410 0.00470 0.77557
Conv4 0.90971 0.40351 0.41641 0.00095 0.77784
Conv5 0.90943 0.40537 0.41655 0.00065 0.77820
Table 3: Quality metrics values - Our fusion framework use nuclear-norm and different networks.

The best values are indicated in bold, the second best values are indicated in red font. As we can see, the ResNets(50/101) obtain all the best and the second best values in different norms. This means ResNet can achieve better fusion performance than VGG19 in our fusion framework.

Comparing ResNet50 with ResNet101 in Table 1 2 3, the quality metrics values are very close. Considering the time efficiency,in our method, the ResNet50 is utilized.

-norm Conv2 0.91257 0.39545 0.41126 0.01495 0.77251
Conv3 0.91156 0.39651 0.41442 0.00468 0.77561
Conv4 0.91093 0.40296 0.41652 0.00131 0.77749
Conv5 0.90921 0.40577 0.41689 0.00062 0.77825
-norm Conv2 0.91255 0.39522 0.41088 0.01487 0.77263
Conv3 0.91146 0.39635 0.41430 0.00470 0.77562
Conv4 0.91087 0.40265 0.41645 0.00134 0.77746
Conv5 0.90925 0.40544 0.41654 0.00064 0.77825
nuclear-norm Conv2 0.91274 0.39659 0.41178 0.01320 0.77329
Conv3 0.91177 0.39712 0.41483 0.00439 0.77566
Conv4 0.91110 0.40238 0.41673 0.00141 0.77726
Conv5 0.90932 0.40509 0.41648 0.00067 0.77817
Table 4: Quality metrics values - Our fusion framework use ResNet50 and different norms.

In Table 4, we evaluate the effect of different norms with ResNet50 in our method. From Table 4, -norm contains four best values and one second best values. This means, in our fusion framework, -norm has better performance than other norms.

4.3 The influence of ZCA operation

In this section, we analyze the influence of ZCA operation on our method. We choose ResNet50 and three norms(-norm, -norm and nuclear-norm[31]) to evaluate the performance with or without ZCA.

Ten quality metrics are chosen. These metrics include: En(entropy), MI(mutual information), 34 , 32 , 32 , 32 , 33 , SCD35 , 17 , and MS_SSIM36 . The performance improves with the increasing numerical index of En, MI, , , , , SCD, and MS_SSIM. However, its better when the values of are small.

-norm Conv2 6.29026 12.58052 0.40154 0.91257 0.39545 0.41126 0.01495 1.64113 0.77251 0.87204
Conv3 6.28155 12.56309 0.39314 0.91156 0.39651 0.41442 0.00468 1.64235 0.77561 0.88102
Conv4 6.23540 12.47081 0.37254 0.91093 0.40296 0.41652 0.00131 1.63949 0.77749 0.87962
Conv5 6.19527 12.39054 0.35098 0.90921 0.40577 0.41689 0.00062 1.63358 0.77825 0.87324
-norm Conv2 6.28650 12.57299 0.39898 0.91255 0.39522 0.41088 0.01487 1.63984 0.77263 0.87103
Conv3 6.28027 12.56054 0.39215 0.91146 0.39635 0.41430 0.00470 1.64175 0.77562 0.88049
Conv4 6.23730 12.47459 0.37283 0.91087 0.40265 0.41645 0.00134 1.63951 0.77746 0.87954
Conv5 6.19689 12.39377 0.35080 0.90925 0.40544 0.41654 0.00064 1.63394 0.77825 0.87312
nuclear-norm Conv2 6.28192 12.56384 0.39986 0.91274 0.39659 0.41178 0.01320 1.63936 0.77329 0.87286
Conv3 6.28654 12.57309 0.39473 0.91177 0.39712 0.41483 0.00439 1.64087 0.77566 0.88094
Conv4 6.25057 12.50114 0.37713 0.91110 0.40238 0.41673 0.00141 1.63960 0.77726 0.88047
Conv5 6.20433 12.40865 0.35311 0.90932 0.40509 0.41648 0.00067 1.63431 0.77817 0.87374
Table 5: Quality metrics values - Our fusion framework use ResNet50 and ZCA operation.
-norm Conv2 6.17245 12.34490 0.34191 0.90884 0.40631 0.41678 0.00060 1.62953 0.77848 0.87014
Conv3 6.17108 12.34216 0.34168 0.90866 0.40652 0.41701 0.00058 1.62854 0.77847 0.86995
Conv4 6.17751 12.35501 0.34468 0.90908 0.40690 0.41739 0.00051 1.63178 0.77844 0.87154
Conv5 6.17760 12.35519 0.34506 0.90885 0.40669 0.41737 0.00052 1.63037 0.77844 0.87125
-norm Conv2 6.17225 12.34451 0.34157 0.90883 0.40647 0.41697 0.00057 1.62945 0.77850 0.87021
Conv3 6.17078 12.34157 0.34125 0.90864 0.40641 0.41697 0.00057 1.62824 0.77847 0.86982
Conv4 6.17612 12.35224 0.34361 0.90898 0.40671 0.41729 0.00052 1.63121 0.77846 0.87110
Conv5 6.17539 12.35078 0.34462 0.90877 0.40692 0.41751 0.00049 1.62996 0.77845 0.87126
nuclear-norm Conv2 6.18154 12.36308 0.35226 0.90969 0.40701 0.41731 0.00064 1.63449 0.77832 0.87363
Conv3 6.19141 12.38282 0.35886 0.91029 0.40645 0.41721 0.00067 1.63843 0.77823 0.87651
Conv4 6.18871 12.37742 0.35321 0.90978 0.40635 0.41721 0.00057 1.63578 0.77830 0.87460
Conv5 6.18116 12.36232 0.34817 0.90909 0.40679 0.41746 0.00052 1.63205 0.77840 0.87243
Table 6: Quality metrics values - Our fusion framework use ResNet50 but without ZCA operation.

Table 5 and Table 6 show the quality values with and without ZCA, respectively. In Table 6, when the ZCA is not used, ResNet50 with nuclear-norm achieves the best values. This means the low-rank ability is more useful than other norms in original deep features. However, in Table 5, when we use ZCA to project deep features into a sub-space, -norm will obtain most of the best values, even compared with nuclear-norm + without ZCA. We think the ZCA projects the original data into a sparse space, and in this situation, the sparse metric(-norm) obtains better performance than low-rank metric(nuclear-norm).

Based on above observation, we choose ResNet50 to extract deep features in our fusion method, ZCA and -norm operations are used to obtain initial weight maps.

4.4 Subjective Evaluation

In subjective and objective evaluation, we choose nine existing fusion methods to compare with our algorithm. These fusion methods are: cross bilateral filter fusion method(CBF)37 , discrete cosine harmonic wavelet transform(DCHWT)33 , joint sparse representation(JSR)8 , saliency detection in sparse domain(JSRSD)38 , gradient transfer and total variation minimization(GTF)39 , weighted least square optimization(WLS)28 , convolutional sparse representation(ConvSR)14 , a Deep Learning Framework based on VGG19 and multi-layers(VggML)17 , and DeepFuse16 .

In our fusion method, the convolutional blocks (Conv4 and Conv5) are chosen to obtain the fused images. The fused images are shown in Fig.8. As an example, we evaluate the relative performance of the fusion methods only on a single pair of images (“street”).

Figure 8: Experiment on “street” images. (a) Infrared image; (b) Visible image; (c) CBF; (d) DCHWT; (e) JSR; (f) JSRSD. (g) GTF; (h) WLS; (i)ConvSR; (j)VggML; (k)DeepFuse; (l)ours(Conv4); (m) ours(Conv5).

From Fig.8(c-m), the fused images which are obtained by CBF and DCHWT contain more noise and some saliency features are not clear. The JSR, JSRSD, GTF and WLS can obtain better performance and less noise. But these fused images still contain artificial information near the saliency features. On the contrary, deep learning-based fusion methods, such as ConvSR, VggML, DeepFuse and ours, contain more saliency features and preserve more detail information, and the fused images look more natural. As there is no validation difference between these deep learning-based methods and the proposed algorithm in terms of human sensitivity, we choose several objective quality metrics to assess the fusion performance in the next section.

4.5 Objective Evaluation

For the purpose of quantitative comparison between the proposed method and existing fusion methods, three quality metrics are utilized. These are: 32 , 33 and 17 .

In this section, we choose 8 pairs of images to evaluate the existing methods and our fusion algorithm. The fused images are shown in Fig.9, and the values of , and are presented in Table 7.

Figure 9: Experiment on 8 pairs images. (a)CBF; (b)DCHWT; (c)JSR; (d)JSRSD; (e)GTF; (f)WLS; (g)ConvSR; (h)VggML; (i)DeepFuse; (j)Ours(Conv4); (k)ours(Conv5);
Images Metrics CBF DCHWT JSR JSRSD GTF WLS ConvSR VggML DeepFuse Ours
Conv4 Conv5
Fig.9 (1) 0.87010 0.89000 0.85281 0.83392 0.88393 0.87897 0.89724 0.88532 0.88226 0.88517 0.88229
0.23167 0.05118 0.34153 0.34153 0.07027 0.14494 0.01494 0.00013 0.03697 0.00011 0.00008
0.62376 0.74834 0.52715 0.52715 0.69181 0.72827 0.74954 0.77758 0.73314 0.77765 0.77773
Fig.9 (2) 0.89441 0.92106 0.91004 0.90446 0.91526 0.91144 0.92269 0.91849 0.91763 0.92068 0.91851
0.48700 0.21840 0.19749 0.19889 0.11237 0.16997 0.02199 0.00376 0.00262 0.00367 0.00222
0.49861 0.64468 0.62399 0.62353 0.61109 0.66873 0.67474 0.68041 0.68092 0.68130 0.68039
Fig.9 (3) 0.80863 0.86116 0.78628 0.77685 0.83492 0.85603 0.87225 0.85276 0.84882 0.85248 0.84916
0.43257 0.07415 0.49804 0.49804 0.08501 0.19188 0.00991 0.00020 0.09275 0.00013 0.00009
0.59632 0.80619 0.46767 0.46767 0.73386 0.77506 0.81383 0.84569 0.81415 0.84607 0.84622
Fig.9 (4) 0.85685 0.87010 0.84809 0.84340 0.83589 0.84038 0.86855 0.86309 0.86206 0.86329 0.86198
0.15233 0.05781 0.21640 0.21536 0.12329 0.23343 0.03404 0.00037 0.11997 0.00007 0.00004
0.52360 0.57614 0.45422 0.45458 0.50273 0.55427 0.56129 0.61117 0.59249 0.61280 0.61306
Fig.9 (5) 0.89101 0.92772 0.90630 0.88746 0.93516 0.90851 0.94036 0.93248 0.93174 0.93244 0.93154
0.47632 0.10340 0.33225 0.32941 0.07322 0.20588 0.01022 0.00109 0.50900 0.00044 0.00033
0.66486 0.83815 0.70211 0.70365 0.80499 0.82727 0.85822 0.87250 0.59381 0.87303 0.87310
Fig.9 (6) 0.89679 0.93556 0.91893 0.89321 0.93693 0.92803 0.94206 0.93491 0.93349 0.93889 0.93481
0.25544 0.07260 0.32488 0.32502 0.03647 0.22335 0.01545 0.00058 0.00948 0.00055 0.00023
0.64975 0.75453 0.58298 0.58333 0.70077 0.72693 0.76111 0.78692 0.72572 0.78709 0.78743
Fig.9 (7) 0.82883 0.92680 0.91361 0.89055 0.92438 0.90933 0.93853 0.93161 0.93066 0.93309 0.93024
0.52887 0.19714 0.33544 0.33720 0.03276 0.31160 0.01561 0.00122 0.29958 0.00120 0.00069
0.50982 0.72735 0.60153 0.60078 0.69419 0.72919 0.77048 0.78256 0.73096 0.78201 0.78285
Fig.9 (8) 0.87393 0.94915 0.93052 0.90994 0.91430 0.92903 0.94804 0.94660 0.92682 0.94501 0.94443
0.25892 0.24507 0.16588 0.16541 0.09293 0.18401 0.02574 0.00203 0.00175 0.00706 0.00130
0.53005 0.62304 0.57422 0.57412 0.62966 0.67908 0.70304 0.72860 0.71540 0.72708 0.72864
Table 7: The values of [32], [33] and for 8 pairs images.

From Fig.9 and Table 7, our method achieves better fusion performance in subjective and objective evaluation. In Table 7, the best values are indicated in bold. Compared with other existing fusion methods, our algorithm obtains almost all the best values in and , which represent that the fused images obtained by our, which contain less noise and preserve more structure information from source images. The advantage of our algorithm is more obvious when the is used to assess the fused images.

Although the for our fused images are not the best, its values are still very close to the best one, and the results from our method to improve the fusion performance in term of and are acceptable.

5 Conclusions

In this article we have proposed a novel fusion algorithm based on ResNet50 and ZCA operation for infrared and visible image fusion. Firstly, the source images are directly fed into ResNet50 network to obtain the deep features. Following this ZCA operation which is also called whitening, is used to project the original deep features into a sparse subspace. The local average -norm is utilized to obtain the initial weight maps. Then bicubic interpolation is used to resize initial weight maps to the source images size. A soft-max operation is used to obtain the final weight maps. Finally, the fused image is reconstructed by weighted-average strategy which combines the final weight maps and source images. Experimental results show that the proposed fusion method has better fusion performance in both objective and subjective evaluation.



  • (1) S. Li, X. Kang, L. Fang, J. Hu, and H. Yin, “Pixel-level image fusion: A survey of the state of the art,” Inf. Fusion, vol. 33, pp. 100–112, 2017.
  • (2) A. Ben Hamza, Y. He, H. Krim, and A. Willsky, “A Multiscale Approach to Pixel-level Image Fusion,” Integr. Comput. Aided. Eng., vol. 12, pp. 135–146, 2005.
  • (3) S. Yang, M. Wang, L. Jiao, R. Wu, and Z. Wang, “Image fusion based on a new contourlet packet,” Inf. Fusion, vol. 11, no. 2, pp. 78–84, 2010.
  • (4) L. Wang, B. Li, and L. F. Tian, “EGGDD: An explicit dependency model for multi-modal medical image fusion in shift-invariant shearlet transform domain,” Inf. Fusion, vol. 19, no. 1, pp. 29–37, 2014.
  • (5) H. Pang, M. Zhu, and L. Guo, “Multifocus color image fusion using quaternion wavelet transform,” 2012 5th Int. Congr. Image Signal Process. CISP 2012, 2012.
  • (6) D. P. Bavirisetti and R. Dhuli, “Two-scale image fusion of visible and infrared images using saliency detection,” Infrared Phys. Technol., vol. 76, pp. 52–64, 2016.
  • (7) J. jing Zong and T. shuang Qiu, “Medical image fusion based on sparse representation of classified image patches,” Biomed. Signal Process. Control, vol. 34, pp. 195–205, 2017.
  • (8) Q. Zhang, Y. Fu, H. Li, and J. Zou, “Dictionary learning method for joint sparse representation-based image fusion,” Opt. Eng., vol. 52, no. 5, p. 057006, 2013.
  • (9) R. Gao, S. A. Vorobyov, and H. Zhao, “Image fusion with cosparse analysis operator,” IEEE Signal Process. Lett., vol. 24, no. 7, pp. 943–947, 2017.
  • (10) X. Lu, B. Zhang, Y. Zhao, H. Liu, and H. Pei, “The infrared and visible image fusion algorithm based on target separation and sparse representation,” Infrared Phys. Technol., vol. 67, pp. 397–407, 2014.
  • (11) M. Yin, P. Duan, W. Liu, and X. Liang, “A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation,” Neurocomputing, vol. 226, no. November 2016, pp. 182–191, 2017.
  • (12) H. Li and X.-J. Wu, “Multi-focus Image Fusion using dictionary learning and Low-Rank Representation,” in Image and Graphics. ICIG 2017. Lecture Notes in Computer Science, vol 10666. Springer, Cham., 2017, pp. 675–686.
  • (13) H. Li and X.-J. Wu, “Infrared and visible image fusion using Latent Low-Rank Representation,” arXiv Prepr. arXiv1804.08992, 2018.
  • (14) Y. Liu, X. Chen, R. K. Ward, and J. Wang, “Image Fusion with Convolutional Sparse Representation,” IEEE Signal Process. Lett., vol. 23, no. 12, pp. 1882–1886, 2016.
  • (15) Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with a deep convolutional neural network,” Inf. Fusion, vol. 36, pp. 191–207, 2017.
  • (16) K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–Octob, no. Ev 0, pp. 4724–4732, 2017.
  • (17) H. Li, X.-J. Wu, and J. Kittler, “Infrared and Visible Image Fusion using a Deep Learning Framework,” in arXiv preprint arXiv:1804.06992, 2018.
  • (18) A. Z. Karen Simonyan, “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION,” in ICLR 2015, 2015, vol. 5, no. 3, pp. 345–358.
  • (19) K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
  • (20) A. Kessy, A. Lewin, and K. Strimmer, “Optimal Whitening and Decorrelation,” Am. Stat., vol. 1305, no. 2017, pp. 1–6, 2018.
  • (21) J. Cai, S. Gu, and L. Zhang, “Learning a deep single image contrast enhancer from multi-exposure images,” IEEE Trans. Image Process., vol. 27, no. 4, pp. 2049–2062, 2018.
  • (22) H. Zhang and K. Dana, “Multi-style Generative Network for Real-time Transfer,” arXiv Prepr. arXiv1703.06953, 2017.
  • (23)

    X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia, “Detail-Revealing Deep Video Super-Resolution,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–Octob, no. 413113, pp. 4482–4490, 2017.

  • (24) Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal Style Transfer via Feature Transforms,” in Advances in neural information processing systems, 2017, pp. 385–395.
  • (25) L. Sheng, Z. Lin, J. Shao, and X. Wang, “Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8242–8250.
  • (26) T. Y. Lin et al., “Microsoft COCO: Common objects in context,” in European conference on computer vision. Springer, Cham, 2014, vol. 8693 LNCS, no. PART 5, pp. 740–755.
  • (27) O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
  • (28) J. Ma, Z. Zhou, B. Wang, and H. Zong, “Infrared and visible image fusion based on visual saliency map and weighted least square optimization,” Infrared Phys. Technol., vol. 82, pp. 8–17, 2017.
  • (29) Alexander Toet et al., TNO Image Fusion Dataset. 2014.
  • (30) H. Li, Https:// 2018.
  • (31)

    G. Liu, Z. Lin, and Y. Yu, “Robust Subspace Segmentation by Low-Rank Representation,” in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 663–670.

  • (32) M. Haghighat and M. A. Razian, “Fast-FMI: Non-reference image fusion metric,” in 8th IEEE International Conference on Application of Information and Communication Technologies, AICT 2014 - Conference Proceedings, 2014.
  • (33) B. K. Shreyamsha Kumar, “Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform,” Signal, Image Video Process., vol. 7, no. 6, pp. 1125–1143, 2013.
  • (34) C. S. Xydeas and V. Petrović, “Objective image fusion performance measure,” Electron. Lett., 2000.
  • (35) V. Aslantas and E. Bendes, “A new image quality metric for image fusion: The sum of the correlations of differences,” AEU - Int. J. Electron. Commun., vol. 69, no. 12, pp. 1890–1896, 2015.
  • (36) K. Ma, S. Member, K. Zeng, and Z. Wang, “Perceptual Quality Assessment for Multi-Exposure Image Fusion,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3345–3356, 2015.
  • (37) B. K. Shreyamsha Kumar, “Image fusion based on pixel significance using cross bilateral filter,” Signal, Image Video Process., 2015.
  • (38) C. H. Liu, Y. Qi, and W. R. Ding, “Infrared and visible image fusion method based on saliency detection in sparse domain,” Infrared Phys. Technol., vol. 83, pp. 94–102, 2017.
  • (39) J. Ma, C. Chen, C. Li, and J. Huang, “Infrared and visible image fusion via gradient transfer and total variation minimization,” Inf. Fusion, vol. 31, pp. 100–109, 2016.