Infrared and visible image fusion is a frequently occuring requirement in image fusion, and the fusion methods for this work are widely used in many applications. These algorithms combine the salient features of source images into a single image1
. The fused image approach is utilized in several computer vision tasks.
The extraction and processing of the features are keys tasks in infrared and visible image fusion, and the fusion performance is directly affected by the different features and processing methods undertaken.
For decades, signal processing algorithms2 3 4 5 were the most propular feature extraction tools in image fusion tasks. In 2016, a two-scale decomposition and saliency detection-based fusion method was proposed by Bavirisetti et al.6 . The base layers and detail layers were extracted by a mean filter and a median filter. The visual salient features were used to obtain weight maps. Then the fused image was reconstructed by combining these three parts.
In recent years, representation learning-based fusion methods have attracted great attention and exhibited state-of-the-art fusion performance.
In the sparse representation(SR) domain, Zong et al.7 proposed a novel medical image fusion method based on SR. In their paper, the sub-dictionaries are learned by Histogram of Oriented Gradients (HOG) features. Then -norm and the max selection strategy are used to reconstruct the fused image. In addition to this approach, the joint sparse representation8 , cosparse representation9
, pulse coupled neural network(PCNN)10 and shearlet transform11 are also applied to image fusion, which incorporate the SR.
In other representation learning domain, for the first time, the low-rank representation(LRR) was applied into image fusion tasks by Li et al.12 . In 12 , they use HOG and dictionary learning method to obtain a global dictionary. The dictionary is then used in LRR and the fused low-rank coefficients are obtained by using an -norm and choose-max strategy. Finally, the fused image is reconsturcted using the global dictionary and LRR. For infrared and visible image fusion, Li et.al13 also proposed an effective and simple algorithm based on latent low-rank representation(LatLRR). Here the source images is decomposed into low-frequency and high-frequancy coefficients by LatLRR and the fused image is reconstructed by using a weighted-averaging strategy.
Although these representation learning-based methods exhibit good fusion performance, they still have two main drawbacks: 1) It is very difficult to learn a dictionary offline, for representation learning-based methods; 2) The time efficiency of representation learning-based methods is very low, especially, when the online dictionary learning methods are used in fusion algorithms. So recently, the fusion algorithms have been improved in two aspects: time efficiency and fusion performance.
In the last two years, deep learning has been applied to image fusion tasks, and has been shown to achieve better fusion performance and time efficiency than non-deep learning-based methods. Most of the deep learning-based fusion methods just treat deep learning as feature extraction operation and use deep features which are obtained by a fixed network to reconstruct the fused image. In 14 , a convolutional sparse representation(CSR) based fusion method was proposed by Yu Liu et al. The CSR is used to extract features which are obtained by different dictionaries. In addition, Yu Liu et al.15
also proposed an algorithm based on convolutional neural network(CNN). Image patches which contain different blur versions of the input image are used to train the network and a decision map is obtained. Finally, the fused image is obtained by the decision map and the source images. The obvious drawback of these two methods is that they are just suitable for the multi-focus image fusion task.
In ICCV 2017, Prabhakar et al.16 proposed a simple and efficient method based on CNN for the exposure fusion problem. In their method, the encoding network has a siamese network architecture where the weights are tied. Input images are encoded by encoding. Then two feature map sequences are obtained and are fused by an addition strategy. The final fused image is reconstructed by a decoding network which contains three CNN layers. This network is not only suitable for the exposure fusion problem, it also achieves good performance in other fusion tasks. However, the architecture is too simple and the information contained in deep network may not have been fully utilized.
fusion method which uses a deeper network and multi-layer deep features. Firstly, the source images are decomposed into base parts and detail content. The base parts are fused by weighted-averaging strategy. And the fixed VGG-19 network, which is trained by ImageNet, is used to extract multi-layer deep features from detail content. Then initial weight maps are calculated by soft-max operator and multi-layer deep features. Then several candidates fused detail content is obtained by initial weight maps. The choose-max is used to construct the final weight maps for the detail content. The final weight maps are utilized to obtain fused detail content. Finally, the fused image is reconstructed by combining the fused base part and the detailed content.
Although the middle layers information is used by VGG-based fusion method17 , the multi-layer combining method is still too simple and much useful information is lost in feature extraction. This phenomenon gets worse when the network is deeper.
To solve these problems, we propose a fusion method to fully utilize and process the deep features. In this paper, a novel fusion algorithm based on residual network(ResNet)19 and zero-phase component analysis(ZCA)20 is proposed for infrared and visible image fusion task. Firstly, the source images are fed into ResNet which is fixed to obtain the deep features. Due to the architecture of ResNet, the deep features already contain multi-layer information, so we just use the output which is obtained by single layer. Then ZCA operation is utilized to project deep features into sparse domain and initial weight maps are obtained by
-norm. We use bicubic interpolation to resize the initial weight maps to source image size. And the final weight maps are obtained by soft-max operation. Finally, the fused image is reconstructed by final weight maps and source images.
2 Related work
Deep residual network (ResNet). In CVPR 2016, He et al.19 proposed a novel network architecture to address the degradation problem. With the shortcut connections and residual representations, their nets were easier to optimize than previous networks and offered better accuracy by increasing the depth. The residual block architecture is shown in Fig.1.
X indicate the input of net block,. With this structure, the multi-layer information is utilzed. Furthermore, in image reconstruction tasks-, the performance gets better by use of the residual block. We also use this architecture in our fusion methods.
Zero-phase component analysis(ZCA). In 20
, Kessy et al. analyzed the whitening and decorrelation by ZCA operation. ZCA opeartion is used to project a random vector into a irrelevant sub-space which is also named whitening. In image processing field, ZCA is a very useful tool to process the features which can obtain useful features to improve algorithm performance. We will introduce ZCA operation briefly.
Let indicate the d-dimensional random vector and represent the mean values. And the covariance matrix Co will be calculated by
. Then the Singular Value Decomposition(SVD) is utilized to decompose, as shown in Eq.1,
Finally, the new random vector is calculated by Eq.2,
denotes the identity matrix, andis a small value avoiding bad matrix inversion.
ZCA utilization in image style transfer. Recently, ZCA is also utilized in image style transfer task which is one of the most popular in the image processing field.
Li et al.24 proposed a universal style transfer algorithm using ZCA operation to transfer the style of artistic image into content image. The encoder network is used to obtain the style features() and content features(). Then authors use ZCA operation to project and into the same space. The final transferred features will be obtained by a coloring transform method which is a reverse operation to the ZCA operation. Finally, the styled image is obtained by transferred features and a decoder network.
In addition, in CVPR 2018, Lu et al.25 also use ZCA operation in their style transfer method. The VGG network is utilized to extract image features, and ZCA is used to project features into the same space. Then transferred features are obained by a reassembling operation based on patches. Finally, the transferred features and a decoder network is trained by MSCOCO26 dataset and utilized to reconstruct the styled image.
From above style transfer methods, the ZCA operation is a powerful tool to process image features, especially in the image reconstruction task. The ZCA operation projects image features into a sub-space and makes it easy to classify and reconstruct. Inspired by these methods, we also apply the ZCA operation into image fusion task.
3 The Proposed Fusion Method
In this section, the proposed fusion method is introduced in detail.
Assuming there are preregistered source images, in our paper, the . Note that the fusion strategy is the same for . The source images are represented as ,. The framework of the proposed fusion method is shown in Fig.2.
As shown in Fig.2, the source images are indicated as and , and ResNet50 contains 50 weight layers which include 5 convolutional blocks (conv1, conv2, conv3, conv4, conv5). The ResNet50 is a fixed network and trained by ImageNet27 , we use it to extract the deep features. And the output of i-th blocks are indicated by the deep features which contain C channels, . and we use ZCA and -norm to process , will be obtained by these operations. Then the weight maps are obtained by resize (bicubic interpolation) and soft-max. Finally, the fused image is reconstructed using a weighted-average strategy. In our paper we choose and to evaluate our fusion framework.
3.1 ZCA operation for deep features
As we discussed earlier, ZCA projects the original features into the same space, and the features become more useful for the next processing. The ZCA operation for deep features is shown in Fig.3.
In Fig.3, we choose the output of conv2 layer which contains 3 residual blocks as an example to introduce the influence of ZCA operation. Each block indicates one channel of the output. The original deep features have different orders of magnitude in each channel. We use ZCA to project original features into the same space. The features become more significant, as shown in Fig.3 (ZCA feature).
3.2 ZCA and -norm operations
After the deep features were obtained, we use ZCA to process the deep features . When we obtain the processed features , we utilize -norm to calculate initial weight map . The procedure of ZCA and -norm operation is shown in Fig.4.
indicates the deep features obtained by the i-th convolutional block, which contains channels, . In ZCA operation, the covariance matrix and its decomposition are calculated by Eq.3,
where denotes the index of channel in deep features.
Then we use Eq.4 to obtain the processed features which is combined by ,
After we obtain the processed features , we utilize the local -norm and average operation to calculate the initial weight maps using Eq.5,
As shown in Fig.4, we choose a window which centers at the to calculate the average -norm, and in our paper .
When the initial weight maps and are calculated by ZCA and -norm, the upsampling and soft-max operations are applied to obtain the final weight maps and , as shown in Fig.5.
Firstly, the bicubic interpolation which is provided by Matlab is used to resize the initial weight maps into source image size.
Then the final weight maps are obtained by Eq.6,
Finally, the fused image is reconstructed using Eq.7,
4 Experiments and Analysis
In this section, the source images and experimental environment are introduced first. Secondly, the effect of different networks and norms in our method are discussed. Then the influence of ZCA operation is analyzed. Finally, the proposed algorithm is evaluated by using subjective and objective criteria. We choose several existing state of the art methods to compare with our fusion method.
4.1 Experimental Settings
4.2 The effect of different networks and norms
where indicates the reshape operation and . And the renshape and nuclear-norm operation are shown in Fig.7.
Five quality metrics are utilized to assess the performance. These are: 32 , 32 and 32 which calculate mutual information (FMI) for the pixel, discrete cosine and wavelet features, respectively; 33 denotes the rate of noise or artifacts added to the fused image by the fusion process; and modified structural similarity()17 .
The performance improves with the increasing numerical index of, , and . Also, the fusion performance is better when the value of is small which means the fused images contain less artificial information and noise.
We calculate the average quality metrics values of 21 pairs of source images. In VGG19, the outputs of four layers(relu1_1, relu2_1, relu3_1, relu4_1) are used. In ResNet50 and ResNet101, we choose four convolutional blocks (Conv2, Conv3, Conv4, Conv5). These values are shown in Table 1 2 3.
The best values are indicated in bold, the second best values are indicated in red font. As we can see, the ResNets(50/101) obtain all the best and the second best values in different norms. This means ResNet can achieve better fusion performance than VGG19 in our fusion framework.
4.3 The influence of ZCA operation
In this section, we analyze the influence of ZCA operation on our method. We choose ResNet50 and three norms(-norm, -norm and nuclear-norm) to evaluate the performance with or without ZCA.
Ten quality metrics are chosen. These metrics include: En(entropy), MI(mutual information), 34 , 32 , 32 , 32 , 33 , SCD35 , 17 , and MS_SSIM36 . The performance improves with the increasing numerical index of En, MI, , , , , SCD, and MS_SSIM. However, its better when the values of are small.
Table 5 and Table 6 show the quality values with and without ZCA, respectively. In Table 6, when the ZCA is not used, ResNet50 with nuclear-norm achieves the best values. This means the low-rank ability is more useful than other norms in original deep features. However, in Table 5, when we use ZCA to project deep features into a sub-space, -norm will obtain most of the best values, even compared with nuclear-norm + without ZCA. We think the ZCA projects the original data into a sparse space, and in this situation, the sparse metric(-norm) obtains better performance than low-rank metric(nuclear-norm).
Based on above observation, we choose ResNet50 to extract deep features in our fusion method, ZCA and -norm operations are used to obtain initial weight maps.
4.4 Subjective Evaluation
In subjective and objective evaluation, we choose nine existing fusion methods to compare with our algorithm. These fusion methods are: cross bilateral filter fusion method(CBF)37 , discrete cosine harmonic wavelet transform(DCHWT)33 , joint sparse representation(JSR)8 , saliency detection in sparse domain(JSRSD)38 , gradient transfer and total variation minimization(GTF)39 , weighted least square optimization(WLS)28 , convolutional sparse representation(ConvSR)14 , a Deep Learning Framework based on VGG19 and multi-layers(VggML)17 , and DeepFuse16 .
In our fusion method, the convolutional blocks (Conv4 and Conv5) are chosen to obtain the fused images. The fused images are shown in Fig.8. As an example, we evaluate the relative performance of the fusion methods only on a single pair of images (“street”).
From Fig.8(c-m), the fused images which are obtained by CBF and DCHWT contain more noise and some saliency features are not clear. The JSR, JSRSD, GTF and WLS can obtain better performance and less noise. But these fused images still contain artificial information near the saliency features. On the contrary, deep learning-based fusion methods, such as ConvSR, VggML, DeepFuse and ours, contain more saliency features and preserve more detail information, and the fused images look more natural. As there is no validation difference between these deep learning-based methods and the proposed algorithm in terms of human sensitivity, we choose several objective quality metrics to assess the fusion performance in the next section.
4.5 Objective Evaluation
From Fig.9 and Table 7, our method achieves better fusion performance in subjective and objective evaluation. In Table 7, the best values are indicated in bold. Compared with other existing fusion methods, our algorithm obtains almost all the best values in and , which represent that the fused images obtained by our, which contain less noise and preserve more structure information from source images. The advantage of our algorithm is more obvious when the is used to assess the fused images.
Although the for our fused images are not the best, its values are still very close to the best one, and the results from our method to improve the fusion performance in term of and are acceptable.
In this article we have proposed a novel fusion algorithm based on ResNet50 and ZCA operation for infrared and visible image fusion. Firstly, the source images are directly fed into ResNet50 network to obtain the deep features. Following this ZCA operation which is also called whitening, is used to project the original deep features into a sparse subspace. The local average -norm is utilized to obtain the initial weight maps. Then bicubic interpolation is used to resize initial weight maps to the source images size. A soft-max operation is used to obtain the final weight maps. Finally, the fused image is reconstructed by weighted-average strategy which combines the final weight maps and source images. Experimental results show that the proposed fusion method has better fusion performance in both objective and subjective evaluation.
- (1) S. Li, X. Kang, L. Fang, J. Hu, and H. Yin, “Pixel-level image fusion: A survey of the state of the art,” Inf. Fusion, vol. 33, pp. 100–112, 2017.
- (2) A. Ben Hamza, Y. He, H. Krim, and A. Willsky, “A Multiscale Approach to Pixel-level Image Fusion,” Integr. Comput. Aided. Eng., vol. 12, pp. 135–146, 2005.
- (3) S. Yang, M. Wang, L. Jiao, R. Wu, and Z. Wang, “Image fusion based on a new contourlet packet,” Inf. Fusion, vol. 11, no. 2, pp. 78–84, 2010.
- (4) L. Wang, B. Li, and L. F. Tian, “EGGDD: An explicit dependency model for multi-modal medical image fusion in shift-invariant shearlet transform domain,” Inf. Fusion, vol. 19, no. 1, pp. 29–37, 2014.
- (5) H. Pang, M. Zhu, and L. Guo, “Multifocus color image fusion using quaternion wavelet transform,” 2012 5th Int. Congr. Image Signal Process. CISP 2012, 2012.
- (6) D. P. Bavirisetti and R. Dhuli, “Two-scale image fusion of visible and infrared images using saliency detection,” Infrared Phys. Technol., vol. 76, pp. 52–64, 2016.
- (7) J. jing Zong and T. shuang Qiu, “Medical image fusion based on sparse representation of classified image patches,” Biomed. Signal Process. Control, vol. 34, pp. 195–205, 2017.
- (8) Q. Zhang, Y. Fu, H. Li, and J. Zou, “Dictionary learning method for joint sparse representation-based image fusion,” Opt. Eng., vol. 52, no. 5, p. 057006, 2013.
- (9) R. Gao, S. A. Vorobyov, and H. Zhao, “Image fusion with cosparse analysis operator,” IEEE Signal Process. Lett., vol. 24, no. 7, pp. 943–947, 2017.
- (10) X. Lu, B. Zhang, Y. Zhao, H. Liu, and H. Pei, “The infrared and visible image fusion algorithm based on target separation and sparse representation,” Infrared Phys. Technol., vol. 67, pp. 397–407, 2014.
- (11) M. Yin, P. Duan, W. Liu, and X. Liang, “A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation,” Neurocomputing, vol. 226, no. November 2016, pp. 182–191, 2017.
- (12) H. Li and X.-J. Wu, “Multi-focus Image Fusion using dictionary learning and Low-Rank Representation,” in Image and Graphics. ICIG 2017. Lecture Notes in Computer Science, vol 10666. Springer, Cham., 2017, pp. 675–686.
- (13) H. Li and X.-J. Wu, “Infrared and visible image fusion using Latent Low-Rank Representation,” arXiv Prepr. arXiv1804.08992, 2018.
- (14) Y. Liu, X. Chen, R. K. Ward, and J. Wang, “Image Fusion with Convolutional Sparse Representation,” IEEE Signal Process. Lett., vol. 23, no. 12, pp. 1882–1886, 2016.
- (15) Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with a deep convolutional neural network,” Inf. Fusion, vol. 36, pp. 191–207, 2017.
- (16) K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–Octob, no. Ev 0, pp. 4724–4732, 2017.
- (17) H. Li, X.-J. Wu, and J. Kittler, “Infrared and Visible Image Fusion using a Deep Learning Framework,” in arXiv preprint arXiv:1804.06992, 2018.
- (18) A. Z. Karen Simonyan, “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION,” in ICLR 2015, 2015, vol. 5, no. 3, pp. 345–358.
- (19) K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
- (20) A. Kessy, A. Lewin, and K. Strimmer, “Optimal Whitening and Decorrelation,” Am. Stat., vol. 1305, no. 2017, pp. 1–6, 2018.
- (21) J. Cai, S. Gu, and L. Zhang, “Learning a deep single image contrast enhancer from multi-exposure images,” IEEE Trans. Image Process., vol. 27, no. 4, pp. 2049–2062, 2018.
- (22) H. Zhang and K. Dana, “Multi-style Generative Network for Real-time Transfer,” arXiv Prepr. arXiv1703.06953, 2017.
X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia, “Detail-Revealing Deep Video Super-Resolution,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–Octob, no. 413113, pp. 4482–4490, 2017.
- (24) Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal Style Transfer via Feature Transforms,” in Advances in neural information processing systems, 2017, pp. 385–395.
- (25) L. Sheng, Z. Lin, J. Shao, and X. Wang, “Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8242–8250.
- (26) T. Y. Lin et al., “Microsoft COCO: Common objects in context,” in European conference on computer vision. Springer, Cham, 2014, vol. 8693 LNCS, no. PART 5, pp. 740–755.
- (27) O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
- (28) J. Ma, Z. Zhou, B. Wang, and H. Zong, “Infrared and visible image fusion based on visual saliency map and weighted least square optimization,” Infrared Phys. Technol., vol. 82, pp. 8–17, 2017.
- (29) Alexander Toet et al., TNO Image Fusion Dataset. https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029. 2014.
- (30) H. Li, Https://github.com/exceptionLi/imagefusion_resnet50/tree/master/IV_images. 2018.
G. Liu, Z. Lin, and Y. Yu, “Robust Subspace Segmentation by Low-Rank Representation,” in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 663–670.
- (32) M. Haghighat and M. A. Razian, “Fast-FMI: Non-reference image fusion metric,” in 8th IEEE International Conference on Application of Information and Communication Technologies, AICT 2014 - Conference Proceedings, 2014.
- (33) B. K. Shreyamsha Kumar, “Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform,” Signal, Image Video Process., vol. 7, no. 6, pp. 1125–1143, 2013.
- (34) C. S. Xydeas and V. Petrović, “Objective image fusion performance measure,” Electron. Lett., 2000.
- (35) V. Aslantas and E. Bendes, “A new image quality metric for image fusion: The sum of the correlations of differences,” AEU - Int. J. Electron. Commun., vol. 69, no. 12, pp. 1890–1896, 2015.
- (36) K. Ma, S. Member, K. Zeng, and Z. Wang, “Perceptual Quality Assessment for Multi-Exposure Image Fusion,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3345–3356, 2015.
- (37) B. K. Shreyamsha Kumar, “Image fusion based on pixel significance using cross bilateral filter,” Signal, Image Video Process., 2015.
- (38) C. H. Liu, Y. Qi, and W. R. Ding, “Infrared and visible image fusion method based on saliency detection in sparse domain,” Infrared Phys. Technol., vol. 83, pp. 94–102, 2017.
- (39) J. Ma, C. Chen, C. Li, and J. Huang, “Infrared and visible image fusion via gradient transfer and total variation minimization,” Inf. Fusion, vol. 31, pp. 100–109, 2016.