1 Introduction
Image fusion, as an informationenhanced image processing, is a hot issue in computer vision today. Image fusion is an enhancement image processing technique to produce a robust or informative image
Ma et al. (2019). Image fusion has a wide range of applications in pattern recognition
Singh et al. (2008), medical imagingZong and Qiu (2017), remote sensingSimone et al. (2002), and modern militaryChen et al. (2014) as they require to fuse two or more images in same scenesLi et al. (2017).The fusion of visible and infrared images improves the perception ability of human visual system in target detection and recognitionKong et al. (2007). As we know, a visible image has rich appearance information, and the features such as texture and detail information are often not obvious in the corresponding infrared image. In contrast, an infrared image mainly reflects the heat radiation emitted by objects, which is less affected by illumination changes or artifacts and overcome the obstacles to target detection at night. However, the spatial resolution of infrared images is typically lower than that of visible images. Consequently, fusing thermal radiation and texture detail information into an image facilitates automatic detection and accurate positioning of targetsMa et al. (2016).
Broadly speaking, the current algorithms for fusing visible and infrared images can be divided into four categories: multiscale transformation, sparse representation, subspace and saliency methodsMa et al. (2019). The multiscale transformation based methodsLiu et al. (2018); Li et al. (2011); Pajares and De La Cruz (2004); Zhang et al. (1999), in general, decompose source images into multiple levels and then fuse images from the same level of the decomposed layers in specific fusion strategies. Finally, the fused image is recovered by incorporating the fused layers. The second category is sparse representationbased methods Yang and Li (2014); Wang et al. (2014); Li et al. (2012), which assume that the natural image is a sparse linear combination of itself, and fused images can be recovered by merging the coefficients. The third category is the subspace learningbased methodsBavirisetti et al. (2017); Kong et al. (2014); Patil and Mudengudi (2011), which aims to project highdimensional input images into lowdimensional subspaces to capture the intrinsic feature of the original image. The fourth category is saliencybased methodsBavirisetti and Dhuli (2016); Zhang et al. (2017); Zhao et al. (2014). Based on the prior knowledge that humans usually pay more attention to the saliency objects rather than surrounding areas, they fuse images by maintaining the integrity of the salient target areas.
To the best of our knowledge, no Bayesian model has been applied to the image fusion problem. Therefore, we present in this paper a novel Bayesian fusion model for infrared and visible images. In our model, the image fusion task is cast into a regression problem. To measure the variable uncertainty, we formulate the model in the hierarchical Bayesian manner. Besides, to make the fused image satisfy human visual system, the model incorporates the TV penalty. Then, this model is efficiently inferred by EM algorithm. We test our algorithm on TNO and NIR image fusion datasets with several stateoftheart approaches. Compared with the previous methods, this method can generate fused image results with highlight targets and rich texture details, which can improve the reliability of the target automatic detection and recognition system.
2 Bayesian fusion model
In this section, we present a novel Bayesian fusion model for infrared and visible images. Then, this model is efficiently inferred by the EM algorithmDempster et al. (1977).
2.1 Model formulation
Given a pair of preregistered infrared and visible images, , image fusion technique aims at obtaining an informative image from and .
It is wellknown that visible images satisfy human visual perception, while they are significantly sensitive to disturbances, such as poor illumination, fog and so on. In contrast, infrared images are robust to these disturbances but may lose part of informative textures. In order to preserve the general profile of two images, we minimize the difference between fused and source images, that is
where
are loss functions. Typically, we assume the difference is measured by
norm. Thus, the problem can be rewritten asLet and , then we have
(1) 
Essentially, equation (1
) corresponds to a linear regression model
where denotes a Laplacian noise and is governed by Laplacian distribution. By reformulating this problem in the Bayesian fashion, the conditional distribution of given is
and the prior distribution of is
To avert from
norm optimization, we reformulate Laplacian distribution as Gaussian scale mixtures with exponential distributed prior to the variance, that is,
(2)  
where
denotes Gaussian distribution with mean
and variance , and denotes exponential distribution with scale parameter . According to equation (2), the original model of and can be rewritten in the hierarchical Bayesian manner, that is,for all and , where and mean the height and the width of the input image. In what follows, we use matrices and to denote the collection of all latent variables and , respectively.
Besides modeling the general profiles, the image textures should be taken into consideration so as to make fused image satisfy the human visual perception. As discussed above, there is plenty of highfrequency information in visible images, but the corresponding areas often cannot be observed in infrared images. In order to preserve the edge information of visible images, we regularize the fused image in gradient domain with a gradient sparsity regularizer expressed as
where is a hyperparameter controlling the strength of regularization, denotes the gradient operator. This regularizer makes the fused image have similar textures to the visible image.
By combining general profiles and gradients modeling, Fig. 1 displays the graphical expression of our hierarchical Bayesian model. Specifically, in the first level, is a latent variable, while and are observed and unknown variables, respectively. In the second level, and are latent and observed variables, respectively. And and are hyperparameters. By ignoring the constant not depending on , the loglikelihood of the model can be expressed as
In next subsection, we will discuss how to infer this model.
2.2 Model inference
As is wellknown, the EM algorithm is an effective tool to maximize the loglikelihood function of a problem which involves some latent variables. In detail, we firstly initialize unknown variable . Then, in Estep, it calculates the expectation of loglikelihood function with respect to , which is often referred to as the socalled function,
In Mstep, we find to maximize the function, i.e.,
Estep: In order to obtain the function in our model, and should be computed. For convenience, we had better compute the posterior distribution for and . It has been assumed that the prior distribution of is , so is governed by
inverse gamma distribution
with shape parameter of 1 and scale parameter of. And the probability density function of
is given byAccording to the Bayesian formula, the posterior of is the inverse Gaussian distribution, that is,
where and . As for , we can compute its posterior in the same way,
Similarly, the posterior of is
where and . Note that the expectation of inverse Gaussian distribution is its location parameter. Thus, we have
(3) 
(4) 
Thereafter, in Estep, the function is given by
where the symbol means elementwise multiplication, and the th entries of and are and , respectively.
Mstep: Here, we need to minimize the negative function with respect to . The halfquadratic splitting algorithm is employed to deal with this problem, i.e.,
It can be further cast into the following unconstraint optimization problem,
The unknown variables can be solved iteratively in the coordinate descent fashion.
Update : It is a least squares issue,
The solution of is
(5) 
where the symbol means the elementwise division.
Update : It is an norm penalized regression issue,
The solution is
(6) 
where .
Update : It is a deconvolution problem,
It can be efficiently solved by the fast Fourier transform (fft) and inverse fft (ifft) operators, and the solution is
(7) 
where denotes the complex conjugation.
In order to make model more flexible, the hyperparameters and are automatically updated. According to empirical Bayes, we have
(8) 
and
(9) 
2.3 Algorithm and implement details
Algorithm 1 summarizes the workflow of our proposed model, where Estep and Mstep alternate with each other until the maximum iteration number is reached. Since there is no analytic solution in Mstep, we maximize function by updating times. It is found that does not affect performance very much. To reduce computation, we set . Furthermore, it is found that algorithm generates a satisfactory result if the outer loop iterations is set to 15. Note that hyperparameter and denote the strength of gradient and norm penalties, respectively. Empirical studies suggest to set and .
3 Experiments
This section aims to study the behaviors of our proposed model and other popular counterparts, including CSRLiu et al. (2016), ADFBavirisetti and Dhuli (2015), FPDEBavirisetti et al. (2017), TSIFVSBavirisetti and Dhuli (2016) and TVADMMGuo et al. (2017). All experiments are conducted with MATLAB on a computer with Intel Core i79750H CPU@2.60GHz.
3.1 Experimental data
In this experiment, we test algorithm on TNO image fusion datasetToet and Hogervorst (2012)^{1}^{1}1https://figshare.com/articles/TNOImageFusionDataset/1008029 and RGBNIR Scene datasetBrown and Süsstrunk (2011)^{2}^{2}2https://ivrlwww.epfl.ch/supplementary_material/cvpr11/index.html. 20 pairs of infrared and visible images in TNO dataset and 52 pairs in the “country” scene of NIR dataset are employed. In TNO dataset, the interesting objects cannot be observed in visible images, as it was shot in night. In contrast, they are salient in infrared images, but without textures. While the NIR image dataset was obtained in daylight, and we test whether the fused image can have more detailed information and highlight information.
3.2 Subjective visual evaluation
In Figure 2, the qualitative fusion results are exhibited, respectively. From left to right: “Soldier_in_trench_1”, “Image_04” and “Marne_04” in TNO dataset, “Image_13” and “Image_35” in NIR dataset. In the first column images, the TSIFVS and ADF methods have almost no face details. The TVADMM method has low target brightness, and the background of the CSR and FPDE methods (such as the trenches) is not clear enough. Analysis of the fusion results for the second column images, apparently, the house details of the CSR method is poor and the ground detail of the ADF method is not obvious enough. Meanwhile, the target objective of the TVADMM and TSIFVS methods have low brightness, and the background details(e.g. the trees) of the FPDE method are not clear enough. In the results of the third column images, the FPDE and ADF methods have lower brightness and fewer details, while the TVADMM and CSR methods have poorer window details, and the TSIFVS method has less obvious edge contours. In the results of the fourth and fifth column images, the edge contour of the TSIVIF method does not fit the human visual system because of the clear boundary. The CSR and TVADMM methods are not salient enough in trees/clouds details and edges. Objects (trees and mountains) of the ADF method have poor highlighting effects and the FPDE method has visual blur with fewer details.
In short, compared with the previous methods, our proposed Bayesian fusion model can generate betterfused images with highlight targets and rich texture details.
3.3 Objective quantitative evaluation
We calculate the average of the selected image pairs in Entropy (EN)Roberts et al. (2008), Mutual information (MI)Qu et al. (2002), Qu et al. (2002)
, Standard deviation (SD)
Rao (1997) and Structure similarity index measure (SSIM)Wang and Bovik (2002) metrics for our proposed model and other popular counterparts. EN and SD measure how much information is contained in an image. reflects the edge information preserved in the fusion image. MI measures the agreement between source images and the fusion image, and SSIM reports the consistency in the light of structural similarities between fusion and source images. The larger metric values are, the better a fused image is. Please refer to Ma et al. (2019) to see more details on these metrics.We show a quantitative comparison of these fusion methods in Table LABEL:table. In TNO dataset, our method performs best in terms of the MI, , SD metrics, and is ranked second in the EN and SSIM indicators, in which the first are the TSIFVS and ADF methods. Meanwhile, in RGBNIR Scene dataset, we get two first places in MI, SD and three second places in EN, and SSIM. This exhibition demonstrates the excellent performance of our method on infrared and visible image fusion compared with other image fusion methods.
Dataset: TNO image fusion dataset  

Metrics  TSIFVS  TVADMM  CSR  ADF  FPDE  BayesFusion 
EN  6.500  6.206  6.225  6.180  6.255  6.432 
MI  1.649  1.919  1.900  1.942  1.730  2.448 
0.510  0.340  0.534  0.436  0.508  0.549  
SD  25.910  21.078  21.459  20.578  21.327  26.285 
SSIM  0.906  0.905  0.864  0.949  0.863  0.937 
Dataset: RGBNIR Scene Dataset  
Metrics  TSIFVS  TVADMM  CSR  ADF  FPDE  BayesFusion 
EN  7.300  7.129  7.170  7.105  7.115  7.201 
MI  3.285  3.673  3.699  3.944  3.877  4.078 
0.571  0.530  0.626  0.553  0.580  0.587  
SD  43.743  40.469  40.383  38.978  39.192  46.105 
SSIM  1.157  1.241  1.130  1.274  1.249  1.251 
4 Conclusion
In our paper, we present a novel Bayesian fusion model for infrared and visible images. In our model, the image fusion task is transformed into a regression problem, and a hierarchical Bayesian fashion is established to solve the problem. Additionally, the TV penalty is used to make the fused image similar to human visual system. Then, the model is efficiently inferred by the EM algorithm with the halfquadratic splitting algorithm. Compared with the previous methods in TNO and NIR datasets, our method can generate better fused images with highlighting thermal radiation target areas and abundant texture details, which can facilitate automatic detection and accurate positioning of targets.
Acknowledgements
The research of S. Xu is supported by the Fundamental Research Funds for the Central Universities under grant number xzy022019059. The research of C.X. Zhang is supported by the National Natural Science Foundation of China under grant 11671317 and the National Key Research and Development Program of China under grant 2018AAA0102201. The research of J.M. Liu is supported by the National Natural Science Foundation of China under grant 61877049 and the research of J.S. Zhang is supported by the National Key Research and Development Program of China under grant 2018YFC0809001, and the National Natural Science Foundation of China under grant 61976174.
References
 Fusion of infrared and visible sensor images based on anisotropic diffusion and karhunenloeve transform. IEEE Sensors Journal 16 (1), pp. 203–209. Cited by: §3.
 Twoscale image fusion of visible and infrared images using saliency detection. Infrared Physics & Technology 76, pp. 52–64. Cited by: §1, §3.

Multisensor image fusion based on fourth order partial differential equations
. In 2017 20th International Conference on Information Fusion (Fusion), pp. 1–9. Cited by: §1, §3.  Multispectral sift for scene category recognition. In CVPR 2011, pp. 177–184. Cited by: §3.1.
 Image fusion with local spectral consistency and dynamic gradient sparsity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2760–2765. Cited by: §1.
 Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §2.
 Infrared and visible image fusion based on total variation and augmented lagrangian. Journal of the Optical Society of America A 34 (11), pp. 1961–1968. Cited by: §3.

Multiscale fusion of visible and thermal ir images for illuminationinvariant face recognition
. International Journal of Computer Vision 71 (2), pp. 215–233. Cited by: §1.  Adaptive fusion method of visible light and infrared images based on nonsubsampled shearlet transform and fast nonnegative matrix factorization. Infrared Physics & Technology 67, pp. 161–172. Cited by: §1.
 Pixellevel image fusion: a survey of the state of the art. Information Fusion 33, pp. 100–112. Cited by: §1.
 Performance comparison of different multiresolution transforms for image fusion. Information Fusion 12 (2), pp. 74–84. Cited by: §1.
 Groupsparse representation with dictionary learning for medical image denoising and fusion. IEEE Transactions on biomedical engineering 59 (12), pp. 3450–3459. Cited by: §1.
 Deep learning for pixellevel image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §1.
 Image fusion with convolutional sparse representation. IEEE Signal Processing Letters 23 (12), pp. 1882–1886. Cited by: §3.
 Infrared and visible image fusion via gradient transfer and total variation minimization. Information Fusion 31, pp. 100–109. Cited by: §1.
 Infrared and visible image fusion methods and applications: a survey. Information Fusion 45, pp. 153–178. Cited by: §1, §1, §3.3.
 A waveletbased image fusion tutorial. Pattern recognition 37 (9), pp. 1855–1872. Cited by: §1.
 Image fusion using hierarchical pca.. In 2011 International Conference on Image Information Processing, pp. 1–6. Cited by: §1.
 Information measure for performance of image fusion. Electronics Letters 38 (7), pp. 313–315. Cited by: §3.3.
 Infibre bragg grating sensors. Measurement Science and Technology 8 (4), pp. 355. Cited by: §3.3.
 Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §3.3.
 Image fusion techniques for remote sensing applications. Information fusion 3 (1), pp. 3–15. Cited by: §1.
 Integrated multilevel image fusion and match score fusion of visible and infrared face images for robust face recognition. Pattern Recognition 41 (3), pp. 880–893. Cited by: §1.
 Progress in color night vision. Optical Engineering 51 (1), pp. 1 – 20. External Links: Document, Link Cited by: §3.1.
 Fusion method for infrared and visible images by using nonnegative sparse representation. Infrared Physics & Technology 67, pp. 477–489. Cited by: §1.
 A universal image quality index. IEEE Signal Processing Letters 9 (3), pp. 81–84. Cited by: §3.3.
 Visual attention guided image fusion with sparse representation. OptikInternational Journal for Light and Electron Optics 125 (17), pp. 4881–4888. Cited by: §1.
 Infrared and visible image fusion via saliency analysis and local edgepreserving multiscale decomposition. Journal of the Optical Society of America A 34 (8), pp. 1400–1410. Cited by: §1.
 A categorization of multiscaledecompositionbased image fusion schemes with a performance study for a digital camera application. Proceedings of the IEEE 87 (8), pp. 1315–1326. Cited by: §1.
 Infrared image enhancement through saliency feature analysis based on multiscale decomposition. Infrared Physics & Technology 62, pp. 86–93. Cited by: §1.

Medical image fusion based on sparse representation of classified image patches
. Biomedical Signal Processing and Control 34, pp. 195–205. Cited by: §1.
Comments
There are no comments yet.