Bayesian Fusion for Infrared and Visible Images

05/12/2020 ∙ by Zixiang Zhao, et al. ∙ Xi'an Jiaotong University 0

Infrared and visible image fusion has been a hot issue in image fusion. In this task, a fused image containing both the gradient and detailed texture information of visible images as well as the thermal radiation and highlighting targets of infrared images is expected to be obtained. In this paper, a novel Bayesian fusion model is established for infrared and visible images. In our model, the image fusion task is cast into a regression problem. To measure the variable uncertainty, we formulate the model in a hierarchical Bayesian manner. Aiming at making the fused image satisfy human visual system, the model incorporates the total-variation(TV) penalty. Subsequently, the model is efficiently inferred by the expectation-maximization(EM) algorithm. We test our algorithm on TNO and NIR image fusion datasets with several state-of-the-art approaches. Compared with the previous methods, the novel model can generate better fused images with high-light targets and rich texture details, which can improve the reliability of the target automatic detection and recognition system.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image fusion, as an information-enhanced image processing, is a hot issue in computer vision today. Image fusion is an enhancement image processing technique to produce a robust or informative image

Ma et al. (2019)

. Image fusion has a wide range of applications in pattern recognition

Singh et al. (2008), medical imagingZong and Qiu (2017), remote sensingSimone et al. (2002), and modern militaryChen et al. (2014) as they require to fuse two or more images in same scenesLi et al. (2017).

The fusion of visible and infrared images improves the perception ability of human visual system in target detection and recognitionKong et al. (2007). As we know, a visible image has rich appearance information, and the features such as texture and detail information are often not obvious in the corresponding infrared image. In contrast, an infrared image mainly reflects the heat radiation emitted by objects, which is less affected by illumination changes or artifacts and overcome the obstacles to target detection at night. However, the spatial resolution of infrared images is typically lower than that of visible images. Consequently, fusing thermal radiation and texture detail information into an image facilitates automatic detection and accurate positioning of targetsMa et al. (2016).

Broadly speaking, the current algorithms for fusing visible and infrared images can be divided into four categories: multi-scale transformation, sparse representation, subspace and saliency methodsMa et al. (2019). The multi-scale transformation based methodsLiu et al. (2018); Li et al. (2011); Pajares and De La Cruz (2004); Zhang et al. (1999), in general, decompose source images into multiple levels and then fuse images from the same level of the decomposed layers in specific fusion strategies. Finally, the fused image is recovered by incorporating the fused layers. The second category is sparse representation-based methods Yang and Li (2014); Wang et al. (2014); Li et al. (2012), which assume that the natural image is a sparse linear combination of itself, and fused images can be recovered by merging the coefficients. The third category is the subspace learning-based methodsBavirisetti et al. (2017); Kong et al. (2014); Patil and Mudengudi (2011), which aims to project high-dimensional input images into low-dimensional subspaces to capture the intrinsic feature of the original image. The fourth category is saliency-based methodsBavirisetti and Dhuli (2016); Zhang et al. (2017); Zhao et al. (2014). Based on the prior knowledge that humans usually pay more attention to the saliency objects rather than surrounding areas, they fuse images by maintaining the integrity of the salient target areas.

To the best of our knowledge, no Bayesian model has been applied to the image fusion problem. Therefore, we present in this paper a novel Bayesian fusion model for infrared and visible images. In our model, the image fusion task is cast into a regression problem. To measure the variable uncertainty, we formulate the model in the hierarchical Bayesian manner. Besides, to make the fused image satisfy human visual system, the model incorporates the TV penalty. Then, this model is efficiently inferred by EM algorithm. We test our algorithm on TNO and NIR image fusion datasets with several state-of-the-art approaches. Compared with the previous methods, this method can generate fused image results with high-light targets and rich texture details, which can improve the reliability of the target automatic detection and recognition system.

The rest paper is organized as follows. In section 2, we introduced the Bayesian fusion method. In section 3, some experiments are conducted to investigate and compare the proposed method with some state-of-the-art techniques. Finally, some conclusions are drawn in section 4.

2 Bayesian fusion model

In this section, we present a novel Bayesian fusion model for infrared and visible images. Then, this model is efficiently inferred by the EM algorithmDempster et al. (1977).

2.1 Model formulation

Given a pair of pre-registered infrared and visible images, , image fusion technique aims at obtaining an informative image from and .

It is well-known that visible images satisfy human visual perception, while they are significantly sensitive to disturbances, such as poor illumination, fog and so on. In contrast, infrared images are robust to these disturbances but may lose part of informative textures. In order to preserve the general profile of two images, we minimize the difference between fused and source images, that is


are loss functions. Typically, we assume the difference is measured by

norm. Thus, the problem can be rewritten as

Let and , then we have


Essentially, equation (1

) corresponds to a linear regression model

where denotes a Laplacian noise and is governed by Laplacian distribution. By reformulating this problem in the Bayesian fashion, the conditional distribution of given is

and the prior distribution of is

To avert from

norm optimization, we reformulate Laplacian distribution as Gaussian scale mixtures with exponential distributed prior to the variance, that is,



denotes Gaussian distribution with mean

and variance , and denotes exponential distribution with scale parameter . According to equation (2), the original model of and can be rewritten in the hierarchical Bayesian manner, that is,

for all and , where and mean the height and the width of the input image. In what follows, we use matrices and to denote the collection of all latent variables and , respectively.

Besides modeling the general profiles, the image textures should be taken into consideration so as to make fused image satisfy the human visual perception. As discussed above, there is plenty of high-frequency information in visible images, but the corresponding areas often cannot be observed in infrared images. In order to preserve the edge information of visible images, we regularize the fused image in gradient domain with a gradient sparsity regularizer expressed as

where is a hyper-parameter controlling the strength of regularization, denotes the gradient operator. This regularizer makes the fused image have similar textures to the visible image.

Figure 1: Illustration of our Bayesian graph model.

By combining general profiles and gradients modeling, Fig. 1 displays the graphical expression of our hierarchical Bayesian model. Specifically, in the first level, is a latent variable, while and are observed and unknown variables, respectively. In the second level, and are latent and observed variables, respectively. And and are hyper-parameters. By ignoring the constant not depending on , the log-likelihood of the model can be expressed as

In next subsection, we will discuss how to infer this model.

2.2 Model inference

As is well-known, the EM algorithm is an effective tool to maximize the log-likelihood function of a problem which involves some latent variables. In detail, we firstly initialize unknown variable . Then, in E-step, it calculates the expectation of log-likelihood function with respect to , which is often referred to as the so-called -function,

In M-step, we find to maximize the -function, i.e.,

E-step: In order to obtain the -function in our model, and should be computed. For convenience, we had better compute the posterior distribution for and . It has been assumed that the prior distribution of is , so is governed by

inverse gamma distribution

with shape parameter of 1 and scale parameter of

. And the probability density function of

is given by

According to the Bayesian formula, the posterior of is the inverse Gaussian distribution, that is,

where and . As for , we can compute its posterior in the same way,

Similarly, the posterior of is

where and . Note that the expectation of inverse Gaussian distribution is its location parameter. Thus, we have


Thereafter, in E-step, the -function is given by

where the symbol means element-wise multiplication, and the th entries of and are and , respectively.

M-step: Here, we need to minimize the negative -function with respect to . The half-quadratic splitting algorithm is employed to deal with this problem, i.e.,

It can be further cast into the following unconstraint optimization problem,

The unknown variables can be solved iteratively in the coordinate descent fashion.

Update : It is a least squares issue,

The solution of is


where the symbol means the element-wise division.

Update : It is an norm penalized regression issue,

The solution is


where .

Update : It is a deconvolution problem,

It can be efficiently solved by the fast Fourier transform (fft) and inverse fft (ifft) operators, and the solution is


where denotes the complex conjugation.

In order to make model more flexible, the hyper-parameters and are automatically updated. According to empirical Bayes, we have




2.3 Algorithm and implement details

Algorithm 1 summarizes the workflow of our proposed model, where E-step and M-step alternate with each other until the maximum iteration number is reached. Since there is no analytic solution in M-step, we maximize -function by updating times. It is found that does not affect performance very much. To reduce computation, we set . Furthermore, it is found that algorithm generates a satisfactory result if the outer loop iterations is set to 15. Note that hyper-parameter and denote the strength of gradient and norm penalties, respectively. Empirical studies suggest to set and .

0:    Infrared image , Visible image , Maximum iteration number of outer and inner loops and .
0:    Fused image .
1:  ; Initialize ;
2:  for  do
3:     % (M-step)
4:     for  do
5:        Update with Eqs. (5), (6) and (7), respectively.
6:     end for
7:     % (E-step)
8:     Evaluate expectations by Eqs. (3) and (4).
9:     Update hyper-parameters by Eqs. (8) and (9).
10:  end for
11:  .
Algorithm 1 Bayesian Fusion

3 Experiments

This section aims to study the behaviors of our proposed model and other popular counterparts, including CSRLiu et al. (2016), ADFBavirisetti and Dhuli (2015), FPDEBavirisetti et al. (2017), TSIFVSBavirisetti and Dhuli (2016) and TVADMMGuo et al. (2017). All experiments are conducted with MATLAB on a computer with Intel Core i7-9750H CPU@2.60GHz.

3.1 Experimental data

In this experiment, we test algorithm on TNO image fusion datasetToet and Hogervorst (2012)111 and RGB-NIR Scene datasetBrown and Süsstrunk (2011)222 20 pairs of infrared and visible images in TNO dataset and 52 pairs in the “country” scene of NIR dataset are employed. In TNO dataset, the interesting objects cannot be observed in visible images, as it was shot in night. In contrast, they are salient in infrared images, but without textures. While the NIR image dataset was obtained in daylight, and we test whether the fused image can have more detailed information and highlight information.

3.2 Subjective visual evaluation

Figure 2: Qualitative fusion results. From left to right: “Soldier_in_trench 1”, “Image_04” and “Marne_04” in TNO dataset, “Image_13” and “Image_35” in NIR dataset. From top to bottom: infrared images, visible images, results of our method, TSIFVS, CSR, ADF, FPDE and TVADMM methods.

In Figure 2, the qualitative fusion results are exhibited, respectively. From left to right: “Soldier_in_trench_1”, “Image_04” and “Marne_04” in TNO dataset, “Image_13” and “Image_35” in NIR dataset. In the first column images, the TSIFVS and ADF methods have almost no face details. The TVADMM method has low target brightness, and the background of the CSR and FPDE methods (such as the trenches) is not clear enough. Analysis of the fusion results for the second column images, apparently, the house details of the CSR method is poor and the ground detail of the ADF method is not obvious enough. Meanwhile, the target objective of the TVADMM and TSIFVS methods have low brightness, and the background details(e.g. the trees) of the FPDE method are not clear enough. In the results of the third column images, the FPDE and ADF methods have lower brightness and fewer details, while the TVADMM and CSR methods have poorer window details, and the TSIFVS method has less obvious edge contours. In the results of the fourth and fifth column images, the edge contour of the TSIVIF method does not fit the human visual system because of the clear boundary. The CSR and TVADMM methods are not salient enough in trees/clouds details and edges. Objects (trees and mountains) of the ADF method have poor highlighting effects and the FPDE method has visual blur with fewer details.

In short, compared with the previous methods, our proposed Bayesian fusion model can generate better-fused images with high-light targets and rich texture details.

3.3 Objective quantitative evaluation

We calculate the average of the selected image pairs in Entropy (EN)Roberts et al. (2008), Mutual information (MI)Qu et al. (2002), Qu et al. (2002)

, Standard deviation (SD)

Rao (1997) and Structure similarity index measure (SSIM)Wang and Bovik (2002) metrics for our proposed model and other popular counterparts. EN and SD measure how much information is contained in an image. reflects the edge information preserved in the fusion image. MI measures the agreement between source images and the fusion image, and SSIM reports the consistency in the light of structural similarities between fusion and source images. The larger metric values are, the better a fused image is. Please refer to Ma et al. (2019) to see more details on these metrics.

We show a quantitative comparison of these fusion methods in Table LABEL:table. In TNO dataset, our method performs best in terms of the MI, , SD metrics, and is ranked second in the EN and SSIM indicators, in which the first are the TSIFVS and ADF methods. Meanwhile, in RGB-NIR Scene dataset, we get two first places in MI, SD and three second places in EN, and SSIM. This exhibition demonstrates the excellent performance of our method on infrared and visible image fusion compared with other image fusion methods.

Dataset: TNO image fusion dataset
EN 6.500 6.206 6.225 6.180 6.255 6.432
MI 1.649 1.919 1.900 1.942 1.730 2.448
0.510 0.340 0.534 0.436 0.508 0.549
SD 25.910 21.078 21.459 20.578 21.327 26.285
SSIM 0.906 0.905 0.864 0.949 0.863 0.937
Dataset: RGB-NIR Scene Dataset
EN 7.300 7.129 7.170 7.105 7.115 7.201
MI 3.285 3.673 3.699 3.944 3.877 4.078
0.571 0.530 0.626 0.553 0.580 0.587
SD 43.743 40.469 40.383 38.978 39.192 46.105
SSIM 1.157 1.241 1.130 1.274 1.249 1.251
Table 1: Quantitative results of different methods. The largest value is shown in bold, and the second largest value is shown in underlined.

4 Conclusion

In our paper, we present a novel Bayesian fusion model for infrared and visible images. In our model, the image fusion task is transformed into a regression problem, and a hierarchical Bayesian fashion is established to solve the problem. Additionally, the TV penalty is used to make the fused image similar to human visual system. Then, the model is efficiently inferred by the EM algorithm with the half-quadratic splitting algorithm. Compared with the previous methods in TNO and NIR datasets, our method can generate better fused images with highlighting thermal radiation target areas and abundant texture details, which can facilitate automatic detection and accurate positioning of targets.


The research of S. Xu is supported by the Fundamental Research Funds for the Central Universities under grant number xzy022019059. The research of C.X. Zhang is supported by the National Natural Science Foundation of China under grant 11671317 and the National Key Research and Development Program of China under grant 2018AAA0102201. The research of J.M. Liu is supported by the National Natural Science Foundation of China under grant 61877049 and the research of J.S. Zhang is supported by the National Key Research and Development Program of China under grant 2018YFC0809001, and the National Natural Science Foundation of China under grant 61976174.


  • D. P. Bavirisetti and R. Dhuli (2015) Fusion of infrared and visible sensor images based on anisotropic diffusion and karhunen-loeve transform. IEEE Sensors Journal 16 (1), pp. 203–209. Cited by: §3.
  • D. P. Bavirisetti and R. Dhuli (2016) Two-scale image fusion of visible and infrared images using saliency detection. Infrared Physics & Technology 76, pp. 52–64. Cited by: §1, §3.
  • D. P. Bavirisetti, G. Xiao, and G. Liu (2017)

    Multi-sensor image fusion based on fourth order partial differential equations

    In 2017 20th International Conference on Information Fusion (Fusion), pp. 1–9. Cited by: §1, §3.
  • M. Brown and S. Süsstrunk (2011) Multi-spectral sift for scene category recognition. In CVPR 2011, pp. 177–184. Cited by: §3.1.
  • C. Chen, Y. Li, W. Liu, and J. Huang (2014) Image fusion with local spectral consistency and dynamic gradient sparsity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2760–2765. Cited by: §1.
  • A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §2.
  • H. Guo, Y. Ma, X. Mei, and J. Ma (2017) Infrared and visible image fusion based on total variation and augmented lagrangian. Journal of the Optical Society of America A 34 (11), pp. 1961–1968. Cited by: §3.
  • S. G. Kong, J. Heo, F. Boughorbel, Y. Zheng, B. R. Abidi, A. Koschan, M. Yi, and M. A. Abidi (2007)

    Multiscale fusion of visible and thermal ir images for illumination-invariant face recognition

    International Journal of Computer Vision 71 (2), pp. 215–233. Cited by: §1.
  • W. Kong, Y. Lei, and H. Zhao (2014) Adaptive fusion method of visible light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization. Infrared Physics & Technology 67, pp. 161–172. Cited by: §1.
  • S. Li, X. Kang, L. Fang, J. Hu, and H. Yin (2017) Pixel-level image fusion: a survey of the state of the art. Information Fusion 33, pp. 100–112. Cited by: §1.
  • S. Li, B. Yang, and J. Hu (2011) Performance comparison of different multi-resolution transforms for image fusion. Information Fusion 12 (2), pp. 74–84. Cited by: §1.
  • S. Li, H. Yin, and L. Fang (2012) Group-sparse representation with dictionary learning for medical image denoising and fusion. IEEE Transactions on biomedical engineering 59 (12), pp. 3450–3459. Cited by: §1.
  • Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §1.
  • Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang (2016) Image fusion with convolutional sparse representation. IEEE Signal Processing Letters 23 (12), pp. 1882–1886. Cited by: §3.
  • J. Ma, C. Chen, C. Li, and J. Huang (2016) Infrared and visible image fusion via gradient transfer and total variation minimization. Information Fusion 31, pp. 100–109. Cited by: §1.
  • J. Ma, Y. Ma, and C. Li (2019) Infrared and visible image fusion methods and applications: a survey. Information Fusion 45, pp. 153–178. Cited by: §1, §1, §3.3.
  • G. Pajares and J. M. De La Cruz (2004) A wavelet-based image fusion tutorial. Pattern recognition 37 (9), pp. 1855–1872. Cited by: §1.
  • U. Patil and U. Mudengudi (2011) Image fusion using hierarchical pca.. In 2011 International Conference on Image Information Processing, pp. 1–6. Cited by: §1.
  • G. Qu, D. Zhang, and P. Yan (2002) Information measure for performance of image fusion. Electronics Letters 38 (7), pp. 313–315. Cited by: §3.3.
  • Y. Rao (1997) In-fibre bragg grating sensors. Measurement Science and Technology 8 (4), pp. 355. Cited by: §3.3.
  • J. W. Roberts, J. A. Van Aardt, and F. B. Ahmed (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §3.3.
  • G. Simone, A. Farina, F. C. Morabito, S. B. Serpico, and L. Bruzzone (2002) Image fusion techniques for remote sensing applications. Information fusion 3 (1), pp. 3–15. Cited by: §1.
  • R. Singh, M. Vatsa, and A. Noore (2008) Integrated multilevel image fusion and match score fusion of visible and infrared face images for robust face recognition. Pattern Recognition 41 (3), pp. 880–893. Cited by: §1.
  • A. Toet and M. A. Hogervorst (2012) Progress in color night vision. Optical Engineering 51 (1), pp. 1 – 20. External Links: Document, Link Cited by: §3.1.
  • J. Wang, J. Peng, X. Feng, G. He, and J. Fan (2014) Fusion method for infrared and visible images by using non-negative sparse representation. Infrared Physics & Technology 67, pp. 477–489. Cited by: §1.
  • Z. Wang and A. C. Bovik (2002) A universal image quality index. IEEE Signal Processing Letters 9 (3), pp. 81–84. Cited by: §3.3.
  • B. Yang and S. Li (2014) Visual attention guided image fusion with sparse representation. Optik-International Journal for Light and Electron Optics 125 (17), pp. 4881–4888. Cited by: §1.
  • X. Zhang, Y. Ma, F. Fan, Y. Zhang, and J. Huang (2017) Infrared and visible image fusion via saliency analysis and local edge-preserving multi-scale decomposition. Journal of the Optical Society of America A 34 (8), pp. 1400–1410. Cited by: §1.
  • Z. Zhang, R. S. Blum, et al. (1999) A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application. Proceedings of the IEEE 87 (8), pp. 1315–1326. Cited by: §1.
  • J. Zhao, Y. Chen, H. Feng, Z. Xu, and Q. Li (2014) Infrared image enhancement through saliency feature analysis based on multi-scale decomposition. Infrared Physics & Technology 62, pp. 86–93. Cited by: §1.
  • J. Zong and T. Qiu (2017)

    Medical image fusion based on sparse representation of classified image patches

    Biomedical Signal Processing and Control 34, pp. 195–205. Cited by: §1.