A Robust Non-Linear and Feature-Selection Image Fusion Theory

12/23/2019 ∙ by Aiqing Fang, et al. ∙ 10

The human visual perception system has strong robustness in image fusion. This robustness is based on human visual perception system's characteristics of feature selection and non-linear fusion of different features. In order to simulate the human visual perception mechanism in image fusion tasks, we propose a multi-source image fusion framework that combines illuminance factors and attention mechanisms. The framework effectively combines traditional image features and modern deep learning features. First, we perform multi-scale decomposition of multi-source images. Then, the visual saliency map and the deep feature map are combined with the illuminance fusion factor to perform high-low frequency nonlinear fusion. Secondly, the characteristics of high and low frequency fusion are selected through the channel attention network to obtain the final fusion map. By simulating the nonlinear characteristics and selection characteristics of the human visual perception system in image fusion, the fused image is more in line with the human visual perception mechanism. Finally, we validate our fusion framework on public datasets of infrared and visible images, medical images and multi-focus images. The experimental results demonstrate the superiority of our fusion framework over state-of-arts in visual quality, objective fusion metrics and robustness.



There are no comments yet.


page 3

page 8

page 14

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robustness of image fusion has always been a bottleneck problem that puzzles and restricts the application and popularization of traditional image fusion technology, while human beings have strong robustness in multi-source image fusion. From the perspective of cognitive psychology, the human visual perception system has the characteristics of information selection for the perception of external stimuli, and the human brain has nonlinear characteristics for the fusion of perceptual informationTreisman1980A . Based on these two characteristics, human beings have better robustness in image fusion tasks.

In the past few decades, researchers have proposed many image fusion methods based on human visual perception characteristics. For example, a multi-scale decomposition method based on the sensitivity of human eyes to different brightness regionsLi2011Performance

, a convolutional neural network method inspired by neurobiology

Xiang2015A and a saliency method based on human visual attention mechanismZhangInfrared

. Among them, multi-scale decomposition focuses more on hierarchical feature extraction of images. The method based on convolution neural network focuses more on learning the characteristics of images by data driven. The method based on visual saliency focuses more on feature extraction of saliency region. The above methods generally use weighted average, maximum or principal component method

Ma2018Infrared in image fusion criteria, and those on non-linear feature fusion are few. In feature selection, more emphasis is placed on the extraction and selection of image features in the early stage, and the effective selection of fusion features is lacking.

The human visual fusion perception system is a highly complex non-linear system. Complex characteristics are not only reflected in feature extraction, but also in image information fusion of human brainTreisman1980A . In the multi-source image fusion task, the human brain filters the perceived target features based on subjective intention, ignores uncertain signals, and fuses non-mutually exclusive features according to prior knowledge. In order to make the results of multi-source image fusion more in line with the human visual fusion mechanism, narrow the gap between human visual fusion and multi-source image fusion, we propose a multi-source image fusion framework based on cognitive psychology theory. This framework is not a simple superposition of deep learning features and traditional image features. Our proposed framework combines high and low frequency information, visual saliency information, deep learning features and illuminance information of the original image. The illumination information is used as the non-linear fusion factor of the image fusion to simulate the non-linear fusion characteristics of the human visual system. The attention network is used to simulate the human eye’s selection characteristics of fusion features.

In order to demonstrate the superiority of our framework over the existing mainstream algorithmic framework, we give a representative example in1 . The image is a thermal infrared and visible image, and the data is derived from the traffic dataset in FLIRFLIR . From the fused image, we can clearly find that there is a strong boundary effect in the glare. At present, mainstream image fusion algorithms cannot effectively remove glare. Our framework can effectively remove glare regions by introducing nonlinear illuminance influence factors, and the fused images have higher clarity.

Figure 1: Schematic illustration of image fusion. From left to right: IR imageFLIR , Visible imageFLIR , JSRZhang2013Dictionary , OURS, WLSMa2017InfraredWLS , DDLatLRRLiu2011LatentLATLRR , LTLRRLi2018InfraredLTLRR , ZCALi2018Infrared , CNNLiu2017InfraredCNN , CVTNencini2007RemoteCVT , DLLi_2018DL , DTCWTLiu2015MultiDSIFT , FusionGANMaFusionGAN , GFShutao2013ImageGF , GTFMa2016InfraredGTF , LPBurt1987TheLP , FEZLahoud2019FastZERO , CBFShreyamsha2015ImageCBF , CSRLiu2016ImageCSR , JSRDLiu2017InfraredJSR-SD , LP-SRLiu2015ALPSR , MSVDNaidu2011Image , RPToet1989ImageRP , WaveletChipman1995Wavelets . Our method has a good fusion effect for high light, and the fusion effect is more coincident with human visual perception mechanism.

The main contributions of our work include the following three points:

Firstly, in the multi-source image fusion task, we propose an image fusion method with feature selection characteristics.

Secondly, based on the non-linear characteristics of human visual fusion perception, we propose a non-linear multi-source image fusion method combining illumination factors.

Finally, based on the human visual perception fusion mechanism, we propose a robust multi-source image fusion framework with traditional methods and deep learning knowledge.

2 Related work

2.1 Image fusion

Based on multi-scale decomposition theory, Bavirisetti DP8009719 proposed an image fusion algorithm combining multi-scale decomposition and visual saliency. Compared with multi-scale decomposition method, this algorithm has better image fusion quality, but there are also problems of insufficient feature extraction. Therefore, LiuLiu2016ImageCSR proposed a multi-modal image fusion method based on sparse convolution representation, which overcomes two shortcomings of sparse representation method including insufficient image high frequency feature extraction and very high sensitive to misregistered images. Compared with the traditional sparse representation method, this method achieves better results in the objective evaluation index of image quality and visual evaluation. In addition, LiuLiu2017MultiCNN

proposed a deep convolution neural network framework for multi-focus image fusion. The network uses deep convolution neural network to extract image features, and classifies the feature focus image into two pixels to obtain a fully focused and clear image. However, this method is not universal and only suitable for multi-focus image fusion. At the same time, the network extracts insufficient features and does not make full use of the high-frequency information of the underlying network. Li

Li_2018DL proposed a pre-training deep convolution neural network based on two-scale decomposition for infrared and visible images. Firstly, the algorithm decomposes infrared and visible images at high and low frequencies respectively, uses weighted average fusion criterion for low frequencies, and uses pre-trained VGG19 network model for high frequencies. The two channels of high-frequency information are fused, and the high-frequency feature map and the low-frequency feature map are added at last. This method innovatively combines multi-scale transformation with neural network, which improves the effect of image fusion to a certain extent. However, this method is still a simple weighted average in image fusion, and has the same problem as the former in deep learning feature extraction. To solve the above problems, LiLi2018DenseFuse

proposed a dense fuse network for infrared and visible images, which uses dense blocks and has multiple hop connections, making full use of the underlying information of the image. However, the loss function of the network only uses a single evaluation index, and the fusion method also uses weighted average method, so the improvement of image quality is limited. At the same time, Ma

MaFusionGAN introduced the antagonistic generation network into infrared and visible image fusion for the first time. At the beginning, the author added the original image and then trained the antagonistic network. At the same time, LahoudLahoud2019FastZERO proposed a zero-learning image fusion method, which combines the traditional image features with the deep learning features. This method does not require a specially designed neural network for training. This method has a very good real-time performance compared with other methods.

The existing image fusion algorithms have the following problems. Firstly, the effective feature extraction of images is insufficient. Most algorithms only fuse features directly after extracting them, and there is no secondary selection of features Liu2017MultiCNN ; Liu2016ImageCSR ; Li_2018DL ; Lahoud2019FastZERO ; Li2018DenseFuse . Secondly, the image fusion method is simple. Most algorithms extract the features directly by simple weighted average, select the weight maximum or extract the feature principal component(PCA)Ma2018Infrared . And the non-linear relationship between features is not fully considered, which is not in line with the human visual perception mechanism. Finally, there is still a great gap between the results of multi-source image fusion of mainstream algorithms and those of human visual fusion. In order to overcome the above problems, based on the selection characteristics of human visual perception characteristics, the attention module is used to select the fused image features. Image quality is affected by illumination, motion, underexposure, etc., so that the place where high frequency information is lost in the image often exists in the form of dark light or highlights. Inspired by the characteristics of human visual perception brightness and contrast sensitivity, we use image illumination information to simulate the non-linear combination characteristics of human eyes with different features in our framework, and establish the nonlinear relationship between image fusion based on the illuminance information of the image. Through the simulation of human visual perception characteristics, the image fusion quality is more in line with human subjective evaluation.

2.2 General framework

Figure 2: General block diagram of our framework. Feature layer includes two-scale decomposition, deep learning feature map capture and global saliency feature map capture. Light index lay include non-linear fusion influencing factor of light intensity density. Feature selection layer includes feature extraction module and attention module of residual network. At the last stage, we get the fused image. M, D, B, Sum indicate the multi-scale decomposition operation, non-linear fusion functions of details, non-linear fusion functions of Base, concat operation.

3 Method

As shown in 2 , our proposed image fusion algorithm needs to complete the following four steps: firstly, image decomposition is performed to obtain the image base layer and detail layer. Secondly, the image illuminance is modeled, and the nonlinear fusion coefficient of the image fusion is obtained. Then, the obtained weight map is combined with the illuminance information fusion factor for feature fusion. Finally, the fused feature map is selected by the channel attention module to obtain the final fused image.

3.1 Multi-scale image decomposition

Multi-scale image decomposition theory has been widely used in the field of computer vision, and has achieved great results in feature extraction. According to human visual perception theory, human eyes have different sensitivity to different regions of degraded images. Therefore, in the multi-source image fusion task, we need to decompose the image at different levels. The method can effectively avoid the image ringing effect caused by high and low frequency mixing during image processing. In our image fusion framework, we use the two-scale decomposition method proposed by

Bavirisetti2016Two .

3.2 Visual saliency detection

With people’s further study of their own visual perception mechanism, saliency detection method based on human visual perception theory has been widely used in the field of computer visionLi_2018DL . In image fusion, the bottom-up and top-down saliency models are usually used, which are realized by the high contrast of the pixels compared with the surrounding information. At present, multi-source image fusion methods based on saliency detection are mainly two ways, one is to calculate the saliency weight map corresponding to the original imageLahoud2019FastZERO ; Bavirisetti2016Two , the other is to extract saliency targetLiu2015MultiDSIFT based on saliency analysis. In this paper, we mainly adopt the bottom-up saliency model method proposed byBavirisetti2016Two , which has less computational complexity than other algorithms.

3.3 Illumination factor modeling

In the process of image imaging, the image quality is degraded due to the influence of weather, light and motion. Image quality degradation is due to the loss of high-frequency information in the image, and the image of the information loss part is often presented in the form of high light and dark light. The following figure shows the problem of image information loss caused by car lights in visible images. As shown in 3 , we conducted a visual modeling analysis of the light field.

Figure 3: Visualization analysis of high optical density images. From left to right indicate visible image, abnormal high light image block, abnormal 3D optical density map, average optical density curve F(x, y), normal image block, normal 3D optical density map, normal average optical density curve R(x, y).

From the average optical density curve, we can easily find that high light exists in the form of a parabolic cross section. Through a lot of validation on FLIR dataset, it is found that the problem of high light caused by circular light source is universal. However, considering the diversity of light sources (rectangle, ellipse and irregular shape) in nature, we can’t use a fixed illumination model for illumination modeling when doing image highlight removal. We need to establish an adaptive illuminance model based on image illuminance, and dynamically adjust the image according to different illuminance models. Aiming at the image fusion task, we validate the current mainstream image fusion algorithm, and the effect is shown in the figure above. From 1 , we can clearly find that the current mainstream image fusion algorithms do not consider the image highlight problem. Therefore, highlight blocks can not be effectively removed in the effect of image fusion, and there is obvious boundary effect, which seriously affects the quality of image fusion. Therefore, we propose to introduce illumination factors in image fusion to effectively eliminate the influence of highlights on image fusion quality.

Based on the theory of physical optics, the light perceived by the human eye is mainly composed of ambient light, diffuse reflection light, specular reflection light, and reflection of the object’s own light sourceTreisman1980A

. To simplify the model, we believe that the image is composed of two parts including the incident image and the reflected image. At present, the incident image estimation method is mainly based on the image low frequency theory, and it is considered that the image illumination is slow, and the image is mainly a low frequency component. Therefore, in various computer vision tasks, the illumination image is generally estimated based on this theory. However, this method has an obvious drawback in that it ignores the non-smooth nature of the illumination itself, which is especially prominent at the edge of the image illumination. Based on this characteristic of illumination, our illuminance fusion factor is based on global illumination modeling and not just low frequency components, which effectively avoids the false edge effect of image fusion.

In the multi-source fusion image, we suppose is the original image. Based on multi-scale decomposition of , the illumination density function of the high-light block and the reference image of the high-light block are established respectively according to the image of the basic layer and the image of the detail layer. In order to effectively recover the high-frequency information lost by the image fusion technology, we use to fit the target function . If loss is used as the objective loss function, then:


3.4 Feature fusion

The characteristics of visual masking, brightness and contrast sensitivity in the human visual perception system indicate that human perception of external information depends more on the brightness difference between target and background. Human beings have some self-adaptive brightness adjustment function in high-light areas, and human eyes can not detect the distortion below just noticeable distortion (JND)Treisman1980A . Inspired by these two characteristics, we propose an innovative multi-source image fusion framework. The fusion rules of the basic level and the detail level of the image will no longer be based on the traditional method (weighted average, maximum, etc.) Ma2018Infrared

, but on the illumination factor as a non-linear fusion factor. By introducing a non-linear fusion factor, the non-linear fusion characteristics of human visual perceptron are simulated. Considering the diversity of light source shapes in nature, we make a detailed analysis of one of the circular light sources.

In the previous chapters, we obtain the significant weight graph and the basic layer . If the mainstream weighted average fusion operator is used, the fusion criteria are shown in formula 2 .


If we introduce our nonlinear illumination factor, the fusion rule is shown in Equation 3 :


In the formula: and respectively represent the detail layer, the base layer, and the image in which the base layer and the detail layer are fused. denotes weight map, denotes an i-th image, denotes a base layer superscript, denotes a detail layer superscript, and denote pixel coordinates, and denotes a pixel normalization constant.

In our image fusion framework, we use normalized illumination intensity as the non-linear illumination coefficient. Considering that the non-linear fusion factor needs to be in the range of and

, we use sigmoid as the activation function. We can find that the weighted average fusion criterion or maximum fusion criterion at the basic level and detail level of the current mainstream algorithms is a special case of our algorithm. When the illumination factor is

, it is the weighted average algorithm in the current mainstream fusion method, and maximizing the illumination factor is the maximum fusion criterion.

3.5 Feature selection

With the excellent performance of deep convolution neural network in feature extraction, the feature extraction method based on deep convolution neural network has been widely adopted by researchersMa2018Infrared . At present, the methods based on pre-training model are mainly infrared and visible image fusion algorithm based on deep neural network framework proposed by LiLi2018Infrared , and fast zero learning method proposed by LahoudLahoud2019FastZERO . By introducing pre-training network model, this method can fully extract features while effectively reducing the training task of image fusion. At the same time, Huhu2017squeezeandexcitation proposed SENet channel attention network for feature selection. ZhangZhang_2018RCAN proposed residual network channel attention module based on SENET channel attention network. JunFu2018Dual proposed dual attention network for image segmentation, and introduced spatial attention module based on channel attention. All the above methods are based on the characteristics of the human visual perception system. The inherent deduction mechanism of vision in the human visual perception system points out that the human visual system deduces content according to prior knowledge in human brain, and discards uncertain information. Inspired by this feature, we use channel attention networkZhang_2018RCAN to simulate the feature selection characteristics of the human visual perception system in the multi-source image fusion tasks. Attention network is used to learn the complex non-mutually exclusive non-linear relationship between different features, and different weight coefficients are given to the features with different degrees of attention.

We suppose that the long and wide channels obtained by residual convolution after previous fusion are x x feature graphs . As shown in formula 4 , the global average pooling (GP) operation is performed on the feature map to obtain the global receptive field corresponding to the feature map, so that the network can exclude the spatial relationship between different channels and focus on learning the non-linear relationship between different feature channels.

Figure 4: Feature selection attention module Zhang_2018RCAN .

Where: represents the pixel value corresponding to the kth channel coordinates. As shown in Equation 4

, after passing through the global average pooling layer, we obtain the output of the attention module through convolution, RELU activation function, convolution, Sigmoid activation function, and dot product operation.


The channel attention module , and represent the activation functions of Sigmoid and Relu respectively, while and represent the weight of two convolutions respectively. indicates the output of the input image after GP operation.

4 Experiments

To evaluate the robustness of our framework, we performed experimental evaluations on different image fusion task data sets. Firstly, we obtain natural scene infrared and visible images from the TNOTNO dataset, which includes 21 pairs of infrared and visible images. Secondly, we obtain the test image from the FLIRFLIR traffic dataset, which includes 10,000 pairs of thermal infrared and visible images. The dataset is not registered, so we use the control point selection toolbox in MATLAB for partial matching. Then, we obtaine a medical image datasetSummers2003Harvard from the Harvard whole brain atlas, which included 97 pairs of CT and MRI images. Finally, we obtain a multi-focus image from the LytroNejati2015Multi dataset, which includes 20 pairs of commonly used multi-focus images. In all subsequent experiments, we converted all images into grayscale images for subsequent image fusion. At the same time, we need to explain that for all subsequent experimental tests, our algorithm does not perform manual tuning operations for specific data set. We will compare experiments with 20 mainstream algorithms such as fast-zero-learning(FEZ)Lahoud2019FastZERO , fonvolutional sparse representation(CSR)Liu2016ImageCSR , deep learning(DL) Li_2018DL , dense fuse(DENSE)Li2018DenseFuse , generative adversarial network for image fusion (Fusion GAN)Ma2018Infrared , laplacian pyramid(LP)Burt1987TheLP , dual-tree complex wavelet transform (DTCWT)Liu2015MultiDSIFT , latent low-rank representation (LTLRR)Li2018InfraredLTLRR , multi-scale transform and sparse representation(LP-SR)Liu2015ALPSR , dense sift (DSIFT)Liu2015MultiDSIFT , convolutional neural network (CNN)Liu2017InfraredJSR-SD , curvelet transformation(CVT) Nencini2007RemoteCVT , bilateral filter fusion method(CBF)Shreyamsha2015ImageCBF , cross joint sparse representation (JSR)Zhang2013Dictionary , joint sparse representation with saliency detection(JSRSD)Liu2017InfraredJSR-SD , gradient transfer fusion (GTF)Ma2016InfraredGTF , weighted least square optimization(WLS)Ma2017InfraredWLS , a ratio of low pass pyramid(RP)Toet1989ImageRP , waveletChipman1995Wavelets

, multi-resolution singular value decomposition (MSVD)

Naidu2011Image , non-linear (OURS) ,non-linear and selection(OURS+). For different experiments, there will be some changes in the related algorithm experiments, and the changes will be explained in the respective experimental chapters. These algorithms have already published their code, and the relevant algorithm parameters are the same according to the settings in the public paper, and our paper-related procedures and data will then be published on github. For our proposed algorithm, we also conducted a comparative experiment on whether there is a channel attention module or not. Our experimental platform is desktop 3.0GHZ i5-8500, RTX2070, 16G memory.

4.1 Fusion metrics

In order to qualitatively evaluate the performance of different algorithms, we mainly use six evaluation methods: entropy(EN)1576816 , structural similarity index measure(SSIM) 1284395 , mutural information(MI)Qu2002Information , visual information fidelity (VIF)Han2013A , information fidelity criterion(IFC)Sheikh2006An , average gradient (AG) Cui2015Detail . The higher the above six image evaluation indexes are, the better the image quality will be.

4.2 Results on TNO dataset

On TNOTNO dataset, we performed quantitative and qualitative analysis on 21 pairs of infrared and visible images in the dataset using the 20 image fusion methods shown in 4 .

Figure 5: Visible and infrared source images with the fusion results obtained by different methods. From (a) to (v) : CNNLiu2017InfraredCNN , CVTNencini2007RemoteCVT , DLLi_2018DL , DTCWTLiu2015MultiDSIFT , FEZLahoud2019FastZERO , DSIFTLiu2015MultiDSIFT , CSRLiu2016ImageCSR , DFALi2018DenseFuse , DFL1Li2018DenseFuse , CBFShreyamsha2015ImageCBF , WLSMa2017InfraredWLS , JSRZhang2013Dictionary , JSRSDLiu2017InfraredJSR-SD , LATLRRLiu2011Latent , FusionGanMaFusionGAN , GTFMa2016InfraredGTF , LPBurt1987TheLP , LPSRLiu2015ALPSR , MSVDNaidu2011Image , RPToet1989ImageRP , WaveletChipman1995Wavelets , OURS+.
Figure 6: Six evaluation indicators for quantitative contrast between infrared and visible Images.

As shown in figure5 , we have qualitatively analyzed the data set. From the tree leaf window in the above figure, we can see that there is obvious highlight in the visible image, and the image information is seriously lost, but in the infrared image, the detailed structure information of this place is relatively well preserved. Existing algorithms do not have good image restoration to recover lost information in visible light images. Compared to other algorithms, our algorithm has a very high definition in the highlights of the trees in the highlights. In the pedestrian window, we can also see that our algorithm allows pedestrians to maintain high contrast information. The images of our algorithm fusion are more in line with the human visual perception mechanisms.

Based on the qualitative analysis of the TNO dataset, we performed a quantitative analysis of the dataset. By comparing the objective evaluation indicators of different image fusion algorithms, we can see from figure6 that our algorithm is in six objective evaluation indicators. Although our algorithm is relatively low on the structural similarity index SSIM1284395 , our algorithm is relatively high on other indicators. At the same time, we can also find that the image fusion algorithm with the attention feature selection module has a lower entropy value than the non-linear algorithm, but it has improved in other indicators.

4.3 Results on FLIR dataset

In order to verify the robustness of our method, we also carried out related experiments on the FLIRFLIR traffic dataset.The sex analysis is shown in figure7 , and the relevant quantitative analysis is shown in figure8 .

Figure 7: Qualitative fusion results on visible and thermal infrared images by different method.
Figure 8: Six evaluation indicators for quantitative contrast between thermal infrared and visible images.

From the figure 7 , we can see that in the FLIR traffic dataset, our algorithm has higher image fusion quality than other algorithms in the high light block. However, we observe the structural evaluation index and the high-light block image. We can find that the image texture details of the fusion of CVTNencini2007RemoteCVT , DTCWTLiu2015MultiDSIFT and RPToet1989ImageRP algorithm have not been repaired at all. At the same time, we used the weighted average SSIM and the infrared image as reference images in other papers. It was found that in the FLIR dataset, the SSIM evaluation indexes of these three algorithms are generally more than one percentage point compared with the proposed algorithm. At the same time, these three algorithms have higher visual fidelity than other algorithms. The reason for the analysis is mainly due to the influence of the brightness and contrast characteristics of the human visual system and the visual masking characteristics. When the image is seriously degraded, the SSIM evaluation index is significantly different from the subjective evaluation. Therefore, when the image is degraded, it is not the higher the SSIM value, the better the image quality.

4.4 Results on medical dataset

Figure 9: Qualitative fusion results on CT and MR images by different method.
CNN 4.760 9.520 0.779 7.551 2.450 0.377
FEZ 4.979 9.959 0.764 4.568 1.532 0.248
NSST_PAPCNN 4.612 9.224 0.778 7.205 1.721 0.286
GTF 4.164 8.327 0.688 4.225 0.534 0.080
LP 4.285 8.570 0.819 7.538 2.611 0.390
LPSR 4.663 9.326 0.808 7.614 2.568 0.390
MSVD 4.124 8.249 0.804 7.046 1.549 0.242
RP 4.199 8.397 0.612 11.263 0.715 0.084
WAVELET 3.933 7.866 0.759 4.513 1.555 0.212
OURS 4.394 8.789 0.872 9.411 2.935 0.503
OURS+ 4.230 8.460 0.792 12.216 1.971 0.339

In the medical dataset, we have added two new algorithms: CNN Yu2017medicalCNN , Parameter-Adaptive Pulse Coupled-Neural Network (NSST_PAPCNN) Yin2018MedicalNSSPAPCNN , and the other seven algorithms are described in 4 . From figure 9 , we can see that the results of our image fusion have good clarity and contrast, which is more in line with the human subjective vision system. From table 1 , we can see our algorithm has a higher improvement on the dataset than other SSIM, AG, IFC, and VIF fusion metrics.

4.5 Results on Multi-Focus dataset

On this dataset, we have added a new algorithm Zero-phase component analysis (ZCA) Li2018Infrared , and the other 20 operators are described in 4 .

Figure 10: Qualitative fusion results on multi-focus source images by different method.
Figure 11: Six evaluation indicators for quantitative contrast between multi-focus image.

At the same time, as shown in10 and11 in the multi-focus image data set, the experiments are carried out to verify the various algorithms used in the experiment. The experimental data comes from LytroNejati2015Multi data set. Through the analysis of experimental data, we can see that our algorithm has higher entropy and gradient values than other algorithms in multi-focus images, which demonstate that the fused image has more information and better clarity.

4.6 Average absolute error experiment

Figure 12: Average absolute difference of image evaluation indicators for different datasets.

In order to compare our results more clearly, we draw the absolute mean difference histogram as shown in 12 based on the experiment in 4 . Firstly, the experiment compares six quality evaluation indicators with four data sets, and then obtains the absolute difference with our algorithm. From the above figure, we can see more clearly that in the four data sets, the indicators are generally larger than the average of the general algorithm, which demonstrates that our algorithm has better robustness.

5 Discussion

From the extensive experiments in chapter 4 , it is proved that the proposed general framework is more robust than the existing methods. We think the main reasons are as follows. Firstly, the collaboration of traditional and deep learning methods is effective in image fusion tasks. Secondly, illumination as a non-linear factor of feature fusion is consistent with human visual perception characteristics. Finally, in the task of image fusion, feature selection is not only effective in the initial stage of feature extraction, but also very important in the later stage of feature fusion.

6 Conclusion

Based on the two characteristics of human visual perception theory, we propose a robust and efficient multi-source image fusion framework. The biggest difference between our algorithm framework and the current mainstream algorithm framework is that we don’t need a dedicated image fusion network to train first. Then we introduce the illuminance fusion factor to simulate the nonlinear characteristics of human visual perception for the first time in image fusion. An attention model was introduced in the image fusion task to simulate the selection characteristics of human visual perception. Through a large amount of data verification, experimental results demonstrate that our algorithm framework is more robust than the existing mainstream algorithm framework. Although our algorithm framework does not fully simulate human visual perception characteristics, the first simulation of human visual perception characteristics in image fusion tasks is in line with the human visual fusion mechanism. Although our framework has achieved relatively good results compared with the existing algorithms, how to better learn the non-linear relationship between the features and the spatial structure will be further discussed in the next research work.


We are very grateful to professor Roundtree and Ph. Wang Xiaoming for their support of the language of the paper. This work was supported by the National Natural Science Foundation of China under Grants nos. 61871326, and the Shanxi Natural Science Basic Research Program under Grant no. 2018JM6116.



  • (1) A. M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1) (1980) 97–136.
  • (2) S. Li, B. Yang, J. Hu, Performance comparison of different multi-resolution transforms for image fusion, Information Fusion 12 (2) (2011) 74–84.
  • (3) T. Xiang, L. Yan, R. Gao, A fusion algorithm for infrared and visible images based on adaptive dual-channel unit-linking pcnn in nsct domain, Infrared Physics & Technology 69 (2015) 53–61.
  • (4) Z. Xiaoye, M. Yong, F. Fan, Z. Ying, H. Jun, Infrared and visible image fusion via saliency analysis and local edge-preserving multi-scale decomposition, Journal of the Optical Society of America. A, Optics, Image Science, and Vision 34 (8) (2017) 1400–1410.
  • (5) J. Ma, Y. Ma, C. Li, Infrared and visible image fusion methods and applications: A survey, Information Fusion 45 (2019) 153 – 178. doi:https://doi.org/10.1016/j.inffus.2018.02.004.
  • (6)

    AZoSensors, Flir releases starter thermal imaging dataset for machine learning advanced driver assistance development,

    https://www.flir.com/oem/adas/adas-dataset-form/ (2018).
  • (7) Q. Zhang, Y. Fu, H. Li, J. Zou, Dictionary learning method for joint sparse representation-based image fusion, Optical Engineering 52 (5) (2013) 7006.
  • (8) J. Ma, Z. Zhou, B. Wang, H. Zong, Infrared and visible image fusion based on visual saliency map and weighted least square optimization, Infrared Physics & Technology 82 (2017) 8–17.
  • (9) G. Liu, S. Yan, Latent low-rank representation for subspace segmentation and feature extraction, in: International Conference on Computer Vision, 2011, pp. 1615–1622. doi:10.1109/ICCV.2011.6126422.
  • (10) H. Li, X.-J. Wu, Infrared and visible image fusion using a novel deep decomposition method, IEEE Transactions on Image Processing.
  • (11) H. Li, X.-J. Wu, Infrared and visible image fusion with resnet and zero-phase component analysis, ArXiv abs/1806.07119.
  • (12) Y. Liu, X. Chen, J. Cheng, H. Peng, Z. Wang, Infrared and visible image fusion with convolutional neural networks, International Journal of Wavelets Multiresolution & Information Processing 16 (3) (2017) 1–20.
  • (13) F. Nencini, A. Garzelli, S. Baronti, L. Alparone, Remote sensing image fusion using the curvelet transform, Information Fusion 8 (2) (2007) 143–156.
  • (14)

    H. Li, X.-J. Wu, J. Kittler, Infrared and visible image fusion using a deep learning framework, 2018 24th International Conference on Pattern Recognition (ICPR)

  • (15) Y. Liu, S. Liu, Z. Wang, Multi-focus image fusion with dense sift, Information Fusion 23 (C) (2015) 139–155.
  • (16) M. Jiayi, Y. Wei, L. Pengwei, L. Chang, J. Junjun, Fusiongan: A generative adversarial network for infrared and visible image fusion, Information Fusion 48 (2019) 11 – 26. doi:https://doi.org/10.1016/j.inffus.2018.09.004.
  • (17) S. Li, K. Xudong, J. Hu, Image fusion with guided filtering, IEEE Transactions on Image Processing 22 (7) (2013) 2864–2875.
  • (18) J. Ma, C. Chen, C. Li, J. Huang, Infrared and visible image fusion via gradient transfer and total variation minimization, Information Fusion 31 (C) (2016) 100–109.
  • (19) P. J. Burt, E. H. Adelson, The laplacian pyramid as a compact image code, Readings in Computer Vision 31 (4) (1987) 671–679.
  • (20) F. Lahoud, S. Süsstrunk, Fast and efficient zero-learning image fusion, Information FusionarXiv:1905.03590.
  • (21) S. Kumar, B. K., Image fusion based on pixel significance using cross bilateral filter, Signal Image & Video Processing 9 (5) (2015) 1193–1204.
  • (22) Y. Liu, X. Chen, R. Ward, Z. J. Wang, Image fusion with convolutional sparse representation, IEEE Signal Processing Letters 23 (12) (2016) 1882–1886.
  • (23) C. Liu, Y. Qi, W. Ding, Infrared and visible image fusion method based on saliency detection in sparse domain, Infrared Physics & Technology 83 (2017) 94–102.
  • (24) Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multi-scale transform and sparse representation, Information Fusion 24 (2015) 147–164.
  • (25) V. P. S. Naidu, Image fusion technique using multi-resolution singular value decomposition, Defence Science Journal 61 (5) (2011) 479–484.
  • (26) A. Toet, Image fusion by a ratio of low-pass pyramid, Pattern Recognition Letters 9 (4) (1989) 245–253.
  • (27) L. J. Chipman, T. M. Orr, L. N. Graham, Wavelets and image fusion, in: International Conference on Image Processing, 1995.
  • (28)

    D. P. Bavirisetti, G. Xiao, G. Liu, Multi-sensor image fusion based on fourth order partial differential equations, in: 2017 20th International Conference on Information Fusion (Fusion), 2017, pp. 1–9.

  • (29) Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion with a deep convolutional neural network, Information Fusion 36 (2017) 191–207.
  • (30) H. Li, X. J. Wu, Densefuse: A fusion approach to infrared and visible images, IEEE Transactions on Image Processing 28 (5) (2018) 2614–2623.
  • (31) D. P. Bavirisetti, R. Dhuli, Two-scale image fusion of visible and infrared images using saliency detection, Infrared Physics & Technology 76 (2016) 52–64.
  • (32) J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-excitation networks (2017). arXiv:1709.01507.
  • (33)

    Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, Lecture Notes in Computer Science (2018) 294–310

  • (34) J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for scene segmentation, Computer Vision and Pattern Recognition.
  • (35) A. Toet, Tno dataset, https://figshare.com/articles/TNO_Image_Fusion_Dataset/1008029 (2018).
  • (36) Summers, D, Harvard whole brain atlas: www.med.harvard.edu/AANLIB/home.html, Journal of Neurology Neurosurgery & Psychiatry 74 (3) (2003) 288–288.
  • (37) M. Nejati, S. Samavi, S. Shirani, Multi-focus image fusion using dictionary-based sparse representation, Information Fusion 25 (2015) 72–84.
  • (38) H. R. Sheikh, A. C. Bovik, Image information and visual quality, IEEE Transactions on Image Processing 15 (2) (2006) 430–444. doi:10.1109/TIP.2005.859378.
  • (39) Zhou Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing 13 (4) (2004) 600–612. doi:10.1109/TIP.2003.819861.
  • (40) G. Qu, D. Zhang, P. Yan, Information measure for performance of image fusion, Electronics Letters 38 (7) (2002) 313–315.
  • (41) Y. Han, Y. Cai, Y. Cao, X. Xu, A new image fusion performance metric based on visual information fidelity, Information Fusion 14 (2) (2013) 127–135.
  • (42) H. R. Sheikh, Member, IEEE, A. C. Bovik, Fellow, An information fidelity criterion for image quality assessment using natural scene statistics, IEEE Trans Image Process 14 (12) (2006) 2117–2128.
  • (43) G. Cui, H. Feng, Z. Xu, Q. Li, Y. Chen, Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition, Optics Communications 341 (341) (2015) 199–209.
  • (44) G. Liu, S. Yan, Latent low-rank representation for subspace segmentation and feature extraction, in: International Conference on Computer Vision, 2011.
  • (45) L. Yu, C. Xun, J. Cheng, P. Hu, A medical image fusion method based on convolutional neural networks, in: International Conference on Information Fusion, 2017. doi:10.23919/ICIF.2017.8009769.
  • (46) M. Yin, L. Xiaoning, L. Yu, C. Xun, Medical image fusion with parameter-adaptive pulse coupled-neural network in nonsubsampled shearlet transform domain, IEEE Transactions on Instrumentation & Measurement 68 (1) (2018) 1–16.