1 Introduction
Pixellevel image fusion intends to combine different images of the same scene by mathematical techniques in order to create a single composite image that will be more comprehensive and thus, more useful for human or machine perception stathaki2011image ; li2017pixel . For instance, multimodal image fusion du2016overview tries to fuse images which have been acquired via different sensor modalities exhibiting diverse characteristics for a more reliable and accurate medical diagnosis. Another typical application is the multifocus image fusion duan2018multifocus . As the depthoffield (DoF) of brightfield microscopy is only about 1~2 micrometers, while the specimen’s profile covers a much larger range and then the parts of the specimen that lie outside the object plane are blurred. The multifocus image fusion can obtain an allinfocus image from multiple images taken under different distance from the object to the lens of the identical view point.
A good image fusion method should contain the following properties. First, it preserves both the details of small size objects and the integrity information of large size objects in the fused image, even in the case of the size of the interested objects varying largely in the image. For example, the cervical cell images from the microscope contain both small size isolated cells and large size agglomerates, which are both useful for cervical cytology nayar2015bethesda . Second, it should be efficient enough to handle largescale data. For instance, it needs to process thousands of fields of view (FoV) in an acceptable time for the whole slide scanning in digital cytopathology pantanowitz2009impact , which requires to fuse a series of high resolution images captured at each FoV in a very efficient way. Third, it does not produce obvious artifacts. Despite being studied extensively, to our best knowledge, existing fusion methods may not meet these requirements simultaneously.
In this paper, we propose a simple yet effective image fusion method which can deal with the case where the size of the interested objects varies largely in the image, which is illustrated in Figure 1. It is a spatial domain method and the key idea is the scaleinvariant structure saliency generation based on the differenceofGaussian (DoG) pyramid lowe2004distinctive . After generating the saliency map of each image, a simple max operation is applied to them to generate the mask images, which are further refined by a singlescale guided filtering he2013guided
to exploit the spatial correlation among adjacent pixels, resulting a scaleinvariant estimation of activity maps. Without complicated processing involved, the proposed method is very fast and can be used to fuse the high resolution images in realtime applications. Experimental results demonstrate that comparing to many stateoftheart methods, the proposed method is much faster while yields competitive or even better results in terms of both visual and quantitative evaluations. Our contributions in this paper are as follows:

We propose a scaleinvariant structure saliency selection scheme based on the differenceofGaussian (DoG) pyramid of images. The resulting image fusion method can keep both details of small size objects and the integrity information of large size objects in images simultaneously.

Our method is very efficient, easy to be implemented and can be used for fast high resolution images fusion.

Comparing to many stateoftheart methods, our method yields competitive or even better results in terms of both visual and quantitative evaluations on three datasets.
1.1 Related works
Various of image fusion techniques have been proposed in literatures which can be roughly classified into two categories
stathaki2011image : transform domain methods and spatial domain methods. The transform domain methods are mainly based on the “decompositionfusionreconstruction” framework, which first transforms each source image into a new domain by some tools such as multiscale decomposition (MSD) he2018multi ; liu2017structure , or sparse coding yang2010multifocus ; liu2016image ; zhang2018robustor other transformation like principal component analysis (PCA)
shahdoosti2016combining , etc., and then constructs a composite representation in the transform domain with specific fusion rules and finally applies the inverse transform on the composite representation to obtain the fused image. The spatial domain method is to take each pixel in the fused image as the weighted average of the corresponding pixels in the input images, where the weights or activity map are often determined according to the saliency of different pixels gangapure2015steerable and the corresponding spatial context information li2013image ; chen2018robust ; li2013image2 ; Zhang2017Boundary. Recently there are emerging deep learningbased fusion methods
liu2018deep ; Liu2017Multi ; du2017image which learn the saliency mapand fusion rule based on convolutional neural network (CNN). Here we will review some related methods and explicitly distinguish them from the proposed method. Some comments are given in order:

The multiscale fusion has the advantages of extracting and combining salient features at different scales, which is widely used in the transform domain methods. The pyramid transform and wavelet transform are the two most used categories of multiple scale decomposition schemes. For instance, the nonsubsampled contourlet transform (NSCT) zhang2009multifocus and the dual tree complex wavelet transform (DTCWT) zhou2014multi are used to decompose the image into a serials of subbands and then perform coefficients fusion in the transform domain, while the method of LPSR liu2015general uses the multiscale Laplacian pyramid transform. Instead of performing coefficient at each scale individually, the crossscale coefficient selection method shen2013cross calculates an optimal set of coefficients for each scale in the transform domain. Different from these methods, our method perform multiscale fusion directly in the spatial domain.

The method of multiscale weighted gradient fusion (MWGF) zhou2014multi tries to combine all the important gradient information from the input images into an optimal weighted gradient image based on a twoscale scheme. Then reconstructs a fused image based on the approximately optimal weighted gradient image. However, such a simple twoscale scheme can not handle large scale variation of objects, which will result in blur or distortion in the final fused image. Moreover, the reconstruction needs to minimize the energy function which may be timeconsuming. Our method searches across the entire scalespace and the reconstruction only needs to calculate the weighted average of the corresponding pixels intuitively in the input images so that it can be very efficient.

The GFFbased method li2013image also employs the guided filtering to combine pixel saliency and spatial context for image fusion. However, the pixel saliency calculated in the GFFbased method can only extract salient features at two scales. Moreover, for each source image, the GFFbased method needs to perform guided filtering twice to refine the weight maps of base layer and detail layer separately, which is a bit timeconsuming. Due to the scaleinvariant saliency selection, our method can deal with the case of large scale variation and is faster than GFFbased method.

The CNNbased image fusion methods Liu2017Multi ; du2017image try to use CNN to learn the saliency map of each images. Specifically, the CNN models for patch similarity comparison zagoruyko2015learning are trained using highquality image patches and their blurred versions to encode the mapping between the source image and the corresponding saliency map Liu2017Multi . Obviously, those CNNbased methods need to perform very complicated training. In order to deal with more than two images, the CNNbased methods try to fuse them one by one in series making the inference is very timeconsuming.
2 The Proposed Method
2.1 ScaleInvariant Saliency Selection
Our scaleinvariant saliency selection scheme is based on the scale space theory lindeberg1994scale . It has been shown that under a variety of reasonable assumptions, the only possible scale space kernel is the Gaussian function, and the scale space of an image can be produced by convolving the image with variablescale Gaussian. Here we adopt the scheme in lowe2004distinctive to generate the sampled scale space. The initial image is incrementally convolved with Gaussian function to produce images separated by a constant factor in scale space and each octave of scale space is divided into several layers with an integer number , where . Once a complete octave has been processed, the first Gaussian image of the next octave has twice the initial value of and is downsampled by taking every second pixel in each row and column of the Gaussian image in its previous octave.
We then utilize the sampled scale space of image to generate the corresponding scaleinvariant saliency map . Inspired by lowe2004distinctive , we first define a scaledependent saliency metric based on the difference of Gaussian (DoG) response which can reflect the local image structure at the current scale and is robust to noise as follows
(1) 
where is the Gaussian image, is the Gaussian function and is the convolution operator. The absolute values of the DoG response are further averaged in the neighborhood of the point by smoothing with a Gaussian with parameter of (integration scale).
It should be noted that instead of using the DoGbased response, we can try to use some derivativesbased alternatives mikolajczyk2004scale . For example, we can define the following gradientbased metric
(2) 
or the following scalenormalized metric based on Laplacian of Gaussian (LoG)
(3) 
Here we use the DoGbased metric (1) in that the DoG operator is a close approximation of the LoG function but can significantly accelerate the computation process lowe2004distinctive
. Another kind of possible alternatives is based on the eigenvalues of the second moment matrix
mikolajczyk2004scale ; zhou2014multi or Hessian matrix lowe2004distinctive , but it is far more complicated and less stable mikolajczyk2002detection .To construct the scaleinvariant saliency map of image , for each position, we search the maximum saliency metric across the scale space
(4) 
Due to the image resolution of each octave is different, we first apply max operation within each octave and then resize the resulting map to be exactly the size of the original image. The scaleinvariant saliency map is finally obtained by applying the max operation across each octave, as shown in Figure 2.
In our implementation, when generating the DoG pyramid, the number of octaves is and we set initial scale and then produce images in the stack of Gaussian blurred images for each octave so that the saliency comparison per octave covers a complete octave. The integration filter is a Gaussian lowpass spatial filter of size .
2.2 Activity Maps Generation and Fusion
A straightforward approach is to take the obtained scaleinvariant saliency of each pixels as the corresponding activity or weight. However, it will introduce blur in the fused image. We adopt the simple nonmax suppression to alleviate this problem, i.e. for th image , we can determine a mask as the initial activity map by comparing the obtained scaleinvariant saliency maps
(5) 
where is the number of input images. However, as the above procedure compares pixels individually without considering the spatial context information, the resulting masks are usually noisy and will introduce artifacts into the fused image. Moreover, there may exist more than one maximum at a spatial position in scalespace (i.e., there exist multiscale structures). To deal with these situations, we can choose a sophisticated solution is to model the pixel saliency and spatial smoothness simultaneously into a energy function, which can be globally optimized by some tools such as graphcut techniques kolmogorov2004energy , but the optimization is often relatively inefficient. Another choice is to perform the morphology smoothing operation which is very efficient but inaccurate which is likely to introduce errors or artifacts zhou2014multi . Guided image filtering he2013guided or joint bilateral filtering petschnigg2004digital is an interesting alternative, which provides a tradeoff between the efficiency and accuracy. Following li2013image , we determine the final activity map by applying guided filtering on the initial activity map as follows
(6) 
where is the number of pixels in window with size of which is centered at pixel , and are the const coefficients of window which are determined by ridge regression
(7)  
(8) 
Here and
are the mean and variance of image
in window , is the mean of the initial activity map in window and denotes the regularization parameter penalizing large . The parameters of guided filtering are set to in our implementation.For each input image , we can determine the corresponding activity map and then obtain the final fused image by
(9) 
For color input images, the activity map is repeated for red, green and blue channels respectively to generate the final color fused image.



3 Experiments and Discussions
To demonstrate the effectiveness and efficiency of the proposed image fusion method , we conduct a set of comparative experiments on three image datasets. The first is composed by 8 pairs of multimodal medical images and the second one contains 15 pairs of multifocus gray or color natural images. These two datasets are often used in many related papers and some examples are shown in Figure 2(a) and Figure 2(b). The third one is a new multifocus cervical cell image dataset collected by ourselves, which consists of 15 groups of color images and each group contains a series of multifocus cervix cell images with size of or , etc. Some source examples are shown in Figure 2(c). Our source code implemented in C++ along with the new multifocus cervical cell image dataset is available online.
We compare our method with several stateoftheart algorithms, which include dense SIFT (DSIFT)liu2015multi , dualtree complex wavelet transform (DTCWT) zhou2014multi , guided filter fusion (GFF) li2013image , image matting (IM) li2013image2 , CNNbased method Liu2017Multi , Laplacian sparse representation (LPSR) liu2015general , multiweighted gradient fusion (MWGF) zhou2014multi , nonsampled contourlet transform (NSCT) zhang2009multifocus and boundary finding based method (BF) Zhang2017Boundary
. All the methods are implemented in Matlab except the CNNbased method which is implemented with C++ based on caffe
jia2014caffe and the parameters of them are set with default values given by the authors. The results are compared in terms of both visual and objective quality. For the objective quality evaluation of image fusion, we adopt several commonly metrics, including the mutual information (MI) Qu2002Information , structural similarity (SSIM) wang2004image , quality index (QI) wang2002universal , edge information preservation value () xydeas2000objective , feature mutual information (FMI) haghighat2011non and visual information fidelity (VIF) sheikh2006image .We first evaluate the performance of the proposed method under varying total number of octaves and number of layers sampled per octave. The fused images of a pair of multimodal medical images with different and are shown in Figure 4. In this example, on the one hand, when only 1 or 2 octaves are involved in constructing the DoG pyramid, the fused images fail to keep the integrity information of large size objects (e.g. eyeballs), while by increasing the value of , the integrity information of eyeballs is preserved. On the other hand, although not as significant as the increase of octave numbers , the fused image can contain more details by the increase of layer numbers . The corresponding objective quality metrics are shown in Figure 5. As shown in Figure 4(a), most of the metric values are improved as the number of octaves increases with the fixed layer numbers 3 in the global tendency and each of them tends to be stable when the number of octaves is 5. To get a relatively good quality from Figure 4(b), we can notice that some of the metric values can get a good performance when the number of layers is 3, such as the MI, SSIM, QI and VIF, though there are only a little change of all the metric values by increasing the number of layers with the fixed octave numbers 5. Because it will result in more computation burden with the increase of the value and , and for different kinds of source images, there are different performance with the diverse parameter settings. To get a tradeoff between them in our experiments, we set for the multimodal dataset, for the natural datasets and for the multifocus cell dataset, respectively.
Figure 6 shows the fused images obtained by different methods with the multimodal source images shown in Figure 2(a). As shown in these figures, the proposed method can produce images which preserve the complementary information of different source images well. Moreover, due to the scaleinvariant structure saliency selection, our method can keep the integrity information of large size objects and the visual details simultaneously. Although the fused image generated by other methods can also capture the details to some extent, all of them fail to keep the integrity information of large size objects such as the eyeballs. Furthermore, from Figure 5(k)5(t), the DTCWT, GFF, IM and NSCT methods may decrease the brightness and contrast while the proposed method can preserve these features and details without producing visible artifacts and brightness distortions.
Figure 7 and Figure 8 show the fused images of different methods by fusing the natural image pairs shown in Figure 2(b) respectively. A closeup view is presented in the bottom of each subpicture in Figure 7. It can be shown from these figures that although all these methods generate acceptable fused images, our method produces slightly better results than others (see the halo artifacts in the magnified area of Figure 7).
Figure 9, Figure 10 and Figure 11 show the comparative fused results of the multifocus cell images shown in Figure 2(c). For clarity, we also present a closeup view in the rightbottom of each subpicture in Figure 9 and Figure 10. As shown in the closeup views of Figure 9, the fused images based on DSIFT, IM, MWGF and BF methods are extremely blurred in the boundary and fail to keep the details of cell nucleus. Furthermore, the DTCWT and NSCT based methods produce halo artifacts in the fused images, while GFF and CNN based methods fail to preserve the small cell nucleus. LPSR based method nearly works fine which keeps the most of the details of the small size cells, but the integrity of the clustered large size cells is damaged. Fortunately, in our proposed method, the integrity of the clustered large size cells is preserved and most of the isolated small size cells are maintained from the original images, which demonstrates the best visual quality.
Similarly, as shown in the closeup views of Figure 10, the fused images from DSIFT, IM, MWGF and BF are blurred and lose some nucleus details, while the results from DTCWT, GFF, CNN and NSCT produce halo artifacts. LPSR based method can keep details well but also produces halo artifacts and other noise. Our method can preserve the focused areas of different source images well without introducing any artifacts. For the example illustrated in Figure 11
, the fused images generated by DSIFT, DTCWT, IM and NSCT all fail to preserve the focused areas of different source images and result in extremely blurred images. The GFF, CNN, MWGF and BF based method introduces a lot of color distortion of the nucleus regions and the obvious halo artifact. The result of LPSR based method is close to the one of our method but introduces some odd color distortion. Again, our method produces fused image which can preserve the focused areas of different source images well without introducing any artifacts.
Source  Index  DSIFT  DTCWT  GFF  IM  CNN  LPSR  MWGF  NSCT  BF  Proposed 

Multimodal dataset  MI  1.4832  1.0889  1.2421  1.3495  0.8067  1.2566  1.3601  1.1233  0.7921  1.4632 
SSIM  0.6416  0.6104  0.6489  0.6460  0.6654  0.6443  0.6512  0.6342  0.6570  0.6646  
0.6017  0.5037  0.5600  0.5354  0.6183  0.5610  0.5983  0.5390  0.6224  0.5850  
QI  0.5109  0.4180  0.4807  0.5134  0.5962  0.4887  0.5236  0.4428  0.5931  0.5471  
FMI  0.8639  0.8514  0.8574  0.8560  0.8718  0.8569  0.8680  0.8514  0.8748  0.8648  
VIF  0.3731  0.2310  0.2789  0.3000  0.4165  0.2735  0.3561  0.2417  0.4444  0.3315  
Natural multifocus dataset  MI  1.8910  1.8603  1.9176  1.9035  0.7881  1.8738  1.8840  1.8838  0.7894  2.1196 
SSIM  0.8328  0.8271  0.8305  0.8247  0.8431  0.8271  0.8219  0.8363  0.8203  0.8537  
0.6203  0.6196  0.6256  0.6226  0.6774  0.6206  0.6210  0.6233  0.6660  0.6749  
QI  0.6351  0.6182  0.6236  0.6175  0.6792  0.6239  0.6152  0.6381  0.6634  0.6891  
FMI  0.8597  0.8612  0.8616  0.8612  0.8686  0.8615  0.8619  0.8612  0.8583  0.8663  
VIF  0.4871  0.4820  0.4953  0.4905  0.5756  0.4827  0.4948  0.4857  0.5584  0.5457  
Multifocus cell dataset  MI  1.3387  1.1634  1.3042  1.2152  0.6893  1.0834  1.1295  1.1507  0.6801  1.1422 
SSIM  0.6923  0.6568  0.6839  0.6552  0.6279  0.6436  0.6481  0.6566  0.6407  0.6422  
0.1870  0.2120  0.2100  0.1950  0.2051  0.2331  0.1886  0.2228  0.2102  0.2468  
QI  0.1915  0.1779  0.1844  0.1677  0.1662  0.2161  0.1814  0.2115  0.1855  0.2474  
FMI  0.7603  0.7570  0.7622  0.7475  0.7456  0.7579  0.7152  0.7573  0.7537  0.7605  
VIF  0.2261  0.1854  0.2205  0.1959  0.2087  0.2293  0.2225  0.1967  0.2270  0.2388  
DSIFT  DTCWT  IM  CNN  LPSR  MWGF  NSCT  BF  GFF  GFF(C++)  Proposed  
Time(s)  1679.67  36.53  72.47  5276.09  34.29  259.30  414.24  453.39  10.66  6.44  2.08 
The quantitative results of different fusion methods are shown in Table 1. It can be seen that the proposed method yields competitive objective metrics on the natural multifocus dataset and multifocus cell dataset. For the multimodal dataset, the metrics value is not the best but is nearly close to the best performance, such as the value of MI, SSIM, FMI. We also compare the computational efficiency of each methods on the highresolution color cell images with the size of . Experiments are performed on a computer equipped with a 4.20 GHz CPU and 8GB memory and all codes are available online. The average running time of different image fusion methods is compared in Table 2. As mentioned before the method of DSIFT, DTCWT, GFF, IM, LPSR, MWGF, NSCT and BF are all implemented in Matlab while the CNNbased and our method are based on C++, and therefore strictly speaking, the comparison is running time unfair. Here, we reimplement the GFFbased method with C++ and also include the corresponding running time in Table 2 to reveal the running efficiency of different implementation between Matlab and C++ to some extent. As shown in Table 2, the guided filtering based methods, i.e. GFFbased and the proposed method are the most efficient methods while the CNNbased is the most timeconsuming. Comparing to the original Matlab implementation, the GFFmethod can be speeded up by almost 40% with C++ implementation, but it is still much slower than our method. We attribute this to the following reasons. First, the computation burden of the DoGbased scaleinvariant saliency selection step can be negligible comparing to the computation burden of activity maps refinement step based on the guided filtering. The GFFbased method needs to perform guided filtering twice (each for both base and detail layers), while our method only needs to perform filtering one time to refine the activity maps of each source image. Second, instead of using the original color image, we use the gray one as the guided image to accelerate the activity refinement step. Due to the extremely efficiency, our method can be applied for some nearly realtime applications such as digital cytopathology pantanowitz2009impact and can be furthermore accelerated through GPU programming.
4 Conclusion
In this paper, based on the scalespace theory we propose a very simple yet effective multiscale image fusion method in spatial domain. To keep both details of small size objects and the integrity information of large size objects in the fused image, we first get a robust saliency map with the scaleinvariant structure based on the DoG pyramid, which transfers the details and integrity detection of the objects into a scalezooming intensive response. Then the activity map is constructed by nonmax suppression scheme based on the saliency maps and refined by the guided filtering to capture the spatial context. Finally, the fused image is generated by combining the activity maps and the original input images intuitively. Experimental results demonstrate that our method is efficient and can produce an allinfocus image with a high quality, which can preserve the details and the integrity of very different size objects well. Meanwhile, due to the lowtime complexity, the propose method can deal with high resolution images in a very efficient way and can be applied for the realtime application.
Acknowledgements
This research was partially supported by the Natural Science Foundation of Hunan Province, China (No.14JJ2008) and the National Natural Science Foundation of China under Grant No. 61602522, No. 61573380, No. 61672542 and the Fundamental Research Funds of the Central Universities of Central South University under Grant No. 2018zzts577.
References
 [1] T. Stathaki, Image fusion: Algorithms and applications, Elsevier, 2011.
 [2] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixellevel image fusion: A survey of the state of the art, Information Fusion 33 (2017) 100–112.
 [3] J. Du, W. Li, K. Lu, B. Xiao, An overview of multimodal medical image fusion, Neurocomputing 215 (2016) 3–20.

[4]
J. Duan, L. Chen, C. P. Chen, Multifocus image fusion with enhanced linear spectral clustering and fast depth map estimation, Neurocomputing 318 (2018) 43–54.
 [5] R. Nayar, D. C. Wilbur, The Bethesda system for reporting cervical cytology: Definitions, criteria, and explanatory notes, Springer, 2015.
 [6] L. Pantanowitz, M. Hornish, R. A. Goulart, The impact of digital imaging in the field of cytopathology, Cytojournal 6.

[7]
D. G. Lowe, Distinctive image features from scaleinvariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.
 [8] K. He, J. Sun, X. Tang, Guided image filtering, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (6) (2013) 1397–1409.
 [9] K. He, D. Zhou, X. Zhang, R. Nie, Multifocus: Focused region finding and multiscale transform for image fusion, Neurocomputing 320 (2018) 157–170.

[10]
X. Liu, W. Mei, H. Du, Structure tensor and nonsubsampled shearlet transform based algorithm for CT and MRI image fusion, Neurocomputing 235 (2017) 131–139.
 [11] B. Yang, S. Li, Multifocus image fusion and restoration with sparse representation, IEEE Transactions on Instrumentation and Measurement 59 (4) (2010) 884–892.
 [12] Y. Liu, X. Chen, R. K. Ward, Z. J. Wang, Image fusion with convolutional sparse representation, IEEE Signal Processing Letters 23 (12) (2016) 1882–1886.

[13]
Q. Zhang, T. Shi, F. Wang, R. S. Blum, J. Han, Robust sparse representation based multifocus image fusion with dictionary construction and local spatial consistency, Pattern Recognition 83 (2018) 299–313.
 [14] H. R. Shahdoosti, H. Ghassemian, Combining the spectral PCA and spatial PCA fusion methods by an optimal filter, Information Fusion 27 (2016) 150–160.
 [15] V. N. Gangapure, S. Banerjee, A. S. Chowdhury, Steerable local frequency based multispectral multifocus image fusion, Information Fusion 23 (2015) 99–115.
 [16] S. Li, X. Kang, J. Hu, Image fusion with guided filtering, IEEE Transactions on Image Processing 22 (7) (2013) 2864–2875.
 [17] Y. Chen, J. Guan, W.K. Cham, Robust multifocus image fusion using edge model and multimatting, IEEE Transactions on Image Processing 27 (3) (2018) 1526–1541.
 [18] S. Li, X. Kang, J. Hu, B. Yang, Image matting for fusion of multifocus images in dynamic scenes, Information Fusion 14 (2) (2013) 147–162.
 [19] Y. Zhang, X. Bai, T. Wang, Boundary finding based multifocus image fusion through multiscale morphological focusmeasure, Information Fusion 35 (2017) 81–101.
 [20] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, X. Wang, Deep learning for pixellevel image fusion: Recent advances and future prospects, Information Fusion 42 (2018) 158–173.
 [21] Y. Liu, X. Chen, H. Peng, Z. Wang, Multifocus image fusion with a deep convolutional neural network, Information Fusion 36 (2017) 191–207.
 [22] C. Du, S. Gao, Image segmentationbased multifocus image fusion through multiscale convolutional neural network, IEEE Access 5 (99) (2017) 15750–15761.
 [23] Q. Zhang, B.l. Guo, Multifocus image fusion using the nonsubsampled contourlet transform, Signal Processing 89 (7) (2009) 1334–1346.
 [24] Z. Zhou, S. Li, B. Wang, Multiscale weighted gradientbased fusion for multifocus images, Information Fusion 20 (2014) 60–72.
 [25] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multiscale transform and sparse representation, Information Fusion 24 (2015) 147–164.
 [26] R. Shen, I. Cheng, A. Basu, Crossscale coefficient selection for volumetric medical image fusion, IEEE Transactions on Biomedical Engineering 60 (4) (2013) 1069–1079.
 [27] S. Zagoruyko, N. Komodakis, Learning to compare image patches via convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4353–4361.
 [28] T. Lindeberg, Scalespace theory: A basic tool for analyzing structures at different scales, Journal of Applied Statistics 21 (12) (1994) 225–270.
 [29] K. Mikolajczyk, C. Schmid, Scale & affine invariant interest point detectors, International Journal of Computer Vision 60 (1) (2004) 63–86.
 [30] K. Mikolajczyk, Detection of local features invariant to affines transformations, Ph.D. thesis, Institut National Polytechnique de Grenoble (2002).
 [31] V. Kolmogorov, R. Zabin, What energy functions can be minimized via graph cuts?, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2) (2004) 147–159.
 [32] G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen, H. Hoppe, K. Toyama, Digital photography with flash and noflash image pairs, in: ACM Transactions on Graphics (TOG), Vol. 23, ACM, 2004, pp. 664–672.
 [33] Y. Liu, S. Liu, Z. Wang, Multifocus image fusion with dense SIFT, Information Fusion 23 (2015) 139–155.
 [34] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675–678.
 [35] G. Qu, D. Zhang, P. Yan, Information measure for performance of image fusion, Electronics Letters 38 (7) (2002) 313–315.
 [36] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: From error visibility to structural similarity, IEEE Transactions on Image Processing 13 (4) (2004) 600–612.
 [37] Z. Wang, A. C. Bovik, A universal image quality index, IEEE Signal Processing Letters 9 (3) (2002) 81–84.
 [38] C. Xydeas, V. Petrovic, Objective image fusion performance measure, Electronics Letters 36 (4) (2000) 308–309.
 [39] M. B. A. Haghighat, A. Aghagolzadeh, H. Seyedarabi, A nonreference image fusion metric based on mutual information of image features, Computers & Electrical Engineering 37 (5) (2011) 744–756.
 [40] H. R. Sheikh, A. C. Bovik, Image information and visual quality, IEEE Transactions on Image Processing 15 (2) (2006) 430–444.