1 Introduction
As the depthoffield (DoF) of the brightfield microscopy is often limited, we can often capture images focused on the partial scene while the parts of the specimen that lie outside the object plane are blurred. These images are not appropriate for further diagnose analysis. Multifocus image fusion provides one versatile solution to reconstruct an allinfocus image by merging a series of localfocused images taken under different distance from the object to the lens. Existing multifocus image fusion methods are usually based on the assumption that the source images are wellaligned miao2011novel ; li2017pixel ; li2013imageGFF or with only small misregistration liu2015multi ; zhou2014multi . But the conditions are not always satisfied for the multifocus microscopical images which are captured by moving the specimen using a step motor, where slight vibration will result in large misregistration. Figure 1 shows the fused results by several stateoftheart multifocus image fusion methods, including guided filtering (GFF) li2013imageGFF , the dense SIFT (DSIFT) liu2015multi , and multiscale weighted gradient (MWGF) zhou2014multi , where there are obvious visual artifacts and distortion.
The misregistration of images has become one of the serious bottleneck to improve the visual quality for multifocus image fusion. A straightforward solution is to register all input images first and then perform image fusion based on the registered images liu2015automatic ; laliberte2003registration ; kessler2006image . Especially, the images can be registered first using the SIFTbased image registration methods Gong2014A ; Guo2017Automatic , then the registered images can be fused using DSIFT liu2015multi . However, this scheme regards the registration and fusion as two completely independent parts, resulting in redundant operation during the process of the registration and fusion. In fact, during the SIFT feature detection for registration, it records the difference of gaussian (DoG) response for each pixel, which can also be regarded as salient response for the subsequent image fusion.
In this paper, we propose a novel method based on SpeedUp Robust Features (SURF) bay2006surf to fuse the unregistered multifocus microscopical images in a very efficient way. The key idea is to use the determinant of the approximate Hessian for both SURF detection and the saliency map generation. We first register all input images using the SURFbased registration method and then merge them by regarding the determinant of the approximated Hessian as the corresponding salient response, thus it enables nearly costfree saliency map generation. In addition, due to the adoption of SURF scale space representation, our method can generate scaleinvariant saliency map which is desired for scaleinvariant image fusion. Experimental results demonstrate our method is very fast to fuse the highresolution microscopical images. Furthermore, as shown in Figure 0(d), the fused image is of desirable visual quality that there is no geometric distortion and information artifacts.
1.1 Related Work
Registration Method. The goal of it is to establish the correspondence between sensed and reference images to maximize the similarity of the two images with the common content zitova2003image
. It can be roughly classified into intensitybased and featurebased methods
goshtasby20052 . The intensitybased methods compare intensity patterns in images via correlation metrics. The featurebased methods find correspondence between image feature, such as points, Tjunction, edge and contour to formulate the feature points, which are widely used due to their simplicity laliberte2003registration ; hill2001medical .Fusion Method. It can be mainly divided into transformation domain method and spatial domain method goshtasby2007image . The transformation methods are based on the framework ”decompositionfusionreconstruction”, which first transform the images by some mathematical transformation such as pyramid decomposition, wavelet decomposition or sparse coding pajares2004wavelet ; liu2015general , and then reconstruct the fused image based on the transform coefficients. They are often not efficient due to the multidecomposition and inverse construction aslantas2014pixel
. The spatial domain methods try to take each pixel in the fused image as the weighted average of the corresponding pixels in the input images, where the weights or activity map are often determined according to the saliency of different pixels. These methods are often fast and easy to be implemented, but the fused performance extremely depends on the accurate estimation of the weights of the pixels. The proposed method belongs to this category. Recently, there are some methods
Liu2018Deepbased on the convolution neural networks (CNNs) to regard the image fusion as a classification problem to avoid the handcraft design. However, it is not efficient and the performance highly depends on the network and the size of training dataset.
Joint Fusion and Registration. Recently, there are several joint methods chen2015sirf ; chen2011maximum ; ofir2017registration which combine the registration and fusion into a single frame and then use the alternating optimization between registration and fusion to minimize the resulting energy function. However, the repeated iteration is often timeconsuming, which makes these joint methods not appropriate to fuse the high resolution images in realtime or nearly realtime systems.
2 Methodology
The scheme of the proposed method is shown in Figure 2. First, following SURF bay2006surf we first approximate the Gaussian second order derivatives with box filters to calculate the Hessian matrix and construct a scale space representation for each image by simply varying the box filters size. Second, based on the SURF scale space representation, we can detect the repeatable feature points and generate the scaleinvariant saliency map simultaneously. Third, the simplified registration transformation model can be estimated using SURF matching followed by a Houghlike voting scheme. Based on the estimated registration model, all input images and the corresponding scaleinvariant saliency map can be registered individually. Then a simple nonmax suppression operation is applied to these registered saliency maps and the resulting mask images are further refined by a singlescale guided filtering He2013Guided to generate the optimal weighted maps. Finally, the fused image is constructed by combining the registered source images with the corresponding weighted maps.
2.1 SURF Scale Space Representation
The SURF relies on the determinant of the Hessian for both location and scale selection bay2006surf . However, during calculating the Hessian the SURF simply replaces the second order Gaussian partial derivative by box filters shown in the bottomleft of Figure 3, whereas these approximations can be evaluated very fast using integral images independently of size. Based on these approximations, we can efficiently construct the SURF scale space representation by upscaling the filter size rather than iteratively reducing the image size. To be specific, the scale space includes several octaves and each octave is filled with constant layers and for the th octave and th layer, the corresponding salient response image is then determined by the determinant of approximate Hessian matrix , i.e.,
(1) 
(2) 
where the is the convolution of the box filter in direction of size with the image in position , similar to and . The is the relative weight to balance the expression for the actual determinant of Hessian matrix, which is often set as , and is the size of box filters corresponding to the th octave and the th layer which is often determined by
(3) 
The smallest size of box filters is set to which are the approximations for Gaussian second order derivatives with and represents the lowest scale or the highest spatial resolution.
2.2 FeatureBased Registration
2.2.1 Registration Model
The featurebased registration hill2001medical aims to set up a parametric transformation that can relate the position of features in the sensed image or coordinate space with the position of the corresponding feature in the reference image or coordinate space. The general 2D planar transformation is modeled as perspective transform or homography, which operates on homogeneous coordinates
(4) 
where is an arbitrary matrix. Note that
is homogeneous, it has only eight degrees of freedom. As for our microscopical scenarios where the multifocus images are taken by moving the specimen using a step motor, the misregistration is mainly generated from the vibration of the step motor and the scale change caused by varying the distance from the specimen to the lens. Due to our scaleinvariant saliency map generation (Section
2.3.1), we neglect the scale change and further approximate the motor’s vibration as translation in vertical and horizonal directions for reasons of simplicity. Thus, the finally transformation can be simplified as(5) 
where the and are the only two freedoms that need to be estimated.
To determine the model and register the series of input images, we follow the basic pipeline of featurebased registration zitova2003image , i.e., feature detection, feature matching, and image registration.
2.2.2 Feature Detection and Description
Based on the SURF scale space representation, we can detect the feature points by applying nonmaximum suppression in the local region, as shown in Figure 3. To cover a complete octave, there are two additional layers added to each octave. After searching through all the layers, the scaleinvariant feature points can be detected to prepare for the feature matching. Then, the description for each feature point is calculated based on Haar wavelet response (refer to bay2006surf for details). Because there is no rotation transformation in the multifocus images, we adopt the upright SURF (USURF) to accelerate the calculation. Usually, the dimension of the descriptor is 64 () and can be easily changed to 36 () or 100 () by splitting up each region into square subregions with different size.
2.2.3 Reference Image selection and Feature Matching
We first need to choose an reference image from the series of input multifocus images. A reasonable way is to select the image with least blurred region or highest sharpness. A more sophisticated method liu2015automatic is to choose the image whose information entropy is closest to the root mean square values of all images. However, these methods will introduce additional computation to evaluate the sharpness or distances in terms of information entropy. Here we choose one with the largest number of detected feature points as the reference image and the others as the sensed images. The rationale behind this is the reference image should be selected to match the others as possible. Such a costfree alternative is a very simple but also effective in practice.
In the matching stage, for each feature point in the sensed image, we search the most similar point in the reference image by EuclideanDistance with the corresponding descriptors. In our implementation, we reject all matches in which the ratio of closest to secondclosest neighbors is greater than 0.8. As pointed out in bay2006surf , the matching can be accelerated by taking advantage of the sign of the Laplacian (i.e. the trace of the Hessian matrix) which distinguishes bright blobs on dark backgrounds from the reverse situation. If the trace of the for two feature points is opposite, there is no need to calculate the distance between these points and therefore can accelerate the feature matching.
2.2.4 Model Estimation and Image Registration
Based on the matched pairs, we can estimate the transformation parameters and then perform the registration between each sensed image and reference image. As shown in Figure 3(a), there are some obvious mismatched pairs due to the similar local appearance and the simplistic matching strategy. The RANdom SAmple Consensus (RANSAC) algorithm Fischler1981Random
is widely used to estimate the transformation parameters which are robust to these “outliers”. However, since RANSAC needs to iteratively fit the model and identify the “outliers”, it is a little timeconsuming. As our model only involves two parameters (i.e.
, ), here we adopt an alternative Houghlike voting scheme which is much faster. More precisely, we divide the parameter space into several grids constrained by an error tolerance, and calculate the for each matched pair which votes for the corresponding grid. The matched pairs who vote for the grid with the highest value are regarded as “inliers”, as shown in Figure 3(b). Then the final optimal is simply determined by the median or mean of the ones of these “inliers”, respectively corresponding to solution of least absolute deviations () or least square errors () loss. Based on the estimated translation parameters, the sensed images can be easily aligned to reference image.2.3 Spatial Domain Fusion
The spatialdomain fusion method often consists of the following three steps, i.e., saliency map generation, weighted map construction and image fusion.
2.3.1 Scaleinvariant Saliency Map Generation
The saliency map records the salient response for each pixel. A good saliency map should be robust to the noise and the varies of objects size. As mentioned above, we directly use the pixelwise determinant of the approximated Hessian as the corresponding salient response, therefore we can generate the scaleinvariant saliency map almost without extra computation. Specifically, for each pixel in the image , we achieve the scaleinvariant saliency by searching across the scale space
(6) 
where and are the number of layers and octaves, respectively. It should be noted that the scale space is constructed based on the unregistered input images and thereby the resulting scaleinvariant saliency is also needed aligned based on the above estimated translation parameters, resulting in the registered saliency map .
2.3.2 Weighted Map Construction and Fusion
A straightforward approach is to take the registered scaleinvariant saliency of each pixels as the corresponding weight. However, it will introduce blur in the fused image. We adopt the simple nonmax suppression to alleviate this problem. The initial weight map for the th image can be obtained with the aligned saliency map
(7) 
where is the number of input images.
Notice that the above procedure compares pixels individually without considering the spatial context information, which will introduce some geometric distortion and artifacts in the final fused image. Following li2013imageGFF , we determine the final weight by applying guided filtering He2013Guided on the initial weight map as follows
(8) 
where is the number of pixels in window with size of which is centered at pixel , is the aligned image, and are the const coefficients of window which are determined by ridge regression
(9)  
(10) 
Here and
are the mean and variance of registered image
in window , is the mean of the initial weight map in window and denotes the regularization parameter penalizing large .According to Eq. 8, we can determine the corresponding weight map for each registered image and thereby obtain the final fused image by
(11) 
3 Experiments and Discussion
To demonstrate the effectiveness and efficiency of the proposed method, we conduct a series of experiments on the dataset collected by ourselves, consisting of 10 groups of unregistered multifocus 4K ultra HD microscopic images with size of , which are available online along with the source code ^{1}^{1}1https://github.com/yiqingmy/JointRF implemented in C++. Some example images are shown in Figure 5. We first investigate the effect of different parameter settings and then compare our method with several stateoftheart multifocus image fusion methods including guided filtering (GFF) li2013imageGFF , the dense SIFT (DSIFT) liu2015multi , and multiscale weighted gradient (MWGF) zhou2014multi , both in terms of the visual performance and running time. These compared methods can be downloaded online and all of them are implemented in Matlab.
The number of octaves and number of layers sampled per octave will produce an effect on both the feature points detection and saliency map construction directly, and therefore on the final fusion result. In addition, the dimensionality of USURF descriptor will influence on the matching accuracy and speed. We evaluate the performance of our method under varying these parameters in terms of both matching accuracy and visual quality of the fused results. Here the matching accuracy metric is determined by the ratio of the number of ”inliers” () and the number of all matched pairs (), i.e.
(12) 
As usual, in our implementation only the top 20 matched pairs are used for hough voting and the matching accuracy evaluation.
The accuracy under varying total number of octaves and number of layers sampled per octave is shown in Figure 6, where the dimension of the USURF descriptors is 64. We can see that the accuracy is improved globally when the number of octaves and layers are increased, and finally saturated. Meanwhile, Figure 7 shows the final fused results based on the source images shown in Figure 5 with different and . It can be shown that when the number of octaves is less than 4, although the value of is large, there are lots of obvious artifacts in the fused images (See Figure 6(a)6(c)), whereas with the increasing of the value of with a few layers in each octave, the visual quality of the fused results is better and there are not obvious distortion and artifacts, as shown in Figure 6(d)6(h). Since there are no obvious distortion and artifacts when covering all the number of layers, we fix the value of to 5 when evaluating the effect of the number of layers per octave. As shown in the last row of Figure 7, the visual quality is improved obviously by vary the value of from 1 to 2, where the distortion and artifacts are prominently decreased (refer to the magnified region in Figure 6(i) and Figure 6(j)). The performance can be further improved slightly by set the or , as shown in the magnified region in Figure 6(k) and Figure 6(l). In the following experiments, we choose to set .
Dimension(D)  16  36  64  100  144 

Accuracy  0.293  0.768  0.813  0.057  0.057 
Matching Time(s)  0.002  0.003  0.006  0.008  0.014 
Then, we change the dimension of the feature descriptors and Table 1 lists the accuracy and the timecost of the feature matching. As the dimension increases, the matching time is increased and the accuracy can be increased gradually until the dimension is 64. Surprisedly, the accuracy would plummet when higher dimension of the USURF is used. The corresponding fused images are shown in Figure 8, and as it shown that the visual quality is very poor when the dimension of the descriptors is too low (i.e., =16) or too high (). Therefore, we set the dimension to 64 for the following comparative experiments.
Figure 9, Figure 10 and Figure 11 show three groups of source images and the comparative results produced by different methods respectively. Each of them displays the sequence of unregistered source images and corresponding results by different methods. The fused results based on DSIFT are shown in Figure 8(i), 9(e) and 10(e), which are extremely blurred obviously even worse than the input images. For the fused images by GFF, there are serious ghost effect that the information is destroyed from the input images, as shown in Figure 8(j), 9(f) and 10(f). Figure 8(k), 9(g) and 10(g) illustrate the results of MWGF, which show that the performance of them are better at a glimpse, but the performance is degraded by the ghost effect, where there are some geometrical distortion. And for clarity, we magnify several regions. Fortunately, as shown in Figure 8(l),9(h) and 10(h), the fused images are very salient by the proposed method. And the information of objects with very different size is reserved well that there are no artifacts and distortion.
We also compare the average running time by fusing the source images in Figure 9, Figure 10 and Figure 11 with these methods. We reimplement the GFFbased method with C++ to make a reference for different implementation. In addition, we also implement an intuitive approach which first performs registration and then fusion (RF) to validate the efficiency of our method ^{2}^{2}2The RF method produces similar visual fused images as ours and therefore are not included due to the limited space.. The experiments are conducted on a PC equipped with a 4.20GHz CPU and 8GB memory and the results are shown in Table 2. It can be seen that our method is extremely faster than others, though they can be speeded up by almost 66 implemented in C++. And it can be furthermore accelerated through GPU programming to make it applicable for realtime applications such as digital cytopathology Pantanowitz2009The .
Methods  DSIFT  MWGF  GFF  GFF(C++)  RF  Proposed 
Time(s)  2.27e+03  1.39e+03  1.05e+02  3.57e+01  2.17e+01  1.54e+01 
4 Conclusion
In this paper, we first propose a novel method to fuse the unregistered microscopical multifocus images in a very efficient way. Different from the existing methods in literature, the proposed method is base on the SURF scale space representation to eliminate the redundant operation between image registration and fusion. The scaleinvariant saliency map can be generated while the repeatable feature points are detected simultaneously. Meanwhile, due to the robust scale space, the image registration is accurate and the information of very different size objects is reserved well in the fused image. The experiments investigate how the parameters (i.e., the number of octaves and layers in the scale space and the dimension of the feature descriptor) influence the fusion results in our algorithm. Meanwhile, we compare our method with several stateoftheart methods and demonstrate our method is superior to others in the efficiency and visual performance. However, there are still some challenges that should be further explored, e.g., making the algorithm more robust to be applied to different practice and promoting the calculation to make it realtime.
Acknowledgements
This research was partially supported by the Natural Science Foundation of Hunan Province, China (No.14JJ2008), the National Natural Science Foundation of China under Grant No. 61602522, No. 61573380, No. 61672542 and the Fundamental Research Funds of the Central Universities of Central South University under Grant No. 2018zzts577.
References
References
 [1] Q.g. Miao, C. Shi, P.f. Xu, M. Yang, Y.b. Shi, A novel algorithm of image fusion using shearlets, Optics Communications 284 (6) (2011) 1540–1547.
 [2] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixellevel image fusion: A survey of the state of the art, Information Fusion 33 (2017) 100–112.
 [3] S. Li, X. Kang, J. Hu, Image fusion with guided filtering, IEEE Transactions on Image Processing 22 (7) (2013) 2864–2875.
 [4] Y. Liu, S. Liu, Z. Wang, Multifocus image fusion with dense sift, Information Fusion 23 (2015) 139–155.
 [5] Z. Zhou, S. Li, B. Wang, Multiscale weighted gradientbased fusion for multifocus images, Information Fusion 20 (2014) 60–72.
 [6] Y. Liu, F. Yu, An automatic image fusion algorithm for unregistered multiply multifocus images, Optics Communications 341 (2015) 101–113.
 [7] F. Laliberté, L. Gagnon, Y. Sheng, Registration and fusion of retinal imagesan evaluation study, IEEE Transactions on Medical Imaging 22 (5) (2003) 661–673.
 [8] M. L. Kessler, Image registration and data fusion in radiation therapy, The British journal of radiology 79 (special_issue_1) (2006) S99–S108.
 [9] M. Gong, S. Zhao, L. Jiao, D. Tian, S. Wang, A novel coarsetofine scheme for automatic image registration based on sift and mutual information, IEEE Transactions on Geoscience and Remote Sensing 52 (7) (2014) 4328–4338.

[10]
F. Guo, X. Zhao, B. Zou, Y. Liang, Automatic retinal image registration using blood vessel segmentation and sift feature, International Journal of Pattern Recognition and Artificial Intelligence 31 (11) (2017) 10.

[11]
H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust features, in: European conference on computer vision, Springer, 2006, pp. 404–417.
 [12] B. Zitova, J. Flusser, Image registration methods: a survey, Image and vision computing 21 (11) (2003) 977–1000.
 [13] A. A. Goshtasby, 2D and 3D image registration: for medical, remote sensing, and industrial applications, John Wiley & Sons, 2005.
 [14] D. L. Hill, P. G. Batchelor, M. Holden, D. J. Hawkes, Medical image registration, Physics in medicine & biology 46 (3) (2001) R1.
 [15] A. A. Goshtasby, S. Nikolov, Image fusion: advances in the state of the art (2007).
 [16] G. Pajares, J. M. De La Cruz, A waveletbased image fusion tutorial, Pattern recognition 37 (9) (2004) 1855–1872.
 [17] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multiscale transform and sparse representation, Information Fusion 24 (2015) 147–164.
 [18] V. Aslantas, A. N. Toprak, A pixel based multifocus image fusion method, optics communications 332 (2014) 350–358.

[19]
Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, X. Wang, Deep learning for pixellevel image fusion: Recent advances and future prospects, Information Fusion.
 [20] C. Chen, Y. Li, W. Liu, J. Huang, Sirf: simultaneous satellite image registration and fusion in a unified framework, IEEE Transactions on Image Processing 24 (11) (2015) 4213–4224.
 [21] S. Chen, Q. Guo, H. Leung, E. Bosse, A maximum likelihood approach to joint image registration and fusion, IEEE Transactions on Image Processing 20 (5) (2011) 1363–1372.
 [22] N. Ofir, S. Silberstein, D. Rozenbaum, S. D. Bar, Registration and fusion of multispectral images using a novel edge descriptor, arXiv preprint arXiv:1711.01543.
 [23] K. He, J. Sun, X. Tang, Guided image filtering, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (6) (2013) 1397–1409.
 [24] M. A. Fischler, R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, ACM, 1981.
 [25] L. Pantanowitz, M. Hornish, R. A. Goulart, The impact of digital imaging in the field of cytopathology, Cytojournal 6 (1) (2009) 59–70.