In recent years, multi-focus image fusion has become an important issue in image processing field. Due to the limited DOF of optical lenses, it is difficult to have all objects with quite different distances from the camera to be all-in-focus within one shot 
. Therefore many researchers devoted to designing algorithm to fuse multiple images of the same scene but with different focus points to create an all-in-focus fused image. The fused image can be used for human or computer operators, and for further image-processing tasks such as segmentation, feature extraction and object recognition.
With the unprecedented success of deep learning, many fusion methods based on deep learning have been proposed. 
first presented a CNN-based fusion method for multi-focus image fusion task. They used gaussian filter to generate synthetic images with different blurred levels to train a two-class image classification network. By using such supervised learning strategy, the network could distinguish whether the patch is in focus. After that, DeepFuse has been developed in an unsupervised manner to fuse multi-exposure images. DenseFuse 
has been designed to fuse infrared and visible images, it utilized unsupervised encoder-decoder network to extract deep features of images and designed L1-norm fusion strategy to fuse two feature maps, and then, the decoder used fused features to obtain a fused image. The basic assumption behind this approach is that the L1 norm of feature vector for each node represents activity level of that. It can be applied to infrared and visible image fusion task. But for multi-focus task, it is commonly assumed that only the objects within the DOF have sharp appearance in the photograph while other objects are likely to be blurred. Therefore, we assume that in multi-focus task, what really matter is feature gradient, not feature intensity.
In order to verify this assumption, we present a fusion method based on unsupervised deep convolutional network. It uses deep features, extracted from encoder-decoder network, and spatial frequency to measure activity level. Experimental results demonstrate that the proposed method achieves the state-of-art fusion performance compared to 16 existing fusion methods in objective and subjective assessment.
Our code and data can be found at https://github.com/Keep-Passion/SESF-Fuse.
The remainder of this paper is organized as follows: In Section II, we provide a brief review of related works. In Section III, the proposed fusion method is described in detail. The experimental results are shown in Section IV. We conclude the paper in section V.
In the past decades, various image fusion methods have been presented which could be classified into two groups: transform domain methods and spatial domain methods. The most classical transform domain fusion methods are based on multi-scale transform (MST) theories, such as laplacian pyramid (LP) , and ratio of low-pass pyramid (RP) , and wavelet-based ones like discrete wavelet transform (DWT) , and dual-tree complex wavelet transform (DTCWT) , and curvelet transform (CVT) , and nonsubsampled contourlet transform (NSCT) , and the sparse representation (SR) , and image matting based (IMF) . The key point behind these methods is that the activity level of source images can be measured by the decomposed coefficients in a selected transform domain. Obviously, the selection of transform domain plays a crucial role in these methods.
Spatial domain fusion methods measure activity level based on gradient information. Early spatial domain fusion methods used manually fixed size block strategy to calculate activity level, spatial frequency for example , which usually causes undesirable artifacts. Many improved versions have been proposed on this topic, such as the adaptive block based method  using differential evolution algorithm to obtain a fixed optimal block size. Recently, some pixel-based spatial domain methods based on gradient information have been proposed, such as the guided filtering (GF)-based one , the multi-scale weighted gradient (MWG)-based one  and the dense SIFT (DSIFT)-based one .
With a span of last 5 years, deep convolutional neural network (CNN) has achieved great success in image processing. Some works tried to measure the activity level by high-capacity deep convolutional model. first applied convolutional neural network to multi-focus image fusion.  performed a CNN-based unsupervised approach for exposure fusion problem, which is so called DeepFuse.  presented DenseFuse to fuse infrared and visible images, which used encoder-decoder unsupervised strategy to obtain useful features and fused them by L1-norm. Inspired by DeepFuse, we also train our network in unsupervised encoder-decoder manner. Moreover, we apply spatial frequency as fusing rule to obtain activity level and decision map of source images, which is in accord with the key assumption that only the objects within the depth-of-field have sharp appearance.
Overview of Proposed Method
The schematic diagram of our algorithm is shown in Figure 1. We train an auto-encoder network to extract highly dimensional feature in training phase. Then we calculate the activity level using those deep features at fusion layer in inference phase. Finally, we obtain the decision map to fuse two multi-focus source images. The algorithm presented here only aims to fuse two source images. However, to deal with more than two multi-focus images, it can be straightforwardly fuse them one by one in series.
Extraction of Deep Feature
By getting inspiration from DenseFuse , we only use encoder and decoder to reconstruct the input image and discard fusion operation in training phase. After the encoder and decoder parameters are fixed, we use spatial frequency to calculate the activity level from deep features which are obtained from encoder.
As shown in Figure 1, the encoder consists of two parts(C1 and SEDense Block). C1 is a convolution layer in encoder network. DC1, DC2 and DC3 are convolution layers in SEDense block and the output of each layer is connected to every other layer by cascade operation. In order to reconstruct image precisely, there are no pooling layer in the network. Squeeze and Excitation (SE) block can enhance spatial encoding by adaptively re-calibrating channel-wise feature responses 
, the influence of this structure is shown at the experiment. The decoder consists of C2, C3, C4 and C5, which will be utilized to reconstruct the input image. We minimize the loss function, which combines pixel loss and structural similarity (SSIM) loss , to train our encoder and decoder. is a constant weight to normalize two loss.
The pixel loss calculates Euclidean distance between the output() and the input().
The SSIM loss calculates structural differences between and . Where represents to structural similarity operation .
Detailed Fusion Strategy
The detailed fusion strategy is shown in Figure 2. We utilize spatial frequency to calculate initial decision map and apply some commonly used consistency verification methods to remove small errors. Finally, we obtain the decision map to fuse two multi-focus source images.
Spatial Frequency Calculation using Deep Features
Different from L1-norm in DenseFuse, we use feature gradient instead of feature intensity to calculate activity level. Specifically, we apply spatial frequency to handle this task using deep features.
In this paper, the encoder provides high dimensional deep feature for each pixel in an image. However, the original spatial frequency is calculated on gray image with single channel. Thus, for deep features, we modify the spatial frequency calculation method. Let represents the deep features driven from encoder block. represents one feature vector, refers to the coordinates of these vectors in image. We calculate its spatial frequency using the formulas below, where and are the row and column vector frequency, respectively.
is radius of kernel. The original spatial frequency is a block-based method, while it is pixel-based in our method. Besides, we apply ’same’ padding strategy at the border of feature maps.
Thus, we can compare the spatial frequencies of two corresponding and , where in is the index of source image. Then we can get the initial decision map (D) with Eq7.
There may be some small lines or burrs in the connection portions, and some adjacent regions may be disconnected by the inappropriate decisions. Thus, alternating opening and closing operators with a small disk structuring element  is applied to process the decision map. In this way, the small lines or burrs could be eliminated, the connection portions of the focused regions could be smoothed, and the adjacent regions would be combined as a whole region. We found that, when the radius of the disk structuring element equals to spatial frequency kernel radius, the small lines or burrs could be well detected and the adjacent regions could be connected right. Beside, we apply the small region removal strategy, which is same with . Specially, we reverse the region which is smaller than an area threshold. In this paper, the threshold is usually set to , where H and W are the height and width of source image, respectively.
Generally, there are some undesirable artifacts around the boundaries between focused and defocused regions. Similar to , we utilize an efficient edge-preserving filter, guided filter , to improve the quality of initial decision map, which can transfer the structural information of a guidance image into the filtering result of the input image. The initial fused image is employed as the guidance image to guide the filtering of initial decision map. In this work, we experimentally set local window radius to 4 and the regularization parameter to 0.1 in guided filter algorithm.
Finally, by using the obtained decision map , we calculate the fused result with the following pixel-wise weighted-average rule. The input images are denoted as which are pre-registered, where represents the index of source images. The representative visualization of fused images are shown in Figure 3.
Due to the unsupervised strategy, we first train the encoder-decoder network using MS-COCO . In this phase, about 82783 images are utilized as training set, 40504 images are used to validate the reconstruction ability in every iteration. All of them are resized to and transformed to gray scale images. Learning rate is set as
and then decrease by a factor of 0.8 at every two epoch. We setwhich is same with DenseFuse  and optimize the objective function with respect to the weights at all network layer by Adam . The batch size and epochs are 48 and 30, respectively. And then we used acquired parameters to perform SF fusion on the testing set above.
Objective Image Fusion Quality Metrics
The proposed fusion method is compared with 16 representative image fusion methods, which are the laplacian pyramid (LP)-based one , the ratio of low-pass pyramid (RP)-based one , the nonsubsampled contourlet transform (NSCT)-based one , the discrete wavelet transform (DWT)-based one , dual-tree complex wavelet transform (DTCWT)-based one , the sparse representation (SR)-based one , the curvelet transform (CVT)-based one , the guided filtering (GF)-based one , the multi-scale weighted gradient (MWG)-based one , the dense SIFT (DSIFT)-based one , the spatial frequency(SF)-based one , the the FocusStack , the Image Matting Fusion(IMF) , the DeepFuse , the DenseFuse (both add and L1-norm fusion strategy)  and the CNN-Fuse . In addition, GF, IMF are driven from  and NSCT, CVT, DWT, DTCWT, LP, RP, SR and CNN-Fuse from .
In order to access the fusion performance of different methods objectively, we adopt three fusion quality metrics, such as ,  and . For each of the above three metrics, a larger value indicates a better fusion performance. A good comprehensive survey of quality metrics can be found in the article . For fair comparison, we use default parameters given in the related publications for these metrics and all codes are driven from .
We first evaluate our methods with different settings to verify our methods. We pick up seven fusion modes to explore the usage of deep features, such as max, abs-max, average, L1-norm, sf, se_sf_dm, and dense_sf_dm. DenseFuse  investigated add and L1-norm fusion strategy and draw out the conclusion that L1-norm of deep feature could be used to fuse infrared-visible images. They utilized feature intensity to calculate activity level. We found that feature gradient (calculated by spatial frequency) is suited to multi-focus fusion task. Table 1 shows mean average score with different methods. The bold value denotes the best performance among all fusion modes. The digits within a parenthesis indicates the number of results on which corresponding methods obtain the first place. Se_sf outperforms abs-max, max, average, l1_norm fusion modes in metric evaluation. In addition, even though the deep learning has promising representative ability, it can not recover the image perfectly. Thus if we use sf to fuse the deep features and input to decoder and draw out result, the fused result could not completely recover every detail of in-focus region. Therefore, we propose to use deep features to calculate the decision map and fuse the original images. As shown in experiment results, the performance of se_sf_dm defeats the se_sf’s. Besides, we conduct an experiment to verify the influence of SE architecture , we have found that the average scores of se_sf_dm in and is higher than dense_sf_dm and the first place number of se_sf_dm is the highest result. We assume that squeeze-and-excitation structure could dynamically recalibrate feature which shows robust result.
Comparison with other fusion methods
We first compare the performance of different fusion methods based on visual perception. For this purpose, four examples in two manners are mainly provided to exhibit the difference among different methods.
In Figure 4, we visualize two fused examples, such as ’leaf’ and ’Sydney Opera House’ image pairs and their fused results. In each image, a region around the boundary between focused and defocused parts is magnified and shown in the higher left corner. In ’leaf’ result, we can see that the border of leaf with different methods. The DWT shows ’serrated’ shape and the CVT, DSIFT, SR, DenseFuse, CNN show undesirable artifacts. Besides, for DWT and DenseFuse, the luminance of leaf at right higher corner shows an abnormal increase. And the same region in MWG is out-of-focused, which means that the method can not well detect the focused regions. In ’Sydney Opera House’ result, the ear of Koala located at the border between focused and defocused parts, as we can see that all methods show smooth and blurred results except SESF-Fuse.
To have a better comparison, Figure 5 and Figure 6 show the difference images obtained by subtracting the first source image from each fused image, and the values of each difference image are normalized to the range of 0 to 1. If the near focused region is completely detected, the difference image will not show any information of that. In Figure 5, it is beer bottle. Therefore, the CVT, DSIFT, DWT and DenseFuse-1e3-L1-Norm can not perfectly detect the focused region. The SR, MWG and CNN perform well except the region at the border of bottle, because we still can see the contour of near focused region. Besides, our SESF-Fuse performs well in both center or border region of near focused regions. In Figure 6, the near focus region is the man. Same with the observation above, the CVT, DSIFT, DWT, NSCT, DenseFuse can not perfectly detect the focused region. The MWG and CNN perform well except that the region at the border of the person. Besides, for MWG, the region surrounded by arms is actually far focused region, MWG can not correctly detect here.
Table 2 lists the objective performance of different fusion methods using the above three metrics. We can see that the CNN-based method and the proposed method clearly beat the other 15 methods on the average score of and fusion metrics. For metric, CNN-Fuse and SESF-Fuse achieve comparable performance. However, CNN-Fuse is a supervised method which needs to generate synthetic images with different blurred levels to train a two-class image classification network. By contrast, our network only needs to train an unsupervised model which doesn’t need to generate synthetic image data. And for metric, the average score of SESF-Fuse is smaller than LP, however, the first place number of proposed method achieves the highest value which means it is more robust than other methods.
Considering the above comparisons on subjective visual quality and objective evaluation metrics together, our proposed SESF-Fuse-based fusion method can generally outperform other methods, leading to state-of-the-art performance in multi-focus image fusion.
In this work, we propose an unsupervised deep learning model to address multi-focus image fusion problem. First, we train an encoder-decoder network in unsupervised manner to acquire deep feature of input images. And then we utilize these features and spatial frequency to calculate activity level and decision map to perform image fusion. Experimental results demonstrate that the proposed method achieves the promising fusion performance compared to existing fusion methods in objective and subjective assessment. This paper demonstrate the viability of combination of unsupervised learning and traditional image processing algorithm. Our team will promote this research in subsequent work. Besides, we believe that same strategy could be applied to other image fusion tasks, such as multi-exposure fusion, infrared-visible fusion and medical image fusion.
The authors acknowledge the financial support from the National Key Research and Development Program of China (No. 2016YFB0700500), and the National Science Foundation of China (No. 61572075, No. 61702036, No. 61873299, No. 51574027), and Key Research Plan of Hainan Province (No. ZDYF2018139).
-  (2010) Fusion of multi-focus images using differential evolution algorithm. Expert Systems with Applications 37 (12), pp. 8861 – 8870. External Links: Cited by: Related work.
-  (1983-04) The laplacian pyramid as a compact image code. IEEE Transactions on Communications 31 (4), pp. 532–540. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
A new automated quality assessment algorithm for image fusion.
Image and Vision Computing 27 (10), pp. 1421 – 1432.
Special Section: Computer Vision Methods for Ambient IntelligenceExternal Links: Cited by: Objective Image Fusion Quality Metrics.
-  (2006) Enhancing effective depth-of-field by image fusion using mathematical morphology. Image and Vision Computing 24 (12), pp. 1278 – 1287. External Links: Cited by: Consistency Verification.
-  (2019) Pytorch. Note: https://pytorch.org Cited by: Experimental Settings.
-  (2013-06) Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (6), pp. 1397–1409. External Links: Cited by: Consistency Verification.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Extraction of Deep Feature, Ablation Experiments.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: Experimental Settings.
-  (2007) Pixel- and region-based image fusion with complex wavelets. Information Fusion 8 (2), pp. 119 – 130. Note: Special Issue on Image Fusion: Advances in the State of the Art External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (1995) Multisensor image fusion using the wavelet transform. Graphical Models and Image Processing 57 (3), pp. 235 – 245. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2019-05) DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. External Links: Cited by: Introduction, Related work, Extraction of Deep Feature, Experimental Settings, Objective Image Fusion Quality Metrics, Ablation Experiments.
-  (2013-07) Image fusion with guided filtering. IEEE Transactions on Image Processing 22 (7), pp. 2864–2875. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2017) Pixel-level image fusion: a survey of the state of the art. Information Fusion 33, pp. 100 – 112. External Links: Cited by: Introduction.
-  (2013) Image matting for fusion of multi-focus images in dynamic scenes. Information Fusion 14 (2), pp. 147 – 162. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2001) Combination of images with diverse focuses using the spatial frequency. Information Fusion 2 (3), pp. 169 – 176. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: Cited by: Experimental Settings.
-  (2017) Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191 – 207. External Links: Cited by: Introduction, Related work, Consistency Verification, Objective Image Fusion Quality Metrics.
-  (2015) Multi-focus image fusion with dense sift. Information Fusion 23, pp. 139 – 155. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2019) Image fusion. Note: http://www.escience.cn/people/liuyu1/Codes.html Cited by: Objective Image Fusion Quality Metrics.
-  (2012-01) Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: a comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 94–109. External Links: Cited by: Objective Image Fusion Quality Metrics.
-  (2012) Image fusion metrics. Note: https://github.com/zhengliu6699/imageFusionMetrics Cited by: Objective Image Fusion Quality Metrics.
-  (2015) Multi-focus image fusion using dictionary-based sparse representation. Information Fusion 25, pp. 72 – 84. External Links: Cited by: Consistency Verification, Experimental Settings.
-  (2007) Remote sensing image fusion using the curvelet transform. Information Fusion 8 (2), pp. 143 – 156. Note: Special Issue on Image Fusion: Advances in the State of the Art External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2008-10) A novel image fusion metric based on multi-scale analysis. In 2008 9th International Conference on Signal Processing, Vol. , pp. 965–968. External Links: Cited by: Objective Image Fusion Quality Metrics.
-  (2017-10) DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Introduction, Related work, Objective Image Fusion Quality Metrics.
-  (2012) Multifocus image fusion based on empirical mode decomposition. In 19th IEEE International Conference on Systems, Signals and Image Processing (IWSSIP), Cited by: Experimental Settings.
-  (2011) Image fusion: algorithms and applications. Elsevier. Cited by: Related work.
-  (1989) Image fusion by a ratio of low-pass pyramid. Pattern Recognition Letters 9 (4), pp. 245 – 253. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: Extraction of Deep Feature.
-  (2019) Focus stacking. Note: https://github.com/cmcguinness/focusstack Cited by: Objective Image Fusion Quality Metrics.
-  (2019) Image fusion. Note: http://xudongkang.weebly.com/index.html Cited by: Objective Image Fusion Quality Metrics.
-  (2000-02) Objective image fusion performance measure. Electronics Letters 36 (4), pp. 308–309. External Links: Cited by: Objective Image Fusion Quality Metrics.
-  (2010-04) Multifocus image fusion and restoration with sparse representation. IEEE Transactions on Instrumentation and Measurement 59 (4), pp. 884–892. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2009) Multifocus image fusion using the nonsubsampled contourlet transform. Signal Processing 89 (7), pp. 1334 – 1346. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.
-  (2014) Multi-scale weighted gradient-based fusion for multi-focus images. Information Fusion 20, pp. 60 – 72. External Links: Cited by: Related work, Objective Image Fusion Quality Metrics.