SESF-Fuse: An Unsupervised Deep Model for Multi-Focus Image Fusion

08/05/2019 ∙ by Boyuan Ma, et al. ∙ English 设为首页 2

In this work, we propose a novel unsupervised deep learning model to address multi-focus image fusion problem. First, we train an encoder-decoder network in unsupervised manner to acquire deep feature of input images. And then we utilize these features and spatial frequency to measure activity level and decision map. Finally, we apply some consistency verification methods to adjust the decision map and draw out fused result. The key point behind of proposed method is that only the objects within the depth-of-field (DOF) have sharp appearance in the photograph while other objects are likely to be blurred. In contrast to previous works, our method analyzes sharp appearance in deep feature instead of original image. Experimental results demonstrate that the proposed method achieves the state-of-art fusion performance compared to existing 16 fusion methods in objective and subjective assessment.



There are no comments yet.


page 2

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In recent years, multi-focus image fusion has become an important issue in image processing field. Due to the limited DOF of optical lenses, it is difficult to have all objects with quite different distances from the camera to be all-in-focus within one shot [13]

. Therefore many researchers devoted to designing algorithm to fuse multiple images of the same scene but with different focus points to create an all-in-focus fused image. The fused image can be used for human or computer operators, and for further image-processing tasks such as segmentation, feature extraction and object recognition.

With the unprecedented success of deep learning, many fusion methods based on deep learning have been proposed.  [17]

first presented a CNN-based fusion method for multi-focus image fusion task. They used gaussian filter to generate synthetic images with different blurred levels to train a two-class image classification network. By using such supervised learning strategy, the network could distinguish whether the patch is in focus. After that, DeepFuse 

[25] has been developed in an unsupervised manner to fuse multi-exposure images. DenseFuse [11]

has been designed to fuse infrared and visible images, it utilized unsupervised encoder-decoder network to extract deep features of images and designed L1-norm fusion strategy to fuse two feature maps, and then, the decoder used fused features to obtain a fused image. The basic assumption behind this approach is that the L1 norm of feature vector for each node represents activity level of that. It can be applied to infrared and visible image fusion task. But for multi-focus task, it is commonly assumed that only the objects within the DOF have sharp appearance in the photograph while other objects are likely to be blurred 

[17]. Therefore, we assume that in multi-focus task, what really matter is feature gradient, not feature intensity.

In order to verify this assumption, we present a fusion method based on unsupervised deep convolutional network. It uses deep features, extracted from encoder-decoder network, and spatial frequency to measure activity level. Experimental results demonstrate that the proposed method achieves the state-of-art fusion performance compared to 16 existing fusion methods in objective and subjective assessment.

Our code and data can be found at

The remainder of this paper is organized as follows: In Section II, we provide a brief review of related works. In Section III, the proposed fusion method is described in detail. The experimental results are shown in Section IV. We conclude the paper in section V.

Figure 1: The schematic diagram of proposed algorithm.

Related work

In the past decades, various image fusion methods have been presented which could be classified into two groups: transform domain methods and spatial domain methods 

[27]. The most classical transform domain fusion methods are based on multi-scale transform (MST) theories, such as laplacian pyramid (LP) [2], and ratio of low-pass pyramid (RP) [28], and wavelet-based ones like discrete wavelet transform (DWT) [10], and dual-tree complex wavelet transform (DTCWT) [9], and curvelet transform (CVT) [23], and nonsubsampled contourlet transform (NSCT) [34], and the sparse representation (SR) [33], and image matting based (IMF) [14]. The key point behind these methods is that the activity level of source images can be measured by the decomposed coefficients in a selected transform domain. Obviously, the selection of transform domain plays a crucial role in these methods.

Spatial domain fusion methods measure activity level based on gradient information. Early spatial domain fusion methods used manually fixed size block strategy to calculate activity level, spatial frequency for example [15], which usually causes undesirable artifacts. Many improved versions have been proposed on this topic, such as the adaptive block based method [1] using differential evolution algorithm to obtain a fixed optimal block size. Recently, some pixel-based spatial domain methods based on gradient information have been proposed, such as the guided filtering (GF)-based one [12], the multi-scale weighted gradient (MWG)-based one [35] and the dense SIFT (DSIFT)-based one [18].

With a span of last 5 years, deep convolutional neural network (CNN) has achieved great success in image processing. Some works tried to measure the activity level by high-capacity deep convolutional model.  

[17] first applied convolutional neural network to multi-focus image fusion.  [25] performed a CNN-based unsupervised approach for exposure fusion problem, which is so called DeepFuse.  [11] presented DenseFuse to fuse infrared and visible images, which used encoder-decoder unsupervised strategy to obtain useful features and fused them by L1-norm. Inspired by DeepFuse, we also train our network in unsupervised encoder-decoder manner. Moreover, we apply spatial frequency as fusing rule to obtain activity level and decision map of source images, which is in accord with the key assumption that only the objects within the depth-of-field have sharp appearance.


Overview of Proposed Method

The schematic diagram of our algorithm is shown in Figure 1. We train an auto-encoder network to extract highly dimensional feature in training phase. Then we calculate the activity level using those deep features at fusion layer in inference phase. Finally, we obtain the decision map to fuse two multi-focus source images. The algorithm presented here only aims to fuse two source images. However, to deal with more than two multi-focus images, it can be straightforwardly fuse them one by one in series.

Extraction of Deep Feature

By getting inspiration from DenseFuse [11], we only use encoder and decoder to reconstruct the input image and discard fusion operation in training phase. After the encoder and decoder parameters are fixed, we use spatial frequency to calculate the activity level from deep features which are obtained from encoder.

As shown in Figure 1, the encoder consists of two parts(C1 and SEDense Block). C1 is a convolution layer in encoder network. DC1, DC2 and DC3 are convolution layers in SEDense block and the output of each layer is connected to every other layer by cascade operation. In order to reconstruct image precisely, there are no pooling layer in the network. Squeeze and Excitation (SE) block can enhance spatial encoding by adaptively re-calibrating channel-wise feature responses [7]

, the influence of this structure is shown at the experiment. The decoder consists of C2, C3, C4 and C5, which will be utilized to reconstruct the input image. We minimize the loss function

, which combines pixel loss and structural similarity (SSIM) loss , to train our encoder and decoder. is a constant weight to normalize two loss.


The pixel loss calculates Euclidean distance between the output() and the input().


The SSIM loss calculates structural differences between and . Where represents to structural similarity operation  [29].


Detailed Fusion Strategy

The detailed fusion strategy is shown in Figure 2. We utilize spatial frequency to calculate initial decision map and apply some commonly used consistency verification methods to remove small errors. Finally, we obtain the decision map to fuse two multi-focus source images.

Figure 2: The detailed fusion strategy.

Spatial Frequency Calculation using Deep Features

Different from L1-norm in DenseFuse, we use feature gradient instead of feature intensity to calculate activity level. Specifically, we apply spatial frequency to handle this task using deep features.

In this paper, the encoder provides high dimensional deep feature for each pixel in an image. However, the original spatial frequency is calculated on gray image with single channel. Thus, for deep features, we modify the spatial frequency calculation method. Let represents the deep features driven from encoder block. represents one feature vector, refers to the coordinates of these vectors in image. We calculate its spatial frequency using the formulas below, where and are the row and column vector frequency, respectively.



is radius of kernel. The original spatial frequency is a block-based method, while it is pixel-based in our method. Besides, we apply ’same’ padding strategy at the border of feature maps.

Thus, we can compare the spatial frequencies of two corresponding and , where in is the index of source image. Then we can get the initial decision map (D) with Eq7.

Figure 3: Visualization of fused results. The first row is near focused source image and the second row is far focused source image. The third row is decision map of our method and the final row is fused result.

Consistency Verification

There may be some small lines or burrs in the connection portions, and some adjacent regions may be disconnected by the inappropriate decisions. Thus, alternating opening and closing operators with a small disk structuring element [4] is applied to process the decision map. In this way, the small lines or burrs could be eliminated, the connection portions of the focused regions could be smoothed, and the adjacent regions would be combined as a whole region. We found that, when the radius of the disk structuring element equals to spatial frequency kernel radius, the small lines or burrs could be well detected and the adjacent regions could be connected right. Beside, we apply the small region removal strategy, which is same with  [17]. Specially, we reverse the region which is smaller than an area threshold. In this paper, the threshold is usually set to , where H and W are the height and width of source image, respectively.

Generally, there are some undesirable artifacts around the boundaries between focused and defocused regions. Similar to  [22], we utilize an efficient edge-preserving filter, guided filter [6], to improve the quality of initial decision map, which can transfer the structural information of a guidance image into the filtering result of the input image. The initial fused image is employed as the guidance image to guide the filtering of initial decision map. In this work, we experimentally set local window radius to 4 and the regularization parameter to 0.1 in guided filter algorithm.


Finally, by using the obtained decision map , we calculate the fused result with the following pixel-wise weighted-average rule. The input images are denoted as which are pre-registered, where represents the index of source images. The representative visualization of fused images are shown in Figure 3.



Experimental Settings

In our experiment, we use 38 pairs of multi-focus images as testing set for evaluation, which are publicly available online [22, 26].

Due to the unsupervised strategy, we first train the encoder-decoder network using MS-COCO [16]. In this phase, about 82783 images are utilized as training set, 40504 images are used to validate the reconstruction ability in every iteration. All of them are resized to and transformed to gray scale images. Learning rate is set as

and then decrease by a factor of 0.8 at every two epoch. We set

which is same with DenseFuse [11] and optimize the objective function with respect to the weights at all network layer by Adam [8]. The batch size and epochs are 48 and 30, respectively. And then we used acquired parameters to perform SF fusion on the testing set above.

Our implementation of this algorithm is derived from the publicly available Pytorch framework 

[5]. The network’s training and testing are performed on a system using 4 NVIDIA 1080Ti GPU with 44GB memory.

Figure 4: Visualization of different ’leaf’ and ’Sydney Opera House’ fused results.

Objective Image Fusion Quality Metrics

The proposed fusion method is compared with 16 representative image fusion methods, which are the laplacian pyramid (LP)-based one [2], the ratio of low-pass pyramid (RP)-based one [28], the nonsubsampled contourlet transform (NSCT)-based one [34], the discrete wavelet transform (DWT)-based one [10], dual-tree complex wavelet transform (DTCWT)-based one [9], the sparse representation (SR)-based one [33], the curvelet transform (CVT)-based one [23], the guided filtering (GF)-based one [12], the multi-scale weighted gradient (MWG)-based one [35], the dense SIFT (DSIFT)-based one [18], the spatial frequency(SF)-based one [15], the the FocusStack [30], the Image Matting Fusion(IMF) [14], the DeepFuse [25], the DenseFuse (both add and L1-norm fusion strategy) [11] and the CNN-Fuse [17]. In addition, GF, IMF are driven from [31] and NSCT, CVT, DWT, DTCWT, LP, RP, SR and CNN-Fuse from [19].

In order to access the fusion performance of different methods objectively, we adopt three fusion quality metrics, such as  [32],  [24] and  [3]. For each of the above three metrics, a larger value indicates a better fusion performance. A good comprehensive survey of quality metrics can be found in the article [20]. For fair comparison, we use default parameters given in the related publications for these metrics and all codes are driven from  [21].

Figure 5: The difference images for each ’beer’ fused results

Ablation Experiments

We first evaluate our methods with different settings to verify our methods. We pick up seven fusion modes to explore the usage of deep features, such as max, abs-max, average, L1-norm, sf, se_sf_dm, and dense_sf_dm. DenseFuse [11] investigated add and L1-norm fusion strategy and draw out the conclusion that L1-norm of deep feature could be used to fuse infrared-visible images. They utilized feature intensity to calculate activity level. We found that feature gradient (calculated by spatial frequency) is suited to multi-focus fusion task. Table 1 shows mean average score with different methods. The bold value denotes the best performance among all fusion modes. The digits within a parenthesis indicates the number of results on which corresponding methods obtain the first place. Se_sf outperforms abs-max, max, average, l1_norm fusion modes in metric evaluation. In addition, even though the deep learning has promising representative ability, it can not recover the image perfectly. Thus if we use sf to fuse the deep features and input to decoder and draw out result, the fused result could not completely recover every detail of in-focus region. Therefore, we propose to use deep features to calculate the decision map and fuse the original images. As shown in experiment results, the performance of se_sf_dm defeats the se_sf’s. Besides, we conduct an experiment to verify the influence of SE architecture [7], we have found that the average scores of se_sf_dm in and is higher than dense_sf_dm and the first place number of se_sf_dm is the highest result. We assume that squeeze-and-excitation structure could dynamically recalibrate feature which shows robust result.

se_absmax 0.5204(0) 2.4880(0) 0.6019(0)
se_average 0.5033(0) 2.4835(0) 0.5963(0)
se_l1_norm 0.5124(0) 2.4961(0) 0.6020(0)
se_max 0.5059(0) 2.4851(0) 0.5980(0)
se_sf 0.6885(0) 2.7216(2) 0.7526(0)
se_sf_dm 0.7105(25) 2.8886(16) 0.7848(19)
dense_sf_dm 0.7103(13) 2.8872(20) 0.7852(19)
Table 1: Ablation experiments with different settings.

Comparison with other fusion methods

We first compare the performance of different fusion methods based on visual perception. For this purpose, four examples in two manners are mainly provided to exhibit the difference among different methods.

In Figure 4, we visualize two fused examples, such as ’leaf’ and ’Sydney Opera House’ image pairs and their fused results. In each image, a region around the boundary between focused and defocused parts is magnified and shown in the higher left corner. In ’leaf’ result, we can see that the border of leaf with different methods. The DWT shows ’serrated’ shape and the CVT, DSIFT, SR, DenseFuse, CNN show undesirable artifacts. Besides, for DWT and DenseFuse, the luminance of leaf at right higher corner shows an abnormal increase. And the same region in MWG is out-of-focused, which means that the method can not well detect the focused regions. In ’Sydney Opera House’ result, the ear of Koala located at the border between focused and defocused parts, as we can see that all methods show smooth and blurred results except SESF-Fuse.

To have a better comparison, Figure 5 and Figure 6 show the difference images obtained by subtracting the first source image from each fused image, and the values of each difference image are normalized to the range of 0 to 1. If the near focused region is completely detected, the difference image will not show any information of that. In Figure 5, it is beer bottle. Therefore, the CVT, DSIFT, DWT and DenseFuse-1e3-L1-Norm can not perfectly detect the focused region. The SR, MWG and CNN perform well except the region at the border of bottle, because we still can see the contour of near focused region. Besides, our SESF-Fuse performs well in both center or border region of near focused regions. In Figure 6, the near focus region is the man. Same with the observation above, the CVT, DSIFT, DWT, NSCT, DenseFuse can not perfectly detect the focused region. The MWG and CNN perform well except that the region at the border of the person. Besides, for MWG, the region surrounded by arms is actually far focused region, MWG can not correctly detect here.

Figure 6: The difference images for each ’golf’ fused results
Metrics DeepFuse FocusStack SF DenseFuse_1e3_add DSIFT DenseFuse_1e3_l1
0.4269(0) 0.4709(0) 0.5115(0) 0.5190(0) 0.5267(0) 0.5283(0)
2.4618(0) 2.8510(0) 2.8512(0) 2.8530(0) 2.8725(0) 2.8561(0)
0.5651(0) 0.6330(0) 0.6024(0) 0.6008(0) 0.6067(0) 0.5972(0)
0.5631(0) 0.6187(0) 0.6222(0) 0.6324(2) 0.6478(0) 0.6529(0)
2.8506(0) 2.9563(0) 2.9465(1) 2.8844(0) 2.9460(0) 2.9583(0)
0.7008(3) 0.6908(0) 0.6712(0) 0.7362(4) 0.7101(0) 0.7126(0)
Metrics NSCT SR LP MWG CNN-Fuse SESF-fuse
0.6587(0) 0.6686(0) 0.6731(0) 0.6998(0) 0.7102(16) 0.7105(20)
2.9592(0) 2.9630(2) 2.9642(8) 2.9615(6) 2.9654(7) 2.8886(14)
0.7169(0) 0.7335(0) 0.7352(0) 0.7764(2) 0.7839(9) 0.7848(20)
Table 2: Comparison with other fusion methods.

Table 2 lists the objective performance of different fusion methods using the above three metrics. We can see that the CNN-based method and the proposed method clearly beat the other 15 methods on the average score of and fusion metrics. For metric, CNN-Fuse and SESF-Fuse achieve comparable performance. However, CNN-Fuse is a supervised method which needs to generate synthetic images with different blurred levels to train a two-class image classification network. By contrast, our network only needs to train an unsupervised model which doesn’t need to generate synthetic image data. And for metric, the average score of SESF-Fuse is smaller than LP, however, the first place number of proposed method achieves the highest value which means it is more robust than other methods.

Considering the above comparisons on subjective visual quality and objective evaluation metrics together, our proposed SESF-Fuse-based fusion method can generally outperform other methods, leading to state-of-the-art performance in multi-focus image fusion.


In this work, we propose an unsupervised deep learning model to address multi-focus image fusion problem. First, we train an encoder-decoder network in unsupervised manner to acquire deep feature of input images. And then we utilize these features and spatial frequency to calculate activity level and decision map to perform image fusion. Experimental results demonstrate that the proposed method achieves the promising fusion performance compared to existing fusion methods in objective and subjective assessment. This paper demonstrate the viability of combination of unsupervised learning and traditional image processing algorithm. Our team will promote this research in subsequent work. Besides, we believe that same strategy could be applied to other image fusion tasks, such as multi-exposure fusion, infrared-visible fusion and medical image fusion.


The authors acknowledge the financial support from the National Key Research and Development Program of China (No. 2016YFB0700500), and the National Science Foundation of China (No. 61572075, No. 61702036, No. 61873299, No. 51574027), and Key Research Plan of Hainan Province (No. ZDYF2018139).