A Symmetric Encoder-Decoder with Residual Block for Infrared and Visible Image Fusion

05/27/2019 ∙ by Lihua Jian, et al. ∙ Sichuan University The University of British Columbia 0

In computer vision and image processing tasks, image fusion has evolved into an attractive research field. However, recent existing image fusion methods are mostly built on pixel-level operations, which may produce unacceptable artifacts and are time-consuming. In this paper, a symmetric encoder-decoder with a residual block (SEDR) for infrared and visible image fusion is proposed. For the training stage, the SEDR network is trained with a new dataset to obtain a fixed feature extractor. For the fusion stage, first, the trained model is utilized to extract the intermediate features and compensation features of two source images. Then, extracted intermediate features are used to generate two attention maps, which are multiplied to the input features for refinement. In addition, the compensation features generated by the first two convolutional layers are merged and passed to the corresponding deconvolutional layers. At last, the refined features are fused for decoding to reconstruct the final fused image. Experimental results demonstrate that the proposed fusion method (named as SEDRFuse) outperforms the state-of-the-art fusion methods in terms of both subjective and objective evaluations.



There are no comments yet.


page 1

page 4

page 5

page 9

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image fusion, as a promising technique for computer vision field, can be leveraged in many real applications. Generally, it can be devoted to object detection in night environments, disease diagnoses in the medical system [1], photography [2, 3, 4] and remote-sensing for mapping, etc. Image fusion is designed to obtain a more comprehensive and informative image by integrating multiple source images from various sensors. Recently, multi-sensor information such as thermal infrared and visible images has been widely applied to the surveillance areas on both military and civilian use [5, 6]. In general, the infrared images are derived from thermal radiation of the objects, whereas visible images are merely captured from the visual scene. However, each of them has its limitations at the night vision as a single sensor cannot capture the complete information from a scene. Therefore, it is essential to fuse the multi-sensor data to generate an informative image, which can provide the final users with more complementary information.

Numerous image fusion algorithms for infrared and visible images have been proposed in these years. We can review these fusion methods from three perspectives: fusion level, fusion domain, and fusion methodology, shown in Fig. 1. From the implementation level, current fusion algorithms can be performed at three main levels, i.e., pixel-level, feature-level and decision-level. As the lowest level fusion techniques [7, 8, 9, 10, 11, 12, 13], the pixel-level based image fusion directly deals with the pixels of an image obtained from sensors. It aims to retain more original information of the source images for visual performance. However, in real applications, pixel-level fusion has two limitations. First, it needs to take more time to preprocess amounts of information. Second, it may result in serious degradation or distortion in the fused results without strict registration of the source images. Therefore, feature-level image fusion [14, 15, 16]

has become a promising direction with the development of deep learning techniques. This operation commonly extracts the most representative features of the source images by using specific filters or some other representation learning methods. Finally, the fused image is reconstructed via combining the useful features. Although this kind of fusion method misses certain information during the process, it has advantages in real-time processing applications such as object detection tasks. Decision-level fusion, the highest level fusion mechanism

[17, 18]

, is mainly based on machine learning. There are many challenges due to the modality correlation between different source images. It provides terminal users decisions instead of visual perception. Namely, it does not identity specific objects in specific images. Hence, it may not be suitable for most current computer vision tasks.

Fig. 1: Classification for image fusion methods from various perspectives

In addition, according to the fusion domain, existing image fusion methods can be roughly classified into two main aspects, typically, the spatial- and transform- based methods. Spatial-based fusion methods directly implement on the pixels of the source images. Examples of spatial-based fusion methods are weighted average, morphological operations

[19], and other matrix-computation methods [20]. Fused results by these methods usually produce undesired effects such as spectral distortions due to unbalanced transfer between the source images. In contrast, transform-based image fusion methods can avoid some limitations such as spectral degradation. By using the appropriate transformation tools, these methods are designed to project source images into various transformed components. The common transformation tools involve pyramid [21], wavelet [22, 23], curvelet [24], shearlet [25, 26], and contourtlet [27], etc. Subsequently, different kinds of suitable fusion strategies are applied to merge the new transformed components. Finally, these fused components are reconstructed by an inverse transform to the original space domain.

In terms of the methodology, recent image fusion methods have been grouped into four major families: (1) the multi-scale decomposition or transformation (MSD or MST); (2) the sparse coding or representation (SR); (3) the hybrid fusion model (HFM); (4) the neural network (NN)-based.

Most of the existing fusion methods in recent years lean on the multi-scale decomposition framework [11, 28, 7, 29]. Typical operations of the MSD are as follows: (1) various represented layers are extracted at different scales (including dual-scale) by the specific transform tools such as pyramid, wavelet, and edge-preserving filters; (2) these extracted feature layers or transform coefficients are combined together by specific fusion rules; (3) previously obtained fusion layers are summed up or inverse transformed to obtain the final fused image. For example, literature [7] decomposed the source images into cartoon and texture components, which are fused by using Sum-modified-Laplacian (SML) and sparse representation fusion strategies respectively.

Recently, the sparse coding method is applied to various signal processing fields [30, 12] to deal with two-dimensional images. The sparse representation learning may be considered as the best feature-representation approach. Researchers encode the source images on an over-complete dictionary to obtain the sparse coefficients, which can be fused by using different fusion strategies such as the -norm, choose-max, and weighted average, etc, [12]. The final fused image is restored by using these merged coefficients on the same over-complete dictionary. In addition, the SR-based method can also serve as a fusion strategy.

To fully combine the special merits of each fusion method, a few hybrid fusion models (HFMs) [9, 31] have been successfully explored. An example of hybrid models combines the multi-scale transform (MST) and sparse representation (SR) [9]. This hybrid model aims to extract useful information such as low-frequency sub-band features via the MST tool. However, using weighted average and choose-max strategies for low-frequency integration will result in redundant information (visual artifacts) on the fused image, since the low-frequency components of an image indicate energy. At present, it has been proven that the use of SR techniques can express the energy without considering over-smoothing. To some extent, it can reduce the redundant information and improve the visual performance of the fused image.

Currently, the most promising and attractive direction for image fusion is the deep learning-based methods. It is not enough in the field of image fusion even though some novel successful cases [32, 33]

have been presented. Neural networks (NNs), such as the autoencoders (AEs) with their variants, deep belief networks (DBNs), and convolutional neural networks (CNNs), are the core of deep learning techniques. All of these NN-based methods are built on training a group of weights that are similar to a series of filters. Then, these trained weights are applied to extract different types of feature layers. Therefore, how to select a training model and how to use their abstract information effectively is critically important to fusion results. Liu et al.

[33] proposed a CNN-based fusion method for multi-focus images. They used CNNs to identify clear and unclear parts of multi-focus images. In fact, this is a binary classification problem. However, this framework has no generalization capabilities in other image fusion applications such as infrared and visible images. Ma et al. [32] presented a FusionGAN-based fusion method for infrared and visible images. This paper proposed a novel method to fuse two types of information using a generative adversarial network (GAN). However, this network has changed the original information of the source images to some extent.

Most existing image fusion methods are based on the pixel level, which may encounter two key issues, namely, it is time-consuming and generates redundant information.

In this paper, to overcome the time-consuming and quality degradation of the fused image, a novel deep learning method for infrared and visible image fusion is proposed. First, a symmetric encoder-decoder with residual block (SEDR) network is trained with an available datasets (KAIST and FLIR) including infrared and visible images. Then, the trained model is utilized to extract the intermediate features and compensation features of the source images. Subsequently, the intermediate features are exploited to generate the corresponding weight maps by using a softmax function. Multiplying the features with the corresponding weight maps to obtain two attention maps, are then employed to fuse the intermediate features. Additionally, the obtained compensation features need to be merged first and passed to the corresponding deconvolutional layers. Finally, all the merged features are fed back into the decoder part by deconvolution operations to reconstruct the final fused image.

The main contributions of the proposed fusion framework can be enumerated as follows:

  • propose a symmetric encoder-decoder with residual block network.

  • present an attention map-based feature fusion method.

  • apply the skip connections to compensate for the missing details of reconstruction images.

  • train the proposed network on KAIST and FLIR datasets consisting of infrared and visible images.

The rest of this article is arranged as follows. Section 2 reviews related work in our fusion framework. Section 3 details the fusion framework. Section 4 interprets the experiments and discussion. Section 5 concludes this article.

Ii Related Work

Ii-a CNNs-based Work Review

In recent years, deep learning techniques have shown great advantages in many image processing applications, i.e., image super-resolution, image segmentation, and object detection. Liu et al.

[33] applied convolutional neural networks (CNNs) to multi-focus image fusion for the first time. They demonstrated the feasibility of CNN for image fusion. However, it simulates the negative examples (defocus images) by Gaussian blur, which makes the training dataset impractical [32]. In addition, this model cannot be generalized to other multi-modality image fusions. For example, like infrared and visible images, it may be inappropriate to treat them as a classification problem. Moreover, there are no sufficient ground-truth images for supervised training networks.

Unlike supervised training, unsupervised learning methods commonly use themselves as targets (labels) to train a network, which compensates for insufficient labeled training data. For instance, an autoencoder architecture can compress the input data into a latent representation (also known as a hidden feature) and then restore the original input data with a low reconstruction error. The ground-truth of this restoration is the input data. Haeggstroem et al.

[34] leveraged a deep encoder-decoder network to solve the inverse problem of PET reconstruction, which quickly and directly obtained high-quality images with little noise information from the PET sinogram data. A stacked convolutional denoising auto-encoder for feature representation was presented by Du et al. [35]

, which could learn a powerful feature extractor. Note that, an encoder-decoder architecture has good reconstruction characteristics without supervised learning. More research attention should focus on the content and use of features, especially the feature-level fusion of multi-modality images.

Inspired by previous work, we intend to use the restoration ability of the encoder-decoder network to obtain a fixed feature extractor, learning a hierarchy of complex representation features. Considering that these features convey different types of information from source images, the framework is divided into three parts at the fusion stage. To begin with, we train a network to have a low reconstruction loss on the specific dataset. Then, the encoder part of the trained network can serve as a feature extractor. In contrast, the decoder part can be regarded as a generator for image reconstruction. The features obtained by each convolutional layer can be fused by using related fusion strategies, and the fused features are transferred to the corresponding deconvolutional layers of the decoder part. To our best of knowledge, this is the first time that all features generated by convolutional layers are considered for fusion, which fully retains each level of information in the fused image.

Ii-B Symmetric Encoder-Decoder with Residual Block

Many image restoration algorithms have been built on neural networks (NN), especially based on deep convolutional encoder-decoder networks. These encoder-decoder networks have superior performance due to their strong ability to learn the intermediate hidden information, which can be used to restore the original image with denoising.

In this work, a symmetric encoder-decoder with residual block (SEDR) network is trained with the specific dataset including infrared and visible images, as shown in Fig. 2. The goal of this training framework is to accurately reconstruct the original dataset (raw input data) while minimizing the reconstruction loss. That is, the smaller the reconstruction error, the more representative the extracted features are. Our proposed SEDR network involves two parts at the training stage. The two major parts are an encoder and decoder without a fusion layer. Once the feature extractor has been trained, we will add the fusion part to complete the framework (see Fig. 3). The basic units in the SEDR network are convolutional layer, deconvolutional layer, residual block [36]

, skip connections and rectified linear units (ReLU) function. The pooling layer is removed because some useful details from the original dataset will be lost.

Fig. 2: Overall training architecture: a symmetric encoder-decoder with residual block (SEDR) network.

Encoder part.  Our encoder part consists of three convolutional layers and one residual block. The first convolutional layer does not change the input size, while the second and the third convolutional layers (down-sampling) are half the size. All convolution operations act as feature extractors, fully retaining texture and structural information of the source images. To compensate for the missing image details during the convolution process, we mimic the ResNet [36] to further reuse the previous features. In this network, we add one residual block after the last convolutional layer. The input training data and the output generation is of same size (height, width, channel), and the output from the encoder has 256 intermediate feature maps with the size of , retaining more raw structural details.

Decoder part.  To obtain the output image of the same size as the input, the decoder part adopts a symmetric deconvolution to corresponded convolutions in the encoder part. Deconvolution is usually used to reconstruct the original image from extracted intermediate features by up-sampling. The kernel size of the deconvolutional layers must be the same with the convolutional layers to match exactly. In this network, all kernel size is set to . Besides, the decoder part only has two types of units, deconvolutional layers and ReLU functions.

Skip connections. As depicted in the literature [37], the convolutional operations preserve the primary image content while the texture details of the image may be lost. In addition, deconvolution only can restore the structural details of image contents from the extracted features, which have a certain amount of information loss during down-sampling in the encoder part. In general, the output of the decoder is the filtered version of the input image, which results in unsatisfactory performance for image fusion. Therefore, in our work, we use skip connections to transfer texture feature information from convolutional layers to their corresponding deconvolutional layers by an element-wise, choose-max manner. These skip connections make the proposed framework easier to be trained and speed up the convergence. More training details will be discussed in Section IV-B.

Iii The Proposed Fusion Framework

Fig. 3

shows the proposed fusion framework, which is composed of four main phases: (1) feature extraction; (2) attention-based feature fusion; (3) compensation feature fusion; (4) image reconstruction. First, the source images are separately encoded by utilizing the trained model to extract a series of rough features (intermediate features). These intermediate features are further calculated to generate two different attention maps, which are used to refine these intermediate features in turn. Subsequently, an attention-based fusion strategy is applied to combine these refined features. Meanwhile, the previous features (compensation features) generated by the first and second convolutional layers are merged by using an element-wise summation, choose-max strategy. Finally, all the final fused features are fed back into the decoder part to reconstruct the final fused image.

Fig. 3: The proposed fusion framework (SEDRFuse).

Iii-a Feature Extraction

Image features should be learned from a specific image dataset that is similar to the characteristics of the target image. We will describe the dataset collection in Section IV-A. In this phase, we use the trained model to extract intermediate features of the source images. We define these initial intermediate features of the encoder output as . represents the input source image. In this framework, is for infrared images and is for visible images. is the total number of intermediate features of an input image. means the residual layer output.

In addition, the output features of the first () and second () convolutions are defined as and . is the total number of the features of the layer while is for the layer. As mentioned above, these features convey the texture details of the source images. We call these features as compensation features.

Conversely, the features of the last () and the penultimate () deconvolutions are expressed as and respectively.

However, not all information from these features is useful for reconstruction results, such as noise information. Therefore, these intermediate features need to be given different weights for refinement by using the corresponding attention maps, which will be discussed in the next section. Besides, certain details may be lost during feature extraction. Therefore, utilizing the features of each convolutional layer and passing them to the corresponding deconvolutional layer is important to compensate for the details of the reconstructed image.

Iii-B Attention-based Feature Fusion

Recently, models based on attention mechanisms have been introduced into the training of CNN architectures. It is applied to many visual tasks, especially regions of interest (RoI) in a visual scene. The goal of infrared and visible image fusion is to retain the visual details and salient thermal radiation regions simultaneously. Therefore, motivated by previous work, we use these rough intermediate features to obtain attention maps of the source images.

In our framework, the output of the encoder is a series of rough feature maps. Each of them depicts a special kind of information about the source images. In order to accurately reflect the salient features of the source images, we need to create attention maps from these feature maps. Each feature map has its own weight, given by the softmax operation, which calculates the probability in the channel direction.


where is weight maps of each feature. denotes the same position of all feature channels. is the channel number. The softmax function can be denoted as follows:



is an element of a vector sequence.

All rough feature maps are multiplied by the corresponding weights and summed to generate the attention map for the source image. The mathematic expression is as follows:



is an attention map that reflects the activity level measurement of the source image.

According to the salient mechanism, we use the attention maps to optimize these rough features before the feature-level fusion. This process can be written as follows:




where means the input source image. is the optimal weight map for the features of the source image. is the fused intermediate features which would be decoded to reconstruct the fused image.

Iii-C Compensation Feature Fusion

For the compensation features, we can use these features to reconstruct the missing details of the convolution process in the decoder part. As each feature pixel value after compression represents a receptive field of the original image, the choose-max strategy is a better selection to merge them in an element-wise manner, which can be written as follows:


where , , represent the infrared, visible, and fused features of the first convolution respectively. is the total number of the features of the first convolution. represents the pixel coordinate of the feature. is the choose-max function in an element-wise manner. The Eq. (7) is expressed in the same way.

Iii-D Image Reconstruction

Image reconstruction needs to restore the above two merged features separately.

To begin with, the fused intermediate features in Eq. (4) can serve as the structural contents of the source images. We pass them to the first deconvolution of the decoder part.

In addition, the compensation features ( and ) of the first and second convolutional layers can compensate the visual details of the fused image. In Section. III-C, we have merged them respectively in Eq. 6 and Eq. 7. The fused compensation features are also passed to the corresponding deconvolutional layers by element-wise summation, which can be represented as follows:


where and are the output of the first and second deconvolutional layers respectively. and are the input features, which are transferred into the second and the third deconvolutional layers.

Finally, the final fused image is recovered by decoding the two portions.

Iv Experimental Design

Iv-a Dataset Preparation

Instead of only using visible images, we select training images from KAIST111https://soonminhwang.github.io/rgbt-ped-detection/ [38] and FLIR222https://www.flir.ca/oem/adas/adas-dataset-form/ datasets, as they contain both infrared and visible versions of each image. That is more reliable than using other image datasets. The KAIST benchmark consists of 95,000 color-thermal pairs taken from a vehicle. And the FLIR dataset also provides approximately 14,452 thermal and visible image pairs for empowering the automotive community. The two datasets are recorded at 20Hz and 30Hz respectively forming into a series of video frames. Besides, the new combined dataset covers a variety of scenarios such as campuses, roads, downtown, streets, and highways. In addition, their scenes are captured during the day time and night time.

To make the scene content of the training data more diverse, we expand the frame interval by re-sampling. The total number of the infrared (IR) and visible (RGB) images is 52,000, including 50,000 training images and 2,000 validation images. The new combined dataset is shown in the following Table. I.

Dataset Training Validation
KAIST IR: 12,500; IR: 500;
RGB: 12,500; RGB: 500;
FLIR IR: 12,500; IR: 500
RGB: 12,500; RGB: 500;
Total Number 50,000 2,000
TABLE I: Dataset Collection

All of these images are resized to pixels and converted to gray-scale images, which are further normalized to [0, 1] interval.

Iv-B Training Details

The training stage combines a symmetric encoder-decoder and residual block (SEDR) into a deep network. The SEDR has six layers, including three convolutional layers and three deconvolutional layers, between which there is a residual block. In the encoder part, the output size is half the input size and the number of features is twice that of the previous layers. The decoder part is just the opposite with the corresponding convolutional layers. The intermediate features can be fully reused through the residual block. The output intermediate features from the residual block are .

We train the SEDR network on the prepared dataset (see Table. I

). The batch number and epoch number are set to 2 and 50 respectively. The learning rate is set to

. The same with Li et al. [15]

, we still use the pixel-loss and SSIM-loss as the total loss function. These two loss functions can constraint the reconstructed pixel error and edge error respectively. The mathematical expression of the total loss is as follows:


where , , and represent the total loss, pixel loss, and SSIM loss respectively. In addition, the SSIM loss is generated by taking 1 and subtracting the structural similarity value computed in [39], which can be written as follows:


where and mean the reconstructed data and input training data, respectively. indicates the SSIM function. In Eq. (12), and is the size of an image. is the pixel location.

Our framework is implemented with NVIDIA GTX 1070Ti (GPU), 32GB RAM (Memory), and Intel Core i5-8500 (CPU). The network architecture is programmed on the Tensorflow.

Fig. 4 shows the training total loss curve on the new dataset. Every 1000 iterations output a total loss value. In this work, each epoch needs iterations. From the curve, we can see that the total loss value tends to be stable around iterations (or 48 epochs). It demonstrates that the trained model has reached optimally.

Fig. 4: The total loss curve in the training stage.

To investigate the effect of the number of residual blocks on the fusion results, we choose the SSIM metric as the performance measurement, as shown in Fig. 5. It can be seen that the proposed framework achieves the best performance with one residual block (see red curve). When the number of the residual blocks is increased, the SSIM value will decrease. In addition, more residual blocks result in time-consuming during the training stage. Therefore, in this work, we design the framework using one residual block.

Fig. 5: The relationship curve between the number of residual blocks and corresponding SSIM values.
Fig. 6: Metric change curves with the increasing of the training epoch number.

Iv-C Experimental Setting

In this experiment, the test source images are derived from the 333https://figshare.com/articles/TNO_Image_Fusion_Dataset/1008029, from which we select 20 pairs of infrared and visible images containing different scenes. Before implementing image fusion, all the source image pairs should be strictly aligned. A portion of source images in this paper is shown in Fig. 7.

Fig. 7: A portion of source images from “”: the first and third rows are infrared images, the second and fourth rows are corresponding visible images.

Our comparative experiment will be implemented on five existing methods, including CNN-based fusion [33], Deep-Fuse [40], DenseFuse [15], FusionGAN [32], and GFF [41]. The fusion results are shown in Figs. 8-9, which will be analyzed in Section IV-E. Parameters for all comparable fusion methods are strictly in accordance with the settings given by authors.

For validating the performance of the fusion results, apart from subjective and visual comparison, we also use seven quantitative fusion metrics to confirm the effectiveness of our proposed method. These fusion metrics are described in the following Section.

Iv-D Fusion Evaluation Metrics

To evaluate the quality of image fusion, it is not enough to solely rely on subjective evaluation, because some fusion methods produce similar visual results. In this paper, except for the visual evaluation, we also use seven quantitative metrics to objectively evaluate the performance of the fused results. These metrics are briefly listed as follows:

  • Entropy () [9].

    of an image reflects the total information involved in a synthetic image in terms of information theory. The larger the value, the better performance the fusion result. However, this metric alone cannot determine the overall quality of the fused image.

  • Spatial frequency () [42].

    measures the overall activity level of an image. combines four direction spatial frequencies, including the row, column, main diagonal, and secondary diagonal. It can indicate the structural and textural information of the fused image. The metric shows good fusion result at a high value.

  • employs a statistical approach to calculate the distance between each individual pixel and the mean in an image. It reflects the dispersion of image pixel value and the mean. The larger the standard deviation, the better the image quality. In addition, a large value has high spatial contrast in the fused image.

  • Average gradient ().

    is a definition of the sharpness of an image, reflecting the ability of the image to express contrast. Specifically, it reflects the change in the tiny details of the image, as well as the ratio of contrast and relative sharpness in the multi-dimensional direction of the image. The larger the value is, the better the fused performance has.

  • Correlation coefficient () [43].

    represents the degree of linear correlation of the fused image and source images. A higher value signifies that the fused image is more similar to the source images. It represents good fusion performance when we obtain a larger value.

  • Structure similarity () [39].

    is a combination of correlation, luminance, and contrast distortion. This metric is consistent with human visual sensitivity in terms of structure loss and distortion. The value ranges from -1 to 1, in which -1 and 1 indicate converse and same structure with the reference image respectively, whereas 0 represents no relationship with the reference image. A high positive value means a good fusion quality.

  • Visual information fidelity () [44].

    mainly computes the information fidelity of the fused image. This metric leverages different models, such as the human visual system (HVS) model, the natural scene statistics (NSS) model, and the distorted model, to extract mutual information from each block and sub-band. The larger the value, the excellent the fusion result are.

Fig. 6 shows the relationship between the training epoch number and fusion metrics on the test images. Combining the training total loss curve in Fig. 4, after 50 epochs, almost all fusion metrics fluctuate within a small range except for . Overall, our proposed fusion method performs well in terms of quantitative evaluation. Therefore, setting the epoch number to 50 is reasonable in our training stage.

Iv-E Results Analysis

In this section, we use both subjective visual evaluation and objective quantitative evaluation to analyze the fused results for six existing image fusion methods. Two infrared and visible image pairs are selected for experimental comparison because of space limitations. The remaining image pairs have similar effects.

Fig. 8: Fusion results on “”. Top row-from left to right: infrared image, visible image, results of GFF-, CNN- based methods; bottom row-from left to right: results of DeepFuse-, DenseFuse-, FusionGAN- based methods, and proposed fusion method.
Fig. 9: Fusion results on “”. Top row-from left to right: infrared image, visible image, results of GFF-, CNN- based methods; bottom row-from left to right: results of DeepFuse-, DenseFuse-, FusionGAN- based methods, and proposed fusion method.

From the visual view, in Figs. 8-9, the results by six fusion methods (CNN-, DeepFuse-, DenseFuse-, FusionGAN-, GFF- based methods, and the proposed method) are listed as (c) to (h). Figures (a) and (b) represent infrared and visible images, respectively. Results generated by the CNN-based method shows obvious block effects (see Fig. 8(c) and Fig. 9(c)). This method is limited to the fusion of multi-focus images and has no generalization on other types of multi-modality image fusion. The DenseFuse-based method does not perform well. For example, the brightness of the road and the roof are dim compared to the original source images (see Fig. 8(e)). In addition, the shape of the cloud is very unclear in Fig. 9(e). The fused results obtained by FusionGAN-based method cannot reflect the visible detail clearly (see Fig. 8(f) and Fig. 9(f)). The GFF-based method struggles to pass the infrared information to the fused image and has low brightness. The close-up areas of Fig. 8(g) have low brightness and contrast. Similarly, the close-up areas of Fig. 9(g) show that the cloud information cannot be transferred to the fused image. Although the DeepFuse-based (see Fig. 8(d) and Fig. 9(d)) and the proposed method (see Fig. 8(h) and Fig. 9(h)) have very close visual performance, the results by the proposed method are more natural and show clear structural features. Therefore, in terms of infrared and visible image fusion, the proposed method is superior to the existing methods in visual evaluation.

Table. II shows the objective evaluation of the six image fusion methods using the two selected image pairs. The proposed method always has larger values in terms of , , , , and metrics. The values are not the best in and image fusions using our method. It may be the effect of other factors on outcome of the fusion, such as the content of the scene. However, in general, our fusion results are not poor in terms of metric. Consistent performance of the subjective and objective evaluations strongly demonstrates that the proposed framework is more efficient than the existing fusion methods.

Source images Methods CNN DeepFuse DenseFuse FusionGAN GFF SEDRFuse (Ours)
3.7923 3.9980 4.1615 2.4636 3.6798 4.4323
0.5763 0.6773 0.6587 0.5290 0.6238 0.6699
6.8151 6.7326 6.9035 6.5267 6.3626 6.9828
11.1157 14.7569 16.0202 9.4881 14.7782 16.2545
0.7118 0.7354 0.7056 0.6987 0.7408 0.7340
0.2671 0.5697 0.4727 0.1413 0.2610 0.6121
31.2828 34.4454 36.7658 24.9111 25.8975 37.5175

2.7619 3.0682 2.6843 2.2036 2.4038 3.5356
0.2835 0.3777 0.2753 0.2700 0.2228 0.3903
7.2929 7.2917 7.0827 7.1004 7.0134 7.3159
10.8251 10.9930 10.4965 8.2041 9.6045 13.0429
0.6828 0.6718 0.7114 0.6608 0.6949 0.6650
0.3652 0.7078 0.3493 0.4229 0.1132 0.8277
49.0840 40.1985 36.6558 42.6331 36.2172 44.7425
TABLE II: Objective quality of the selected three image pair fusion by different methods

The proposed method is not limited to the current few image pairs, as it is validated across the entire test dataset to obtain the average evaluation values in Table. III. We can see that the proposed fusion method outperforms previous fusion methods in all metrics except the and , which it places third and second respectively. Our proposed method also specifically excels at the and metrics. In general, these excellent results exhibit that our fusion method can preserve the structural components of the source images, and transfer the visual details to the fused results simultaneously.


3.6984 0.3784 6.7758 14.2968 0.7126 0.2382 40.4855

3.5302 0.4907 6.6525 12.8103 0.7308 0.5281 33.0762

3.4587 0.4454 6.8080 13.2573 0.7196 0.3596 37.6247

2.1586 0.4194 6.3275 8.0845 0.6575 0.1827 25.7923

3.8225 0.3420 6.7877 14.5258 0.7243 0.2473 37.0476

SEDRFuse (Ours)
4.0054 0.4910 6.8179 14.7979 0.7219 0.5511 37.9516

TABLE III: Average performance of different image fusion methods on the TNO dataset.

We also verified the influence of different training datasets on fusion results. Except for the training dataset, the comparison experiments adopt the same parameter settings and pre-processing way. We use the MSCOCO dataset444http://cocodataset.org/download/ as a training dataset to demonstrate that only visible images cannot compare with the specific datasets (FLIR and KAIST) in terms of infrared and visible image fusion. Fig. 10 shows the fusion results by using different training datasets. It can be seen that the MSCOCO dataset displays poor performance (see Fig. 10(c)). In contrast, fusion results from the new training dataset tend to be more natural and better (see Fig. 10(d)). Table. IV gives average fusion performance on the test dataset by using different training datasets. It can be seen that the new combined datasets can achieve better objective results than the MSCOCO dataset. Therefore, data selection is very important for training networks to conduct different fusion tasks.

Fig. 10: Fusion results by using different training datasets.(a) infrared images; (b) visible images; (c) results by using MSCOCO dataset; (d) results by using New dataset.
MSCOCO 3.9384 0.4532 6.7586 14.4086 0.6914 0.4831 38.9456

4.0054 0.4910 6.8179 14.7979 0.7219 0.5511 37.9516

TABLE IV: Average performance on the TNO dataset by different training datasets.

Table. V compares the average running time of different fusion methods to fuse one infrared and visible image pair. This shows that the proposed method is adequately fast.

Methods GFF CNN DeepFuse DenseFuse FusionGAN SEDRFuse (Ours)
Time (seconds per pic) 0.258 86.069 0.381 2.333 1.637 1.635
TABLE V: Average running time of one image pair fusion by different methods

Iv-F Other Types of Image Fusion

To demonstrate the generalization capability of the proposed method, we attempt to extend its applications on other types of multi-modality images, including multi-focus images (gray scale and color scale, see Fig. 11(a)-(b)), medical images (CT and MRI, see Fig. 11(c)), and multi-exposure images (over-exposure and under-exposure, see Fig. 11(d)). Although the model has not been trained in the relevant datasets, the fusion results still perform good visual effects. It can be seen that the structures and details of the source images are transferred well into the fused image.

Fig. 11: Fusion results on other types of images.

Therefore, it demonstrates that our proposed method is also applicable to image fusion of other modalities, and that it can achieve good performance.

V Conclusion

The infrared and visible image fusion is a well-studied problem in image processing. This article aims to explore a two-stage feature-level fusion method for infrared and visible images. In addition, inspired by the restoration capability of the encoder-decoder network, we design a symmetric network combining the residual block (SEDR) as a fixed feature extractor. The training purpose of this SEDR network is to reduce the error of the input and output data. Only if the training loss reaches a stable value, the intermediate output features are representative. In our framework, the training datasets come from KAIST and FLIR published online. We extract the intermediate features of the source images and generate two attention maps, which are in turn used to fuse the intermediate features. The features generated by the first two convolutional layers are also utilized to fuse for compensating the detail loss in down-sampling. Finally, the fused intermediate features and the fused compensation features are fed back into the decoder and the corresponding deconvolutional layers respectively, to reconstruct the fused image.

Our fusion framework focuses on combining infrared and visible images on a feature-level. Using our framework, the time-consuming problem of the traditional pixel-based image fusion methods can be significantly reduced. This feature-level image fusion can avoid generating redundant information in the fused image to some extent. Overall, the proposed fusion method achieves better results than the-state-of-the-art methods in terms of visual and objective evaluations. Future study will focus on feature-level fusion, and improving the use of intermediate features to fuse other modality images.


This research is sponsored by National Natural Science Foundation of China (No. 61701327, No. 61711540303, and No. 61601266), Science Foundation of Sichuan Science and Technology Department (No. 2018GZ0178), also is supported by Graduate Student’s Research and Innovation Fund of Sichuan University (Grant No. 2018YJSY058), the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD) Fund, Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology (CICAEET) Fund. The authors thank the financial support by China Scholarship Council (Grant No. 201806240047), and also thank Dr. Zheng Liu for his helpful guidance.


  • [1] G. Bhatnagar, Q. J. Wu, and Z. Liu, “Directive contrast based multimodal medical image fusion in nsct domain,” IEEE transactions on multimedia, vol. 15, no. 5, pp. 1014–1024, 2013.
  • [2] X. Guo, R. Nie, J. Cao, D. Zhou, L. Mei, and K. He, “Fusegan: Learning to fuse multi-focus image via conditional generative adversarial network,” IEEE Transactions on Multimedia, 2019.
  • [3] T.-H. Wang, C.-W. Chiu, W.-C. Wu, J.-W. Wang, C.-Y. Lin, C.-T. Chiu, and J.-J. Liou, “Pseudo-multiple-exposure-based tone fusion with local region adjustment,” IEEE Transactions on Multimedia, vol. 17, no. 4, pp. 470–484, 2015.
  • [4] F. Kou, Z. Wei, W. Chen, X. Wu, C. Wen, and Z. Li, “Intelligent detail enhancement for exposure fusion,” IEEE Transactions on Multimedia, vol. 20, no. 2, pp. 484–495, 2018.
  • [5] H.-M. Hu, J. Wu, B. Li, Q. Guo, and J. Zheng, “An adaptive fusion algorithm for visible and infrared videos based on entropy and the cumulative distribution of gray levels,” IEEE Transactions on Multimedia, vol. 19, no. 12, pp. 2706–2719, 2017.
  • [6] W. Zhao, H. Lu, and D. Wang, “Multisensor image fusion and enhancement in spectral total variation domain,” IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 866–879, 2018.
  • [7] Z. Zhu, H. Yin, Y. Chai, Y. Li, and G. Qi, “A novel multi-modality image fusion method based on image decomposition and sparse representation,” Information Sciences, vol. 432, pp. 516–529, 2018.
  • [8] L. Cao, L. Jin, H. Tao, G. Li, Z. Zhuang, and Y. Zhang, “Multi-focus image fusion based on spatial frequency in discrete cosine transform domain,” IEEE signal processing letters, vol. 22, no. 2, pp. 220–224, 2015.
  • [9] Y. Liu, S. Liu, and Z. Wang, “A general framework for image fusion based on multi-scale transform and sparse representation,” Information Fusion, vol. 24, pp. 147–164, 2015.
  • [10] Y. Ma, J. Chen, C. Chen, F. Fan, and J. Ma, “Infrared and visible image fusion using total variation model,” Neurocomputing, vol. 202, pp. 12–19, 2016.
  • [11] L. Jian, X. Yang, Z. Zhou, K. Zhou, and K. Liu, “Multi-scale image fusion through rolling guidance filter,” Future Generation Computer Systems, vol. 83, pp. 310–325, 2018.
  • [12] M. Nejati, S. Samavi, and S. Shirani, “Multi-focus image fusion using dictionary-based sparse representation,” Information Fusion, vol. 25, pp. 72–84, 2015.
  • [13] G. Bhatnagar, Q. J. Wu, and Z. Liu, “A new contrast based multimodal medical image fusion framework,” Neurocomputing, vol. 157, pp. 143–152, 2015.
  • [14] Z. Shao and J. Cai, “Remote sensing image fusion with deep convolutional neural network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 5, pp. 1656–1669, 2018.
  • [15] H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and visible images,” IEEE Transactions on Image Processing, 2018.
  • [16] Y. Liu, S. Liu, and Z. Wang, “Multi-focus image fusion with dense sift,” Information Fusion, vol. 23, pp. 139–155, 2015.
  • [17]

    N. B. Kolekar and R. Shelkikar, “Decision level based image fusion using wavelet transform and support vector machine,”

    International Journal of Scientific Engineering and Research (IJSER), vol. 4, no. 12, pp. 54–58, 2016.
  • [18]

    N. Kausar and A. Majid, “Random forest-based scheme using feature and decision levels information for multi-focus image fusion,”

    Pattern Analysis and Applications, vol. 19, no. 1, pp. 221–236, 2016.
  • [19] H. Li, L. Li, and J. Zhang, “Multi-focus image fusion based on sparse feature matrix decomposition and morphological filtering,” Optics Communications, vol. 342, pp. 1–11, 2015.
  • [20] Q. Wei, N. Dobigeon, and J.-Y. Tourneret, “Fast fusion of multi-band images based on solving a sylvester equation,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4109–4121, 2015.
  • [21] J. Du, W. Li, B. Xiao, and Q. Nawaz, “Union laplacian pyramid with multiple features for medical image fusion,” Neurocomputing, vol. 194, pp. 326–339, 2016.
  • [22] G. Pajares and J. M. De La Cruz, “A wavelet-based image fusion tutorial,” Pattern recognition, vol. 37, no. 9, pp. 1855–1872, 2004.
  • [23] A. Sappa, J. Carvajal, C. Aguilera, M. Oliveira, D. Romero, and B. Vintimilla, “Wavelet-based visible and infrared image fusion: a comparative study,” Sensors, vol. 16, no. 6, p. 861, 2016.
  • [24] L. Dong, Q. Yang, H. Wu, H. Xiao, and M. Xu, “High quality multi-spectral and panchromatic image fusion technologies based on curvelet transform,” Neurocomputing, vol. 159, pp. 268–274, 2015.
  • [25] S. Singh, D. Gupta, R. Anand, and V. Kumar, “Nonsubsampled shearlet based ct and mr medical image fusion using biologically inspired spiking neural network,” Biomedical Signal Processing and Control, vol. 18, pp. 91–101, 2015.
  • [26] M. Yin, P. Duan, W. Liu, and X. Liang, “A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation,” Neurocomputing, vol. 226, pp. 182–191, 2017.
  • [27] P. Ganasala and V. Kumar, “Ct and mr image fusion scheme in nonsubsampled contourlet transform domain,” Journal of digital imaging, vol. 27, no. 3, pp. 407–418, 2014.
  • [28] G. Cui, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition,” Optics Communications, vol. 341, pp. 199–209, 2015.
  • [29] D. P. Bavirisetti and R. Dhuli, “Two-scale image fusion of visible and infrared images using saliency detection,” Infrared Physics & Technology, vol. 76, pp. 52–64, 2016.
  • [30] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution as sparse representation of raw image patches,” 2008.
  • [31] J. Cai, Q. Cheng, M. Peng, and Y. Song, “Fusion of infrared and visible images based on nonsubsampled contourlet transform and sparse k-svd dictionary learning,” Infrared Physics & Technology, vol. 82, pp. 85–95, 2017.
  • [32] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “Fusiongan: A generative adversarial network for infrared and visible image fusion,” Information Fusion, vol. 48, pp. 11–26, 2019.
  • [33] Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with a deep convolutional neural network,” Information Fusion, vol. 36, pp. 191–207, 2017.
  • [34] I. Haeggstroem, C. R. Schmidtlein, G. Campanella, and T. J. Fuchs, “Deeprec: A deep encoder-decoder network for directly solving the pet reconstruction inverse problem,” arXiv preprint arXiv:1804.07851, 2018.
  • [35] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao, “Stacked convolutional denoising auto-encoders for feature representation,” IEEE transactions on cybernetics, vol. 47, no. 4, pp. 1017–1027, 2017.
  • [36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [37] X. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in Advances in neural information processing systems, 2016, pp. 2802–2810.
  • [38] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon, “Multispectral pedestrian detection: Benchmark dataset and baseline,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1037–1045.
  • [39] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [40] K. Ram Prabhakar, V. Sai Srikar, and R. Venkatesh Babu, “Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4714–4722.
  • [41] S. Li, X. Kang, and J. Hu, “Image fusion with guided filtering,” IEEE Transactions on Image processing, vol. 22, no. 7, pp. 2864–2875, 2013.
  • [42] Y. Zheng, E. A. Essock, B. C. Hansen, and A. M. Haun, “A new metric based on extended spatial frequency and its application to dwt based fusion algorithms,” Information Fusion, vol. 8, no. 2, pp. 177–192, 2007.
  • [43] M. Deshmukh and U. Bhosale, “Image fusion and image quality assessment of fused images,” International Journal of Image Processing (IJIP), vol. 4, no. 5, p. 484, 2010.
  • [44] Y. Han, Y. Cai, Y. Cao, and X. Xu, “A new image fusion performance metric based on visual information fidelity,” Information fusion, vol. 14, no. 2, pp. 127–135, 2013.