Infrared and visible image fusion using CNN layers and dense block architecture. -- tensorflow
In this paper, we present a novel deep learning architecture for infrared and visible images fusion problem. In contrast to conventional convolutional networks, our encoding network is combined by convolutional neural network layer and dense block in which the output of each layer is connected to every other layer. We attempt to use this architecture to get more useful features from source images in encoding process. Two fusion strategies are designed to fuse these features. Finally, the fused image is reconstructed by decoder. Compared with existing fusion methods, the proposed fusion method achieves state-of-the-art performance in objective and subjective assessment.Code and pre-trained models are available at https://github.com/exceptionLi/imagefusion densefuseREAD FULL TEXT VIEW PDF
Infrared and visible image fusion is an important problem in image fusio...
In this work, we propose a novel unsupervised deep learning model to add...
We propose a real-time image fusion method using pre-trained neural netw...
Deep image completion usually fails to harmonically blend the restored i...
We present the collections of images of the same rotating plastic object...
In image fusion task, feature extraction and processing are keys for fus...
Residual representation learning simplifies the optimization problem of
Infrared and visible image fusion using CNN layers and dense block architecture. -- tensorflow
The infrared and visible image fusion task is an important problem in image processing field. It attempts to extract salient features from source images, then these features are integrated into a single image by appropriate fusion method. For decades, these fusion methods achieve extrodinary fusion performance and are widely used in many applications, like video surveillance and military applications.
As we all know, many signal processing methods have been applied in the image fusion task to extract image salient features, such as muli-scale decomposition-based methods[2, 3, 4, 5, 6, 7]. Firstly, the salient features are extracted by image decomposition methods. Then an appropriate fusion strategy is utilized to obtain the final fused image.
In recent years, the representation learning-based methods have also attracted great attention. In sparse domain, many fusion methods are presented, like sparse representation(SR) and Histogram of Oriented Gradients(HOG)-based fusion method, joint sparse representation(JSR)-based fusion method and co-sparse representation-based method. In low-rank domain, Li et al. proposed a low-rank representation(LRR)-based fusion method. They use LRR instead of SR to extract features, then -norm and the max selection strategy are used to reconstruct the fused image.
With the rise of deep learning, many fusion methods based on deep learning are proposed. The convolutional neural network(CNN) is used to obtain the image features and reconstruct the fused image[12, 13]. In these CNN-based fusion methods, only the last layer result are used to be the image features and this operation will lose many useful information which is obtained by middle layers. We think these imformation are important for fusion method.
In order to solve this problem, in our paper, we propose a novel deep learning architecture which is constructed by encoding network and deconding network. We use encoding network to extract image features and the fused image is obtained by decoding network. The encoding network is constructed by convolutional layer and dense block in which the output of each layer is used as the input of next layer. So in our deep learning architecture, the results of each layer in encoding network are utilized to construct feature maps. Finally, the fused image will be reconstructed by fusion strategy and decoding network which is combined by four CNN layers.
Many fusion algorithms have been proposed in the last two years, especially based on deep learning. Unlike muli-scale decomposition-based methods and representation learning-based methods, the deep learning-based algorithms use lot of images to train the network and these networks are used to obtian the salient features.
In 2016, Yu Liu et al.
proposed a fusion method based on convolutional sparse representation(CSR). The CSR is different from CNN-based methods, but this algorithm is still deep learning-based algorithm, because it also extracts the deep features. In this method, authors use source images to learn several dictionaries which have different scale and employ CSR to extract multi-layer features, then fused image is generated by these features. In 2017, Yu Liu et al. also presented a CNN-based fusion method for multi-focus image fusion task. The image patches which contain different blur versions of the input image are used to train the network and use it to get a decision map. Then, the fused image is obtained by using the decision map and the source images. However, this method is only suitable for multi-focus image fusion.
In ICCV 2017, Prabhakar et al. performed a CNN-based approach for exposure fusion problem. They proposed a simple CNN-based architecture which contains two CNN layers in encoding network and three CNN layers in decoding network. Encoding network has siamese network architecture and the weights are tied. Two input images are encoded by this network. Then two feature map sequences are obtained and they are fused by addition strategy. The final fused image is reconstructed by three CNN layers which called decoding network. Although this method achieves better performance, it still suffers from two main drawbacks: 1) The network architecture is too simple and the salient features may not be extracted properly; 2) These methods just use the result which is calculated by the last layers in encoding network and many useful information which are obtained by middle layers will be lost, this phenomenon will get worse when the network is deeper.
To overcome these drawbacks, we propose a novel deep learning architecture based on CNN layers and dense block. In our network, we use infrared and visible image pairs as input for our method. And in dense block, their feature maps which are obtained by each layer in encoding network are cascaded as the next layer’s input.
In traditional CNN based network, with the increase of network depth, a degradation problem has been exposed and many information which are extracted by middle layers are not be used thoroughly. To address the degradation problem, He et al. introduced a deep residual learning framework. To further improve the information flow between layers, Huang et al. propose a novel architecture with dense block in which direct connections from any layer to all the subsequent layers are used. Dense block architecture has three advantages: 1) this architecture can preserve as much information as possible; 2) this model can improve flow of information and gradients through the network, which makes network easy to train; and 3) the dense connections have a regularizing effect, which reduces overfitting on tasks.
Based on these observations, we incorporate dense block in our encoding network, which is the origin of our proposed name: Densefuse. With this operation, our network can preserve more useful information from middle layers and easy to train. We will introduce our fusion algorithm in detail in the next section.
In this section, the proposed deep learning-based fusion method is introduced in detail. With a span of last 5 years, CNN gains great success in image processing field. It is also the footstone for our network.
The input infrared and visible images(gray level images) are denoted as and , respectively. We assume that input images are registered using existing algorithms. Our network architecture has three parts: encoder, fusion layer, and decoder. The architecture of the proposed network is shown in Fig.1.
As shown in Fig.1, the encoder is a siamese architecture network and it has two channels (C11 and DenseBlock11 for channel1, C12 and DenseBlock12 for channel2). The first layer (C11 and C12) contains filters to extract rough features and the dense block (DenseBlock11 and DenseBlock12) contains three convolution layers (each layer’s output is cascaded as the next layer’s input) which also contain filters. The weights of encoder channels are tied, C11 and C12 (DenseBlock11 and DenseBlock12) share same weights. For each convolution layer in encoding network, the channel number of feature maps is 16. The architecture of encoder has two advantages. First, the filter size and step of convolutional operation are and 1, respectively. With this strategy, the input image can be any size. Second, dense block architecture can preserve as much features as possible which are obtained by each convolution layer in encoding network and this operation can make sure all the salient features will be used in fusion strategy.
We choose different fusion strategies in fusion layer and these will be introduced in SectionIII-B.
The decoder contains four convolution layers ( filters). The output of fusion layer will be the input of decoder. We use this simple and effective architecture to reconstruct the final fused image.
In training process, we just consider encoder and decoder network. We attempt to train our encoder and decoder network to reconstruct the input image. The framework of our training process is shown in Fig.2, and the architecture of training process is outlined in Table I.
In Fig.2 and Table I, C1 is convolution layer in encoder network which contains filters. DC1, DC2 and DC3 are convolution layers in dense block and the output of each layer is connected to every other layer by cascaded operation. The encoder consists of C2, C3, C4 and C5, which will be utilized to reconstruct the input image.
In order to reconstruct the input image more precisely, we minimize the loss function L to train our encoder and decoder,
which is a weighted combination of pixel loss and structural similarity (SSIM) loss with the weight .
The pixel loss is calculated as,
where and indicate the output and input images, respectively. It is the Euclidean distance between the output and the input .
The SSIM loss is obtained by Eq.3,
where represents the structural similarity operation and it denotes the structural similarity of two images. Because the order of magnitude between pixel loss and SSIM loss is different, in training process, the is set as 1, 10, 100 and 1000, respectively.
Once the encoder and decoder networks are trained, in testing process, we used two-stream architecture in encoder and the weights are tied. We choose two fusion strategies (addition strategy and -norm strategy) to combine salient feature maps which are obtained by encoder.
In our network, represents the number of feature maps. indicates the input images or feature maps.
and indicate the feature maps which are obtained by encoder from input images, denotes the fused feature maps. The addition strategy is formulated by Eq.4,
where denotes the corresponding position in feature maps and fused feature maps. Then will be the input of decoder and final fused image will be reconstructed by decoder.
The performance of addition strategy was proved in 
. But this operation is a very rough fusion strategy for salient feature selection. We applied a new strategy which is based on-norm and soft-max operation into our network. The diagram of this strategy is shown in Fig.4.
In Fig.4, the features maps are denoted by
, the activity level mapwill be calculated by -norm and block-based average operator, and still denotes the fused feature maps.
Then block-based average operator is utilized to calculate the final activity level map by Eq.6.
where determines the block size and in our strategy .
After we get the final activity level map , is calculated by Eq.7,
The final fused image will be reconstructed by decoder in which the fused feature maps as the input.
The purpose of the experiment is to validate the proposed fusion method using subjective and objective criteria and to carry out the comparison with existing methods.
In our experiment, the source infrared and visible images were collected from  and . There are 20 pairs of our source images for the experiment and infrared and visible images are available at . A sample of these images is shown in Fig.5.
We compare the proposed method with several typical fusion methods, including cross bilateral filter fusion method(CBF), the joint-sparse representation model(JSR), gradient transfer and total variation minimization(GTF), the JSR model with saliency detection fusion method(JSRSD), deep convolutional neural network-based method(CNN) and the DeepFuse method(DeepFuse). In our experiment, the filter size is set as for DeepFuse methods.
For the purpose of quantitative comparison between our fusion method and other existing algorithms, seven quality metrics are utilized. These are: entropy(En); Qabf; the sum of the correlations of differences(SCD); and  which calculate mutual information (FMI) for the wavelet and discrete cosine features, respectively; modified structural similarity for no-reference image(); and a new no-reference image fusion performance measure(MS_SSIM).
In our experiment, the is calculated by Eq.8,
where denotes the structural similarity operation, is fused image, and , are source images. The value of represents the ability to preserve structural information.
The fusion performance improves with the increasing numerical index of all these seven metrics.
The fused images obtained by the six existing methods and the proposed method use different parameters which are shown in Fig.6 and Fig.7. Due to the space limit, we evaluate the relative performance of the fusion methods on two pairs of images(“car” and “street”).
The fused images which are obtained by CBF, JSR and JSRSD have more artificial noise and the saliency features are not clear, such as sky(orange and dotted) and floor(red and solid) in Fig.6 and billboard(red box) in Fig.7.
On the other hand, the fused images obtained by the proposed method contain less noise in the red box no matter what parameters were chosen. Compared with GTF, CNN and DeepFuse, our fusion method preserves more detail information in red box, as we can see from Fig.6.
In Fig.7, the fused image is darker than other images when the CNN-based method is utilized to fuse images. The reason of this phenomenon is CNN-based method is not suitable for infrared and visible images. On the contrary, the fused images obtained by our method look more natural.
However, as there is no validate difference between DeepFuse and proposed method in human sensitivity, we choose several objective metrics to evaluate the fusion performance in the next.
The average values of seven metrics for 20 fused images which are obtained by existing methods and the proposed fusion method are shown in Table II.
The best values for quality metrics are indicated in bold and the second-best values are indicated in red and italic. As we can see, the proposed method which use addition and -norm strategies have four best average values (En, Qabf, , ) and three second-best values (SCD, , MS_SSIM).
Our method has best values in , , this denotes that our method preserves more structural information and features. The fused images obtained by proposed method are more natural and contain less artificial noise because of the best values of En, Qabf and second-best values of SCD.
With different fusion strategy (addition and -norm) are utilized in to our network, our algorithm still has best or second-best values in seven quality metrics. This means our network is an effective architecture for infrared and visible image fusion task.
In this paper, we present a novel and effective deep learning architecture based on CNN and dense block for infrared and visible image fusion problem. Our network has three parts: encoder, fusion layer and decoder. Firstly, the source images (infrared and visible images) are utilized to be the input of encoder. And the features maps are obtained by CNN layer and dense block, which are fused by fusion strategy (addition and -norm). After fusion layer, the feature maps are integrated into one feature map which contains all salient features from source images. Finally, the fused image is reconstructed by decoder network. We use both subjective and objective quality metrics to evaluate our fusion method. The experimental results show that the proposed method exhibits state-of-the-art fusion performance.
We believe our network architecture can be applied to other image fusion problem, such as multi-exposure image, multi-focus image and medical image fusion. In the future, we will continue to research this fusion method and hope to achieve better fusion performance.
Zong J, Qiu T. Medical image fusion based on sparse representation of classified image patches[J]. Biomedical Signal Processing and Control, 2017, 34: 195-205.