NestFuse: An Infrared and Visible Image Fusion Architecture based on Nest Connection and Spatial/Channel Attention Models

07/01/2020 ∙ by Hui Li, et al. ∙ NetEase, Inc 0

In this paper we propose a novel method for infrared and visible image fusion where we develop nest connection-based network and spatial/channel attention models. The nest connection-based network can preserve significant amounts of information from input data in a multi-scale perspective. The approach comprises three key elements: encoder, fusion strategy and decoder respectively. In our proposed fusion strategy, spatial attention models and channel attention models are developed that describe the importance of each spatial position and of each channel with deep features. Firstly, the source images are fed into the encoder to extract multi-scale deep features. The novel fusion strategy is then developed to fuse these features for each scale. Finally, the fused image is reconstructed by the nest connection-based decoder. Experiments are performed on publicly available datasets. These exhibit that our proposed approach has better fusion performance than other state-of-the-art methods. This claim is justified through both subjective and objective evaluation. The code of our fusion method is available at https://github.com/hli1221/imagefusion-nestfuse

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 8

page 9

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image fusion represents an important technique in image processing aimed at generating a single image containing salient features and complementary information from source images, by using appropriate feature extraction methods and fusion strategies

[20]. Current state-of-the-art fusion algorithms are widely employed in many applications, such as in self-driving vehicles, visual tracking [17] [13] [8] and video surveillance.

Fusion algorithms can be broadly classified into two categories: traditional methods

[33] [2] [22] [24] [14] [18]

and deep learning-based methods

[28] [16] [15] [19]. Most traditional methods are based on signal processing operators that have achieved good performance. In recent years, deep learning-based methods have exhibited immense potential in image fusion tasks and have been seen to offer better performance than traditional algorithms.

Traditional methods, in general, cover two approaches: multi-scale based methods; sparse and low-rank representation learning-based methods. Multi-scale methods [33] [2] [48] [21] [41] usually decompose source images into different scales to extract features and use appropriate fusion strategies to fuse each scale feature. An inverse operator is then used to reconstruct the fused image. Although these methods demonstrate good fusion performance, their performance is highly dependent on the multi-scale methods.

Before the development of deep learning-based fusion methods, the sparse representation [43] (SR) and low-rank representation [25] (LRR) had attracted significant attention. Based on SR, several fusion algorithms were developed [24] [14] [39]. In [24], Liu et al. proposed a fusion algorithm based on joint sparse representation (JSR) and saliency detection operator. The JSR is used to extract common information and complementary features from source images.

In LRR domain, Li et al. [18]

presented a multi-focus image fusion method based on LRR and dictionary learning. In this approach, firstly, source images are divided into image patches and the histogram of oriented gradient (HOG) features are utilized to classify each image patch. A global dictionary is learned by K-singular value decomposition (K-SVD)

[1]

. In addition, there are many other methods combining SR and other operators, such as pulse coupled neural network (PCNN)

[29], and the shearlet transform [49].

Although the SR and LRR based fusion methods have indicated very good performance. These methods still have weaknesses: (1) The running time of fusion algorithms is highly dependent on the dictionary learning operator; (2) When source images are complex, this leads to representation performance degradation.

To solve these draw backs, in the past several years, many deep learning-based fusion methods have been proposed. These methods can be separated into two categories: with and without training phase.

First class Second class Reference Advantages Disadvantages
Traditional
methods
Multi-scale
Wavelet[33], Biorthogonal
wavelet[2], Contourlet[48],
Guided filtering[21], Non-
subsampled shearlet
transform (NSST)[41]
The raw data is transformed into freq-
uency domain, which may extract more
useful information to represent the
source images. With the appropriate
fusion strategies, these methods may
achieve better performance.
(1) Their performance is highly dependent on
the multi-scale methods, which are complex
to find an appropriate decomposition method
for different type of source images;
(2) In transform processing, it may cause
unrecoverable loss of data.
SR/LRR
JSRSD[24], DDL[14],
Sparse K-SVD[39],
DLLRR[18], TS-SR[29],
DCST-SR[49],
ConvSR[28]
Unlike multi-scale transform, the SR
and LRR based methods directly do
the fusion process without the trans-
form processing. Furthermore, these
methods can avoid the unrecoverable
loss of data.
(1) The running time of fusion algorithms
is highly dependent on the dictionary
learning operator;
(2) When source images are complex, this
leads to representation performance
degradation.
Deep learning
based methods
Without
training phase
VggML[16], ResNet-ZCA[15]
(1) This is the first time that the pretr-
ained deep neural networks which
can extract multi-level deep features
are utilized in image fusion task;
(2) The multi-level deep features
contains richer information which is
benefit for the image fusion tasks.
(1) Since these pre-trained networks are
trained for different tasks, they may not fit
the image fusion tasks;
(2) The deep feature extraction operation can
be improved by train an appropriate fusion
network.
With
training phase
CNN[26], Unsupervised[47],
DneseFuse[19], FusionGAN
[31], IFCNN[51]
(1) With appropriate fusion network,
the deep features contains more useful
information [26][47];
(2) The auto-encoder based fusion net-
work [19] avoid the lack of training
data in image fusion task;
(3) The end-to-end image fusion frame-
works [31] [51] can generate the fused
image without any handcrafted feature
extraction operation.
(1) The network has no down-sampling
operator, which can not extract multi-scale
features. And the deep features are not fully
utilized;
(2) The topology of network architecture
need to be improved for multi-scale
feature extraction;
(3) The fusion strategy is not carefully
designed for the fusion of deep features.
TABLE I: The summary of the existing fusion methods.

Without the training phase implies that these methods do not have backpropagation, and use a pre-trained network to extract deep features, which leads to the generation of a decision map. Based on this theory, Li et al.

[16][15] proposed a fusion framework that utilizes pre-trained network (VGG-19 [38] and ResNet50 [6]). This was the first time that multi-level deep features were used to address the infrared and visible image fusion task.

Since an appropriate model for image fusion task can be trained to obtain better fusion performance, the latest deep learning methods are all based on this strategy. In 2017, Liu et al. [26]

proposed a convolutional neural network (CNN)

[9] used a fusion framework for the multi-focus image fusion task. Yan et al. [47] also presented a fusion network based on CNN and multi-level features. In the infrared and visible image fusion field, Li et al. [19] proposed a novel fusion framework based on a dense block [7] and an auto-encoder architecture. Ma et al. [31] applied generative adversarial network (GAN)[3] for the infrared and visible image fusion task. Compared with existing fusion methods, these CNN or GAN based fusion frameworks have achieved extraordinary fusion performance.

However, these deep learning-based frameworks still have several drawbacks: (1) The network has no down-sampling operator and cannot extract multi-scale features, and the deep features are not fully utilized; (2) The topology of network architecture needs to be improved for multi-scale feature extraction; (3) The fusion strategy is not carefully designed to fuse deep features. The summary of all the above existing fusion methods is shown in Table I.

In order to solve these drawbacks, we propose a novel fusion framework based on a novel connection architecture and an appropriate fusion strategy. The main contributions of our fusion framework are summarized as follows:

(1) The nest connection architecture [52] is applied to the CNN based fusion framework. Our nest connection-based framework is different from existing nest connection based framework. It contains three parts: encoder network, fusion strategy and decoder network respectively.

(2) Our nest connection architecture makes full use of deep features and preserves more information from different scale features which are extracted by the encoder network.

(3) For the fusion of multi-scale deep features, we propose a novel fusion strategy based on spatial attention and channel attention models.

(4) Compared with existing state-of-the-art fusion methods, our fusion framework has better performance in terms of both visual assessment and objective assessment.

The rest of our paper is structured as follows. In Section II, we briefly review related works on deep learning-based fusion methods. In Section III we present the proposed fusion framework in detail. And in Section IV we illustrate the experimental results. Finally, we draw conclusion in section V.

Ii Related Works

With the rise of deep learning in recent years, a lot of deep learning based methods have been proposed for the image fusion task. These methods attempt to design an end-to-end network to directly generate fused images [27]. In this section, firstly, we briefly introduce several classical methods, and the latest deep learning-based methods. Then we present the nest connection approach.

Ii-a Deep Learning-based Fusion Methods

In 2017, a CNN-based fusion network was proposed by Liu et al. [26]. In their paper, the pairs of image patches () which contain different blur versions were used to train their network. The label of clear patch and blur patch were 1 and 0, respectively. The aim of this network was to generate a decision map, which indicated which source image is in more focus at the corresponding points. With the training phase, this CNN-based method has obtained better fusion performance than other algorithms before 2017. However, due to the limitation of training strategy, this method is only suitable for multi-focus images.

To overcome this weakness, Li et al. [19] proposed a novel auto-encoder based network (DenseFuse) for fusing infrared and visible images. It consists of three parts: encoder, fusion layer and decoder. In the training phase, the fusion layer is discarded and the DenseFuse degenerates into an auto-encoder network. The purpose of the training phase was to obtain two sub-networks in which the encoder fully extracted deep features from source images and decoder adaptively reconstructed the raw data according to the encoded features. During the testing phase, the fusion layer was utilized to fuse deep features. Then, the fused image was reconstructed by the decoder network. To preserve more detail information, Zhang et al. [51] proposed a general end-to-end fusion network which was a simple yet effective architecture to generate fused images.

The GAN architecture was introduced to the infrared and visible image fusion field (FusionGAN) by Ma et al. [31]

. In the training phase, the source images were concatenated as a tensor to feed into the generator network, and the fused image was obtained by this network. Their loss function contained two terms: content loss and discriminator loss. With the adversarial strategy, the generator network can be trained to fuse arbitrary infrared and visible images.

Ii-B The Nest Connection Architecture

The nest connection architecture was proposed by Zhou et al. [52] for the task of medical image segmentation. In the deep learning network, skip connection is a common operator to preserve more information from previous layers. However, the semantic gap causes unexpected results when long skip connections are used in the network architecture. To solve this problem, Zhou et al. presented a novel architecture (nest connection) which uses up-sampling and several short skip connections to replace a long skip connection. The framework of nest connection is illustrated in Fig.1.

Fig. 1: The architecture of nest connection in UNet++.

With the nest connection, the influence of the semantic gap is constrained and more information is preserved to obtain better segmentation results.

Inspired by this work, we introduce this architecture into the image fusion task and propose a modified nest connection based fusion framework based on nest connection and a novel fusion strategy.

Iii Proposed Fusion Method

In this section, the proposed nest connection-based fusion network is introduced in detail. Firstly, the fusion framework is presented in section III-A. Then, the detail of training phase is described in section III-B. Finally, we present our novel fusion strategy based on two stages of attention models.

Fig. 2: The framework of proposed method. “FS” indicates fusion strategy.

Iii-a Fusion Network

Our fusion network (see Fig.2) contains three main parts: encoder (blue square), fusion strategy (blue circle) and decoder (others), respectively. The nest connection is utilized in decoder network to process multi-scale deep features which are extracted by the encoder.

In Fig.2, and indicate the source images.

denotes the fused image. “Conv” means one convolutional layer. “ECB” denotes encoder convolutional block which contains two convolutional layers and one max-pooling layer. And “DCB” indicates decoder convolutional block without pooling operator.

Firstly, two input images are separately fed into encoder network to get multi-scale deep features. For each scale features, our fusion strategy is utilized to fuse the resulting features. Finally, the nest connection-based decoder network is used to reconstruct the fused image using the fused multi-scale deep features.

In next sections, we will introduce the training phase and the novel fusion strategy, respectively.

Iii-B Training Phase

The training strategy is similar to the DenseFuse [19]. In the training phase, the fusion strategy is discarded. We want to train an auto-encoder network in which the encoder is able to extract multi-scale deep features and the decoder reconstructs the input image from these features. The training framework is shown in Fig.3, and the fusion network settings are outlined in Table II.

Fig. 3: The framework of training process.

In Fig.3 and Table II, and are input image and output image, respectively. The encoder network consists of one convolutional layer (“Conv”) and four convolutional blocks (“ECB10”, “ECB20”, “ECB30” and “ECB40”). Each block contains two convolutional layers and one max-pooling operator which can ensure that encoder network can extract deep features in different scales.

The decoder network has six convolutional blocks (“DCB11”, “DCB12”, “DCB13”; “DCB21”, “DCB22”; “DCB31”) and one convolutional layer (“Conv”). Six convolutional blocks are connected by nest connection architecture to avoid the semantic gap between encoder and decoder.

Layer Size Stride
Channel
(input)
Channel
(output)
Activation
Encoder Conv 3 1 1 16 ReLu
ECB10 - - 16 64 -
ECB20 - - 64 112 -
ECB30 - - 112 160 -
ECB40 - - 160 208 -
Decoder DCB31 - - 368 160 -
DCB21 - - 272 112 -
DCB22 - - 384 112 -
DCB11 - - 176 64 -
DCB12 - - 240 64 -
DCB13 - - 304 64 -
Conv 1 1 64 1 ReLu
ECB Conv 3 1 16 ReLu
Conv 3 1 16 ReLu
max-pooling - - - - -
DCB Conv 3 1 16 ReLu
Conv 3 1 16 ReLu
TABLE II: The network settings of encoder and decoder network. Conv is convolutional layer; ECB denotes the encoder convolutional block (convolutional layer + max pooling); DCB denotes decoder convolutional block (without pooling); the values of and depend on which layer of “ECB” or “DCB” belongs in encoder or decoder.

In training phase, the loss function is defined as follows,

(1)

where and indicate the pixel loss and structure similarity () loss between the input image and the output image . denotes the trade-off value between and .

is calculated by Eq.2,

(2)

where and indicate the output and input images, respectively. is the Frobenius norm. calculates the distance between and . This loss function will make sure that the reconstructed image is more similar to input image in pixel level.

The SSIM loss is obtained by Eq.3,

(3)

where denotes the structural similarity measure [42]. The output image and the input image have more similarity in structure when the values of become larger.

The aim of the training phase is to obtain two powerful tools for the encoder network and the decoder network. Thus, the type of input images in training phase is not limited to infrared and visible images. In the training stage, the dataset MS-COCO [23] is used to train our auto-encoder network and we choose 80000 images to be the input images. These images are converted to gray scale and then resized to . As the orders of magnitude are different between and , the parameter is set as 1, 10, 100 and 1000 to train our network. The detailed analysis of training phase is introduced in the Ablation Study given in Section IV-B.

Iii-C Fusion Strategy

Most fusion strategies are based on the weight-average operator which generates a weighting map to fuse the source images. Based on this theory, the choice of the weighting map becomes a key issue.

The fusion network becomes more flexible when the fusion strategies are added to the test phase [19], however, these strategies are not designed for deep features, and attention mechanism is not considered yet.

To solve this problem, in this section, we introduce a novel fusion strategy based on two stages of attention models. In our fusion architecture, indicates the level of multi-scale deep features and . The framework of our fusion strategy is shown in Fig.4.

Fig. 4: The procedure of attention model-based fusion strategy.

and are multi-scale deep features which are extracted by encoder from two input images, respectively. and are fused features which are obtained by spatial attention model and channel attention model, respectively. is the final fused multi-scale deep feature which will be the input of the decoder network.

In our fusion strategy, we focus on two types of features: spatial attention model and channel attention model. The extracted multi-scale deep features are processed in two phases.

When and are obtained by our attention models, the final features are generated by Eq.4,

(4)

Now, we will introduce our attention model-based fusion strategies in detail.

Iii-C1 Spatial Attention Model

In [28][16][19], a spatial-based fusion strategy is utilized in the image fusion task. In this paper, we extend this operation to fuse multi-scale deep features and is called the spatial attention model. The procedure for obtaining the spatial attention model is shown in Fig.5.

Fig. 5: The procedure of spatial attention-based fusion strategy.

and indicate the weighting maps which are calculated by -norm and soft-max operator from deep features and . The weighting maps are formulated by Eq.5,

(5)

where denotes -norm, and . indicates the corresponding position in multi-scale deep features ( and ) and weighting maps ( and ), each position denotes a

dimensional vector in deep features. The

denotes a vector which has dimensions.

and denote the enhanced deep features which are weighted by and . The enhanced features() are calculated by Eq.6,

(6)

Then the fused features are calculated by adding these enhanced deep features, the formulation is shown in Eq.7,

(7)

Iii-C2 Channel Attention Model

In existing deep learning-based fusion methods, almost fusion strategies just calculate the spatial information. However deep features are three dimensional tensors. Hence not only spatial dimensional information, but the channel information should also be considered in the fusion strategy as well. Thus, we propose a channel attention-based fusion strategy. The diagram of this strategy is shown in Fig.6.

Fig. 6: The diagram of channel attention-based fusion strategy.

As we discussed in section III-C1, and are multi-scale deep features. and are C dimensional weighting vectors which are calculated by global pooling and soft-max. and indicate enhanced deep features which are weighted by weighting vectors. is fused features which are calculated by channel attention-based fusion strategy.

Firstly, a global pooling operator is utilized to calculate the initial weighting vectors ( and ). The formulation is shown in Eq.8,

(8)

where , indicates the corresponding index of channel in deep features , is the global pooling operator.

In our channel attention model, three global pooling operations are chosen, including: (1) Average operator which calculates the average values of each channel; (2) Max operator which calculates the maximum value of each channel; (3) Nuclear-norm operator () which is the sum of singular values for one channel. The influence of different operators for global pooling will be discussed in the Ablation Study IV-B.

Then, a soft-max operator (Eq.9) is used to obtain the final weighting vectors and ,

(9)

When we obtain the final weight vectors, the fused features which are generated by channel attention model can be calculated by Eq.10,

(10)

Iv Experimental Results

In this section, we first describe the experimental settings of testing phase. Then, we introduce our ablation study. We compare our method with other existing methods in subjective evaluation and utilize several quality metrics to evaluate the fusion performance objectively.

Fig. 7: Four pairs of source images. The top row contains infrared images, and the second row contains visible images.

Iv-a Experimental Settings

In our experiments, 21 pairs of infrared and visible images111These images are available at https://github.com/hli1221/imagefusion-nestfuse. were collected from [32] and [40]. A sample of these images is shown in Fig.7.

We choose twelve typical and state-of-the-art fusion methods to evaluate the fusion performance, including: cross bilateral filter fusion method (CBF) [11], discrete cosine harmonic wavelet transform fusion method (DCHWT) [10], joint SR based fusion method (JSR) [50], the joint sparse representation model with saliency detection fusion method (JSRSD) [24], gradient transfer and total variation minimization (GTF) [30], visual saliency map and weighted least square optimization based fusion method (WLS) [32], convolutional sparse representation based fusion method (ConvSR) [28], VGG-19 and the multi-layer fusion strategy-based method (VggML) [16], DeepFuse [35], DenseFuse [19]222The addition fusion strategy is utilized and the parameter is set to 100., the GAN-based fusion network (FusionGAN) [31] and a general end-to-end fusion network(IFCNN) [51]. All these comparison fusion methods are implemented based on their publicly available codes, and their parameters are set by referring to their papers.

Seven quality metrics are utilized for quantitative comparison between our fusion method and other existing fusion methods. These are: entropy () [37]

; standard deviation (

) [36]; mutual information () [34]; and [4] which calculates mutual information () for the discrete cosine transform and the region feature; the modified structural similarity for no-reference image (); and visual information fidelity () [5].

The is calculated by Eq.11,

(11)

where denotes the structural similarity measure [42], is fused image, and , are source images.

The fusion performance improves with the increasing numerical index of all these seven metrics. The larger and means input image contains more information, which also indicates that the fusion method achieves better performance. The larger , and indicates the fusion method could preserve more raw information and features from source images. For and , the fusion algorithms preserve more structural information from source images and generate more natural features.

Iv-B Ablation Study

Iv-B1 Parameter() in Loss Function

As discussed in section III-B, the parameter

is set as 1, 10, 100 and 1000. The epoch and batch size are 2 and 4, respectively. Our network is implemented with NVIDIA GTX 1080Ti and PyTorch is used for implementation. The line chart of loss values is demonstrated in Fig.

9.

In Fig.9, at first 400 iterations, the auto-encoder network has rapid convergence with the increase of the parameter . , and have faster convergence rate when or . In addition, when iterations are more than 600, we get the optimal network weights, no matter which is chosen. In general, our fusion network gets faster convergence of with increase of in the early stage.

We still need to choose one values for our image fusion task based on the test images. Seven metrics are used to evaluate the performance of different network with different . And the operations which were utilized in channel attention model are , and , respectively. These values are shown in Table III. The best values are indicated in bold and the second-best values are denoted in red and italic.

From Table III, although different has no effect for convergence rate when iterations become larger, it still has influence for the fusion performance of our fusion framework. When is 100 (), our network can achieve better fusion performance than other values of . So, in our experiment, is set as 100.

Fig. 8: The training framework of deep supervision. , and are outputs based on different scale features.
Fig. 9: The line charts of pixel loss(a), SSIM loss(b) and total loss(c) in training phase. Each point in horizontal axis indicates 10 iterations and we choose the first 1000 iterations.
[37] [36] [34] [4] [4] [5]
NestFuse 6.91369 82.39563 13.82738 0.35118 0.43607 0.73154 0.78185
6.88793 80.02630 13.77586 0.35450 0.43172 0.73512 0.74792
6.89778 82.51583 13.79557 0.35700 0.43522 0.73323 0.75936
6.90552 81.79393 13.81103 0.34487 0.43387 0.73177 0.77405
6.88281 79.80942 13.76562 0.34749 0.42985 0.73516 0.74333
6.89021 82.11951 13.78042 0.34967 0.43307 0.73339 0.75367
6.91971 82.75242 13.83942 0.35801 0.43724 0.73199 0.78652
6.89421 80.36372 13.78842 0.36080 0.43293 0.73532 0.75204
6.90461 82.92572 13.80923 0.36277 0.43621 0.73360 0.76415
6.91062 82.14301 13.82125 0.34819 0.43572 0.73211 0.77881
6.88939 80.18043 13.77877 0.35079 0.43109 0.73547 0.74746
6.89612 82.40198 13.79224 0.35295 0.43437 0.73393 0.75632
TABLE III: The metrics values with different and different global operations.
[37] [36] [34] [4] [4] [5]
deep supersion 6.90495 81.95101 13.80989 0.34559 0.43414 0.73149 0.77568
6.88467 80.00623 13.76934 0.34870 0.43018 0.73479 0.74551
6.89084 82.19268 13.78168 0.35087 0.43337 0.73322 0.75491
6.91023 82.31554 13.82046 0.34433 0.43395 0.73182 0.77754
6.88734 80.10722 13.77469 0.34725 0.42984 0.73520 0.74612
6.89418 82.53775 13.78835 0.34941 0.43318 0.73350 0.75650
6.90866 82.23202 13.81732 0.34462 0.43399 0.73172 0.77702
6.88544 80.00264 13.77088 0.34753 0.42984 0.73512 0.74526
6.89351 82.47415 13.78702 0.34965 0.43307 0.73334 0.75639
w/o deep supersion 6.91971 82.75242 13.83942 0.35801 0.43724 0.73199 0.78652
6.89421 80.36372 13.78842 0.36080 0.43293 0.73532 0.75204
6.90461 82.92572 13.80923 0.36277 0.43621 0.73360 0.76415
TABLE IV: The objective evaluation of the outputs of deep supervision and without(w/o) deeply supervision.
Fig. 10: Experiment on “man” images. (a) Infrared image; (b) Visible image; (c) CBF; (d) DCHWT; (e) JSR; (f) JSRSD; (g) GTF; (h) WLS; (i) ConvSR; (j) VggML; (k) DeepFuse; (l) DenseFuse; (m) IFCNN; (n) FusionGAN; (o) NestFuse; (p) NestFuse; (q) NestFuse.
Fig. 11: Experiment on “street” images. (a) Infrared image; (b) Visible image; (c) CBF; (d) DCHWT; (e) JSR; (f) JSRSD; (g) GTF; (h) WLS; (i) ConvSR; (j) VggML; (k) DeepFuse; (l) DenseFuse; (m) IFCNN; (n) FusionGAN; (o) NestFuse; (p) NestFuse; (q) NestFuse.
Fig. 12: Experiment on “umbrella” images. (a) Infrared image; (b) Visible image; (c) CBF; (d) DCHWT; (e) JSR; (f) JSRSD; (g) GTF; (h) WLS; (i) ConvSR; (j) VggML; (k) DeepFuse; (l) DenseFuse; (m) IFCNN; (n) FusionGAN; (o) NestFuse; (p) NestFuse; (q) NestFuse.
[37] SD [36] MI [34] [4] [4] VIF[5]
CBF [11] 6.85749 76.82410 13.71498 0.26309 0.32350 0.59957 0.71849
DCHWT [10] 6.56777 64.97891 13.13553 0.38568 0.40147 0.73132 0.50560
JSR [50] 6.72263 74.10783 12.72654 0.14236 0.18506 0.60642 0.63845
JSRSD [24] 6.72057 79.19536 13.38575 0.14253 0.18498 0.54097 0.67071
GTF [30] 6.63433 67.54361 13.26865 0.39787 0.41038 0.70016 0.41687
WLS [32] 6.64071 70.58894 13.28143 0.33103 0.37662 0.72360 0.72874
ConvSR [28] 6.25869 50.74372 12.51737 0.34640 0.34640 0.75335 0.39218
VggML [16] 6.18260 48.15779 12.36521 0.40463 0.41684 0.77803 0.29509
DeepFuse [35] 6.69935 68.79312 13.39869 0.41501 0.42477 0.72882 0.65773
DenseFuse [19] 6.67158 67.57282 13.34317 0.41727 0.42767 0.73150 0.64576
FusionGAN [31] 6.36285 54.35752 12.72570 0.36335 0.37083 0.65384 0.45355
IFCNN [51] 6.59545 66.87578 13.19090 0.37378 0.40166 0.73186 0.59029
NestFuse 6.91971 82.75242 13.83942 0.35801 0.43724 0.73199 0.78652
6.89421 80.36372 13.78842 0.36080 0.43293 0.73532 0.75204
6.90461 82.92572 13.80923 0.36277 0.43621 0.73360 0.76415
TABLE V: The average values of quality metrics for fused images. “”, “” and “” denote the global pooling operator() in channel attention model-based fusion strategy.

Iv-B2 The Influence of Multi-scale Deep Features

In this section, we analyze the influence to fusion performance with different scales of deep features, the parameter is set as 100.

To generate multiple outputs in different scales of deep features, we use deeply supervised training strategy which is utilized in UNet++ [52] to train our fusion network. The training framework of deep supervision of NestFuse is shown in Fig.8.

, and are outputs obtained by NestFuse with deep supervision. And the loss function is defined as follows,

(12)

where , is the total loss function which is discussed in section III-B.

Seven quality metrics are also selected to evaluate the fusion performance in different scales of deep features. These values are shown in Table IV and the best values are indicated in bold. “w/o deep supervision” denotes the training phase without deep supervision which was introduced in section III-B.

From Table IV, with deep supervision, the metrics values are very close in different scales (, and ) and the advantage of multi-scale deep features in NestFuse is not competitive. Specifically, comparing with deep scale features (such as ), the shallow scale features () obtain better evaluation in , , and , which indicates shallow scale features contain more detail information. When deeper scale features are utilized in NestFuse, the fused images contain more structure features, which delivers best values on , and a comparable value on .

However, when we train NestFuse with global optimization strategy, the fusion performance is boosted (obtains all best values), which means multi-scale mechanism is effective in our fusion network. This indicates that while the deeply supervised strategy may not train a better model in image fusion task, it still achieves better performance in image segmentation.

Thus, our network is trained with global optimization strategy which fully utilizes the multi-scale features in NestFuse.

Iv-C Results Analysis

The fused images obtained by existing fusion methods and our fusion method (NestFuse) are shown in Fig.10 - Fig.12. We analyze the visual effects of fused results on three pairs of infrared and visible images.

As shown in the red boxes of Fig.10, Fig.11 and Fig.12, comparing with the proposed method, CBF, DCHWT, JSR and JSRSD generate much more noise in fused images and some detail information are not clear. For GTF, WLS, ConvSR, VggML and FusionGAN, although some of the saliency features are highlighted, some regions in fused images are blurred. Moreover, the features in red boxes are not so satisfactory.

On the contrary, the DeepFuse, DenseFuse, IFCNN and the proposed method obtain better fusion performance in subjective evaluation compared with other three fusion methods. In addition, the fused images obtained by the proposed method have more reasonable luminance information.

For objective evaluation, we choose seven objective metrics to evaluate the fusion performance of these eleven fusion methods and the proposed method.

The average values of seven metrics for all fused images which are obtained by existing methods and the proposed fusion method are shown in Table V. The best values are indicated in bold and the second-best values are denoted in red and italic.

From Table V, the proposed fusion framework has five best values and five second-best values (except and ). This indicates that the proposed fusion framework can preserve more detail information (, and ) and feature information( and ) in the fused images.

For the metric , it is used to measure the amount of information in one image. The larger means the fused image contains more information. However, if the fusion method generates noise in fuison processing, it also leads to larger (Fig.10(c)-12(c)). That is why the fused image obtained by CBF achieves larger . On the contrary, the fused images obtained by our proposed method have more reasonable luminance information and contain less noise, which make our proposed fusion method achieves best value on .

Comparing with operator in channel attention model-based fusion strategy, the and operators can achieve almost the best values in objective metrics. In channel attention model, these two operations are effective and can capture more structure information from deep features.

Fig. 13: The tracking results on the VOT-RGBT2019 benchmark. In first row and second row are RGB and infrared iamges, respectively. Each column includes one pair of RGB and infrared frames which are selected from video sequences ‘biketwo’, ‘carLight’, ‘greyman’ and ‘elecbike’. The ‘RGB’ and ‘infrared’ denote the input of the SiamRPN++ is just one modality (RGB or infrared). The ‘NestFuse’ presents the case when the multi-scale fusion strategy is applied into SiamRPN++.
Type EAO Accuracy Failures
SiamRPN++ infrared 0.2831 0.5875 43.9274
RGB 0.3312 0.6104 37.5201
SiamRPN++ with NestFuse 0.3493 0.6661 40.9503
TABLE VI: Tracking results using SiamRPN++ and NestFuse on VOT-RGBT2019 dataset. The fusion strategy which is developed in NestFuse is utilized to fuse multi-scale deep features.

Iv-D An Application to Visual Objective Tracking

The Visual Object Tracking (VOT) challenges address short-term or long-term, causal and model-free tracking [8][46][45][44].

In VOT2019, two new sub-challenges (VOT-RGBT and VOT-RGBD) are introduced by the committee. The VOT-RGBT sub-challenge focuses on short-term tracking, which contains two modalities (RGB and thermal infrared). As mentioned in [17], the infrared and visible image fusion methods are ideally suited to improve the tracking performance in this task.

According to our previous research [17], if the tracker engages a greater proportion of deep features for the data representation, its performance will be improved when the fusion method focuses on feature-level fusion. This insight gives us a direction to apply our proposed fusion method into RGBT tracking task.

Thus, in this experiment, we choose SiamRPN++ [12] as the base tracker and the fusion strategy proposed in this paper is applied to do the feature-level fusion. The SiamRPN++ is based on deep learning and achieves the state-of-the-art tracking performance in 2019.

In VOT-RGBT benchmark[8], it contains 60 video sequences. The examples of these frames and some tracking results are shown in Fig.13.

For the objective evaluation, three metrics [8]

are selected to analyze the tracking performance: Expected Average Overlap (EAO), Accuracy and Failures. (1) EAO is an estimator of the average overlap a tracker manages to attain on a large collection with the same visual properties as the ground-truth; (2) Accuracy denotes the average overlap between the predicted and ground truth bounding boxes; (3) Failure evaluates the robustness of a tracker.

The evaluation measure values of SiamRPN++ with the proposed fusion method are shown in Table VI. The bold and red italic indicate the best values and second-best values, respectively.

In VOT challenge, the EAO is the primary measure. As shown in Table VI, comparing with ‘’ and ‘’, the tracking performance (EAO) is improved by applying our fusion strategy to fuse multi-scale deep features. This indicates that not only in image fusion task, the proposed fusion method can also improve the tracking performance in RGBT tracking task as well.

Furthermore, we will also apply the proposed fusion method into other computer vision tasks to evaluate the performance of fusion algorithm in future.

V Conclusions

In this paper, we propose a novel image fusion architecture by developing a nest connection network and spatial/channel attention models. Firstly, with the pooling operator in encoder network, the multi-scale features are extracted by this architecture, which could present richer features from source images. Then, the proposed spatial/channel attention models are utilized to fuse these multi-scale deep features in each scale. These fused features are fed into the nest connection-based decoder network to generate the fused image. With this novel network structure and the multi-scale deep feature fusion strategy, more saliency features can be preserved in the reconstruction process and the fusion performance can also be improved.

The experimental results and analyses show that the proposed fusion framework demonstrates state-of-the-art fusion performance. An additional experiment on RGBT tracking task also shows that the proposed fusion strategy is effective in improving the algorithm performance in other computer vision task.

References

  • [1] M. Aharon, M. Elad, A. Bruckstein, et al. (2006) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54 (11), pp. 4311. Cited by: §I.
  • [2] A. Ben Hamza, Y. He, H. Krim, and A. Willsky (2005) A multiscale approach to pixel-level image fusion. Integrated Computer-Aided Engineering 12 (2), pp. 135–146. Cited by: TABLE I, §I, §I.
  • [3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §I.
  • [4] M. Haghighat and M. A. Razian (2014) Fast-FMI: non-reference image fusion metric. In 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–3. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
  • [5] Y. Han, Y. Cai, Y. Cao, and X. Xu (2013) A new image fusion performance metric based on visual information fidelity. Information fusion 14 (2), pp. 127–135. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §I.
  • [7] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §I.
  • [8] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J. Kamarainen, L. Cehovin Zajc, O. Drbohlav, A. Lukezic, A. Berg, et al. (2019) The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–36. Cited by: §I, §IV-D, §IV-D, §IV-D.
  • [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
  • [10] B. S. Kumar (2013) Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal, Image and Video Processing 7 (6), pp. 1125–1143. Cited by: §IV-A, TABLE V.
  • [11] B. S. Kumar (2015) Image fusion based on pixel significance using cross bilateral filter. Signal, image and video processing 9 (5), pp. 1193–1204. Cited by: §IV-A, TABLE V.
  • [12] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019) SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291. Cited by: §IV-D.
  • [13] C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang (2019) RGB-T object tracking: benchmark and baseline. Pattern Recognition 96, pp. 106977. Cited by: §I.
  • [14] H. Li, Y. Wang, Z. Yang, R. Wang, X. Li, and D. Tao (2019) Discriminative dictionary learning-based multiple component decomposition for detail-preserving noisy image fusion. IEEE Transactions on Instrumentation and Measurement. Note: doi: 10.1109/TIM.2019.2912239 Cited by: TABLE I, §I, §I.
  • [15] H. Li, X. Wu, and T. S. Durrani (2019) Infrared and Visible Image Fusion with ResNet and zero-phase component analysis. Infrared Physics & Technology, pp. 103039. Cited by: TABLE I, §I, §I.
  • [16] H. Li, X. Wu, and J. Kittler (2018) Infrared and Visible Image Fusion using a Deep Learning Framework. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. Cited by: TABLE I, §I, §I, §III-C1, §IV-A, TABLE V.
  • [17] H. Li, X. Wu, and J. Kittler (2020) MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing. Note: doi: 10.1109/TIP.2020.2975984 Cited by: §I, §IV-D, §IV-D.
  • [18] H. Li and X. Wu (2017) Multi-focus image fusion using dictionary learning and low-rank representation. In International Conference on Image and Graphics, pp. 675–686. Cited by: TABLE I, §I, §I.
  • [19] H. Li and X. Wu (2018) DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: TABLE I, §I, §I, §II-A, §III-B, §III-C1, §III-C, §IV-A, TABLE V.
  • [20] S. Li, X. Kang, L. Fang, J. Hu, and H. Yin (2017) Pixel-level image fusion: A survey of the state of the art. Information Fusion 33, pp. 100–112. Cited by: §I.
  • [21] S. Li, X. Kang, and J. Hu (2013) Image fusion with guided filtering. IEEE Transactions on Image processing 22 (7), pp. 2864–2875. Cited by: TABLE I, §I.
  • [22] X. Li, X. Guo, P. Han, X. Wang, H. Li, and T. Luo (2020) Laplacian re-decomposition for multimodal medical image fusion. IEEE Transactions on Instrumentation and Measurement. Note: doi: 10.1109/TIM.2020.2975405 Cited by: §I.
  • [23] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §III-B.
  • [24] C. Liu, Y. Qi, and W. Ding (2017) Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology 83, pp. 94–102. Cited by: TABLE I, §I, §I, §IV-A, TABLE V.
  • [25] G. Liu, Z. Lin, and Y. Yu (2010) Robust subspace segmentation by low-rank representation. In ICML, Vol. 1, pp. 8. Cited by: §I.
  • [26] Y. Liu, X. Chen, H. Peng, and Z. Wang (2017) Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191–207. Cited by: TABLE I, §I, §II-A.
  • [27] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §II.
  • [28] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang (2016) Image fusion with convolutional sparse representation. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: TABLE I, §I, §III-C1, §IV-A, TABLE V.
  • [29] X. Lu, B. Zhang, Y. Zhao, H. Liu, and H. Pei (2014) The infrared and visible image fusion algorithm based on target separation and sparse representation. Infrared Physics & Technology 67, pp. 397–407. Cited by: TABLE I, §I.
  • [30] J. Ma, C. Chen, C. Li, and J. Huang (2016) Infrared and visible image fusion via gradient transfer and total variation minimization. Information Fusion 31, pp. 100–109. Cited by: §IV-A, TABLE V.
  • [31] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang (2019) FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: TABLE I, §I, §II-A, §IV-A, TABLE V.
  • [32] J. Ma, Z. Zhou, B. Wang, and H. Zong (2017) Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Physics & Technology 82, pp. 8–17. Cited by: §IV-A, §IV-A, TABLE V.
  • [33] G. Pajares and J. M. De La Cruz (2004) A wavelet-based image fusion tutorial. Pattern recognition 37 (9), pp. 1855–1872. Cited by: TABLE I, §I, §I.
  • [34] H. Peng, F. Long, and C. Ding (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence (8), pp. 1226–1238. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
  • [35] K. Ram Prabhakar, V. Sai Srikar, and R. Venkatesh Babu (2017) Deepfuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4714–4722. Cited by: §IV-A, TABLE V.
  • [36] Y. Rao (1997) In-fibre bragg grating sensors. Measurement science and technology 8 (4), pp. 355. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
  • [37] J. W. Roberts, J. A. Van Aardt, and F. B. Ahmed (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
  • [38] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I.
  • [39] S. Singh and R. Anand (2019) Multimodal medical image sensor fusion model using sparse k-svd dictionary learning in nonsubsampled shearlet domain. IEEE Transactions on Instrumentation and Measurement. Note: doi: 10.1109/TIM.2019.2902808 Cited by: TABLE I, §I.
  • [40] A. Toet (2014) TNO Image Fusion Dataset. Note: https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029 Cited by: §IV-A.
  • [41] A. Vishwakarma and M. Bhuyan (2018) Image fusion using adjustable non-subsampled shearlet transform. IEEE Transactions on Instrumentation and Measurement 68 (9), pp. 3367–3378. Cited by: TABLE I, §I.
  • [42] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-B, §IV-A.
  • [43] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma (2008)

    Robust face recognition via sparse representation

    .
    IEEE transactions on pattern analysis and machine intelligence 31 (2), pp. 210–227. Cited by: §I.
  • [44] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019) An Accelerated Correlation Filter Tracker. Pattern Recognition accepted. arXiv preprint arXiv:1912.02854. Cited by: §IV-D.
  • [45] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019) Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7950–7960. Cited by: §IV-D.
  • [46] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019) Learning Adaptive Discriminative Correlation Filters via Temporal Consistency preserving Spatial Feature Selection for Robust Visual Object Tracking. IEEE Transactions on Image Processing. Cited by: §IV-D.
  • [47] X. Yan, S. Z. Gilani, H. Qin, and A. Mian (2018) Unsupervised deep multi-focus image fusion. arXiv preprint arXiv:1806.07272. Cited by: TABLE I, §I.
  • [48] S. Yang, M. Wang, L. Jiao, R. Wu, and Z. Wang (2010) Image fusion based on a new contourlet packet. Information Fusion 11 (2), pp. 78–84. Cited by: TABLE I, §I.
  • [49] M. Yin, P. Duan, W. Liu, and X. Liang (2017) A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation. Neurocomputing 226, pp. 182–191. Cited by: TABLE I, §I.
  • [50] Q. Zhang, Y. Fu, H. Li, and J. Zou (2013) Dictionary learning method for joint sparse representation-based image fusion. Optical Engineering 52 (5), pp. 057006. Cited by: §IV-A, TABLE V.
  • [51] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang (2020) IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion 54, pp. 99–118. Cited by: TABLE I, §II-A, §IV-A, TABLE V.
  • [52] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang (2018) Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §I, §II-B, §IV-B2.