Image fusion represents an important technique in image processing aimed at generating a single image containing salient features and complementary information from source images, by using appropriate feature extraction methods and fusion strategies. Current state-of-the-art fusion algorithms are widely employed in many applications, such as in self-driving vehicles, visual tracking    and video surveillance.
Fusion algorithms can be broadly classified into two categories: traditional methods     
and deep learning-based methods   . Most traditional methods are based on signal processing operators that have achieved good performance. In recent years, deep learning-based methods have exhibited immense potential in image fusion tasks and have been seen to offer better performance than traditional algorithms.
Traditional methods, in general, cover two approaches: multi-scale based methods; sparse and low-rank representation learning-based methods. Multi-scale methods      usually decompose source images into different scales to extract features and use appropriate fusion strategies to fuse each scale feature. An inverse operator is then used to reconstruct the fused image. Although these methods demonstrate good fusion performance, their performance is highly dependent on the multi-scale methods.
Before the development of deep learning-based fusion methods, the sparse representation  (SR) and low-rank representation  (LRR) had attracted significant attention. Based on SR, several fusion algorithms were developed   . In , Liu et al. proposed a fusion algorithm based on joint sparse representation (JSR) and saliency detection operator. The JSR is used to extract common information and complementary features from source images.
In LRR domain, Li et al. 
presented a multi-focus image fusion method based on LRR and dictionary learning. In this approach, firstly, source images are divided into image patches and the histogram of oriented gradient (HOG) features are utilized to classify each image patch. A global dictionary is learned by K-singular value decomposition (K-SVD)
. In addition, there are many other methods combining SR and other operators, such as pulse coupled neural network (PCNN), and the shearlet transform .
Although the SR and LRR based fusion methods have indicated very good performance. These methods still have weaknesses: (1) The running time of fusion algorithms is highly dependent on the dictionary learning operator; (2) When source images are complex, this leads to representation performance degradation.
To solve these draw backs, in the past several years, many deep learning-based fusion methods have been proposed. These methods can be separated into two categories: with and without training phase.
|First class||Second class||Reference||Advantages||Disadvantages|
Without the training phase implies that these methods do not have backpropagation, and use a pre-trained network to extract deep features, which leads to the generation of a decision map. Based on this theory, Li et al. proposed a fusion framework that utilizes pre-trained network (VGG-19  and ResNet50 ). This was the first time that multi-level deep features were used to address the infrared and visible image fusion task.
Since an appropriate model for image fusion task can be trained to obtain better fusion performance, the latest deep learning methods are all based on this strategy. In 2017, Liu et al. 
proposed a convolutional neural network (CNN) used a fusion framework for the multi-focus image fusion task. Yan et al.  also presented a fusion network based on CNN and multi-level features. In the infrared and visible image fusion field, Li et al.  proposed a novel fusion framework based on a dense block  and an auto-encoder architecture. Ma et al.  applied generative adversarial network (GAN) for the infrared and visible image fusion task. Compared with existing fusion methods, these CNN or GAN based fusion frameworks have achieved extraordinary fusion performance.
However, these deep learning-based frameworks still have several drawbacks: (1) The network has no down-sampling operator and cannot extract multi-scale features, and the deep features are not fully utilized; (2) The topology of network architecture needs to be improved for multi-scale feature extraction; (3) The fusion strategy is not carefully designed to fuse deep features. The summary of all the above existing fusion methods is shown in Table I.
In order to solve these drawbacks, we propose a novel fusion framework based on a novel connection architecture and an appropriate fusion strategy. The main contributions of our fusion framework are summarized as follows:
(1) The nest connection architecture  is applied to the CNN based fusion framework. Our nest connection-based framework is different from existing nest connection based framework. It contains three parts: encoder network, fusion strategy and decoder network respectively.
(2) Our nest connection architecture makes full use of deep features and preserves more information from different scale features which are extracted by the encoder network.
(3) For the fusion of multi-scale deep features, we propose a novel fusion strategy based on spatial attention and channel attention models.
(4) Compared with existing state-of-the-art fusion methods, our fusion framework has better performance in terms of both visual assessment and objective assessment.
The rest of our paper is structured as follows. In Section II, we briefly review related works on deep learning-based fusion methods. In Section III we present the proposed fusion framework in detail. And in Section IV we illustrate the experimental results. Finally, we draw conclusion in section V.
Ii Related Works
With the rise of deep learning in recent years, a lot of deep learning based methods have been proposed for the image fusion task. These methods attempt to design an end-to-end network to directly generate fused images . In this section, firstly, we briefly introduce several classical methods, and the latest deep learning-based methods. Then we present the nest connection approach.
Ii-a Deep Learning-based Fusion Methods
In 2017, a CNN-based fusion network was proposed by Liu et al. . In their paper, the pairs of image patches () which contain different blur versions were used to train their network. The label of clear patch and blur patch were 1 and 0, respectively. The aim of this network was to generate a decision map, which indicated which source image is in more focus at the corresponding points. With the training phase, this CNN-based method has obtained better fusion performance than other algorithms before 2017. However, due to the limitation of training strategy, this method is only suitable for multi-focus images.
To overcome this weakness, Li et al.  proposed a novel auto-encoder based network (DenseFuse) for fusing infrared and visible images. It consists of three parts: encoder, fusion layer and decoder. In the training phase, the fusion layer is discarded and the DenseFuse degenerates into an auto-encoder network. The purpose of the training phase was to obtain two sub-networks in which the encoder fully extracted deep features from source images and decoder adaptively reconstructed the raw data according to the encoded features. During the testing phase, the fusion layer was utilized to fuse deep features. Then, the fused image was reconstructed by the decoder network. To preserve more detail information, Zhang et al.  proposed a general end-to-end fusion network which was a simple yet effective architecture to generate fused images.
The GAN architecture was introduced to the infrared and visible image fusion field (FusionGAN) by Ma et al. 
. In the training phase, the source images were concatenated as a tensor to feed into the generator network, and the fused image was obtained by this network. Their loss function contained two terms: content loss and discriminator loss. With the adversarial strategy, the generator network can be trained to fuse arbitrary infrared and visible images.
Ii-B The Nest Connection Architecture
The nest connection architecture was proposed by Zhou et al.  for the task of medical image segmentation. In the deep learning network, skip connection is a common operator to preserve more information from previous layers. However, the semantic gap causes unexpected results when long skip connections are used in the network architecture. To solve this problem, Zhou et al. presented a novel architecture (nest connection) which uses up-sampling and several short skip connections to replace a long skip connection. The framework of nest connection is illustrated in Fig.1.
With the nest connection, the influence of the semantic gap is constrained and more information is preserved to obtain better segmentation results.
Inspired by this work, we introduce this architecture into the image fusion task and propose a modified nest connection based fusion framework based on nest connection and a novel fusion strategy.
Iii Proposed Fusion Method
In this section, the proposed nest connection-based fusion network is introduced in detail. Firstly, the fusion framework is presented in section III-A. Then, the detail of training phase is described in section III-B. Finally, we present our novel fusion strategy based on two stages of attention models.
Iii-a Fusion Network
Our fusion network (see Fig.2) contains three main parts: encoder (blue square), fusion strategy (blue circle) and decoder (others), respectively. The nest connection is utilized in decoder network to process multi-scale deep features which are extracted by the encoder.
In Fig.2, and indicate the source images.
denotes the fused image. “Conv” means one convolutional layer. “ECB” denotes encoder convolutional block which contains two convolutional layers and one max-pooling layer. And “DCB” indicates decoder convolutional block without pooling operator.
Firstly, two input images are separately fed into encoder network to get multi-scale deep features. For each scale features, our fusion strategy is utilized to fuse the resulting features. Finally, the nest connection-based decoder network is used to reconstruct the fused image using the fused multi-scale deep features.
In next sections, we will introduce the training phase and the novel fusion strategy, respectively.
Iii-B Training Phase
The training strategy is similar to the DenseFuse . In the training phase, the fusion strategy is discarded. We want to train an auto-encoder network in which the encoder is able to extract multi-scale deep features and the decoder reconstructs the input image from these features. The training framework is shown in Fig.3, and the fusion network settings are outlined in Table II.
In Fig.3 and Table II, and are input image and output image, respectively. The encoder network consists of one convolutional layer (“Conv”) and four convolutional blocks (“ECB10”, “ECB20”, “ECB30” and “ECB40”). Each block contains two convolutional layers and one max-pooling operator which can ensure that encoder network can extract deep features in different scales.
The decoder network has six convolutional blocks (“DCB11”, “DCB12”, “DCB13”; “DCB21”, “DCB22”; “DCB31”) and one convolutional layer (“Conv”). Six convolutional blocks are connected by nest connection architecture to avoid the semantic gap between encoder and decoder.
In training phase, the loss function is defined as follows,
where and indicate the pixel loss and structure similarity () loss between the input image and the output image . denotes the trade-off value between and .
is calculated by Eq.2,
where and indicate the output and input images, respectively. is the Frobenius norm. calculates the distance between and . This loss function will make sure that the reconstructed image is more similar to input image in pixel level.
The SSIM loss is obtained by Eq.3,
where denotes the structural similarity measure . The output image and the input image have more similarity in structure when the values of become larger.
The aim of the training phase is to obtain two powerful tools for the encoder network and the decoder network. Thus, the type of input images in training phase is not limited to infrared and visible images. In the training stage, the dataset MS-COCO  is used to train our auto-encoder network and we choose 80000 images to be the input images. These images are converted to gray scale and then resized to . As the orders of magnitude are different between and , the parameter is set as 1, 10, 100 and 1000 to train our network. The detailed analysis of training phase is introduced in the Ablation Study given in Section IV-B.
Iii-C Fusion Strategy
Most fusion strategies are based on the weight-average operator which generates a weighting map to fuse the source images. Based on this theory, the choice of the weighting map becomes a key issue.
The fusion network becomes more flexible when the fusion strategies are added to the test phase , however, these strategies are not designed for deep features, and attention mechanism is not considered yet.
To solve this problem, in this section, we introduce a novel fusion strategy based on two stages of attention models. In our fusion architecture, indicates the level of multi-scale deep features and . The framework of our fusion strategy is shown in Fig.4.
and are multi-scale deep features which are extracted by encoder from two input images, respectively. and are fused features which are obtained by spatial attention model and channel attention model, respectively. is the final fused multi-scale deep feature which will be the input of the decoder network.
In our fusion strategy, we focus on two types of features: spatial attention model and channel attention model. The extracted multi-scale deep features are processed in two phases.
When and are obtained by our attention models, the final features are generated by Eq.4,
Now, we will introduce our attention model-based fusion strategies in detail.
Iii-C1 Spatial Attention Model
In , a spatial-based fusion strategy is utilized in the image fusion task. In this paper, we extend this operation to fuse multi-scale deep features and is called the spatial attention model. The procedure for obtaining the spatial attention model is shown in Fig.5.
and indicate the weighting maps which are calculated by -norm and soft-max operator from deep features and . The weighting maps are formulated by Eq.5,
where denotes -norm, and . indicates the corresponding position in multi-scale deep features ( and ) and weighting maps ( and ), each position denotes a
dimensional vector in deep features. Thedenotes a vector which has dimensions.
and denote the enhanced deep features which are weighted by and . The enhanced features() are calculated by Eq.6,
Then the fused features are calculated by adding these enhanced deep features, the formulation is shown in Eq.7,
Iii-C2 Channel Attention Model
In existing deep learning-based fusion methods, almost fusion strategies just calculate the spatial information. However deep features are three dimensional tensors. Hence not only spatial dimensional information, but the channel information should also be considered in the fusion strategy as well. Thus, we propose a channel attention-based fusion strategy. The diagram of this strategy is shown in Fig.6.
As we discussed in section III-C1, and are multi-scale deep features. and are C dimensional weighting vectors which are calculated by global pooling and soft-max. and indicate enhanced deep features which are weighted by weighting vectors. is fused features which are calculated by channel attention-based fusion strategy.
Firstly, a global pooling operator is utilized to calculate the initial weighting vectors ( and ). The formulation is shown in Eq.8,
where , indicates the corresponding index of channel in deep features , is the global pooling operator.
In our channel attention model, three global pooling operations are chosen, including: (1) Average operator which calculates the average values of each channel; (2) Max operator which calculates the maximum value of each channel; (3) Nuclear-norm operator () which is the sum of singular values for one channel. The influence of different operators for global pooling will be discussed in the Ablation Study IV-B.
Then, a soft-max operator (Eq.9) is used to obtain the final weighting vectors and ,
When we obtain the final weight vectors, the fused features which are generated by channel attention model can be calculated by Eq.10,
Iv Experimental Results
In this section, we first describe the experimental settings of testing phase. Then, we introduce our ablation study. We compare our method with other existing methods in subjective evaluation and utilize several quality metrics to evaluate the fusion performance objectively.
Iv-a Experimental Settings
In our experiments, 21 pairs of infrared and visible images111These images are available at https://github.com/hli1221/imagefusion-nestfuse. were collected from  and . A sample of these images is shown in Fig.7.
We choose twelve typical and state-of-the-art fusion methods to evaluate the fusion performance, including: cross bilateral filter fusion method (CBF) , discrete cosine harmonic wavelet transform fusion method (DCHWT) , joint SR based fusion method (JSR) , the joint sparse representation model with saliency detection fusion method (JSRSD) , gradient transfer and total variation minimization (GTF) , visual saliency map and weighted least square optimization based fusion method (WLS) , convolutional sparse representation based fusion method (ConvSR) , VGG-19 and the multi-layer fusion strategy-based method (VggML) , DeepFuse , DenseFuse 222The addition fusion strategy is utilized and the parameter is set to 100., the GAN-based fusion network (FusionGAN)  and a general end-to-end fusion network(IFCNN) . All these comparison fusion methods are implemented based on their publicly available codes, and their parameters are set by referring to their papers.
Seven quality metrics are utilized for quantitative comparison between our fusion method and other existing fusion methods. These are: entropy () ) ; mutual information () ; and  which calculates mutual information () for the discrete cosine transform and the region feature; the modified structural similarity for no-reference image (); and visual information fidelity () .
The is calculated by Eq.11,
where denotes the structural similarity measure , is fused image, and , are source images.
The fusion performance improves with the increasing numerical index of all these seven metrics. The larger and means input image contains more information, which also indicates that the fusion method achieves better performance. The larger , and indicates the fusion method could preserve more raw information and features from source images. For and , the fusion algorithms preserve more structural information from source images and generate more natural features.
Iv-B Ablation Study
Iv-B1 Parameter() in Loss Function
As discussed in section III-B, the parameter
is set as 1, 10, 100 and 1000. The epoch and batch size are 2 and 4, respectively. Our network is implemented with NVIDIA GTX 1080Ti and PyTorch is used for implementation. The line chart of loss values is demonstrated in Fig.9.
In Fig.9, at first 400 iterations, the auto-encoder network has rapid convergence with the increase of the parameter . , and have faster convergence rate when or . In addition, when iterations are more than 600, we get the optimal network weights, no matter which is chosen. In general, our fusion network gets faster convergence of with increase of in the early stage.
We still need to choose one values for our image fusion task based on the test images. Seven metrics are used to evaluate the performance of different network with different . And the operations which were utilized in channel attention model are , and , respectively. These values are shown in Table III. The best values are indicated in bold and the second-best values are denoted in red and italic.
From Table III, although different has no effect for convergence rate when iterations become larger, it still has influence for the fusion performance of our fusion framework. When is 100 (), our network can achieve better fusion performance than other values of . So, in our experiment, is set as 100.
|w/o deep supersion||6.91971||82.75242||13.83942||0.35801||0.43724||0.73199||0.78652|
|||SD ||MI ||||||VIF|
Iv-B2 The Influence of Multi-scale Deep Features
In this section, we analyze the influence to fusion performance with different scales of deep features, the parameter is set as 100.
To generate multiple outputs in different scales of deep features, we use deeply supervised training strategy which is utilized in UNet++  to train our fusion network. The training framework of deep supervision of NestFuse is shown in Fig.8.
, and are outputs obtained by NestFuse with deep supervision. And the loss function is defined as follows,
where , is the total loss function which is discussed in section III-B.
Seven quality metrics are also selected to evaluate the fusion performance in different scales of deep features. These values are shown in Table IV and the best values are indicated in bold. “w/o deep supervision” denotes the training phase without deep supervision which was introduced in section III-B.
From Table IV, with deep supervision, the metrics values are very close in different scales (, and ) and the advantage of multi-scale deep features in NestFuse is not competitive. Specifically, comparing with deep scale features (such as ), the shallow scale features () obtain better evaluation in , , and , which indicates shallow scale features contain more detail information. When deeper scale features are utilized in NestFuse, the fused images contain more structure features, which delivers best values on , and a comparable value on .
However, when we train NestFuse with global optimization strategy, the fusion performance is boosted (obtains all best values), which means multi-scale mechanism is effective in our fusion network. This indicates that while the deeply supervised strategy may not train a better model in image fusion task, it still achieves better performance in image segmentation.
Thus, our network is trained with global optimization strategy which fully utilizes the multi-scale features in NestFuse.
Iv-C Results Analysis
The fused images obtained by existing fusion methods and our fusion method (NestFuse) are shown in Fig.10 - Fig.12. We analyze the visual effects of fused results on three pairs of infrared and visible images.
As shown in the red boxes of Fig.10, Fig.11 and Fig.12, comparing with the proposed method, CBF, DCHWT, JSR and JSRSD generate much more noise in fused images and some detail information are not clear. For GTF, WLS, ConvSR, VggML and FusionGAN, although some of the saliency features are highlighted, some regions in fused images are blurred. Moreover, the features in red boxes are not so satisfactory.
On the contrary, the DeepFuse, DenseFuse, IFCNN and the proposed method obtain better fusion performance in subjective evaluation compared with other three fusion methods. In addition, the fused images obtained by the proposed method have more reasonable luminance information.
For objective evaluation, we choose seven objective metrics to evaluate the fusion performance of these eleven fusion methods and the proposed method.
The average values of seven metrics for all fused images which are obtained by existing methods and the proposed fusion method are shown in Table V. The best values are indicated in bold and the second-best values are denoted in red and italic.
From Table V, the proposed fusion framework has five best values and five second-best values (except and ). This indicates that the proposed fusion framework can preserve more detail information (, and ) and feature information( and ) in the fused images.
For the metric , it is used to measure the amount of information in one image. The larger means the fused image contains more information. However, if the fusion method generates noise in fuison processing, it also leads to larger (Fig.10(c)-12(c)). That is why the fused image obtained by CBF achieves larger . On the contrary, the fused images obtained by our proposed method have more reasonable luminance information and contain less noise, which make our proposed fusion method achieves best value on .
Comparing with operator in channel attention model-based fusion strategy, the and operators can achieve almost the best values in objective metrics. In channel attention model, these two operations are effective and can capture more structure information from deep features.
|SiamRPN++ with NestFuse||0.3493||0.6661||40.9503|
Iv-D An Application to Visual Objective Tracking
In VOT2019, two new sub-challenges (VOT-RGBT and VOT-RGBD) are introduced by the committee. The VOT-RGBT sub-challenge focuses on short-term tracking, which contains two modalities (RGB and thermal infrared). As mentioned in , the infrared and visible image fusion methods are ideally suited to improve the tracking performance in this task.
According to our previous research , if the tracker engages a greater proportion of deep features for the data representation, its performance will be improved when the fusion method focuses on feature-level fusion. This insight gives us a direction to apply our proposed fusion method into RGBT tracking task.
Thus, in this experiment, we choose SiamRPN++  as the base tracker and the fusion strategy proposed in this paper is applied to do the feature-level fusion. The SiamRPN++ is based on deep learning and achieves the state-of-the-art tracking performance in 2019.
For the objective evaluation, three metrics 
are selected to analyze the tracking performance: Expected Average Overlap (EAO), Accuracy and Failures. (1) EAO is an estimator of the average overlap a tracker manages to attain on a large collection with the same visual properties as the ground-truth; (2) Accuracy denotes the average overlap between the predicted and ground truth bounding boxes; (3) Failure evaluates the robustness of a tracker.
The evaluation measure values of SiamRPN++ with the proposed fusion method are shown in Table VI. The bold and red italic indicate the best values and second-best values, respectively.
In VOT challenge, the EAO is the primary measure. As shown in Table VI, comparing with ‘’ and ‘’, the tracking performance (EAO) is improved by applying our fusion strategy to fuse multi-scale deep features. This indicates that not only in image fusion task, the proposed fusion method can also improve the tracking performance in RGBT tracking task as well.
Furthermore, we will also apply the proposed fusion method into other computer vision tasks to evaluate the performance of fusion algorithm in future.
In this paper, we propose a novel image fusion architecture by developing a nest connection network and spatial/channel attention models. Firstly, with the pooling operator in encoder network, the multi-scale features are extracted by this architecture, which could present richer features from source images. Then, the proposed spatial/channel attention models are utilized to fuse these multi-scale deep features in each scale. These fused features are fed into the nest connection-based decoder network to generate the fused image. With this novel network structure and the multi-scale deep feature fusion strategy, more saliency features can be preserved in the reconstruction process and the fusion performance can also be improved.
The experimental results and analyses show that the proposed fusion framework demonstrates state-of-the-art fusion performance. An additional experiment on RGBT tracking task also shows that the proposed fusion strategy is effective in improving the algorithm performance in other computer vision task.
-  (2006) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54 (11), pp. 4311. Cited by: §I.
-  (2005) A multiscale approach to pixel-level image fusion. Integrated Computer-Aided Engineering 12 (2), pp. 135–146. Cited by: TABLE I, §I, §I.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §I.
-  (2014) Fast-FMI: non-reference image fusion metric. In 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–3. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
-  (2013) A new image fusion performance metric based on visual information fidelity. Information fusion 14 (2), pp. 127–135. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §I.
-  (2019) The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–36. Cited by: §I, §IV-D, §IV-D, §IV-D.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
-  (2013) Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal, Image and Video Processing 7 (6), pp. 1125–1143. Cited by: §IV-A, TABLE V.
-  (2015) Image fusion based on pixel significance using cross bilateral filter. Signal, image and video processing 9 (5), pp. 1193–1204. Cited by: §IV-A, TABLE V.
-  (2019) SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291. Cited by: §IV-D.
-  (2019) RGB-T object tracking: benchmark and baseline. Pattern Recognition 96, pp. 106977. Cited by: §I.
-  (2019) Discriminative dictionary learning-based multiple component decomposition for detail-preserving noisy image fusion. IEEE Transactions on Instrumentation and Measurement. Note: doi: 10.1109/TIM.2019.2912239 Cited by: TABLE I, §I, §I.
-  (2019) Infrared and Visible Image Fusion with ResNet and zero-phase component analysis. Infrared Physics & Technology, pp. 103039. Cited by: TABLE I, §I, §I.
-  (2018) Infrared and Visible Image Fusion using a Deep Learning Framework. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. Cited by: TABLE I, §I, §I, §III-C1, §IV-A, TABLE V.
-  (2020) MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing. Note: doi: 10.1109/TIP.2020.2975984 Cited by: §I, §IV-D, §IV-D.
-  (2017) Multi-focus image fusion using dictionary learning and low-rank representation. In International Conference on Image and Graphics, pp. 675–686. Cited by: TABLE I, §I, §I.
-  (2018) DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: TABLE I, §I, §I, §II-A, §III-B, §III-C1, §III-C, §IV-A, TABLE V.
-  (2017) Pixel-level image fusion: A survey of the state of the art. Information Fusion 33, pp. 100–112. Cited by: §I.
-  (2013) Image fusion with guided filtering. IEEE Transactions on Image processing 22 (7), pp. 2864–2875. Cited by: TABLE I, §I.
-  (2020) Laplacian re-decomposition for multimodal medical image fusion. IEEE Transactions on Instrumentation and Measurement. Note: doi: 10.1109/TIM.2020.2975405 Cited by: §I.
-  (2014) Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §III-B.
-  (2017) Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology 83, pp. 94–102. Cited by: TABLE I, §I, §I, §IV-A, TABLE V.
-  (2010) Robust subspace segmentation by low-rank representation. In ICML, Vol. 1, pp. 8. Cited by: §I.
-  (2017) Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191–207. Cited by: TABLE I, §I, §II-A.
-  (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §II.
-  (2016) Image fusion with convolutional sparse representation. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: TABLE I, §I, §III-C1, §IV-A, TABLE V.
-  (2014) The infrared and visible image fusion algorithm based on target separation and sparse representation. Infrared Physics & Technology 67, pp. 397–407. Cited by: TABLE I, §I.
-  (2016) Infrared and visible image fusion via gradient transfer and total variation minimization. Information Fusion 31, pp. 100–109. Cited by: §IV-A, TABLE V.
-  (2019) FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: TABLE I, §I, §II-A, §IV-A, TABLE V.
-  (2017) Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Physics & Technology 82, pp. 8–17. Cited by: §IV-A, §IV-A, TABLE V.
-  (2004) A wavelet-based image fusion tutorial. Pattern recognition 37 (9), pp. 1855–1872. Cited by: TABLE I, §I, §I.
-  (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence (8), pp. 1226–1238. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
-  (2017) Deepfuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4714–4722. Cited by: §IV-A, TABLE V.
-  (1997) In-fibre bragg grating sensors. Measurement science and technology 8 (4), pp. 355. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
-  (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §IV-A, TABLE III, TABLE IV, TABLE V.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I.
-  (2019) Multimodal medical image sensor fusion model using sparse k-svd dictionary learning in nonsubsampled shearlet domain. IEEE Transactions on Instrumentation and Measurement. Note: doi: 10.1109/TIM.2019.2902808 Cited by: TABLE I, §I.
-  (2014) TNO Image Fusion Dataset. Note: https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029 Cited by: §IV-A.
-  (2018) Image fusion using adjustable non-subsampled shearlet transform. IEEE Transactions on Instrumentation and Measurement 68 (9), pp. 3367–3378. Cited by: TABLE I, §I.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-B, §IV-A.
Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence 31 (2), pp. 210–227. Cited by: §I.
-  (2019) An Accelerated Correlation Filter Tracker. Pattern Recognition accepted. arXiv preprint arXiv:1912.02854. Cited by: §IV-D.
-  (2019) Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7950–7960. Cited by: §IV-D.
-  (2019) Learning Adaptive Discriminative Correlation Filters via Temporal Consistency preserving Spatial Feature Selection for Robust Visual Object Tracking. IEEE Transactions on Image Processing. Cited by: §IV-D.
-  (2018) Unsupervised deep multi-focus image fusion. arXiv preprint arXiv:1806.07272. Cited by: TABLE I, §I.
-  (2010) Image fusion based on a new contourlet packet. Information Fusion 11 (2), pp. 78–84. Cited by: TABLE I, §I.
-  (2017) A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation. Neurocomputing 226, pp. 182–191. Cited by: TABLE I, §I.
-  (2013) Dictionary learning method for joint sparse representation-based image fusion. Optical Engineering 52 (5), pp. 057006. Cited by: §IV-A, TABLE V.
-  (2020) IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion 54, pp. 99–118. Cited by: TABLE I, §II-A, §IV-A, TABLE V.
-  (2018) Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §I, §II-B, §IV-B2.