RFN-Nest: An end-to-end residual fusion network for infrared and visible images

03/07/2021 ∙ by Hui Li, et al. ∙ 12

In the image fusion field, the design of deep learning-based fusion methods is far from routine. It is invariably fusion-task specific and requires a careful consideration. The most difficult part of the design is to choose an appropriate strategy to generate the fused image for a specific task in hand. Thus, devising learnable fusion strategy is a very challenging problem in the community of image fusion. To address this problem, a novel end-to-end fusion network architecture (RFN-Nest) is developed for infrared and visible image fusion. We propose a residual fusion network (RFN) which is based on a residual architecture to replace the traditional fusion approach. A novel detail-preserving loss function, and a feature enhancing loss function are proposed to train RFN. The fusion model learning is accomplished by a novel two-stage training strategy. In the first stage, we train an auto-encoder based on an innovative nest connection (Nest) concept. Next, the RFN is trained using the proposed loss functions. The experimental results on public domain data sets show that, compared with the existing methods, our end-to-end fusion network delivers a better performance than the state-of-the-art methods in both subjective and objective evaluation. The code of our fusion method is available at https://github.com/hli1221/imagefusion-rfn-nest

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

page 9

page 11

page 12

page 13

page 15

Code Repositories

imagefusion-rfn-nest

RFN-Nest(Information Fusion) - PyTorch >=0.4.1


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the physical limitations of imaging sensors, it is very difficult to capture an image of a scene that is of uniformly good quality. Image fusion plays an important role in this context. Its aim is to reconstruct a perfect image of the scene from multiple samples that provide complementary information about the visual content. It has many applications, such as object tracking (Li et al., 2019a) (Li et al., 2018a) (Luo et al., 2019), self-driving and video surveillance (Shrinidhi et al., 2018). The fusion task requires algorithms to generate a single image which amalgamates the complementary information conveyed by different source images (Ma et al., 2019a)(Li et al., 2017)(Liu et al., 2018).

Image fusion involves three key processes: feature extraction, fusion strategy and reconstruction. Most of the existing fusion research focuses on one or more of these elements to improve the fusion performance. The existing fusion methods can be classified into two categories: traditional algorithms and deep learning-based methods. In the traditional algorithm category, multi-scale transform methods

(Pajares and De La Cruz, 2004)(Ben Hamza et al., 2005)(Yang et al., 2010)(Li et al., 2013) are widely applied to extract multi-scale features from the source images. The feature channels are combined by an appropriate fusion strategy. Finally, the fused image is reconstructed by an inverse multiscale transform. Obviously, the fusion performance of these algorithms is highly dependent on the feature extraction method used.

Following this direction, sparse representation (SR) (Wright et al., 2008) and low-rank representation (LRR) (Liu et al., 2010)(Liu et al., 2012) have been applied to extract salient features from the source images. In SR and LRR based fusion methods (Zhang et al., 2013)(Liu et al., 2017a)(Gao et al., 2017)(Li and Wu, 2017), the sliding window technique is used to decompose source images into image patches. A matrix is constructed using these image patches, in which each column is a reshaped image patch. This matrix is fed into SR (or LRR) to calculate SR (or LRR) coefficients which are considered as image features. By virtue of this operation, the image fusion problem is transformed to the one of coefficient fusion. The fused coefficients are generated by an appropriate fusion strategy and used to reconstruct the fused image in the SR (or LRR) framework. Beside the above, other SR based approaches and other signal processing methods (Lu et al., 2014)(Yin et al., 2017) have been suggested in the literature.

Although the traditional fusion methods have achieved good fusion performance, they have drawbacks: (1) The fusion performance highly depends on handcrafted features (Zhang et al., 2013)(Li and Wu, 2017)(Liu et al., 2017b), as it is difficult to find a universal feature extraction method for different fusion tasks; (2) Different fusion strategies may be required to work with different features; (3) For SR and LRR based methods, the dictionary learning is very time-consuming; (4) Complex source images pose a challenge for SR (or LRR) based fusion methods.

To overcome these drawbacks, deep learning based fusion methods have been developed, which can be grouped into three categories according to the three key elements of the fusion process: deep feature extraction, fusion strategy and end-to-end training. In the feature extraction direction, deep learning methods are utilized to extract deep representation of the information conveyed by the source images

(Li et al., 2018b)(Li et al., 2019b)(Song and Wu, 2018)(Li and Wu, 2019)(Li et al., 2020b). Different fusion strategies have been suggested to reconstruct the fused image. In other fusion methods (Liu et al., 2016)(Liu et al., 2017c), deep learning is also used to design the fusion strategy. In (Liu et al., 2016)(Liu et al., 2017c)

, convolutional sparse representation and convolutional neural network are utilized to generate a decision map for the source images. Using the learned decision map, the fused images are obtained by appropriate post-processing. Although these fusion methods achieve good fusion performance, the fusion strategy and the post-processing are tricky to design. To avoid the limitations of handcrafted solutions, some end-to-end fusion frameworks were presented (FusionGAN

(Ma et al., 2019b), FusionGANv2 (Ma et al., 2020a), DDcGAN (Ma et al., 2020b)). These frameworks are based on adversarial learning which avoids the shortcoming of handcrafted features and fusion strategies. However, even the state of the art methods, FusionGANv2 (Ma et al., 2020a) and DDcGAN (Ma et al., 2020b)

, face challenges to preserve image detail adequately. To preserve more detail background information from visible images, a nest connection based autoencoder fusion network (NestFuse

(Li et al., 2020a)) was proposed. Although NestFuse obtains good performance in detail information preservation, the fusion strategy is still not learnable.

To address these problems, in this paper, we propose a novel end-to-end fusion framework (RFN-Nest). Our network contains three parts: an encoder network, residual fusion network (RFN) which is designed to extract fused multi-scale deep features, and a decoder network based on nest connection (Zhou et al., 2018). Although the encoder and decoder architecture of the proposed network is similar to the NestFuse (Li et al., 2020a), the fusion strategy, the training strategy and the loss function are totally different.

Firstly, instead of fusing handcrafted features in NestFuse (Li et al., 2020a), several simple yet efficient learnable fusion networks (RFN) have been designed and inserted into the autoencoder architecture. With the RFN, the autoencoder-based fusion network is upgraded to an end-to-end fusion network. Secondly, as RFN is a learnable structure, it is important that the encoder and decoder exhibit powerful feature extraction and feature reconstruction abilities, respectively. Thus, we develop a two-stage training strategy to train our fusion network (encoder, decoder and RFN networks). Thirdly, to train the proposed RFN networks, we design a new loss function () to preserve the detail information from visible image and maintain the salient features from infrared image, simultaneously.

The main contributions of RFN-Nest can be summarized as follows,

(1) A novel residual fusion network(RFN) is proposed to supersede handcrafted fusion strategies. Although many methods (Li et al., 2018b)(Song and Wu, 2018)(Ram Prabhakar et al., 2017)(Li and Wu, 2019)

now use deep features to achieve good performance, the heuristic approach to selecting a suitable fusion strategy is their weakness. The proposed RFN is a learnable fusion network that overcomes this weakness.

(2) A two-stage training strategy is developed to design our network. The feature extraction and feature reconstruction abilities are the key for the encoder and decoder networks. Using only one stage training strategy to simultaneously train the whole network (encoder, decoder and RFN networks) is insufficient. Inspired by (Li and Wu, 2019), firstly, the encoder and the decoder network are trained as an auto-encoder. With the fixed encoder and decoder, the RFN networks are trained using an appropriate loss function.

(3) A loss function capable of preserving the image detail, together with a feature enhancing loss function are designed to train our RFN networks. We show that with these loss functions, more detail information and image salient features are preserved in the fused image.

(4) We show that, compared with the state-of-the-art fusion methods, the proposed RFN-Nest framework exhibits better fusion performance on public datasets in both subjective visual assessment and objective evaluation.

The rest of our paper is structured as follows. In Section 2, we briefly review the related work on deep learning-based fusion. The proposed fusion framework is described in detail in Section 3. The experimental results are presented in Section 4 and Section 5. Finally, we draw the paper to conclusion in Section 6.

2 Related Works

Recently, many deep learning methods have been developed for image fusion. Most of them are based on convolutional neural networks (CNN). These methods can be classified into the non end-to-end learning and end-to-end learning categories. In this section, we briefly overview the most representative deep learning based methods from these two categories.

2.1 Non End-to-end Methods

In the early days, deep learning neural networks were used to extract deep features as a bank of “decision” maps (Li et al., 2018b)(Li et al., 2019b)(Song and Wu, 2018). In (Li et al., 2018b), Li et al. proposed a fusion framework based on a pre-trained network (VGG-19 (Simonyan and Zisserman, 2014)). Firstly, the source images are decomposed into salient parts (texture and edges) and base parts (contour and luminance). Then, VGG-19 is used to extract multi-level deep features from the salient parts. At each level, the decision maps are computed from the deep features and a candidate fused salient part is generated. The fused image is reconstructed by combining the fused base parts and the fused salient parts using an appropriate fusion strategy. In (Li et al., 2019b), the pre-trained ResNet-50 (He et al., 2016) is utilized to extract deep features from the source images directly. A decision map is obtained by zero-phase component analysis(ZCA) and -norm. The PCANet-based fusion method (Song and Wu, 2018) also follows this framework to generate the fused image, in which PCANet, instead of VGG-19 or ResNet-50, is used to extract the features.

In addition to pure feature extraction, in (Liu et al., 2016)(Liu et al., 2017c), the two key processes (feature extraction and fusion strategy) are implemented by a single network. In (Liu et al., 2017c), a decision map is generated by a CNN trained on image patches of multiple blurred versions of the input image. In (Liu et al., 2016), the convolutional sparse representation instead of CNN is utilized to extract features and to generate a decision map. From the generated decision map, the fused image can easily be reconstructed.

Figure 1: The framework of RFN-Nest. ‘’ denote the residual fusion network. The nest connection-based decoder network(‘Decoder’) will be introduced later. ’’ means the kernel size is , input channel is 16 and output channel is 8 in a convolutional layer.

Besides the above methods, a deep auto-encoder network based fusion framework was proposed in (Li and Wu, 2019). Inspired by DeepFuse (Ram Prabhakar et al., 2017), the authors proposed a novel network architecture which contains an encoder, a fusion layer and a decoder. The dense block (Huang et al., 2017) based encoder network was adopted as it extracts more complementary deep features from the source images. In their framework, the fusion strategy becomes very important.

Inspired by DenseFuse(Li and Wu, 2019) and the architecture in (Zhou et al., 2018), Li et al. proposed NestFuse(Li et al., 2020a)

to preserve more detail background information from visible images, while enhancing the salient features in infrared images. Additionally, a novel spatial/channel attention models are designed to fuse the multi-scale deep features. Although these frameworks achieve good fusion performance, it is very difficult to find an effective handcrafted fusion strategy for image fusion.

2.2 End-to-end Methods

To eliminate the arbitrariness of handcrafted features and fusion strategies, several end-to-end fusion frameworks have been suggested (Ma et al., 2019b) (Ma et al., 2020a) (Zhang et al., 2020b) (Zhang et al., 2020a) (Ma et al., 2020b) (Xu et al., 2020a).

In (Ma et al., 2019b), a GAN-based fusion framework (FusionGAN) was introduced to the infrared and visible image fusion field. The generator network is the engine which computes the fused image, while the discriminator network constrains the fused image to contain the detail information from the visible image. The loss function has two terms: content loss and discriminator loss. Due to the content loss, the fused image tends to become similar to the infrared image, failing to preserve the image detail, in spite of the discriminator network.

To preserve more detail information from the visible images, the authors of (Ma et al., 2020a) proposed a new version of FusionGAN which was named FusionGANv2. In this new version, the authors deepen the generator and discriminator networks, endowing them with more powerful feature representation ability. In addition, two new loss functions, namely detail loss and target edge-enhancement loss, were presented to preserve the detail information. With these improvements, the fused images reconstructed more scene details, with clearly highlighted edge-sharpened targets.

A general end-to-end image fusion network (IFCNN) (Zhang et al., 2020b) was also proposed, which is a simple yet effective fusion method. In IFCNN, two convolutional layers are utilized to extract deep features from the source images. Element-wise fusion rules (elementwise-maximum, elementwise-sum, elementwise-mean) are used to fuse the deep features. The fused image is generated from the fused deep features by two convolutional layers. Although IFCNN achieves a satisfactory fusion performance in multiple image fusion tasks, its architecture is too simplistic to extract powerful deep features, and the fusion strategies designed using a traditional way are not optimal.

3 The Proposed Fusion Framework

The proposed fusion network is introduced in this section. Firstly, the architecture of our network is presented in Section 3.1. The advocated two-stage training strategy is described in Section 3.2.

3.1 The Architecture of the Fusion Network

The RFN-Nest is an end-to-end fusion network, the architecture of which is shown in Fig.1. RFN-Nest contains three parts: encoder (left part), residual fusion network () and decoder (right part). For a convolutional layer, “” means the kernel size is , input channel is and output channel is .

With the max pooling operation in the encoder network, multi-scale deep features can be extracted from the source images. The RFN is utilized to fuse multi-modal deep features extracted at each scale. While the shallow layer features preserve more detail information, the deeper layer features convey semantic information, which is important for reconstructing the salient features. Finally, the fused image is reconstructed by the nest connection-based decoder network, which fully exploits the multi-scale structure of the features.

As shown in Fig.1, and indicate the source images (infrared image and visible image). denotes the output of RFN-Nest, that is the fused image. “” means one residual fusion network for deep features at scale . The architecture of the encoder in our framework is constituted by four RFN networks, . These RFN networks share the same architecture but with different weights.

We now introduce the RFN and the decoder in detail.

3.1.1 Residual Fusion Network (RFN)

The RFN is based on the concept of residual block (He et al., 2016) which has been adapted to the task of image fusion. The RFN architecture is shown in Fig.2.

Figure 2: The architecture of .

In Fig.2, and indicate the -th scale deep features extracted by the encoder network, with indicating the index of the RFN network. “Conv1-6” denote six convolutional layers in RFN. In this residual architecture, the outputs of “Conv1” and “Conv2” are concatenated as the input of “Conv3”. “Conv6” is the first fusion layer to generate initial fused features. With this architecture, RFN can easily be optimized by our training strategy. The convolutional operations produce the fused deep features , which are fed into the decoder network.

Thanks to the multi-scale deep features and the proposed learning process, both the image detail and salient structures are preserved by the shallow RFN networks and deep RFN networks, respectively.

3.1.2 Decoder Network

The decoder network based on the nest connection architecture is shown in Fig.3. Compared to UNet++ (Zhou et al., 2018), regarding the image fusion task, we simplify the network architecture to make it light yet effective to reconstruct fused images, this architecture was also utilized in (Li et al., 2020a).

Figure 3: The architecture of the decoder.

() denote the fused multi-scale features obtained by the RFN networks. “DCB” indicates a decoder convolutional block, which has two convolutional layers. In each row, these blocks are connected by short connections which are similar to the dense block architecture (Huang et al., 2017). The cross-layer links connect multi-scale deep features in the decoder network.

The output of the network is the fused image reconstructed from the fused multi-scale features.

3.2 Two-stage Training Strategy

Note that the ability of the encoder in our network to perform feature extraction and that of the decoder to conduct feature reconstruction are absolutely crucial for successful operation. Accordingly, we develop a two-stage training strategy to make sure that each part in our network can achieve the expected performance.

Firstly, the encoder and the decoder are trained as an auto-encoder network to reconstruct the input image. After learning the encoder and decoder networks, in the second training stage, several RFN networks are trained to fuse the multi-scale deep features.

In this section, a novel two-stage training strategy is introduced in detail.

3.2.1 Training of the Auto-encoder Network

Inspired by DenseFuse (Li and Wu, 2019), in the first stage, the encoder network is trained to extract multi-scale deep features. The decoder network is trained to reconstruct the input image with multi-scale deep features. The auto-encoder network training framework is shown in Fig.4.

Figure 4: The training of the auto-encoder network.

In Fig.4, and denote the input image and the output image (both indicate one image), respectively. In contrast to (Ram Prabhakar et al., 2017)(Li and Wu, 2019)(Ma et al., 2019b)(Ma et al., 2020a), our feature extraction part includes a down sampling operation (max pooling), which extracts deep features at four scales. These multi-scale deep features are fed into the decoder network to reconstruct the input image. With short cross-layer connections, the multi-scale deep features are fully used to reconstruct the input image.

The auto-encoder network is trained using the loss function, defined as follows,

(1)

where and denote the pixel loss and the structure similarity (SSIM) loss between the input image () and the output image (). is the trade-off parameter between and .

The pixel loss () is calculated by Eq.2,

(2)

where the is the Frobenius norm. constrains the reconstructed image to be like the input image at the pixel level.

The SSIM loss () is defined as,

(3)

where 333The definition of is introduced in our supplementary material (Section 1). is the structural similarity measure (Wang et al., 2004) which quantifies the structural similarity of the two images. The structural similarity between and is constrained by .

3.2.2 Training of the RFN

The RFN is proposed to implement a fully learnable fusion strategy. In the second stage, with the encoder and decoder fixed, the RFN is trained with an appropriate loss function. The training process is shown in Fig.5.

Figure 5: The training of RFN. In the RFN, inputs and denote infrared and visible deep features, respectively. represents the fused deep features obtained by . In our framework, indicates the scale of deep features.

The fixed encoder network is utilized to extract multi-scale deep features ( and ) from the source images. For each scale, an RFN is used to fuse these deep features. Then, the fused multi-scale features () are fed into the fixed decoder network.

To train our RFN, we propose a novel loss function , which is defined as,

(4)

where and indicate the background detail preservation loss function and the target feature enhancement loss function, respectively. is a trade-off parameter.

In the case of infrared and visible image fusion, most of the background detail information comes from the visible image. aims to preserve the detail information and structural features from visible image, which is defined as

(5)

As the infrared image contains more salient target features than the visible image, the loss function is designed to constrain the fused deep features so as to preserve the salient structures. The is defined as,

(6)

In Eq.6, is the number of the multi-scale deep features, which is set to 4. Owing to the magnitude difference between the scales,

is a trade-off parameter vector for balancing the loss magnitudes. It assumes four values

. and control the relative influence of the visible and infrared features in the fused feature map .

As the visible information is constrained by and the aim of is to preserve salient features from the infrared image, in Eq.6, is usually greater than .

Figure 6: Five pairs of source images. The top row contains visible images, and the second row contains infrared images.
Figure 7: The fusion results obtained with different values of and , when . In , .

4 Experimental Validation

In this section, we conduct an experimental validation of the proposed fusion method. After detailing the experimental settings in the training phase and the test phase, we present several ablation studies to investigate the effect of different elements of the proposed fusion network. Finally, we compare our fusion framework with other existing algorithms qualitatively. For this purpose, we use several performance metrics to evaluate the fusion performance objectively.

Our network is implemented on the NVIDIA TITAN Xp GPU using PyTorch as a programming environment.

Figure 8: The fusion results obtained without () or with () . In , .

4.1 Experimental Settings in the Training Phase

In this section, we introduce the training datasets used in our two-stage training strategy.

In the first stage, we use the dataset MS-COCO

(Lin et al., 2014) to train our auto-encoder network. 80000 images are chosen to constitute the training set. These images are converted to gray scale and reshaped to . In Eq. 1, the parameter is set to 100 to balance the magnitude difference between and

. The batch size and epoch are set to 4 and 2, respectively. The learning rate is set to

.

For the second training stage, we choose the KAIST (Hwang et al., 2015) dataset to train our RFN networks. It contains almost 90000 pairs of images. In this dataset, 80000 pairs of infrared and visible images are chosen for training. These images are also converted to gray scale and resized to . The batch size and epoch are set to 4 and 2, respectively. The learning rate was also set to , as in the first stage.

4.2 Experimental Settings in the Test Phase

Our test images come from two datasets which were collected from TNO (Toet, 2014) and VOT2020-RGBT (Kristan et al., 2020). These images are available at (Li, 2020). Some samples of these images are shown in Fig.6. The first dataset contains 21 pairs of infrared and visible images collected from TNO. The second dataset contains 40 pairs of infrared and visible images, which were collected from TNO and VOT2020-RGBT.

We use six quality metrics444The definitions of these metrics are introduced in our supplementary material. to evaluate our fusion algorithm objectively. These include: entropy() (Roberts et al., 2008)

; standard deviation (

) (Rao, 1997); mutual information() (Qu et al., 2002); modified fusion artifacts measure() (Kumar, 2013), which evaluates the noise information in fused images; the sum of the correlations of differences() (Aslantas and Bendes, 2015); and the multi-scale structural similarity(MS-SSIM) (Ma et al., 2015). The fusion performance improves with the increasing numerical index of all these seven metrics.

4.3 Ablation Study for and

In this section, we discuss the effect of and tune the parameters in . Then, we investigate the impact of the relative weights of the visible and infrared features on the fusion performance.

Once the auto-encoder network is trained in the first stage, the parameters of the encoder and decoder are fixed and we use to train four RFN networks. As discussed in Section 3.2.2, due to the magnitude difference between and , the value of the parameter should be large. Furthermore, the role of is to preserve the detail information from the visible image. Based on the above considerations, in this experiment, is set to 0 and 700 to analyze its influence on our network.

In , is a trade-off vector to balance the values between the scales. To preserve the salient features from the infrared image, and should be set appropriately. In view of the role of , should be relatively small to reduce any redundancy in reconstructing the image detail information. In contrast, should be large to preserve the complementary salient features in the infrared image. However, if is set to 0, which constrains the fused features to mirror the infrared features, the network fails to converge due to the conflicting constraints of and . So, in our experiment, is set to a non-zero value.

As different combination of and can lead to different fusion results, we analyze the influence of these two parameters for different values from the range of [0.5, 6.0].

Firstly, when , which means only is utilized to train RFN networks, some of the fusion results with different and are shown in Fig.7.

(Roberts et al., 2008) (Rao, 1997) (Qu et al., 2002) (Kumar, 2013) (Aslantas and Bendes, 2015) MS-SSIM(Ma et al., 2015)
0 0.5 0.5 6.71845 67.66313 13.43690 0.09354 1.83520 0.92903
2.0 2.0 6.71557 67.63524 13.43114 0.09252 1.83495 0.92887
3.0 6.80410 74.73724 13.60821 0.09240 1.82712 0.92294
4.0 6.83492 79.75125 13.66983 0.09419 1.78649 0.90543
3.0 3.0 6.72263 67.83451 13.44526 0.09339 1.83713 0.92988
4.0 6.78738 72.45840 13.57476 0.09230 1.83518 0.92715
5.0 6.81292 76.36078 13.62583 0.09324 1.81501 0.91696
4.0 4.0 6.72150 67.53190 13.44299 0.09355 1.83367 0.92842
5.0 6.77188 70.98434 13.54376 0.09208 1.83538 0.92775
6.0 6.80239 74.64694 13.60478 0.09218 1.82594 0.92263
5.0 5.0 6.71684 67.48675 13.43368 0.09218 1.83366 0.92847
6.0 6.76875 70.35820 13.53750 0.08944 1.83707 0.92870
6.0 6.0 6.72585 67.82480 13.45170 0.09209 1.83665 0.92949

 

700 5.0 0.5 6.95916 91.41847 13.9183 0.14375 1.58717 0.84109
6.0 0.5 6.79112 68.28532 13.58224 0.07838 1.78391 0.88602
3.0 6.84134 71.90131 13.68269 0.07288 1.83676 0.91456
Table 1: The average values of the objective metrics obtained with different parameters (, , ) on 21 pairs of infrared and visible images.

In Fig.7, when is small, the fused images are similar to the visible image and the salient features in the infrared images are suppressed (as shown in first two rows). On the contrary, when is large (greater than 3.0), the salient features in the infrared image are retained. In contrast, the detail information in the visible image is not preserved.

To capture both types of information, for , we choose a middle value (yellow and pink boxes in Fig.7

) to perform the objective evaluation. The evaluation metrics for different

and are presented in Table 1. The best values are indicated in bold.

When , the detail information is preserved by . The aim of is to promote the salient features conveyed by the source images. Accordingly, the values of must be smaller than . We choose different combinations of and (the red boxes in Fig.7) to find the best values of and . The fusion results obtained with () or without () in the same combinations of and are shown in Fig.8.

In Fig.8 (right part), the fusion results in the red boxes contain more detail information from the source images, yet the infrared features are still maintained. Compared with the left part, the fusion results on the right (red boxes) evidently preserve more detail information. When , the objective values for different parameters (Fig.8 (right part), red boxes) are also presented in Table 1.

From Fig.7 and Table 1, the different values of , and have a significant influence on the results. If the detail preservation loss function () is not used () in the training phase, the proposed fusion network fails to obtain acceptable fusion results. Although the fusion performance appears to be comparable in subjective evaluation (Fig.7, yellow and pink boxes), the subjective and objective assessments indicates a notable degradation compared with the optimal parameter combination (, and ).

When the detail preservation loss function () is switched on , our RFN-Nest fusion network scores the comparable metrics values of six metrics with and . Based on this analysis, we set and in our next experiments.

In next section, we will analyze the impact of parameter in our loss function.

4.4 Ablation Study for in

As discussed in Section 4.3, when the detail preserving loss function is discarded (), both the subjective and objectively measured fusion performance will be poor. It is evident from from Fig.8 and Table 1 that our fusion network can achieve better fusion performance when is not 0. Thus, choosing an optimal value of becomes an important issue.

In our study, the parameters of and are set to 3.0 and 6.0, respectively. is set to to balance the discrepancy in the orders of magnitude of different scales. To find the optimal , we set it to and compute the results.

Some examples of the fusion results are sown in Fig.9. With the increase (1000), the salient features (man in the yellow box) are not clear, even suppressed, which makes the fused image similar to the visible image. When is set to 500 and 700, the fusion results contain more detail information and the salient features are also maintained.

Based on these observations, we objectively evaluate our fusion method with set to 10, 100, 200, 500, 700, 1000. The metrics values of the fusion results with different are shown in Table 2. The best values are indicated in bold.

Figure 9: The fusion results obtained with different .
(Roberts et al., 2008) (Rao, 1997) (Qu et al., 2002) (Kumar, 2013) (Aslantas and Bendes, 2015) MS-SSIM(Ma et al., 2015)
10 6.66878 62.83593 13.33757 0.08012 1.77192 0.87593
100 6.70939 63.77416 13.41878 0.07129 1.79329 0.90096
200 6.75446 66.01632 13.50893 0.06680 1.80880 0.91177
500 6.82103 70.34117 13.64206 0.06768 1.83252 0.91453
700 6.84134 71.90131 13.68269 0.07288 1.83676 0.91456
Table 2: The average metrics values of the proposed fusion network with different on 21 pairs of infrared and visible images.

As shown in Fig.9 and Table 2, when is 700, the proposed fusion network achieves better fusion performance in both subjective and objective evaluation. In Table 2, the proposed network scores best in four out of seven metrics with . Thus, in our next experiments, the parameter is set to 700.

4.5 Ablation Study for Training Strategy

The proposed two-stage training strategy is a critical operation in our training phase. In this section, we discuss why this strategy is effective, and show its relative merits compared to the one-stage strategy.

One-stage training strategy means the encoder, RFN and decoder are trained, simultaneously. The training framework is shown in Fig.10, where both the encoder and the decoder are free to adapt their weights. The loss function and the parameter settings are the same as , which means , , and . The fusion results obtained by these two training strategies are shown in Fig.11.

Figure 10: The training framework for the one-stage training strategy.
Figure 11: The fusion results obtained by one-stage and two-stage training strategy. (a) Visible images; (b) Infrared images; (c) Fused images obtained by one-stage strategy; (d) Fused images obtained by two-stage strategy.

In Fig.11 (c), the visible spectrum detail information is enhanced with the one-stage training strategy. However, the salient objects in infrared image are lost. The premise of image fusion is not realised. In contrast, the two-stage training strategy (Fig.11, d) enables the fused image to preserve the salient infrared objects and contain more detail information from visible images.

The reason is that the encoder and the decoder may not have the desirable feature extraction and reconstruction ability when designed using the one-stage training strategy. More importantly, as the RFN is the key in our fusion network, it should be trained carefully to obtain good fusion performance.

In conclusion, we use the two-stage training strategy to train our fusion network. In the first training stage, the encoder is trained to extract powerful multi-scale deep features, to be used by the decoder for image reconstruction. In the second stage, with the fixed encoder and decoder, the RFN networks are trained to fuse the multi-scale deep features, to enhance the detail information from the visible spectrum image and to preserve salient features from the infrared source image.

4.6 Ablation Study for Nest Connection in Decoder

In this section, we discuss the influence of the nest connection in the decoder. Fig.12 shows the decoder network structure without nest connection (remove the short connections between “DCB”). We train this new decoder architecture with the same training strategy and the same loss functions as discussed in Section 3.2.

Figure 12: The decoder network without short connection (“No-nest”).
(Roberts et al., 2008) (Rao, 1997) (Qu et al., 2002) (Kumar, 2013) (Aslantas and Bendes, 2015) MS-SSIM(Ma et al., 2015)
No-nest 6.75935 66.48558 13.51871 0.05278 1.80356 0.90172
Encoder & Decoder 6.68274 67.45593 13.36548 0.09209 1.83367 0.92831
6.71760 92.49952 13.43519 0.21454 1.58628 0.77823
-norm 6.83073 93.21573 13.66146 0.20760 1.56378 0.76769
-norm 6.81192 73.66134 13.62385 0.09010 1.80934 0.92628
SCA 6.91971 82.75242 13.83942 0.13405 1.73353 0.86248
RFN-Nest 6.84134 71.90131 13.68269 0.07288 1.83676 0.91456
Table 3: The quality metrics values of two ablation studies. “No-nest” indicates the decoder without nest-connection architecture. “Encoder & Decoder” denotes that the encoder and the decoder are fixed, the fusion strategy is “”, “”, “-norm”, “-norm” or “SCA”. “RFN-Nest” means the proposed fusion network.

The values of seven quality metrics are shown in Table 3. The “No-nest” denotes the decoder without nest-connection architecture. The best values, the second-best values and the third-best values are indicated in bold, red and italic and blue and italic, respectively.

Compared with “No-nest”, the RFN-Nest (the decoder with nest connection) obtains one best metrics value, three second-best metrics values and one third-best metrics value. This indicates that the nest connection architecture plays an important role in boosting the reconstruction ability of the decoder network. With the nest connection, the decoder is able to preserve more image information conveyed by the multiscale deep features (, , ) and generate more natural and clearer fused image (, , ).

4.7 Ablation Study for Fusion Strategy

In this section, we analyze the importance of RFN as an adaptive fusion mechanism in our fusion network. We choose five classical handcrafted fusion strategies (“”, “”, “-norm”, “-norm” and “SCA”) which are used in existing fusion networks (Li and Wu, 2019)(Zhang et al., 2020b)(Li et al., 2020a) to do the experiments.

The trained encoder and decoder are utilized to extract the multi-scale deep features and generate the final image from the fused features, respectively.

Let and denote the multi-scale deep features extracted by the trained encoder from the infrared and visible image, respectively. are the fused deep features. indicates the scale of the deep features. The formulas of these five strategies are shown in Table 4.

Fusion Strategy Formula
(Li and Wu, 2019)
(Zhang et al., 2020b)
-norm(Li and Wu, 2019)
-norm(Li et al., 2020a)
SCA(Li et al., 2020a) Spatial and channel attention fusion strategy
Table 4: The formulas of different fusion strategies.

”, means the fused features are obtained by adding the source features, directly. In “” strategy, denotes an element wise choose-max strategy (Zhang et al., 2020b). For the “-norm” strategy, , the weights are calculated based on -norm. For details on how to calculate these weights, please refer to (Li and Wu, 2019). For “-norm” (known as nuclear-norm),

calculates the sum of singular values of a matrix involved in the global pooling operation of deep features to obtain the fusion weights.

Figure 13: The fusion results with different fusion strategies. “RFN-Nest” means the adaptive RFN networks are utilized in the fusion operation.

The “SCA” indicates the Spatial/Channel Attention fusion strategy which was utilized in NestFuse (Li et al., 2020a). In this experiment555The “SCA” fusion strategy is used in NestFuse., the -norm is used to do the spatial attention fusion and the average pooling is utilized to calculate the channel attention.

Some examples of the images fused using different fusion strategies are shown in Fig.13. Compared with other handcrafted fusion strategies, the fused image obtained by the RFN-based network preserves more detail information from visible image (blue boxes) and the fused image contains less artefacts (red boxes).

The results of fusing 21 pairs of infrared and visible images have been evaluated in terms of the seven quality metrics. The metrics values are shown in Table 3. The table also reports the results obtained with other fusion strategies. The RFN-based network (RFN-Nest) achieves five best values. This indicates that when the learnable fusion network is used as a fusion strategy, the detail image information will be boosted (, ) thanks to the proposed loss function. Regarding feature preservation, the proposed strategy still obtains three best values (, and ) and two comparable results ( and ).

In Section 5, we adopt this learnable fusion network (RFN) for the object tracking task to illustrate the effectiveness of RFN-based fusion strategy in other vision task.

4.8 Fusion Results Analysis on 21 pairs Images

To compare the fusion performance of the proposed method with the state-of-the-art algorithms, eleven representative fusion methods are chosen, including discrete cosine harmonic wavelet transform(DCHWT) (Kumar, 2013), gradient transfer and total variation minimization(GTF) (Ma et al., 2016), convolutional sparse representation(ConvSR) (Liu et al., 2016), multi-layer deep features fusion method(VggML) (Li et al., 2018b), DenseFuse (Li and Wu, 2019), FusionGAN (Ma et al., 2019b), IFCNN (Zhang et al., 2020b) (elementwise-maximum), NestFuse (Li et al., 2020a), PMGI (Zhang et al., 2020a), DDcGAN (Ma et al., 2020b) and U2Fusion (Xu et al., 2020a).

Figure 14: The experimental results on “man” images. (a) Visible; (b) Infrared; (c) DCHWT; (d) GTF; (e) ConvSR; (f) VggML; (g) DenseFuse; (h) FusionGAN; (i) IFCNN; (j) NestFuse; (k) PMGI; (l) DDcGAN; (m) U2Fusion; (n) proposed.
Figure 15: The experimental results on “umbrella” images. (a) Visible; (b) Infrared; (c) DCHWT; (d) GTF; (e) ConvSR; (f) VggML; (g) DenseFuse; (h) FusionGAN; (i) IFCNN; (j) NestFuse; (k) PMGI; (l) DDcGAN; (m) U2Fusion; (n) proposed.
(Roberts et al., 2008) (Rao, 1997) (Qu et al., 2002) (Kumar, 2013) (Aslantas and Bendes, 2015) MS-SSIM(Ma et al., 2015)
DCHWT(Kumar, 2013) 6.56777 64.97891 13.13553 0.12295 1.60993 0.84326
GTF(Ma et al., 2016) 6.63433 67.54361 13.26865 0.07951 1.00488 0.80844
ConvSR(Liu et al., 2016) 6.25869 50.74372 12.51737 0.01958 1.64823 0.90281
VggML(Li et al., 2018b) 6.18260 48.15779 12.36521 0.00120 1.63522 0.87478

 

DenseFuse(Li and Wu, 2019) 6.67158 67.57282 13.34317 0.09214 1.83502 0.92896
FusionGan(Ma et al., 2019b) 6.36285 54.35752 12.72570 0.06706 1.45685 0.73182
IFCNN(Zhang et al., 2020b) 6.59545 66.87578 13.19090 0.17959 1.71375 0.90527
NestFuse(Li et al., 2020a) 6.91971 82.75242 13.83942 0.13405 1.73353 0.86248
PMGI(Zhang et al., 2020a) 6.93391 71.54806 13.86783 0.13525 1.78242 0.88934
DDcGAN(Ma et al., 2020b) 7.47310 100.34809 14.94620 0.33784 1.60926 0.76636
U2Fusion(Xu et al., 2020a) 6.75708 64.91158 13.51416 0.29088 1.79837 0.92533
proposed 6.84134 71.90131 13.68269 0.07288 1.83676 0.91456
Table 5: The values of seven quality metrics averaged over the fused images on 21 pairs of infrared and visible images.
Figure 16: The experimental results on “street” images. (a) Visible; (b) Infrared; (c) DenseFuse; (d) NestFuse; (e) PMGI; (f) DDcGAN; (g) U2Fusion; (h) proposed.

For DenseFuse, we choose the sum strategy and set the trade-off parameter to . For NestFuse, the average pooling is utilized for the channel attention fusion strategy. All these fusion methods are implemented using publicly available codes, and their parameters are set by referring to their original reports.

To evaluate the visual effects of the fusion results666More experimental results are shown in our supplementary material., two pairs of visible and infrared images are selected, namely “man” and “umbrella”. The fused images obtained by the existing methods and our fusion method (RFN-Nest) are shown in Fig.14 and Fig.15, respectively.

In Fig.14 and Fig.15, the fused images obtained by DCHWT are more noisy and contain image artefacts. The fused images obtained by GTF and FusionGAN exhibit clearer features and more detailed background information. Although these fused images retain more complementary information, they look more like the infrared image, especially in the background. In view of the importance of the background information, ConvSR, VggML, DenseFuse, IFCNN, NestFuse, PMGI and U2Fusion are designed to preserve more detail information from the visible image. These methods appear to reduce the salient infrared features, compared with GTF and FusionGAN, producing acceptable fusion results. Although DDcGAN is also designed to maintain more detail information from visible images, in Fig.14 (l), it injects more noise into the fused image and the infrared targets are blurred.

Although the target features are not enhanced too much in the fused image, the contrast is better than in the visible image. Moreover, for the detail preservation, in Fig.14, compared with the other fusion methods, the ‘tree’ and ‘street lamp’ (red box) are clearer in the fused image obtained by our proposed method. The detail textures of ‘bushes’ (green box) are also preserved into the fused image.

In Fig.15, in the green box, many fusion methods are unable to preserve the salient features of ‘pavilion’ from the visible image except IFCNN, NestFuse and the proposed method, which means these fusion methods fuse too much background information from the infrared image. Compared with all these fusion methods, in the red box, the detail information of the ‘tree’ reconstructed by the proposed method is clearer in the fused image (Fig.15 (n)).

The background and the context around salient parts are not very clear and sometime even invisible because of the difficulty in extracting salient features from source images, as shown in Fig.14 and Fig.15

. This drawback will cause a performance degradation when the image fusion algorithms are used in other computer vision tasks, such as RGB-T visual object tracking. In contrast, our RFN-Nest fusion network is able to preserve more detail information and to maintain the contrast of infrared parts.

Compared with all the above fusion methods, the fused image obtained by the proposed method appears to retain a better balance between the visible background information and the infrared features.

We evaluate the fusion performance objectively using the seven quality metrics to compare the seventeen existing fusion methods and our proposed fusion framework. The values of these metrics averaged over all fused images are shown in Table 5. The best values, the second-best values and the third-best values are indicated in bold, red and italic and blue and italic, respectively.

From Table 5, the proposed fusion framework (RFN-Nest) obtains one best values () and three third-best values (, , MS-SSIM) compared to the other methods. The reason why DDcGAN obtains larger values of , and is that DDcGAN introduces more noise and artefacts into the fused image. Our fusion network achieves good fusion performance, producing sharper content and exhibiting more visual information fidelity.

(Roberts et al., 2008) (Rao, 1997) (Qu et al., 2002) (Kumar, 2013) (Aslantas and Bendes, 2015) MS-SSIM(Ma et al., 2015)
DenseFuse(Li and Wu, 2019) 6.77630 73.63462 13.55261 0.06346 1.74862 0.92944
NestFuse(Li et al., 2020a) 6.99347 90.28951 13.98693 0.11138 1.67540 0.88611
PMGI(Zhang et al., 2020a) 6.96974 77.25462 13.93948 0.11434 1.68523 0.88830
DDcGAN(Ma et al., 2020b) 7.50173 106.99113 15.00346 0.30998 1.55359 0.78419
U2Fusion(Xu et al., 2020a) 6.94970 76.80347 13.89939 0.28363 1.74780 0.93141
proposed 6.92952 78.22247 13.85904 0.06357 1.76116 0.90894
Table 6: The values of seven quality metrics averaged over the fused images on 40 pairs of infrared and visible images which collected from TNO and VOT2020.

4.9 Further Analysis on 40 Pairs Images

The previous ablation studies and experiments are conducted on one test dataset which contains 21 pairs of infrared and visible images. To verify the generalization performance of the proposed fusion network, a new test dataset is created. It contains 40 pairs of infrared and visible images which are collected from TNO (Toet, 2014) and VOT2020-RGBT (Kristan et al., 2020).

In this section, we choose several state-of-the-art deep learning based fusion methods to perform comparative experiments. These methods include DenseFuse (Li and Wu, 2019) which is a classical autoencoder-based fusion method, NestFuse (Li et al., 2020a) which has the same backbone (encoder and decoder) with the proposed method, and three latest fusion methods (PMGI (Zhang et al., 2020a), DDcGAN (Ma et al., 2020b) and U2Fusion (Xu et al., 2020a)).

An example of the fused images obtained by these fusion methods and the proposed network is shown in Fig.16. The unnatural textures in the sky are introduced by the infrared image (Fig.16 (b), red box) should be looked more natural in the fused image. It is observed that DenseFuse, PMGI and the proposed method generate fused images of natural appearance. Moreover, compared with the existing fusion methods, our network also preserves more detail information from both infrared and visible images (Fig.16, the house in yellow box). Note, the DDcGAN again introduces noise into the fused image and blurs the salient feature content.

The same six quality metrics are used for comparative evaluation. The average values of these metrics are shown in Table 6. The best values are indicated in bold, the second-best values are denoted in red and italic and the third-best values are denoted in blue and italic.

Compared with the results on the 21 pairs of images, the proposed network exhibits even better performance on the 40 image pairs. The method achieves one best value (), one second-best values () and two third-best values (, MS-SSIM). Even compared with DDcGAN, our fusion network performance is comparable. This confirms that our fusion network trained by the two-stage fusion strategy and the novel loss function demonstrates better generalization.

5 Experiments on RGBT Object Tracking

Over the past two years, multi-modality object tracking has been of interest in many vision applications. In Vision Object Tracking challenge (VOT) 2019 (Kristan et al., 2019), for the first time, the committee introduced two new sub-challenges (RGBD and RGBT), in which each sequence in the dataset of RGBD or RGBT contains two modalities (RGB image and depth image, RGB image and infrared image) as the input. As we focus on the fusion of infrared and visible images, the RGBT sub-challenge data is used to evaluate the performance of the proposed learnable fusion network (RFN) and the novel loss functions.

In VOT2020 (Kristan et al., 2020), the video sequences are the same as VOT2019 (Kristan et al., 2019)

, but a new performance evaluation protocol is introduced for short-term tracker evaluation (includes RGBT sub-challenge). The new protocol avoids tracker-dependent resets and reduces the variance of the performance evaluation measures.

A state-of-the-art siamese-based tracker AFAT (Xu et al., 2020b) is chosen to be the base tracker. In AFAT, a failure-aware system, realized by a Quality Prediction Network (QPN), based on convolutional and LSTM modules was proposed and obtained better tracking performance in many datasets. For RGBT object tracking, the proposed fusion strategy network (RFN) and the proposed loss function are incorporated into AFAT.

5.1 The RFN and The Loss Function For RGBT Tracking

As we discussed, the proposed residual fusion network (RFN) is a learnable fusion strategy. Thus, ideally, when RFN is applied into AFAT (Xu et al., 2020b), it needs a sufficient quantity of data to train the whole model.

However, due to the lack of labeled training data, we were forced to simplify the architecture of RFN, by reducing the number of convolutional layers, as shown in Fig.17. In the training phase, we only train the RFN module, the AFAT modules are fixed to reduce the number of learnable parameters 777The framework of RFN-base AFAT is shown in our supplementary material..

Figure 17: The RFN architecture for RGBT tracking.

Three RGBT datasets are used to train our RFN module, namely GTOT (Li et al., 2016), VT821 (Tang et al., 2019), VT1000 (Tu et al., 2019). These datasets only contain 17.6k frames in total. GTOT is a dataset for RGBT tracking, the VT821 and the VT1000 are built for RGBT salient object detection.

Figure 18: Some tracking results on VOT2020-RGBT. The frames in first row and second row are RGB frames and infrared frames, respectively. Five pairs of RGB and infrared frames in columns are collected from different video sequences (“carnotfar”, “crouch”, “fog6”, “twowoman” and “woman89”). AFAT is the base tracker. In these frames, blue boxes denote the ground truth, green boxes and yellow boxes are the tracking results obtained by AFAT with only RGB frames or infrared frames fed as input. Red boxes indicate the tracking results obtained by the RFN-based tracker.

To train the RFN module, the proposed loss function (, Section 3.2.2) is used in the AFAT training. As RGBT tracking does not involve image generation, it is inevitable that the background detail preservation loss function () needs to be modified to become applicable to the tracking task. is defined as follows,

(7)

where denotes the fused deep feature obtained by the RFN module, and indicates the “-norm” based fusion strategy discussed in Section 4.7. The target feature enhancement loss function () is the same as in Section 3.2.2.

5.2 The Tracking Results on VOT-RGBT

The video sequences in VOT2020-RGBT are the same as in VOT2019-RGBT. Thus, we only present a few tracking results on VOT2020-RGBT in Fig.18. The ‘RFN’ denotes the RFN-based AFAT.

To evaluate the tracking performance, three measures (Kristan et al., 2019) were selected: Accuracy (), Robustness () and Expected Average Overlap (). (1) Accuracy denotes the average overlap between the ground truth and the predicted bounding boxes; (2) Robustness evaluates how many times the tracker loses the target (fails) during tracking (3)

is an estimator of the average overlap of a tracker. For the detail of Accuracy, Robustness and

please refer to (Kristan et al., 2015).

In VOT2020-RGBT (Kristan et al., 2020)888The toolkit version of VOT2020-RGBT is 0.2.0., for the Accuracy, Robustness and EAO, they have the same meanings but their calculation methods are re-defined by the committee. Thus, these metrics are indicated as , and . The higher values of , , , , and , the better the tracker.

In addition to the base tracker (AFAT), we choose two further trackers for each dataset (VOT2019, VOT2020) to analyze the tracking performance of RFN-based AFAT. In the VOT2019-RGBT competition, mfDiMP and FSRPN won the third and fourth place on the public dataset, respectively. For VOT2020-RGBT, DFAT and M2C2Frgbt won third and seventh place on the public dataset, respectively. Note that DFAT is the winner on VOT2020-RGBT challenge. All the metrics values are provided by the VOT committee and available on the VOT reports (Kristan et al., 2019) (Kristan et al., 2020).

The tracking results of RFN-based AFAT and other trackers are shown in Table 7 and Table 8. and indicate that only one modality (RGB or infrared) is fed into AFAT.

VOT2019
FSRPN(Kristan et al., 2019) 0.3553 0.6362 0.7069
mfDiMP(Kristan et al., 2019) 0.3879 0.6019 0.8036
AFAT(Xu et al., 2020b) RGB 0.32590 0.61130 0.5700
Infrared 0.18120 0.56740 0.1800
RFN-based 0.35840 0.64470 0.6500
Table 7: The tracking results obtained on the VOT2019-RGBT dataset. AFAT is used as the base tracker to evaluate the RFN network fusion strategy.
VOT2020
M2C2Frgbt(Kristan et al., 2020) 0.332 0.636 0.722
DFAT(Kristan et al., 2020) 0.390 0.672 0.779
AFAT(Xu et al., 2020b) RGB 0.329 0.635 0.669
Infrared 0.265 0.573 0.588
RFN-based 0.371 0.668 0.726
Table 8: The tracking results obtained on the VOT2020-RGBT dataset. AFAT is set as the base tracker to evaluate the RFN network fusion strategy.

From these two tables, compared with just feeding one modality into AFAT, the RFN-based AFAT delivers better tracking performance in all measures both on VOT-RGBT2019 and on VOT-RGBT2020. On VOT2019-RGBT, although mfDiMP achieves the best performance, the results produced by the RFN-based tracker are comparable () and the accuracy is better. On VOT2020-RGBT, even compared with the winning tracker, DFAT, our tracker is also competitive.

These experiments demonstrate that even with insufficient training data, the tracker performance is improved by incorporating the proposed residual fusion network (RFN) into the AFAT tracking framework. When more training data becomes available, we believe the RFN-based tracker will achieve even better tracking performance.

6 Conclusions

Motivated by the weakness of the existing fusion methods in preserving image detail, in this paper, we proposed a novel end-to-end fusion framework (RFN-Nest) which is based on the nest connection incorporated into a residual fusion network. To design our RFN-Nest, a two-stage training strategy was presented. In the proposed scheme, an auto-encoder network is trained using the SSIM loss function () and the pixel loss function (). The trained encoder is utilized to extract multi-scale features from the source images and the nest connection-based decoder network is designed to reconstruct the fused images using the fused multi-scale features. The key component of RFN-Nest is the residual fusion network (RFN). In the second stage of the training strategy, four residual fusion networks (RFN) are trained to preserve the image detail, and preserve the salient features using and , respectively. Once the two-stage training is accomplished, the fused image is reconstructed using the encoder, the RFN networks and the decoder. Compared to seventeen existing fusion methods, the RFN-Nest achieves the best fusion performance in both subjective and objective evaluation.

To validate the generality of the fusion network, we also applied the proposed RFN and the novel loss functions to a state-of-the-art tracker to perform a multimodal tracking task (RGBT tracking). Compared with single modality, the RFN-based tracker delivers better tracking performance in all measures on VOT2019 and VOT2020. Even compared with the state of the art RGBT trackers, the RFN-based tracker achieves very good performance. This demonstrates that with this proposed innovations, the RFN-Nest network has a wide applicability, extending beyond image fusion.

References

  • V. Aslantas and E. Bendes (2015) A new image quality metric for image fusion: the sum of the correlations of differences. Aeu-international Journal of electronics and communications 69 (12), pp. 1890–1896. Cited by: §4.2, Table 1, Table 2, Table 3, Table 5, Table 6.
  • A. Ben Hamza, Y. He, H. Krim, and A. Willsky (2005) A multiscale approach to pixel-level image fusion. Integrated Computer-Aided Engineering 12 (2), pp. 135–146. Cited by: §1.
  • R. Gao, S. A. Vorobyov, and H. Zhao (2017) Image fusion with cosparse analysis operator. IEEE Signal Processing Letters 24 (7), pp. 943–947. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.1, §3.1.1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.1, §3.1.2.
  • S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon (2015) Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1037–1045. Cited by: §4.1.
  • M. Kristan, J. Matas, A. Leonardis, M. Felsberg, et al. (2020) The eighth visual object tracking VOT2020 challenge results. In Proc. 16th Eur. Conf. Comput. Vis. Workshop, Cited by: §4.2, §4.9, §5.2, §5.2, Table 8, §5.
  • M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder (2015) The visual object tracking vot2015 challenge results. In Proceedings of the IEEE international conference on computer vision workshops, pp. 1–23. Cited by: §5.2.
  • M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J. Kamarainen, L. Cehovin Zajc, O. Drbohlav, A. Lukezic, A. Berg, et al. (2019) The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–36. Cited by: §5.2, §5.2, Table 7, §5, §5.
  • B. S. Kumar (2013) Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal, Image and Video Processing 7 (6), pp. 1125–1143. Cited by: §4.2, §4.8, Table 1, Table 2, Table 3, Table 5, Table 6.
  • C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing 25 (12), pp. 5743–5756. Cited by: §5.1.
  • C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang (2019a) RGB-T object tracking: benchmark and baseline. Pattern Recognition 96, pp. 106977. Cited by: §1.
  • C. Li, C. Zhu, J. Zhang, B. Luo, X. Wu, and J. Tang (2018a) Learning Local-Global Multi-Graph Descriptors for RGB-T Object Tracking. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
  • H. Li, X. Wu, and T. S. Durrani (2019b) Infrared and Visible Image Fusion with ResNet and zero-phase component analysis. Infrared Physics & Technology 102, pp. 103039. Cited by: §1, §2.1.
  • H. Li, X. Wu, and T. Durrani (2020a) NestFuse: An Infrared and Visible Image Fusion Architecture based on Nest Connection and Spatial/Channel Attention Models. IEEE Transactions on Instrumentation and Measurement. Note: doi: 10.1109/TIM.2020.3005230 Cited by: §1, §1, §1, §2.1, §3.1.2, §4.7, §4.7, §4.8, §4.9, Table 4, Table 5, Table 6.
  • H. Li, X. Wu, and J. Kittler (2018b) Infrared and Visible Image Fusion using a Deep Learning Framework. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. Cited by: §1, §1, §2.1, §4.8, Table 5.
  • H. Li, X. Wu, and J. Kittler (2020b) MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing. Note: doi: 10.1109/TIP.2020.2975984 Cited by: §1.
  • H. Li and X. Wu (2017) Multi-focus image fusion using dictionary learning and low-rank representation. In International Conference on Image and Graphics, pp. 675–686. Cited by: §1, §1.
  • H. Li and X. Wu (2019) DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §1, §1, §1, §2.1, §2.1, §3.2.1, §3.2.1, §4.7, §4.7, §4.8, §4.9, Table 4, Table 5, Table 6.
  • H. Li (2020) Code of RFN-Nest. Note: https://github.com/hli1221/imagefusion-rfn-nest Cited by: §4.2.
  • S. Li, X. Kang, L. Fang, J. Hu, and H. Yin (2017) Pixel-level image fusion: a survey of the state of the art. Information Fusion 33, pp. 100–112. Cited by: §1.
  • S. Li, X. Kang, and J. Hu (2013) Image fusion with guided filtering. IEEE Transactions on Image processing 22 (7), pp. 2864–2875. Cited by: §1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: Common objects in context. In European conference on computer vision, Zurich, Switzerland, pp. 740–755. Cited by: §4.1.
  • C. Liu, Y. Qi, and W. Ding (2017a) Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology 83, pp. 94–102. Cited by: §1.
  • C. Liu, Y. Qi, and W. Ding (2017b) Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology 83, pp. 94–102. Cited by: §1.
  • G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma (2012) Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 171–184. Cited by: §1.
  • G. Liu, Z. Lin, and Y. Yu (2010) Robust subspace segmentation by low-rank representation.. In ICML, Vol. 1, pp. 8. Cited by: §1.
  • Y. Liu, X. Chen, H. Peng, and Z. Wang (2017c) Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191–207. Cited by: §1, §2.1.
  • Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Information Fusion 42, pp. 158–173. Cited by: §1.
  • Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang (2016) Image fusion with convolutional sparse representation. IEEE signal processing letters 23 (12), pp. 1882–1886. Cited by: §1, §2.1, §4.8, Table 5.
  • X. Lu, B. Zhang, Y. Zhao, H. Liu, and H. Pei (2014) The infrared and visible image fusion algorithm based on target separation and sparse representation. Infrared Physics & Technology 67, pp. 397–407. Cited by: §1.
  • C. Luo, B. Sun, K. Yang, T. Lu, and W. Yeh (2019) Thermal infrared and visible sequences fusion tracking based on a hybrid tracking framework with adaptive weighting scheme. Infrared Physics & Technology 99, pp. 265–276. Cited by: §1.
  • J. Ma, C. Chen, C. Li, and J. Huang (2016) Infrared and visible image fusion via gradient transfer and total variation minimization. Information Fusion 31, pp. 100–109. Cited by: §4.8, Table 5.
  • J. Ma, P. Liang, W. Yu, C. Chen, X. Guo, J. Wu, and J. Jiang (2020a) Infrared and visible image fusion via detail preserving adversarial learning. Information Fusion 54, pp. 85–98. Cited by: §1, §2.2, §2.2, §3.2.1.
  • J. Ma, Y. Ma, and C. Li (2019a) Infrared and visible image fusion methods and applications: A survey. Information Fusion 45, pp. 153–178. Cited by: §1.
  • J. Ma, H. Xu, J. Jiang, X. Mei, and X. Zhang (2020b)

    DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion

    .
    IEEE Transactions on Image Processing 29, pp. 4980–4995. Cited by: §1, §2.2, §4.8, §4.9, Table 5, Table 6.
  • J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang (2019b) FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §1, §2.2, §2.2, §3.2.1, §4.8, Table 5.
  • K. Ma, K. Zeng, and Z. Wang (2015) Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing 24 (11), pp. 3345–3356. Cited by: §4.2, Table 1, Table 2, Table 3, Table 5, Table 6.
  • G. Pajares and J. M. De La Cruz (2004) A wavelet-based image fusion tutorial. Pattern recognition 37 (9), pp. 1855–1872. Cited by: §1.
  • G. Qu, D. Zhang, and P. Yan (2002) Information measure for performance of image fusion. Electronics letters 38 (7), pp. 313–315. Cited by: §4.2, Table 1, Table 2, Table 3, Table 5, Table 6.
  • K. Ram Prabhakar, V. Sai Srikar, and R. Venkatesh Babu (2017) Deepfuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4714–4722. Cited by: §1, §2.1, §3.2.1.
  • Y. Rao (1997) In-fibre Bragg grating sensors. Measurement science and technology 8 (4), pp. 355. Cited by: §4.2, Table 1, Table 2, Table 3, Table 5, Table 6.
  • J. W. Roberts, J. A. Van Aardt, and F. B. Ahmed (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §4.2, Table 1, Table 2, Table 3, Table 5, Table 6.
  • V. Shrinidhi, P. Yadav, and N. Venkateswaran (2018) IR and Visible Video Fusion for Surveillance. In 2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 1–6. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
  • X. Song and X. Wu (2018) Multi-focus Image Fusion with PCA Filters of PCANet. In IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human-Computer Interaction, pp. 1–17. Cited by: §1, §1, §2.1.
  • J. Tang, D. Fan, X. Wang, Z. Tu, and C. Li (2019) RGBT Salient Object Detection: Benchmark and A Novel Cooperative Ranking Approach. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §5.1.
  • A. Toet (2014) TNO Image Fusion Dataset. Note: https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029 Cited by: §4.2, §4.9.
  • Z. Tu, T. Xia, C. Li, X. Wang, Y. Ma, and J. Tang (2019) RGB-T Image Saliency Detection via Collaborative Graph Learning. IEEE Transactions on Multimedia 22 (1), pp. 160–173. Cited by: §5.1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.2.1.
  • J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma (2008)

    Robust face recognition via sparse representation

    .
    IEEE transactions on pattern analysis and machine intelligence 31 (2), pp. 210–227. Cited by: §1.
  • H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling (2020a) U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Transactions on Pattern Analysis and Machine Intelligence. Note: doi: 10.1109/TPAMI.2020.3012548 Cited by: §2.2, §4.8, §4.9, Table 5, Table 6.
  • T. Xu, Z. Feng, X. Wu, and J. Kittler (2020b) AFAT: Adaptive Failure-Aware Tracker for Robust Visual Object Tracking. arXiv preprint arXiv:2005.13708v1. Cited by: §5.1, Table 7, Table 8, §5.
  • S. Yang, M. Wang, L. Jiao, R. Wu, and Z. Wang (2010) Image fusion based on a new contourlet packet. Information Fusion 11 (2), pp. 78–84. Cited by: §1.
  • M. Yin, P. Duan, W. Liu, and X. Liang (2017) A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation. Neurocomputing 226, pp. 182–191. Cited by: §1.
  • H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma (2020a) Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12797–12804. Cited by: §2.2, §4.8, §4.9, Table 5, Table 6.
  • Q. Zhang, Y. Fu, H. Li, and J. Zou (2013) Dictionary learning method for joint sparse representation-based image fusion. Optical Engineering 52 (5), pp. 057006. Cited by: §1, §1.
  • Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang (2020b) IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion 54, pp. 99–118. Cited by: §2.2, §2.2, §4.7, §4.7, §4.8, Table 4, Table 5.
  • Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang (2018) Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, Spain, pp. 3–11. Cited by: §1, §2.1, §3.1.2.