DeepAI
Log In Sign Up

TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning

01/19/2022
by   Linhao Qu, et al.
FUDAN University
10

Image fusion is a technique to integrate information from multiple source images with complementary information to improve the richness of a single image. Due to insufficient task-specific training data and corresponding ground truth, most existing end-to-end image fusion methods easily fall into overfitting or tedious parameter optimization processes. Two-stage methods avoid the need of large amount of task-specific training data by training encoder-decoder network on large natural image datasets and utilizing the extracted features for fusion, but the domain gap between natural images and different fusion tasks results in limited performance. In this study, we design a novel encoder-decoder based image fusion framework and propose a destruction-reconstruction based self-supervised training scheme to encourage the network to learn task-specific features. Specifically, we propose three destruction-reconstruction self-supervised auxiliary tasks for multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion based on pixel intensity non-linear transformation, brightness transformation and noise transformation, respectively. In order to encourage different fusion tasks to promote each other and increase the generalizability of the trained network, we integrate the three self-supervised auxiliary tasks by randomly choosing one of them to destroy a natural image in model training. In addition, we design a new encoder that combines CNN and Transformer for feature extraction, so that the trained model can exploit both local and global information. Extensive experiments on multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion tasks demonstrate that our proposed method achieves the state-of-the-art performance in both subjective and objective evaluations. The code will be publicly available soon.

READ FULL TEXT VIEW PDF

page 17

page 20

page 25

page 27

page 28

page 29

page 32

12/02/2021

TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning

In this paper, we propose TransMEF, a transformer-based multi-exposure i...
02/19/2020

T-Net: A Template-Supervised Network for Task-specific Feature Extraction in Biomedical Image Analysis

Existing deep learning methods depend on an encoder-decoder structure to...
05/05/2020

Multi-interactive Encoder-decoder Network for RGBT Salient Object Detection

RGBT salient object detection (SOD) aims to segment the common prominent...
05/27/2019

A Symmetric Encoder-Decoder with Residual Block for Infrared and Visible Image Fusion

In computer vision and image processing tasks, image fusion has evolved ...
09/23/2021

Cross Attention-guided Dense Network for Images Fusion

In recent years, various applications in computer vision have achieved s...
07/23/2020

Guided Deep Decoder: Unsupervised Image Pair Fusion

The fusion of input and guidance images that have a tradeoff in their in...
10/04/2021

BPFNet: A Unified Framework for Bimodal Palmprint Alignment and Fusion

Bimodal palmprint recognition leverages palmprint and palm vein images s...

1 Introduction

By integrating information from multiple source images with complementary information and fusing them into one fused image, image fusion technique can generate high-quality images and compensate for the inherent defects of a single imaging sensor Li et al. (1995). Image fusion has a wide range of applications Li et al. (1995); Goshtasby and Nikolov (2007); Li et al. (2013); Bai et al. (2011). For example, in military applications, the fusion of infrared and visible images can be used for reconnaissance as well as night vision Xue and Blum (2003); Wan et al. (2009); Zhou et al. (2016); Zhang et al. (2017b). In the medical field, fusing images of different modalities (e.g., Computed Tomography (CT) and Magnetic Resonance Imaging (MRI)) can assist the clinicians in clinical diagnosis and treatment Bhatnagar et al. (2013); Xu (2014). In the field of consumer electronics, multi-exposure image fusion can be employed to generate high dynamic range images for mobile devices Goshtasby (2005); Shen et al. (2011); Ma et al. (2015), while multi-focus image fusion can be applied for refocusing algorithms Saha et al. (2013); Bai et al. (2015); Zhang and Levine (2016).

Typically, image fusion tasks can be classified into multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion. Although most studies focus on a certain fusion task, the design of unified image fusion frameworks that can be applied to different tasks is gradually becoming a significant research direction

Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b). The reason not only lies in the generality of a unified image fusion framework for multiple tasks, but also rests with the finding that these frameworks can achieve better performance when they are jointly trained on different tasks than being trained on a specific task Xu et al. (2020a).

Existing image fusion methods can be classified into two categories: traditional methods Li et al. (1995, 2013); Huang and Jing (2007); Zhang et al. (2017a); Zhou et al. (2014); Burt and Adelson (1987); Toet (1989); Cao et al. (2014); Quan et al. (2014); Luo et al. (2016); Liu et al. (2017, 2016); Li et al. (2020) and deep learning-based methods Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Li and Wu (2018); Ma et al. (2019, 2020, 2021); Liu et al. (2020)

. Although traditional image fusion methods have achieved promising performance before the deep learning era, the hand-crafted feature extraction approaches limit further performance improvement. Moreover, these methods can only be used in a specific task due to their poor generalization capability. Deep learning-based image fusion methods have alleviated such limitation thanks to their powerful feature extraction capability, and gradually become the mainstream approaches. In these methods, the source images are fed into a deep neural network and the output of the network is the fused image.

According to how the fusion network is trained, deep learning-based methods can be further divided into end-to-end methods Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Ma et al. (2019, 2020) and two-stage methods Li and Wu (2018); Ma et al. (2021); Liu et al. (2020)

. In end-to-end methods, the fusion network is trained directly either in a supervised manner using synthetic ground truth fused images or in an unsupervised manner using loss functions defined on the similarity between the fused image and the input source images. However, the end-to-end methods require a large number of task-specific training images, which is difficult or expensive to collect in the image fusion field. Although some training datasets are constructed for specific fusion tasks

Xu et al. (2020a); Cai et al. (2018b); Nejati et al. (2015); Toet (2014)

, they are not comparable in size to large natural image datasets (e.g., ImageNet

Deng et al. (2009)

, COCO

Lin et al. (2014)). Therefore, end-to-end image fusion methods easily fall into overfitting or tedious parameter optimization processes due to insufficient training data. In order to obtain more training data, most end-to-end methods Xu et al. (2020a); Zhang et al. (2020a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Ma et al. (2019, 2020) divide the source images into small patches during the training process, which corrupts the semantic information of the whole image and also makes the network difficult to model the global features, leading to inferior fusion performance. Instead, the two-stage methods first train an encoder-decoder network on a large natural image dataset through image reconstruction. Then, the trained encoder is used to extract feature maps from the source images, and the feature maps are fused and further decoded to generate the fused image using the trained decoder. The advantage of two-stage methods is that the encoder and the decoder can be trained on large natural image datasets and can avoid the need of large amount of task-specific training data, which makes them more flexible and stable.

However, the two-stage methods still have several unsolved issues that hinder further performance improvement. First, the current two-stage methods Li and Wu (2018); Ma et al. (2021); Liu et al. (2020) purely focus on the reconstruction of natural images but the domain gap between natural images and fusion tasks results in poor generalization of the extracted features and inferior performance. In addition, the same natural image dataset is usually used for different fusion tasks without considering task-specific features. Therefore, it is a prominent issue to enable the encoder-decoder network to be trained on large natural image datasets while learn task-specific image features at the same time. Second, recent studies have indicated Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b) that joint training on different tasks can help improve the performance on each single task. Therefore, how to design a joint training scheme for two-stage frameworks is another important issue. Third, all existing deep-learning-based image fusion methods utilize CNN for feature extraction, but it is difficult for CNN to model long-range dependencies due to its small receptive field Dosovitskiy et al. (2020).

To address the issues mentioned above, in this paper, we propose a novel two-stage image fusion framework based on a new encoder-decoder network, which is named TransFuse. First, to enable our network to be trained on large natural image datasets and learn task-specific features at the same time, we design three destruction-reconstruction self-supervised auxiliary tasks for each of the three image fusion tasks, multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion. Instead of simply inputting a natural image into the encoder and using the decoder to reconstruct it, we destroy the natural image before inputting it into the encoder and design a specific way of destruction for each fusion task. By enforcing the encoder-decoder network to reconstruct a destroyed image, we can make the network to learn better task-specific image features. Second, in order to encourage different fusion tasks to promote each other and increase the generalizability of the trained network, we integrate the three self-supervised auxiliary tasks by randomly choosing one of them to destroy a natural image in model training. Third, to compensate for the defects of CNN in modeling long-range dependencies, we design a new encoder that combines CNN and Transformer to exploit both local and global information in feature extraction. We conduct extensive experiments to demonstrate the effectiveness of each component of our framework.

The main contributions are summarized as follows.

  • We design a novel encoder-decoder based image fusion framework and propose a destruction-reconstruction based self-supervised training scheme to encourage the network to learn task-specific features.

  • We propose to use three transformations to destroy a natural image, pixel intensity non-linear transformation, brightness transformation, and noise transformation for multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion, respectively. We integrate the three transformations by randomly selecting one of them to destroy an image in model training so as to make the trained model extract more generalizable features and achieve higher performance on each task.

  • We design a new encoder that combines CNN and Transformer for feature extraction, so that the trained model can exploit both local and global information.

  • Extensive experiments on multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion tasks showed that our proposed method achieved new state-of-the-art performance in both subjective and objective evaluations.

2 Related Work

In this section, we will briefly review the most representative methods in the field of image fusion in recent years, including traditional methods and deep learning-based methods. The latter ones are divided into end-to-end methods and two-stage methods. Subsequently, we will introduce the related works of self-supervised learning and Transformer in the field of computer vision and their application potential in image fusion field.

2.1 Image Fusion

2.1.1 Traditional Image Fusion Algorithms

Traditional image fusion methods can be classified into spatial domain-based methods, transform domain-based methods, sparse representation and dictionary learning-based methods. Spatial domain-based methods usually calculate a weighted average of local or pixel-level saliency of two source images Li et al. (2013); Huang and Jing (2007); Zhang et al. (2017a); Zhou et al. (2014) to obtain the fused image.

Transform domain-based methods firstly transform source images into a transform domain (e.g., wavelet domain) to obtain different frequency components. Then the corresponding components are fused by appropriate fusion rules, and finally the fused image is obtained by inverse transform. Commonly used transforms include Laplace pyramid (LP) Burt and Adelson (1987), low-pass pyramid (RP) Toet (1989), discrete wavelet (DWT) Li et al. (1995), discrete cosine (DCT) Cao et al. (2014), curvelet transform (CVT) Quan et al. (2014) and shearlet transform (Shearlet) Luo et al. (2016), etc.

New methods based on sparse representation and dictionary learning have also emerged in recent years. For example, Liu Liu et al. (2017) et al. proposed JSR and JSRSD based on joint sparse representation and saliency detection. They first obtained the global and local representative maps of the source images based on sparse coefficients, and then combined them by a representative detection model to generate the overall representative map. Finally, a weighted fusion algorithm was adopted to obtain the fused image based on the overall representative map.

Although aforementioned methods have achieved good results, their performance are still limited due to the following two aspects. First, the complicated manually designed feature extraction approaches usually fail to effectively preserve important information in the source images and cause artifacts in the fused image. Second, the feature extraction methods are usually designed for a specific task so it is difficult to adapt them to other tasks.

2.1.2 Deep Learning-based Image Fusion Algorithms

Due to the powerful feature extraction capability of CNN, deep learning-based methods have gradually become the mainstream approaches in the field of image fusion. Deep learning-based methods can be further divided into end-to-end methods and two-stage methods.

End-to-End Image Fusion Scheme

In end-to-end methods, the fusion network is directly trained either in a supervised manner using synthetic ground truth fusion images or in an unsupervised manner using loss functions defined on the similarity between the source images and the fused images.

Zhang et al Zhang et al. (2020b)

proposed IFCNN, a supervised unified image fusion framework. They constructed a large multi-focus image fusion dataset with ground truth images and utilized a perceptual loss function for supervised training. Then the trained model was transferred to other fusion tasks. Prabhakar et al

Ram Prabhakar et al. (2017) proposed DeepFuse, an unsupervised multi-exposure image fusion framework utilizing a no-reference quality metric as loss function. They designed a novel CNN-based network trained to learn the fusion operation without ground truth fusion images. Ma et al. Ma et al. (2019) proposed FusionGAN, an unsupervised infrared and visible image fusion framework based on GAN. The generator network is used to generate the fused image and the discriminator network makes sure that the texture details in visible images together with thermal radiation information in infrared images are retained in the fused image. On this basis, they further proposed DDcGAN Ma et al. (2020), which can enhance the edges and the saliency of thermal targets by introducing a target-enhanced loss function and a dual discriminator structure. Zhang Zhang et al. (2020a) et al. proposed PMGI, an unsupervised unified image fusion network. They unified the image fusion problem into the proportional maintenance of texture and intensity information of the source images, and proposed a new loss function based on the gradient information and intensity information between the fused image and the source images for unsupervised training. Xu et al. Xu et al. (2020a, b) proposed U2Fusion, an unsupervised unified image fusion network. They utilized a novel loss function based on adaptive information preservation degree for unsupervised training. During training, a pre-trained neural network was used to extract the features of the source images and these features are further used to calculate the adaptive information preservation degree.

The above end-to-end models have achieved promising fusion performance, but both supervised and unsupervised methods require a large number of task-specific training images. Although several studies have constructed some training datasets for specific fusion tasks (e.g., RoadScene Xu et al. (2020a) and TNO Toet (2014) for infrared and visible image fusion; Harvard Keith A. Johnson for medical image fusion; SICE Cai et al. (2018b) for multi-exposure image fusion; and Lytro Nejati et al. (2015)

for multi-focus image fusion), their sizes are not comparable to that of large natural image datasets (e.g., ImageNet

Deng et al. (2009), COCO Lin et al. (2014)). Insufficient training data tends to cause overfitting or complex parameter optimization. Besides, in order to obtain more training data, most end-to-end methods Xu et al. (2020a); Zhang et al. (2020a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Ma et al. (2019, 2020) divide the source images into a large number of small patches during the training process, which corrupts the semantic information of the whole image and also prevents the network from modeling global features of the image. Above limitations hinder further performance improvement of the end-to-end methods.

Two-stage Image Fusion Scheme

In two-stage methods, an encoder-decoder network is first trained on large natural image datasets by image reconstruction in the first stage. In the second stage, the trained encoder is used to extract feature maps from source images and the features maps are then fused and decoded by the trained decoder to generate the fused image.

Li et al Li and Wu (2018) presented the first two-stage method DenseFuse, for infrared and visible image fusion, and in the fusion process, they introduced l1-Norm based fusion rule to fuse the feature maps. Ma et al. Ma et al. (2021)

proposed SESF-Fuse, a new two-stage framework for multi-focus image fusion. They utilized spatial frequency to measure activity level of the feature maps and acquired decision maps based on the activity level. Finally, the decision maps were used to obtain the fusion results. Liu et al

Liu et al. (2020) proposed WaveFuse, a unified image fusion framework based on multi-scale discrete wavelet transform (DWT). In the fusion stage, the feature maps of source images are transformed into several components in the wavelet domain and fusion was done component by component. Then, the fused feature maps were reconstructed using inverse DWT, which was further inputted into the trained decoder to generate the fused image.

The advantage of two-stage methods is that the training can be performed self-supervisedly on accessible large natural image datasets without the need of scene-specific datasets or ground truth fusion images. However, the current two-stage frameworks simply perform natural image reconstruction and cannot enable the encoder-decoder network to learn task-specific image features. In this paper, we propose a novel two-stage image fusion framework which retains the above advantages of two-stage methods. At the same time, we propose to use three destruction-reconstruction auxiliary tasks to encourage the network to learn task-specific image features.

2.2 Self-supervised Learning

Self-supervised learning is a branch of unsupervised learning, which automatically generate its own supervised labels from large-scale unsupervised data. A network can learn valuable representations for downstream tasks when it is trained to perform auxiliary tasks using the generated labels

Liu et al. (2021)

. Self-supervised learning has been successfully applied in many fields such as computer vision and natural language processing

Liu et al. (2021). In the field of computer vision, self-supervised auxiliary tasks include angle prediction Gidaris et al. (2018), image puzzles Noroozi and Favaro (2016), image coloring Zhang et al. (2016) and image reconstruction Pu et al. (2016). Extensive studies Kolesnikov et al. (2019); Achituve et al. (2021); Misra and Maaten (2020); Jing and Tian (2021) have shown that by designing suitable auxiliary tasks, self-supervised learning can learn effective and generalized image feature representations. In this study, we design three self-supervised reconstruction auxiliary tasks for three image fusion tasks and we further integrate the three auxiliary tasks using random combination to further improve the performance of the trained model.

2.3 Transformer

In the field of computer vision, CNNs and its variants have been widely used due to its powerful feature extraction capability. Nevertheless, CNN fail to establish long-range dependencies because of its inherent small receptive field. Unfortunately, almost all existing image fusion architectures are based on CNN, and the global information are not fully exploited. In the field of natural language processing, Transformer has achieved satisfying results in modeling global dependencies through the self-attention mechanism. A great number of works have emerged to utilize Transformer as an alternative to CNN in the field of computer vision and have achieved fantastic results in different tasks, such as image classification Dosovitskiy et al. (2020); Touvron et al. (2021), object detection Carion et al. (2020); Zhu et al. (2020); Zheng et al. (2020); Dai et al. (2021); Sun et al. (2021), image segmentation Wang et al. (2018, 2021) and image generation Parmar et al. (2018) etc. Extensive studies Xu et al. (2020a); Zhang et al. (2020b); Ram Prabhakar et al. (2017); Li and Wu (2018); Ma et al. (2019, 2020, 2021); Liu et al. (2020) have shown that the ability to extract effective image feature representation is the key to improve image fusion performance. To remedy the deficiency of establishing long-range dependencies within current CNN-based image fusion architectures, we design a new feature extraction module that combines CNN with Transformer to allow the network to exploit both local and global information more comprehensively.

3 Method

3.1 Framework Overview

The overall architecture of our framework is shown in Fig. 1. We train the encoder-decoder network by image reconstruction on a large natural image dataset. Different from existing two-stage methods, which directly input the original natural images into the encoder-decoder network for reconstruction, we first destroy the original images before input them into the encoder. Concretely, given an original training image , we first randomly generate image subregions to form a set . For each subregion in , we randomly apply pixel intensity non-linear transformation, brightness transformation and noise transformation to obtain a set of transformed subregions and the destroyed input image . Then, we input the destroyed image into the encoder, which consists of a feature extraction module TransBlock and a feature enhancement module EnhanceBlock. Finally, the extracted features are input to the decoder to reconstruct the original training image. The detailed structures of the encoder and the decoder are described in Section 3.2, and the three different task-specific transformations are introduced in detail in Section 3.3.

As shown in Fig. 1 (b), the fusion framework consists of two parameter-shared encoders, a fusion block and a decoder. Note that both the encoder and the decoder are trained in the first stage, and the fusion block has no parameters to be trained, which ensures the simplicity and efficiency of the fusion framework. Specifically, two source images , are first input to the encoder for feature encoding, then the extracted feature maps , are fused by the fusion block to obtain the fused feature maps . Finally, is decoded by the decoder to reconstruct the fused image . Since the important information differs in the fused images of different tasks, different fusion rules are used for the characteristics of different tasks. The fusion rules are introduced in detail in Section 3.4.

Figure 1: The two-stage image fusion framework based on an encoder-decoder network. (a) The framework of training a Transformer-based encoder-decoder network via image reconstruction. (b) The architecture of fusing two source images using the trained encoder and decoder. (c) Details of the Transformer block and ConvBlock.

3.2 Transformer-Based Encoder-Decoder Framework for Training

3.2.1 Encoder-Decoder Network via Image Reconstruction

We train an encoder-decoder network via image reconstruction for image fusion, and the network architecture is shown in Fig. 1 (a). Given a training image , we randomly generate image subregions to form a set for transformation. For each subregion  in to be transformed, we randomly apply three transformations (the three transformations for different fusion tasks are detailed in Section 3.3) to obtain the set of transformed subregions and the transformed input image.

(1)
(2)

where , denotes the transform function for a subregion set, is the transformed image subregion; is the transformed input image, and refers to the image transform function.

We perform a self-supervised image reconstruction task for model training so that the encoder-decoder network learns an inverse mapping that reconstructs the original image from the transformed input image.

(3)

Notice that we do not transform the whole input image, but a random selection of subregions of it. The reconstruction of the image is conducted at the whole image level .

As shown in Fig. 1 (a), the encoder contains a feature extraction module named TransBlock, and a feature enhancement module called EnhanceBlock. TransBlock contains two sub-modules, the CNN-Module and the Transformer-Module based on CNN and Transformer, respectively. The specific design of TransBlock is described in Section 3.2.2. The EnhanceBlock further aggregates and enhances the feature maps of the CNN-Module and the Transformer-Module in TransBlock, thus allowing the encoder to better integrate local and global features. In particular, we concatenate the encoded features from the CNN-Module and Transformer-Module, and then input them into two ConvBlock layers to achieve the integration and feature enhancement. As shown in Fig. 1 (c), each ConvBlock consists of two convolutional layers with kernel size of 3

3, padding of 1 and a ReLU activation layer. The decoder contains two sequentially connected layers of ConvBlock followed by a 1

1 convolution to reconstruct the original image.

3.2.2 TransBlock: A Powerful Local and Global Feature Extractor

Inspired by ViT Dosovitskiy et al. (2020) and TNT Han et al. (2021), we combine CNN and Transformer to propose a powerful feature extraction module, TransBlock, to exploit both local and global information in source images. As shown in Fig. 1 (a), the TransBlock consists of two feature extraction sub-modules, the CNN-Module and the Transformer-Module. CNN-Module consists of three sequentially connected ConvBlocks. Transformer-Module are designed with a Fine-grained Transformer for local features modeling and a Global Transformer for global features modeling. Concretely, we first divide the transformed input image into patches with the size of and construct a global sequence of images , where , , is the size of the divided patches. To further capture finer-grained features, we further divide each patches in the global sequence into smaller sub-patches and construct the local sequence , where , , is the size of divided sub-patches.

For the processing of local sequences, we use the Fine-grained Transformer with shared weights to learn fine-grained relative dependencies in images.

First, we input the local sequence  into the Linear Projection layer to perform a linear mapping to obtain the encoded feature sequence.

(4)

where refers to the sub-patch of the patch;  represents the encoded features of the sub-patch;

flattens the input patch to one-dimensional vector;

, mean the weights and bias of the Linear Projection layer, respectively.

We then input the encoded features of the local sequences into the Fine-grained Transformer. The Fine-grained Transformer adopts a standard Transformer structure similar to ViT Dosovitskiy et al. (2020), as shown in Fig. 1 (c).

(5)

where denotes the block, and refers to the number of transformer blocks.

For the global sequence, similar to the local sequence, we first input the global sequence  into the Linear Projection layer. Afterwards, we linearly map the output of the corresponding Fine-grained Transformer of the local sequence and concatenate them to the encoded features of the global sequence.

(6)

where flattens the input patch to one-dimensional vector; , stands for the weights and bias respectively.

Then, we input the interactive encoding of global and local sequences into the Global Transformer. The structure of the Global Transformer utilizes a standard Transformer structure similar to ViT Dosovitskiy et al. (2020), as shown in Fig. 1 (c).

(7)

In general, the Fine-grained Transformer is designed to model fine-grained relative dependencies within an image patch, and thus extract local semantic features, while the Global Transformer is designed to model global relative dependencies within an image, and thus extract global features.

3.2.3 Loss Function

We expect our network to learn more than just pixel-level image reconstruction, but also sufficiently capture the structural and gradient information in the images. Therefore, we use a loss function with three components,

(8)

where  is the Mean Square Error (MSE) loss function, denotes the Structural Similarity (SSIM) loss function Keith A. Johnson , represents the Total Variation (TV) loss function Hou et al. (2020), and and are two coefficients used to balance each loss function, which are empirically set to 20 in the experiment.

The MSE loss is used for pixel-level reconstruction of images and it is defined as

(9)

where denotes the output image of the decoder and represents the input image.

The SSIM loss is used to make the network learn structural information of the images, and it is defined as

(10)
(11)

where and

denote the mean and the standard deviation, respectively, and

is the correlation between and . The C1 and C2 are two very small constants, empirically set to 0.02 and 0.06. The standard deviation of the Gaussian window is empirically set to 1.5.

The TV loss is used to preserve the gradient information in the images and further eliminate the noise during image reconstruction, and it is defined as follows,

(12)
(13)

where is the difference between the original image and the reconstructed image, is the norm, and , represent the horizontal and vertical coordinates of the image pixels, respectively.

Figure 2: The three image transformations used to destroy a source image in destruction-reconstruction self-supervised auxiliary tasks. (a) Pixel intensity non-linear transformation for multi-modal image fusion. The first column shows the original image and one of its subregions and the second to fifth column show the image subregions after different non-linear transformations and corresponding Bessel transform curves. (b) Brightness transformation multi-exposure image fusion. The first column shows the original image and one of its subregions and the second to third column show the image subregions after Gamma transform and corresponding Gamma transform curves. (c) Noise transformation multi-focus image fusion. The first column shows the original image and one of its subregions and the second column shows the image subregions after the Gaussian blur transform and corresponding Gaussian function.

3.3 Task-Specific Self-Supervised Training Scheme

In this section, we introduce the three destruction-reconstruction auxiliary tasks designed for different image fusion tasks. For multi-modal, multi-exposure and multi-focus image fusion, the auxiliary task is based on pixel intensity non-linear transformation, brightness transformation and noise transformation, respectively. Specifically, for each subregion  in the set to be transformed, we randomly apply the three transforms to obtain the set of transformed subregions and then the transformed input image . The transformed input image  is input into the encoder-decoder network for reconstruction. This kind of destruction-reconstruction based auxiliary task enables our network to be trained on large natural image datasets and learn task-specific features at the same time.

3.3.1 Pixel Intensity Non-Linear Transformation for Multi-Modal Image Fusion

We design a novel self-supervised auxiliary task based on pixel intensity non-linear transformation for multi-modal image fusion. In multi-modal image fusion, different source images contain different kinds of information of modalities, and we hope the most important information from each modality can be retained in the fused image. Following Xu et al. (2020a), we study two scenarios of multi-modal image fusion, infrared and visible image fusion and multi-modal medical image fusion. In the former scenario, the most significant information is the thermal radiation in infrared images and the structural semantic information in visible images Xu et al. (2020a); Ma et al. (2019). In the later scenario, the most significant information is the functional response and the structural anatomical information in medical images Buzug (2011); Forbes (2012). The above important fusion information is all reflected in the form of image pixel intensity distribution in the source images Xu et al. (2020a). Therefore, we propose a pixel intensity non-linear transformation to destroy the pixel intensity distribution first in the source images and then train the network to reconstruct the original pixel intensity. By doing this, our network can learn the pixel intensity information in the source images effectively.

Concretely, we use a smooth monotonic third-order Bézier Curve Mortenson (1999) composed of four control points to implement the non-linear transformation. The four control points include two endpoints ( and ) and two midpoints ( and ), defined as:

(14)
(15)

where,

is a fractional value along the length of the line and interp indicates the interpolation function. Fig.

2 (a) illustrate an original image subregion and the image subregions transformed by different Bessel transform curves. Specifically, we set two endpoints and to get a monotonically increasing Bessel transform curve. Then we randomly flip the curve to get a monotonically decreasing curve. And the midpoints ( and

) are generated randomly for more variances. As shown in columns 2 and 4, when the midpoints

and coincide with the two endpoints respectively, the transformation function is linear. The midpoints in columns 3 and 5 are randomly generated for more variances. Please note that by using a transformation curve like in column 4 and 5, the pixel intensity can be reversed.

3.3.2 Brightness Transformation for Multi-Exposure Image Fusion

For multi-exposure image fusion, we propose a self-supervised auxiliary task based on brightness transformation, encouraging the network to learn content and structural information at different exposure levels. In general, over-exposed images (images captured with long exposure time) have better content and structural information in dark regions, while under-exposed images (images captured with short exposure time) have better information in bright regions. Therefore, for multi-exposure image fusion, it is crucial to maintain appropriate luminance in the fused image while preserving abundant information Xu et al. (2020a); Ram Prabhakar et al. (2017). In this study, we design a brightness transformation to destroy the luminance of the source images and train the encoder-decoder network to reconstruct the original image. In this process, our network can learn well about the content and structural information of the images at different exposure levels, and thus can learn important fusion information for multi-exposure images.

The brightness transformation is implemented using the Gamma transform, a specific non-linear operation which is widely used to encode and decode brightness or trichromatic values in image and video processing Poynton (2012). The Gamma transform is defined as

(16)

where and are the transformed pixel values and the original pixel values, respectively. Fig. 2 (b) shows an original image subregion and the image subregions after brightness transformations with two different Gamma transformation curves. For each pixel in the selected subregion , is empirically set as 0.3 to compress the brightness or 3 to enlarge the brightness.

3.3.3 Noise Transformation for Multi-Focus Image Fusion

For multi-focus image fusion, we propose a self-supervised auxiliary task based on noise transformation to enable the network to learn the variations of different depths of field (DoF) and maintain clear detail information. Due to the limitation of a camera’s DoF, it is very difficult to obtain an all-in-focus image within one shot. Objects within the DoF can maintain clear detail information, but the scene content outside the DoF is blurry. The main objective of multi-focus image fusion is to retain clear detail information of objects at different DoFs. We propose the noise transformation to generate locally blurred images for the encoder-decoder network to reconstruct, so that the trained model can learn to reconstruct clear images from locally blurred multi-focus source images.

We implement the noise transformation using Gaussian blur. Mathematically, applying Gaussian blur to an image is the same as convolving the image with a Gaussian function Forsyth and Ponce (2011). In a two-dimensional image, the Gaussian function is defined as

(17)
(18)

where and are the distances of a point from the origin on the horizontal and vertical axis, respectively.

is the standard deviation of the Gaussian distribution, which we empirically set as three. Fig.

2 (c) shows an original image subregion and the subregion after Gaussian blur transform.

Figure 3:

We integrate the three transformations designed for different fusion tasks using a probability-based combination strategy, enabling the network to learn task-specific features while extracting more generalizable features. Eight kinds of combinations of an image subregion are shown in the figure.

3.3.4 Integrating the Three Transformations in a Unified Framework

We have proposed three task-specific image transformations and our experiments showed that each of them can help improve the fusion performance of the corresponding image fusion task. Here we propose to integrate the three transformations in a probability-based combination strategy, so that the network can extract more generalized features and the fusion performance of each single task can be further improved. Concretely, for a subregionwe apply the three transformations in the order of pixel intensity non-linear transformation, brightness transformation and noise transformation, but for each transformation we randomly decide whether to apply or skip it. Thus, we obtain eight possible combinations of transformations applied on an image patch, as illustrated in Fig. 3. In this way, the trained model can handle more diverse input images, including the original unchanged images, images transformed by one transformation or images transformed by two or even three different transformations. The pseudo code for implementation is shown in Algorithm 1.

Input: subregion to be transformed

, hyperparameter probability for each task-specific transformation

,
Output: transformed subregion
1 Generate uniformly distributed over [0, 1) for non-linear transformation. if  < then
2      
3 end Generate uniformly distributed over [0, 1) for brightness transformation. if  < then
4      
5 end Generate uniformly distributed over [0, 1) for noise transformation. if  < then
6      
end
Algorithm 1 Integrate three task-specific transformations by possible combinations in a Python-like style

3.4 Fusion Rule

Due to the strong feature extraction capability of our network, fairly simple fusion rules can achieve very good fusion results. For multi-exposure image fusion task and multi-focus image fusion task, we directly average the feature maps of the two source images to obtain the fused feature maps. For multi-model image fusion, we adopt the L1-Norm fusion rule used in Li et al. Li and Wu (2018), which can highlight and preserve the critical feature information in the fused feature maps adaptively according to the region energy in the feature maps.

4 Experimental Results

4.1 Datasets

We used the large natural image dataset MS-COCO Lin et al. (2014) to train the encoder-decoder network, which contains more than 70,000 natural images of various scenes. All images were resized to 256 256 and converted to grayscale images.

We used the following datasets to evaluate our image fusion framework and compare it to other methods in different types of image fusion tasks. In multi-modal image fusion, we used the TNO333https://figshare.com/articles/TNOImageFusionDataset/1008029 dataset for infrared and visible image fusion and the Harvard444http://www.med.harvard.edu/AANLIB/home.html dataset for multi-modal medical image fusion. For multi-exposure image fusion, we used the dataset in Cai et al. (2018a), and for the multi-focus image fusion we used the dataset Lytro555https://mansournejati.ece.iut.ac.ir/content/lytro-multi-focus-dataset. From each dataset, we randomly select 20 pairs of source images for testing.

4.2 Implementation Details

Our model was trained on an NVIDIA GTX 3090 GPU with a batchsize of 64, epoch of 70, using an Adam optimizer and a cosine annealing learning rate adjustment strategy with a learning rate of 1e-4 and a weight decay of 5e-4. Given a 256

256 training image, we randomly generate four image subregions with the size of 16 16 to form the set to be transformed. In TransBlock, we divided the transformed input image into 256 patches with the size of 16×16 and constructed the sequence. For each patch, we further divided it into 16 sub-patches with the size of 4 4 and constructed the sequence . We set all the probabilities ,  and as 0.6.

4.3 Evaluation Metrics

In the current image fusion research, evaluating an image fusion algorithm is not a simple task due to the lack of ground truth fusion results. There are two widely adopted methods for evaluating the fused images, namely subjective evaluation and objective evaluation Zhang (2021)

. Subjective evaluation assesses the fused image in terms of sharpness, luminance, and contrast, etc. from the perspective of the observer. Objective evaluation assesses the fused images through objective evaluation metrics, but there is no consensus of the choice of evaluation metrics. Therefore, in order to provide a fair and comprehensive comparison with other fusion methods, we selected nine objective evaluation metrics focusing on four different aspects of the fused images and compared our method with state-of-the-art conventional and deep learning methods in each fusion task.

The evaluation metrics includes: Information theory-based metrics: Hou et al. (2020), Parmar et al. (2018), Prakash et al. (2019); Image feature-based metrics: Buzug (2011), Forbes (2012); Image structural similarity-based metrics: Poynton (2012), Gidaris et al. (2018); Human perception inspired metrics: Forsyth and Ponce (2011), Chen and Varshney (2007). Among these quantitative evaluation metrics, the minimum value of indicates the best fusion performance, while the maximum value indicates the best performance for all the other metrics. In every fusion task, the average of the objective metrics in fusion 20 image pairs were reported.

4.4 Results

4.4.1 Comparation with Unified Image Fusion Framework

Our model is a unified image fusion framework, so we first compared it with existing state-of-the-art unified image fusion algorithms U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), and PMGI Zhang et al. (2020a) in all the four image fusion tasks. The objective metrics of unified fusion methods are shown in Table 1, where the four different tasks infrared-visible image fusion, multi-modal medical image fusion, multi-exposure image fusion and multi-focus image fusion are denoted as IV, MED, ME and MF, respectively. It is apparent from Table 1 that our method achieves the best fusion performance in almost all metrics. In subsequent experimental comparisons on each task, the three unified fusion methods are still included, so the comparison with their subjective fusion results is discussed in detail in the following sections.

Task Method
IV U2Fusion 0.3546 0.8055 0.4380 0.4502 0.3331 0.9196 0.7161 0.5637 451.9610
IFCNN 0.3985 0.8050 0.5193 0.4670 0.3212 0.8726 0.7574 0.2701 436.1472
PMGI 0.3510 0.8049 0.4022 0.3461 0.1862 0.7471 0.6164 0.4896 706.8626
ours 0.4223 0.8083 0.5773 0.4986 0.3689 0.8733 0.7605 0.4054 280.1503
MED U2Fusion 0.3471 0.8067 0.1664 0.4658 0.3550 0.8980 0.4174 0.4535 1166.1936
IFCNN 0.4192 0.8071 0.2094 0.5232 0.3797 0.9183 0.4634 0.5134 933.5277
PMGI 0.2993 0.8068 0.1423 0.2756 0.2326 0.8188 0.3469 0.5166 1532.5396
ours 0.3604 0.8094 0.2301 0.5275 0.4620 0.9046 0.5073 0.5201 936.7991
ME U2Fusion 0.4411 0.8146 0.4436 0.6146 0.7055 0.9349 0.6340 1.2437 174.8057
IFCNN 0.5058 0.8141 0.6753 0.6890 0.6926 0.9445 0.7366 0.5889 192.5404
PMGI 0.4157 0.8150 0.4119 0.4578 0.5088 0.9083 0.5981 0.7557 297.7641
ours 0.5340 0.8271 0.6703 0.7322 0.7665 0.9579 0.7118 1.1042 192.2550
MF U2Fusion 0.4653 0.8269 0.4650 0.6637 0.7936 0.9574 0.8487 1.1727 102.8688
IFCNN 0.5438 0.8287 0.5845 0.7103 0.8317 0.9848 0.8971 0.9915 64.0155
PMGI 0.4366 0.8255 0.3860 0.4610 0.5417 0.8532 0.6555 0.9405 373.6547
ours 0.5620 0.8367 0.6666 0.7560 0.8374 0.9878 0.9049 0.9935 56.4455
Table 1: Comparison of objective evaluations for unified image fusion tasks. Bolded red and bolded blue are used to denote the best and the second best values, respectively.

4.4.2 Multi-Modal Image Fusion

Visible and Infrared Image Fusion

We compared our method with nine representative methods in infrared and visible image fusion task, including traditional methods (DWT Li et al. (1995), JSR Liu et al. (2017), JSRSD Liu et al. (2017)) and deep learning-based methods (U2Fusion Xu et al. (2020a), DeepFuse Ram Prabhakar et al. (2017), DenseFuse Li and Wu (2018), FusionGAN Ma et al. (2019), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a)). The fusion results are shown in Fig. 4, and the comparisons of objective evaluation metrics are shown in Table 2.

Figure 4: Infrared and visible source images and their fusion results, where (a1)-(b1), (a2)-(b2), (a3)-(b3), (a4)-(b4) are the infrared visible source image pairs respectively. (c1)-(l1), (c2)-(l2), (c3)-(l3), (c4)-(l4) are the fusion results of the comparative methods.
Subjective Evaluation

In Fig. 4, JSR, JSRSD, FusionGAN and PMGI show relatively disappointing fusion results due to noise and artifacts. DWT, DeepFuse and DenseFuse show relatively similar fusion results, where the fused images have low contrast and blurred details and blurred edges. U2Fusion, IFCNN and our method all achieve satisfying fusion results, while our method preserves sharper texture details with higher contrast and sharpness. As can be seen from the red box in Fig. 4, our method highlights the key targets (the humans in (a1)-(b1), (a2)-(b2) and (a3)-(b3) and the house in (a4)-(b4)) and has the best visual effect.

Objective Evaluation

As can be seen in Table 2, our method achieves the best results in four metrics and the second best results in other four metrics among all nine objective metrics, which is the best in all comparing methods. DenseFuse is specially designed for this task and it also shows strong performance.

Task Method
IV DWT 0.3544 0.8050 0.4991 0.4005 0.3038 0.8401 0.7051 0.2321 457.8833
JSR 0.2166 0.8055 0.3754 0.4060 0.2121 0.8475 0.5992 0.3448 350.9396
JSRSD 0.1924 0.8056 0.3089 0.3746 0.1524 0.7636 0.5219 0.2778 391.0290
U2Fusion 0.3546 0.8055 0.4380 0.4502 0.3331 0.9196 0.7161 0.5637 451.9610
DeepFuse 0.3718 0.8051 0.4317 0.3746 0.3034 0.8473 0.6782 0.2523 458.7481
DenseFuse 0.4157 0.8083 0.5237 0.4754 0.3740 0.8131 0.7885 0.1978 275.0618
FusionGAN 0.2937 0.8080 0.3355 0.1923 0.0859 0.3963 0.3362 0.0797 2291.7420
IFCNN 0.3985 0.8050 0.5193 0.4670 0.3212 0.8726 0.7574 0.2701 436.1472
PMGI 0.3510 0.8049 0.4022 0.3461 0.1862 0.7471 0.6164 0.4896 706.8626
ours 0.4223 0.8083 0.5773 0.4986 0.3689 0.8733 0.7605 0.4054 280.1503
Table 2: Comparison of objective metrics on the infrared and visible image fusion dataset. The best values are bolded in red the second best values are bolded in blue.
Medical Image Fusion

We compared our method with five current state-of-the-art methods in medical image fusion task, which contain the traditional methods DWT Li et al. (1995), NSCT Zhang and Guo (2009), PAPCNN Yin et al. (2018) and the deep learning methods U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a). The fusion results are shown in Fig. 5, and the comparison of objective evaluation metrics are shown in Table 3.

Figure 5: Medical source images and their fusion results, where (a1)-(b1) are the medical source image pairs in ”CT” and ”MRI” modalities, (a2)-(b2), (a3)-(b3), (a4)-(b4) are the medical source image pairs in ”T1” and ”T2” modalities. (c1)-(i1), (c2)-(i2), (c3)-(i3) and (c4)-(i4) are the results of the comparative methods.
Subjective Evaluation

The fusion results in Fig. 5 show that DWT, NSCT and PMGI obtain blurred fused images with low contrast and sharpness. PAPCNN introduces severe artifacts and noise. Although U2Fusion and IFCNN enhance the edges and detail information, their fused images have low contrast and inferior visual effect. In contrast, our method retains the best texture and functional information, while showing the optimal contrast and details.

Objective Evaluation

As can be seen in Table 3, our method achieves the best performance in six metrics and the second best performance in three metrics with a small difference from the best ones on the medical Harvard test dataset. In general, our method achieves the best fusion performance.

Task Method
MED DWT 0.3588 0.8072 0.1858 0.4336 0.3409 0.8827 0.4960 0.4684 1143.3950
PAPCNN 0.3518 0.8066 0.1634 0.4821 0.3299 0.8796 0.4250 0.5039 1078.2390
U2Fusion 0.3471 0.8067 0.1664 0.4658 0.3550 0.8980 0.4174 0.4535 1166.1936
IFCNN 0.4192 0.8071 0.2094 0.5232 0.3797 0.9183 0.4634 0.5134 933.5277
PMGI 0.2993 0.8068 0.1423 0.2756 0.2326 0.8188 0.3469 0.5166 1532.5396
ours 0.3604 0.8094 0.2301 0.5275 0.4620 0.9046 0.5073 0.5201 936.7991
Table 3: Comparison of objective metrics on the medical image fusion dataset. The best values are bolded in red the second best values are bolded in blue.
Multi-Exposure Image Fusion

We compared our method with five current state-of-the-art methods in multi-exposure image fusion task, including traditional methods (DWT Li et al. (1995), JSRSD Liu et al. (2017)) and deep learning-based methods (DeepFuse Ram Prabhakar et al. (2017), U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a)). The fusion results are shown in Fig. 6, and the comparisons of objective evaluation metrics are shown in Table 4.

Figure 6: Multi-exposure source images and their fusion results, where (a1)-(b1), (a2)-(b2), (a3)-(b3), (a4)-(b4) are multi-exposure source image pairs and (c1)-(i1), (c2)-(i2), (c3)-(i3), (c4)-(i4) are fusion results of the comparative methods respectively.
Subjective Evaluation

Fig. 6 show that JSRSD introduces severe noise and artifacts. DWT, DeepFuse and IFCNN fail to maintain appropriate luminance with contrast and details lost. U2Fusion, PMGI and our method achieve better fusion results. However, U2Fusion over-sharpens the details and edge information, resulting in unnatural visual effect and PMGI lose more details. In contrast, our fused images maintain the best luminance and have high contrast and sharpness. Overall, our fused images are more natural and achieved optimal visual effect.

Objective Evaluation

As can be seen in Table 4, our method achieves the best performance in five metrics and the second best performance in other four metrics on the multi-exposure image fusion dataset. In comparison, our method achieves the best fusion performance.

Task Method
ME DWT 0.4727 0.8174 0.6077 0.5978 0.6601 0.9227 0.6933 0.4952 223.4195
JSRSD 0.3026 0.8150 0.5428 0.6174 0.4461 0.9001 0.6771 0.8074 307.0123
U2Fusion 0.4411 0.8146 0.4436 0.6146 0.7055 0.9349 0.6340 1.2437 174.8057
DeepFuse 0.4462 0.8163 0.4289 0.5253 0.6330 0.9115 0.6501 0.6306 217.6652
IFCNN 0.5058 0.8141 0.6753 0.6890 0.6926 0.9445 0.7366 0.5889 192.5404
PMGI 0.4157 0.8150 0.4119 0.4578 0.5088 0.9083 0.5981 0.7557 297.7641
ours 0.5340 0.8271 0.6703 0.7322 0.7665 0.9579 0.7118 1.1042 192.2550
Table 4: Comparison of objective metrics on the multi-exposure image fusion dataset. The best values are bolded in red the second best values are bolded in blue.
Multi-Focus Image Fusion

We compared our method with six representative methods in multi-focus image fusion task, including traditional methods (DWT Li et al. (1995), JSRSD Liu et al. (2017)) and deep learning-based methods (DeepFuse Ram Prabhakar et al. (2017), SESF-Fuse Ma et al. (2021), U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a)). The fusion results are shown in Fig. 7, and the comparison of objective evaluation metrics are shown in Table 5.

Figure 7: Multi-focus source images and their fusion results, where (a1)-(b1), (a2)-(b2), (a3)-(b3), (a4)-(b4) are multi-focus source image pairs and (c1)-(j1), (c2)-(j2), (c3)-(j3), (c4)-(j4) are fusion results of the comparative methods.
Subjective Evaluation

From Fig. 7, JSRSD and PMGI introduces some noise and obtain unsatisfactory fusion results. Although DWT, DeepFuse and SESF-Fuse successfully fuse the images, they still lose detailed information, such as the flag or the edge contours of the child marked in red boxes, etc. U2Fusion, IFCNN and our method all achieves satisfying fusion results.

Objective Evaluation

We did not compare the objective metric of JSRSD owing to the artifacts in its fusion images. As can be seen in Table 5, our method achieves the best performance in six metrics and the second best performance in other three metrics on the multi-exposure image fusion dataset. In comparison, our method achieves the best fusion performance.

Task Method
MF U2Fusion 0.4653 0.8269 0.4650 0.6637 0.7936 0.9574 0.8487 1.1727 102.8688
DeepFuse 0.4868 0.8292 0.4960 0.7104 0.7757 0.9812 0.8608 0.9033 71.3387
IFCNN 0.5438 0.8287 0.5845 0.7103 0.8317 0.9848 0.8971 0.9915 64.0155
SESF-Fuse 0.5565 0.8328 0.7138 0.7473 0.8357 0.9877 0.9039 0.8613 49.1273
PMGI 0.4366 0.8255 0.3860 0.4610 0.5417 0.8532 0.6555 0.9405 373.6547
ours 0.5620 0.8367 0.6666 0.7560 0.8374 0.9878 0.9049 0.9935 56.4455
Table 5: Comparison of objective metrics on the multi-focus image fusion dataset. The best values are bolded in red the second best values are bolded in blue.

5 Ablation Study

5.1 Ablation Study for TransBlock

To solve the difficulties in establishing long-range relational dependencies of CNN-based image fusion architectures, we design TransBlock that combines CNN with Transformer to enable the network to exploit both local and global information. To verify the effectiveness of TransBlock, we perform ablation experiments on all fusion tasks using 20% of the training data and the results are shown in Table 6. “3 Transformations” in the table denotes that using proposed three task-specific image transformations and the probability-based combination strategy for training. As shown in Fig. 1 (a), we remove the Transformer-Module in TransBlock along with the lower branch of the EnhanceBlock and use only the CNN-Module and the upper branch of the EnhanceBlock for comparison. As shown in Table 6, almost in all fusion tasks, no matter whether the three transformations are used or not, adding TransBlock always improves the performance, which shows its effectiveness.

Task TransBlock 3 Transformations
IV 0.2775 0.8075 0.4669 0.4692 0.3265 0.8745 0.7158 0.3873 295.9092
0.3135 0.8085 0.5172 0.4993 0.3606 0.8298 0.7773 0.3088 301.0812
0.3204 0.8079 0.5035 0.4857 0.3592 0.8748 0.7484 0.3879 291.0118
0.4151 0.8086 0.5852 0.5068 0.3693 0.8650 0.7792 0.3714 283.3448
Task TransBlock 3 Transformations
MED 0.2373 0.8087 0.1837 0.5254 0.4156 0.8891 0.4661 0.4998 860.2159
0.2373 0.8087 0.2049 0.3117 0.4500 0.8891 0.4865 0.4998 923.1668
0.2838 0.8084 0.1971 0.5177 0.4491 0.9088 0.4897 0.5347 868.1148
0.3583 0.8093 0.2214 0.5280 0.4665 0.9009 0.4997 0.5147 938.6183
Task TransBlock 3 Transformations
ME 0.3269 0.8178 0.4540 0.6446 0.6345 0.9538 0.6581 1.0803 192.0676
0.3723 0.8185 0.5159 0.6886 0.6973 0.9628 0.7070 1.0048 185.4562
0.3917 0.8201 0.5162 0.6911 0.7107 0.9606 0.6962 1.0959 177.6318
0.5199 0.8262 0.6482 0.7318 0.7642 0.9594 0.7155 1.0930 191.1575
Task TransBlock 3 Transformations
MF 0.3656 0.8305 0.4660 0.6920 0.7464 0.9826 0.8482 0.9903 79.0899
0.4063 0.8315 0.5017 0.7186 0.7950 0.9839 0.8689 0.9950 72.6904
0.4241 0.8324 0.5167 0.7244 0.8032 0.9845 0.8797 1.0108 68.7478
0.5478 0.8361 0.6391 0.7530 0.8357 0.9875 0.9031 0.9928 58.7026
Table 6: Results of the TransBlock ablation tests using 20% training data

To further explain the effectiveness of TransBlock, we visualize the image reconstruction results of using the CNN architecture (only the CNN-Module in TransBlock) and using TransBlock in Fig. 8. We can see that the reconstructed image with solely CNN lost some texture and detail information of the original image, such as the typical areas shown in the red box. In contrast, the reconstructed image with TransBlock has higher resolution and more detail information, which indicates that the encoding network with TransBlock performs better in extracting local and global features of the image.

Figure 8: Visualization of the image reconstruction results using solely CNN and using TransBlock in the encoder. The top row is the original input image, the middle row is the image reconstruction results using traditional CNN architecture, and the bottom row is the image reconstruction results using TransBlock. The second and the fourth columns are the enlarged views of the red boxes shown in the first and the third columns, respectively.

5.2 Ablation Study for Task-specific Self-Supervised Training Scheme

We design task-specific self-supervised auxiliary tasks for multi-modal image fusion, multi-exposure image fusion, and multi-focus image fusion, which are respectively based on pixel intensity non-linear transformation, brightness transformation, and noise transformation. In this experiment, we evaluate the effectiveness of each auxiliary task and study if the performance is further improved by combing all the three transformation in our proposed model. The results are shown in Table 7. Most of the fusion metrics are improved by using the corresponding self-supervised auxiliary tasks, which indicates the effectiveness of every single one of them. The effectiveness of our proposed combination strategy is demonstrated by the fact that most of the objective fusion metrics can be further improved by the proposed model.

Task Nonlinear Brightness Noise
IV 0.3135 0.8085 0.5172 0.4993 0.3606 0.8298 0.7773 0.3088 301.0812
0.3522 0.8086 0.5404 0.5061 0.3674 0.8489 0.7773 0.3451 296.2987
0.4151 0.8086 0.5852 0.5068 0.3693 0.8650 0.7792 0.3714 283.3448
Task Nonlinear Brightness Noise
MED 0.2373 0.8087 0.2049 0.3117 0.4500 0.8891 0.4865 0.4998 923.1668
0.2810 0.8088 0.2203 0.5329 0.4653 0.8985 0.4961 0.5187 915.8077
0.3583 0.8093 0.2214 0.5280 0.4665 0.9009 0.4997 0.5147 938.6183
Task Nonlinear Brightness Noise
ME 0.3723 0.8185 0.5159 0.6886 0.6973 0.9628 0.7070 1.0048 185.4562
0.4141 0.8208 0.5406 0.7084 0.7298 0.9598 0.6991 1.0903 184.2343
0.5199 0.8262 0.6482 0.7318 0.7642 0.9594 0.7155 1.0930 191.1575
Task Nonlinear Brightness Noise
MF 0.4063 0.8315 0.5017 0.7186 0.7950 0.9839 0.8689 0.9950 72.6904
0.4561 0.8334 0.5539 0.7387 0.8235 0.9857 0.8904 1.0081 64.0137
0.5478 0.8361 0.6391 0.7530 0.8357 0.9875 0.9031 0.9928 58.7026
Table 7: Results of ablation tests on the auxiliary tasks using 20% training data
Task Method
IV N1_R256 0.1947 0.8054 0.3667 0.3672 0.1800 0.8288 0.3382 0.6367 392.0450
N2_R8 0.3307 0.8087 0.5250 0.5019 0.3639 0.8455 0.3423 0.7716 297.5007
N2_R16 0.3319 0.8085 0.5379 0.5053 0.3625 0.8525 0.3253 0.7667 300.5854
N2_R32 0.3282 0.8085 0.5258 0.5005 0.3618 0.8486 0.3363 0.7826 300.3576
N2_R64 0.3122 0.8081 0.5010 0.4952 0.3551 0.8414 0.3214 0.7747 298.4003
N4_R8 0.3163 0.8083 0.5222 0.5007 0.3563 0.8524 0.3331 0.7764 301.0319
N4_R16 0.3405 0.8088 0.5379 0.5035 0.3642 0.8367 0.3270 0.7872 303.8792
N4_R32 0.3169 0.8083 0.5162 0.4973 0.3538 0.8467 0.3328 0.7821 300.4295
N4_R64 0.2702 0.8060 0.4390 0.4528 0.3021 0.8646 0.3448 0.7247 341.9801
N8_R8 0.2702 0.8060 0.5336 0.4528 0.3613 0.8646 0.3448 0.7811 295.8375
N8_R16 0.3272 0.8086 0.5244 0.5034 0.3637 0.8480 0.3446 0.7776 295.2036
N8_R32 0.2964 0.8075 0.4949 0.4826 0.3348 0.8565 0.3246 0.7694 315.0558
N8_R64 0.2575 0.8053 0.4228 0.4184 0.2559 0.8211 0.2717 0.6960 428.7837
MED N1_R256 0.2075 0.8105 0.3116 0.3584 0.2832 0.8848 0.8795 0.4563 306.8308
N2_R8 0.3695 0.8155 0.4446 0.6374 0.5436 0.8341 1.1028 0.7219 214.4595
N2_R16 0.3753 0.8158 0.4490 0.6450 0.5548 0.8380 1.0856 0.7313 219.3057
N2_R32 0.3663 0.8152 0.4321 0.6313 0.5482 0.8334 1.0987 0.7141 211.9355
N2_R64 0.3365 0.8121 0.4374 0.6065 0.4909 0.8277 0.9498 0.6878 305.5588
N4_R8 0.3536 0.8147 0.4365 0.6300 0.5440 0.8392 1.0878 0.7145 219.4811
N4_R16 0.3823 0.8159 0.4556 0.6402 0.5491 0.8272 1.1091 0.7224 224.6704
N4_R32 0.3503 0.8144 0.4311 0.6209 0.5300 0.8366 1.0330 0.7083 220.9103
N4_R64 0.2873 0.8091 0.4061 0.5632 0.3881 0.8151 0.8239 0.6460 259.3240
N8_R8 0.2873 0.8091 0.4363 0.5632 0.5573 0.8151 0.8239 0.7218 212.2687
N8_R16 0.3646 0.8151 0.4336 0.6261 0.5449 0.8323 1.1038 0.7199 214.0531
N8_R32 0.3195 0.8115 0.4452 0.5948 0.4620 0.8441 0.8939 0.6812 325.3530
N8_R64 0.2722 0.8087 0.4074 0.5372 0.3868 0.8361 0.6873 0.6441 435.6439
ME N1_R256 0.1518 0.8061 0.1305 0.2886 0.1937 0.8387 0.4201 0.3354 1525.4510
N2_R8 0.2634 0.8077 0.1568 0.4223 0.3880 0.8930 0.4913 0.4365 1171.2512
N2_R16 0.2636 0.8078 0.1553 0.4114 0.3853 0.8883 0.4844 0.4286 1186.1036
N2_R32 0.2501 0.8076 0.1536 0.4075 0.3814 0.8884 0.4848 0.4240 1199.2137
N2_R64 0.2349 0.8074 0.1520 0.4068 0.3688 0.8853 0.4868 0.4235 1218.6177
N4_R8 0.2562 0.8076 0.1532 0.4103 0.3833 0.8905 0.4857 0.4235 1161.3674
N4_R16 0.2637 0.8078 0.1553 0.4214 0.3859 0.8983 0.4944 0.4286 1186.1036
N4_R32 0.2405 0.8075 0.1521 0.4061 0.3742 0.8900 0.4908 0.4299 1202.8124
N4_R64 0.2011 0.8068 0.1445 0.3632 0.3224 0.8573 0.4441 0.3914 1452.7672
N8_R8 0.2591 0.8077 0.1535 0.4113 0.3857 0.8908 0.4872 0.4296 1175.6538
N8_R16 0.2578 0.8076 0.1542 0.4103 0.3844 0.8926 0.4884 0.4279 1186.3845
N8_R32 0.2293 0.8072 0.1493 0.3753 0.3565 0.8745 0.4626 0.4047 1376.2388
N8_R64 0.1908 0.8063 0.1393 0.3339 0.2599 0.7804 0.3543 0.3720 2190.5549
MF N1_R256 0.2453 0.8219 0.3296 0.4923 0.4733 0.9256 1.0105 0.6825 173.9629
N2_R8 0.4321 0.8328 0.5291 0.7311 0.8086 0.9801 0.9967 0.8823 72.3116
N2_R16 0.4354 0.8329 0.5324 0.7327 0.8102 0.9845 0.9948 0.8850 70.5472
N2_R32 0.4279 0.8326 0.5208 0.7274 0.8058 0.9843 0.9984 0.8820 74.6641
N2_R64 0.4024 0.8292 0.4937 0.7146 0.7901 0.9833 0.9635 0.8657 81.7729
N4_R8 0.4152 0.8322 0.5117 0.7246 0.7993 0.9839 0.9956 0.8733 73.1741
N4_R16 0.4421 0.8331 0.5338 0.7311 0.8103 0.9850 0.9825 0.8839 74.4669
N4_R32 0.4119 0.8315 0.5038 0.7214 0.7970 0.9839 1.0001 0.8751 76.0166
N4_R64 0.3498 0.8256 0.4420 0.6660 0.7205 0.9742 0.8940 0.8151 101.5177
N8_R8 0.3498 0.8256 0.5293 0.6660 0.8100 0.9742 0.8940 0.8819 72.5735
N8_R16 0.4290 0.8326 0.5216 0.7293 0.8087 0.9841 0.9999 0.8823 73.7230
N8_R32 0.3846 0.8295 0.4725 0.7023 0.7746 0.9817 0.9561 0.8510 86.1646
N8_R64 0.3297 0.8222 0.4106 0.6277 0.6543 0.9413 0.7608 0.7628 153.8210
Table 8: Results of ablation tests on the number and size of the transformed subregions.

5.3 Ablation Study for the Number and Area of Transformed Subregion

In our proposed method, we randomly generate four image subregions with the size of 1616 from a 256256 training image to form the set for transformation. Obviously, both the number and size of transform subregions will affect the efficiency of trained network. For example, when the size of the subregion is too large (even transforming the whole image), it will be very difficult for the network to reconstruct the original image. At the same time, because most of the original image is destroyed, the network cannot learn useful features. On the contrary, when the subregions are too small, the effect of the transformation will also be small and may not be able to encourage the network to learn better task-specific features. We conducted the ablation experiments on different combinations of the number and size of the subregions, and the results are shown in Table 8. In this experiment, we used 20% of the training data with 30 epochs.

In Table 8, represents the number of transformed regions and represents the size of transformed subregions. For example, represents the transformation of the whole image, and represents the transformation of four randomly selected 1616 regions. It can be seen that achieves the best performance overall.

6 Discussion

In this study, we propose TransFuse, a unified Transformer-based image fusion framework by self-supervised learning, which can be effectively used to different image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion. We propose three destruction-reconstruction self-supervised auxiliary tasks for multi-modal, multi-exposure and multi-focus image fusion and integrate the three tasks by randomly choosing one of them to destroy a natural image during model training, thus enabling our network to be trained on accessible large natural image datasets and can learn task-specific features at the same time. In addition, we design a new encoder that combines CNN and Transformer for feature extraction, so that the model can exploit both local and global information more comprehensively. Extensive experiments have shown that our framework achieves new state-of-the-art performance in both subjective and objective evaluations in all common image fusion tasks.

Notably, in different fusion tasks, the vital information to be fused varies largely as source images contain different characteristics. The three destruction-reconstruction self-supervised auxiliary tasks are specially designed according to the characteristics of source images in different fusion tasks, so that our framework can learn task-specific features during reconstruction. Furthermore, we integrate the three tasks in model training to encourage different fusion tasks to promote each other and increase the generalization of the trained network, enabling our framework to handle different image fusion tasks in a unified way.

Our framework also has some limitations. First, although we utilize parameter-shared Fine-grained Transformers, introducing Transformer for feature extraction still makes our model (with 43.18 MB parameters) larger than the existing methods (generally with 0.3-3 MB parameters). Fortunately, a common GPU NVIDIA 1080Ti is still sufficient for model training. Second, we do not propose new fusion rules for different fusion tasks and more effective task-specific fusion rules can be specially designed in the future to further improve the fusion performance.

In future work, more effective Transformer-based feature extraction methods can be explored for image fusion tasks. Moreover, since both CNN-based architectures and Transformer-based architectures have their advantages, how to further combine the two architectures is another promising research direction. On the other hand, more effective self-supervised auxiliary tasks may also be proposed to enforce the network learn more useful features according to the characteristics of different source images.

In addition, our proposed self-supervised tasks and training scheme can help the model learn more robust and generalized features, and our proposed Transformer-based feature extraction modules can exploit both local and global information more comprehensively, hence, we expect that our method has a broad application prospect and a large development potential in both image fusion field and related computer vision fields.

References

  • I. Achituve, H. Maron, and G. Chechik (2021) Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 123–133. Cited by: §2.2.
  • X. Bai, Y. Zhang, F. Zhou, and B. Xue (2015) Quadtree-based multi-focus image fusion using a weighted focus-measure. Information Fusion 22, pp. 105–118. Cited by: §1.
  • X. Bai, F. Zhou, and B. Xue (2011) Fusion of infrared and visual images through region extraction by using multi scale center-surround top-hat transform. Optics Express 19 (9), pp. 8444–8457. Cited by: §1.
  • G. Bhatnagar, Q. J. Wu, and Z. Liu (2013) Directive contrast based multimodal medical image fusion in nsct domain. IEEE Transactions on Multimedia 15 (5), pp. 1014–1024. Cited by: §1.
  • P. J. Burt and E. H. Adelson (1987) The laplacian pyramid as a compact image code. In Readings in Computer Vision, pp. 671–679. Cited by: §1, §2.1.1.
  • T. M. Buzug (2011) Computed tomography. In Springer Handbook of Medical Technology, pp. 311–342. Cited by: §3.3.1, §4.3.
  • J. Cai, S. Gu, and L. Zhang (2018a) Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing 27 (4), pp. 2049–2062. Cited by: §4.1.
  • J. Cai, S. Gu, and L. Zhang (2018b) Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing 27 (4), pp. 2049–2062. Cited by: §1, §2.1.2.
  • L. Cao, L. Jin, H. Tao, G. Li, Z. Zhuang, and Y. Zhang (2014) Multi-focus image fusion based on spatial frequency in discrete cosine transform domain. IEEE Signal Processing Letters 22 (2), pp. 220–224. Cited by: §1, §2.1.1.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), pp. 213–229. Cited by: §2.3.
  • H. Chen and P. K. Varshney (2007) A human perception inspired quality metric for image fusion based on regional information. Information Fusion 8 (2), pp. 193–207. Cited by: §4.3.
  • Z. Dai, B. Cai, Y. Lin, and J. Chen (2021) Up-detr: unsupervised pre-training for object detection with transformers. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 1601–1610. Cited by: §2.3.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §1, §2.1.2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.3, §3.2.2, §3.2.2, §3.2.2.
  • G. B. Forbes (2012) Human body composition: growth, aging, nutrition, and activity. Springer Science & Business Media. Cited by: §3.3.1, §4.3.
  • D. Forsyth and J. Ponce (2011) Computer vision: a modern approach.. Prentice hall. Cited by: §3.3.3, §4.3.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §4.3.
  • A. A. Goshtasby and S. G. Nikolov (2007) Guest editorial: image fusion: advances in the state of the art. Information Fusion: Special Issue on Image Fusion: Advances in the State of the Art 8, pp. 114–118. Cited by: §1.
  • A. A. Goshtasby (2005) Fusion of multi-exposure images. Image and Vision Computing 23 (6), pp. 611–618. Cited by: §1.
  • K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang (2021) Transformer in transformer. arXiv preprint arXiv:2103.00112. Cited by: §3.2.2.
  • R. Hou, D. Zhou, R. Nie, D. Liu, L. Xiong, Y. Guo, and C. Yu (2020) VIF-net: an unsupervised framework for infrared and visible image fusion. IEEE Transactions on Computational Imaging 6, pp. 640–651. Cited by: §3.2.3, §4.3.
  • W. Huang and Z. Jing (2007) Evaluation of focus measures in multi-focus image fusion. Pattern Recognition Letters 28 (4), pp. 493–500. Cited by: §1, §2.1.1.
  • L. Jing and Y. Tian (2021) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (11), pp. 4037–4058. External Links: Document Cited by: §2.2.
  • [24] J. A. B. Keith A. Johnson The whole brain atlas. Note: Websitehttp://www.med.harvard.edu/AANLIB/home.html Cited by: §2.1.2, §3.2.3.
  • A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1920–1929. Cited by: §2.2.
  • H. Li, B. Manjunath, and S. K. Mitra (1995) Multisensor image fusion using the wavelet transform. Graphical Models and Image Processing 57 (3), pp. 235–245. Cited by: §1, §1, §2.1.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2.
  • H. Li, X. Wu, and J. Kittler (2020) MDLatLRR: a novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing 29, pp. 4733–4746. Cited by: §1.
  • H. Li and X. Wu (2018) DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §1, §1, §1, §2.1.2, §2.3, §3.4, §4.4.2.
  • S. Li, X. Kang, and J. Hu (2013) Image fusion with guided filtering. IEEE Transactions on Image processing 22 (7), pp. 2864–2875. Cited by: §1, §1, §2.1.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision (ECCV), pp. 740–755. Cited by: §1, §2.1.2, §4.1.
  • C. Liu, Y. Qi, and W. Ding (2017) Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology 83, pp. 94–102. Cited by: §1, §2.1.1, §4.4.2, §4.4.2, §4.4.2.
  • S. Liu, M. Wang, and Z. Song (2020) WaveFuse: a unified deep framework for image fusion with discrete wavelet transform. arXiv preprint arXiv:2007.14110. Cited by: §1, §1, §1, §2.1.2, §2.3.
  • X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang (2021) Self-supervised learning: generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, pp. 1–1. Cited by: §2.2.
  • Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang (2016) Image fusion with convolutional sparse representation. IEEE Signal Processing Letters 23 (12), pp. 1882–1886. Cited by: §1.
  • X. Luo, Z. Zhang, B. Zhang, and X. Wu (2016) Image fusion with contextual statistical similarity and nonsubsampled shearlet transform. IEEE Sensors Journal 17 (6), pp. 1760–1771. Cited by: §1, §2.1.1.
  • B. Ma, Y. Zhu, X. Yin, X. Ban, H. Huang, and M. Mukeshimana (2021) SESF-fuse: an unsupervised deep model for multi-focus image fusion. Neural Computing and Applications 33 (11), pp. 5793–5804. Cited by: §1, §1, §1, §2.1.2, §2.3, §4.4.2.
  • J. Ma, H. Xu, J. Jiang, X. Mei, and X. Zhang (2020)

    DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion

    .
    IEEE Transactions on Image Processing 29, pp. 4980–4995. Cited by: §1, §1, §2.1.2, §2.1.2, §2.3.
  • J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang (2019) FusionGAN: a generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §1, §1, §2.1.2, §2.1.2, §2.3, §3.3.1, §4.4.2.
  • K. Ma, K. Zeng, and Z. Wang (2015) Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing 24 (11), pp. 3345–3356. Cited by: §1.
  • I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6707–6717. Cited by: §2.2.
  • M. E. Mortenson (1999) Mathematics for computer graphics applications. Industrial Press Inc.. Cited by: §3.3.1.
  • M. Nejati, S. Samavi, and S. Shirani (2015) Multi-focus image fusion using dictionary-based sparse representation. Information Fusion 25, pp. 72–84. Cited by: §1, §2.1.2.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision (ECCV), pp. 69–84. Cited by: §2.2.
  • N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. In

    International Conference on Machine Learning (ICML)

    ,
    pp. 4055–4064. Cited by: §2.3, §4.3.
  • C. Poynton (2012) Digital video and hd: algorithms and interfaces. Elsevier. Cited by: §3.3.2, §4.3.
  • O. Prakash, C. M. Park, A. Khare, M. Jeon, and J. Gwak (2019) Multiscale fusion of multimodal medical images using lifting scheme based biorthogonal wavelet transform. Optik 182, pp. 995–1014. Cited by: §4.3.
  • Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin (2016)

    Variational autoencoder for deep learning of images, labels and captions

    .
    Advances in Neural Information Processing Systems 29, pp. 2352–2360. Cited by: §2.2.
  • S. Quan, W. Qian, J. Guo, and H. Zhao (2014) Visible and infrared image fusion based on curvelet transform. In The 2014 2nd International Conference on Systems and Informatics (ICSAI), pp. 828–832. Cited by: §1, §2.1.1.
  • K. Ram Prabhakar, V. Sai Srikar, and R. Venkatesh Babu (2017) Deepfuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4714–4722. Cited by: §1, §1, §2.1.2, §2.1.2, §2.3, §3.3.2, §4.4.2, §4.4.2, §4.4.2.
  • A. Saha, G. Bhatnagar, and Q. J. Wu (2013) Mutual spectral residual approach for multifocus image fusion. Digital Signal Processing 23 (4), pp. 1121–1135. Cited by: §1.
  • R. Shen, I. Cheng, J. Shi, and A. Basu (2011) Generalized random walks for fusion of multi-exposure images. IEEE Transactions on Image Processing 20 (12), pp. 3634–3646. Cited by: §1.
  • Z. Sun, S. Cao, Y. Yang, and K. M. Kitani (2021) Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3611–3620. Cited by: §2.3.
  • A. Toet (1989) Image fusion by a ratio of low-pass pyramid. Pattern Recognition Letters 9 (4), pp. 245–253. Cited by: §1, §2.1.1.
  • A. Toet (2014) TNO image fusion dataset. Note: Websitehttps://doi.org/10.6084/m9.figshare.1008029.v1 Cited by: §1, §2.1.2.
  • H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021) Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pp. 10347–10357. Cited by: §2.3.
  • T. Wan, N. Canagarajah, and A. Achim (2009) Segmentation-driven image fusion based on alpha-stable modeling of wavelet coefficients. IEEE Transactions on Multimedia 11 (4), pp. 624–633. Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (EMNLP), pp. 353–355. Cited by: §2.3.
  • Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia (2021) End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8741–8750. Cited by: §2.3.
  • H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling (2020a) U2Fusion: a unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §1, §1, §1, §1, §2.1.2, §2.1.2, §2.3, §3.3.1, §3.3.2, §4.4.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2.
  • H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo (2020b) Fusiondn: a unified densely connected network for image fusion. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Vol. 34, pp. 12484–12491. Cited by: §1, §1, §1, §1, §2.1.2, §2.1.2.
  • Z. Xu (2014) Medical image fusion using multi-level local extrema. Information Fusion 19, pp. 38–48. Cited by: §1.
  • Z. Xue and R. S. Blum (2003) Concealed weapon detection using color image fusion. In Proceedings of the 6th International Conference on Information Fusion (ICIF), Vol. 1, pp. 622–627. Cited by: §1.
  • M. Yin, X. Liu, Y. Liu, and X. Chen (2018) Medical image fusion with parameter-adaptive pulse coupled neural network in nonsubsampled shearlet transform domain. IEEE Transactions on Instrumentation and Measurement 68 (1), pp. 49–64. Cited by: §4.4.2.
  • H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma (2020a) Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 34, pp. 12797–12804. Cited by: §1, §1, §1, §1, §2.1.2, §2.1.2, §4.4.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2.
  • Q. Zhang and B. Guo (2009) Multifocus image fusion using the nonsubsampled contourlet transform. Signal Processing 89 (7), pp. 1334–1346. Cited by: §4.4.2.
  • Q. Zhang and M. D. Levine (2016) Robust multi-focus image fusion using multi-task sparse representation and spatial context. IEEE Transactions on Image Processing 25 (5), pp. 2045–2058. Cited by: §1.
  • R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    .
    In European Conference on Computer Vision (ECCV), pp. 649–666. Cited by: §2.2.
  • X. Zhang (2021) Deep learning-based multi-focus image fusion: a survey and a comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. Cited by: §4.3.
  • Y. Zhang, X. Bai, and T. Wang (2017a) Boundary finding based multi-focus image fusion through multi-scale morphological focus-measure. Information Fusion 35, pp. 81–101. Cited by: §1, §2.1.1.
  • Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang (2020b)

    IFCNN: a general image fusion framework based on convolutional neural network

    .
    Information Fusion 54, pp. 99–118. Cited by: §1, §1, §1, §1, §2.1.2, §2.3, §4.4.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2.
  • Y. Zhang, L. Zhang, X. Bai, and L. Zhang (2017b) Infrared and visual image fusion through infrared feature extraction and visual information preservation. Infrared Physics & Technology 83, pp. 227–237. Cited by: §1.
  • M. Zheng, P. Gao, R. Zhang, K. Li, X. Wang, H. Li, and H. Dong (2020) End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315. Cited by: §2.3.
  • Z. Zhou, S. Li, and B. Wang (2014) Multi-scale weighted gradient-based fusion for multi-focus images. Information Fusion 20, pp. 60–72. Cited by: §1, §2.1.1.
  • Z. Zhou, B. Wang, S. Li, and M. Dong (2016) Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with gaussian and bilateral filters. Information Fusion 30, pp. 15–26. Cited by: §1.
  • X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR), Cited by: §2.3.