1 Introduction
By integrating information from multiple source images with complementary information and fusing them into one fused image, image fusion technique can generate high-quality images and compensate for the inherent defects of a single imaging sensor Li et al. (1995). Image fusion has a wide range of applications Li et al. (1995); Goshtasby and Nikolov (2007); Li et al. (2013); Bai et al. (2011). For example, in military applications, the fusion of infrared and visible images can be used for reconnaissance as well as night vision Xue and Blum (2003); Wan et al. (2009); Zhou et al. (2016); Zhang et al. (2017b). In the medical field, fusing images of different modalities (e.g., Computed Tomography (CT) and Magnetic Resonance Imaging (MRI)) can assist the clinicians in clinical diagnosis and treatment Bhatnagar et al. (2013); Xu (2014). In the field of consumer electronics, multi-exposure image fusion can be employed to generate high dynamic range images for mobile devices Goshtasby (2005); Shen et al. (2011); Ma et al. (2015), while multi-focus image fusion can be applied for refocusing algorithms Saha et al. (2013); Bai et al. (2015); Zhang and Levine (2016).
Typically, image fusion tasks can be classified into multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion. Although most studies focus on a certain fusion task, the design of unified image fusion frameworks that can be applied to different tasks is gradually becoming a significant research direction
Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b). The reason not only lies in the generality of a unified image fusion framework for multiple tasks, but also rests with the finding that these frameworks can achieve better performance when they are jointly trained on different tasks than being trained on a specific task Xu et al. (2020a).Existing image fusion methods can be classified into two categories: traditional methods Li et al. (1995, 2013); Huang and Jing (2007); Zhang et al. (2017a); Zhou et al. (2014); Burt and Adelson (1987); Toet (1989); Cao et al. (2014); Quan et al. (2014); Luo et al. (2016); Liu et al. (2017, 2016); Li et al. (2020) and deep learning-based methods Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Li and Wu (2018); Ma et al. (2019, 2020, 2021); Liu et al. (2020)
. Although traditional image fusion methods have achieved promising performance before the deep learning era, the hand-crafted feature extraction approaches limit further performance improvement. Moreover, these methods can only be used in a specific task due to their poor generalization capability. Deep learning-based image fusion methods have alleviated such limitation thanks to their powerful feature extraction capability, and gradually become the mainstream approaches. In these methods, the source images are fed into a deep neural network and the output of the network is the fused image.
According to how the fusion network is trained, deep learning-based methods can be further divided into end-to-end methods Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Ma et al. (2019, 2020) and two-stage methods Li and Wu (2018); Ma et al. (2021); Liu et al. (2020)
. In end-to-end methods, the fusion network is trained directly either in a supervised manner using synthetic ground truth fused images or in an unsupervised manner using loss functions defined on the similarity between the fused image and the input source images. However, the end-to-end methods require a large number of task-specific training images, which is difficult or expensive to collect in the image fusion field. Although some training datasets are constructed for specific fusion tasks
Xu et al. (2020a); Cai et al. (2018b); Nejati et al. (2015); Toet (2014), they are not comparable in size to large natural image datasets (e.g., ImageNet
Deng et al. (2009), COCO
Lin et al. (2014)). Therefore, end-to-end image fusion methods easily fall into overfitting or tedious parameter optimization processes due to insufficient training data. In order to obtain more training data, most end-to-end methods Xu et al. (2020a); Zhang et al. (2020a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Ma et al. (2019, 2020) divide the source images into small patches during the training process, which corrupts the semantic information of the whole image and also makes the network difficult to model the global features, leading to inferior fusion performance. Instead, the two-stage methods first train an encoder-decoder network on a large natural image dataset through image reconstruction. Then, the trained encoder is used to extract feature maps from the source images, and the feature maps are fused and further decoded to generate the fused image using the trained decoder. The advantage of two-stage methods is that the encoder and the decoder can be trained on large natural image datasets and can avoid the need of large amount of task-specific training data, which makes them more flexible and stable.However, the two-stage methods still have several unsolved issues that hinder further performance improvement. First, the current two-stage methods Li and Wu (2018); Ma et al. (2021); Liu et al. (2020) purely focus on the reconstruction of natural images but the domain gap between natural images and fusion tasks results in poor generalization of the extracted features and inferior performance. In addition, the same natural image dataset is usually used for different fusion tasks without considering task-specific features. Therefore, it is a prominent issue to enable the encoder-decoder network to be trained on large natural image datasets while learn task-specific image features at the same time. Second, recent studies have indicated Xu et al. (2020a); Zhang et al. (2020b, a); Xu et al. (2020b) that joint training on different tasks can help improve the performance on each single task. Therefore, how to design a joint training scheme for two-stage frameworks is another important issue. Third, all existing deep-learning-based image fusion methods utilize CNN for feature extraction, but it is difficult for CNN to model long-range dependencies due to its small receptive field Dosovitskiy et al. (2020).
To address the issues mentioned above, in this paper, we propose a novel two-stage image fusion framework based on a new encoder-decoder network, which is named TransFuse. First, to enable our network to be trained on large natural image datasets and learn task-specific features at the same time, we design three destruction-reconstruction self-supervised auxiliary tasks for each of the three image fusion tasks, multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion. Instead of simply inputting a natural image into the encoder and using the decoder to reconstruct it, we destroy the natural image before inputting it into the encoder and design a specific way of destruction for each fusion task. By enforcing the encoder-decoder network to reconstruct a destroyed image, we can make the network to learn better task-specific image features. Second, in order to encourage different fusion tasks to promote each other and increase the generalizability of the trained network, we integrate the three self-supervised auxiliary tasks by randomly choosing one of them to destroy a natural image in model training. Third, to compensate for the defects of CNN in modeling long-range dependencies, we design a new encoder that combines CNN and Transformer to exploit both local and global information in feature extraction. We conduct extensive experiments to demonstrate the effectiveness of each component of our framework.
The main contributions are summarized as follows.
-
We design a novel encoder-decoder based image fusion framework and propose a destruction-reconstruction based self-supervised training scheme to encourage the network to learn task-specific features.
-
We propose to use three transformations to destroy a natural image, pixel intensity non-linear transformation, brightness transformation, and noise transformation for multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion, respectively. We integrate the three transformations by randomly selecting one of them to destroy an image in model training so as to make the trained model extract more generalizable features and achieve higher performance on each task.
-
We design a new encoder that combines CNN and Transformer for feature extraction, so that the trained model can exploit both local and global information.
-
Extensive experiments on multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion tasks showed that our proposed method achieved new state-of-the-art performance in both subjective and objective evaluations.
2 Related Work
In this section, we will briefly review the most representative methods in the field of image fusion in recent years, including traditional methods and deep learning-based methods. The latter ones are divided into end-to-end methods and two-stage methods. Subsequently, we will introduce the related works of self-supervised learning and Transformer in the field of computer vision and their application potential in image fusion field.
2.1 Image Fusion
2.1.1 Traditional Image Fusion Algorithms
Traditional image fusion methods can be classified into spatial domain-based methods, transform domain-based methods, sparse representation and dictionary learning-based methods. Spatial domain-based methods usually calculate a weighted average of local or pixel-level saliency of two source images Li et al. (2013); Huang and Jing (2007); Zhang et al. (2017a); Zhou et al. (2014) to obtain the fused image.
Transform domain-based methods firstly transform source images into a transform domain (e.g., wavelet domain) to obtain different frequency components. Then the corresponding components are fused by appropriate fusion rules, and finally the fused image is obtained by inverse transform. Commonly used transforms include Laplace pyramid (LP) Burt and Adelson (1987), low-pass pyramid (RP) Toet (1989), discrete wavelet (DWT) Li et al. (1995), discrete cosine (DCT) Cao et al. (2014), curvelet transform (CVT) Quan et al. (2014) and shearlet transform (Shearlet) Luo et al. (2016), etc.
New methods based on sparse representation and dictionary learning have also emerged in recent years. For example, Liu Liu et al. (2017) et al. proposed JSR and JSRSD based on joint sparse representation and saliency detection. They first obtained the global and local representative maps of the source images based on sparse coefficients, and then combined them by a representative detection model to generate the overall representative map. Finally, a weighted fusion algorithm was adopted to obtain the fused image based on the overall representative map.
Although aforementioned methods have achieved good results, their performance are still limited due to the following two aspects. First, the complicated manually designed feature extraction approaches usually fail to effectively preserve important information in the source images and cause artifacts in the fused image. Second, the feature extraction methods are usually designed for a specific task so it is difficult to adapt them to other tasks.
2.1.2 Deep Learning-based Image Fusion Algorithms
Due to the powerful feature extraction capability of CNN, deep learning-based methods have gradually become the mainstream approaches in the field of image fusion. Deep learning-based methods can be further divided into end-to-end methods and two-stage methods.
End-to-End Image Fusion Scheme
In end-to-end methods, the fusion network is directly trained either in a supervised manner using synthetic ground truth fusion images or in an unsupervised manner using loss functions defined on the similarity between the source images and the fused images.
Zhang et al Zhang et al. (2020b)
proposed IFCNN, a supervised unified image fusion framework. They constructed a large multi-focus image fusion dataset with ground truth images and utilized a perceptual loss function for supervised training. Then the trained model was transferred to other fusion tasks. Prabhakar et al
Ram Prabhakar et al. (2017) proposed DeepFuse, an unsupervised multi-exposure image fusion framework utilizing a no-reference quality metric as loss function. They designed a novel CNN-based network trained to learn the fusion operation without ground truth fusion images. Ma et al. Ma et al. (2019) proposed FusionGAN, an unsupervised infrared and visible image fusion framework based on GAN. The generator network is used to generate the fused image and the discriminator network makes sure that the texture details in visible images together with thermal radiation information in infrared images are retained in the fused image. On this basis, they further proposed DDcGAN Ma et al. (2020), which can enhance the edges and the saliency of thermal targets by introducing a target-enhanced loss function and a dual discriminator structure. Zhang Zhang et al. (2020a) et al. proposed PMGI, an unsupervised unified image fusion network. They unified the image fusion problem into the proportional maintenance of texture and intensity information of the source images, and proposed a new loss function based on the gradient information and intensity information between the fused image and the source images for unsupervised training. Xu et al. Xu et al. (2020a, b) proposed U2Fusion, an unsupervised unified image fusion network. They utilized a novel loss function based on adaptive information preservation degree for unsupervised training. During training, a pre-trained neural network was used to extract the features of the source images and these features are further used to calculate the adaptive information preservation degree.The above end-to-end models have achieved promising fusion performance, but both supervised and unsupervised methods require a large number of task-specific training images. Although several studies have constructed some training datasets for specific fusion tasks (e.g., RoadScene Xu et al. (2020a) and TNO Toet (2014) for infrared and visible image fusion; Harvard Keith A. Johnson for medical image fusion; SICE Cai et al. (2018b) for multi-exposure image fusion; and Lytro Nejati et al. (2015)
for multi-focus image fusion), their sizes are not comparable to that of large natural image datasets (e.g., ImageNet
Deng et al. (2009), COCO Lin et al. (2014)). Insufficient training data tends to cause overfitting or complex parameter optimization. Besides, in order to obtain more training data, most end-to-end methods Xu et al. (2020a); Zhang et al. (2020a); Xu et al. (2020b); Ram Prabhakar et al. (2017); Ma et al. (2019, 2020) divide the source images into a large number of small patches during the training process, which corrupts the semantic information of the whole image and also prevents the network from modeling global features of the image. Above limitations hinder further performance improvement of the end-to-end methods.Two-stage Image Fusion Scheme
In two-stage methods, an encoder-decoder network is first trained on large natural image datasets by image reconstruction in the first stage. In the second stage, the trained encoder is used to extract feature maps from source images and the features maps are then fused and decoded by the trained decoder to generate the fused image.
Li et al Li and Wu (2018) presented the first two-stage method DenseFuse, for infrared and visible image fusion, and in the fusion process, they introduced l1-Norm based fusion rule to fuse the feature maps. Ma et al. Ma et al. (2021)
proposed SESF-Fuse, a new two-stage framework for multi-focus image fusion. They utilized spatial frequency to measure activity level of the feature maps and acquired decision maps based on the activity level. Finally, the decision maps were used to obtain the fusion results. Liu et al
Liu et al. (2020) proposed WaveFuse, a unified image fusion framework based on multi-scale discrete wavelet transform (DWT). In the fusion stage, the feature maps of source images are transformed into several components in the wavelet domain and fusion was done component by component. Then, the fused feature maps were reconstructed using inverse DWT, which was further inputted into the trained decoder to generate the fused image.The advantage of two-stage methods is that the training can be performed self-supervisedly on accessible large natural image datasets without the need of scene-specific datasets or ground truth fusion images. However, the current two-stage frameworks simply perform natural image reconstruction and cannot enable the encoder-decoder network to learn task-specific image features. In this paper, we propose a novel two-stage image fusion framework which retains the above advantages of two-stage methods. At the same time, we propose to use three destruction-reconstruction auxiliary tasks to encourage the network to learn task-specific image features.
2.2 Self-supervised Learning
Self-supervised learning is a branch of unsupervised learning, which automatically generate its own supervised labels from large-scale unsupervised data. A network can learn valuable representations for downstream tasks when it is trained to perform auxiliary tasks using the generated labels
Liu et al. (2021). Self-supervised learning has been successfully applied in many fields such as computer vision and natural language processing
Liu et al. (2021). In the field of computer vision, self-supervised auxiliary tasks include angle prediction Gidaris et al. (2018), image puzzles Noroozi and Favaro (2016), image coloring Zhang et al. (2016) and image reconstruction Pu et al. (2016). Extensive studies Kolesnikov et al. (2019); Achituve et al. (2021); Misra and Maaten (2020); Jing and Tian (2021) have shown that by designing suitable auxiliary tasks, self-supervised learning can learn effective and generalized image feature representations. In this study, we design three self-supervised reconstruction auxiliary tasks for three image fusion tasks and we further integrate the three auxiliary tasks using random combination to further improve the performance of the trained model.2.3 Transformer
In the field of computer vision, CNNs and its variants have been widely used due to its powerful feature extraction capability. Nevertheless, CNN fail to establish long-range dependencies because of its inherent small receptive field. Unfortunately, almost all existing image fusion architectures are based on CNN, and the global information are not fully exploited. In the field of natural language processing, Transformer has achieved satisfying results in modeling global dependencies through the self-attention mechanism. A great number of works have emerged to utilize Transformer as an alternative to CNN in the field of computer vision and have achieved fantastic results in different tasks, such as image classification Dosovitskiy et al. (2020); Touvron et al. (2021), object detection Carion et al. (2020); Zhu et al. (2020); Zheng et al. (2020); Dai et al. (2021); Sun et al. (2021), image segmentation Wang et al. (2018, 2021) and image generation Parmar et al. (2018) etc. Extensive studies Xu et al. (2020a); Zhang et al. (2020b); Ram Prabhakar et al. (2017); Li and Wu (2018); Ma et al. (2019, 2020, 2021); Liu et al. (2020) have shown that the ability to extract effective image feature representation is the key to improve image fusion performance. To remedy the deficiency of establishing long-range dependencies within current CNN-based image fusion architectures, we design a new feature extraction module that combines CNN with Transformer to allow the network to exploit both local and global information more comprehensively.
3 Method
3.1 Framework Overview
The overall architecture of our framework is shown in Fig. 1. We train the encoder-decoder network by image reconstruction on a large natural image dataset. Different from existing two-stage methods, which directly input the original natural images into the encoder-decoder network for reconstruction, we first destroy the original images before input them into the encoder. Concretely, given an original training image , we first randomly generate image subregions to form a set . For each subregion in , we randomly apply pixel intensity non-linear transformation, brightness transformation and noise transformation to obtain a set of transformed subregions and the destroyed input image . Then, we input the destroyed image into the encoder, which consists of a feature extraction module TransBlock and a feature enhancement module EnhanceBlock. Finally, the extracted features are input to the decoder to reconstruct the original training image. The detailed structures of the encoder and the decoder are described in Section 3.2, and the three different task-specific transformations are introduced in detail in Section 3.3.
As shown in Fig. 1 (b), the fusion framework consists of two parameter-shared encoders, a fusion block and a decoder. Note that both the encoder and the decoder are trained in the first stage, and the fusion block has no parameters to be trained, which ensures the simplicity and efficiency of the fusion framework. Specifically, two source images , are first input to the encoder for feature encoding, then the extracted feature maps , are fused by the fusion block to obtain the fused feature maps . Finally, is decoded by the decoder to reconstruct the fused image . Since the important information differs in the fused images of different tasks, different fusion rules are used for the characteristics of different tasks. The fusion rules are introduced in detail in Section 3.4.

3.2 Transformer-Based Encoder-Decoder Framework for Training
3.2.1 Encoder-Decoder Network via Image Reconstruction
We train an encoder-decoder network via image reconstruction for image fusion, and the network architecture is shown in Fig. 1 (a). Given a training image , we randomly generate image subregions to form a set for transformation. For each subregion in to be transformed, we randomly apply three transformations (the three transformations for different fusion tasks are detailed in Section 3.3) to obtain the set of transformed subregions and the transformed input image.
(1) |
(2) |
where , denotes the transform function for a subregion set, is the transformed image subregion; is the transformed input image, and refers to the image transform function.
We perform a self-supervised image reconstruction task for model training so that the encoder-decoder network learns an inverse mapping that reconstructs the original image from the transformed input image.
(3) |
Notice that we do not transform the whole input image, but a random selection of subregions of it. The reconstruction of the image is conducted at the whole image level .
As shown in Fig. 1 (a), the encoder contains a feature extraction module named TransBlock, and a feature enhancement module called EnhanceBlock. TransBlock contains two sub-modules, the CNN-Module and the Transformer-Module based on CNN and Transformer, respectively. The specific design of TransBlock is described in Section 3.2.2. The EnhanceBlock further aggregates and enhances the feature maps of the CNN-Module and the Transformer-Module in TransBlock, thus allowing the encoder to better integrate local and global features. In particular, we concatenate the encoded features from the CNN-Module and Transformer-Module, and then input them into two ConvBlock layers to achieve the integration and feature enhancement. As shown in Fig. 1 (c), each ConvBlock consists of two convolutional layers with kernel size of 3
3, padding of 1 and a ReLU activation layer. The decoder contains two sequentially connected layers of ConvBlock followed by a 1
1 convolution to reconstruct the original image.3.2.2 TransBlock: A Powerful Local and Global Feature Extractor
Inspired by ViT Dosovitskiy et al. (2020) and TNT Han et al. (2021), we combine CNN and Transformer to propose a powerful feature extraction module, TransBlock, to exploit both local and global information in source images. As shown in Fig. 1 (a), the TransBlock consists of two feature extraction sub-modules, the CNN-Module and the Transformer-Module. CNN-Module consists of three sequentially connected ConvBlocks. Transformer-Module are designed with a Fine-grained Transformer for local features modeling and a Global Transformer for global features modeling. Concretely, we first divide the transformed input image into patches with the size of and construct a global sequence of images , where , , is the size of the divided patches. To further capture finer-grained features, we further divide each patches in the global sequence into smaller sub-patches and construct the local sequence , where , , is the size of divided sub-patches.
For the processing of local sequences, we use the Fine-grained Transformer with shared weights to learn fine-grained relative dependencies in images.
First, we input the local sequence into the Linear Projection layer to perform a linear mapping to obtain the encoded feature sequence.
(4) |
where refers to the sub-patch of the patch; represents the encoded features of the sub-patch;
flattens the input patch to one-dimensional vector;
, mean the weights and bias of the Linear Projection layer, respectively.We then input the encoded features of the local sequences into the Fine-grained Transformer. The Fine-grained Transformer adopts a standard Transformer structure similar to ViT Dosovitskiy et al. (2020), as shown in Fig. 1 (c).
(5) |
where denotes the block, and refers to the number of transformer blocks.
For the global sequence, similar to the local sequence, we first input the global sequence into the Linear Projection layer. Afterwards, we linearly map the output of the corresponding Fine-grained Transformer of the local sequence and concatenate them to the encoded features of the global sequence.
(6) |
where flattens the input patch to one-dimensional vector; , stands for the weights and bias respectively.
Then, we input the interactive encoding of global and local sequences into the Global Transformer. The structure of the Global Transformer utilizes a standard Transformer structure similar to ViT Dosovitskiy et al. (2020), as shown in Fig. 1 (c).
(7) |
In general, the Fine-grained Transformer is designed to model fine-grained relative dependencies within an image patch, and thus extract local semantic features, while the Global Transformer is designed to model global relative dependencies within an image, and thus extract global features.
3.2.3 Loss Function
We expect our network to learn more than just pixel-level image reconstruction, but also sufficiently capture the structural and gradient information in the images. Therefore, we use a loss function with three components,
(8) |
where is the Mean Square Error (MSE) loss function, denotes the Structural Similarity (SSIM) loss function Keith A. Johnson , represents the Total Variation (TV) loss function Hou et al. (2020), and and are two coefficients used to balance each loss function, which are empirically set to 20 in the experiment.
The MSE loss is used for pixel-level reconstruction of images and it is defined as
(9) |
where denotes the output image of the decoder and represents the input image.
The SSIM loss is used to make the network learn structural information of the images, and it is defined as
(10) |
(11) |
where and
denote the mean and the standard deviation, respectively, and
is the correlation between and . The C1 and C2 are two very small constants, empirically set to 0.02 and 0.06. The standard deviation of the Gaussian window is empirically set to 1.5.The TV loss is used to preserve the gradient information in the images and further eliminate the noise during image reconstruction, and it is defined as follows,
(12) |
(13) |
where is the difference between the original image and the reconstructed image, is the norm, and , represent the horizontal and vertical coordinates of the image pixels, respectively.

3.3 Task-Specific Self-Supervised Training Scheme
In this section, we introduce the three destruction-reconstruction auxiliary tasks designed for different image fusion tasks. For multi-modal, multi-exposure and multi-focus image fusion, the auxiliary task is based on pixel intensity non-linear transformation, brightness transformation and noise transformation, respectively. Specifically, for each subregion in the set to be transformed, we randomly apply the three transforms to obtain the set of transformed subregions and then the transformed input image . The transformed input image is input into the encoder-decoder network for reconstruction. This kind of destruction-reconstruction based auxiliary task enables our network to be trained on large natural image datasets and learn task-specific features at the same time.
3.3.1 Pixel Intensity Non-Linear Transformation for Multi-Modal Image Fusion
We design a novel self-supervised auxiliary task based on pixel intensity non-linear transformation for multi-modal image fusion. In multi-modal image fusion, different source images contain different kinds of information of modalities, and we hope the most important information from each modality can be retained in the fused image. Following Xu et al. (2020a), we study two scenarios of multi-modal image fusion, infrared and visible image fusion and multi-modal medical image fusion. In the former scenario, the most significant information is the thermal radiation in infrared images and the structural semantic information in visible images Xu et al. (2020a); Ma et al. (2019). In the later scenario, the most significant information is the functional response and the structural anatomical information in medical images Buzug (2011); Forbes (2012). The above important fusion information is all reflected in the form of image pixel intensity distribution in the source images Xu et al. (2020a). Therefore, we propose a pixel intensity non-linear transformation to destroy the pixel intensity distribution first in the source images and then train the network to reconstruct the original pixel intensity. By doing this, our network can learn the pixel intensity information in the source images effectively.
Concretely, we use a smooth monotonic third-order Bézier Curve Mortenson (1999) composed of four control points to implement the non-linear transformation. The four control points include two endpoints ( and ) and two midpoints ( and ), defined as:
(14) |
(15) |
where,
is a fractional value along the length of the line and interp indicates the interpolation function. Fig.
2 (a) illustrate an original image subregion and the image subregions transformed by different Bessel transform curves. Specifically, we set two endpoints and to get a monotonically increasing Bessel transform curve. Then we randomly flip the curve to get a monotonically decreasing curve. And the midpoints ( and) are generated randomly for more variances. As shown in columns 2 and 4, when the midpoints
and coincide with the two endpoints respectively, the transformation function is linear. The midpoints in columns 3 and 5 are randomly generated for more variances. Please note that by using a transformation curve like in column 4 and 5, the pixel intensity can be reversed.3.3.2 Brightness Transformation for Multi-Exposure Image Fusion
For multi-exposure image fusion, we propose a self-supervised auxiliary task based on brightness transformation, encouraging the network to learn content and structural information at different exposure levels. In general, over-exposed images (images captured with long exposure time) have better content and structural information in dark regions, while under-exposed images (images captured with short exposure time) have better information in bright regions. Therefore, for multi-exposure image fusion, it is crucial to maintain appropriate luminance in the fused image while preserving abundant information Xu et al. (2020a); Ram Prabhakar et al. (2017). In this study, we design a brightness transformation to destroy the luminance of the source images and train the encoder-decoder network to reconstruct the original image. In this process, our network can learn well about the content and structural information of the images at different exposure levels, and thus can learn important fusion information for multi-exposure images.
The brightness transformation is implemented using the Gamma transform, a specific non-linear operation which is widely used to encode and decode brightness or trichromatic values in image and video processing Poynton (2012). The Gamma transform is defined as
(16) |
where and are the transformed pixel values and the original pixel values, respectively. Fig. 2 (b) shows an original image subregion and the image subregions after brightness transformations with two different Gamma transformation curves. For each pixel in the selected subregion , is empirically set as 0.3 to compress the brightness or 3 to enlarge the brightness.
3.3.3 Noise Transformation for Multi-Focus Image Fusion
For multi-focus image fusion, we propose a self-supervised auxiliary task based on noise transformation to enable the network to learn the variations of different depths of field (DoF) and maintain clear detail information. Due to the limitation of a camera’s DoF, it is very difficult to obtain an all-in-focus image within one shot. Objects within the DoF can maintain clear detail information, but the scene content outside the DoF is blurry. The main objective of multi-focus image fusion is to retain clear detail information of objects at different DoFs. We propose the noise transformation to generate locally blurred images for the encoder-decoder network to reconstruct, so that the trained model can learn to reconstruct clear images from locally blurred multi-focus source images.
We implement the noise transformation using Gaussian blur. Mathematically, applying Gaussian blur to an image is the same as convolving the image with a Gaussian function Forsyth and Ponce (2011). In a two-dimensional image, the Gaussian function is defined as
(17) |
(18) |
where and are the distances of a point from the origin on the horizontal and vertical axis, respectively.
is the standard deviation of the Gaussian distribution, which we empirically set as three. Fig.
2 (c) shows an original image subregion and the subregion after Gaussian blur transform.
We integrate the three transformations designed for different fusion tasks using a probability-based combination strategy, enabling the network to learn task-specific features while extracting more generalizable features. Eight kinds of combinations of an image subregion are shown in the figure.
3.3.4 Integrating the Three Transformations in a Unified Framework
We have proposed three task-specific image transformations and our experiments showed that each of them can help improve the fusion performance of the corresponding image fusion task. Here we propose to integrate the three transformations in a probability-based combination strategy, so that the network can extract more generalized features and the fusion performance of each single task can be further improved. Concretely, for a subregionwe apply the three transformations in the order of pixel intensity non-linear transformation, brightness transformation and noise transformation, but for each transformation we randomly decide whether to apply or skip it. Thus, we obtain eight possible combinations of transformations applied on an image patch, as illustrated in Fig. 3. In this way, the trained model can handle more diverse input images, including the original unchanged images, images transformed by one transformation or images transformed by two or even three different transformations. The pseudo code for implementation is shown in Algorithm 1.
, hyperparameter probability for each task-specific transformation
, ,3.4 Fusion Rule
Due to the strong feature extraction capability of our network, fairly simple fusion rules can achieve very good fusion results. For multi-exposure image fusion task and multi-focus image fusion task, we directly average the feature maps of the two source images to obtain the fused feature maps. For multi-model image fusion, we adopt the L1-Norm fusion rule used in Li et al. Li and Wu (2018), which can highlight and preserve the critical feature information in the fused feature maps adaptively according to the region energy in the feature maps.
4 Experimental Results
4.1 Datasets
We used the large natural image dataset MS-COCO Lin et al. (2014) to train the encoder-decoder network, which contains more than 70,000 natural images of various scenes. All images were resized to 256 256 and converted to grayscale images.
We used the following datasets to evaluate our image fusion framework and compare it to other methods in different types of image fusion tasks. In multi-modal image fusion, we used the TNO333https://figshare.com/articles/TNOImageFusionDataset/1008029 dataset for infrared and visible image fusion and the Harvard444http://www.med.harvard.edu/AANLIB/home.html dataset for multi-modal medical image fusion. For multi-exposure image fusion, we used the dataset in Cai et al. (2018a), and for the multi-focus image fusion we used the dataset Lytro555https://mansournejati.ece.iut.ac.ir/content/lytro-multi-focus-dataset. From each dataset, we randomly select 20 pairs of source images for testing.
4.2 Implementation Details
Our model was trained on an NVIDIA GTX 3090 GPU with a batchsize of 64, epoch of 70, using an Adam optimizer and a cosine annealing learning rate adjustment strategy with a learning rate of 1e-4 and a weight decay of 5e-4. Given a 256
256 training image, we randomly generate four image subregions with the size of 16 16 to form the set to be transformed. In TransBlock, we divided the transformed input image into 256 patches with the size of 16×16 and constructed the sequence. For each patch, we further divided it into 16 sub-patches with the size of 4 4 and constructed the sequence . We set all the probabilities , and as 0.6.4.3 Evaluation Metrics
In the current image fusion research, evaluating an image fusion algorithm is not a simple task due to the lack of ground truth fusion results. There are two widely adopted methods for evaluating the fused images, namely subjective evaluation and objective evaluation Zhang (2021)
. Subjective evaluation assesses the fused image in terms of sharpness, luminance, and contrast, etc. from the perspective of the observer. Objective evaluation assesses the fused images through objective evaluation metrics, but there is no consensus of the choice of evaluation metrics. Therefore, in order to provide a fair and comprehensive comparison with other fusion methods, we selected nine objective evaluation metrics focusing on four different aspects of the fused images and compared our method with state-of-the-art conventional and deep learning methods in each fusion task.
The evaluation metrics includes: Information theory-based metrics: Hou et al. (2020), Parmar et al. (2018), Prakash et al. (2019); Image feature-based metrics: Buzug (2011), Forbes (2012); Image structural similarity-based metrics: Poynton (2012), Gidaris et al. (2018); Human perception inspired metrics: Forsyth and Ponce (2011), Chen and Varshney (2007). Among these quantitative evaluation metrics, the minimum value of indicates the best fusion performance, while the maximum value indicates the best performance for all the other metrics. In every fusion task, the average of the objective metrics in fusion 20 image pairs were reported.
4.4 Results
4.4.1 Comparation with Unified Image Fusion Framework
Our model is a unified image fusion framework, so we first compared it with existing state-of-the-art unified image fusion algorithms U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), and PMGI Zhang et al. (2020a) in all the four image fusion tasks. The objective metrics of unified fusion methods are shown in Table 1, where the four different tasks infrared-visible image fusion, multi-modal medical image fusion, multi-exposure image fusion and multi-focus image fusion are denoted as IV, MED, ME and MF, respectively. It is apparent from Table 1 that our method achieves the best fusion performance in almost all metrics. In subsequent experimental comparisons on each task, the three unified fusion methods are still included, so the comparison with their subjective fusion results is discussed in detail in the following sections.
Task | Method | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
IV | U2Fusion | 0.3546 | 0.8055 | 0.4380 | 0.4502 | 0.3331 | 0.9196 | 0.7161 | 0.5637 | 451.9610 |
IFCNN | 0.3985 | 0.8050 | 0.5193 | 0.4670 | 0.3212 | 0.8726 | 0.7574 | 0.2701 | 436.1472 | |
PMGI | 0.3510 | 0.8049 | 0.4022 | 0.3461 | 0.1862 | 0.7471 | 0.6164 | 0.4896 | 706.8626 | |
ours | 0.4223 | 0.8083 | 0.5773 | 0.4986 | 0.3689 | 0.8733 | 0.7605 | 0.4054 | 280.1503 | |
MED | U2Fusion | 0.3471 | 0.8067 | 0.1664 | 0.4658 | 0.3550 | 0.8980 | 0.4174 | 0.4535 | 1166.1936 |
IFCNN | 0.4192 | 0.8071 | 0.2094 | 0.5232 | 0.3797 | 0.9183 | 0.4634 | 0.5134 | 933.5277 | |
PMGI | 0.2993 | 0.8068 | 0.1423 | 0.2756 | 0.2326 | 0.8188 | 0.3469 | 0.5166 | 1532.5396 | |
ours | 0.3604 | 0.8094 | 0.2301 | 0.5275 | 0.4620 | 0.9046 | 0.5073 | 0.5201 | 936.7991 | |
ME | U2Fusion | 0.4411 | 0.8146 | 0.4436 | 0.6146 | 0.7055 | 0.9349 | 0.6340 | 1.2437 | 174.8057 |
IFCNN | 0.5058 | 0.8141 | 0.6753 | 0.6890 | 0.6926 | 0.9445 | 0.7366 | 0.5889 | 192.5404 | |
PMGI | 0.4157 | 0.8150 | 0.4119 | 0.4578 | 0.5088 | 0.9083 | 0.5981 | 0.7557 | 297.7641 | |
ours | 0.5340 | 0.8271 | 0.6703 | 0.7322 | 0.7665 | 0.9579 | 0.7118 | 1.1042 | 192.2550 | |
MF | U2Fusion | 0.4653 | 0.8269 | 0.4650 | 0.6637 | 0.7936 | 0.9574 | 0.8487 | 1.1727 | 102.8688 |
IFCNN | 0.5438 | 0.8287 | 0.5845 | 0.7103 | 0.8317 | 0.9848 | 0.8971 | 0.9915 | 64.0155 | |
PMGI | 0.4366 | 0.8255 | 0.3860 | 0.4610 | 0.5417 | 0.8532 | 0.6555 | 0.9405 | 373.6547 | |
ours | 0.5620 | 0.8367 | 0.6666 | 0.7560 | 0.8374 | 0.9878 | 0.9049 | 0.9935 | 56.4455 |
4.4.2 Multi-Modal Image Fusion
Visible and Infrared Image Fusion
We compared our method with nine representative methods in infrared and visible image fusion task, including traditional methods (DWT Li et al. (1995), JSR Liu et al. (2017), JSRSD Liu et al. (2017)) and deep learning-based methods (U2Fusion Xu et al. (2020a), DeepFuse Ram Prabhakar et al. (2017), DenseFuse Li and Wu (2018), FusionGAN Ma et al. (2019), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a)). The fusion results are shown in Fig. 4, and the comparisons of objective evaluation metrics are shown in Table 2.

Subjective Evaluation
In Fig. 4, JSR, JSRSD, FusionGAN and PMGI show relatively disappointing fusion results due to noise and artifacts. DWT, DeepFuse and DenseFuse show relatively similar fusion results, where the fused images have low contrast and blurred details and blurred edges. U2Fusion, IFCNN and our method all achieve satisfying fusion results, while our method preserves sharper texture details with higher contrast and sharpness. As can be seen from the red box in Fig. 4, our method highlights the key targets (the humans in (a1)-(b1), (a2)-(b2) and (a3)-(b3) and the house in (a4)-(b4)) and has the best visual effect.
Objective Evaluation
As can be seen in Table 2, our method achieves the best results in four metrics and the second best results in other four metrics among all nine objective metrics, which is the best in all comparing methods. DenseFuse is specially designed for this task and it also shows strong performance.
Task | Method | |||||||||
IV | DWT | 0.3544 | 0.8050 | 0.4991 | 0.4005 | 0.3038 | 0.8401 | 0.7051 | 0.2321 | 457.8833 |
JSR | 0.2166 | 0.8055 | 0.3754 | 0.4060 | 0.2121 | 0.8475 | 0.5992 | 0.3448 | 350.9396 | |
JSRSD | 0.1924 | 0.8056 | 0.3089 | 0.3746 | 0.1524 | 0.7636 | 0.5219 | 0.2778 | 391.0290 | |
U2Fusion | 0.3546 | 0.8055 | 0.4380 | 0.4502 | 0.3331 | 0.9196 | 0.7161 | 0.5637 | 451.9610 | |
DeepFuse | 0.3718 | 0.8051 | 0.4317 | 0.3746 | 0.3034 | 0.8473 | 0.6782 | 0.2523 | 458.7481 | |
DenseFuse | 0.4157 | 0.8083 | 0.5237 | 0.4754 | 0.3740 | 0.8131 | 0.7885 | 0.1978 | 275.0618 | |
FusionGAN | 0.2937 | 0.8080 | 0.3355 | 0.1923 | 0.0859 | 0.3963 | 0.3362 | 0.0797 | 2291.7420 | |
IFCNN | 0.3985 | 0.8050 | 0.5193 | 0.4670 | 0.3212 | 0.8726 | 0.7574 | 0.2701 | 436.1472 | |
PMGI | 0.3510 | 0.8049 | 0.4022 | 0.3461 | 0.1862 | 0.7471 | 0.6164 | 0.4896 | 706.8626 | |
ours | 0.4223 | 0.8083 | 0.5773 | 0.4986 | 0.3689 | 0.8733 | 0.7605 | 0.4054 | 280.1503 |
Medical Image Fusion
We compared our method with five current state-of-the-art methods in medical image fusion task, which contain the traditional methods DWT Li et al. (1995), NSCT Zhang and Guo (2009), PAPCNN Yin et al. (2018) and the deep learning methods U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a). The fusion results are shown in Fig. 5, and the comparison of objective evaluation metrics are shown in Table 3.

Subjective Evaluation
The fusion results in Fig. 5 show that DWT, NSCT and PMGI obtain blurred fused images with low contrast and sharpness. PAPCNN introduces severe artifacts and noise. Although U2Fusion and IFCNN enhance the edges and detail information, their fused images have low contrast and inferior visual effect. In contrast, our method retains the best texture and functional information, while showing the optimal contrast and details.
Objective Evaluation
As can be seen in Table 3, our method achieves the best performance in six metrics and the second best performance in three metrics with a small difference from the best ones on the medical Harvard test dataset. In general, our method achieves the best fusion performance.
Task | Method | |||||||||
MED | DWT | 0.3588 | 0.8072 | 0.1858 | 0.4336 | 0.3409 | 0.8827 | 0.4960 | 0.4684 | 1143.3950 |
PAPCNN | 0.3518 | 0.8066 | 0.1634 | 0.4821 | 0.3299 | 0.8796 | 0.4250 | 0.5039 | 1078.2390 | |
U2Fusion | 0.3471 | 0.8067 | 0.1664 | 0.4658 | 0.3550 | 0.8980 | 0.4174 | 0.4535 | 1166.1936 | |
IFCNN | 0.4192 | 0.8071 | 0.2094 | 0.5232 | 0.3797 | 0.9183 | 0.4634 | 0.5134 | 933.5277 | |
PMGI | 0.2993 | 0.8068 | 0.1423 | 0.2756 | 0.2326 | 0.8188 | 0.3469 | 0.5166 | 1532.5396 | |
ours | 0.3604 | 0.8094 | 0.2301 | 0.5275 | 0.4620 | 0.9046 | 0.5073 | 0.5201 | 936.7991 |
Multi-Exposure Image Fusion
We compared our method with five current state-of-the-art methods in multi-exposure image fusion task, including traditional methods (DWT Li et al. (1995), JSRSD Liu et al. (2017)) and deep learning-based methods (DeepFuse Ram Prabhakar et al. (2017), U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a)). The fusion results are shown in Fig. 6, and the comparisons of objective evaluation metrics are shown in Table 4.

Subjective Evaluation
Fig. 6 show that JSRSD introduces severe noise and artifacts. DWT, DeepFuse and IFCNN fail to maintain appropriate luminance with contrast and details lost. U2Fusion, PMGI and our method achieve better fusion results. However, U2Fusion over-sharpens the details and edge information, resulting in unnatural visual effect and PMGI lose more details. In contrast, our fused images maintain the best luminance and have high contrast and sharpness. Overall, our fused images are more natural and achieved optimal visual effect.
Objective Evaluation
As can be seen in Table 4, our method achieves the best performance in five metrics and the second best performance in other four metrics on the multi-exposure image fusion dataset. In comparison, our method achieves the best fusion performance.
Task | Method | |||||||||
ME | DWT | 0.4727 | 0.8174 | 0.6077 | 0.5978 | 0.6601 | 0.9227 | 0.6933 | 0.4952 | 223.4195 |
JSRSD | 0.3026 | 0.8150 | 0.5428 | 0.6174 | 0.4461 | 0.9001 | 0.6771 | 0.8074 | 307.0123 | |
U2Fusion | 0.4411 | 0.8146 | 0.4436 | 0.6146 | 0.7055 | 0.9349 | 0.6340 | 1.2437 | 174.8057 | |
DeepFuse | 0.4462 | 0.8163 | 0.4289 | 0.5253 | 0.6330 | 0.9115 | 0.6501 | 0.6306 | 217.6652 | |
IFCNN | 0.5058 | 0.8141 | 0.6753 | 0.6890 | 0.6926 | 0.9445 | 0.7366 | 0.5889 | 192.5404 | |
PMGI | 0.4157 | 0.8150 | 0.4119 | 0.4578 | 0.5088 | 0.9083 | 0.5981 | 0.7557 | 297.7641 | |
ours | 0.5340 | 0.8271 | 0.6703 | 0.7322 | 0.7665 | 0.9579 | 0.7118 | 1.1042 | 192.2550 |
Multi-Focus Image Fusion
We compared our method with six representative methods in multi-focus image fusion task, including traditional methods (DWT Li et al. (1995), JSRSD Liu et al. (2017)) and deep learning-based methods (DeepFuse Ram Prabhakar et al. (2017), SESF-Fuse Ma et al. (2021), U2Fusion Xu et al. (2020a), IFCNN Zhang et al. (2020b), PMGI Zhang et al. (2020a)). The fusion results are shown in Fig. 7, and the comparison of objective evaluation metrics are shown in Table 5.

Subjective Evaluation
From Fig. 7, JSRSD and PMGI introduces some noise and obtain unsatisfactory fusion results. Although DWT, DeepFuse and SESF-Fuse successfully fuse the images, they still lose detailed information, such as the flag or the edge contours of the child marked in red boxes, etc. U2Fusion, IFCNN and our method all achieves satisfying fusion results.
Objective Evaluation
We did not compare the objective metric of JSRSD owing to the artifacts in its fusion images. As can be seen in Table 5, our method achieves the best performance in six metrics and the second best performance in other three metrics on the multi-exposure image fusion dataset. In comparison, our method achieves the best fusion performance.
Task | Method | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
MF | U2Fusion | 0.4653 | 0.8269 | 0.4650 | 0.6637 | 0.7936 | 0.9574 | 0.8487 | 1.1727 | 102.8688 |
DeepFuse | 0.4868 | 0.8292 | 0.4960 | 0.7104 | 0.7757 | 0.9812 | 0.8608 | 0.9033 | 71.3387 | |
IFCNN | 0.5438 | 0.8287 | 0.5845 | 0.7103 | 0.8317 | 0.9848 | 0.8971 | 0.9915 | 64.0155 | |
SESF-Fuse | 0.5565 | 0.8328 | 0.7138 | 0.7473 | 0.8357 | 0.9877 | 0.9039 | 0.8613 | 49.1273 | |
PMGI | 0.4366 | 0.8255 | 0.3860 | 0.4610 | 0.5417 | 0.8532 | 0.6555 | 0.9405 | 373.6547 | |
ours | 0.5620 | 0.8367 | 0.6666 | 0.7560 | 0.8374 | 0.9878 | 0.9049 | 0.9935 | 56.4455 |
5 Ablation Study
5.1 Ablation Study for TransBlock
To solve the difficulties in establishing long-range relational dependencies of CNN-based image fusion architectures, we design TransBlock that combines CNN with Transformer to enable the network to exploit both local and global information. To verify the effectiveness of TransBlock, we perform ablation experiments on all fusion tasks using 20% of the training data and the results are shown in Table 6. “3 Transformations” in the table denotes that using proposed three task-specific image transformations and the probability-based combination strategy for training. As shown in Fig. 1 (a), we remove the Transformer-Module in TransBlock along with the lower branch of the EnhanceBlock and use only the CNN-Module and the upper branch of the EnhanceBlock for comparison. As shown in Table 6, almost in all fusion tasks, no matter whether the three transformations are used or not, adding TransBlock always improves the performance, which shows its effectiveness.
Task | TransBlock | 3 Transformations | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
IV | 0.2775 | 0.8075 | 0.4669 | 0.4692 | 0.3265 | 0.8745 | 0.7158 | 0.3873 | 295.9092 | ||
✓ | 0.3135 | 0.8085 | 0.5172 | 0.4993 | 0.3606 | 0.8298 | 0.7773 | 0.3088 | 301.0812 | ||
✓ | 0.3204 | 0.8079 | 0.5035 | 0.4857 | 0.3592 | 0.8748 | 0.7484 | 0.3879 | 291.0118 | ||
✓ | ✓ | 0.4151 | 0.8086 | 0.5852 | 0.5068 | 0.3693 | 0.8650 | 0.7792 | 0.3714 | 283.3448 | |
Task | TransBlock | 3 Transformations | |||||||||
MED | 0.2373 | 0.8087 | 0.1837 | 0.5254 | 0.4156 | 0.8891 | 0.4661 | 0.4998 | 860.2159 | ||
✓ | 0.2373 | 0.8087 | 0.2049 | 0.3117 | 0.4500 | 0.8891 | 0.4865 | 0.4998 | 923.1668 | ||
✓ | 0.2838 | 0.8084 | 0.1971 | 0.5177 | 0.4491 | 0.9088 | 0.4897 | 0.5347 | 868.1148 | ||
✓ | ✓ | 0.3583 | 0.8093 | 0.2214 | 0.5280 | 0.4665 | 0.9009 | 0.4997 | 0.5147 | 938.6183 | |
Task | TransBlock | 3 Transformations | |||||||||
ME | 0.3269 | 0.8178 | 0.4540 | 0.6446 | 0.6345 | 0.9538 | 0.6581 | 1.0803 | 192.0676 | ||
✓ | 0.3723 | 0.8185 | 0.5159 | 0.6886 | 0.6973 | 0.9628 | 0.7070 | 1.0048 | 185.4562 | ||
✓ | 0.3917 | 0.8201 | 0.5162 | 0.6911 | 0.7107 | 0.9606 | 0.6962 | 1.0959 | 177.6318 | ||
✓ | ✓ | 0.5199 | 0.8262 | 0.6482 | 0.7318 | 0.7642 | 0.9594 | 0.7155 | 1.0930 | 191.1575 | |
Task | TransBlock | 3 Transformations | |||||||||
MF | 0.3656 | 0.8305 | 0.4660 | 0.6920 | 0.7464 | 0.9826 | 0.8482 | 0.9903 | 79.0899 | ||
✓ | 0.4063 | 0.8315 | 0.5017 | 0.7186 | 0.7950 | 0.9839 | 0.8689 | 0.9950 | 72.6904 | ||
✓ | 0.4241 | 0.8324 | 0.5167 | 0.7244 | 0.8032 | 0.9845 | 0.8797 | 1.0108 | 68.7478 | ||
✓ | ✓ | 0.5478 | 0.8361 | 0.6391 | 0.7530 | 0.8357 | 0.9875 | 0.9031 | 0.9928 | 58.7026 |
To further explain the effectiveness of TransBlock, we visualize the image reconstruction results of using the CNN architecture (only the CNN-Module in TransBlock) and using TransBlock in Fig. 8. We can see that the reconstructed image with solely CNN lost some texture and detail information of the original image, such as the typical areas shown in the red box. In contrast, the reconstructed image with TransBlock has higher resolution and more detail information, which indicates that the encoding network with TransBlock performs better in extracting local and global features of the image.

5.2 Ablation Study for Task-specific Self-Supervised Training Scheme
We design task-specific self-supervised auxiliary tasks for multi-modal image fusion, multi-exposure image fusion, and multi-focus image fusion, which are respectively based on pixel intensity non-linear transformation, brightness transformation, and noise transformation. In this experiment, we evaluate the effectiveness of each auxiliary task and study if the performance is further improved by combing all the three transformation in our proposed model. The results are shown in Table 7. Most of the fusion metrics are improved by using the corresponding self-supervised auxiliary tasks, which indicates the effectiveness of every single one of them. The effectiveness of our proposed combination strategy is demonstrated by the fact that most of the objective fusion metrics can be further improved by the proposed model.
Task | Nonlinear | Brightness | Noise | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
IV | 0.3135 | 0.8085 | 0.5172 | 0.4993 | 0.3606 | 0.8298 | 0.7773 | 0.3088 | 301.0812 | |||
✓ | 0.3522 | 0.8086 | 0.5404 | 0.5061 | 0.3674 | 0.8489 | 0.7773 | 0.3451 | 296.2987 | |||
✓ | ✓ | ✓ | 0.4151 | 0.8086 | 0.5852 | 0.5068 | 0.3693 | 0.8650 | 0.7792 | 0.3714 | 283.3448 | |
Task | Nonlinear | Brightness | Noise | |||||||||
MED | 0.2373 | 0.8087 | 0.2049 | 0.3117 | 0.4500 | 0.8891 | 0.4865 | 0.4998 | 923.1668 | |||
✓ | 0.2810 | 0.8088 | 0.2203 | 0.5329 | 0.4653 | 0.8985 | 0.4961 | 0.5187 | 915.8077 | |||
✓ | ✓ | ✓ | 0.3583 | 0.8093 | 0.2214 | 0.5280 | 0.4665 | 0.9009 | 0.4997 | 0.5147 | 938.6183 | |
Task | Nonlinear | Brightness | Noise | |||||||||
ME | 0.3723 | 0.8185 | 0.5159 | 0.6886 | 0.6973 | 0.9628 | 0.7070 | 1.0048 | 185.4562 | |||
✓ | 0.4141 | 0.8208 | 0.5406 | 0.7084 | 0.7298 | 0.9598 | 0.6991 | 1.0903 | 184.2343 | |||
✓ | ✓ | ✓ | 0.5199 | 0.8262 | 0.6482 | 0.7318 | 0.7642 | 0.9594 | 0.7155 | 1.0930 | 191.1575 | |
Task | Nonlinear | Brightness | Noise | |||||||||
MF | 0.4063 | 0.8315 | 0.5017 | 0.7186 | 0.7950 | 0.9839 | 0.8689 | 0.9950 | 72.6904 | |||
✓ | 0.4561 | 0.8334 | 0.5539 | 0.7387 | 0.8235 | 0.9857 | 0.8904 | 1.0081 | 64.0137 | |||
✓ | ✓ | ✓ | 0.5478 | 0.8361 | 0.6391 | 0.7530 | 0.8357 | 0.9875 | 0.9031 | 0.9928 | 58.7026 |
Task | Method | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
IV | N1_R256 | 0.1947 | 0.8054 | 0.3667 | 0.3672 | 0.1800 | 0.8288 | 0.3382 | 0.6367 | 392.0450 |
N2_R8 | 0.3307 | 0.8087 | 0.5250 | 0.5019 | 0.3639 | 0.8455 | 0.3423 | 0.7716 | 297.5007 | |
N2_R16 | 0.3319 | 0.8085 | 0.5379 | 0.5053 | 0.3625 | 0.8525 | 0.3253 | 0.7667 | 300.5854 | |
N2_R32 | 0.3282 | 0.8085 | 0.5258 | 0.5005 | 0.3618 | 0.8486 | 0.3363 | 0.7826 | 300.3576 | |
N2_R64 | 0.3122 | 0.8081 | 0.5010 | 0.4952 | 0.3551 | 0.8414 | 0.3214 | 0.7747 | 298.4003 | |
N4_R8 | 0.3163 | 0.8083 | 0.5222 | 0.5007 | 0.3563 | 0.8524 | 0.3331 | 0.7764 | 301.0319 | |
N4_R16 | 0.3405 | 0.8088 | 0.5379 | 0.5035 | 0.3642 | 0.8367 | 0.3270 | 0.7872 | 303.8792 | |
N4_R32 | 0.3169 | 0.8083 | 0.5162 | 0.4973 | 0.3538 | 0.8467 | 0.3328 | 0.7821 | 300.4295 | |
N4_R64 | 0.2702 | 0.8060 | 0.4390 | 0.4528 | 0.3021 | 0.8646 | 0.3448 | 0.7247 | 341.9801 | |
N8_R8 | 0.2702 | 0.8060 | 0.5336 | 0.4528 | 0.3613 | 0.8646 | 0.3448 | 0.7811 | 295.8375 | |
N8_R16 | 0.3272 | 0.8086 | 0.5244 | 0.5034 | 0.3637 | 0.8480 | 0.3446 | 0.7776 | 295.2036 | |
N8_R32 | 0.2964 | 0.8075 | 0.4949 | 0.4826 | 0.3348 | 0.8565 | 0.3246 | 0.7694 | 315.0558 | |
N8_R64 | 0.2575 | 0.8053 | 0.4228 | 0.4184 | 0.2559 | 0.8211 | 0.2717 | 0.6960 | 428.7837 | |
MED | N1_R256 | 0.2075 | 0.8105 | 0.3116 | 0.3584 | 0.2832 | 0.8848 | 0.8795 | 0.4563 | 306.8308 |
N2_R8 | 0.3695 | 0.8155 | 0.4446 | 0.6374 | 0.5436 | 0.8341 | 1.1028 | 0.7219 | 214.4595 | |
N2_R16 | 0.3753 | 0.8158 | 0.4490 | 0.6450 | 0.5548 | 0.8380 | 1.0856 | 0.7313 | 219.3057 | |
N2_R32 | 0.3663 | 0.8152 | 0.4321 | 0.6313 | 0.5482 | 0.8334 | 1.0987 | 0.7141 | 211.9355 | |
N2_R64 | 0.3365 | 0.8121 | 0.4374 | 0.6065 | 0.4909 | 0.8277 | 0.9498 | 0.6878 | 305.5588 | |
N4_R8 | 0.3536 | 0.8147 | 0.4365 | 0.6300 | 0.5440 | 0.8392 | 1.0878 | 0.7145 | 219.4811 | |
N4_R16 | 0.3823 | 0.8159 | 0.4556 | 0.6402 | 0.5491 | 0.8272 | 1.1091 | 0.7224 | 224.6704 | |
N4_R32 | 0.3503 | 0.8144 | 0.4311 | 0.6209 | 0.5300 | 0.8366 | 1.0330 | 0.7083 | 220.9103 | |
N4_R64 | 0.2873 | 0.8091 | 0.4061 | 0.5632 | 0.3881 | 0.8151 | 0.8239 | 0.6460 | 259.3240 | |
N8_R8 | 0.2873 | 0.8091 | 0.4363 | 0.5632 | 0.5573 | 0.8151 | 0.8239 | 0.7218 | 212.2687 | |
N8_R16 | 0.3646 | 0.8151 | 0.4336 | 0.6261 | 0.5449 | 0.8323 | 1.1038 | 0.7199 | 214.0531 | |
N8_R32 | 0.3195 | 0.8115 | 0.4452 | 0.5948 | 0.4620 | 0.8441 | 0.8939 | 0.6812 | 325.3530 | |
N8_R64 | 0.2722 | 0.8087 | 0.4074 | 0.5372 | 0.3868 | 0.8361 | 0.6873 | 0.6441 | 435.6439 | |
ME | N1_R256 | 0.1518 | 0.8061 | 0.1305 | 0.2886 | 0.1937 | 0.8387 | 0.4201 | 0.3354 | 1525.4510 |
N2_R8 | 0.2634 | 0.8077 | 0.1568 | 0.4223 | 0.3880 | 0.8930 | 0.4913 | 0.4365 | 1171.2512 | |
N2_R16 | 0.2636 | 0.8078 | 0.1553 | 0.4114 | 0.3853 | 0.8883 | 0.4844 | 0.4286 | 1186.1036 | |
N2_R32 | 0.2501 | 0.8076 | 0.1536 | 0.4075 | 0.3814 | 0.8884 | 0.4848 | 0.4240 | 1199.2137 | |
N2_R64 | 0.2349 | 0.8074 | 0.1520 | 0.4068 | 0.3688 | 0.8853 | 0.4868 | 0.4235 | 1218.6177 | |
N4_R8 | 0.2562 | 0.8076 | 0.1532 | 0.4103 | 0.3833 | 0.8905 | 0.4857 | 0.4235 | 1161.3674 | |
N4_R16 | 0.2637 | 0.8078 | 0.1553 | 0.4214 | 0.3859 | 0.8983 | 0.4944 | 0.4286 | 1186.1036 | |
N4_R32 | 0.2405 | 0.8075 | 0.1521 | 0.4061 | 0.3742 | 0.8900 | 0.4908 | 0.4299 | 1202.8124 | |
N4_R64 | 0.2011 | 0.8068 | 0.1445 | 0.3632 | 0.3224 | 0.8573 | 0.4441 | 0.3914 | 1452.7672 | |
N8_R8 | 0.2591 | 0.8077 | 0.1535 | 0.4113 | 0.3857 | 0.8908 | 0.4872 | 0.4296 | 1175.6538 | |
N8_R16 | 0.2578 | 0.8076 | 0.1542 | 0.4103 | 0.3844 | 0.8926 | 0.4884 | 0.4279 | 1186.3845 | |
N8_R32 | 0.2293 | 0.8072 | 0.1493 | 0.3753 | 0.3565 | 0.8745 | 0.4626 | 0.4047 | 1376.2388 | |
N8_R64 | 0.1908 | 0.8063 | 0.1393 | 0.3339 | 0.2599 | 0.7804 | 0.3543 | 0.3720 | 2190.5549 | |
MF | N1_R256 | 0.2453 | 0.8219 | 0.3296 | 0.4923 | 0.4733 | 0.9256 | 1.0105 | 0.6825 | 173.9629 |
N2_R8 | 0.4321 | 0.8328 | 0.5291 | 0.7311 | 0.8086 | 0.9801 | 0.9967 | 0.8823 | 72.3116 | |
N2_R16 | 0.4354 | 0.8329 | 0.5324 | 0.7327 | 0.8102 | 0.9845 | 0.9948 | 0.8850 | 70.5472 | |
N2_R32 | 0.4279 | 0.8326 | 0.5208 | 0.7274 | 0.8058 | 0.9843 | 0.9984 | 0.8820 | 74.6641 | |
N2_R64 | 0.4024 | 0.8292 | 0.4937 | 0.7146 | 0.7901 | 0.9833 | 0.9635 | 0.8657 | 81.7729 | |
N4_R8 | 0.4152 | 0.8322 | 0.5117 | 0.7246 | 0.7993 | 0.9839 | 0.9956 | 0.8733 | 73.1741 | |
N4_R16 | 0.4421 | 0.8331 | 0.5338 | 0.7311 | 0.8103 | 0.9850 | 0.9825 | 0.8839 | 74.4669 | |
N4_R32 | 0.4119 | 0.8315 | 0.5038 | 0.7214 | 0.7970 | 0.9839 | 1.0001 | 0.8751 | 76.0166 | |
N4_R64 | 0.3498 | 0.8256 | 0.4420 | 0.6660 | 0.7205 | 0.9742 | 0.8940 | 0.8151 | 101.5177 | |
N8_R8 | 0.3498 | 0.8256 | 0.5293 | 0.6660 | 0.8100 | 0.9742 | 0.8940 | 0.8819 | 72.5735 | |
N8_R16 | 0.4290 | 0.8326 | 0.5216 | 0.7293 | 0.8087 | 0.9841 | 0.9999 | 0.8823 | 73.7230 | |
N8_R32 | 0.3846 | 0.8295 | 0.4725 | 0.7023 | 0.7746 | 0.9817 | 0.9561 | 0.8510 | 86.1646 | |
N8_R64 | 0.3297 | 0.8222 | 0.4106 | 0.6277 | 0.6543 | 0.9413 | 0.7608 | 0.7628 | 153.8210 |
5.3 Ablation Study for the Number and Area of Transformed Subregion
In our proposed method, we randomly generate four image subregions with the size of 1616 from a 256256 training image to form the set for transformation. Obviously, both the number and size of transform subregions will affect the efficiency of trained network. For example, when the size of the subregion is too large (even transforming the whole image), it will be very difficult for the network to reconstruct the original image. At the same time, because most of the original image is destroyed, the network cannot learn useful features. On the contrary, when the subregions are too small, the effect of the transformation will also be small and may not be able to encourage the network to learn better task-specific features. We conducted the ablation experiments on different combinations of the number and size of the subregions, and the results are shown in Table 8. In this experiment, we used 20% of the training data with 30 epochs.
In Table 8, represents the number of transformed regions and represents the size of transformed subregions. For example, represents the transformation of the whole image, and represents the transformation of four randomly selected 1616 regions. It can be seen that achieves the best performance overall.
6 Discussion
In this study, we propose TransFuse, a unified Transformer-based image fusion framework by self-supervised learning, which can be effectively used to different image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion. We propose three destruction-reconstruction self-supervised auxiliary tasks for multi-modal, multi-exposure and multi-focus image fusion and integrate the three tasks by randomly choosing one of them to destroy a natural image during model training, thus enabling our network to be trained on accessible large natural image datasets and can learn task-specific features at the same time. In addition, we design a new encoder that combines CNN and Transformer for feature extraction, so that the model can exploit both local and global information more comprehensively. Extensive experiments have shown that our framework achieves new state-of-the-art performance in both subjective and objective evaluations in all common image fusion tasks.
Notably, in different fusion tasks, the vital information to be fused varies largely as source images contain different characteristics. The three destruction-reconstruction self-supervised auxiliary tasks are specially designed according to the characteristics of source images in different fusion tasks, so that our framework can learn task-specific features during reconstruction. Furthermore, we integrate the three tasks in model training to encourage different fusion tasks to promote each other and increase the generalization of the trained network, enabling our framework to handle different image fusion tasks in a unified way.
Our framework also has some limitations. First, although we utilize parameter-shared Fine-grained Transformers, introducing Transformer for feature extraction still makes our model (with 43.18 MB parameters) larger than the existing methods (generally with 0.3-3 MB parameters). Fortunately, a common GPU NVIDIA 1080Ti is still sufficient for model training. Second, we do not propose new fusion rules for different fusion tasks and more effective task-specific fusion rules can be specially designed in the future to further improve the fusion performance.
In future work, more effective Transformer-based feature extraction methods can be explored for image fusion tasks. Moreover, since both CNN-based architectures and Transformer-based architectures have their advantages, how to further combine the two architectures is another promising research direction. On the other hand, more effective self-supervised auxiliary tasks may also be proposed to enforce the network learn more useful features according to the characteristics of different source images.
In addition, our proposed self-supervised tasks and training scheme can help the model learn more robust and generalized features, and our proposed Transformer-based feature extraction modules can exploit both local and global information more comprehensively, hence, we expect that our method has a broad application prospect and a large development potential in both image fusion field and related computer vision fields.
References
- Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 123–133. Cited by: §2.2.
- Quadtree-based multi-focus image fusion using a weighted focus-measure. Information Fusion 22, pp. 105–118. Cited by: §1.
- Fusion of infrared and visual images through region extraction by using multi scale center-surround top-hat transform. Optics Express 19 (9), pp. 8444–8457. Cited by: §1.
- Directive contrast based multimodal medical image fusion in nsct domain. IEEE Transactions on Multimedia 15 (5), pp. 1014–1024. Cited by: §1.
- The laplacian pyramid as a compact image code. In Readings in Computer Vision, pp. 671–679. Cited by: §1, §2.1.1.
- Computed tomography. In Springer Handbook of Medical Technology, pp. 311–342. Cited by: §3.3.1, §4.3.
- Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing 27 (4), pp. 2049–2062. Cited by: §4.1.
- Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing 27 (4), pp. 2049–2062. Cited by: §1, §2.1.2.
- Multi-focus image fusion based on spatial frequency in discrete cosine transform domain. IEEE Signal Processing Letters 22 (2), pp. 220–224. Cited by: §1, §2.1.1.
- End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), pp. 213–229. Cited by: §2.3.
- A human perception inspired quality metric for image fusion based on regional information. Information Fusion 8 (2), pp. 193–207. Cited by: §4.3.
-
Up-detr: unsupervised pre-training for object detection with transformers.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 1601–1610. Cited by: §2.3. - Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §1, §2.1.2.
- An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.3, §3.2.2, §3.2.2, §3.2.2.
- Human body composition: growth, aging, nutrition, and activity. Springer Science & Business Media. Cited by: §3.3.1, §4.3.
- Computer vision: a modern approach.. Prentice hall. Cited by: §3.3.3, §4.3.
- Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §4.3.
- Guest editorial: image fusion: advances in the state of the art. Information Fusion: Special Issue on Image Fusion: Advances in the State of the Art 8, pp. 114–118. Cited by: §1.
- Fusion of multi-exposure images. Image and Vision Computing 23 (6), pp. 611–618. Cited by: §1.
- Transformer in transformer. arXiv preprint arXiv:2103.00112. Cited by: §3.2.2.
- VIF-net: an unsupervised framework for infrared and visible image fusion. IEEE Transactions on Computational Imaging 6, pp. 640–651. Cited by: §3.2.3, §4.3.
- Evaluation of focus measures in multi-focus image fusion. Pattern Recognition Letters 28 (4), pp. 493–500. Cited by: §1, §2.1.1.
- Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (11), pp. 4037–4058. External Links: Document Cited by: §2.2.
- [24] The whole brain atlas. Note: Websitehttp://www.med.harvard.edu/AANLIB/home.html Cited by: §2.1.2, §3.2.3.
- Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1920–1929. Cited by: §2.2.
- Multisensor image fusion using the wavelet transform. Graphical Models and Image Processing 57 (3), pp. 235–245. Cited by: §1, §1, §2.1.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2.
- MDLatLRR: a novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing 29, pp. 4733–4746. Cited by: §1.
- DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §1, §1, §1, §2.1.2, §2.3, §3.4, §4.4.2.
- Image fusion with guided filtering. IEEE Transactions on Image processing 22 (7), pp. 2864–2875. Cited by: §1, §1, §2.1.1.
- Microsoft coco: common objects in context. In European conference on computer vision (ECCV), pp. 740–755. Cited by: §1, §2.1.2, §4.1.
- Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology 83, pp. 94–102. Cited by: §1, §2.1.1, §4.4.2, §4.4.2, §4.4.2.
- WaveFuse: a unified deep framework for image fusion with discrete wavelet transform. arXiv preprint arXiv:2007.14110. Cited by: §1, §1, §1, §2.1.2, §2.3.
- Self-supervised learning: generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, pp. 1–1. Cited by: §2.2.
- Image fusion with convolutional sparse representation. IEEE Signal Processing Letters 23 (12), pp. 1882–1886. Cited by: §1.
- Image fusion with contextual statistical similarity and nonsubsampled shearlet transform. IEEE Sensors Journal 17 (6), pp. 1760–1771. Cited by: §1, §2.1.1.
- SESF-fuse: an unsupervised deep model for multi-focus image fusion. Neural Computing and Applications 33 (11), pp. 5793–5804. Cited by: §1, §1, §1, §2.1.2, §2.3, §4.4.2.
-
DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion
. IEEE Transactions on Image Processing 29, pp. 4980–4995. Cited by: §1, §1, §2.1.2, §2.1.2, §2.3. - FusionGAN: a generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §1, §1, §2.1.2, §2.1.2, §2.3, §3.3.1, §4.4.2.
- Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing 24 (11), pp. 3345–3356. Cited by: §1.
- Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6707–6717. Cited by: §2.2.
- Mathematics for computer graphics applications. Industrial Press Inc.. Cited by: §3.3.1.
- Multi-focus image fusion using dictionary-based sparse representation. Information Fusion 25, pp. 72–84. Cited by: §1, §2.1.2.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision (ECCV), pp. 69–84. Cited by: §2.2.
-
Image transformer.
In
International Conference on Machine Learning (ICML)
, pp. 4055–4064. Cited by: §2.3, §4.3. - Digital video and hd: algorithms and interfaces. Elsevier. Cited by: §3.3.2, §4.3.
- Multiscale fusion of multimodal medical images using lifting scheme based biorthogonal wavelet transform. Optik 182, pp. 995–1014. Cited by: §4.3.
-
Variational autoencoder for deep learning of images, labels and captions
. Advances in Neural Information Processing Systems 29, pp. 2352–2360. Cited by: §2.2. - Visible and infrared image fusion based on curvelet transform. In The 2014 2nd International Conference on Systems and Informatics (ICSAI), pp. 828–832. Cited by: §1, §2.1.1.
- Deepfuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4714–4722. Cited by: §1, §1, §2.1.2, §2.1.2, §2.3, §3.3.2, §4.4.2, §4.4.2, §4.4.2.
- Mutual spectral residual approach for multifocus image fusion. Digital Signal Processing 23 (4), pp. 1121–1135. Cited by: §1.
- Generalized random walks for fusion of multi-exposure images. IEEE Transactions on Image Processing 20 (12), pp. 3634–3646. Cited by: §1.
- Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3611–3620. Cited by: §2.3.
- Image fusion by a ratio of low-pass pyramid. Pattern Recognition Letters 9 (4), pp. 245–253. Cited by: §1, §2.1.1.
- TNO image fusion dataset. Note: Websitehttps://doi.org/10.6084/m9.figshare.1008029.v1 Cited by: §1, §2.1.2.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pp. 10347–10357. Cited by: §2.3.
- Segmentation-driven image fusion based on alpha-stable modeling of wavelet coefficients. IEEE Transactions on Multimedia 11 (4), pp. 624–633. Cited by: §1.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (EMNLP), pp. 353–355. Cited by: §2.3.
- End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8741–8750. Cited by: §2.3.
- U2Fusion: a unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §1, §1, §1, §1, §2.1.2, §2.1.2, §2.3, §3.3.1, §3.3.2, §4.4.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2.
-
Fusiondn: a unified densely connected network for image fusion.
In
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)
, Vol. 34, pp. 12484–12491. Cited by: §1, §1, §1, §1, §2.1.2, §2.1.2. - Medical image fusion using multi-level local extrema. Information Fusion 19, pp. 38–48. Cited by: §1.
- Concealed weapon detection using color image fusion. In Proceedings of the 6th International Conference on Information Fusion (ICIF), Vol. 1, pp. 622–627. Cited by: §1.
- Medical image fusion with parameter-adaptive pulse coupled neural network in nonsubsampled shearlet transform domain. IEEE Transactions on Instrumentation and Measurement 68 (1), pp. 49–64. Cited by: §4.4.2.
- Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 34, pp. 12797–12804. Cited by: §1, §1, §1, §1, §2.1.2, §2.1.2, §4.4.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2.
- Multifocus image fusion using the nonsubsampled contourlet transform. Signal Processing 89 (7), pp. 1334–1346. Cited by: §4.4.2.
- Robust multi-focus image fusion using multi-task sparse representation and spatial context. IEEE Transactions on Image Processing 25 (5), pp. 2045–2058. Cited by: §1.
-
Colorful image colorization
. In European Conference on Computer Vision (ECCV), pp. 649–666. Cited by: §2.2. - Deep learning-based multi-focus image fusion: a survey and a comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. Cited by: §4.3.
- Boundary finding based multi-focus image fusion through multi-scale morphological focus-measure. Information Fusion 35, pp. 81–101. Cited by: §1, §2.1.1.
-
IFCNN: a general image fusion framework based on convolutional neural network
. Information Fusion 54, pp. 99–118. Cited by: §1, §1, §1, §1, §2.1.2, §2.3, §4.4.1, §4.4.2, §4.4.2, §4.4.2, §4.4.2. - Infrared and visual image fusion through infrared feature extraction and visual information preservation. Infrared Physics & Technology 83, pp. 227–237. Cited by: §1.
- End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315. Cited by: §2.3.
- Multi-scale weighted gradient-based fusion for multi-focus images. Information Fusion 20, pp. 60–72. Cited by: §1, §2.1.1.
- Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with gaussian and bilateral filters. Information Fusion 30, pp. 15–26. Cited by: §1.
- Deformable detr: deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR), Cited by: §2.3.