DeepAI
Log In Sign Up

TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning

12/02/2021
by   Linhao Qu, et al.
FUDAN University
0

In this paper, we propose TransMEF, a transformer-based multi-exposure image fusion framework that uses self-supervised multi-task learning. The framework is based on an encoder-decoder network, which can be trained on large natural image datasets and does not require ground truth fusion images. We design three self-supervised reconstruction tasks according to the characteristics of multi-exposure images and conduct these tasks simultaneously using multi-task learning; through this process, the network can learn the characteristics of multi-exposure images and extract more generalized features. In addition, to compensate for the defect in establishing long-range dependencies in CNN-based architectures, we design an encoder that combines a CNN module with a transformer module. This combination enables the network to focus on both local and global information. We evaluated our method and compared it to 11 competitive traditional and deep learning-based methods on the latest released multi-exposure image fusion benchmark dataset, and our method achieved the best performance in both subjective and objective evaluations.

READ FULL TEXT VIEW PDF

page 4

page 6

01/19/2022

TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning

Image fusion is a technique to integrate information from multiple sourc...
07/19/2021

Image Fusion Transformer

In image fusion, images obtained from different sensors are fused to gen...
08/29/2019

Metric-based Regularization and Temporal Ensemble for Multi-task Learning using Heterogeneous Unsupervised Tasks

One of the ways to improve the performance of a target task is to learn ...
03/07/2019

Deep CNN-based Multi-task Learning for Open-Set Recognition

We propose a novel deep convolutional neural network (CNN) based multi-t...
01/25/2022

TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network

The end-to-end image fusion framework has achieved promising performance...
10/15/2021

Transformer-based Multi-task Learning for Disaster Tweet Categorisation

Social media has enabled people to circulate information in a timely fas...
01/12/2022

Multi-task Joint Strategies of Self-supervised Representation Learning on Biomedical Networks for Drug Discovery

Self-supervised representation learning (SSL) on biomedical networks pro...

1 Introduction

Due to the low dynamic range (LDR) of common imaging sensors, a single image often suffers from underexposure or overexposure and fails to depict the high dynamic range (HDR) of luminance levels in natural scenes. The multi-exposure image fusion (MEF) technique provides an economical and effective solution by fusing LDR images with different exposures into a single HDR image and thus is widely used in HDR imaging for mobile devices Zhang (2021); Reinhard et al. (2010); Hasinoff et al. (2016); Shen et al. (2011).

The study of MEF has a long history, and a series of traditional methods have been proposed Li et al. (1995); Liu and Wang (2015); Lee et al. (2018); Ma and Wang (2015); Ma et al. (2017b, a). However, their performances are limited because weak hand-crafted representations have low generalizability and are not robust to varying input conditions Zhang (2021); Zhang et al. (2020b); Xu et al. (2020a).

Recently, deep learning-based algorithms have gradually become mainstream in the MEF field. In these methods, two source images with different exposures are directly input into a fusion network, and the fused image is obtained from the output of the network. The fusion networks can be trained in a common supervised way using ground truth fusion images Zhang et al. (2020b); Wang et al. (2018); Li and Zhang (2018) or in an unsupervised way by encouraging the fused image to retain different aspects of the important information in the source images Xu et al. (2020a); Ram Prabhakar et al. (2017); Xu et al. (2020b); Zhang et al. (2020a); Ma et al. (2019). However, both supervised and unsupervised MEF methods require a large amount of multi-exposure data for training. Although many researchers Ram Prabhakar et al. (2017); Cai et al. (2018); Zeng et al. (2014) have collected various multi-exposure datasets, their quantities are not comparable to large natural image datasets such as ImageNet Deng et al. (2009) or

MS-COCO

Lin et al. (2014). The absence of large amounts of training data generally leads to overfitting or tedious parameter optimization. In addition, ground truth is also in high demand for supervised MEF methods but is not commonly available in the field Zhang (2021). Some researchers synthesize ground truth images Wang et al. (2018) or use the fusion results from other methods as ground truth for training Yin et al. (2020); Chen and Chuang (2020); Xu et al. (2020c). However, these ground truth images are not real, and using them leads to inferior performance.

Moreover, all the existing deep learning-based MEF methods utilize convolutional neural networks (CNNs) for feature extraction, but it is difficult for CNNs to model long-range dependencies due to their small receptive field, which is an inherent limitation. In image fusion, the quality of the fused images is related to the pixels within the receptive field as well as to the pixel intensity and texture of the entire image. Therefore, modeling both global and local dependencies is required.

To address the above issues, we propose TransMEF, a transformer-based multi-exposure image fusion framework that uses self-supervised multi-task learning. The framework is based on an encoder-decoder network and is trained on a large natural image dataset using self-supervised image reconstruction to avoid training with multi-exposure images. During the fusion phase, we first apply the trained encoder to extract feature maps from two source images and then apply the trained decoder to generate a fused image from the fused feature maps. We also design three self-supervised reconstruction tasks according to the characteristics of multi-exposure images to train the network so that our network learns these characteristics more effectively.

In addition, we design an encoder that includes both a CNN module and a transformer module so that the encoder can utilize both local and global information. Extensive experiments demonstrate the effectiveness of the self-supervised reconstruction tasks as well as the transformer-based encoder and show that our method outperforms the state-of-the-art MEF methods in both subjective and objective evaluations.

The main contributions of this paper are summarized as follows:

  • We propose three self-supervised reconstruction tasks according to the characteristics of multi-exposure images and train an encoder-decoder network using multi-task learning so that our network is not only able to be trained on large natural image datasets but also learns the characteristics of multi-exposure images.

  • To compensate for the defect in establishing long-range dependencies in CNN-based architectures, we design an encoder that combines a CNN module with a transformer module, which enables the network to utilize both local and global information during feature extraction.

  • To provide a fair and comprehensive comparison with other fusion methods, we used the latest released multi-exposure image fusion benchmark dataset Zhang (2021)

    as the test dataset. We selected 12 objective evaluation metrics from four perspectives and compared our method to 11 competitive traditional and deep learning-based methods in the MEF field. Our method achieves the best performance in both subjective and objective evaluations. The code will be publicly available.

2 Related Work

2.1 Traditional MEF Algorithms

Traditional MEF methods can be further classified into spatial domain-based fusion methods

Liu and Wang (2015); Lee et al. (2018); Ma and Wang (2015); Ma et al. (2017b, a) and transform domain-based fusion methods Li et al. (1995); Burt and Kolczynski (1993); Mertens et al. (2007); Kou et al. (2017). Spatial domain methods calculate the fused image’s pixel values directly from the source image’s pixel values, and three types of techniques are commonly used to fuse images in the spatial domain, namely, pixel-based methods Liu and Wang (2015); Lee et al. (2018), patch-based methods Ma and Wang (2015); Ma et al. (2017b) and optimization-based methods Ma et al. (2017a).

In transform domain-based fusion algorithms, the source images are first transformed to a specific transform domain (such as the wavelet domain) to obtain different frequency components, and then appropriate fusion rules are used to fuse different frequency components. Finally, the fused images are obtained by inversely transforming the fused frequency components. Commonly used transform methods include pyramid transform Burt and Kolczynski (1993), Laplacian pyramid Mertens et al. (2007), wavelet transform Li et al. (1995), and edge-preserving smoothing Kou et al. (2017), among others.

Although traditional methods achieve promising fusion results, weak hand-crafted representations with low generalizability hinder further improvement.

2.2 Deep-Learning Based MEF Algorithms

In deep learning-based algorithms, two source images with different exposures are directly input into a fusion network, and the network outputs the fused image. The fusion networks can be trained with ground truth fusion images Zhang et al. (2020b); Wang et al. (2018); Li and Zhang (2018)

or similarity metric-based loss functions

Xu et al. (2020a); Ram Prabhakar et al. (2017); Xu et al. (2020b); Zhang et al. (2020a); Ma et al. (2019).

Due to the lack of real ground truth fusion images in the MEF field, several methods have been proposed to synthesize ground truth. For example, Wang et al. Wang et al. (2018) generated ground truth data by changing the pixel intensity of normal images Deng et al. (2009), while other researchers utilized the fusion results from other MEF methods as ground truth Yin et al. (2020); Chen and Chuang (2020); Xu et al. (2020c). However, these ground truth images are not real, leading to inferior fusion performance.

In addition to training with ground truth images, another research direction is to train the fusion network with similarity metric-based loss functions that encourage the fusion image to retain important information from different aspects of the source images. For example, Prabhakar et al. Ram Prabhakar et al. (2017) applied a no-reference image quality metric (MEF-SSIM) as a loss function. Zhang et al. Zhang et al. (2020a) designed a loss function based on gradient and intensity information to perform unsupervised training. Xu et al. Xu et al. (2020a, b) proposed U2Fusion, in which a fusion network is trained to preserve the adaptive similarity between the fusion result and the source images. Although these methods do not require ground truth images, a large number of multi-exposure images are still in high demand for training. Although several multi-exposure datasets Ram Prabhakar et al. (2017); Cai et al. (2018); Zeng et al. (2014) have been collected, their quantities are incomparable to large natural image datasets such as ImageNet Deng et al. (2009) or MS-COCO Lin et al. (2014). The absence of large amounts of training data leads to overfitting or tedious parameter optimization.

Notably, researchers have already utilized encoder-decoder networks in infrared and visible image fusion Li and Wu (2018) as well as multi-focus image fusion tasks Ma et al. (2021). They trained the encoder-decoder network on the natural image dataset, but the network cannot effectively learn the characteristics of multi-exposure images due to the domain discrepancy. In contrast, we design three self-supervised reconstruction tasks according to the characteristics of multi-exposure images so that our network can not only be trained on large natural image datasets but will also be able to learn the characteristics of multi-exposure images.

Figure 1: TransMEF image fusion framework. (a) The proposed self-supervised image reconstruction network, which uses multi-task learning. (b) The image fusion architecture. (c) Detailed structures of Transformer and ConvBlock.

3 Method

3.1 Framework Overview

As shown in Figure 1, our framework is an encoder-decoder-based architecture. We train the network via image reconstruction on a large natural image dataset. During the fusion phase, we apply the trained encoder to extract feature maps from a pair of source images and then fuse the two feature maps and input it into the decoder to generate the fused image.

The framework training process is shown in Figure 1 (a), where we use the network to perform self-supervised image reconstruction tasks, i.e., to reconstruct the original image from the destroyed input image. Concretely, given an original image , three different destroyed images are generated by destroying several subregions using one of the three different transformations (Gamma-based transformation (·), Fourier-based transformation (·) and global region shuffling (·)). The destroyed images are input into the Encoder, which consists of a feature extraction module, TransBlock, and a feature enhancement module, EnhanceBlock. TransBlock uses a CNN module and a transformer module for feature extraction. The destroyed images are directly input into the CNN-Module, and concurrently, they are divided into patches , which are then input into the Transformer-Module. The EnhanceBlock aggregates and enhances the feature maps extracted from the CNN-Module and the Transformer-Module. Finally, the image features extracted by the Encoder are used to obtain the reconstructed image by the Decoder. We utilize a multi-task learning approach that simultaneously performs three self-supervised reconstruction tasks. The detailed structure of the Encoder and the three reconstruction tasks are introduced in Sections 3.2 and 3.3, respectively.

The trained Encoder and Decoder are then used for image fusion, as shown in Figure 1 (b). Specifically, two source images are first input to the Encoder for feature encoding, and then the extracted feature maps and are fused using the Fusion Rule to obtain the fused feature maps . Finally, the fused image is reconstructed by the Decoder. The fusion rule is described in detail in Section 3.4. Here, we only introduce this framework’s pipeline for single-channel grayscale image fusion, and the fusion of color images is elaborated in Section 3.5.

3.2 Transformer-Based Encoder-Decoder Framework

Encoder-Decoder Framework for Image Reconstruction

The encoder-decoder network is shown in Figure 1 (a). Using a single self-supervised reconstruction task as an example, given a training image , we first randomly generate 10 image subregions  to form a set that will be transformed , where the sizes of the subregions , are all random values uniformly sampled from the positive integer set [1,25]. After that, we transform each subregion in the set with an image transform tailored for multi-exposure image fusion (the three different transformations are described in detail in Section 3.3) to obtain the set of transformed subregions , which are then used to replace the original subregions to obtain the transformed image . In Figure 1 (a), (·), (·) and

(·) represent the transformation based on Gamma transform, Fourier transform and global region shuffling, respectively.

The Encoder contains a feature extraction module, TransBlock, and a feature enhancement module, EnhanceBlock. The detailed architecture of TransBlock is introduced in the following section. The feature enhancement module, EnhanceBlock, aggregates and enhances the feature maps extracted by TransBlock so that the Encoder can better integrate the global and local features. Concretely, we concatenate the two feature maps from the CNN-Module and the Transformer-Module in TransBlock and input them into two sequentially connected ConvBlock layers to achieve feature enhancement. As shown in Figure 1 (c), each ConvBlock consists of two convolutional layers with a kernel size of 3

3, a padding of 1 and one ReLU activation layer. The Decoder contains two sequentially connected ConvBlock layers and a final 1

1 convolution to reconstruct the original image.

TransBlock: A Powerful Feature Extractor

Inspired by TransUnet Chen et al. (2021) and ViT Dosovitskiy et al. (2020), we propose a feature extraction module, TransBlock, that combines the CNN and transformer architecture to model both local and global dependencies in images. The architecture of TransBlock is shown in Figure 1 (a). Specifically, the CNN-Module consists of three sequentially connected ConvBlock layers, and the input to the CNN-Module is the destroyed images. Simultaneously, the destroyed image  is divided into a total of M patches of size . The patches are used to construct the sequence , where , , and is the size of the patches. The sequence is fed into the Transformer-Module, which starts with a patch embeddings linear projection E, and the encoded sequence features are obtained. Then, passes through L Transformer layers and the output of each layer is denoted as . Figure 1

(c) illustrates the architecture of one Transformer layer, which consists of a multi-head attention mechanism (MSA) block and a multi-layer perceptron (MLP) block, where layer normalization (LN) is applied before every block and residual connections are applied after every block. The MLP block consists of two linear layers with a GELU activation function.

Loss Function

Our architecture applies a multi-task learning approach to simultaneously perform three self-supervised reconstruction tasks using the following loss function.

(1)

where  denotes the overall loss function. , and are the loss functions of the three self-supervised reconstruction tasks.

In each reconstruction task, we encourage the network to not only learn the pixel-level image reconstruction, but also capture the structural and gradient information in the image. Therefore, the loss of each reconstruction task contains three parts and is defined as follows:

(2)

where is the mean square error (MSE) loss function, is the structural similarity (SSIM) loss function, and is the total variation loss function. and

are two hyperparameters that are empirically set to 20.

The MSE loss is used to ensure pixel-level reconstruction and is defined as follows:

(3)

where is the output, a fused image reconstructed by the network, and represents the input, the original image.

The SSIM loss helps the model better learn structural information from images and is defined as:

(4)

The total variation loss introduced in VIFNet Hou et al. (2020) is used to better preserve gradients in the source images and further eliminate noise. It is defined as follows:

(5)
(6)

where  denotes the difference between the original image and the reconstructed image, · denotes the norm, and , represent the horizontal and vertical coordinates of the image’s pixels, respectively.

3.3 Three Specific Self-Supervised Image Reconstruction Tasks

In this section, we introduce three transformations that destroy the original images and generate the input for the image reconstruction encoder-decoder network. An example that shows the image and the corresponding subregions before and after the transformations is presented in Supplementary Material Section 1.

(1) Learning Scene Content and Luminance Information using Gamma-based Transformation. In general, overexposed images contain sufficient content and structural information in dark regions, while underexposed images contain sufficient color and structural information in bright regions. In the fused image, it is desirable to maintain uniform brightness while retaining rich information in all regions Xu et al. (2020a); Ram Prabhakar et al. (2017). We adopt Gamma transform to change the luminance in several subregions of the original image and train the network to reconstruct that original image. In this process, our network learns the content and structural information from the images at different luminance levels.

Gamma transform is defined as:

(7)

where and are the transformed and original pixel values, respectively. For each pixel in the selected subregion , we use a random Gamma transform to change the luminance, where is a random value uniformly sampled from the interval [0, 3].

(2) Learning Texture and Detail Information using Fourier-based Transformation.

We introduce a self-supervised task based on Fourier transform that enables the network to learn texture and detail information from the frequency domain.

In the discrete Fourier transform (DFT) of an image, the amplitude spectrum determines the image’s intensities, while the phase spectrum primarily determines the high-level semantics of the image and contains information about the image’s content and the location of the objects. (See Supplementary Material Section 1.2 for further descriptions and experiments).

Underexposed images are too dark due to insufficient exposure time, and overexposed images are too bright due to a long exposure time, both of which result in inappropriate image intensity distribution. Therefore, it is critical to encourage the network to learn the proper intensity distribution from the images.

Despite the poor intensity distribution in both underexposed and overexposed images, the shape and content of the objects in the image are still well-contained in the phase spectrum. Hence, it is beneficial to build a network that can capture that shape and content information under such circumstances.

To this end, for the selected image subregions, we first perform Fourier transform to obtain the amplitude and phase spectrum. Then, we destroy the subregions in the frequency domain. Specifically, Gaussian blurring is used to change the amplitude spectrum, and random swapping is performed  times on all phase values in the phase spectrum, where is a random number in the positive integer set [1, 5].

(3) Learning Structure and Semantic Information using Global Region Shuffling We introduce the global region shuffling transform Kang et al. (2017) to destroy the original images, thus enabling the network to learn the structure and semantic information through image reconstruction. Specifically, for each image subregion  in the set of image subregions selected in the original image , we randomly select another image subregion with the same size of . After that, they are swapped and the process is repeated 10 times to obtain the destroyed image.

3.4 Fusion Rule

Because our network already has a strong feature extraction capability, we simply average the feature maps from the two source images and to obtain the fused feature maps , which is then forwarded to the Decoder.

3.5 Managing RGB Input

We adopt a strategy commonly applied in previous deep learning-based studies to fuse RGB multi-exposure images Zhang et al. (2020a). The color image’s RGB channels are first converted to the YCbCr color space. Then, the Y (luminance) channel is fused using our network, and the information in the Cb and Cr (chrominance) channels is fused using the traditional weighted average method, defined as:

(8)

where and represent the Cb (or Cr) channel values from the multi-exposure images. denotes their fused channel result, where is set to 128. Finally, the fused Y channel, Cb and Cr channels are converted back to the RGB space.

Figure 2: Two examples of source image pairs and fusion results from different methods. (a1)-(b1) and (a2)-(b2) are the source image pairs, and (c1)-(n1) and (c2)-(n2) are the fusion results from various methods.
Table 1: Objective evaluation results for the benchmark dataset with the maximum values depicted in red.

4 Experiments and Results

Table 2: Results of the ablation study for TransBlock and three self-supervised reconstruction tasks using 20% of the training data.
Table 3: Results of the ablation study for each self-supervised reconstruction task using 20% of the training data.

4.1 Datasets

We used the large natural dataset MS-COCO Lin et al. (2014) to train the encoder-decoder network. MS-COCO contains more than 70,000 natural images of various scenes. For convenience, all images were resized to 256 256 and converted into grayscale images. It is worth mentioning that although many competitive MEF algorithms have been proposed, they are not evaluated on a unified MEF benchmark. We used the latest released multi-exposure image fusion benchmark dataset Zhang (2021) as the test dataset. This benchmark dataset consists of 100 pairs of multi-exposure images with a variety of scenes and multiple objects.

4.2 Implementation Details

Our network was trained on an NVIDIA GTX 3090 GPU with a batch size of 64 and 70 epoch. We used an ADAM optimizer and a cosine annealing learning rate adjustment strategy with a learning rate of 1e-4 and a weight decay of 0.0005. For a 256

256 training image, we randomly generated 10 subregions of random size to form the set to be transformed. In TransBlock, we divided the transformed input image into patches with the size of 16 16 and constructed the sequence .

4.3 Evaluation Metrics

We rigorously evaluated our method using both subjective and objective evaluations Zhang (2021). Subjective evaluation is the observer’s subjective assessment of the quality of the fused images in terms of sharpness, detail, and contrast, among other factors. In the objective evaluation, to provide a fair and comprehensive comparison with other fusion methods, we selected 12 objective evaluation metrics from four perspectives. These include information theory-based metrics, , , , , ; image feature-based metrics, , , , ; image structural similarity-based metrics, , ; and human perception inspired metrics, . Details about the metrics can be found in Supplementary Material Section 3. All objective metrics are calculated as the average for the 100 fused images, and a larger value indicates better performance for all metrics.

We compared our method with 11 competitive traditional methods Li et al. (1995); Liu and Wang (2015); Lee et al. (2018); Ma and Wang (2015); Ma et al. (2017b, a) and deep learning methods Zhang et al. (2020b); Xu et al. (2020a); Ram Prabhakar et al. (2017); Xu et al. (2020b); Zhang et al. (2020a); Ma et al. (2019) in the MEF field. The comparison methods are as follows: The traditional methods include: DWT Li et al. (1995), DSIFT-EF Liu and Wang (2015), MEFAW Lee et al. (2018), PWA Ma and Wang (2015), SPD-MEF Ma et al. (2017b), and MEFOpt Ma et al. (2017a) and the deep learning-based methods include Deepfuse Ram Prabhakar et al. (2017), MEFNet Ma et al. (2019), U2Fusion Xu et al. (2020a), PMGI Zhang et al. (2020a), and IFCNN Zhang et al. (2020b).

4.4 Subjective Evaluation

Figure 2 shows the fusion results from our method and our competitors in an indoor and outdoor scene. More fusion results are shown in Supplementary Material Section 4.

When fusing the first pair of source images in Figure 2 (a1) and (b1), DSIFT-EF, MEFAW, MEFOpt, SPD-MEF and MEFNet result in disappointing luminance maintenance, and the fused images appear dark. PWA introduces artifacts, and the color is unrealistic. Although DWT, Deepfuse, PMGI, IFCNN and U2Fusion maintain moderate luminance, their fusion results suffer from low contrast and fail to depict the image’s details. In comparison, our method maintains the best luminance and contrast and simultaneously displays excellent details with better visual perception.

When fusing the second pair of source images in Figure 2 (a2) and (b2), most methods fail to maintain appropriate luminance. MEFNet and PMGI maintain relatively better luminance but introduce artifacts and blurring. Clearly, our method maintains optimal luminance and contrast and simultaneously retains more detailed information.

4.5 Objective Evaluation

Table 1 presents the objective evaluation for all comparison methods on the benchmark dataset. Our method achieves the best performance for nine of the 12 metrics, while for the other three metrics, the gap between our method’s results and the best results is small.

5 Ablation Study

5.1 Ablation Study for TransBlock

To verify the effectiveness of TransBlock, we conducted an ablation study using 20% of the training data, and the results of the ablation study are shown in Table 2. Regardless of whether the proposed self-supervised reconstruction tasks are used, adding TransBlock always improves the fusion performance.

To further explain why TransBlock is effective, we visualized the effect of image reconstruction using both the traditional CNN architecture and the model that includes TransBlock. It can be seen that the latter reconstructed better details. More information can be found in Supplementary Material Section 2.

5.2 Ablation Study for Three Specific Self-Supervised Image Reconstruction Tasks

In this ablation study, we demonstrate the effectiveness of each of the self-supervised reconstruction tasks and the superiority of performing them simultaneously in a multi-task manner. This study was performed using 20% of the training data, and the experimental results are shown in Table 3. The results show that each of the self-supervised reconstruction tasks alone can improve the fusion performance, and the overall best performance is achieved by conducting the three tasks simultaneously through multi-task learning.

6 Conclusion

In this paper, we propose TransMEF, a transformer-based multi-exposure image fusion framework via self-supervised multi-task learning. TransMEF is based on an encoder-decoder structure so that it can be trained on large natural image datasets. The TransMEF encoder integrates a CNN module and a transformer module so that the network can focus on both local and global information. In addition, we design three self-supervised reconstruction tasks according to the characteristics of multi-exposure images and conduct these tasks simultaneously using multi-task learning so that the network can learn those characteristics during the process of image reconstruction. Extensive experiments show that our new method achieves state-of-the-art performance when compared with existing competitive methods in both subjective and objective evaluations. The proposed TransBlock and the self-supervised reconstruction tasks have the potential to be applied in other image fusion tasks and other areas of image processing.

References

  • P. J. Burt and R. J. Kolczynski (1993) Enhanced image capture through fusion. In

    1993 (4th) International Conference on Computer Vision (ICCV)

    ,
    pp. 173–182. Cited by: §2.1, §2.1.
  • J. Cai, S. Gu, and L. Zhang (2018) Learning a deep single image contrast enhancer from multi-exposure images. IEEE Transactions on Image Processing 27 (4), pp. 2049–2062. Cited by: §1, §2.2.
  • J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. Cited by: §3.2.
  • S. Chen and Y. Chuang (2020)

    Deep exposure fusion with deghosting via homography estimation and attention learning

    .
    In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1464–1468. Cited by: §1, §2.2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 248–255. Cited by: §1, §2.2, §2.2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), Cited by: §3.2.
  • S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Transactions on Graphics 35 (6), pp. 1–12. Cited by: §1.
  • R. Hou, D. Zhou, R. Nie, D. Liu, L. Xiong, Y. Guo, and C. Yu (2020) VIF-net: an unsupervised framework for infrared and visible image fusion. IEEE Transactions on Computational Imaging 6, pp. 640–651. Cited by: §3.2.
  • G. Kang, X. Dong, L. Zheng, and Y. Yang (2017) Patchshuffle regularization. arXiv preprint arXiv:1707.07103. Cited by: §3.3.
  • F. Kou, Z. Li, C. Wen, and W. Chen (2017) Multi-scale exposure fusion via gradient domain guided image filtering. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1105–1110. Cited by: §2.1, §2.1.
  • S. Lee, J. S. Park, and N. I. Cho (2018) A multi-exposure image fusion based on the adaptive weights reflecting the relative pixel intensity and global gradient. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1737–1741. Cited by: §1, §2.1, §4.3.
  • H. Li, B. Manjunath, and S. K. Mitra (1995) Multisensor image fusion using the wavelet transform. Graphical Models and Image Processing 57 (3), pp. 235–245. Cited by: §1, §2.1, §2.1, §4.3.
  • H. Li and X. Wu (2018) DenseFuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §2.2.
  • H. Li and L. Zhang (2018) Multi-exposure fusion with cnn features. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1723–1727. Cited by: §1, §2.2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §1, §2.2, §4.1.
  • Y. Liu and Z. Wang (2015) Dense sift for ghost-free multi-exposure fusion. Journal of Visual Communication and Image Representation 31, pp. 208–224. Cited by: §1, §2.1, §4.3.
  • B. Ma, Y. Zhu, X. Yin, X. Ban, H. Huang, and M. Mukeshimana (2021) SESF-fuse: an unsupervised deep model for multi-focus image fusion. Neural Computing and Applications 33 (11), pp. 5793–5804. Cited by: §2.2.
  • K. Ma, Z. Duanmu, H. Yeganeh, and Z. Wang (2017a) Multi-exposure image fusion by optimizing a structural similarity index. IEEE Transactions on Computational Imaging 4 (1), pp. 60–72. Cited by: §1, §2.1, §4.3.
  • K. Ma, Z. Duanmu, H. Zhu, Y. Fang, and Z. Wang (2019) Deep guided learning for fast multi-exposure image fusion. IEEE Transactions on Image Processing 29, pp. 2808–2819. Cited by: §1, §2.2, §4.3.
  • K. Ma, H. Li, H. Yong, Z. Wang, D. Meng, and L. Zhang (2017b) Robust multi-exposure image fusion: a structural patch decomposition approach. IEEE Transactions on Image Processing 26 (5), pp. 2519–2532. Cited by: §1, §2.1, §4.3.
  • K. Ma and Z. Wang (2015) Multi-exposure image fusion: a patch-wise approach. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 1717–1721. Cited by: §1, §2.1, §4.3.
  • T. Mertens, J. Kautz, and F. Van Reeth (2007) Exposure fusion. In 15th Pacific Conference on Computer Graphics and Applications (PCCGA), pp. 382–390. Cited by: §2.1, §2.1.
  • K. Ram Prabhakar, V. Sai Srikar, and R. Venkatesh Babu (2017) Deepfuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In IEEE International Conference on Computer Vision (ICCV), pp. 4714–4722. Cited by: §1, §2.2, §2.2, §3.3, §4.3.
  • E. Reinhard, W. Heidrich, P. Debevec, S. Pattanaik, G. Ward, and K. Myszkowski (2010) High dynamic range imaging: acquisition, display, and image-based lighting. Morgan Kaufmann. Cited by: §1.
  • R. Shen, I. Cheng, J. Shi, and A. Basu (2011) Generalized random walks for fusion of multi-exposure images. IEEE Transactions on Image Processing 20 (12), pp. 3634–3646. Cited by: §1.
  • J. Wang, W. Wang, G. Xu, and H. Liu (2018)

    End-to-end exposure fusion using convolutional neural network

    .
    IEICE Transactions on Information and Systems 101 (2), pp. 560–563. Cited by: §1, §2.2, §2.2.
  • H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling (2020a) U2Fusion: a unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2.2, §2.2, §3.3, §4.3.
  • H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo (2020b) Fusiondn: a unified densely connected network for image fusion. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 12484–12491. Cited by: §1, §2.2, §2.2, §4.3.
  • H. Xu, J. Ma, and X. Zhang (2020c)

    MEF-gan: multi-exposure image fusion via generative adversarial networks

    .
    IEEE Transactions on Image Processing 29, pp. 7203–7216. Cited by: §1, §2.2.
  • J. Yin, B. Chen, Y. Peng, and C. Tsai (2020) Deep prior guided network for high-quality image fusion. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1, §2.2.
  • K. Zeng, K. Ma, R. Hassen, and Z. Wang (2014) Perceptual evaluation of multi-exposure image fusion algorithms. In 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX), pp. 7–12. Cited by: §1, §2.2.
  • H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma (2020a) Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12797–12804. Cited by: §1, §2.2, §2.2, §3.5, §4.3.
  • X. Zhang (2021) Benchmarking and comparing multi-exposure image fusion algorithms. Information Fusion 74, pp. 111–131. Cited by: 3rd item, §1, §1, §1, §4.1, §4.3.
  • Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang (2020b) IFCNN: a general image fusion framework based on convolutional neural network. Information Fusion 54, pp. 99–118. Cited by: §1, §1, §2.2, §4.3.