Deep Convolutional Sparse Coding Networks for Image Fusion (Pytorch)
Image fusion is a significant problem in many fields including digital photography, computational imaging and remote sensing, to name but a few. Recently, deep learning has emerged as an important tool for image fusion. This paper presents three deep convolutional sparse coding (CSC) networks for three kinds of image fusion tasks (i.e., infrared and visible image fusion, multi-exposure image fusion, and multi-modal image fusion). The CSC model and the iterative shrinkage and thresholding algorithm are generalized into dictionary convolution units. As a result, all hyper-parameters are learned from data. Our extensive experiments and comprehensive comparisons reveal the superiority of the proposed networks with regard to quantitative evaluation and visual inspection.READ FULL TEXT VIEW PDF
Deep Convolutional Sparse Coding Networks for Image Fusion (Pytorch)
Image fusion is a fundamental topic in image processing , and its aim is to generate a fusion image by combining the complementary information of source images . This technique has been applied to many scenarios. For example, in military, infrared and visible image fusion (IVF) is helpful for object detection and recognition . In digital photography, high dynamic range (HDR) imaging can be solved by multi-exposure image fusion (MEF) to generate high-contrast and informative images .
Over the past a few decades, numerous image fusion algorithms have been proposed, where transform based algorithms are very popular 
. They transform source images into feature domain, detect the active levels, blend the features and at last apply the inverse transformer in order to obtain the fused image. Recently, deep neural networks have emerged as an effective tool in image fusion
. They are divided into three groups: (1) Autoencoder based methods. This is a deep learning variant of transform based algorithms. The transformers and inverse transformers are replaced by encoders and decoders, respectively. (2) Supervised methods. For multi-focus image fusion, there are ground truth images in the synthetic datasets . For MEF, Cai et al. constructed a large dataset providing the reference images by comparing 13 MEF/HDR algorithms 
. Owing to the strong fitting ability, supervised learning networks are suitable for these tasks. (3) Human visual system based methods. In the case without reference image, by taking prior knowledge into account and setting proper loss functions, researchers designed regression[44, 27] or adversarial  networks to make fusion images satisfy human visual systems. However, it is found that many algorithms are evaluated on a limited number of cherry-picked images. Thus, their generalizations still remain unknown. It leaves room for possible improvement with reasonable and interpretable formulations.
Convolutional sparse coding (CSC) has been successfully applied to computer vision tasks on account of its high performance and robustness[40, 12]. The CSC model is generally solved by the iterative shrinkage and thresholding algorithm (ISTA), but the results significantly depend on hyper-parameters. To address this problem, the CSC model and ISTA are generalized into some dictionary convolutional units (DCUs) which are put in the hidden layers of neural networks. In this manner, the hyper-parameters (e.g. penalty parameters, dictionary filters and thresholding functions) in DCUs are learnable. Based on the novel unit, we design deep CSC networks for three fusion tasks, including IVF, MEF, and multi-modal image fusion (MMF). In our experiments, we employ relatively large test datasets to make a comprehensive and convincing evaluation. Experimental results show that the deep CSC networks outperform the state-of-the-art (SOTA) methods in terms of both objective metrics and visual inspection. Besides, our networks are with high reproducibility. The remainder of this paper is organized as follows. Section II converts the CSC and ISTA into a DCU. Then, in section III we design three DCU based networks for IVF, MEF and MRF tasks. The extensive experiments are reported in section IV. Section V concludes this paper.
In dictionary learning, CSC is a typical method for image processing. Given an image ( for gray images and for RGB images) and convolutional filters , CSC can be formulated as the following problem:
is a hyperparameter,denotes the convolution operator, is the sparse feature map (or say, code) and is a sparse regularizer. This problem can be solved by ISTA, and it is easy to write the updating rule for feature maps as below,
where is the step size and is the flipped version of along horizontal and vertical directions. Note that is the proximal operator of the regularizer . If is the -norm, its corresponding proximal operator is the soft shrinkage thresholding (SST) function defined by where
is the rectified linear unit andis the sign function. CSC provides a pipeline to extract features of an image, but its performance highly depends on the configuration of . By the principle of algorithm unrolling [33, 39, 9], the ISTA of CSC can be generalized as a unit in neural networks. We employ two convolutional units, , to replace and , and proximal operator
is extended to the activation function. Hence, Eq.(2) can be rewritten as
where we also take batch normalization (BN) into account. It is worth pointing out that, except for SST, the activation function can be freely set to alternatives (e.g., ReLU, parametric ReLU (PReLU) and so on) if the regularizeris not set to -norm. In what follows, Eq. (3) is called a dictionary convolutional unit (DCU). By stacking DCUs, the original CSC model can be represented as a deep CSC neural network.
In addition, stacking DCUs is interpretable to representation learning. serves as a decoder, since it maps from feature space to image space. And serves as an encoder, since it maps the residual between the original image and the reconstructed image from image space to feature space. Then, the encoded residual is added to the current code for updating. Eventually, the output passes through BN and an activation function for non-linearity. This process can be regarded as an iterative auto-encoder.
In this section, we apply deep CSC neural networks to the image fusion problem, and exhibit three paradigms of model formulation for three different image fusion tasks.
By combining autoencoders and the CSC model, we propose a CSC-based IVF network (CSC-IVFN), which can be regarded as a flexible data-driven transformer. In the training phase, we train CSC-IVFN in the autoencoder fashion. In the testing phase, features obtained by the encoder of CSC-IVFN are fused and the fusion image is decoded by a decoder.
The architecture is displayed in Fig. 1 (a). Firstly, the input image 111In the training phase, both infrared and visible images are indiscriminately denoted by . is decomposed into a base image containing low-frequency information and a detail image containing high-frequency textures. Similar to [22, 14], is obtained by applying a box-blur filter to , and as for the detail image there is . Then, the base and detail images pass through stacked DCUs, and we will get the final feature maps, that is, and
. And next we feed them into a decoder to decode the base and detail images. Finally, they are combined to reconstruct the input image. Here, the output is activated by a sigmoid function to make sure that the values range from 0 to 1. The loss function is mean squared error (MSE) plus structural similarity (SSIM) loss,
where is a trade-off parameter to balance the MSE and SSIM . Note that MSE is used to keep the spatial consistency and SSIM guarantees local details in terms of structure, contrast and brightness .
After training a CSC-IVFN, there is a transformer (encoder) and inverse transformer (decoder). In the test phase, CSC-IVFN is feed with a pair of infrared and visible images. In what follows, we use , , and to represent the base and detail feature maps of infrared and visible images, respectively. As exhibited in Fig. 1 (b), a fusion layer is inserted between encoder and decoder in the test phase. It can be expressed by a unified merging operation ,
Here, and are element-wise product and addition. There are three popular fusion strategies:
Average strategy: .
Saliency-weighted fusion strategy : To highlight and retain the saliency target and information, the fusion weight of this strategy is determined by the saliency degree. We take base weights as an example. Firstly, the saliency value of at the th pixel can be obtained by where is the value of the th pixel and is the frequency of pixel value . The initial weight at the th pixel is and . To prevent region boundaries and artifacts, the weight map is refined via the guided filter with the guidance of base and detail feature maps:
Most of MEF algorithms fall under the umbrella of weighted summation framework, where are source images, are the corresponding weight maps, is the fused image and denotes the number of exposures. We propose a CSC-based MEF network (CSC-MEFN). Different from CSC-IVFN, CSC-MEFN is an end-to-end network. Here DCUs extract feature maps, which are then used to predict weight maps to generate the fusion image. To avoid chroma distortion, the proposed CSC-MEFN works in the YCbCr space, and its channels are denoted by and . As shown in Fig. 1 (c), Y channels pass through CSC-MEFN one-by-one. At first, CSC-MEFN stacks DCUs to code the Y channels. Then, it is followed by a convolutional unit to get the final code . Thereafter, the codes are converted into weight maps by softmax activation. At last, the fused Y channel is obtained by . As for the Cb channels, we employ the MEF -norm fusion strategy, i.e., So Cr channels do. After the separate fusion of three channels, the fusion image is transformed from YCbCr to RGB space. Eventually, we apply a post-processing : the values at 0.5% and 99.5% intensity level are mapped to [0,1], and values out of this range are clipped.
CSC-MEFN is supervised by improved MEFSSIM . It evaluates the similarity between source images and the fusion image in terms of illumination, contrast and structure. Our experimental results show that MEFSSIM often leads to haloes. Essentially, halo artifacts result from the pixel fluctuation in the illumination map (i.e., Y channel). To suppress haloes, we propose a halo loss defined by the -norm on gradients of the illumination map, where denotes the image gradient operator (see details in supplementary materials). In our experiments, is implemented by horizontal and vertical Sobel filters. In summary, given the penalty parameter , the loss function of CSC-MEFN is expressed by
Owing to the limitation of multispectral imaging devices, multispectral images (MS) contain enriched spectral information but with low resolution (LR). One of the promising techniques for acquiring a high resolution (HR) MS is to fuse the LRMS with a guidance image (e.g. panchromatic or RGB images). This problem is a special MMF task. We present a CSC-based MMF network (CSC-MMFN) for the general MMF task. It is assumed that LR and guidance images are represented by and respectively. Given the dictionary of HR images , the HR image is represented by
The symbol denotes the upsampling operator. According to this model, CSC-MMFN separately extracts codes of and by two sequences of DCUs, and we utilize the fast guidance filter to super-resolve with the guidance of . At last, the HR image is recovered by a convolutional unit. The loss function is set to MSE between ground-truth and fusion images.
Here we elaborate the implementation and configuration details of our networks. Experiments are conducted to show the performance of our models and the rationality of network structures. For each task, our experiments utilized training, validation and test datasets. The hyperparameters are determined by validation set.
As shown in Table I, IVF experiments use three datasets (FLIR, NIR and TNO). The 180 pairs of images in FLIR compose the training set. Two subsets (Water and OldBuilding) of NIR are used for validation. To comprehensively evaluate the performance of different models, we employ TNO, NIR-Country and the rest pairs of FLIR as test datasets. To the best of our knowledge, most of the papers only employ part of cherry-picked pairs in TNO as test sets. However, our test sets contain more than 130 pairs with different illuminations and scenarios. To quantitatively measure the fusion performance, six metrics are employed: entropy (EN) 
, standard deviation (SD), spatial frequency (SF) , visual information fidelity (VIF) , average gradient (AG)  and sum of the correlations of differences (SCD) . Larger metrics indicate that a fusion image is better. In our experiment, the tuning parameter in Eq. (4
) is set to 5. The network is optimized over 60 epochs with a learning rate ofin the first 30 epochs and in the rest epochs. The number of DCUs, activation function and fusion strategy may significantly affect the performance of CSC-IVFN. We determine them on validation sets. With the limited space, the validation experiments are exhibited in supplementary materials and the best configuration is reported as follows: the number of DCUs in base or detail encoder is 7; the activation functions in base and detail encoders are set as PReLU and SST, respectively; the fusion strategies for base and detail images are saliency-weighted fusion and IVF -norm fusion, respectively.
To verify the superiority of our CSC-IVFN, we compare its fusion results with nine popular IVIF fusion methods, including ADKT , CSR , DeepFuse , DenseFuse , DLF , FEZL , FusionGAN , SDF  and TVAL . Six metrics of all methods are displayed in Table II. It is shown that our method achieves the best performance on all test sets with regard to most metrics. Therefore, our method is suitable for various scenarios with different kinds of illuminations and object categories. In contrast, the other methods (including DeepFuse, DenseFuse and SDF) can achieve good performance on certain test sets with regard to a part of metrics. Besides the metric comparison, representative fusion images are displayed in Fig. 2. In the visible image, there are lots of bushes. In the infrared image, we can observe a bunker. However, it is not easy to recognize the bushes/bunker in the infrared/visible image. It is found that our fusion image keeps the details and textures of the visible image, and preserves the interest objects (i.e., the bushes and the bunker). In addition, its contrast is fairly high. In conclusion, both visible spectrum and thermal radiation information are retained in our fusion image. However, other methods cannot generate satisfactory images as good as ours.
|Dataset: NIR-Country Scene|
Three datasets SICE , TCI2018  and HDRPS 222http://markfairchild.org/HDR.html are employed in our experiments. HDRPS and TCI2018 are used for test and validation, respectively. SICE is a large and high-quality dataset. It is divided into two parts for training and validation. The basic information of datasets is shown in Table I. Many papers use MEFSSIM to evaluate the performance, but CSC-MEFN is supervised by MEFSSIM. Hence, it is unfair for other methods. As an alternative, we utilize four SOTA blind image quality indices , i.e., blind/referenceless image spatial quality evaluator (Brisque) , naturalness image quality evaluator (Niqe) , perception based image quality evaluator (Piqe)  and multi-task end-to-end optimized deep neural network (MEON) based blind image quality assessment . Smaller values indicate that a fusion image is better. Experiments show that large makes training unstable, so at the th iteration it is set to . We select to make halo loss and MEFSSIM loss have similar magnitudes. The network is optimized by Adam over 50 epochs with a learning rate of . The network configuration is determined by validation datasets. We utilize DCUs to extract codes and SST is employed as an activation function.
CSC-MEFN is compared with seven classic and recent SOTA methods, including EF , GGIF , DenseFuse , MEF-Net , FMMR , DSIFTEF , Lee18 . The metrics are listed in Table III. Our network outperforms other methods. Lee18 and EF are ranked in the second and third places. Fig. 3 displays the fusion images. It is shown that GGIF, MEF-Net, FMMR, DSIFTEF and Lee18 suffer from strongly halo effects around edges between the sky and rocks. For EF the right rock is too dark, and for DenseFuse the sun cannot be recognized. The contrast of local regions for both EF and DenseFuse is low. Our fusion image strikes the balance.
. It contains 32 scenes, each of which has a 31-band multispectral image and an RGB image. It is divided into three parts for training, testing and validation. The Wald protocol is used to construct training sets. We employ peak signal-to-noise ratio (PSNR) and SSIM as evaluation indexes. Larger PSNR and SSIM indicate that a fusion image is better. The network is optimized by Adam over 100 epochs with a learning rate of. SST is employed as an activation function. The number of DCUs is empirically set to 4 for a speed and accuracy trade-off.
CSC-MMFN is compared with seven classic and recent SOTA methods, including CNMF , GSA , FUSE , MAPSMM , GLPHS , PNN  and PFCN . The metrics listed in Table IV show that our network achieves the largest PSNR and SSIM. GLPHS and PFCN can be ranked in the second place in terms of PSNR and SSIM, respectively. The error maps of the third band of stuffed toys are displayed in Fig. 4. We found that CNMF, GSA and PFCN break down when reconstructing the color checkerboard and stuffed toys, while FUSE, MAPSMM, GLPHS and PNN perform badly at the edges. In summary, CSC-MMFN has the best performance.
Inspired by converting the ISTA and CSC models into a hidden layer of neural networks, this paper proposes three deep CSC networks for IVF, MEF and MMF tasks. Extensive experiments and comprehensive comparisons demonstrate that our networks outperform the SOTA methods. Furthermore, the experiments in supplementary materials show that our networks are highly reproducible.
Multi-focus image fusion with a deep convolutional neural network. Inf. Fusion 36, pp. 191–207. External Links: Cited by: §I.