Implementation of "A Semantic-based Medical Image Fusion Approach" paper
It is necessary for clinicians to comprehensively analyze patient information from different sources. Medical image fusion is a promising approach to providing overall information from medical images of different modalities. However, existing medical image fusion approaches ignore the semantics of images, making the fused image difficult to understand. In this paper, we put forward a semantic-based medical image fusion methodology, and as an implementation, we propose a Fusion W-Net (FW-Net) for multimodal medical image fusion. The experimental results are promising: the fused image generated by our approach greatly reduces the semantic information loss, and has comparable visual effects in contrast to the state-of-art approaches. Our approach and tool have great potential to be applied in the clinical setting. The source code of FW-Net is available at https://github.com/fanfanda/Medical-Image-Fusion.READ FULL TEXT VIEW PDF
Implementation of "A Semantic-based Medical Image Fusion Approach" paper
Medical images of different modalities provide different types of information, and they play an increasingly important role in clinical diagnosis. For example, computed tomography (CT) images display the information of dense structures such as bones and implants, while magnetic resonance (MR) images show high-resolution anatomical information like soft tissue . Generally, clinicians must thoroughly study medical images of different modalities in order to provide an accurate diagnosis for each patient. The industry is working towards developing devices with hybrid imaging technologies for obtaining images directly, such as MR/PET and SPECT/CT [2, 3]. However, the devices are not only very expensive, but also difficult to obtain mixed medical images of any two modalities. Fortunately, there is another low-cost alternative: for each patient, we can fuse existing medical images of different modalities, i.e., CT and MR-T2 (T2 weighted) images. This alternative is easy to popularize, since it is able to fuse any modal medical image with minimal loss of information .
In detail, these low-cost approaches are as follows: On the basis of transformation domain [5, 6, 7, 8, 1], first, they transform source images into specific coefficients of different scales, and then fuse the coefficients according to several manually-designed rules, and finally invert the coefficients into fused images. Unfortunately, there are semantic conflicts in medical images of different modalities, which is overlooked by previous approaches. For example, the bone (the blue arrow in Figure 1) appears bright in CT but dark in MR-T2, while the ventricle (the red arrow in Figure 1) appears dark in CT but bright in MR-T2. Without resolving the semantic conflicts, the fused images are difficult to read and hence useless in the clinical setting. Specifically, two significant drawbacks of those approaches are as follows:
The existing approaches overlook semantic conflicts of different source images, which will result in severe semantic loss in the fused images. The brightness of CT image represents the density of tissue, while the brightness of MR-T2 image represents the fluidity and magnetic property of tissue. So the semantics of brightness in different source images are totally different. In Figure 1, the blue arrow points to a high-density, low-flowing skull, and the red arrow points to a low-density, high-flowing cerebrospinal fluid (in the ventricles). The semantics of brightness in source images (a) and (b) are significantly different. However, in the fusion results (c), (d) and (e), there is no difference in the brightness of the skull and cerebrospinal fluid.
The fusion approaches that do not consider semantics of brightness can cause some brain tissue boundaries to blur. In the green frame of Figure 1 (b), we can clearly see the inflammation area of the frontal sinus, which is the focus of clinicians. However, since the corresponding part in Figure 1 (a) is bright, the frontal sinus boundaries in fusion results (c), (d) and (e) become blurred.
In this paper, motivated by the above issue, we first propose a semantic-based fusion approach. The semantics in the source images are extracted, then mapped to a new semantic space, and finally a fused medical image is formed in the new semantic space. As a reference implementation, we provide an autoencoder-based framework, which we nameFusion W-Net (FW-Net). Specifically, our FW-Net model consists of two U-Nets . The FW-Net is able to extract the semantics of brightness in the source image and then automatically maps brightness from different modal images to the same semantic space for image fusion. In this paper, we focus on medical image fusion of CT and MR-T2. However, our approach can be generalized to other images. In contrast to five state-of-the-art approaches, FW-Net not only greatly reduces semantic loss, but also generates visually comparable medical fused images.
Our contributions are as follows:
We reveal the reason why current image fusion approaches are difficult to apply in the clinical setting, that is, semantic conflicts are ignored.
We put forward a semantic-based medical image fusion methodology, under which a novel FW-Net model that combines autoencoder and U-Net is proposed to fuse medical images of different modalities.
A metric is proposed to evaluate the semantic loss in the fused image.
Since the medical image fusion is an image-to-image task, and our work is a hybrid model of autoencoder and U-Net, we first briefly review the image-to-image task, autoencoder and U-Net.
. Image-to-image translation tasks can be roughly divided into two categories. One is that there are target images in the task, where we only need to use the loss function at the pixel level or the structure level for regression. The other is that only the target domain is known or even the target domain is not known. In this case, a metric network is usually trained as a loss to make the generated image meets certain requirements, such as perceptual-loss  in image style transfer tasks, adversarial-loss and cycle-consistency-loss  in image translation tasks.
Autoencoders have the ability to learn the effective representation of input data in an unsupervised way. It is mainly used for data compression [13, 14, 15], representation learning [16, 17] and as a generative model [18, 19, 20]. In the framework of an autoencoder [21, 22], an encoder can effectively calculate from an input . A decoder is expected to recover well through . The objective of optimizing autoencoder is to minimize the loss function
where is a measure of the discrepancy between and after reconstruction. is usually chosen as -loss, i.e., .
U-Net  is a fully convolutional network (FCN) , which is used for medical image segmentation. It copies the feature map of layer to the layer , where n is the total number of layers. The low-level feature map of the network preserves the fine-grained information of the image, while the high-level feature map retains the higher-level semantic information and the high-frequency portion of the image. It is beneficial to medical image fusion tasks, so we use U-Net in our encoder and decoder.
It is worth nothing that this paper is not the first to combine U-Net and autoencoder framework. W-Net  was proposed for image segmentation task. However, our FW-Net is very different from it in terms of loss function and network structure.
For medical images of different modalities, the semantics of brightness are inconsistent. In order to solve semantic conflicts in medical image fusion, we propose a semantic-based approach, as shown in Figure 2.
First, we extract information from source medical images, such as structural information. Among the extracted information, the semantics of the medical image is one of the most important ones, which determines whether clinicians can understand the medical image, usually expressed by different brightness.
Second, the semantics of the source images of different modalities are mapped to the same semantic space. At the same time, structural information, edge information, etc. are also placed in the same space.
Finally, medical images from different sources are fused in the same semantic space.
The Autoencoder framework has the ability to map medical image information from different sources to the same semantic space for fusion. The encoder is used to extract the semantics from source images, and the decoder is used to reconstruct the source images. By minimizing the mean squared error (MSE) between the reconstructed images and the source images, the goal of minimizing semantic conflicts in the fused image is achieved.
, both the encoder and the decoder use U-Net. The first U-Net is used to generate the fused image, and the second U-Net is used to reconstruct the source images. Finally, our fused image is generated by minimizing the reconstruction error. The traditional autoencoder framework is fully connected, so the vector output from the encoder is not guaranteed to be spatially aligned with the source image, while U-Net uses a local connection structure, which makes the output vector spatially aligned and results in a visually fused image.
As shown in Figure 3, our network structure is divided into two parts. One part is the encoder . A pair of aligned and registered images is stacked and fed to the in order to generate a fused image of the same size as the source images. The other part is the decoder , which receives the fused image and generates two reconstructed images and . Our overall objective is
Here, the first two terms are the MSE loss. It measures the pixel-level loss of the input images and the reconstructed images. The third term is the sparse constraint. Its expansion is
The Kullback-Leibler (KL) divergence measures the Bernoulli distribution with meanand the Bernoulli distribution with mean . The purpose of adding this term is to make the generated image smoother, and the activation value of each pixel in the image approaches . This will cause the fused image to lose some edge information, which is a trade-off between generating image smoothness and saliency. The last term in Formula 2 is the regularization item to prevent the model from over-fitting. In the medical image fusion experiment of this paper, takes 3, takes 5e-4, and takes 0.4.
In order to evaluate the semantic loss of the fused image of different approaches, we train a separate decoder for each fusion approach. In the training process, we optimize the objective
After training decoder , we can calculate the semantic loss () of the fusion result by
The basic framework of our encoder and decoder follows the structure of U-Net. Figure 4
shows the structure of the encoder. The stride of
convolution is 1, and its padding is 1. So after each convolution operation, the size of feature map does not change. The structure of decoder is almost identical to that of the encoder, except that the input size isand the output size is .
. 2) We replace the deconvolution operation with a bilinear interpolation operation.
can make the output of each layer approach zero mean and unit variance, which will accelerate the convergence speed of training and improve the training effect of the model. Although deconvolution operation can increase the capacity of the model, it makes the quality of the fused image generated by the encoder poor. The deconvolution operation produces obvious pepper noise and blur in the image generated by the encoder, while the bilinear interpolation operation produces clearer and smoother images.
In this section, we compare our FW-Net with the state-of-the-art baselines in medical image fusion tasks. To start with, we introduce the experimental settings. Then, we present and analyze the results.
Our FW-Net and semantic evaluation network are implemented in the pytorch framework and run on the Tesla M40 GPU. The training time of the two networks is no more than 10 minutes.
We compare our approach with five mainstream algorithms, including the guided filtering-based (GF) approach , the NSCT-RPCNN approach , the phase congruency and directive contrast in non-subsampled contourlet domain (NSCT-PCDC) approach , the LP-CNN approach , and the NSST-PAPCNN approach .
In our FW-net, the batch size is set to 1. The optimizer is SGD , where the learning rate is 0.03, momentum is 0.9 and weight decay is 5e-4. The comparative approaches use their default parameter values.
We obtained the medical images of the CT and MR-T2 at http://www.med.harvard.edu/AANLIB/home.html. The images come from ten people, each with 13 slices, a total of 130 pairs of images. All source images have the same pixels with each pair of CT and MR-T2 images aligned and registered. We used 91 images of 7 people as training set, 26 images of 2 people as validation set and 13 images of 1 person as test set.
To assess the quality of the fused image, we evaluate them by the following five indicators:
 is an entropy-based evaluation index that measures how much information the fused image retains from source images.
 is a gradient-based evaluation index, which measures the degree of preservation of edge information of the source images in the fused image.
 is an evaluation index based on structural similarity, which measures the structural similarity between the fused image and the source images.
 is an evaluation index based on human vision system (HVS) using the Daly’s filter. It measures the visual differences between the fused image and the source images.
Semantic loss () is an evaluation index we propose in Section 3.2.1, which measures the reconstruction error of the fused image to represent semantic conflicts in the fused image.
The evaluation indicators and have window sizes of 16 and 11. The semantic loss evaluation network uses the improved U-Net architecture proposed in Section 3.2.2. We use the fused images generated by each approach to train their own evaluation networks. In order to ensure comparability, all approaches use the same data partition as described in Section 4.1.1. Their training parameters are the same. The optimizer is SGD, where the learning rate is 0.05 and the momentum is 0.9.
It should be pointed out that the lower the and , the better. The higher , and , the better.
As the experimental results shown in Table 1, our model has an excellent performance compared with other approaches. First, there is no doubt that the fused image generated by our approach has the least semantic loss. Second, our results are the best in , indicating that our approach retains the information of the source images very well. Finally, in the and evaluation indexes, our approach is comparable to other approaches, neither the best nor the worst. It means that our approach can also preserve the edge information of the source images and generate the fused image with small visual difference from the source images. By the way, after training, our approach only needs a forward propagation to get the fused image. Therefore, our approach has the shortest run time.
Moreover, the index shows that compared with other approach, the fused image generated by our approach has the highest structural similarity with the source CT image and the lowest structural similarity with the source MR-T2 image. This is because the index is related to not only the structure information, but also the pixel value. The fused image generated by our approach tends to represent the new semantic space based on the brightness values of the CT. In the brain structure, most of the brightness values of CT and MR-T2 images are reversed, such as bone and ventricles, which leads to a large gap in scores of index.
Figure 5 shows a section of a patient with cerebral toxoplasmosis. The yellow arrow points to the left ventricle, and the blue arrow points to the outer skull. It should be noted that the red arrow indicates calcification, which should be the focus of clinicians. In the source CT image, we can clearly see the bright calcification. Although this information should be the focus of fused images, the existing approaches tend to retain brightness and other significant image information regardless of semantics. Therefore, the bright calcification in CT is mixed with the bright right ventricle in MR-T2, blurring the key information. It can be seen from the figure that , , and do not retain the information of calcification well. Although highlighting the information of calcification, the left and right boundaries of ventricles are blurred. However, in our approach , the brain tissue boundaries and the information of calcification are well preserved.
Another fundamental issue is semantic conflicts. For example, the bone (the blue arrow in Figure 5) appears bright in CT while dark in MR-T2, and the ventricle (the yellow arrow in Figure 5) appears dark in CT while bright in MR-T2. The previous approaches do not distinguish the brightness in different source images, which results in the same brightness values of ventricles and skulls in fusion results (c), (d), (e), (f) and (g). However, they have different semantics in fact. The semantic conflicts here are reflected in the high-density, low-flowing skull and low-density, high-flowing cerebrospinal fluid (in the ventricles) that present the same brightness values in the fused image. Our approach resolves the semantic conflicts well, making the brightness of the skull and the cerebrospinal fluid opposite.
It is also worth mentioning that our approach produces "cleaner" images, and the brightness is biased towards the source CT image, but the semantics are more abundant. The thin blood vessels that are not present in the source CT image are well presented in the source MR-T2. In our fusion result, the thin blood vessels are converted to be dark, indicating their low-density, high-flowing characteristics (the same semantics as cerebrospinal fluid). Our FW-Net converts the same part of the different modal source images into a new semantic space, eliminating noise caused by inconsistencies in images of different modalities. This makes fused images shaper and semantically richer.
Medical images play an important role in clinical diagnosis, and it is necessary to analyze medical images of different modalities. However, existing medical image fusion approaches ignore the semantics of medical images, which makes the fused images incomprehensible. In this paper, we propose a semantic-based approach to fuse medical images: extract the semantics of source images of different modalities, map semantics to a new semantic space, and fuse images in the new semantic space. In contrast to the state-of-the-art approaches, i.e., GF , NSCT-RPCNN , NSCT-PCDC , LP-CNN  and NSST-PAPCNN , our approach effectively solves semantic conflicts and produces visually pleasing images. Our approach is promising and expected to be applied in the clinical setting in future.
We thank Mengya He for helpful discussions, and Dr. Fangfang Hu for help with the medical images annotation.
Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
Image-to-image translation with conditional adversarial networks.In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
Perceptual losses for real-time style transfer and super-resolution.In European conference on computer vision, pages 694–711. Springer, 2016.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research, 11(Dec):3371–3408, 2010.
Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.