Learning Cross-Modal Deep Representations for Multi-Modal MR Image Segmentation

08/06/2019 ∙ by Cheng Li, et al. ∙ 1

Multi-modal magnetic resonance imaging (MRI) is essential in clinics for comprehensive diagnosis and surgical planning. Nevertheless, the segmentation of multi-modal MR images tends to be time-consuming and challenging. Convolutional neural network (CNN)-based multi-modal MR image analysis commonly proceeds with multiple down-sampling streams fused at one or several layers. Although inspiring performance has been achieved, the feature fusion is usually conducted through simple summation or concatenation without optimization. In this work, we propose a supervised image fusion method to selectively fuse the useful information from different modalities and suppress the respective noise signals. Specifically, an attention block is introduced as guidance for the information selection. From the different modalities, one modality that contributes most to the results is selected as the master modality, which supervises the information selection of the other assistant modalities. The effectiveness of the proposed method is confirmed through breast mass segmentation in MR images of two modalities and better segmentation results are achieved compared to the state-of-the-art methods.



There are no comments yet.


page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-modal magnetic resonance imaging (MRI) is an essential tool in clinics for the screening and diagnosis of different diseases including breast cancer, prostate cancer, and neurodegenerative disorders. The combination of different imaging modalities can overcome the limitations of the individual modalities. In breast cancer screening, for example, while contrast-enhanced MRI possesses high sensitivity in detecting breast lesions, T2-weighted MRI is effective in reducing false-positive results [1, 2]. Considering different MRI modalities is thus important for the acquisition of accurate lesion information. Lesion segmentation of MR images is a critical step in the process for the following diagnosis and surgical planning. Manual segmentation is both time-consuming and error-prone. Therefore, the development of automatic and reliable algorithms is of high clinical values.

Learning-based methods, especially those based on convolutional neural networks (CNNs), have seen rapid development in medical image analysis in the last decade [3]

. CNNs were originally proposed for the task of image-level classification. The intuitive application of CNNs to image segmentation, which is a pixel-level classification task, was conducted by classifying each pixel in a sliding window manner (R-CNN)

[4]. Fully convolutional neural networks (FCNs) were designed later to avoid the cumbersome and memory-deficient R-CNN approach [5]. FCNs segment the input image directly by generating heatmap output. Following FCNs, U-Net was proposed specifically for biomedical image segmentation [6], which is the current baseline network for various medical image segmentation tasks and is the inspiration of many subsequent works.

A critical issue regarding multi-modal image segmentation is the fusion of information from the different imaging modalities. CNN-based multi-modal image fusion can be realized through early fusion, late fusion, and multi-layer fusion. Early fusion happens at the input stage or low-level feature stages [7, 8]. This strategy may fail to achieve the expected information compensation, especially when the different modal images have complex relationships. Late fusion refers to the fusion of high-level and high-abstract features, and multi-stream networks are commonly utilized in this case with each stream processing images from one modality. Late fusion has been demonstrated to generate better segmentation results than direct early fusion [9, 10]. Nevertheless, multi-layer fusion should be a more generalized strategy. Multi-layer fusion was first proposed for RGB-D image segmentation where FuseNet was designed to incorporate depth information into RGB images [11]. Further network optimization over FuseNet confirmed that multi-layer fusion was a more effective approach [12]. Multi-layer fusion has also been successfully applied to multi-modal medical image segmentation [13]. Although inspiring results have been achieved, the feature fusion was conducted through direct pixel-wise summation or channel-wise concatenation. Without supervision and selection, the fusion process may introduce irrelevant signals and noise signals to the final outputs.

In this study, we propose a novel multi-stream CNN-based feature fusion network for the processing of multi-modal MR images. In accordance with real clinical situations, we pick one MR modality that contributes most to the final segmentation results as the master modality, and the other modalities are treated as assistant modalities. Inspired by the knowledge distillation concept [14], where a teacher network supervises the training of a student network, our master modal network stream supervises the training of the assistant modal network streams. In detail, we adopt an attention block to extract the supervision information from the master modality and utilize this supervision information to select useful information from both the master modality and the assistant modalities. The effectiveness of the proposed method is evaluated through the mass segmentation in breast MR images of two modalities. Segmentation of breast mass structures is a challenging task, as the masses have a large range of sizes and shapes, especially for spiculated masses that have ill-defined borders. The results show that our method can achieve the best performance compared to existing feature fusion strategies.

2 Methodology

2.0.1 Breast MRI Dataset.

The breast MR images were collected using an Achieva 1.5T system (Philips Healthcare, Best, Netherlands) with a four-channel phased-array breast coil. All acquisitions of 313 patients in the prone position were conducted between 2011 and 2017. Two MRI sequences were applied. Axial T2-weighted (T2W) images (TR/TE = 3400 ms/90 ms, FOV = 336 mm 336 mm, section thickness = 1 mm) with fat suppression were obtained before the injection of contrast medium. After the intravenous injection of 0.3 mL/kg of gadodiamide (BeiLu Pharmaceutical, Beijing, China), axial fat-suppressed contrast-enhanced T1-weighted (T1C) images were collected (TR/TE = 5.2 ms/2.3 ms, FOV = 336 mm 336 mm, section thickness = 1 mm, and flip angle = ). Since manual segmentation of the breast masses in 3D multi-modal MR images is very difficult and time-consuming, only the central slices with the largest cross-section areas were labelled by two experienced radiologists in this retrospective study.

Figure 1: Baseline network architecture. The overall network structure (a) and the implementation details of the modality fusion by the three networks (b).

2.0.2 A Better Feature Fusion Baseline Network Architecture.

Our baseline model is built from FuseNet [11]

with two major modifications. First, FuseNet was proposed for the analysis of natural images. The encoder part of FuseNet adopted the VGG 16-layer model for the convenience of utilizing ImageNet pre-trained network parameters. To better adapt to medical image processing, we build a FuseNet-like network architecture based on U-Net, named FuseOriginUNet (Fig. 

1). Second, in FuseNet, the feature fusion of different imaging modalities was realized by pixel-wise summation, which could preserve the VGG 16-layer model after introducing the feature fusion module. In FuseOriginUNet, a channel-wise concatenation is implemented instead. To make the overall network lightweight, we half the convolution kernels for each layer in the encoder part compared to U-Net and achieve the final baseline model FuseUNet (Fig. 1). The experiments show that FuseUNet achieves better performance compared to both FuseNet and FuseOriginUNet.

Figure 2: Example breast MR images when T1C images highlight irrelevant regions (organs or dense glandular tissues) and T2W can distinguish these regions from the targeted breast masses. Label images are the manual segmentation results.

2.0.3 Supervised Cross-Modal Deep Representation Learning.

Different imaging modalities contain different sorts of useful information for the targeted task. For breast MR images, the T1C modality has a high sensitivity and a relatively low specificity in detecting breast masses. Two examples are shown in Fig. 2. It can be observed that the T1C image highlights not only the breast mass area but also the irrelevant regions, such as the organs and the dense glandular tissues. In this case, T2W images are important in distinguishing the true masses from all the enhanced areas. Accordingly, the two imaging modalities are treated differently in the proposed method. T1C is chosen as the master modality having a greater impact on the results. T2W is regarded as the assistant modality complementing the information of the master modality.

Figure 3:

The encoder section of the proposed master–assistant cross-modal learning network (a) and the cross-modal supervision learning module (n is the input feature number, r is a reduction factor, and D is the atrous rate of dilated convolutions) (b).

Inspired by the knowledge distillation between teacher–student networks [14], we propose a supervised master–assistant cross-modal learning framework (Fig. 3a). The master modality generates supervision information that modulates the learning of the assistant modality. Enlightened by the activation-based attention transfer strategies [15], a spatial attention (SA) block is designed to extract the supervision information (Fig. 3b). The input of the block is the features from the master modal stream and the output, which is a weight heatmap, is utilized to guide the information selection for both the master and the assistant modalities.

2.0.4 Implementation Details.

Five-fold cross-validation experiments were conducted. All the images along with the label images were resized to

and intensity normalized. No further data processing or augmentation was applied. The models were implemented with PyTorch on a NVIDIA TITAN Xp GPU (12G) with batch size of 4. ADAM with AMSGRAD was applied to train the models. The step decay learning rate strategy was used with an initial learning rate of 1e-4 that was decreased by half every 30 epochs. The hyperparameters in the SA module were set as r=16 and d=4. To tackle the widely recognized class-imbalance problem in medical image analysis, a loss function combining cross-entropy loss and Dice loss was adopted:


where is the loss function utilized, is the Dice loss, is the cross-entropy loss, is the total number of pixels in the image, is the manual segmentation label of the pixel in the image where 0 refers to the background and 1 refers to the foreground,

is the corresponding predicted probability of the

pixel belonging to the foreground class, is a constant to keep the numerical stability, and is a weight constant to control the tradeoff between the two losses.

Three metrics were utilized to quantify the segmentation performance, the Dice similarity coefficient, sensitivity, and relative area difference. Three independent experiments were run, and the results are presented as ().

Models Number of parameters Dice Sensitivity Relative area difference
U-Net (T1C) 34.5M
U-Net + SA (T1C) 34.7M
FuseNet 29.4M
FuseNetConcate 74.0M
FuseOriginUNet 56.2M
EarlyFuseUNet 34.5M
LateFuseUNet 25.1M
FuseUNet 26.7M
FuseUNet + SA 26.8M
Proposed 26.7M
M–millions. Values in percentage.
Table 1: Segmentation results of different models.

3 Results and Discussion

Table 1 lists the quantitative segmentation results of the different networks. It can be concluded that compared to the pixel-wise summation strategy used in FuseNet, multi-modal feature fusion by channel-wise concatenation (FuseNetConcate) is more effective. Adopting the U-Net blocks (FuseOriginUNet) leads to further performance enhancement. Moreover, the lightweight FuseUNet achieves a comparable or even superior level of segmentation accuracy with only half of the parameters used by FuseOriginUNet. U-Net trained solely on T1C presents worse performance than all the two-modal U-Net based networks, suggesting that T2W images

Figure 4: Example results of the different networks. White lines indicate the boundaries of the manual segmentation labels. Green lines are the boundaries of the network segmentation results. The value in each image is the Dice similarity coefficient (%).

provided useful complementary information for the segmentation task. For the two-modal U-Net based networks, multilayer fusion of FuseNet is more effective than both early fusion (EarlyFuseUNet) and late fusion (LateFuseUNet). Introducing our SA block to both U-Net (U-Net + SA) and FuseUNet (FuseUNet + SA) before each pooling layer elevates segmentation performance. Finally, our proposed supervised cross-modal deep representation learning method generates the best segmentation results reflected by all three metrics.

The segmentation results of several examples are given in Fig. 4. Overall, models utilizing two modal inputs are more effective than the single-modal U-Net. Except in the last example, the improved baseline model FuseUNet achieves a higher Dice similarity coefficient than FuseNet. The proposed method consistently achieves much better results than the existing methods with decreased false negatives (first example) and decreased false positives (second and third examples).

To demonstrate the mechanism behind the improved performance brought by the proposed method, the SA maps of all five down-sampling blocks are visualized. One example is presented in Fig. 5. It is clear from Fig. 5 that the T1C modal stream in both the proposed method and the FuseUNet + SA model was able to localize the mass regions through implementing the SA modules (red arrows in Fig. 5). The T2W modal stream could hardly find the interesting areas and even highlighted the regions that were irrelevant for the task (blue arrows in Fig. 5). Therefore, it is reasonable and necessary to apply the T1C attention maps to the information selection of T2W. For situations where different modalities generate images with similar sensitivites, our network architecture can still be utilized with an accordinly designed supervision information extraction strategy. The main idea regarding the supervised feature fusion of different imaging modalities should always be beneficial.

Figure 5: SA maps of the proposed method and the FuseUNet + SA model. Blocks 1-5 refer to the attention maps generated at the five blocks before the respective pooling layers. T1C and T2W refer to the feature maps generated by the T1C modal stream and the T2W modal stream.

4 Conclusion

In this work, we presented a novel network for the segmentation of multi-modal MR images. Inspired by the knowledge distillation and attention transfer strategies, a supervised cross-modal deep representation learning method was designed that selectively fused the useful information from the different modalities and suppressed the respective noise signals. Results on an in-vivo breast MR image dataset of two modalities confirmed the effectiveness of the proposed method.The proposed method is extendable to different medical image segmentation scenarios and will be investigated in the future.

4.0.1 Acknowledgements.

This research was partially supported by the National Natural Science Foundation of China (61601450, 61871371, 81830056), Science and Technology Planning Project of Guangdong Province (2017B020227012, 2018B010109009), Youth Innovation Promotion Association Program of Chinese Academy of Sciences (2019351), and the Basic Research Program of Shenzhen (JCYJ20180507182400762).


  • [1] Heywang-Köbrunner, S.H., Viehweg, P., Heinig, A., Küchler, Ch.: Contrast-enhanced MRI of the breast: accuracy, value, controversies, solutions. Eur. J. Radiol. 24(2), 94–108 (1997)
  • [2] Westra, C., Dialani, V., Mehta, T.S., Eisenberg, R.L.: Using T2-weighted sequences to more accurately characterize breast masses seen on MRI. Am. J. Roentgenol. 202(3), 183–190 (2014)
  • [3]

    Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J.A., van Ginneken, B., Sánchez, C.I.: A survery on deep learning in medical image analysis. Med. Image Anal.

    42, 60–88 (2017)
  • [4] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on CVPR, pp. 580–587. IEEE (2014)
  • [5] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on CVPR, pp. 3431–3440. IEEE (2015)
  • [6] Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015, LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). 10.1007/978-3-319-24574-4_28
  • [7] Zhou, C., Ding, C., Lu, Z., Wang, X., Tao, D.: One-pass multi-task convolutional neural networks for efficient brain tumor segmentation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) MICCAI 2018, LNCS, vol. 11072, pp. 637–645, Springer, Cham (2018). 10.1007/978-3-030-00931-1_73
  • [8] Zhang, W., Li, R., Deng, H., Wang, L., Lin, W., Ji, S., Shen, D.: Deep convolutional neural networks for multi-modality isointense infant brain image segmentation. Neuroimage 108, 214–224 (2015)
  • [9] Nie, D., Wnag, L., Gao, Y., Shen, D.: Fully convolutional networks for multi-modality isointense infant brain image segmentation. In: IEEE 13th ISBI, pp. 1342–1345, IEEE (2016)
  • [10] Pinto, A., Pereira, S., Meier, R., Alves, V., Wiest, R., Silva, C.A., Reyes, M.: Enhancing clinical MRI perfusion maps with data-driven maps of complementary nature for lesion outcome prediction. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) MICCAI 2018, LNCS, vol. 11072, pp. 107–115, Springer, Cham (2018). 10.1007/978-3-030-00931-1_13
  • [11] Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: Incoporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) ACCV 2016, LNCS, vol. 10111, pp. 213–228, Springer, Cham (2017). 10.1007/978-3-319-54181-5_14
  • [12] Chen, H., Li, Y.: Progressively complementarity-aware fusion network for RGB-D salient object detection. In: 2018 IEEE Conference on CVPR, pp. 3051–3060, IEEE (2018)
  • [13] Dolz, J., Gopinath, K., Yuan, J., Lombaert, H., Desrosiers, C., Ayed, I.B.: HyperDense-Net: A hyper-densely connected CNN for multi-modal image segmentation. IEEE Trans. Med. Imaging (Early Access) (2018)
  • [14] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531, pp. 1–9 (2015)
  • [15] Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: 5th ICLR, pp. 1–13, Microtome Publishing (2017)