Gliomas, the most common primary central nervous system malignancies, consist of various subregions . In medical practice, multimodal images can provide different biological information mapping tumor-induced tissue change, such as postcontrast T1-weighted (T1Gd), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (FLAIR) volumes . The same anatomical structures may be observed in multiple modalities despite the different image characteristics. Currently, most deep models for multimodal segmentation rely on paired registered images [6, 9, 7]. However, multimodal paired registered images are difficult to obtain in many cases. Therefore, developing a model that can segment the target objects from different modalities with unpaired images is significant for many clinical applications.
Recently, the problem of multimodal segmentation has been extensively studied. Nie et al.  trained a network for three modalities and fused their high-layer features for the final segmentation. Tseng et al.  proposed cross-modality convolution layers to better leverage multimodal information. However, these methods were limited because they required paired registered images. An alternative approach is to train different models for different modalities in a shared latent space by extracting a common representation from different modalities. Kuga et al.  trained multimodal encoder-decoder networks with a shared representation. Valindria et al.  extracted modality-independent features by sharing the last layers of the encoder. However, these methods were not unified and required more parameters because of the requirement of specific encoder-decoder architectures for each modality. To extract modality-invariant features efficiently, Xu et al.  represented a multimodal distillation module. Hu et al.  performed feature recalibration using SE (squeeze-and-excitation) blocks. Recently, adversarial learning has been regarded as an effective way to transfer knowledge across different image domains. Huo et al.  presented an end-to-end synthesis and segmentation network with unpaired MRI and CT images. To address limited scalability and robustness in translating among more than two domains, Choi et al. 
developed a scalable approach (StarGAN) that can perform image-to-image translation for multiple domains using an unified model.
During the adversarial learning of StarGAN, the generator learns to change the image characteristics and preserves the global features to fool the discriminator. Utilizing these global features may improve the performance of multimodal segmentation. Thus, in this work, we take translation as an auxiliary task to help segmentation and propose a two-stream translation and segmentation unified attentional generative adversarial network (UAGAN). In the translation stream, the discriminator is the same as StarGAN , whereas the backbone of the generator is changed to U-net 
in order to better leverage the low-level and high-level features. In the segmentation stream, another U-net is adopted to share the last layers of the encoder in the translation stream. Because not all features extracted from the translation stream are useful for segmentation, we add attentional blocks to focus on the useful features for segmentation. Experiments on three-modality brain tumor segmentation indicate that UAGAN outperforms existing methods in most cases.
2 UAGAN for Multimodal Segmentation
Multimodal segmentation using a single model remains very challenging due to the different image characteristics of different modalities. A key challenge is to extract modality-invariant features. Previous multimodal segmentation methods have required paired n-modality images. To address the limitation of requiring paired multimodal images, we present a two-stream UAGAN. The following subsections discuss the details of the proposed framework.
2.1 Method Overview
Fig. 1a illustrates the training strategy of the proposed UAGAN and Fig. 1b shows the architecture of our UAGAN with the translation and segmentation streams. Both streams adopt the U-net architecture. Inspired by , we adopt independent encoders and decoders but share the last layers of the encoders. We denote the network of the translation stream as and the network of the segmentation stream as . The adversarial training strategy is similar to StarGAN , which contains two phases. In the forward phase, we first randomly generate the target-modality label
, which is the one-hot encoding of the modalities, such as FLAIR, T1Gd and T2. We expandto the size of the input image and perform depth-wise concatenation between expanded and an arbitrary known modality image . Given the input (, ), learns to translate to target-modality image , . Meanwhile, takes as input and outputs the segmentation map . In the backward phase, takes fake image and source-modality label as inputs and tries to recover the source-modality image by . To preserve the tumor structure, an auxiliary shape-consistency loss  was added in the backward phase, which is the cross-entropy loss between and their manual annotations.
In order to fool the discriminator , learns to modify image intensity relative to its neighbor tissues and keeps the brain structure unchanged. Utilizing these unchanged features of different modalities in may have great potential to improve the segmentation performance. However, directly combining features from the translation task may not be an optimal method to get satisfactory segmentation results due to the features extracted by which are not related to the segmentation task.
To discard these features, we adopt attentional blocks at each upsampling step. The attentional blocks first generate attentional maps to highlight related and suppress unrelated features. Then the attentional blocks fuse the lower level feature maps from both streams with the attention maps. For more details, please refer to Section 2.2. During inference, UAGAN conducts segmentation results given the testing images and their modality labels.
2.2 Attentional Blocks
In the U-net architecture, the encoder captures multilevel features varying from low-level details to high-level semantic knowledge. The decoder combines the low-level and high-level features gradually to construct the final result. The features extracted by are expected to be more related to the tumor. However, in the translation stream, is trained to fool the discriminator . Therefore, Unrelated information such as the contour and internal structures of the brain may be preserved in the translation encoder. To emphasize tumor-related features and suppress unrelated features , we propose attentional blocks at each upsampling step in the decoders. As presented in Fig. 1a, we denote the features extracted in encoder of task at level as , where corresponding to the levels of the feature maps, and corresponding to the translation and segmentation tasks. Given the feature maps extracted from another task , an attention map is first produced as follows:
where denotes the convolution operation,
is a sigmoid function andis the convolution mask.
In order to reduce the information gap between and , we apply a convolution to with parameters and then perform element-wise multiplication with the attention map to focus related information automatically. The fused outputs at level are produced as follows:
The final outputs of the attentional block are the concatenation of and the upsampled extracted from the decoder of task .
3 Experimental Results
Experiments were carried out on multimodal multisite MRI data released in the Brain Tumors Task of Medical Segmentation Decathlon222http://medicaldecathlon.com/. We selected 300 patients randomly and divided them into three disjoint partitions to form three independent datasets. Each dataset contained different modality images. Experiments were conducted on 2D slices of each dataset separately. In each dataset, 50 patients were used for training and 50 for testing. Each patient volume was consisted of 155 slices with the size of 240240 and a pixel size of 1
1 mm in the direction of axis view. To form unpaired data in each dataset, we used only one of the modalities (T1Gd, FLAIR and T2) per patient randomly and applied z-score normalization to the volumes individually. To exclude the irrelevant regions, we neglected the slices of small brain tissues, cropped the center of slices of 200200, and then resized to 128
128 due to the limitation of GPU memory. Finally, the average numbers of training and testing slices of each dataset are 5867 and 5861. To prevent overfitting, several data augmentation techniques (i.e., rotation, vertically flipped, horizontally flipped and scale) were applied on the fly during training. The whole framework was implemented in PyTorch, using a computer with the Intel i7 8700K CPU and an NVIDIA GTX 1080Ti GPU.
3.2 Training Strategy
3.2.1 Loss Function.
To generate the indistinguishable target-modality images, the objective functions to optimize the discriminator are the same as StarGAN . However, the objective functions to optimize the generator are significantly different because of uniting a new segmentation task:
where the translation loss is defined in StarGAN  and the segmentation loss is defined as the summation of and ; and are the cross entropy losses in the forward and backward phase; and are the weights of the cross entropy losses. We set the weights (, ) as 100 to emphasize the segmentation task. Other weights are set up the same as StarGAN .
3.2.2 Training Parameters.
In an end-to-end training manner, we updated the weights of all networks using the Adam optimizer with an initial learning rate of
, and the batch size of training was 8. All networks were trained up to 100 epochs, where the learning rate was fixed in the first 60 epochs and then linearly reduced to. In the early phase, the synthetic images were blurry, so was set to 0 at the beginning and linearly increased to 100 at 60 epoch. In the end, we used the model trained at the 100th epoch to perform on testing data.
3.3 Whole Tumor Segmentation Performance
There are two types of multimodal segmentation methods, the unified and the non-unified model. In the unified model, all modality images are processed by a single stream, whereas in the non-unified model the processing is by the different streams.
|Multi V4 ||57.410.69||54.181.07||67.143.82||98.840.12||8.820.57|
|Multi V4 ||79.561.97||73.783.62||88.860.91||99.250.15||3.400.19|
|Multi V4 ||73.073.19||69.158.51||81.404.08||99.280.14||4.510.97|
As a baseline, we trained U-nets corresponding to each modality individually and tested them only on the corresponding modality (Individual). To compare with the unified model, we trained a U-net for all modalities (Joint). To compare with the same non-unified model, we trained a multistream model (Multi V4)  and changed the backbone from FCN to U-net for a fair comparison. Ablation studies were conducted by UAGAN, including only sharing the last blocks of encoders without fusing features from different streams (UAGAN-fuse) and simply adding the features of both streams without attention (UAGAN-atten). To verify the effect of the translation stream, we also replaced the translation task with the reconstruction task and kept the attention blocks unchanged (denoted as UAGAN-trans).
We employed five metrics to evaluate the performance of whole brain tumor segmentation, including Dice score (Dice), Precision, Sensitivity (Sens), Specificity (Spec) and Average Symmetric Surface Distance (ASSD). Because all the models were trained on 2D slices, we concatenated the slices of the same patient, and all the metrics were performed on 3D volumes. A better model will have higher Dice, Sens, Spec, Precision and lower ASSD.
Experiments were conducted with three disjoint unpaired datasets, and the results are shown in Table 1. We compared the metrics in three modalities (T1Gd, FLAIR and T2). The symbol denotes the unified model. The boxplot in Fig. 3 from all test cases shows the performance of different models in terms of Dice scores. Some visual segmentation results are shown in Fig. 4.
Our methods outperformed the unified model and performed better than the non-unified models except some cases in T1Gd. The results revealed that similar modalities such as FLAIR and T2 could improve the performance of segmentation with each other.
In this work, we propose a novel two-stream unified attentional generative adversarial network (UAGAN) for multimodal unpaired medical image segmentation. Our framework is flexible, and can use on more than two modalities due to the unified structure and has fewer parameters than the same non-unified model. To capture the modality-invariant features beneficial to segmentation, we fuse the features from both segmentation and translation streams. Furthermore, feature recalibration is performed with attentional blocks to emphasize useful features. Experiments on brain tumor segmentation indicate that our framework achieved better performance in most cases. The more similar the modalities are, the more significant the effect will be, such as FLAIR and T2. Our proposed framework can alleviate the problem of paired multimodal medical image scarcity. In the future, the framework will be applied to other biomedical image segmentation tasks such as multimodal abdominal organ segmentation.
This work is supported by the National Natural Science Foundation of China (61402181, 61502174), the Natural Science Foundation of Guangdong Province (2015A030313215, 2017A030313358, 2017A030313355), the Science and Technology Planning Project of Guangdong Province (2016A040403046), the Guangzhou Science and Technology Planning Project (201704030051).
-  Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos, C.: Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data 4, 170117 (2017)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8789–8797 (2018)
-  Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141 (2018)
-  Huo, Y., Xu, Z., Bao, S., Assad, A., Abramson, R.G., Landman, B.A.: Adversarial synthesis learning enables segmentation without target modality ground truth. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). pp. 1217–1220. IEEE (2018)
-  Kuga, R., Kanezaki, A., Samejima, M., Sugano, Y., Matsushita, Y.: Multi-task learning using multi-modal encoder-decoder networks with shared skip connections. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 403–411 (2017)
-  Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34(10), 1993–2024 (2015)
-  Nie, D., Wang, L., Gao, Y., Shen, D.: Fully convolutional networks for multi-modality isointense infant brain image segmentation. In: 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI). pp. 1342–1345. IEEE (2016)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
-  Tseng, K.L., Lin, Y.L., Hsu, W., Huang, C.Y.: Joint sequence learning and cross-modality convolution for 3d biomedical segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6393–6400 (2017)
-  Valindria, V.V., Pawlowski, N., Rajchl, M., Lavdas, I., Aboagye, E.O., Rockall, A.G., Rueckert, D., Glocker, B.: Multi-modal learning from unpaired images: Application to multi-organ segmentation in ct and mri. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 547–556. IEEE (2018)
Xu, D., Ouyang, W., Wang, X., Sebe, N.: Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 675–684 (2018)
-  Zhang, Z., Yang, L., Zheng, Y.: Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9242–9251 (2018)