Unsupervised Cross-Modality Domain Adaptation for Segmenting Vestibular Schwannoma and Cochlea with Data Augmentation and Model Ensemble

09/24/2021 ∙ by Hao Li, et al. ∙ 5

Magnetic resonance images (MRIs) are widely used to quantify vestibular schwannoma and the cochlea. Recently, deep learning methods have shown state-of-the-art performance for segmenting these structures. However, training segmentation models may require manual labels in target domain, which is expensive and time-consuming. To overcome this problem, domain adaptation is an effective way to leverage information from source domain to obtain accurate segmentations without requiring manual labels in target domain. In this paper, we propose an unsupervised learning framework to segment the VS and cochlea. Our framework leverages information from contrast-enhanced T1-weighted (ceT1-w) MRIs and its labels, and produces segmentations for T2-weighted MRIs without any labels in the target domain. We first applied a generator to achieve image-to-image translation. Next, we ensemble outputs from an ensemble of different models to obtain final segmentations. To cope with MRIs from different sites/scanners, we applied various 'online' augmentations during training to better capture the geometric variability and the variability in image appearance and quality. Our method is easy to build and produces promising segmentations, with a mean Dice score of 0.7930 and 0.7432 for VS and cochlea respectively in the validation set.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vestibular schwannoma (VS) is a benign tumor of the human hearing system. For better understanding the disease progression, quantitative analysis of VS and cochlea from magnetic resonance images (MRIs) is important. Recently, deep learning frameworks have been dominating the medical segmentation field [shapey2019artificial, wang2019automatic, dorent2020scribble, li2021mri, Shapey:SDATA:2021]

with state-of-the-art performance. However, supervised learning methods may lack domain generalizability or ability to deal with MRIs from different domains, and such methods require a high level of consistency between training and testing data.

Furthermore, in the medical image analysis area, lack of human delineations in one or multiple domains is a common issue, which is problematic for supervised learning. Domain adaptation (DA) is a solution for increasing generalizability of deep learning models to deal with new data from different domains.

In this work, we propose an unsupervised cross-modality domain adaptation framework for segmenting the VS and cochlea. Our framework contains 2 parts: synthesis and segmentation. For synthesis, we apply a CycleGAN [zhu2017unpaired] to perform unpaired image translating between ceT1-weighted and T2-weighted MRIs. For segmentation, we use the generated T2-weighted MRIs as input and train an ensemble of models with various data augmentations, each of which yields the segmentations of VS and cochlea. We fuse those segmentations to form the final segmentation.

2 Methods and material

2.1 Dataset

The cross-modality domain adaptation for medical image segmentation challenge dataset111https://crossmoda.grand-challenge.org/CrossMoDA/ contains 2 different MRI modalities: contrast-enhanced T1-w with an in-plane resolution of and slice thickness between , and high-resolution T2-w with an in-plane resolution of and slice thickness between . ceT1-weighted imaging was performed with an MPRAGE sequence and T2-weighted imaging with 3D CISS or FIESTA sequence. The training set contains 105 ceT1-weighted and 105 T2-weighted MRIs, and the validation set contains 32 T2-weighted MRIs. Associated with all images in each modality are expert-acquired, manual VS and cochlea labels. More detailed information about this dataset can be found in 222https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70229053.

Figure 1: Proposed overall framework. Our framework contains 2 parts: synthesis (Syn) and segmentation (Seg). We use a CycleGAN model as generator in the synthesis part. Results from three different models with different augmentations are fused to obtain the final segmentation.

2.2 Overall framework

Fig. 1 displays our proposed framework. The 3D ceT1w image, after undergoing pre-processing, is fed to a generator to achieve image-to-image translation (ceT1w to T2w). Data augmentation strategies are applied on the generated T2w to increase the model robustness. We send the augmented data into four different models: a 2.5D model with two different augmentation schemes, a 3D model with attention module, and a 3D model without attention module. We finally employ a union operation to fuse the outputs of each model to form the final segmentation.

2.3 Preprocessing and postprocessing

Our pre-processing pipeline contains 5 steps: (1) non-local mean filter denoising, (2) image alignment with MNI template by rigid registration, (3) bias field correction, (4) image cropping based on region of interest, and (5) intensity normalization. We extract the largest connected component for VS in postprocessing.

2.4 Data augmentation

Data augmentation is widely used in medical image segmentation to help minimize the gap between datasets/domains, producing more robust segmentations. Here, we design an ‘online’ augmentation strategy during training and randomly apply data transformations to input images. These transformations are categorized in 3 different groups:

  • Spatial augmentation. 3 types of random spatial augmentation are used: affine transformation with angle range of [, ], and scale factor from 0.9 to 1.2; elastic deformation with ; and a combination of affine and elastic deformation. The same spatial augmentations and parameters are applied to both MRIs and the labels.

  • Image appearance augmentation. To minimize the different image appearance between MRIs from different sites and scanners, we randomly apply multi-channel Contrast Limited Adaptive Equalization (mCLAHE) and gamma correction with from 0.5 to 2 to adjust image contrast.

  • Image quality augmentation. In this context, image quality refers to resolution and noise level. We randomly blur the image using Gaussian kernel with from 0.5 to 1.5, we add Gaussian noise with , and we sharpen the image by , where is sharpened image, is the image blurred with a Gaussian kernel (), and is the image blurred twice with the same Gaussian kernel. In our case, we set .

2.5 2.5D model and its architecture

We applied a 2.5D CNN to alleviate the impact of the anisotropic image resolution. Network architecture details can be found in [wang2019automatic]. This 2.5D network contains both 2D and 3D convolutions to capture both in-slice and global information. Adapted from U-Net [ronneberger2015u]

, the 2.5D model contains an encoder and a decoder. 2D convolutions are used in the first 2 levels and 3D convolutions for the remainder. The max-pooling and deconvolution operations connect the features between levels. Batch normalization and parametric rectified linear unit (pReLU) are used in the network architecture. An attention module is applied to assist in segmenting the small region of interest (ROI). A second 2.5D model with identical architecture was also used in the ensemble, with the only difference being in the gamma correction parameter in the augmentation stage (with

rather than ).

2.6 3D model and its architecture

We used a fully convolutional neural network for our 3D models. The network architecture details can be found in

[li2021mri]. Similar to a 3D U-Net, it consists of an encoder and a decoder. 3D max-pooling and 3D nearest neighbor upsampling are used in both the encoder and decoder. We further reinforce the output by adding the feature maps at the end of each level. In one of our two 3D models, we employ an attention module in the skip connections to emphasize the small ROI and preserve the information from encoder to decoder. Note that the two 3D models are identical except the attention module.

2.7 Implementation details

The Adam optimizer was used with L2 penalty of 0.00001, , and an initial learning rate of 0.0002 for the CycleGAN generators [zhu2017unpaired]

and 0.001 for the segmentation models. For the generators, the learning rate was left at the initial value for the first 100 epochs, and then dropped to 0 in following 100 epochs. For segmentation models, learning rate was decayed by a factor of 0.5 every 50 epochs. We evaluated our generator and segmentation models every epoch and selected the optimal based on image quality by visual quality control and Dice score. We defined our loss function as

from multiple labels, with equal weight () for all foreground labels and decayed weight (

) for the background. The training, which has a batch size of 2, was conducted on NVIDIA GPUs and implemented using PyTorch.

VS Cochlea
Dice 0.7930(0.1515) 0.7432(0.0369)
ASSD 0.6343(0.3587) 0.2939(0.0626)
Table 1: Quantitative results in validation phase. The Dice score and average symmetric surface distance (ASSD) reported as .

3 Results

Tab. 1 displays the quantitative results of our method in the validation phase, reported as . Representative qualitative results can be viewed in Fig. 2. In the first row of Fig. 2, we observe that our 2.5D baseline method produces an accurate segmentation. Varying the augmentation parameters allows the 2.5D model to perform better in some challenging cases (fourth row of Fig. 2-(c)). However, in some cases, both 2.5D models under-segments VS, which can be seen in the second and third rows of Fig. 2-(b,c). In such cases, 3D models compensate for this tendency of the 2.5D models. Although the attention module has good ability to capture small ROIs, which can be observed in the second row of Fig. 2-(e), it may also lead to over-fitting (first and third rows of Fig. 2-(e)). Thus, we also include a 3D model without attention module (first and third rows of Fig. 2-(d)) in the model ensemble for best results. The model fusion balances the strengths of the individual models and is able to produce consistently good segmentations in a variety of images (Fig. 2-(f)).

Figure 2: Segmentation results from 4 different subjects. Red, VS, green, cochlea. (a) Input image. (b) 2.5D model with . (c) 2.5D model with . (d) 3D model without attention. (e) 3D model with attention. (f) Final segmentation. The final model fusion step produces consistently good segmentations.

4 Conclusion

In this work, we proposed an unsupervised cross-modality domain adaptation framework for VS and cochlear segmentation. In the validation stage, our method shows promising results.