CNN-based Prostate Zonal Segmentation on T2-weighted MR Images: A Cross-dataset Study

03/29/2019 ∙ by Leonardo Rundo, et al. ∙ 0

Prostate cancer is the most common cancer among US men. However, prostate imaging is still challenging despite the advances in multi-parametric Magnetic Resonance Imaging (MRI), which provides both morphologic and functional information pertaining to the pathological regions. Along with whole prostate gland segmentation, distinguishing between the Central Gland (CG) and Peripheral Zone (PZ) can guide towards differential diagnosis, since the frequency and severity of tumors differ in these regions; however, their boundary is often weak and fuzzy. This work presents a preliminary study on Deep Learning to automatically delineate the CG and PZ, aiming at evaluating the generalization ability of Convolutional Neural Networks (CNNs) on two multi-centric MRI prostate datasets. Especially, we compared three CNN-based architectures: SegNet, U-Net, and pix2pix. In such a context, the segmentation performances achieved with/without pre-training were compared in 4-fold cross-validation. In general, U-Net outperforms the other methods, especially when training and testing are performed on multiple datasets.



There are no comments yet.


page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prostate cancer (PCa) is expected to be the most common cancer among US men during 2018 [siegel2018]. Several imaging modalities can aid PCa diagnosis—such as Transrectal Ultrasound (TRUS), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI) [rundo2019MedGA]—according to the clinical context. Conventional structural T1-weighted (T1w) and T2-weighted (T2w) MRI sequences can play an important role along with functional MRI, such as Dynamic Contrast Enhanced MRI (DCE-MRI), Diffusion Weighted Imaging (DWI), and Magnetic Resonance Spectroscopic Imaging (MRSI) [lemaitre2015]. Therefore, MRI conveys more information for PCa diagnosis than CT, revealing the internal prostatic anatomy, prostatic margins, and the extent of prostatic tumors [villeirs2007].

The manual delineation of both prostate Whole Gland (WG) and PCa on MR images is a time-consuming and operator-dependent task, which relies on experienced physicians [rundo2017Inf]. Besides WG segmentation, distinguishing between the Central Gland (CG) and Peripheral Zone (PZ) of the prostate can guide towards differential diagnosis, since the frequency and severity of tumors differ in these regions [choi2007, niaf2012]; the PZ harbors of PCa and is a target for prostate biopsy [haffner2009]. Regarding this anatomic division of the prostate, the zonal compartment scheme proposed by McNeal is widely accepted [selman2011]. In this context, T2w MRI is the de facto standard in the clinical routine of prostate imaging thanks to its high resolution, which allows for differentiating the hyper-intense PZ and hypo-intense CG in young male subjects [hoeks2011]. However, the conventional clinical protocol for PCa based on Prostate-Specific Antigen (PSA) and systematic biopsy does not generally obtain reliable diagnostic outcomes; thus, the PZ volume ratio (i.e., the PZ volume divided by the WG volume) was recently integrated for PCa diagnostic refinement [chang2017]; the CG volume ratio can be also useful for monitoring prostate hyperplasia [kirby2002]. Furthermore, for robust clinical applications, generalization—among different prostate MRI datasets from multiple institutions—is essential.

So, how can we extract the CG and PZ from the WG on different MRI datasets? In this work, we automatically segment the CG and PZ using Deep Learning to evaluate the generalization ability of Convolutional Neural Networks (CNNs) on two different MRI prostate datasets. However, this is challenging since multi-centric datasets are generally characterized by different contrast, visual consistencies, and image characteristics. Therefore, prostate zones on T2w MR images were manually annotated for supervised learning, and then automatically segmented using a mixed scheme by (

i) training on either each individual dataset or both datasets and (ii) testing on both datasets, using CNN-based architectures: SegNet [badrinarayanan2017], U-Net [ronneberger2015], and pix2pix [isola2016]. In such a context, we compared the segmentation performances achieved with/without pre-training [tajbakhsh2016].

The manuscript is structured as follows: Sect. 2 outlines the state-of-the-art about MRI prostate zonal segmentation methods; Sect. 3 describes the MRI datasets as well as the proposed CNN-based segmentation approach; Sect. 4 shows our experimental results; finally, some conclusive remarks and possible future developments are given in Sect. 5.

2 Background

In prostate MR image analysis, WG segmentation is essential, especially on T2w MR images [ghose2012] or both T2w and the corresponding T1w images [rundo2017Inf, rundo2018WIRN]. Towards it, literature works used atlas-based methods [klein2008], deformable models, or statistical priors [martin2010]. More recently, Deep Learning [bevilacqua2019]

has been successfully applied to this domain, combining deep feature learning with shape models

[guo2017] or using CNN-based segmentation approaches [milletari2016].

Among the studies on prostate segmentation, the most representative works are outlined hereafter. The authors of [toth2013] used active appearance models combined with multiple level sets for simultaneously segmenting prostatic zones. Qiu et al. [qiu2014] proposed a zonal segmentation approach introducing a continuous max-flow model based on a convex-relaxed optimization problem with region consistency constraints. Unlike these methods that analyzed T2w images alone, Makni et al. [makni2011] exploited the evidential C-means algorithm to partition the voxels into their respective zones by integrating the information conveyed by T2w, DWI, and CE T1w MRI sequences.

However, these methods do not evaluate the generalization ability on different MRI datasets from multiple institutions, making their clinical applicability difficult [albadawy2018]; thus, we verify the cross-dataset generalization using three CNN-based architectures with/without pre-training. Moreover, differently from a recent CNN-based work on DWI data [clark2017], to the best of our knowledge, this is the first CNN-based prostate zonal segmentation approach on T2w MRI alone.

3 Materials and Methods

For clinical applications with better generalization ability, we evaluate prostate zonal segmentation performances of three CNN-based architectures: SegNet, U-Net, and pix2pix. We also compare the results in -fold cross-validation by () training on either each individual dataset or both datasets and () testing on both datasets, with/without pre-training on a relatively large prostate dataset.

3.1 MRI Datasets

We segment the CG and PZ from the WG using two completely different multi-parametric prostate MRI datasets, namely:

  • dataset containing patients/ MR slices with prostate, acquired with a whole body Philips Achieva 3T MRI scanner using a phased-array pelvic coil at the Cannizzaro Hospital (Catania, Italy) [rundo2017Inf]. The MRI parameters: matrix size pixels; slice thickness mm; inter-slice spacing mm; pixel spacing mm; number of slices ;

  • Initiative for Collaborative Computer Vision Benchmarking (I2CVB) dataset (

    patients/ MR slices with prostate), acquired with a whole body Siemens TIM 3T MRI scanner using a body coil at the Hospital Center Regional University of Dijon-Bourgogne (France) [lemaitre2015]. The MRI parameters: matrix size pixels; slice thickness mm; inter-slice spacing mm; pixel spacing mm; number of slices .

Figure 3: Example input prostate T2w MR axial slices in their original image ratio: (a) dataset ; (b) dataset . The CG and PZ are highlighted with solid and dashed white lines, respectively.

To make the proposed approach clinically feasible [hoeks2011], we analyzed only T2w images—the most commonly used sequence for prostate zonal segmentation— among available sequences. Fig. 3 shows two example T2w MR images of the analyzed two datasets. We conducted the following three experiments using a -fold cross-validation scheme to confirm the generalization effect under different training/testing conditions in our multi-centric study:

  • Individual dataset : training on dataset alone, and testing on the whole dataset and the rest of dataset separately for each round;

  • Individual dataset : training on dataset alone, and testing on the whole dataset and the rest of dataset separately for each round;

  • Mixed dataset: training on both datasets and , and testing on the rest of datasets and separately for each round.

For 4-fold cross-validation, we partitioned the datasets and using patient indices , , , and , , , , respectively. Finally, the results from the different cross-validation rounds were averaged.

3.2 CNN-based Prostate Zonal Segmentation

This work adopts a selective two-step delineation approach to focus on pathological regions in the CG and PZ denoted with and , respectively. Relying on [villeirs2007, qiu2014], the PZ was obtained by subtracting the CG from the WG () meeting the constraints: and . The overall prostate zonal segmentation method is outlined in Fig. 4.

Figure 4: Work-flow of our CNN-based prostate zonal segmentation approach. The gray and black data blocks denote gray-scale images and binary masks, respectively.

Starting from the whole prostate MR image, a pre-processing phase, comprising image cropping and resizing to deal with the different characteristics of the two datasets, is performed (see Sect. 3.1). Afterwards, the resulting image is masked with the binary WG and fed onto the investigated CNN-based models for CG segmentation, which is then refined. Finally, the PZ delineation is obtained by subtracting the CG from the WG according to [qiu2014].

3.2.1 Pre-processing

To fit the image resolution of the dataset , we center-cropped the images of the dataset and resized them to pixels. Furthermore, the images of these datasets were masked using the corresponding prostate binary masks to omit the background and only focus on extracting the CG and PZ from the WG. This operation can be performed either by an automated method [rundo2017Inf, rundo2018WIRN] or previously provided manual WG segmentation [lemaitre2015]. For better training, we randomly cropped the input images from to pixels and horizontally flipped them.

3.2.2 Investigated CNN-based Architectures

The following architectures were chosen in our comparative analysis since they cover several aspects regarding the CNN-based segmentation: SegNet [badrinarayanan2017] addresses semantic segmentation (i.e., assigning each pixel of the scene to an object class), U-Net [ronneberger2015] was successfully applied in biomedical image segmentation, while pix2pix [isola2016] exploits an adversarial generative model to perform image-to-image translations.

During the training of all architectures, the loss function (i.e., a continuous version of the Dice Similarity Coefficient) was used [milletari2016]:


where and represent the continuous values of the prediction map (i.e., the result of the final layer of the CNN) and the ground truth at the -th pixel (

is the total number of pixels to be classified), respectively.


is a CNN architecture for semantic pixel-wise segmentation [badrinarayanan2017]

. More specifically, it was designed for semantic segmentation of road and traffic scenes, wherein classes represent macro-objects, aiming at smooth segmentation results by preserving boundary information. This non-fully connected architecture, which allows for parameter-efficient implementations suitable for embedded systems, consists of an encoder-decoder network followed by a pixel-wise classification layer. Since our classification task involves only one class, the soft-max operation and Rectified Linear Unit (ReLU) activation function at the final layer were removed for stable training.

We implemented SegNet using PyTorch. During the training phase, we used the Stochastic Gradient Descent (SGD) 

[bottou2010] with a learning rate of , momentum of , weight decay of , and batch size of . It was trained for epochs and the learning rate was multiplied by at the -th and -th epochs.


is a fully CNN capable of stable training with a reduced number of samples [ronneberger2015], combining pooling operators with up-sampling operations. The general architecture is an encoder-decoder with skip connections between mirrored layers in the encoder-decoder stacks. By so doing, high resolution features from the contracting path are combined with the up-sampled output for better localization. We utilized four scaling operations. U-Net achieved outstanding performance in biomedical benchmark problems [falk2019] and has been also serving as an inspiration for novel Deep Learning models for image segmentation.

U-Net was implemented using Keras on top of TensorFlow. We used SGD with a learning rate of

, momentum of , weight decay of , and batch size of . Training was executed for epochs, multiplying the learning rate by at the -th, and -th epochs.


is an image-to-image translation method coupled with conditional adversarial networks

[isola2016]. As a generator, U-Net is used to translate the original image into the segmented one [ronneberger2015], preserving the highest level of abstraction. The generator and discriminator include and scaling operations, respectively.

We implemented pix2pix on PyTorch. Adam [kingma2014] was used as an optimizer with a learning rate of and for the discriminator and generator, respectively. The learning rate for generator was multiplied by every epochs. It was trained for epochs with a batch size of .

3.2.3 Post-processing

Two simple morphological steps were applied on the obtained CG binary masks to smooth boundaries and avoid disconnected regions:

  • a hole filling algorithm on the segmented to remove possible holes in the predicted map;

  • a small area removal operation to delete connected components with area less than pixels, where denotes the number of the pixels contained in the WG segmentation. This criterion effectively adapts according to the different dimensions of the .

3.2.4 Evaluation

The accuracy of the achieved segmentation results was quantitatively evaluated with respect to the real measurement (i.e., the gold standard obtained manually by experienced radiologists) using the DSC:


3.3 Influence of Pre-training

In medical imaging, due to the lack of training data, ensuring CNN’s proper training convergence is difficult from scratch. Therefore, pre-training models on a different application and then fine-tuning is common [tajbakhsh2016].

To evaluate cross-dataset generalization abilities via pre-training, we compared the performances of the three CNN-based architectures with/without pre-training on a similar application. We used a relatively large dataset of manually segmented examples from the Prostate MR Image Segmentation 2012 (PROMISE12) challenge [litjens2014]. Since this competition focuses only on WG segmentation without providing prostate zonal labeling, we pre-trained the architectures on WG segmentation. To adjust this dataset to our experimental setup, the images of this dataset were resized from to pixels and randomly cropped to pixels; because our task only focuses on slices with prostate, we also omitted initial/final slices without prostate, so the number of slices for each sample was fixed to .

4 Results

This section explains how the three CNN-based architectures segmented the prostate zones, evaluating their cross-dataset generalization ability.

Table 1 shows the -fold cross-validation results obtained in the different experimental conditions. When training and testing are both performed on the dataset , U-Net outperforms the other architectures on both CG and PZ segmentation; however, it experiences problems with testing on the dataset due to the limited number of training images in the dataset . In such a case, pix2pix generalizes better thanks to its internal generative model. When trained on the dataset

alone, U-Net yields the most accurate results both in intra- and cross-dataset testing. This probably derives from the dataset

’s relatively larger training data as well as U-Net’s good generalization ability when sufficient data are available. Moreover, SegNet reveals rather unstable results, especially when trained on a limited amount of data.


Architecture Zone Testing on Dataset Testing on Dataset
Average Std. Dev. Average Std. Dev.
Training on Dataset SegNet (w/o PT) CG 80.20 3.28 74.48 5.82
PZ 80.66 11.51 59.57 12.68
SegNet (w/ PT) CG 83.38 3.22 72.75 2.80
PZ 87.39 3.90 66.20 5.64
U-Net (w/o PT) CG 84.33 2.37 74.18 3.77
PZ 88.98 2.98 66.63 1.93
U-Net (w/ PT) CG 86.88 1.60 70.11 5.31
PZ 90.38 3.38 58.89 7.06
pix2pix (w/o PT) CG 82.35 2.09 76.61 2.17
PZ 87.09 2.72 73.20 2.62
pix2pix (w/ PT) CG 80.38 2.81 76.19 5.77
PZ 83.53 5.65 73.73 2.40
Training on Dataset SegNet (w/o PT) CG 76.04 2.05 87.07 2.41
PZ 77.25 3.09 82.45 1.77
SegNet (w/ PT) CG 77.99 2.15 87.75 2.83
PZ 76.51 2.70 82.26 2.09
U-Net (w/o PT) CG 78.88 0.88 88.21 2.10
PZ 74.52 1.85 83.03 2.46
U-Net (w/ PT) CG 79.82 1.11 88.66 2.28
PZ 74.56 5.12 82.48 2.47
pix2pix (w/o PT) CG 77.90 0.73 86.95 2.93
PZ 66.09 3.07 81.33 0.90
pix2pix (w/ PT) CG 77.21 1.02 85.94 4.31
PZ 67.39 5.04 80.07 0.84
Training on Mixed Dataset SegNet (w/o PT) CG 84.28 3.12 87.92 2.80
PZ 87.74 1.66 82.21 0.79
SegNet (w/ PT) CG 86.08 1.92 87.78 2.75
PZ 89.53 3.28 82.39 1.50
U-Net (w/o PT) CG 86.34 2.10 88.12 2.34
PZ 90.74 2.40 83.04 2.30
U-Net (w/ PT) CG 85.82 1.98 87.42 1.89
PZ 91.44 2.15 82.17 2.11
pix2pix (w/o PT) CG 83.07 3.39 86.39 3.16
PZ 83.53 2.36 80.40 1.80
pix2pix (w/ PT) CG 82.08 4.37 85.96 5.40
PZ 83.04 4.20 80.60 1.49


Table 1: Prostate zonal segmentation results of the three CNN-based architectures in -fold cross-validation assessed by the DSC

(presented as the average and standard deviation). The experimental results are calculated on the different setups of (

i) training on either each individual dataset or both datasets and (ii) testing on both datasets. Numbers in bold indicate the highest DSC values for each prostate region (i.e., CG and PZ) among all architectures with/without pre-training (PT).

Finally, when trained on the mixed dataset, all three architectures—especially U-Net—achieve good results on both datasets without losing accuracy compared to training on the same dataset alone. Therefore, using mixed MRI datasets during training can considerably improve the performance in cross-dataset generalization towards other clinical applications. Comparing the CG and PZ segmentation, when tested on the dataset , the results on the PZ are generally more accurate, except when trained on the dataset alone; however, for the dataset , segmentations on the CG are generally more accurate.

Fine-tuning after pre-training sometimes leads to slightly better results than training from scratch, when trained only on a single dataset. However, its influence is generally negligible or rather negative, when trained on the mixed dataset. This modest impact is probably due to the ineffective data size for pre-training.

For a visual assessment, two examples (one for each dataset) are shown in Fig. 13. Relying on the gold standards in Figs. (d)d and (h)h, it can be seen that U-Net generally achieves more accurate results compared with SegNet and pix2pix. This finding confirms the trend revealed by the DSC values in Table 1.

Figure 13: Examples of prostate zonal segmentation in pre-training/fine-tuning. The first row concerns testing on dataset , trained on: (a) dataset ; (b) dataset ; (c) mixed dataset. The second row concerns testing on dataset , trained on: (e) dataset ; (f) dataset ; (g) mixed dataset. The segmentation results are represented with magenta, cyan, and yellow solid contours for SegNet, U-Net, and pix2pix, respectively. The dashed green line denotes the boundary. The last column (sub-figures (d) and (h)) shows the gold standard for and with red and blue lines, respectively. The images are zoomed with a factor.

5 Discussion and Conclusions

Our preliminary results show that CNN-based architectures can segment prostate zones on two different MRI datasets to some extent, leading to valuable clinical insights; CNNs suffer when training and testing are performed on different MRI datasets acquired by different devices and protocols, but this can be mitigated by training the CNNs on multiple datasets, even without pre-training. Generally, considering different experimental training and testing conditions, U-Net outperforms SegNet and pix2pix thanks to its good generalization ability. Furthermore, this study suggests that significant performance improvement via fine-tuning may require a remarkably large dataset for pre-training.

As future developments, we plan to improve the results by refining the predicted binary masks for better smoothness and continuity, avoiding disconnected segments; furthermore, we should enhance the output delineations considering the three-dimensional spatial information among slices. Furthermore, relying on the encouraging cross-dataset capability of U-Net, it is worth to devise and test new solutions aiming at improving the performance of the standard U-Net architecture [falk2019]

. Finally, for better cross-dataset generalization, additional prostate zonal datasets and domain adaptation using transfer learning with Generative Adversarial Networks (GANs) 

[goodfellow2014, han2018] and Variational Auto-Encoders (VAEs) [kingma2013] could be useful.


This work was partially supported by the Graduate Program for Social ICT Global Creative Leaders of The University of Tokyo by JSPS.