USE-Net: incorporating Squeeze-and-Excitation blocks into U-Net for prostate zonal segmentation of multi-institutional MRI datasets

04/17/2019 ∙ by Leonardo Rundo, et al. ∙ 0

Prostate cancer is the most common malignant tumors in men but prostate Magnetic Resonance Imaging (MRI) analysis remains challenging. Besides whole prostate gland segmentation, the capability to differentiate between the blurry boundary of the Central Gland (CG) and Peripheral Zone (PZ) can lead to differential diagnosis, since tumor's frequency and severity differ in these regions. To tackle the prostate zonal segmentation task, we propose a novel Convolutional Neural Network (CNN), called USE-Net, which incorporates Squeeze-and-Excitation (SE) blocks into U-Net. Especially, the SE blocks are added after every Encoder (Enc USE-Net) or Encoder-Decoder block (Enc-Dec USE-Net). This study evaluates the generalization ability of CNN-based architectures on three T2-weighted MRI datasets, each one consisting of a different number of patients and heterogeneous image characteristics, collected by different institutions. The following mixed scheme is used for training/testing: (i) training on either each individual dataset or multiple prostate MRI datasets and (ii) testing on all three datasets with all possible training/testing combinations. USE-Net is compared against three state-of-the-art CNN-based architectures (i.e., U-Net, pix2pix, and Mixed-Scale Dense Network), along with a semi-automatic continuous max-flow model. The results show that training on the union of the datasets generally outperforms training on each dataset separately, allowing for both intra-/cross-dataset generalization. Enc USE-Net shows good overall generalization under any training condition, while Enc-Dec USE-Net remarkably outperforms the other methods when trained on all datasets. These findings reveal that the SE blocks' adaptive feature recalibration provides excellent cross-dataset generalization when testing is performed on samples of the datasets used during training.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to the American Cancer Society, in 2019 the Prostate Cancer (PCa) is expected to be the most common malignant tumor with the second highest mortality for American males siegel2019. Given a clinical context, several imaging modalities can be used for PCa diagnosis, such as Transrectal Ultrasound (TRUS), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). For an in-depth investigation, structural T1-weighted (T1w) and T2-weighted (T2w) MRI sequences can be combined with the functional information from Dynamic Contrast Enhanced MRI (DCE-MRI), Diffusion Weighted Imaging (DWI), and Magnetic Resonance Spectroscopic Imaging (MRSI) lemaitre2015. Recent advancements in MRI scanners, especially those related to magnetic field strengths higher than T, did not decrease the effect of magnetic susceptibility artifacts on prostate MR images, even though the shift from T to T theoretically leads to a doubled Signal-to-Noise Ratio (SNR) rouviere2006. However, T MRI scanners permitted to obtain high-quality images with less invasive procedures compared with T, thanks to a pelvic coil that reduces prostate gland compression/deformation kim2008; heijmink2007.

Therefore, MRI plays a decisive role in PCa diagnosis and disease monitoring (even in an advanced status padhani2017), revealing the internal prostatic anatomy, prostatic margins, and PCa extent villeirs2007. According to the zonal compartment system proposed by McNeal, the prostate Whole Gland (WG) can be partitioned into the Central Gland (CG) and Peripheral Zone (PZ) selman2011. In prostate imaging, T2w MRI serves as the principal sequence scheenen2015, thanks to its high resolution that enables to differentiate the hyper-intense PZ and hypo-intense CG in young male subjects hoeks2011.

Besides manual detection/delineation of the WG and PCa on MR images, distinguishing between the CG and PZ is clinically essential, since the frequency and severity of tumors differ in these regions choi2007; niaf2012. As a matter of fact, the PZ harbors - of PCa and represents a target for prostate biopsy haffner2009. Furthermore, the PZ volume ratio (i.e., the PZ volume divided by the WG volume) can be considered for PCa diagnostic refinement chang2017, while the CG volume ratio can help monitoring prostate hyperplasia kirby2002. Therefore, according to the Prostate Imaging-Reporting and Data System version 2 (PI-RADSTM v2) weinreb2016, radiologists must perform a zonal partitioning before assessing the suspicion of PCa on multi-parametric MRI. However, an improved PCa diagnosis requires a reliable and automatic zonal segmentation method, since manual delineation is time-consuming and operator-dependent rundo2017Inf; muller2015. Moreover, in clinical practice, the generalization ability among multi-institutional prostate MRI datasets is essential due to large anatomical inter-subject variability and the lack of a standardized pixel intensity representation for MRI (such as for CT-based radiodensity measurements expressed in Hounsfield units) klein2008. Hence, we aim at automatically segmenting the prostate zones on three multi-institutional T2w MRI datasets to evaluate the generalization ability of Convolutional Neural Network (CNN)-based architectures. This task is challenging because images from multi-institutional datasets are characterized by different contrasts, visual consistencies, and heterogeneous characteristics vanOpbroek2015.

In this work, we propose a novel CNN, called USE-Net, which incorporates Squeeze-and-Excitation (SE) blocks hu2017 into U-Net after every Encoder (Enc USE-Net) or Encoder-Decoder block (Enc-Dec USE-Net). The rationale behind the design of USE-Net is to exploit adaptive channel-wise feature recalibration to boost the generalization performance. The proposed USE-Net is conceived to outperform the state-of-the-art CNN-based architectures for segmentation in multi-institutional studies, whilst the SE blocks (initially proposed in hu2017) were originally designed to boost the performance only for classification and object detection via feature recalibration, by capturing single dataset characteristics. Unlike the original SE blocks placed in InceptionNet szegedy2016 and ResNet he2016 architectures, we introduced them into U-Net after the encoders and decoders to boost the segmentation performance with increased generalization ability, thanks to the representation of channel-wise relationships in multi-institutional clinical scenarios, analyzing multiple heterogeneous MRI datasets. This study adopted a mixed scheme for cross- and intra-dataset generalization: (i) training on either each individual dataset or multiple datasets, and (ii) testing on all three datasets with all possible training/testing combinations. To the best of our knowledge, this is the first CNN-based prostate zonal segmentation on T2w MRI alone. By relying on both spatial overlap-/distance-based metrics, we compared USE-Net against three CNN-based architectures: U-Net, pix2pix, and Mixed-Scale Dense Network (MS-D Net) pelt2017, along with a semi-automatic continuous max-flow model qiu2014.


Our main contributions are:

  • Prostate zonal segmentation: our novel Enc-Dec USE-Net achieves accurate CG and PZ segmentation results on T2w MR images, remarkably outperforming the other competitor methods when trained on all datasets used for testing in multi-institutional scenarios.

  • Cross-dataset generalization: this first cross-dataset study, investigating all possible training/testing conditions among three different medical imaging datasets, shows that training on the union of multiple datasets generally outperforms training on each dataset during testing, realizing both intra-/cross-dataset generalization—thus, we may train CNNs by feeding samples from multiple different datasets for improving the performance.

  • Deep Learning for medical imaging: this research reveals that SE blocks provide excellent intra-dataset generalization in multi-institutional scenarios, when testing is performed on samples from the datasets used during training. Therefore, adaptive mechanisms (e.g., feature recalibration in CNNs) may be a valuable solution in medical imaging applications involving multi-institutional settings.

The manuscript is structured as follows. Section 2 outlines the background of prostate MRI zonal segmentation, especially related work on CNNs. Section 3

describes the analyzed multi-institutional MRI datasets, the proposed USE-Net architectures, the investigated state-of-the-art CNN- and max-flow-based segmentation approaches, as well as the employed evaluation metrics; the experimental results are presented and discussed in Section 

4. Finally, conclusive remarks and future directions of this work are given in Section 5.

2 Related Work

Due to the crucial role of MR image analysis in PCa diagnosis and staging lemaitre2015, researchers have paid specific attention to automatic WG detection/segmentation. Classic methods mainly leveraged atlases klein2008; martin2008 or statistical shape priors martin2010: atlas-based approaches realized accurate segmentation when new prostate instances resemble the atlas, relying on a non-rigid registration algorithm martin2010; toth2013. Unsupervised clustering techniques allowed for segmentation without manual labeling of large-scale MRI datasets rundo2017Inf; rundo2018SIST. In the latest years, Deep Learning techniques litjens2017

have achieved accurate prostate segmentation results by using deep feature learning combined with shape models 

guo2017 or location-prior maps sun2017. Moreover, CNNs were used with patch-based ensemble learning jia2018 or dense prediction schemes milletari2016. In addition, end-to-end deep neural networks achieved outstanding results in automated PCa detection in multi-parametric MRI yang2017; wang2018.

Differently from WG segmentation and PCa detection, less attention has been paid to CG and PZ segmentation despite its clinical importance in PCa diagnosis niaf2012

. In this context, classic Computer Vision techniques have been mainly exploited on T2w MRI. For instance, early studies combined classifiers with statistical shape models 

allen2006 or deformable models yin2012; Toth et al. toth2013 employed active appearance models with multiple level sets for simultaneous zonal segmentation; Qiu et al. qiu2014 used a continuous max-flow model—the dual formulation of convex relaxed optimization with region consistency constraints yuan2010; in contrast, Makni et al. makni2011 fused and processed 3D T2w, DWI, and contrast-enhanced T1w MR images by means of an evidential C-means algorithm masson2008. As the first CNN-based method, Clark et al. clark2017 detected DWI MR images with prostate relying on Visual Geometry Group (VGG) net simonyan2015, and then sequentially segmented WG and CG using U-Net ronneberger2015.

Regarding the most recent computational methods in medical image segmentation, along with traditional Pattern Recognition techniques 

rundo2018next, significant advances have been proposed in CNN-based architectures. For instance, to overcome the limitations related to accurate image annotations, DeepCut rajchl2017 relies on weak bounding box labeling rundo2017NC. This method aims at learning features for a CNN-based classifier from bounding box annotations. Among the architectures devised for biomedical image segmentation havaei2017; kamnitas2017, U-Net ronneberger2015 showed to be a noticeably successful solution, thanks to the combination of a contracting (i.e., encoding) path, for coarse-grained context detection, and a symmetric expanding (i.e., decoding) path, for fine-grained localization. This fully CNN is capable of stable training with reduced samples. The authors of V-Net milletari2016

extended U-Net for volumetric medical image segmentation, by introducing also a different loss function based on the Dice Similarity Coefficient (

DSC). Schlemper et al. schlemper2019 presented an Attention Gate (AG) model for medical imaging, which aims at focusing on target structures or organs. AGs were introduced into the standard U-Net, so defining Attention U-Net, which achieved high performance in multi-class image segmentation without relying on multi-stage cascaded CNNs. Recently MS-D Net pelt2017 was shown to yield better segmentation results in biomedical images than U-Net ronneberger2015 and SegNet badrinarayanan2017, by creating dense connections among features at different scales obtained by means of dilated convolutions. By so doing, features at different scales can be contextually extracted using fewer parameters than full CNNs. Finally, also image-to-image translation approaches—e.g., pix2pix isola2016 that leverages conditional adversarial neural networks—were exploited for image segmentation.

However, no literature method so far coped with the generalization ability among multi-institutional MRI datasets, making their clinical applicability difficult albadawy2018. In a previous work rundoWIRN2018, we compared existing CNN-based architectures—namely, SegNet badrinarayanan2017, U-Net ronneberger2015, and pix2pix isola2016—on two multi-institutional MRI datasets. According to our results, U-Net generally achieves the most accurate performance. Here, we thoroughly verify the intra-/cross-dataset generalization on three datasets from three different institutions, also proposing a novel architecture based on U-Net ronneberger2015 incorporating SE blocks hu2017. To the best of our knowledge, this is the first study on CNN-based prostate zonal segmentation on T2w MRI alone.

3 Materials and Methods

This section first describes the analyzed multi-institutional MRI datasets collected by different institutions. Afterwards, we explain the proposed USE-Net, the other investigated CNN-based architectures, as well as a state-of-the-art prostate zonal segmentation method based on a continuous max-flow model qiu2014. Finally, the used spatial overlap- and distance-based evaluation metrics are reported.

3.1 Multi-institutional MRI Datasets

We segment the CG and PZ from the WG on three completely different multi-parametric prostate MRI datasets, namely:

  • dataset ( patients/ MR slices with prostate), acquired with a whole body Philips Achieva 3T MRI scanner at the Cannizzaro Hospital (Catania, Italy) rundo2017Inf. MRI parameters: matrix size pixels; slice thickness mm; inter-slice spacing mm; pixel spacing mm; number of slices per image series (including slices without prostate) . Average patient age: years;

  • Initiative for Collaborative Computer Vision Benchmarking (I2CVB) dataset ( patients/ MR slices with prostate), acquired with a whole body Siemens TIM 3T MRI scanner at the Hospital Center Regional University of Dijon-Bourgogne (Dijon, France) lemaitre2015. MRI parameters: matrix size pixels; slice thickness mm; inter-slice spacing mm; pixel spacing mm; number of slices per image series . Average patient age: years;

  • National Cancer Institute – International Symposium on Biomedical Imaging (NCI-ISBI) 2013 Automated Segmentation of Prostate Structures Challenge dataset ( patients/ MR slices with prostate) via The Cancer Imaging Archive (TCIA) prior2017, acquired with a whole body Siemens TIM 3T MRI scanner at Radboud University Medical Center (Nijmegen, The Netherlands) TCIA. The prostate structures were manually delineated by five experts. MRI parameters: matrix size pixels; slice thickness mm; inter-slice spacing mm; pixel spacing mm; number of slices per image series ranging from to . Average patient age: years.

All the analyzed MR images are encoded in the -bit Digital Imaging and Communications in Medicine (DICOM) format. It is worth noting that even MR images from the same dataset have intra-dataset variations (such as the matrix size, slice thickness, and number of slices). Furthermore, inter-rater variability for the CG and PZ annotations exists, as different physicians delineated them. For clinical feasibility hoeks2011, we analyzed only axial T2w MR slices—the most commonly used sequence for prostate zonal segmentation—among the available sequences. In our multi-centric study, we conducted the following seven experiments resulting from all possible training/testing conditions:

Figure 4: Examples of input prostate T2w MR axial slices in their original image ratio: (a) dataset ; (b) dataset ; (c) dataset . The CG and PZ are highlighted with red and blue transparent regions, respectively. Alpha blending with .
  • Individual dataset , , : training and testing on dataset (, , respectively) alone in -fold cross-validation, and testing also on whole datasets and ( and , and , respectively) separately for each round;

  • Mixed dataset , , : training and testing on both datasets and ( and , and , respectively) in -fold cross-validation, and testing also on whole dataset (, , respectively) separately for each round;

  • Mixed dataset : training and testing on whole datasets , , and in -fold cross-validation.

For clinical applications, such a multi-centric research is valuable for analyzing CNNs’ generalization ability among different MRI acquisition options, e.g., different devices and functioning parameters. In our study, for instance, both intra-/cross-scanner evaluations can be carried out, because dataset ’s scanner is different from those of datasets and . Fig. 4 shows an example image for each analyzed dataset; in the context of generalization among different datasets, Yan et al. yan2018 evaluated the average vessel segmentation performance on three retinal fundus image datasets under the three-dataset training condition, while pair-wisely assessing the cross-dataset performance on two datasets under the other one-dataset training condition. Yang et al. yang2018 proposed an alternative approach using adversarial appearance rendering to relieve the burden of re-training for Ultrasound imaging datasets. Differently, we thoroughly evaluate all possible training/testing conditions (for a total of configurations) on each dataset to confirm the intra- and cross-dataset generalization ability by incrementally injecting samples from the other datasets at hand.

With regard to the -fold cross-validation, we partitioned the datasets , , and into folds by using the following patient indices: , , , , , , , , and , , , , respectively. Finally, the results from the different cross-validation rounds were averaged to obtain a final descriptive value. These patient indices represent a permutation of the randomly arranged original patient ordering to portray a randomized partition scheme. This allowed us to guarantee a fixed partitioning among the different training/testing conditions with a general notation valid for all datasets, regardless of the number of patients in each dataset.

Cross-validation strategies aim at estimating the generalization ability of a given model; the hold-out method fixedly partition the dataset into the training/test sets to train the model on the first partition alone and test it only on the unseen test set data. Unlike the leave-one-out cross-validation with high variance and low bias, the

-fold cross-validation is a natural way to improve the hold-out method: the dataset is divided into mutually exclusive folds of approximately equal size diri2008. The statistical validity increases with less variance and less dependency on the initial dataset partition, averaging the results for all the cross-validation rounds. Consequently, the -fold cross-validation is the most common choice for reliable generalization results, minimizing the bias associated with the random sampling of the training/test sets diri2008. However, this statistical practice is computationally expensive due to the times-repeated training from scratch gandhi2010. Moreover, the results could underestimate the actual performance allowing for conservative analyses kohavi1995; thus, we chose -fold cross-validation for reliable and fair training/testing phases, according to the number of patients in each dataset, calculating the evaluation metrics on a statistically significant test set (i.e., of each prostate MRI dataset).

3.2 Prostate Zonal Segmentation on Multi-institutional MRI Datasets

This work adopts a selective delineation approach to focus on internal prostatic anatomy: the CG and PZ, denoted by and , respectively. Let the entire image and the WG region be and , respectively, the following relationships can be defined:


where represents background pixels. Relying on villeirs2007; qiu2014, was obtained by subtracting from meeting the constraints:

Figure 5: Scheme of the proposed USE-Net architecture: Enc USE-Net has only (red-contoured) SE blocks after every encoder, whilst Enc-Dec USE-Net has SE blocks integrated after every encoder/decoder (represented with red/blue contours, respectively).

3.2.1 USE-Net: Incorporating SE Blocks into U-Net

We propose to introduce SE blocks hu2017 following every Encoder (Enc USE-Net) or Encoder-Decoder (Enc-Dec USE-Net) of U-Net ronneberger2015, as shown in Fig. 5. As pointed out before, U-Net allows for a multi-resolution decomposition/composition technique suzuki2006, by combining encoders/decoders with skip connections between them  yao2018; in our implementation, encoders and decoders consist of four pooling operators that capture the context and up-sampling operators that conduct precise localization, respectively.

We introduce SE blocks to enhance image segmentation, expecting an increased representational power from modeling the channel-wise dependencies of convolutional features hu2017. These blocks were originally envisioned for image classification using adaptive feature recalibration to boost informative features and suppress the weak ones at minimal computational burden.

Enc USE-Net and Enc-Dec USE-Net are investigated to evaluate the effect of strengthened feature recalibration. Since the template of the SE blocks is generic, they can be exploited at any depth of any architecture. Considering that SE blocks should be placed after output feature maps for feature recalibration, we have three possible places to integrate them for U-Net, namely: (i) after encoders; (ii) after decoders; (iii) after a classifier. SE blocks are more powerful in the encoding path than in the decoding path and more powerful in the decoding path than after a classifier, as they affect lower-level features in the U-Net architecture and thus increase the overall performance significantly; consequently, instead of placing only a single SE block after the first encoder/decoder, we place SE blocks after each encoder/decoder for both coarse-grained context detection in the earlier layers and fine-grained localization in the deeper layers for the best segmentation performance.

The SE blocks can be formally described as follows:


Let be an input feature map, where is a single channel with size . Through spatial dimensions , a global average pooling layer generates channel-wise statistics , whose -th element is given by:


To limit the model complexity and boost generalization, two fully-connected layers and the Rectified Linear Unit (ReLU

nair2010 function transform

with a sigmoid activation function



where , , and is the reduction ratio controling the capacity and computational cost of the SE blocks. Hu et al. hu2017 showed that the SE blocks can overfit to the channel inter-dependencies of the training set despite a lower number of weights with respect to the original architecture; they found the best compromise of , which guarantees the lowest overall error (in terms of top- and top- errors) with ResNet-50 he2016

for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2017 classification competition

ILSVRC15. Therefore, we also selected

for the USE-Net. In order to obtain an adaptive recalibration that ignores less important channels and emphasizes important ones (allowing for non-mutual exclusivity among multiple channels, differently from one-hot encoding),

is rescaled into by applying Eq. (5):


where represents the channel-wise multiplication between the feature map and the scalar .

3.2.2 Pre-processing

To fit the image resolution of dataset

, we either center-cropped or zero-padded the images of datasets

and to resize them to pixels. Afterwards, all images in the three datasets were masked using the corresponding prostate binary masks to omit the background and only focus on extracting the CG and PZ from the WG. This operation can be performed either by an automated method rundo2017Inf or previously provided manual WG segmentation lemaitre2015. As a simple form of data augmentation, we randomly cropped the input images from to pixels and horizontally flipped them.

3.2.3 Post-processing

Two efficient morphological operations were applied on the obtained binary masks to smooth boundaries and deal with disconnected regions:

  • a hole filling algorithm on the segmented to remove possible holes in a predicted map;

  • a small area removal operation dealing with connected components smaller than pixels, where denotes the number of pixels contained in WG segmentation. This adaptive criterion takes into account the different sizes of (ranging from the apical to the basal prostate slices).

3.2.4 Comparison against the State-of-the-Art Methods

We compare USE-Net against three supervised CNN-based architectures (i.e., U-Net, pix2pix, and Mixed-Scale Dense Network) and the unsupervised continuous max-flow model qiu2014. All the investigated CNN-based architectures were trained using the loss function (i.e., a continuous version of the DSCmilletari2016 through the pixels to classify:


where and refer to the continuous values in of the prediction map and the Boolean ground truth annotated by experienced radiologists at the -th pixel, respectively. The loss function was designed by Milletari et al. milletari2016 to deal with the imbalance of the foreground labels in medical image segmentation tasks.

USE-Net and U-Net

Using four scaling operations, U-Net and USE-Net were implemented on Keras with TensorFlow backend. We used the Stochastic Gradient Descent (SGD) method 

bottou2010 with a learning rate of , momentum of , weight decay of , and batch size of . Training was executed for epochs, multiplying the learning rate by at the -th and -th epochs.


This image-to-image translation method with conditional adversarial networks was used to translate the original image into the segmented one 


. The generator and discriminator (both U-Nets in our implementation) include eight and five scaling operations, respectively. We developed pix2pix on PyTorch. Adam 

kingma2014 was used as an optimizer with a learning rate of for the generator—which was multiplied by every epochs—and for the discriminator. Training was executed for epochs with a batch size of .

MS-D Net

This dilated convolution-based method, characterized by densely connected feature maps, is designed to capture features at various image scales pelt2017. It was implemented on PyTorch with a depth of and width of . We used Adam kingma2014 with a learning rate of and trained it for epochs with a batch size of .

Continuous Max-flow Model

This model qiu2014 exploits duality-based convex relaxed optimization yuan2010 to achieve better numerical stability (i.e., convergence) than classic graph cut-based methods freedman2005. This semi-automatic approach simultaneously segments both and under the constraints given in Eq. (2

), relying on user intervention. The initialization procedure consists in two closed surfaces defined by a thin-plate spline interpolating


control points interactively selected by the user (considering both the axial and sagittal views). These 3D partitions estimate the intensity probability density functions associated with three sub-regions of background, CG, and PZ. This allows for defining the region appearance models for global optimization-based multi-region segmentation 


Since the supervised CNN-based architectures rely on the gold standard for zonal segmentation, we apply the continuous max-flow method on CG for single-region segmentation for a fair comparison. Moreover, in our tests, a very accurate slice-by-slice initialization is provided by eroding the gold standard CG with a circular structuring element (radius pixels).

The continuous max-flow model qiu2014 was implemented in MatLab® R2017a -bit (The Mathworks, Natick, MA, USA).

3.3 Evaluation Metrics

We evaluate the segmentation methods by comparing the segmented MR images () to the corresponding gold standard manual segmentation () using spatial overlap- and distance-based metrics taha2015; fenster2005; zhang2001. Those metrics are calculated using a slice-wise comparison and then averaged per patient; thus, each single result regarding a patient represents an aggregate value.

Overlap-based metrics

These metrics quantify the spatially-overlapping segmented Region of Interest (ROI). Let true positives be , false negatives be , false positives be , and true negatives be . In what follows, we denote the cardinality of the pixels belonging to a region as .

  • Dice similarity coefficient zou2004 is the most used measure in medical image segmentation to compare the overlap of two regions:

  • Sensitivity measures the correct detection ratio of true positives:

  • Specificity measures the correct detection ratio of true negatives:


    However, this formulation is ineffective when data are unbalanced (i.e., the ROI is much smaller than the whole image). Consequently, we use the following definition:

Distance-based metrics

As precise boundary tracing plays an important role in clinical practice, overlap-based metrics have limitations in evaluating segmented images. In order to measure the distance between the two ROI boundaries, distance-based metrics can be considered. Let the manual contour consist in a set of vertices and the automatically-generated contour consist in a set of vertices . We calculate the absolute distance between an arbitrary element and all the vertices in as follows:

  • Average absolute distance measures the average difference between the ROI boundaries of and :

  • Maximum absolute distance represents the maximum difference between the ROI boundaries of and :


4 Experimental Results

This section shows how the CNN-based architectures and the continuous max-flow model segmented the prostate zones, through the evaluation of their cross-dataset generalization ability. Aiming at showing the performance boost achieved by integrating the SE blocks into U-Net, we performed a fair comparison against the state-of-the-art architectures under the same training/testing conditions. In particular, due to the lack of annotated MR images for prostate zonal segmentation, we used three different datasets by composing a multi-institutional dataset. This allowed us to show the SE blocks’ cross-dataset adaptive feature recalibration effect, better capturing each dataset’s peculiar characteristics. Therefore, we exploited all possible training/testing conditions involving the three analyzed datasets (for a total of configurations) on each dataset to overcome the limitation from the small sample size, confirming the intra- and cross-dataset generalization ability of the CNN-based architectures.

Table 1 shows the -fold cross-validation results, as assessed by the DSC metrics, obtained under different training/testing conditions (the values of the other metrics are given in Supplementary Material, Tables S1-S4). For visual and comprehensive comparison, the Kiviat diagrams (also known as radar or cobweb charts) kolence1973; diri2008 for each CNN-based architecture are also displayed in Fig. 9. Here, we can observe the impact of leaving dataset out of the training set and, at the same time, using it as test set: the corresponding spokes III, VI, and XII generally show lower performance, probably due to the peculiar image characteristics of dataset (comprising the highest number of patients) that are not learned during the training phase on datasets . In general, Enc USE-Net performs similarly to U-Net, which stably yields satisfactory results. More interestingly, Enc USE-Net obtains considerably better results when trained/tested on multiple datasets. Enc-Dec USE-Net (characterized by a higher number of SE blocks with respect to Enc USE-Net) consistently and remarkably outperforms the other methods on both CG and PZ segmentation when trained on all the investigated datasets, also performing well when trained and tested on the same datasets.


Method Testing on dataset Testing on dataset Testing on dataset

Training on dataset

MS-D Net
Enc USE-Net
Enc-Dec USE-Net

Training on dataset

MS-D Net
Enc USE-Net
Enc-Dec USE-Net

Training on dataset

MS-D Net
Enc USE-Net
Enc-Dec USE-Net

Training on datasets

MS-D Net
Enc USE-Net
Enc-Dec USE-Net

Training on datasets

MS-D Net
Enc USE-Net
Enc-Dec USE-Net

Training on datasets

MS-D Net
Enc USE-Net
Enc-Dec USE-Net

Training on datasets

MS-D Net
Enc USE-Net
Enc-Dec USE-Net
  None Qiu et al. qiu2014


Table 1: Prostate zonal segmentation results of the CNN-based architectures and the unsupervised continuous max-flow model (proposed by Qiu et al. qiu2014) in -fold cross-validation assessed by DSC (presented as the mean value standard deviation). The supervised experimental results are calculated under the different seven conditions described in Section 3.1. Numbers in bold indicate the best DSC values (the higher the better) for each prostate region (i.e., and ) among all architectures.
Figure 9: Kiviat diagrams showing the DSC values achieved by each method under different conditions. and results are denoted by blue and cyan colors, respectively. Each variable represents a “training-set test-set” condition as follows:
(a) one-dataset training: I) ; II) ; III) ; IV) ; V) ; VI) ; VII) ; VIII) ; IX) .
(b) two-dataset training: X) ; XI) ; XII) ; XIII) ; XIV) ; XV) ; XVI) ; XVII) ; XVIII) .
(c) three-dataset training: XIX) ; XX) ; XXI) .
Figure 16: Segmentation results obtained by the six investigated methods (under the three-dataset training condition) on two different images for each dataset: (a) ; (b) ; (c) . Automatic segmentations (solid lines) are compared against the corresponding gold standards (dashed red line). segmentations can be obtained from and (dashed green line) according to the constraints in Eq. (2).

We executed the Friedman’s test to quantitatively investigate any statistical performance differences among the tested approaches. Regarding the three-dataset condition: and for the CG and PZ, respectively. Considering all training/testing combinations: and for the CG and PZ, respectively. Since the

-values allowed us to reject the null hypothesis, we performed the Bonferroni-Dunn’s

post hoc test for both the three-dataset condition and all training/testing combinations demvsar2006. In order to visualize the achieved results, example images segmented by each method are compared in Fig. 16 under the three-dataset training condition. The critical difference diagram (Fig. 19) using the Bonferroni-Dunn’s post hoc test also confirms this trend, considering DSC values for every round of the -fold cross-validation.

However, as shown in Fig. 22, Enc-Dec USE-Net shows less powerful cross-dataset generalization when trained and tested on different datasets, achieving slightly lower average performance than Enc USE-Net (considering all training/testing combinations). This implies that the SE blocks’ adaptive feature recalibration—boosting informative features and suppressing weak ones—provides excellent intra-dataset generalization in the case of testing performed on multiple datasets used during training (i.e., when training samples from every testing dataset are fed to the model).

On the contrary, pix2pix achieves good generalization when trained and tested on different datasets, especially under mixed-dataset training conditions, thanks to its internal generative model. MS-D Net generally works better in single dataset scenarios, using a limited amount of training samples, according to pelt2017. The unsupervised continuous max-flow model achieves comparable results to the supervised ones only when trained and tested on different datasets. However, this semi-automatic approach is outperformed by the supervised methods when trained and tested on the same datasets, as it underestimates .

The results also reveal that training on multi-institutional datasets generally outperforms training on each dataset during testing on any dataset/zone, realizing both intra-/cross-dataset generalization. For instance, training on datasets and generally outperforms training on dataset during testing on all datasets , , and , without losing accuracy.

Figure 19: Critical Difference (CD) diagram comparing the DSC values achieved by all the investigated CNN-based architectures using the Bonferroni-Dunn’s post hoc test demvsar2006 with confidence level for the three-dataset training conditions. Bold lines indicate groups of methods whose performance difference was not statistically significant.
Figure 22: Critical Difference (CD) diagram comparing the DSC values achieved by all the investigated CNN-based architectures using the Bonferroni-Dunn’s post hoc test demvsar2006 with confidence level considering all training/testing combinations. Bold lines indicate groups of methods whose performance difference was not statistically significant.

Therefore, training schemes with mixed MRI datasets can achieve reliable and excellent performance, potentially useful for other clinical applications. Comparing the CG and PZ segmentation, the results on the CG are generally more accurate, except when trained and tested on dataset ; this could be due to intra- and cross-scanner generalization, since dataset ’s scanner is different from those of datasets and .

The trend characterizing the best DSC accuracy performance, especially in the case of three-dataset training/testing conditions, is reflected by both the SEN and SPC values (Tables S1 and S2). As shown in Tables S3 and S4, the achieved spatial distance-based indices are consistent with overlap-based metrics. Hence, Enc-Dec USE-Net obtained high performance also in terms of difference between the automated and the manual boundaries.

Considering more permutations in the random partitioning and running multiple -fold cross-validation instances may increase the robustness of the results, by evaluating the combination of the multiple executions. However, with particular reference to the three-dataset training/testing condition, where the feature recalibration can effectively capture the dataset characteristics with the most available samples, the Bonferroni-Dunn’s post hoc test showed significant differences in the multiple comparisons among the competing architectures (Fig. 19). On the contrary, no significant statistical difference was detected when considering all training/testing conditions (Fig. 22). The achieved results suggest that cross-validation with a single random permutation is methodologically sound. In addition, we can state that the patterns arising from the -fold cross-validation experiments are not just by chance or biased by the increased training samples, so USE-Net significantly outperforms the other techniques.

To conclude, the comparison of U-Net and USE-Nets shows the individual contribution of SE blocks under each of the dataset combinations. Interestingly, USE-Net is not always superior on one- or two-dataset cases, but consistently outperforms U-Net on three-dataset training/testing. This arises from USE-Net’s higher number of parameters than U-Net, generally requiring more samples for proper tuning.

5 Discussion and Conclusions

The novel CNN architecture introduced in this work, Enc-Dec USE-Net, achieved accurate prostate zonal segmentation results when trained on the union of the available datasets in the case of multi-institutional studies—significantly outperforming the competitor CNN-based architectures, thanks to the integration of SE blocks hu2017 into U-Net ronneberger2015. This also derives from the presented cross-dataset generalization approach among three prostate MRI datasets, collected by three different institutions, aiming at segmenting and ; Enc-Dec USE-Net’s segmentation performance considerably improved when trained on multiple datasets with respect to individual training conditions. Since the training on multi-institutional datasets analyzed in this work achieved good intra-/cross-dataset generalization, CNNs could be trained on multiple datasets with different devices/protocols to obtain better outcomes in clinically feasible applications. Moreover, our research also implies that state-of-the-art CNN architectures properly combined with innovative concepts, such as feature recalibration provided by the SE blocks hu2017, allow for excellent intra-dataset generalization when tested on samples coming from the datasets used for the training phase. Therefore, we may argue that multi-dataset training and SE blocks represent not just individual options but mutually indispensable strategies to draw out each other’s full potential. In conclusion, such adaptive mechanisms may be a valuable solution in medical imaging applications involving multi-institutional settings.

As future developments, we will refine the output images considering the 3D spatial information among the prostate MR slices. Finally, for better cross-dataset generalization, we plan to use domain adaptation viatransfer learning by maximizing the distribution similarity vanOpbroek2015. In this context, Generative Adversarial Networks (GANs) goodfellow2014; han2018 and Variational Auto-Encoders (VAEs) kingma2013 represent useful solutions.


This work was partially supported by the Graduate Program for Social ICT Global Creative Leaders of The University of Tokyo by JSPS.

We thank the Cannizzaro Hospital, Catania, Italy, for providing one of the imaging datasets analyzed in this study.