Prostate cancer is one of the major causes of cancer death for men in the Western world. Multi-parameter MRI (mpMRI) is increasingly being used in the diagnosis of prostate cancer. Widespread implementation of mpMRI diagnosis of prostate cancer could avoid unnecessary biopsies and enable population screening of prostate cancer, allowing early diagnosis of the disease. However, widespread implementation is impeded because there is a lack of standardisation and because it requires substantial expertise in reading MRI scans. Automation of the detection of prostate tumor lesions in mpMRI scans would be instrumental to overcome these impediments.
Automatic segmentation of biomedical images has taken flight in recent years thanks to the development of deep learning techniques. So-called convolutional neural networks (CNNs) significantly outperform classical techniques like hand-made feature mapping. The current state of the art is the U-net architecture[Unet], which has been successfully applied to segmentation problems for various organs and imaging modalities. The power of U-net is that it captures both contextual and local information to segment both full organs and the details of the organ’s borders. The method has been successfully applied to entire prostate segmentation on mpMRI volumes [Milletari], and in this project this has been extended to segment the peripheral zone (PZ) and transition zones (TZ) of the prostate separately. This is challenging because the border between zones is harder to identify and enhancements of the method may be required. The medical significance of segmenting the zones is that the guidelines for mpMRI diagnosis of prostate cancer [PIRADS] are different depending on which zone of the prostate the tumor/lesion is located. Therefore any technique for automating the detection of prostate lesions should include a zonal segmentation step.
1.2 Related work
Recent reviews ([Shen2017], [PROMISE12]) have highlighted that deep learning, and convolutional networks in particular [cs231], has been applied to a wide range of medical image analysis tasks (segmentation, classification, detection,registration, image reconstruction, enhancement, etc.) across a wide range of anatomical sites (brain, heart, lung, abdomen, breast, prostate, musculature, etc.).
Full prostate segmentation using a V-net architecture has been reported by Milletari [Milletari], achieving a dice score of 0.87 on the PROMISE2012 challenge dataset [PROMISE12]. A 3D U-net architecture [Cicek] was used for the prostate segmentation of another set of MRI scans, and an improvement of dice score from 90.1 to 92.7 was obtained using both scans of axial and sagittal orientations instead of only scans of the axial orientation [Meyer2018].
Segmentation of the prostate PZ and TZ separately has been reported using both ATLAS methods [Padgett2016] (dice score 0.83 for prostate and 0.57 for PZ) and using a 2-stage 2D U-net method [Clark2017] (dice score for prostate, and for TZ).
1.3 Optimising deep learning for medical images
Here we test ideas for improving the 3D U-net setup for the task of segmenting the prostate PZ and TZ in mpMRI prostate scans.
We hypothesize that certain characteristics of medical images can be exploited to improve deep learning strategies. One important characteristic is that medical images always have the same topology, given by the anatomy of the organ of interest and the surrounding tissues. We speculate that the network can improve segmentation predictions by taking these surroundings into account. This can be achieved by providing the network with a large input image volume that includes the surroundings and by annotating surrounding tissues.
Another characteristic is the dimensions of the 3D volumes. In the case of MRI scans, image volumes are high resolution in one plane (e.g. 0.5x05mm in the axial plane), while in the third direction the slices are much thicker (e.g. 3.6 mm). We suggest that segmentation performance will improve if the anisotropy of the input dimensions is reflected in an anisotropic 3D U-net.
2.1 Data volumes and manual segmentations
The data set used in this project consists of fifty-three 3D T2-weighted MRI volumes of the prostate and surrounding tissues from a large prostate mpMRI Scientific Archive [detection2016].
In all volumes, the prostates were annotated by hand, distinguishing between the prostate PZ and TZ as shown in Figure 1. While manual annotation was done in the axial view, a sagittal view was consulted for especially the apex and the base which can be hard to identify in the axial view.
The original image resolution of 0.5x0.5x3.6 mm and 0.3x0.3x3.6 mm was resampled to 1x1x3.6mm to fit the full image into GPU memory. The data set is imbalanced, with the PZ under-represented as shown in Figure 2.
In addition the bladder, rectum and femur bones were annotated as shown in Figure 3.
2.2 Network architecture
Our anisotropic network, aniso-3DUNET, follows the 3D U-net architecture by Çiçek [Cicek], with the difference that it starts and ends with two layers of two 2D convolutions and one 2D maxpooling each (see Figure 4). The final step is a softmax either to 3 labels (background, TZ, and PZ) or 6 labels (background, TZ, PZ, bladder, rectum and femur bones), preceded by two steps for each voxel that map 64 to 64 and then 64 to 3 or 5 features (depending on the number of tissues that are segmented)).
An alternative more isotropic model, iso-3DUNET, is evaluated, in which all the maxpoolings are 3D as in the original 3D U-net. In view of the low dimensionality in the vertical direction, every second convolution in each layer is a 2D convolution instead of a 3D convolution as in the original 3D U-net.
2.3 Training and testing
During training, data augmentation was applied to the images in the form of small translations, rotations, isotropic expansions and contractions, elastic deformations and left-right flips.
Due to the low number of annotated volumes available for training, it was expected that validation scores vary depending on the validation split, so a 5-fold cross validation was carried out to obtain a range of validation scores. Training was aimed at minimizing a multi-label cross entropy loss, and focussed on the organs of interest by weight factors to counter the data imbalance shown in Figure 2
. Loss contributions from each voxel were given a weight linked to its ground truth label: background = 1, TZ = 2, PZ = 6, and other organs = 1. These numbers do reflect the skew in the imbalance but are lower than the actual imbalances, because excessive weighting results in over-prediction of the zones. After each cross-validation run, predictions were generated for the validation volumes.
After the 5-fold cross-validation runs, one more model training was carried out on all the training volumes and predictions were generated on a set of eight test volumes that had been kept separate.
The models were trained using the Keras framework with a learning rate of 0.00001, glorot uniform initialization, L2 kernel regularization, and an Adam optimizer. The number of epochs for each training run is 300, which is limited by run time: the 6 training runs required to evaluate a scenario together take 5 days on a single GPU.
Two sets of experiments were run:
The aniso-3DUNET model shown in Figure 4 was trained on the same input volumes, first to segment only the prostate PZ and TZ (2-label case) and then to segment the prostate PZ and TZ, the bladder, rectum and femur bones (6-label case).
The metric used for scoring predictions for each data input volume was dice score:
is the volume of predicted segmentation probabilities andis the volume of ground truth labels.
The training and validation scores are reported as a function of the number of epochs. The distribution and average of the dice scores is also reported for the predictions of each of the fifty-three input volumes after the validation runs or the test run (for whichever run an input volume was not part of the training set).
To find out how the network is learning, the layers of the model are visualized. There are several ways to do this [Chollet2017], and here feature maps are plotted in the form of activations for the second convolution in each layer for a single test input volume.
Here dice scores are reported for 5-fold cross-validation training runs and a test run, for each of the scenario’s that we are evaluating. The dice scores at the end of most runs are still increasing, and longer runs would achieve somewhat higher dice scores. The highest scores are obtained by segmenting only TZ and PZ with the aniso-3DUNET shown in Figure 4: dice scores of 0.85 for TZ and 0.60 for PZ. The PZ is segmented most reliably in the middle of the prostate. For a split of the prostates into a base (top 1/5), middle (mid 3/5) and apex (bottom 1/5) sections, average dice scores of respectively 0.46, 0.71 and 0.51 are obtained.
3.1 Segmenting surrounding organs
The distribution and average dice score for all fifty-three input volumes are compared in Figure 6. Clearly the results contradict our hypothesis in section 1.3, that segmenting more organs would benefit the prostate segmentation. The case where the prostate zones plus surrounding organs are segmented predicts the PZ significantly worse than the case where only the prostate zones are segmented. There is virtually no difference in predicting the TZ.
The same can be seen in Figure 7, which shows that segmenting more organs also slows down the training and increases the range of scores in the cross-validation runs.
3.2 Anisotropic 3D U-net (aniso-3DUNET).
shows that the aniso-3DUNET achieves a higher dice score (dice=0.60) than the iso-3DUNET (dice=0.57). The 0.03 increase in dice score is marginal though, as it is of the same order of magnitude as the standard error in Figure8 which is . Figure 9 shows the variation between the cross-validation runs, which like in Figure 8 shows a minor improvement.
3.3 Feature map analysis.
To understand how the networks learn and to further understand differences in performance, we plot the activations that are output by the second convolution in each layer. A slice map through the middle of the prostate is plotted for each of the features per layer, and the complete sets are shown in the Appendix. In Figure 10 three of the features for each layer are shown for the best performing model (aniso-3DUNET with 2-label segmentation), which are selected because they provide some insights into how the network performs the segmentation. The input image is fed into the network at the top left layer. The information flow follows the red and green arrows, which are the same as the ones in the network graph shown in Figure 4
. The flow ends at the last layer at the top right, which feeds into the softmax layer for final segmentation.
Based on Figure 10 we try to deduce per layer what these features are coding for.
The three feature maps for the first model layer show that, as reported for other networks elsewhere [Chollet2017], this layer codes for local features like edges and thick black shapes at resp. the left and the middle feature maps. An interesting feature map is the one on the right, in which thick black shapes have been greyed out, but not thin black lines.
The second layer shows similar feature maps as the first layer, at the lower resolution of this layer.
The feature maps in the third layer still show some similarity to the ones in the earlier layers, but are now more cartoon like. The feature map at the left is covering parts of the areas that are background in a white blur. The middle feature map is progressing the black shape feature maps in the previous layer, with the outline of the prostate shape thickened. The feature map at the right is showing both the bladder and the prostate TZ.
This layer is the most downscaled layer, and the feature maps are very blurry white. Most maps look unstructured, but the three shown here do have some structure: the left feature map blurs the background areas white, the middle feature map blurs most of the image white except the rectum, and the feature map on the right blurs everything white except the centre of the bladder.
The resolution of the feature maps in layer 5 is increasing back to the same resolution as layer 3. Using information from both layer 3 and 4, the left feature map in layer 5 is progressing the white blurring of the background, while sharpening the edge of the prostate boundary. The middle feature map is highlighting the prostate TZ, while the feature map on the right is showing a combination of a black PZ and a grey TZ as well as grey background.
This layer progresses the feature maps of layer 5 and sharpens the borders further.
The final layer progresses the features in layer 5 and 6 and sharpens the borders further again. In addition to the two feature maps that segment the background and the prostate TZ (left and middle feature map), feature maps appear that segment the prostate PZ separately (feature map to the right).
The final, seventh, layer is preparing well for the segmentation into prostate TZ, PZ, and background in the softmax layer. To find out why the performance of the model in segmenting PZ is better for 2-label segmentation than for 6-label segmentation, we plot in Figure 11 all the feature maps for layer 7 for both cases. The difference observed is that for the 6-label segmentation case, feature maps that were coding for a separate prostate PZ in the 2-label case are replaced by feature maps that are coding for combinations of the other organs (bladder, rectum, and femur bones).
4 Conclusions and discussion
We have evaluated a 3D U-net neural network [Cicek] for automatic segmentation of both prostate TZ and PZ in MRI scans and have found that it could achieve segmentations with dice scores of 0.85 for TZ and 0.60 for PZ. This is a few percent higher than a recently published ATLAS method [Padgett2016] (0.83 for prostate, 0.57 for PZ) and a 2-stage 2D U-net method [Clark2017] (0.82 for prostate and 0.77 for TZ).
We explored two ideas for improving the 3D U-net performance, that make use of characteristics specific to MRI images. One characteristic is the anisotropy of the MRI volumes for which we tested two architectures: an anisotropic network architecture, aniso-3DUNET, that reflects the anisotropy in the MRI volumes (see Figure 4) and a more isotropic architecture (see Figure 5), iso-3DUNET. The aniso-3DUNET performs slightly better (0.60 versus 0.57), but we consider the difference marginal because it is of the same order of magnitude as the standard error in the average dice scores of for all volumes.
Another characteristic is that the images always have a fixed topology, that is given by the anatomy of the organ of interest and surrounding tissues. We have tested whether training the network to segment additional tissues (bladder, rectum and femur bones) would improve the segmentation, but against our expectations this significantly decreases the dice score by 0.07.
An explanation for the decreased segmentation performance of the network when segmenting more tissues was found by plotting the activations of the feature maps for each of the 3D U-net layers for one of the test MRI volumes. This clearly showed that when only the prostate PZ and TZ are segmented, the last layer of the network learns features dedicated to each label: background, TZ and PZ. When more tissues are segmented, the features dedicated to PZ are replaced by combination features of other tissues. An increase in the number of filters might allow for dedicated feature maps, but at the cost of significant increase of GPU memory requirements. And the added value compared to the 2-label case may be limited, considering that Figure 10 shows that the network is already taking surrounding shapes into account in earlier layers without explicitly segmenting them in the final layer.
Visualizing the feature map activations also provides insights in how the 3D U-net manages to segment the images. In the first 3 layers different shapes are detected utilizing distinctive local features. These local features are edges and intensities, but also thick versus thin shapes. The coarsest fourth layer combines feature maps that have the area of interest in common and that blur out the background when superimposed. The prostate PZ is too thin to be coded in the fourth layer, and the distinction between TZ and PZ is detected in the third layer. In the last layers 5, 6 and 7, these areas of interest are sharpened by overlaying the coarse feature maps with higher resolution feature maps from earlier layers that detect the edges of the segmented zones. Finally feature maps are combined to form feature maps that separately detect each label in the same resolution as the original image.