Cylindrical Transform: 3D Semantic Segmentation of Kidneys With Limited Annotated Images

09/24/2018 ∙ by Hojjat Salehinejad, et al. ∙ 0

In this paper, we propose a novel technique for sampling sequential images using a cylindrical transform in a cylindrical coordinate system for kidney semantic segmentation in abdominal computed tomography (CT). The images generated from a cylindrical transform augment a limited annotated set of images in three dimensions. This approach enables us to train contemporary classification deep convolutional neural networks (DCNNs) instead of fully convolutional networks (FCNs) for semantic segmentation. Typical semantic segmentation models segment a sequential set of images (e.g. CT or video) by segmenting each image independently. However, the proposed method not only considers the spatial dependency in the x-y plane, but also the spatial sequential dependency along the z-axis. The results show that classification DCNNs, trained on cylindrical transformed images, can achieve a higher segmentation performance value than FCNs using a limited number of annotated images.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object recognition refers to the task of detecting and labeling all objects in a given image. A bounding box is usually used in this approach to localize the object(s). In object detection, bounding boxes are used to localize a specific object in the image and the rest of the image is assigned to the non-object class. Semantic segmentation refers to the classification of each pixel in an image to generate an image mask consisting of a number of labeled regions. Object recognition approaches are generally easier to implement and computationally less expensive than semantic segmentation methods. However, accuracy and pixel-depth segmentation can be more important than computational complexity in certain applications such as medical image processing.

Deep fully convolutional networks (FCNs) [1] are popular models for semantic segmentation [2] that use a convolutional decoder with a large annotated training dataset. Since there are limited numbers of annotated images and data samples available for every possible class in real-world problems [3], augmentation methods such as image rotation and synthesis [4] can help increase the diversity of training datasets, and therefore prevent the models from overfitting [2], [5], [6]. In [5], we have proposed a radial transform method in the polar coordinate system as a novel augmentation method for classification problems. This technique is well suited for highly imbalanced datasets, or datasets with a limited number of labeled images.

In this paper, we propose a cylindrical transform in the cylindrical coordinate system as a technique to generate representations from limited annotated sequential images. The cylindrical transform method enables us to train contemporary classification deep convolutional neural networks (DCNNs) instead of FCNs for semantic segmentation. We applied the proposed method for registration-free segmentation of left kidney, right kidney, and non-kidney data classes in abdominal computed tomography (CT) images by training AlexNet [7] and GoogLeNet [8] DCNNs. We have selected these architectures due to their simplicity of training and relatively high classification performance [9].

2 Proposed Method

In this section, we discuss the proposed cylindrical transform sampling method and the training and inference procedures of a DCNN using cylindrical transform generated images for semantic segmentation.

2.1 Sampling Using Cylindrical Transform in 3D Space

A cylindrical coordinate system is a generalization of a polar coordinate system to 3D space and is created by superposing a height along the z-axis. The objects in a volume of images not only have spatial dependency on the x-y plane, but also along the z-axis. We define a volume as a sequence of images along the z-axis, where is of size , as presented in Figure 1. We can randomly select a pixel from in the Cartesian coordinate system, such as . This pixel can be mapped onto the cylindrical coordinate system as a pole with coordinates . Cylindrical transform represents each pixel in the volume as a new image of the size , where , by up-sampling the pole and representing the spatial information between the pole and other pixels in the volume.

Figure 1: Sampling a sequence of images using cylindrical transform.
Read // Selected pole in the image volume
Read // Original image volume
Read // Sampling step along slices
Initialize to zero // Cylindrical transform image of size
for  do
       if  then
              // Set of slices to sample
       end if
end for
for  do
       for  do
             for  do
                    if  then
                    end if
             end for
       end for
end for
Algorithm 1 Sampling via Cylindrical Transform

In the cylindrical coordinate system, a pixel on the plane can be represented as , where is the radial coordinate from the pole along the z-axis and is the counter-clockwise angular coordinate. It is considered with respect to an axis drawn horizontally from the pole to the right, as illustrated in Figure 1. For a given volume of images, we can select slices, with a given distance of slices, above and under the slice with respect to the pole such as . In the cylindrical coordinate system , we can generate sampling points with respect to a pole such that


for . By , we project the pixels at Cartesian coordinates from the original image to generate an image with respect to the pole using cylindrical transform as


for and such that , , and is the rounding function to the nearest integer. These conditions guarantee that the pair stays spatially within . A pixel in the constructed image is then defined as The image is the cylindrical transform image of X with respect to the pole with sampling step of along the z-axis. Algorithm 1 shows the pseudocode of the cylindrical transform sampling.

(a) Geometric shapes.
(b) Slice view of the geometric shapes.
(c) Cylindrical transform of randomly selected poles from (a). The slice number along the z-axis are .
Figure 2: A volume of geometric shapes in a 3D Cartesian space. The pole is selected randomly inside each geometric shape. In , the sphere and cylinder are presented as two identical circles, while they belong to two different objects in the volume. However, shows the difference of these objects at a different spatial location along the z-axis. Cylindrical transform provides a representation of the objects by considering the spatial dependency on the x-y plane as well as the z-axis. This observation is not detectable by only considering a single slice along the x-y plane.

Figure 2

shows the advantage of using cylindrical transformed images over independent slices of a volume. An sphere and cylinder look different in a 3D space. However, these objects may look similar depending on the object location along the z-axis on the x-y plane. This is while the cylindrical transformed images capture the spatial difference along the z-axis and by combining that with spatial information on the x-y plane, represent a volume along an arbitrary pixel as an image, feasible for machine learning. This image contains information about spatial dependency on the x-y plane as well as the z-axis.

2.2 Cylindrical Transform for Semantic Segmentation

Figure 3 shows samples of cylindrical transform generated images from contrast-enhanced abdominal CT. Figure 4 shows the procedure for training a DCNN with cylindrical transformed images. By considering a sequence of images as the input, the cylindrical transform generates images for a number of randomly selected poles in and stores them with their corresponding labels in a pool of images to train a DCNN. The trained model can later be used for inference, where the cylindrical transform considers every pixel in the original image as the pole and generates its corresponding cylindrical transform image. The generated images are then passed to the trained DCNN for classification and labeling of a mask template, which represents the predicted data class for each pixel in .

(a) Left kidney.
(b) Non-kidney.
(c) Right kidney
Figure 3: Samples of cylindrical transform generated images for left kidney, non-kidney, and right kidney. The images are rotated counter-clockwise for the sake of presentation.

3 Experiments

3.1 Data

With the approval of the research ethics board, 20 contrast-enhanced normal abdominal CT acquisitions from an equal number of male and female subjects between 25 to 50 years of age were collected [10]. Each acquisition had on average 18 axial slices containing kidneys. The left and the right kidneys were outlined manually by trained personnel and stored as images. The boundary delineation was performed using a standard protocol for all kidneys. To avoid inter-rater variability in the dataset, quality of segmentation was assured by two board certified radiologists. The sampling step along the z-axis is and the size of a cylindrical transform generated image is .

Figure 4: Training semantic segmentation using classification DCNNs with cylindrical transformed images. The steps are labeled.

3.2 Technical Details of Training

The FCN models were trained on original images with a setup as outlined in [2]. For experiments with cylindrical transformed images, 7 acquisitions each containing on average 18 axial slices (totalling ) were used for training with 1,000 randomly selected poles per label class per slice to generate cylindrical transformed images. For all experiments, three acquisitions were used for validation ( axial slices), and 10 acquisitions ( axial slices) were used for test. The number of training iterations was set to 120. An Adam [11]

optimizer with a sigmoid decay adaptive learning rate (LR) and momentum term of 0.9 was used. The activation function before the max-pooling layer was a ReLU 

[12]. The regularization was set to

and early-stopping (storing network parameters and stopping at maximum validation performance in a window of 5 iterations) was applied. The training datasets were shuffled in each training epoch. The performance results were collected after 10-fold cross-validation.

3.3 Semantic Segmentation of Kidneys in Contrast-Enhanced Abdominal Computed Tomography

Using the definition of true positive (), false positive (), and false negative (), the precision and recall measure the success of prediction in classification tasks [13]. The Dice similarity coefficient (DSC) [2] is a well-known measure for the accuracy of segmentation methods [2]. By considering a volume as a set of pixels, for a segmented sequence of images and its corresponding ground-truth , the DSC is expressed as , where is the cardinality of the set. Since we apply the transform to each pixel of the volume , the DSC segmentation accuracy can be interpreted as the top-1 classification accuracy [14].





DSC per class

Left Kidney


Right Kidney

FCN-GoogLeNet 64 42.17% 43.73% 40.36% 42.08%
FCN-VGG-19 64 34.52% 35.93% 36.26% 35.57%
FCN-VGG-19P 64 54.26% 58.93% 56.27% 56.48%
FCN-GoogLeNet (augmented) 64 60.62% 65.82% 62.69% 63.04%
FCN-VGG-19 (augmented) 64 62.74% 65.20% 63.36% 63.76%
FCN-VGG-19P (augmented) 64 66.83% 69.26% 67.77% 67.95%
CLT-AlexNet 4 93.68% 92.55 % 94.42 % 93.52%
CLT-GoogLeNet 4
Table 1:

DSC value of classification DCNNs trained on cylindrical transform generated images and FCNs for semantic segmentation of kidneys in contrast-enhanced abdominal CTs. CLT: cylindrical transform; P: pre-trained on ImageNet; LR: adaptive learning rate basis; MB: size of mini-batch. Top DSC value is in boldface.

In [2], 16,000 original annotated images were used for training a VGG-16 FCN for semantic segmentation of kidneys with a DSC performance of . In our experiments, the focus was on using a limited number of annotated images and considering sequential spatial dependency between images along the z-axis. For the purpose of semantic segmentation, the FCNs require the entire volume of annotated original images (i.e., 126 images) as input for training and inference. However, the cylindrical transform method enables us to train contemporary classification networks for whole-image classification without the need for a FCN to predict dense outputs for semantic segmentation.

The performance results of FCN-AlexNet [7], FCN-GoogLeNet [8], and FCN-VGG-19 [15] are presented in Table 1. The VGG-19 pre-trained on ImageNet [16] requires square-size input images. Since cylindrical transformed images are of size , we did not use pre-trained models. However, we used FCN-VGG-19 [15] for training with original images for sake of comparison. The experiments were conducted in five schemes: 1) from scratch end-to-end in an FCN mode; 2) using pre-trained weights (denoted with P in the tables) on ImageNet [16] end-to-end in a FCN mode; 3) from scratch end-to-end in an FCN mode with augmentation; 4) using pre-trained weights on ImageNet [16] end-to-end in an FCN mode with augmentation; 5) from scratch using cylindrical transformed (denoted with CT in the tables) images. The augmentation methods used in the FCNs include rotation (every 36 degrees - 10), scaling ( - 2), shifting an image in x-y direction ( - 2), and applying an intensity variation ( - 2) similar to [2], totaling training images.

Table 2

shows precision and recall scores of the DCNNs evaluated in Table 

1. The receiver operating characteristic (ROC) plots in Figure 5 show the area under curve (AUC) of the classification models trained using cylindrical transformed images, which is and for AlexNet and GoogLeNet, respectively. The overall performance results show that FCNs are challenging to train with a limited number of training images. These models have achieved less DSC performance comparing to the GoogLeNet, trained with cylindrical transform generated images, that produced a DSC value of .

width=0.3 Model Measure Class Avg. Left Kidney Non-Kidney Right Kidney CLT-AlexNet precision 0.95 0.90 0.97 0.94 recall 0.94 0.93 0.94 0.94 f1-score 0.94 0.91 0.96 0.94 CLT-GoogLeNet precision 0.99 0.98 0.99 recall 0.98 0.98 0.99 f1-score 0.98 0.98 0.99

Table 2: Precision and recall of cylindrical transform (CLT) method for contrast abdominal CT.
(a) AlexNet.
(b) GoogLeNet.
Figure 5: ROC curve and AUC of AlexNet and GoogLeNet on the test dataset for contrast CTs.

4 Conclusions

Most of the proposed methods for semantic segmentation of sequential images (i.e., a volume) perform segmentation for each image of the sequence independently, without considering the sequential spatial dependency between the images. In addition, annotating sequential images is challenging and expensive, which is a drawback in using supervised deep learning models due to their need for a massive number of training samples. In this paper, we investigate the semantic segmentation of sequential images in a 3D space by proposing a sampling method in the cylindrical coordinate system. The proposed method can generate images up to the number of pixels in the volume, and therefore augment the training dataset. The generated images contain spatial samples from the x-y plane, as well as the time (i.e., sequential) dimension along the z-axis. This method enables us to train contemporary classification convolutional neural networks instead of a fully convolutional network (FCN). This technique helps the network to avoid overfitting and boost up its generalization performance.