Automatic Lumbar Spinal CT Image Segmentation with a Dual Densely Connected U-Net

10/21/2019 ∙ by He Tang, et al. ∙ 19

The clinical treatment of degenerative and developmental lumbar spinal stenosis (LSS) is different. Computed tomography (CT) is helpful in distinguishing degenerative and developmental LSS due to its advantage in imaging of osseous and calcified tissues. However, boundaries of the vertebral body, spinal canal and dural sac have low contrast and hard to identify in a CT image, so the diagnosis depends heavily on the knowledge of expert surgeons and radiologists. In this paper, we develop an automatic lumbar spinal CT image segmentation method to assist LSS diagnosis. The main contributions of this paper are the following: 1) a new lumbar spinal CT image dataset is constructed that contains 2393 axial CT images collected from 279 patients, with the ground truth of pixel-level segmentation labels; 2) a dual densely connected U-shaped neural network (DDU-Net) is used to segment the spinal canal, dural sac and vertebral body in an end-to-end manner; 3) DDU-Net is capable of segmenting tissues with large scale-variant, inconspicuous edges (e.g., spinal canal) and extremely small size (e.g., dural sac); and 4) DDU-Net is practical, requiring no image preprocessing such as contrast enhancement, registration and denoising, and the running time reaches 12 FPS. In the experiment, we achieve state-of-the-art performance on the lumbar spinal image segmentation task. We expect that the technique will increase both radiology workflow efficiency and the perceived value of radiology reports for referring clinicians and patients.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Lumbar spinal stenosis (LSS) is one of the most common diseases encountered in spinal surgery practice. Diagnosis of LSS is usually made under the guidance of medical imaging techniques such as magnetic resonance imaging (MRI) and computed tomography (CT). More previous studies prefer MRI because it is safer and does not involve any radiation. However, the pathogenesis of degenerative LSS and developmental LSS differ [14]. Degeneration of the lumbar intervertebral disc, hypertrophy of the articular process and the calcification of ligamentum flavum are the main causes of degenerative LSS; treatment for patients is usually lumbar decompression. Developmental LSS is usually due to vertebral laminae osseous stenosis, and the corresponding treatment is usually laminectomy. Precisely identifying the vertebral body, spinal canal and dural sac is helpful in diagnosing different types of LSS [17]. Surgeons usually use lumbar spinal CT images to distinguish between degenerative and developmental LSS because CT is better at imaging osseous and calcified tissues than MRI is [6]

. However, tabbati2017mrihe boundaries of the spinal canal and dural sac in CT images are not intuitive; segmentation of these two tissues depends heavily on expert surgeons and radiologists, which brings uncertainty and risk. In this paper, we provide a sufficiently labeled lumbar spinal CT image dataset; the areas of spinal canal, dural sac and vertebral body are labeled in pixel-level. We hope this new dataset will promote the automatic diagnosis of LSS. We then propose a multi-scale densely connected neural network that can automatically segment the spinal canal, dural sac and vertebral body from a raw CT image. To the best of our knowledge, this is the first deep learning-based method to simultaneously segment the spinal canal, dural sac and vertebral body from CT images.

Recently, deep convolutional neural networks have been applied in medical image analysis for it providing abundant and discriminative image representations. Han et al.

[10] introduced a lung CT imaging signs dataset and proposed a software of abnormal regions annotation. Yu et al. [26] presented a melanoma recognition method by both a deep learning method and a local descriptor encoding strategy. Nie et al. [20] used deep convolutional adversarial networks to synthesize more medical images. Abbati et al. [18] proposed a automatical treatment decision-making plan for LSS. In practice, medical image collection and labeling is expensive and time-consuming, however, training deep neural networks usually requires a massive number of training samples. In this paper, we perform data augmentation of the CT images to overcome this limitation. Moreover, we include several dense blocks [12]

in the proposed dual densely connected U-shaped network (DDU-Net) to reduce the number of parameters and increase the computation efficiency. These two attempts will alleviate the gradient vanishing problem when training a deep neural network with limited data and will improve prediction accuracy as well. In the experiment, we find that some dim-small tissues (e.g., dural sac) are difficult to segment from the original CT image, and the scales of tissues in CT images show large variance. To handle these problems, the proposed DDU-Net contains two U-shaped sub-networks with different sizes of receptive filed, which allows DDU-Net to perform extraction of multi-scale features and segmentation of different sizes of tissues automatically and precisely.

In the experimental section, we test our method on the proposed new dataset and compare the performance with three state-of-the-art image segmentation methods, i.e., U-Net [21], FCN [19] and DeepLab [3]. Both visual comparison and quantitative comparison show that our method outperforms these state-of-the-art methods.

In summary, this paper makes the following contributions:

  1. A new challenging dataset is collected for further research and evaluation of spinal CT image segmentation;

  2. Unlike previous works that produce a binary segmentation, this is the first work to segment the spinal canal, dural sac and vertebral body from a spinal CT image simultaneously; we hope that this work will promote automatic diagnosis of lumbar spinal stenosis;

  3. The proposed DDU-Net segment spinal CT images in an automatic and end-to-end manner; all parameters are optimized simultaneously. DDU-Net is capable of segmenting tissues with scale-variant, inconspicuous edges (e.g., spinal canal) and extremely small size (e.g., dural sac);

  4. The proposed method is practical; it requires no image pre-processing such as image registration, denoising, or contrast enhancement. The proposed DDU-Net has only 54M parameters, and it outperforms state-of-the-art methods in both visual comparison and quantitative comparison, with the running time reaching 12 FPS.

The rest of this paper is organized as follows. In section II, we introduce some previous papers that are related to our work, including medical image analysis and basic deep learning technology. In section III, the new lumbar spinal CT image dataset is demonstrated. Section IV covers the methodology part of this paper, where we introduce the data augmentation method and architecture of DDU-Net, explaining the details of the network training. In Section V, visual comparison, qualitative and quantitative comparison are conducted on the proposed method and state-of-the-art methods. Finally, the conclusion is described in Section VI.

Ii Related works

Several state-of-the-art methods for spinal image segmentation have been developed over the past ten years. Some methods have used traditional machine learning and image processing technology, for example,

[4] and [5] developed an automatic method for spinal cord and spinal canal segmentation for CT images. Their method is based on multi-resolution propagation of tubular deformable models, and coupled with an automatic intervertebral disk identification method.

With the remarkable performance of deep convolutional neural networks (DCNNs) in different domains such as natural image classification [16], [23] and segmentation [19, 3], biomedical image segmentation has achieved a breakthrough by using a U-shaped fully convolutional network (FCN). U-Net [21] is an end-to-end architecture used to segment different semantics of images, owing to skip connections, this method won the ISBI cell tracking challenge 2015 by using only 30 training images, outperforming the second best method by a large margin. Since then, deep convolutional networks have become popular in automatic biomedical image segmentation. Korez et al. proposed an automatic method to segment vertebral bodies from 3D MRI images. Abbati et al. [1] introduced MRI-based surgical planning for lumbar spinal stenosis, developing an automated algorithm to localize the stenosis causing the patient’s symptoms from the MR image; before training the network, the authors manually cropped the original images to obtain the region of interest and trained the network with both labeled and unlabeled images, and the results demonstrated promising performance. Korez et al. [15] segmented vertebral bodies from MR images with 3D CNNs. Gros et al. [8] segmented both spinal cord and intramedullary multiple sclerosis lesions by convolutional neural networks (CNNs).

In contrast with the aforementioned works, in this paper, we introduce a new fully convolutional network to segment the spinal canal, dural sac and vertebral body in parallel. The proposed method is automatic and does not require any image pre-processing, and the performance surpasses that of state-of-the-art methods.

Iii A new dataset

As this is the first attempt to simultaneously segment the spinal canal, dural sac and vertebral body from CT images, to promote the study of this problem, we have built a new dataset with pixel-level labels. We collected 2393 axial lumbar spinal CT images from 279 patients.

We consider lumbar spinal image segmentation as a pixel-level multi-class classification task, where the input is CT images from different patients of different views and the expected output is a mask with 4 classes, i.e., spinal canal, dural sac, vertebral body and background. Since the spinal canal, dural sac and vertebral body are unique in each CT image, we treat this pixel-level multi-class classification task as a semantic segmentation problem. To ensure label consistency, we asked four radiologists to label four different semantics in all images using a custom designed interactive segmentation tool. We only kept the images that were given very similar labels by all four radiologists. Finally, the proposed dataset contains 1280 images with precise and consistent pixel-level labels. We randomly divided the dataset into three parts, i.e., 50% for training, 20% for validation and 30% for testing.

The first column of Fig. 1 shows 2 sample CT images from our dataset, each image shows a CT scan acquired from an individual patient. The second column is the corresponding ground truth of each raw image. Each ground truth has the same size as the raw image; the red mask of a ground truth indicates the vertebral body region, the green mask of a ground truth indicates the spinal canal region, and the white mask of a ground truth indicates dural sac region. These masks are unique in one ground truth. Raw images in our dataset have obvious variations, e.g., scale, rotation, brightness, and noise. As shown in Fig. 1, the scale of the top-left of image Fig. 1(a) is smaller than that of the bottom-left image Fig. 1(b), that is, the details of Fig. 1(a) are more abundant, and we can see more tissue in Fig. 1(b). Furthermore, Fig. 1(a) is more noisy than Fig. 1(b), a vertical line exists throughout Fig. 1(a), and Fig. 1(b) is slightly rotated from the standard visual angle shown in Fig. 1(a). On the other hand, as shown in the enlarged view of Fig. 1(b), the boundaries of the spinal canal and dural sac have low contrast against the nearby regions, and the size of the dural sac is extremely small, making it difficult to identify in practical CT images. In sum, the variations and low contrast of the raw CT images make lumbar spinal CT image segmentation difficult, and this dataset is challenging. In this paper, we will develop a robust method to segment the raw CT images in an end-to-end manner, without image denoising, contrast enhancement, registration, etc.

Fig. 1: Examples of the proposed dataset, from the left to right columns, are raw CT images, ground truth label and our segmentation. The bottommost image shows an enlarged view of the spinal canal and dural sac regions in Fig. 1(b). The red regions denote the vertebral body, the green regions denote the spinal canal, the white regions inside green regions denote the dural sac, and the black regions denote background.

Sample imbalance of the dataset. The labels in our dataset are imbalanced; see Fig. 2

for the class distribution: background (black color) 95.02%, vertebral body (red color) 4.43%, spinal canal (green color) 0.37%, and dural sac (white color) 0.18%. We will handle this problem by a weighted cross-entropy loss function, please see Sec. 4.3 for details.

Fig. 2: Class distribution in the proposed dataset. Black, background; red, vertebral body; green, spinal canal; white, dural sac. This class distribution shows heavily sample imbalance.

Iv Method

In this section, we will introduce details regarding how we segment spinal images automatically and precisely with limited data and why this method works. First, we augment the image using several image processing approaches. This data augmentation alleviates overfitting when training, and we report the performance comparison with and without data augmentation in an ablation study. Second, we construct a dual densely connected U-shaped network (DDU-Net) to segment the spinal canal, dural sac and vertebral body in parallel. Finally, we introduce how to train this network in detail.

Iv-a Data augmentation

For convolutional neuron network training, we use the following data augmentation: rotation by a random angle between (0, 2 ); horizontal flip of the original images; random crop of the images to a size of

from the original size of

; and random standard Gaussian noise is applied to the images, with the standard deviation

, where the random function produces a random float value between 0 and 1. Each image is augmented 100 times by the methods, which alleviates the requirement of a large quantity of labeled data and allow us to train the convolutional neural network successfully.

Iv-B DDU-Net architecture

Fig. 3: DDU-Net architecture, example for input image with size . Each cuboid corresponds to a multi-channel feature map. The grey cuboids represent copied feature maps from prior layers. Dense blocks are built on the downsampling part of the network. The arrows denote the different operations. The color version provides a better view.

We propose a deep fully convolutional network to segment the CT images. Different from U-Net [21] and some related medical image segmentation methods like [25] and [13], the proposed method does not split the original large images into patches, the input of the proposed DDU-Net is the original large images with size , which avoids the need to recompose the image patches. The network architecture is shown in Fig. 3; the cuboids represent feature maps, and the grey cuboids are copied feature maps from prior layers. In Fig. 3

, the arrows are connections between layers, where solid thin arrows represent standard batch normalization (BN) -rectified linear unit (ReLU)-convolution (Conv); green arrows represent upsampling-BN-ReLU-Conv, and the size of a feature map will increase after this operation; brown arrows represent BN-ReLU-Conv-average pooling, and the size of a feature map will decrease after this operation; black arrows represent max pooling, and the size of a feature map will also decrease after this operation; and grey arrows represent copying the source feature maps and concatenating them to the target feature maps, and the channels of the target layer will increase after this operation. Black numbers are the size of the feature maps, and blue numbers are the channels of the layers.

Dual network structure. DDU-Net consists of two sub-networks, and we duplicate an image and feed it into the two sub-networks separately. The upper sub-network upsamples feature maps at the fourth layer and downsamples at the fifth layer, while the lower sub-network downsamples feature maps at the fourth layer and upsamples at the fifth layer. Neurons of the upper sub-network have a smaller receptive field than the lower sub-network; consequently, the upper sub-network concentrates smaller tissues, and the lower sub-network concentrates larger tissues. We merge the feature maps of the last layer from the two sub-networks and convolve them with a convolution layer to obtain a pixel-level classification. This dual network architecture makes DDU-Net segment different size of tissues robustly, e.g., Fig. 1(c) and Fig. 1(d) shows the small dural sac (white color) and the large vertebral body (red color).

Skip connections. Inspired by U-Net [21], DDU-Net consists of a downsampling part (left side) and an upsampling part (right side). The downsampling part is used to encode input images in a lower dimensionality with richer filters, while the upsampling part is designed to complete the inverse process of encoding by upsampling and merging low-dimensional feature maps, which produce dense predictions of each pixel. On each sub-network, skip connections copy feature maps from the , and layers, and concatenate the feature maps to the , and layers, respectively; as shown in Fig. 3 grey arrows indicate copy directions, and grey cuboids represent duplicated feature maps.

Dense blocks. Generally, a deeper network performs better than a shallower network. To balance the depth of network and number of parameters, we insert dense blocks at the downsampling part of DDU-Net. Inside the dense blocks, neurons of each layer connect not only to the next layer but also to all their subsequent layers:

(1)

where is the layer and is a transition function, which is usually BN-ReLU-Conv 3 consecutive operations, and affected by .

Fig. 4 shows an illustration of a dense block; between the input and output, there are 3 layers, with each layer having channels, and we call it a dense block. In this example, layer 1 connects to layer 2, as well as to layer 3 and the output layer. As shown in Fig. 3, we designed 8 dense blocks in DDU-Net in total, i.e., and dense blocks for each sub-network.

Fig. 4: A 3-layers dense block with growth rate . Each layer has 2 channels, and each layer connect not only to the next layer but also to all their subsequent layers.

Detailed structure of DDU-Net. See Table I for a detailed structure of the upper sub-network and lower sub-network of DDU-Net. Please note that in the layer details, conv is BN-ReLU-Conv,

, 64 conv, stride 2 corresponds to the sequence BN-ReLU-Conv layer with convolutional kernel size of

and 64 channels with stride 2; the symbol represent a dense block with operations in repeating times, and means that this layer skip-connects with a dense block. The growth rate for all dense blocks is

; the upsampling is bilinear interpolation, and each transition layer is an operation between two dense blocks. The lower sub-network has one additional upsamplinig layer to recover the feature map size to

. The results of the upper sub-network and lower sub-network are concatenated and convolved with a convolutional layer for dense classification.

Upper sub-network Lower sub-network
Layer name Layer details Feature size Layer name Layer details Feature size
input - 400400 input - 400400
Convolution 1 77, 64 conv, stride 2 200200 Convolution 1 77, 64 conv, stride 2 200200
Pooling 33, 64 max pool, stride 2 100100 Pooling 33, 64 max pool, stride 2 100100
Upsampling 1 22, 64 upsampling 200200 Transition Layer 1 22, 64 average pool, stride 2 5050
Dense Block 1 200200 Dense Block 1 5050
Transition Layer 1
11, 128 conv
22, 128 average pool, stride 2
200200
100100
Upsampling 1 22, 128 upsampling 100100
Dense Block 2 100100 Dense Block 2 100100
Transition Layer 2
11, 256 conv
22, 256 average pool, stride 2
100100
5050
Transition Layer 2
11, 256 conv
22, 256 average pool, stride 2
100100
5050
Dense Block 3 5050 Dense Block 3 5050
Transition Layer 3
11, 512 conv
22, 512 average pool, stride 2
5050
2525
Transition Layer 3
11, 512 conv
22, 512 average pool, stride 2
5050
2525
Dense Block 4 2525 Dense Block 4 2525
Upsampling 2 22 upsampling - [Dense Block3], 2048 5050 Upsampling 2 22 upsampling - [Dense Block3], 2048 5050
Convolution 2
11, 512 conv
33, 512 conv
5050 Convolution 2
11, 512 conv
33, 512 conv
5050
Upsampling 3 22 upsampling - [Dense Block 2], 1024 100100 Upsampling 3 22 upsampling - [Dense Block 2], 1024 100100
Convolution 3
11, 256 conv
33, 256 conv
100100 Convolution 3
11, 256 conv
33, 256 conv
100100
Upsampling 4 22 upsampling - [Dense Block 1], 512 200200 Transition Layer 4
11, 256 conv
22 average pool, stride 2 - [Dense Block 1], 512
100100
5050
Convolution 4
11, 16 conv
33, 16 conv
200200 Convolution 4
11, 16 conv
33, 16 conv
5050
Upsampling 5 22 upsampling 400400 Upsampling 5 44, 16 upsampling 200200
Convolution 5
11, 32 conv
33, 32 conv
400400 Convolution 5
11, 32 conv
33, 32 conv
200200
- - - Upsampling 6 22, 32 upsampling 400400
TABLE I: The architecture of the two sub-networks of DDU-Net.

Iv-C Training

We randomly divide the proposed dataset into three parts, i.e., 50% for training, 20% for validation and 30% for testing. In practice, 50% of the images are fed into DDU-Net for training, 20% of the images are used for hyperparameters optimization and prevention of overfitting, and 30% of the images are used to evaluate the performance of the neural networks. The DDU-Net is trained in an end-to-end manner, and all parameters in the network are optimized simultaneously.

For a typical lumbar spinal CT image, most pixels are the background, and regions such as the dural sac and spinal canal are extremely small (see Fig. 1 and Fig. 2). To solve this problem, we introduce a class-balancing weight on a per-pixel term basis, this class-balancing weight is designed to offset the imbalance between major and minor classes and promote the neural network to learn features of small tissues such as the dural sac and spinal canal. Specifically, the DDU-Net adopts a weighted cross-entropy function as the loss function, which can be formulated as:

(2)

where is pixel number of the image, is the labeled class of the pixel ,

is the prediction probability of pixel

belonging to class , i.e., spinal canal (), dural sac (), vertebral body () and background (), and denotes the class-balancing weight as:

(3)

where is pixel number of the image and is the pixel number that belongs to class .

The code of DDU-Net is implemented by Pytorch framework. We train DDU-Net using two NVIDIA GeForce GTX 1080Ti GPUs, and due to GPU memory constraints, our model is trained with a mini-batch size 4. The optimizer that we adopt is stochastic gradient descent (mini-batch SGD)

[24] with momentum=0.95; the learning rate is set to 1e-7, and the weight decay is 5e-4. The parameters in dense blocks of each sub-network are initialized with DenseNet [12]

weights pretrained on ImageNet

[22], and other parameters are initialized by He initialization [11].

V Experiments and results

In this section, we introduce the performance evaluation metrics of this paper, after which ablation studies are conducted to explain why this method works. Finally, we compare the DDU-Net with state-of-the-art methods.

V-a Evaluation metrics

Suppose that we classify image pixels into

classes, where is the total pixel number of the image, is the correct predicted pixel number, and and are false positive and false negative predicted pixel numbers respectively, as shown in Fig. 5. We evaluated our model using several semantic segmentation metrics [7]: pixel accuracy (PA), mean pixel accuracy (MPA), mean intersection over union (MIoU), and frequency weighted intersection over union (FWIoU).

Fig. 5: An illustration of semantic segmentation: is a correctly predicted pixel, is an i-class pixel that was predicted as j-class, and is a j-class pixel that was predicted as i-class.

PA measures the ratio between correctly predicted pixels and total pixels. This metric can be formulated as follows:

(4)

MPA is a simple improvement of PA that calculates the percentage of correctly predicted pixels of each class and averages the percentages as a result. This metric can be formulated as follows:

(5)

mIoU is a common metric in semantic segmentation that calculates the ratio between the intersection region (true positives) and the union region (true positives, false positives and false negatives), and average the ratios on all classes. This metric can be formulated as follows:

(6)

fwIoU is an improvement of mIoU, it weights the intersection over union of each class by their occurrence rate. This metric can be formulated as follows:

(7)

V-B Ablation studies

To investigate the importance of different options in our method, we conduct an ablation study. Ablation studies include: with/without data augmentation, with/without skip connections, with/without dense blocks, with/without multi-branch networks, and the growth rate affect. Table II presents the details of the ablation studies, the last row shows the default DDU-Net as the baseline, that is, the network adopts skip connections, dense blocks and multi-branches, and use data augmentation.

Data augmentation.

Since the proposed dataset is not a large-scale dataset, we conduct data augmentation to alleviate overfitting when training the network; the data augmentation details are introduced in Section 4.1. To reveal the benefit of data augmentation, we train this model for 100 epochs with and without data augmentation. The last row and the second row of Table

II show the performance comparison between training with and without data augmentation under the same network architecture; the network adopts skip-connections, dense blocks and multi-branch structure, and after data agumentation, the mIoU improves by approximately 1 point . Fig. 6 shows the training and validation losses during the training procedure with and without data augmentation; the blue curves denote training loss, and the orange curves denote validation loss. Fig. 6(a) shows the training process without data augmentation, and the best mIoU is at about the epoch. The validation loss begins to increase again (after a fall) at about the epoch, this phenomenon is caused by overfitting. Fig. 6(b) depicts the training process with data augmentation, although the validation curve also increases again at about the epoch, with the best mIoU reaching at about the epoch, and improving by approximately 1 point. The above mentioned experiments indicate that data augmentation in our method not only alleviates overfitting but also improves the performance.

(a) Training without data augmentation
(b) Training with data augmentation
Fig. 6: Training on the proposed dataset. Top: without data augmentation. Bottom: with data augmentation. We observe that training with data augmentation alleviates overfitting and improves the performance.

Network architecture. We design several modules to improve architecture of DDU-Net. Skip connections in the DDU-Net are designed to reuse low-level features and fuse multi-level features. As shown in the third row of Table II, we remove all skip connections in DDU-Net, after which the network becomes a flat network; the mIoU of the flat network is , which is lower than that of the default DDU-Net by 3.42 points. Dense blocks are capable of alleviating the gradient vanishing problem in deep networks, enhancing feature resuse and feature propagation, and also reducing the number of parameters. As shown in Table II, without dense blocks, the mIoU decreased to , which is lower than that of the defualt DDU-Net by 6.95 points. We separate DDU-Net into two types: one retains only the upper sub-network, and the other retains only the lower sub-network. The fifth and sixth rows of Table II show the performances of stand-alone upper sub-network and lower sub-network, respectively; their mIoU are and , which are lower than the mIoU achieved by the default DDU-Net consisting of multi-branch networks by 1.45 and 2.64 points. The evaluation results under other metrics such as PA, MPA and fwIoU also indicate that the default DDU-Net which using data augmentation and three modules (SC, DB and MB) performs best.

DA SC DB MB mIoU PA MPA fwIoU Destription
0.8209 0.9909 0.8901 0.9828 Without data augmentation
0.7989 0.9908 0.8836 0.9827 A flat network
0.7636 0.9856 0.8599 0.9731 Without dense blocks
0.8186 0.9905 0.8862 0.9821 Upper sub-network
0.8067 0.9898 0.8858 0.9807 Lower sub-network
0.8331 0.9913 0.9099 0.9835 DDU-Net (default)
TABLE II: Ablation Study under different options. Four options of the proposed method: data augmentation (DA), skip-connections (SC), dense blocks (DB), and multi-branches (MB).

We visualize the ablation studies in Fig. 7, where the images in the leftmost column are two input CT images; scale of the top-left image is larger than that of the bottom-right image. The images on rightmost side are the corresponding ground truth (GT), of which the spinal canal, dural sac, vertebral body and background are marked in green, white, red and black, respectively. The columns from Fig. 7(b) to Fig. 7(f) show the network predictions under several options, that is, Fig. 7(b) presents results from the flat network without skip connections, Fig. 7(c) represents results from the network without dense blocks, Fig. 7(d) represents results from the upper sub-network, Fig. 7(e) represents results from the lower sub-network, and Fig. 7(e) represents the complete DDU-Net segmentations. We can clearly see that the segmentation results generated by the complete DDU-Net are much closer to the ground truth than the other results.

Fig. 7: Examples from the ablation study. This includes a comparison between the ground truth (rightmost column) and results from different options, which are shown respectively from (b) to (f): w/o SC (without skip connection), w/o DB (without dense blocks), upper-subnet (only upper-subnet), lower-subnet (only lower-subnet), DDU-Net (the complete DDU-Net).

Growth rate. The growth rate is a hyperparameter of a densely connected neural network that indicates how many layers a dense block has. Generally, a network with a larger growth rate will perform better. However, a larger growth rate will bring about more parameters, and the running time of the network will also increase. In the experiment we test several growth rate numbers under the same network architecture and data. As shown in Table III, the method performs best when the growth rate equal to 48, and it surpasses the growth rate 32 by a very small margin. On the other hand, when the growth rate is equal to 32, the running time is much faster than when the growth rate is equal to 48; this is because the former parameter number is much smaller than the latter. Consequently, we set the growth rate equal to 32 as the optimal hyperparameter since it has a good tradeoff between accuracy and efficiency.

Growth rate #parameters FPS mIoU PA MPA fwIoU
12 8.62M 18.31 0.7797 0.9873 0.8558 0.9760
24 31.44M 16.14 0.8183 0.9902 0.8938 0.9815
32 54.65M 12.35 0.8331 0.9913 0.9099 0.9835
48 82.21M 8.12 0.8340 0.9934 0.9122 0.9865
TABLE III: The growth rate impacts parameter the number of parameters and the running time of DDU-Net; we choose a growth rate of as it has a good tradeoff between accuracy and efficiency.

V-C Comparison with state-of-the-art methods

Qualitative Comparison Several state-of-the-art methods are related to spinal image analysis; since the proposed method is the first to simultaneously segment the vertebral body, spinal canal and dural sac, we compare our DDU-Net with five related state-of-the-art methods in the following three aspects: 1) data source type, either MRI or CT images; 2) technical details, including objective of the work and methodology; and 3) segmentation targets, explaining what contents are segmented from the images. As shown in Table IV, all methods except our proposed DDU-Net using MR images as training data. [4] and [9] segment images by using traditional machine learning algorithms, [1] segment images manually, where the segmented images are fed into CNNs as intermediate results; and other methods segment images by CNNs. The segmentation targets of these methods are different, only the proposed DDU-Net segment the vertebral body, spinal canal and dural sac from spinal CT images.

Methods Data source type Technical details Segmentation targets
Leener 2015 [4] MRI
Automatically segment spinal cord and
spinal canal by vertebral label
Vertebral regions, spinal cord
and cerebrospinal fluid
Gros 2018 [9] MRI
Automatically localize spinal cord
using global curve optimization
Spinal cord
Gros 2019 [8] MRI
Automatically segment spinal cord and intramedullary
multiple sclerosis lesions with CNNs
Spinal cord and intramedullary
multiple sclerosis lesions
Korez 2016 [15] MRI Automatically segment vertebral body by 3D CNNs Vertebral body
Abbati 2017 [1] MRI
Automatically diagnose lumbar spinal stenosis by
CNNs, segmentations are intermediate results
Manual scan chopping and
interpolation to four slides
DDU-Net (proposed) CT images
Automatically segment vertebral body, spinal canal
and dural sac by a dual densely connected U-shaped CNN
Vertebral body, spinal canal
and dural sac
TABLE IV: Qualitative comparison of different methods.

Visual Comparison For visual comparison, we select some sample segmentation results of three state-of-the-art deep learning-based semantic segmentation methods and DDU-Net. As shown in Fig. 8, the images in the leftmost column are input images, and the images in the rightmost column are the ground truth. Fig. 8(b) shows the result for FCN-8s [19], which applies per-pixel classification using a fully convolutional network; in this experiment, we adopt the FCN 8 pixel stride version since it performs best among all FCN versions; Fig. 8(c) shows the results for U-Net [21], which segments medical images by a U-shaped fully convolutional network with skip connections; Fig. 8(d) shows the results for DeeplabV3 [3], which improved Deeplab [2] by multigrid and atrous spatial pyramid pooling; Fig. 8(e) demonstrates segmentation maps of the proposed DDU-Net. As demonstrated in Fig. 8, FCN-8s [19] and DeeplabV3 [3] failed to segment the extremely small dural sac, boundaries of the vertebral body in U-Net [21] and DeeplabV3 [3] are not precisely, and we can clearly see that the segmentation maps of DDU-Net are much closer to the GT than those of the other methods. From the top-left to bottom-left image, the scale of the images are increasing, that is, the top-left image concentrates details of tissues, where the regions of interest are larger than those of other images, and the bottom-left image represent a global view of CT, regions of interest are smaller than other images. Under this challenging condition, DDU-Net not only handles images with different scales but also segments semantics with different sizes, such as the large vertebral body (denoted in red) and small dural sac (denoted in white), while FCN-8s, U-Net and DeeplabV3 failed in at least one case.

Fig. 8: Visual comparison between DDU-Net and three state-of-the-art methods. The images in the rightmost column are the ground truth of each row, where the red regions indicate vertebral bodies, green regions indicate the spinal canal, white regions indicate the dural sac, and black regions indicate the background. From (b) to (e) are the results of FCN-8s [19], U-Net [21], DeeplabV3 [3] and the proposed DDU-Net, respectively. Our results are the most similar to the ground truth.

Quantitative Comparison The quantitative comparison of several methods on our dataset is shown in Table V. For fair comparison, the parameters of FCN-8s [19], U-Net [21] and DeeplabV3 [3] are finetuned on our dataset before comparing them. We can see that DDU-Net performs best in terms of both four evaluation metrics. Since over 95% of the labels are the background class, the performances on the PA and fwIoU metrics are quite saturate, these two metrics are not appropriate to evaluate performance of one method. On the other hand, the performances of DDU-Net in terms of the mIoU and MPA metrics reach 0.8331 and 0.9099 respectively, surpassing the state-of-the-art methods by at least 3 points. Consequently, both qualitative and quantitative comparisons between our method and the state-of-the-art methods indicate that the proposed method can generate promising segmentations on practical lumbar spinal CT images.

mIoU PA MPA fwIoU
FCN-8s [19] 0.6705 0.9869 0.7588 0.9766
U-Net [21] 0.8053 0.9896 0.8718 0.9802
DeeplabV3 [3] 0.6446 0.9804 0.7316 0.9646
DDU-Net (proposed) 0.8331 0.9913 0.9099 0.9835
TABLE V: Quantitative comparison of different methods on our dataset. The best performances are bolded, and the second best performances are underlined.

Vi Conclusion

Precisely identifying and recognizing the vertebral body, spinal canal and dural sac is a key step in diagnosing different types of LSS. In this paper, we first provide a new lumbar spinal CT image segmentation dataset with pixel-level labels and present a fully automatic method for segmentation of the vertebral body, spinal canal and dural sac from axial spine CT images based on a dual densely connected U-shaped network. Our method is practical, and requires no image preprocessing such as contrast enhancement, registration and denoising; the input is raw CT images, and the output is the desired segmentation maps; and the running speed is about 12 FPS (please see Table III). Our method is precise, and by comparing the segmentation results to those of existing state-of-the-art methods on our new dataset, the proposed method proved superior in terms of segmentation accuracy (Table V).

Given that we have automatically segmented the vertebral body, spinal canal and dural sac from CT images, there is still one more step before fully automatic LSS diagnosis of different types. In future work, we will apply the proposed DDU-Net as an approach for generating regions of interest and will investigate the complete automatic LSS diagnosis pipeline.

References

  • [1] G. Abbati, S. Bauer, S. Winklhofer, P. J. Schüffler, U. Held, J. M. Burgstaller, J. Steurer, and J. M. Buhmann (2017) MRI-based surgical planning for lumbar spinal stenosis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 116–124. Cited by: §II, §V-C, TABLE IV.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §V-C.
  • [3] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §I, §II, Fig. 8, §V-C, §V-C, TABLE V.
  • [4] B. De Leener, J. Cohen-Adad, and S. Kadoury (2015) Automatic segmentation of the spinal cord and spinal canal coupled with vertebral labeling. IEEE transactions on medical imaging 34 (8), pp. 1705–1718. Cited by: §II, §V-C, TABLE IV.
  • [5] B. De Leener, S. Lévy, S. M. Dupont, V. S. Fonov, N. Stikov, D. L. Collins, V. Callot, and J. Cohen-Adad (2017)

    SCT: spinal cord toolbox, an open-source software for processing spinal cord mri data

    .
    Neuroimage 145, pp. 24–43. Cited by: §II.
  • [6] S. Eisenstein (1983) Lumbar vertebral canal morphometry for computerised tomography in spinal stenosis.. Spine 8 (2), pp. 187–191. Cited by: §I.
  • [7] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez (2017) A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857. Cited by: §V-A.
  • [8] C. Gros, B. De Leener, A. Badji, J. Maranzano, D. Eden, S. M. Dupont, J. Talbott, R. Zhuoquiong, Y. Liu, T. Granberg, et al. (2019) Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks. Neuroimage 184, pp. 901–915. Cited by: §II, TABLE IV.
  • [9] C. Gros, B. De Leener, S. M. Dupont, A. R. Martin, M. G. Fehlings, R. Bakshi, S. Tummala, V. Auclair, D. G. McLaren, V. Callot, et al. (2018) Automatic spinal cord localization, robust to mri contrasts using global curve optimization. Medical image analysis 44, pp. 215–227. Cited by: §V-C, TABLE IV.
  • [10] G. Han, X. Liu, F. Han, I. N. T. Santika, Y. Zhao, X. Zhao, and C. Zhou (2014) The liss a public database of common imaging signs of lung diseases for computer-aided detection and diagnosis research and medical education. IEEE Transactions on Biomedical Engineering 62 (2), pp. 648–656. Cited by: §I.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 1026–1034. Cited by: §IV-C.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 4700–4708. Cited by: §I, §IV-C.
  • [13] Q. Jin, Z. Meng, T. D. Pham, Q. Chen, L. Wei, and R. Su (2019) DUNet: a deformable network for retinal vessel segmentation. Knowledge-Based Systems 178, pp. 149–162. Cited by: §IV-B.
  • [14] S. Kitab, B. S. Lee, and E. C. Benzel (2018)

    Redefining lumbar spinal stenosis as a developmental syndrome: an mri-based multivariate analysis of findings in 709 patients throughout the 16-to 82-year age spectrum

    .
    Journal of Neurosurgery: Spine 29 (6), pp. 654–660. Cited by: §I.
  • [15] R. Korez, B. Likar, F. Pernuš, and T. Vrtovec (2016) Model-based segmentation of vertebral bodies from mr images with 3d cnns. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 433–441. Cited by: §II, TABLE IV.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §II.
  • [17] C. C. Kuo, M. Merchant, M. P. Kardile, A. Yacob, K. Majid, and R. S. Bains (2019) In degenerative spondylolisthesis, unilateral laminotomy for bilateral decompression leads to less reoperations at 5 years when compared to posterior decompression with instrumented fusion: a propensity matched retrospective analysis.. Spine. Cited by: §I.
  • [18] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §I.
  • [19] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I, §II, Fig. 8, §V-C, §V-C, TABLE V.
  • [20] D. Nie, R. Trullo, J. Lian, L. Wang, C. Petitjean, S. Ruan, Q. Wang, and D. Shen (2018) Medical image synthesis with deep convolutional adversarial networks. IEEE Transactions on Biomedical Engineering 65 (12), pp. 2720–2730. Cited by: §I.
  • [21] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I, §II, §IV-B, §IV-B, Fig. 8, §V-C, §V-C, TABLE V.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §IV-C.
  • [23] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II.
  • [24] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §IV-C.
  • [25] Y. Wu, Y. Xia, Y. Song, Y. Zhang, and W. Cai (2018) Multiscale network followed network model for retinal vessel segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 119–126. Cited by: §IV-B.
  • [26] Z. Yu, X. Jiang, F. Zhou, J. Qin, D. Ni, S. Chen, B. Lei, and T. Wang (2018) Melanoma recognition in dermoscopy images via aggregated deep convolutional features. IEEE Transactions on Biomedical Engineering 66 (4), pp. 1006–1016. Cited by: §I.