In clinical routine, lower back pain (LBP) caused by spinal disorders is reported as common reason for clinical visits . Both computed tomography (CT) and magnetic resonance (MR) imaging technologies are used in computer assisted spinal diagnosis and therapy support systems. MR imaging becomes the preferred modality for diagnosing various spinal disorders such as degenerative disc diseases, spindylolisthesis and spinal stenosis due to its excellent soft tissue contrast and no ionizing radiation . However, when it comes to spine trauma patients with moderate or high risk, CT is superior to all other imaging modalities in the detection of vertebral fractures and unstable injuries .
An accurate segmentation of individual vertebrae from CT images are important for many clinical applications. After segmentation, it is possible to determine the shape and condition of individual vertebrae. Segmentation can also assist early diagnosis, surgical planning and locating spinal pathologies like degenerative disorders, deformations, trauma, tumors and fractures. Most computer-assisted diagnosis and planning systems are based on manual segmentation performed by physicians. The disadvantage of manual segmentation is that it is time consuming and the results are not really reproducible because the image interpretations by humans may vary significantly across interpreters.
In this paper, we address the challenging problem of automatic segmentation of lumbar vertebrae from 3D CT images acquired with varying fields of view (FOV), which is usually solved with a two-stage method consisting of a localization stage followed by a segmentation stage . The localization aims to identify each lumbar vertebra, where segmentation handles the problem of producing binary labeling of a given 3D image. For vertebra localization, there exist both semi-automatic methods and fully automatic methods . For vertebra segmentation, both 2D image-based methods and 3D image-based methods [6, 7, 8, 9]
are introduced before. These methods can be roughly classified as statistical shape model or atlas based methods, and graph theory (GT) based methods. The multiple center milestone study of clinical vertebra segmentation as presented in summarized the performance of several state-of-the-art vertebra segmentation algorithms on CT scans.
Recently, machine learning-based methods have gained more and more interest in the medical image analysis community. Most of these methods are based on ensemble learning principles that can aggregate predictions of multiple classifiers and demonstrate superior performance in various challenging medical image analysis problems. For example, Michael Kelm et al.  proposed to detect spine from CT or MR images using iterative marginal space learning. Zhan et al. 
presented a hierarchical strategy and local articulated model to detect vertebrae and discs from 3D MR images. Due to the successful applications of Random Forest (RF) regression and classification for localization and segmentation of orans from 3D CT/MR data, such a technique has been used for automatic localization and segmentation of vertebrae from CT/MR images[4, 12].
More recently, with the advance of deep learning techniques[13, 14, 15], many researchers have proposed deep learning based methods for automatic localization and segmentation of vertebrae from CT images. For example, Chen et al. 
proposed a method for automatically locating and identifying vertebrae in 3D CT volumes by exploiting high level feature representation with deep convolutional neural networks (CNN). To solve the same task, a different approach was presented by Suzani et al.
where they parametrized the vertebral localization problem as a multi-variate non-linear regression. They then used deep feed-forward neural network with hand-crafted features to regress displacements between the center of each vertebral body and reference voxels which were selected using Canny edge detector. The final estimation of the center for each vertebral body was then obtained by using an adaptive kernel density estimation method. This idea was later extended by Sekuboyina et al.
to develop a localization-segmentation approach for automatic segmentation of lumbar vertebrae. More specifically, instead of localizing each individual vertebra, they proposed to use a multi-layered perceptron (MLP) with hand-crafted features to perform non-linear regression for locating the lumber region. After that, a 2D U-net like Fully Convolutional Networks (FCN) was used to segment all sagittal slices in order to get a volumetric segmentation.
Inspired by , in this paper we proposed a method to automatically segment lumbar vertebrae from 3D CT images using cascaded 3D FCNs consisting of a localization FCN and a segmentation FCN (see Fig. 1 for an overview). More specifically, in the first step we train a regression 3D FCN (the LocalizationNet in Fig. 1) to find a bounding box which defines the region of interest (ROI) of the lumbar region. After that, a 3D U-net like FCN (the SegmentationNet in Fig. 1) is then developed, which after training, can perform a pixel-wise multi-class segmentation to map a cropped lumber region volumetric data to its volume-wise labels.
The paper is organized as follows. In the next section, we will describe the method. Section 3 will present the experimental results, followed by discussions and conclusions in Section 4.
Although we require all CT data used in our study containing lumbar region, the FOV of each scan varies from data to data. Some may include vertebrae as high as T1 vertebra. Thus, it is important to first localize the lumbar region and then to develop a segmentation method to segment each individual lumbar vertebra. This has motivated us to develop cascaded FCNs consisting of a LocalizationNet and a SegmentationNet as shown in Fig. 1. Below we first present the LocalizationNet, followed by a description of the SegmentationNet.
2.1 Lumbar Region Localization
Inspired by previous work [17, 18], we also formulate the lumbar region localization as a multi-variate regression problem. We use a rectangular box to represent the ROI of the lumbar region for each data, which can be represented by two diagonal corners of the rectangular box. The target of the regression is then the two relative displacement vectors between a reference voxel and the two diagonal corners as shown in Fig. 2. Following , we used Canny edge detector to select voxels with high edge responses as the reference voxels in both training and testing stages. Unlike previous work, where hand-crafted features computed from a 3D patch sampled around each reference voxel are used to regress the target, here we propose to directly regress the target from a sampled 3D patch using a deep FCN, where the features are automatically learned from the data.
To this end, we developed a localization net to automatically regress the two target displacement vectors from a sampled 3D patch. The architecture of the LocalizationNet is shown in Fig. 3. It consists of three repetitions of two 3D convolutional layers followed by one maximum pooling layer. The convolutional layers have a kernel size of19]. After the three repetitions, the size of the feature maps is and is resized with another convolutional layer with a kernel size of to the size of with 512 features. This process is followed by two convolutions. The last convolutional layer reduces the number of output channels to the desired 6 values (representing the two displacement vectors to the two corners).
Training. The LocalizationNet was trained in two rounds. In the first round, the mean squared error loss (L2) is optimized with the Adam solver. In the second round, the Intersection of Union (IoU) loss as introduced in  was used to train the pretrained (from round one) network, which has been shown to have better performance than L2 loss for object detection. The IoU could not be used at the beginning yet because for that, the corners have to be estimated in a reasonable accuracy already, otherwise there is no intersection between the estimated ROI and the ground truth ROI.
Testing. During testing, the LocalizationNet can predict two displacement vectors from the center of each sampled 3D patch to respectively the two corners of the ROI . After that, we used kernel density estimation method  to obtain a density function for all the voxel votes for each corner of the ROI. The global maximum of this density function is considered as the predicted location of the associated corner.
2.2 Lumbar Vertebrae Segmentation
The estimated corners of the ROI will allow us to extract the lumbar region from an input CT data. The goal of this stage is then to segment each individual lumbar vertebra from the cropped data. To this end, we developed a segmentation net to conduct multi-class segmentation of the cropped lumbar region data. A 3D U-net  like FCN was adopted here for our purpose. Fig. 4 shows a schematic drawing of the architecture of the employed SegmentationNet. It consists of two parts, i.e., the encoder part (contracting path) and the decoder part (expansive path). The encoder part focuses on analysis and feature representation learning from the input data while the decoder part generates segmentation results, relying on the learned features from the encoder part. Shortcut connections are established between layers of equal resolution in the encoder and decoder paths.
Previous studies show small convolutional kernels are more beneficial for training and performance . For our SegmentationNet, all convolutional layers use kernel size of and strides of 1 and all max pooling layers uses kernel size of and strides of 2. In the convolutional and deconvolutional blocks of our network, Batch normalization and Rectified linear unit are adopted.
Data augmentation and training patch generation. For each training sample, besides gray value augmentation and applying a smooth dense deformation field on both data and ground truth labels as suggested in , we also conduct ROI augmentation following the suggestion in . After data augmentation, multiple 3D patches with a size of at random locations are taken from each image and saved in a list. The patches in the list are shuffled and packed into batches to fed into the SegmentationNet.
The training of the SegmentationNet is done in two steps. First, the SegmentationNet is trained for binary segmentation of the lumbar spine, where all vertebra have the same label. The SegmentationNet has to distinguish between spine and background. After that, it is trained for multi-class segmentation where one class corresponds to one vertebra, starting with L1. The trained weights of the binary segmentation network are used to initialize the multi-class segmentation. In both steps, weighted cross-entropy loss functions were optimized with the Adam solver, where considering the fact that we have more background class voxels than voxels of other classes, we reduce the weights of the background voxels and increases the weights of the vertebra voxels to balance the influence.
Testing. Our trained models can estimate labels of an arbitrary-sized volumetric image. Given a test volumetric image, we extracted overlapped sub-volume patches with the size of
, and fed them to the trained network to get prediction probability maps. For the overlapped voxels, the final probability maps would be the average of the probability maps of the overlapped patches, which were then used to derive the final segmentation results. After that, we conducted morphological operations to remove isolated small volumes and internal holes.
2.2.1 Implementation Details
3 Experimental Results
3.1 Experimental design
We obtained 15 spine CT images with ground truth segmentation from the MICCAI 2016 xVertSeg challenge 111One can find details about the xVertSeg challenge at: http://lit.fe.uni-lj.si/xVertSeg/. Not only are the data acquired with varying FOVs but also contain fractured vertebrae, which posts a challenge for automatic localization and segmentation. In this paper, we conducted a leave-three-out cross-validation study to evaluate the performance of the present method. More specifically, each time we randomly take 3 out of the 15 CT data as the test data and the remaining 12 CT data as the training data. The process was repeated for 5 folds. In each fold, the segmented results of the test data were compared with the associated ground truth segmentation. For each vertebra in a test CT data, we evaluate the Average Symmetric Surface Distance (ASSD) and Hausdorff Distance (HD) between the surface models extracted from different segmentation as well as the volume overlap measurements including Dice Coefficient (DC) and Jaccard Coefficient (JC).
3.2 Experimental results
Every time, the SegmentationNet can achieve successful segmentation from the ROI estimated form our LocalizationNet. Quantitative segmentation results of the cross validation study is shown in Table 1, where the results on each individual vertebra as well as on the entire lumbar region are presented. Our approach achieves a mean DC of 95.770.81% and a mean ASSD of 0.370.06mm on the entire lumbar region.
In each fold, it took about 12 hours and 40 minutes to finish the training for the localization net, and another 23 hours to finish the training of segmentation net. After training, it took on average about 79 seconds to finish the segmentation of one test CT data.
|DC (%)||JC (%)||HD (mm)||ASSD (mm)|
4 Discussions and Conclusions
In this paper, we proposed a method based on cascaded FCNs consisting of a localization net and a segmentation net for automatic segmentation of lumbar vertebras from spinal CT data. The localization net helps to turn the attention of our segmentation net to the lumbar region in order to get accurate segmentation of each individual lumbar vertebra.
The results achieved by our method are comparable to those by the state-of-the-art methods, though direct comparison of different methods is difficult as not all of them are evaluated on the same dataset. For example, when evaluating their method on 50 healthy lumbar vertebrae from 10 spinal CT data, Ibragimov et al.  reported a mean DC of 93.6%. On the same lumbar vertebral dataset, Korez et al.  reported a mean DC of 95.3%. In contrast, even evaluated on a challenging dataset with fractured vertebrae and varying FOVs, our method achieved a mean DC of 95.77%. In comparison to the method introduced in , our approach also achieved superior results. A mean DC of 92.7% was reported in  while our approach achieved a mean DC of 95.77%. As both methods are based on deep learning techniques, one possible explanation is that they have used different localization (MLP vs. our 3D regression FCN) and segmentation (2D U-net like FCN vs. our 3D segmentation FCN) methods from ours.
In conclusion, we presented a cascaded CNN-based approach for fully automatic segmentation of lumbar vertebrae from 3D CT data. Our method achieved equivalent or superior results over the state-of-the-art methods.
-  Balmain Sports Medicine, “Common lumbar spine injuries,” https://www.balmainsportsmed.com.au/injury-library/injury-information/lumber-and-spine.html, 2014, [Online; accessed 09-10-2017].
-  T. Emch and M. Modic, “Imaging of lumbar degenerative disk disease: history and current state.,” Skeletal Radiology, vol. 40, no. 9, pp. 1175–1189, 2011.
-  P.M. Parizel, T. van der Zijden, S. Gaudino, M. Spaepen, M.H.J. Voormolen, and et al., “Trauma of the spine and spinal cord: imaging strategies,” Eur Spine J, vol. 19, no. Suppl 1, pp. S8–S17, 2010.
-  C. Chu, D.L. Belavy, G. Armbrecht, M. Bansmann, D. Felsenberg, and G. Zheng, “Fully automatic localization and segmentation of 3d vertebral bodies from ct/mr images via a learning-based method,” PLOS ONE, vol. 10, no. 11, pp. e0143327, 2015.
-  J Yao, J.E. Burns, D. Forsberg, and et al., “A multi-center milestone study of clinical vertebral ct segmentation,” Computerized Medical Imaging and Graphics, vol. 49, pp. 16–28, 2016.
-  T. Klinder, J. Ostermann, M. Ehm an A. Franz, R. Kneser, and C. Lorenz, “Automated model-based vertebra detection, identification, and segmentation in ct image,” Medical Image Analysis, vol. 13, no. 3, pp. 471–482, 2009.
D. tern, B. Likar, F. Pernus, and T. Vrtovec,
“Parametric modelling and segmentation of vertebral bodies in 3d ct and mr spine images,”Phys. Med. Biol, vol. 56, pp. 7505–7522, 2011.
-  B. Ibragimov, B. Likar, F. PERNUS, and T. Vrtovec, “Shape representation for efficient landmark-based segmentation in 3-d,” Medical Imaging, IEEE Transactions on, vol. 33, no. 4, pp. 861–874, April 2014.
R Korez, B Ibragimov, B Likar, F Pernus, and T Vrtovec,
“A framework for automated spine and vertebrae interpolation-based detection and model-based segmentation,”IEEE Trans Med Imaging, vol. 34, pp. 1649–1662, 01 2015.
-  B. Michael Kelm, M. Wels, S. Zhou, S. Seifert, M. Suehling, Y. Zheng, and D. Comaniciu, “Spine detection in ct and mr using iterative marginal space learning,” Medical Image Analysis, vol. 17, no. 8, pp. 1283–1292, 2013.
-  Y. Zhan, D. Maneesh, M. Harder, and X. Zhou, “Robust mr spine detection us- ing hierarchical learning and local articulated model.,” in Proceedings of MICCAI 2012, 2012, pp. 141–148.
-  B. Glocker, J. Feulner, and et al., “Automatic localization and identication of vertebrae in arbitrary field-of-view ct scans ,” in Proceedings of MICCAI 2012, 2012, pp. 590–598.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in , 2015, pp. 3431–3440.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
-  O. Cicek, A. Abdulkadir, S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: Learning dense volumetric segmentation from sparse annotation,” in MICCAI 2016, vol. LNCS 9901, pp. 424–432. Springer, 2016.
-  H. Chen, C. Shen, J. Qin, and et al., “Automatic localization and identification of vertebrae in spine ct via a joint learning model with deep neural network,” in Proceedings of MICCAI 2015, 2015, pp. 515–522.
-  A. Suzani, A. Seitel, Y. Liu, and et al., “Fast automatic vertebrae detection and localization in pathological ct scans - a deep learning approach,” in Proceedings of MICCAI 2015, 2015, pp. 678–786.
-  A. Sekuboyina, A. Valentinitsch, J.S. Kirschke, and Menze B.H., “A localisation-segmentation approach for multi-label annotation of lumbar vertebrae using deep nets,” arXiv:1703.04347, 2017.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of ICML, 2015, pp. 448–456.
-  Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas S. Huang, “Unitbox: An advanced object detection network,” CoRR, vol. abs/1608.01471, 2016.
-  Z. Botev, J. Grotowski, D. Kroese, and et al, “Kernel density estimation via diffusion,” The Annals of Statistics, vol. 38, no. 5, pp. 2916–2957, 2010.
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.