1 Introduction
Automatic localization and labeling of vertebrae in 3D spinal imaging, e.g. computed tomography (CT) or magnetic resonance imaging (MRI), has become an essential tool for clinical tasks, including pathological diagnosis, surgical planning and postoperative assessment. Specific applications such as vertebrae segmentation, fracture detection, tumor detection, registration and statistical shape analysis can also benefit from the effective vertebrae detection and labeling algorithms. However, there are many challenges associated with designing an accurate and automatic algorithm, which arise from pathologies, image artifacts, and the limited fieldofview. For example, as shown in Figure 1, the abnormal spine curvature and surgical metal implants significantly alter the appearance of vertebrae and reduce the image contrast. Spinefocused scans with small fieldofview (FOV) also add difficulty to the identification tasks due to lack of global spatial and contextual information.
To address these challenges, many approaches have been proposed for automatic localization and identification of vertebrae. Glocker et al.[1] presented a method based on regression forests and probabilistic graphic models. However, their method is likely to suffer from the narrow fieldofview because the broad contextual information is not always available. To overcome this limitation, Glocker et al.[2]
proposed a randomized classification forest based approach which achieved reasonable localization and identification performances on pathological cases and those with limited FOV. Recently, deep learning has been employed in the applications of spine detection. Chen
et al.[3]presented a joint convolutional neural network (JCNN). This hybrid approach used a random forest classifier to coarsely localize the candidates before the JCNN scaned the input CT volume for final results. By incorporating the pairwise information of neighboring vertebrae in JCNN, it outperformed other methods
[2]. Suzani et al.[4]proposed a deep feedforward neural network to detect if an input image contained a specific vertebra. Although this work achieved high detection rates, it reported a large mean localization error compared with other works. Besides, instead of the direct 3D volumetric input, this work extracted 1D features based on the local voxel intensities as the input of deep feedforward neural network. In addition, no convolution or pooling operation was applied in the network. Payer
et al.[5] proposed a composite neural network to build up the full connection between response maps of all landmarks with convolutional kernels. The spatial relationship of landmarks were implicitly embedded in the CNN model.In order to overcome these limitations and to take advantage of deep neural networks, we present an approach, shown in Figure 2, with the following contributions:
a) Deep ImagetoImage Network (DI2IN) for VoxelWise Regression
Without extracting features from input images, the proposed deep imagetoimage architecture directly takes a 3D CT volume as input. The training of the proposed network is designed as multichannel voxelwise regression (refer to section 2.1). It generates the multichannel probability maps associated with different vertebra centers, which intuitively illustrate the location and label of vertebrae. Our neural network requires no coarse classifiers to remove the outliers for preprocessing. Instead, it automatically extracts contextual and spatial information by itself. By taking the advantage of fully convolutional implementation, the proposed network is significantly timeefficient, which sets it apart from the sliding window approaches.
b) Response Enhancement with Message Passing
Although the proposed deep imagetoimage network generates confident probability maps, there is no guarantee that it will avoid false positives (outliers) due to the complexity of appearance (shown in Figure 1
). To resolve this problem, we adopt a message passing scheme within the probability maps of vertebra centers, which leverages the mutual relation of vertebrae. A chainstructure graphical model is introduced to depict the spatial relationship. Each node in the model represents a probability distribution of one vertebra center. During the passing scheme, the probability map of each vertebra center iteratively receives messages (encoded in the convolution operation) from all neighboring vertebrae (nodes) and absorbs them for further selfevolvement. The collected messages can not only enhance the response of correct location, but also suppress that of the false positives.
c) Refinement using Sparse Representation
To further refine the coordinates of vertebrae, we incorporate a dictionary learning and sparse representation approach which utilizes the holistic structure of the spine and identifies the important set of coordinates. Instead of learning a regression model to fit the spinal shape, we simply adopt the coordinates of the spine in the training samples to construct a data dictionary and formulate this problem as an norm optimization to learn the best sparse representation. Based on the regularity of the spine shape, ambiguous coordinates are removed and the sparse representation is optimized in a subspace instead of all coordinates (refer to section 2.2). Finally, the refined coordinates in each axis are reconstructed from the same subspace jointly, which further improves the localization and identification performance.
The rest of the paper is organized as follows: In section II, we introduce our deep imagetoimage network architecture with message passing and refinement approach. In section III, the proposed framework is compared to previous stateoftheart methods based on a public spine dataset. In section IV, we present the conclusion and discussion.
2 Methodology
2.1 Deep ImagetoImage Network (DI2IN) for Multiple Landmark Localization
In this section, we present the proposed deep imagetoimage network, which is multilayer convolutional, to localize vertebra centroids. As shown in Figure 3
, the proposed network is deployed in a symmetric manner which can be treated equivalently as a convolutional encoderdecoder network. It is implemented in the fashion of voxelwise endtoend learning to enable efficient inference. The multichannel ground truth data is specially designed with the coordinates of vertebra centroid. A Gaussian distribution
is defined in each channel to represent the vertebra location, . Vector
represents the voxel coordinates in volume, vectoris the ground truth location of verterbra centroid. Variance
is predefined which controls the scale of the Gaussian distribution. Each channel’s prediction corresponds to a unique vertebra centroid. It has the same size as the input image. Therefore, the whole learning problem is formulated as multichannel voxelwise regression. During the training, we apply the square loss for each voxel at the output layer. We define the centroid detection as a regression task instead of classification. Because the highly imbalanced data in classification is inevitable and it causes the misleading classification accuracy.Convolution, rectified linear unit (ReLU), and maxpooling layers are used in the encoder part of the proposed network. Pooling is critical as it helps increase the receptive field of neurons and lower the GPU memory consumption. With the larger receptive field, more contextual information is taken into consideration for each neuron in different layers. Therefore, the relative spatial position of vertebra centroids in prediction would be better interpreted. The decoder part is composed of the convolution, ReLU and upsampling layers. Upsampling layers are implemented with the bilinear interpolation to enlarge and densify the activation. It further enables the endtoend voxelwise training. The convolutional filter size is
in the final output layer and for the other convolution layers. The maxpooling filter size is. The stride in the convolution layers is set as 1 to maintain the same size in each channel. The pooling factor in pooling layers is set as 2 for downsampling by half in each dimension. The number of channels in each layers are marked next to the layers in Figure
3. In upsampling layers, the input features are upsampled by a factor of 2 in directions respectively. The network takes a 3D CT image (volume) as input and directly outputs multiple probability maps, with each map associated with one vertebra landmark (equivalent to vertebra centroid). The framework is more efficient at computing the probability maps as well as the centroid locations than the patchwise classification or regression methods in [3, 4].Our DI2IN adopts several prevailing techniques[6, 7, 8, 10, 11] with necessary modification. We utilize the feature layer concatenation in DI2IN which is analogous with the one described in [7]. The shortcut bridges are built up directly from the encoder layers to decoder layers. It passes forward the feature maps from the encoder and is then concatenated with the decoder feature layers. The concatenated features are used as the input for next convolution layers. Following the concatenation, high and low level features are combined explicitly so that the network benefits from both the local and global contextual information. Deep supervision in neural network during the endtoend training is shown in [8, 10, 11] to achieve excellent boundary detection and segmentation results. In the network, we introduce a more sophisticated deep supervision method to improve the performance. Several branches are bifurcated out from the main network from the intermediate layers of the decoder part. With proper upsampling factors and convolution operations, the output size of each channel of all branches matches the size of the input image. The supervision is introduced at the end of each branch by computing a loss term with the same ground truth data. To further leverage the results from different branches, the final output is determined by the convolution operation of output concatenation of all branches with ReLU. The total loss is a combination of loss terms from all output layers which includes the output layers from all branches and the final output layer, as shown here:
2.2 Probability Map Enhancement with Message Passing Scheme
Given the image , the DI2IN generates one probability map for the center of each individual vertebra with high confidence. The vertebrae will be located at the peak positions of probability maps. However, we find that these probability maps are not perfect yet: some probability maps don’t have response or have very low response at the ground truth locations because of similar image appearances of several vertebrae (e.g. ). In order to handle the problem of missing response, we propose a message passing scheme to effectively enhance the probability maps by utilizing the prior knowledge of the spine structure.
The concept of message passing was first introduced in the context of probabilistic graphical models. It is used in the sumproduct or maxproduct algorithms for exact inference of the marginal probabilities of nodes or the distribution mode in a treestructured graph. Messages are passed iteratively between neighboring nodes to exchange information and optimize the overall probability distribution. Similarly, we introduce an MRFlike model, a chainstructure graph shown in Figure 4, to express the spatial relationship among vertebrae, where each node in the graph represents one vertebra center . Then we propose the following formulation to update the during the iteration of message passing.
(1)  
(2) 
where is the neighbor of vertebra in the graph, is a normalization constant, and is a discounted factor. The messages , defined as , are passed along the chain shown in Figure 4. is the convolution operation. is a single convolution kernel which is learned from the ground truth distribution of vertebra . Multidimensional convolution itself is capable to shift the mass of the probability map to its neighborhood with a fixed orientation (kernel). If is confident at its correct location, then the message would be a strong prior for at the correct location of the vertebra . After several iterations of message passing, the vertebra with missing response can be compensated with the aggregated messages from its neighboring vertebrae. The underlying assumption is that majority of the vertebra probability maps are confident and well distributed around the true locations, which is guaranteed by the powerful DI2IN in our method. The advantage of the proposed scheme is that it can be concatenated into the DI2IN for further endtoend training (finetuning) when the iteration number is fixed. The location of each vertebra centroid can simply be determined by the location of the maximum value in the corresponding probability map.
Several recent works have deployed the messagepassing concept for different landmark detection tasks. Chu et al.[12] proposed the passing scheme between the feature maps instead of landmark probability maps. Yang et al.[9] introduced a fully connected graphical model for message passing between probability maps. The handcrafted features were adopted in the pairwise terms of the messages. Payer et al.[5] also brought up the fully connected graphical model, applying onetime passing with pixelwise dotproduct for noise cancelling. In our proposed method, the passing is directly among the response maps along the chainstructure model. The response maps are gradually enhanced within several passing iterations, since one passing is not enough to make necessary adjustment for probability maps. Compared to the handcraft features, the single convolutional kernel is eligible to generate messages between neighbors because the designed neighborhood is compact. In our framework, the missing response is the major issue instead of the noisy output, so the dotproduct operation is not applicable and may hurt the output probabilities.
2.3 Sparse Representation for Landmark Refinement
As shown in Figure 5, the DI2IN with message passing generates a clear probability map, where the high probability map indicates the potential location of the landmark (centroid of the vertebrae). However, sometimes due to image artifacts and low image resolution, it is still difficult to guarantee there will be no false positive. In [3], a shape regression optimization model was used to refine the predicted vertebral centroids in the vertical axis. By minimizing an energy function, the optimized parameters are learned for each test sample to determine the final coordinates of vertebrae. However, their model assumes that the coordinates distribution can be described in a quadratic form, and it was only applied for coordinates in the vertical axis.
Inspired by the previous works in sparse representation, we propose an norm approach to help refine the coordinates in all , and axes. Given a pregenerated shapebased dictionary and the predicted coordinates vector of all centroids in a testing sample, we adopt the norm optimization to solve the sparse coefficient vector . The refined coordinates is defined as . In particular, the shapebased dictionary is learned from the training samples. For example, the dictionary associated with the vertical axis is constructed by the coordinates of all centroids of each sample in the training database. denotes the predicted coordinates of one sample in the testing database. The dictionaries and indicate the dictionaries associated with other axes and are learned in the same way.
The details are shown in Algorithm 1. First, we use dynamic programming to find the maximum descending subsequence in the predicted coordinates since the vertical axis of the spine produces the most stable results. We define the subspace of dictionary and the predicted coordinates vector based on the indices in the subsequence. For example, we only choose the atoms from dictionary and associated with the indices to generate a subdictionary and subvector . Then we solve the optimization problem in Step 3 for , and axes individually in the subspace instead of the original space . Finally, all coordinates are reconstructed by the original dictionary (i.e., ) and sparse vector (i.e., ). Intuitively, we remove the ambiguous outliers in the preliminary predicted coordinates and then define a subspace without these outliers. Based on the subspace, we find the best sparse combination in the corresponding subdictionary. By taking advantage of the original dictionary, all coordinates are reconstructed and refined simultaneously as shown in Figure 6.
3 Experiments
First, we evaluate the proposed method on the database introduced in [2] which consists of 302 CT scans of patients with varying types of pathologies. There are several unusual appearances in the database, such as the abnormal spine curvature and the bright visual artifacts caused by metal implants from the postoperative procedures. In addition, the fieldofview (FOV) of each CT image varies widely in terms of vertical cropping, image noise and physical resolution[1]. Most cases contain a portion of whole vertebrae while the global spine structure is visible only in a few cases. The large variations in pathologies and the limited FOV increase the complexity of vertebra appearance, and thus raise the difficulties of accurate spine localization and identification task. The ground truth is marked at the centroid of each vertebra, which is annotated by clinical experts. In previous works[1, 3, 4]
, there are two different settings on these 302 CT images: the first one uses 112 of the images as training and another 112 images as testing; the second one takes all images (242) in setting one with extra 18 images as training data and an additional 60 images as testing data. For a fair comparison, we follow the same database settings in our experiments. They are denoted as “Set 1” and “Set 2” respectively. We follow the evaluation metrics described in
[2], in terms of the Euclidean distance error (in mm) and identification rates (Id.Rates) defined in [1]. Table 1 compares our evaluation performance with the number reported by previous approaches[2, 3, 4]. We obtain an overall average mean error of 9.1 mm and 8.6 mm and an identification rates of 80% and 85% on those two sets, respectively. Overall, our method outperforms the stateoftheart methods on the same datasets in terms of mean error and identification rates.It is well known that deep neural networks have the capability to represent the variations of a large amount of data. With large amounts of annotated data in the training, the deep neural network can usually achieve better performance on various tasks. In order to validate if more training data can boost the performance of the proposed method, we introduce additional 1000+ CT scans of patients into the training samples and train our proposed model again from scratch. These data cover large variations in populations and contrast phases which are collected for various purposes. Most cases have a large FOV and include all the vertebrae. Some scans are extended to the knee and head. The testing data is not changed in all experiments. This pipeline is denoted as “Our Method+1000 training data”. As shown in Table 1, the experimental results demonstrate that the large amount of training samples can further improve the performance significantly. Our approach has achieved the best performance in almost all the metrics. On “Set 1”, the Id. Rates of our method is 13 percent higher than the stateoftheart method[2]. We also achieve more than 90% Id. Rates on “Set 2”, which is 6 percent higher than the stateoftheart method[3].
Region  Method  Set 1  Set 2  
Mean  Std  Id.Rates  Mean  Std  Id.Rates  
All  Glocker et al.[2]  12.4  11.2  70%  13.2  17.8  74% 
Suzani et al[4]  18.2  11.4          
Chen et al.[3]        8.8  13.0  84%  
DI2IN  17.0  47.3  74%  13.6  37.5  76%  
DI2IN+MP  11.7  19.7  77%  10.2  13.9  78%  
DI2IN+MP+Sparsity  9.1  7.2  80%  8.6  7.8  85%  
DI2IN+1000  10.6  21.5  80%  7.1  11.8  87%  
DI2IN+MP+1000  9.4  16.2  82%  6.9  8.3  89%  
DI2IN+MP+Sparsity+1000  8.5  7.7  83%  6.4  5.9  90%  
Cervical  Glocker et al.[2]  7.0  4.7  80%  6.8  10.0  89% 
Suzani et al[4]  17.1  8.7          
Chen et al.[3]        5.1  8.2  92%  
DI2IN+MP+Sparsity  6.6  3.9  83%  5.6  4.0  92%  
DI2IN+MP+Sparsity+1000  5.8  3.9  88%  5.2  4.4  93%  
Thoracic 
Glocker et al.[2]  13.8  11.8  62%  17.4  22.3  62% 
Suzani et al[4]  17.2  11.8          
Chen et al.[3]        11.4  16.5  76%  
DI2IN+MP+Sparsity  9.9  7.5  74%  9.2  7.9  81%  
DI2IN+MP+Sparsity+1000  9.5  8.5  78%  6.7  6.2  88%  
Lumbar 
Glocker et al.[2]  14.3  12.3  75%  13.0  12.5  80% 
Suzani et al[4]  20.3  12.2          
Chen et al.[3]        8.4  8.6  88%  
DI2IN+MP+Sparsity  10.9  9.1  80%  11.0  10.8  83%  
DI2IN+MP+Sparsity+1000  9.9  9.1  84%  7.1  7.3  90%  

All experiments are conducted on a workstation equipped with an Intel 3.50 GHz CPU and a 12GB Nvidia Titan X GPU. During the evaluation, the response maps of all output channels are compared with a heuristic threshold constant in an elementwise manner in order to distinguish valid response from random noise. Only the channels whose response maps contain elements with value greater than the threshold are considered. The vertebra centroids associated with these channels are then identified to be present in the image. The landmarks corresponding to the other response maps are considered as nonpresented. The localization and identification of all vertebrae in one case is achieved simultaneously in an efficient way. The testing time of our method is around three seconds per case on average assisted with the GPU. The experimental results demonstrate that our proposed method for spine centroids localization and identification is not only effective in terms of accuracy, but also significantly timeefficient.
4 Conclusion
In this paper, we proposed an effective and fast automatic method to localize and label vertebra centroids in 3D CT volumes. Our method outperforms other stateoftheart methods of spine labeling in terms of various evaluation metrics. For the future study, we plan to investigate various DI2IN architectures (e.g. ResNet) and other sophisticated refinement approaches to further improve the localization and identification performance.
Disclaimer: This feature is based on research, and is not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.
References
 [1] Glocker, B., Feulner, J., Criminisi, A., Haynor, D.R. and Konukoglu, E., 2012, October. Automatic localization and identification of vertebrae in arbitrary fieldofview CT scans. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 590598). Springer Berlin Heidelberg.
 [2] Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R. and Criminisi, A., 2013, September. Vertebrae localization in pathological spine CT via dense classification from sparse annotations. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 262270). Springer Berlin Heidelberg.
 [3] Chen, H., Shen, C., Qin, J., Ni, D., Shi, L., Cheng, J.C. and Heng, P.A., 2015, October. Automatic localization and identification of vertebrae in spine ct via a joint learning model with deep neural networks. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 515522). Springer International Publishing.
 [4] Suzani, A., Seitel, A., Liu, Y., Fels, S., Rohling, R.N. and Abolmaesumi, P., 2015, October. Fast Automatic Vertebrae Detection and Localization in Pathological CT ScansA Deep Learning Approach. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 678686). Springer International Publishing.
 [5] Payer, C., Stern, D., Bischof, H. and Urschler, M., 2016, October. Regressing Heatmaps for Multiple Landmark Localization Using CNNs. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 230238). Springer International Publishing.
 [6] Badrinarayanan, V., Kendall, A. and Cipolla, R., 2015. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. arXiv preprint arXiv:1511.00561.
 [7] Ronneberger, O., Fischer, P. and Brox, T., 2015, October. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 234241). Springer International Publishing.

[8]
Xie, S. and Tu, Z., 2015. Holisticallynested edge detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 13951403).

[9]
Yang, W., Ouyang, W., Li, H. and Wang, X., 2016. Endtoend learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. CVPR.
 [10] Merkow, J., Kriegman, D., Marsden, A. and Tu, Z., 2016. Dense VolumetoVolume Vascular Boundary Detection. arXiv preprint arXiv:1605.08401.
 [11] Dou, Q., Chen, H., Jin, Y., Yu, L., Qin, J. and Heng, P.A., 2016, October. 3d deeply supervised network for automatic liver segmentation from ct volumes. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 149157). Springer International Publishing.
 [12] Chu, X., Ouyang, W., Li, H. and Wang, X., 2016. Structured feature learning for pose estimation. arXiv preprint arXiv:1603.09065.
Comments
There are no comments yet.