[ISBI 2020] Vertebra-Focused Landmark Detection for Scoliosis Assessment
Adolescent idiopathic scoliosis (AIS) is a lifetime disease that arises in children. Accurate estimation of Cobb angles of the scoliosis is essential for clinicians to make diagnosis and treatment decisions. The Cobb angles are measured according to the vertebrae landmarks. Existing regression-based methods for the vertebra landmark detection typically suffer from large dense mapping parameters and inaccurate landmark localization. The segmentation-based methods tend to predict connected or corrupted vertebra masks. In this paper, we propose a novel vertebra-focused landmark detection method. Our model first localizes the vertebra centers, based on which it then traces the four corner landmarks of the vertebra through the learned corner offset. In this way, our method is able to keep the order of the landmarks. The comparison results demonstrate the merits of our method in both Cobb angle measurement and landmark detection on low-contrast and ambiguous X-ray images. Code is available at: <https://github.com/yijingru/Vertebra-Landmark-Detection>.READ FULL TEXT VIEW PDF
Landmark detection for clothes is a fundamental problem for many
Scoliosis is a congenital disease that causes lateral curvature in the s...
Correct evaluation and treatment of Scoliosis require accurate estimatio...
Visual localization is of great importance in robotics and computer visi...
Recent methods in multiple landmark detection based on deep convolutiona...
Calcaneus is the largest tarsal bone designed to withstand the daily str...
We present a method for highly efficient landmark detection that combine...
[ISBI 2020] Vertebra-Focused Landmark Detection for Scoliosis Assessment
Adolescent idiopathic scoliosis (AIS) is a lateral deviation and axial rotation of the spine  that arises in children at or around puberty . Early detection and bracing treatment of scoliosis would decrease the need for surgery . Cobb angle  is used as a gold standard by clinicians for scoliosis assessment and diagnosis. It is commonly measured based on the anterior-posterior (AP) radiography (X-ray) by selecting the most tilted vertebra at the top and bottom of the spine [4, 3]. Measurement of the Cobb angles is challenging due to the ambiguity and variability in the scoliosis AP X-ray images (Fig. 1). Generally, the clinicians manually measure the landmarks (the yellow points in Fig. 1) and choose the particular tilted vertebrae for the Cobb angle assessment. However, the measurement tends to be affected by the selection of the vertebrae and the bias of different observers.
Given that manual scoliosis assessment of Cobb angles in clinical practice is time-consuming and unreliable, there is a surge of interest in developing automatic methods for accurate spinal curvature estimation in spinal AP X-ray images. Traditional unsupervised methods such as filtering  and active contour 
are parameter sensitive and typically involve complicated processing stages. To deal with the large anatomical variability and the low tissue contrast in X-ray images, supervised learning-based methods are developed. SVR 
uses structured Support Vector Regression (SVR) to regress the landmarks and the Cobb angles directly based on the extracted hand-crafted features. BoostNet learns more robust spinal features by convolutional layers. These regression-based methods are able to exploit the global information of the image. However, the dense mapping between the regressed points and the latent features requires significant parameter and computational costs. Consequently, the input image (25001000) has to be downsampled to a very small resolution (e.g., 256128) to enable training and inference. Such an operation limits the performance of these methods due to the loss of fine details from the original high-resolution images. To handle this issue, another direction proposes to use convolutional layers to segment each vertebra for scoliosis assessment [7, 8]. These methods are mainly based on U-net, and tend to be sensitive to the image qualities and difficult to separate the attached vertebrae.
Recently, keypoint-based methods have achieved remarkable performance in human pose joint localization  and object detection [13, 14, 15]. Unlike the regression-based methods, the keypoint-based methods localize the points without dense mapping. Therefore, it simplifies the network and is able to consume the higher-resolution input image. In this paper, we propose a vertebra-focused landmark detection method based on keypoint detection. We make the network learn to differentiate different vertebrae by localizing the vertebra centers directly. After capturing the vertebrae, we regress the four corner landmarks of each vertebra using convolutional layers. In this way, we keep the order of the landmarks. Experimental results demonstrate the superiority of our method compared to the regression and segmentation based methods.
As is shown in Fig. 1, the Cobb angles are determined by the locations of landmarks. The X-ray image we used contains 17 vertebrae from the thoracic and the lumbar spine. Each vertebra has 4 corner landmarks (top-left, top-right, bottom-left and bottom-right), and each image totally has 68 landmarks. The relative orders of landmarks are important for accurately localizing the tilted vertebrae. Considering this, we do not localize the 68 points directly from the output feature map since the model cannot guarantee that the detected points will stay at the right positions, especially when there are false positives, which would lead to incorrect landmark ordering. To address this issue, one strategy is to separate the landmarks into different groups, thus giving an output feature map with channel number . However, since each channel of output feature map has only one positive point, this strategy suffers from the class imbalance issue between the positive and negative points, which will hurt the model performance.
In this paper, we propose to first localize the 17 vertebrae by detecting their center points. One advantage of this approach is that the center points will not overlap. Therefore, the center points can be used to identify each vertebra without suffering from the touching problem in segmentation-based methods. After the vertebrae are localized, we then capture the 4 corner landmarks of each vertebra from its center point. In this way, we are able to keep the order of landmarks.
conv1-5 to extract the high-level semantic features of the input image. Then we use the skip connections to combine the deep features with the shallow ones to exploit both high-level semantic information and low-level fine details, similar to[16, 17]. At layer D2, we construct the heatmap, center offset and corner offset maps using convolutional layers for landmark localization.
The keypoint heatmap is generally used in pose joint localization and object detection. For each point , its ground-truth is an unnormalized 2D Gaussian disk (see Fig. 2b) which can be formulated as . The radius is determined by the size of the vertebrae . We use the variant of the focal loss to optimize the parameters, the same as [13, 18]:
where indexes to each position of the feature map. is the total number of positions on the feature map, and refer to the prediction and ground-truth values, respectively. We set the parameters and  in this paper.
As can be seen from Fig. 2a, the output feature map of the network is downsized compared to the input images. This not only saves computational cost but also alleviates the imbalance problem between the positive and negative points due to the reduced output resolution. Consequently, a position on the input image is mapped to the location of the downsized feature map, where is the downsampling factor. After extracting the center points from the downsized feature map, we use the center offset to map the points back to the original input image. The center offset is defined as:
The center offsets at the center points are trained with L1 loss.
When the center points of each vertebra are localized, we trace the 4 corner landmarks from the vertebra using corner offsets. The corner offsets are defined as vectors that start from the center and point to the vertebra corners (see Fig. 2b). The corner offset map has channels. We use L1 loss to train the corner offsets at the center points.
We use training data (580 images) of the public AASCE MICCAI 2019 challenge as our dataset. All the images are the anterior-posterior X-ray images. Specifically, we use 60% of the dataset for training (348 images), 20% for validation (116 images), and 20% for testing (116 images). Each image contains 17 vertebrae from the thoracic and lumbar spine. Each vertebra is located by 4 corner landmarks. The ground-truth landmarks (68 points per image) are provided by local clinicians. The Cobb angle is calculated using the algorithm provided by AASCE. The input images vary in sizes (25001000).
We implement our method in PyTorch with NVIDIA K40 GPUs. The backbone network ResNet34, which gives an output resolution of . To reduce overfitting, we adopt the standard data augmentation, including random expanding, cropping, contrast and brightness distortion. The network is optimized with Adam  with an initial learning rate
. We train the network for 100 epochs and stop when the validation loss does not decrease significantly.
Following the AASCE Challenge, we use the symmetric mean absolute percentage error (SMAPE) to evaluate the accuracy of the measured Cobb angles:
where indexes the three Cobb angles in the area of proximal thoracic (PT), main thoracic (MT) and the thoracolumbar (TL), denotes the -th image, and is the total number of testing images. The and refer to the estimated and the ground-truth Cobb angles, respectively. We also report the SMAPE for PT, MT and TL area individually which we represent as SMAPE, SMAPE and SMAPE.
We evaluate the accuracy of the landmarks by comparing the detected landmark locations to the ground-truth landmark locations. The averaged detection error is:
where and are the detected and ground-truth landmark locations, respectively; is the total number of landmarks of the whole testing images.
We compare our method with the regression-based method  and the segmentation-based method . The qualitative results and the quantitative results are shown in Fig. 3 and Table 1. Note that the regression-based method has a smaller input resolution because the parameters in the FC layers are too large and the GPU memory is limited. The landmarks of the segmentation-based method are decoded from the corner points of the minimum bounding rectangle of the vertebra segmentation mask. We use the same data augmentations and training skills for all these baseline methods.
As is shown in Fig. 3, the regression-based method performs well in capturing the orders of landmarks. This is owing to the separated channels of the FC layer. However, it fails to capture the landmark locations accurately. One reason would be that the small input resolution loses the morphology details of the vertebrae. In addition, the dataset we used is not large enough for the model to learn well as there are lots of parameters in the FC layers. Different from the regression-based method, the segmentation-based method captures the landmark locations better with the aid of segmentation masks. We show the overlayed vertebrae masks in Fig. 3. However, as can be seen from cases 1 and 6, the segmentation-based method fails to separate the connected regions. Moreover, for cases 2-4, the segmentation-based method tends to produce corrupted masks due to the ambiguity of the input images. Consequently, the false predictions disrupt the orders of detected landmarks and incur errors in landmark detection and Cobb angle calculation. This is also explained in Table 1 that the segmentation-based method performs worse in the TL area of the spine as the vertebra typically gets more ambiguous in this part. In particular, in the TL area, the landmark error of the segmentation-based method is very close to that of the regression-based method.
Compared to the baseline methods, our vertebra-focused method achieves the best performance in both Cobb angle measurement (SMAPE) and the landmark detection (Error), as shown in Table 1. We illustrate both the corner offsets and the detected landmarks in Fig. 3. The corner offsets are colored arrows starting from the decoded center point of the vertebra. From cases 2,4,6, we can see that the vertebra-focused method is robust in localizing the vertebrae that have low-contrast in the original images. The reason would be that the model has the ability to identify the vertebra according to their global morphology features through center localization. We show a failure example in case 5, which suggests that the vertebra-focused network would skip the vertebra that has lower morphology property than the other vertebrae. However, such a failure does not affect the detection of the remaining vertebrae, indicating that the proposed method has better object reasoning ability.
In this paper, we propose a vertebra-focused landmark detection method that traces the corner landmarks of a vertebra from its center point. The strategy of predicting center heatmaps enables our model to identify different vertebrae and allows it to detect landmarks robustly from the low-contrast images and ambiguous boundaries. In contrast to the regression- and segmentation-based methods, our vertebra-focused method performs the best in landmark detection and Cobb measurement.
“Cobb angle measurement of spine from x-ray images using convolutional neural network,”Computational and mathematical methods in medicine, 2019.
“Stacked hourglass networks for human pose estimation,”in European conference on computer vision (ECCV). Springer, 2016, pp. 483–499.
Thirty Fourth AAAI Conference on Artificial Intelligence, 2020.