Multiple analysis and quantification tasks on evaluating shoulder instability and planning shoulder preoperative diagnosis are based on the analysis of the structure and morphology of human humerus and scapula from the volumetric images that have been generated by computed tomography (CT) or magnetic resonance (MR) imaging. In order to prepare an accurate preoperative diagnosis of humerus and scapula for evaluating pathologies, the volumetric data first needs to be segmented, and then the structural and morphological analysis of the two bones can be applied[chuang2008use, acid2012preoperative]. Manually labeling the humerus and scapula data in three-dimensional (3D) space is a tedious, laborious, and time-consuming task. A 3D CT/MR image is captured by combining 2D slices along a specific scanning direction. Therefore, the manual labeling on a 3D image is actually the labeling of multiple 2D images, which costs much time and efforts. Moreover, it is hard for humans to take spatial characteristics, including the relation between slices and the continuity in shape from different axes, into consideration during manually labeling 2D slices. Thus an effective automatic 3D segmentation algorithm to extract the humerus and scapula is necessary and can offer reliability and repeatability in various medical applications.
Some previous automatic and semi-automatic approaches pay more attention to the fine scanned CT shoulder bones images [sharma2013adaptive]. The CT scanning could offer high-contrast bone imaging, but its moderate or high radiation may limit its wide usage for high-resolution and large-region 3D shoulder imaging. On the other hand, due to the heavy demands of MR scanning for shoulder preoperative, obtaining high-resolution (i.e., low slice thickness) MR images are quite impossible, since taking the high-resolution MR scanning is time-consuming in clinical. In addition, the parameters for scanning may differ from case to case, which will increase the variance of images. Fig. 1 exhibits two 3D MR shoulder data from the experimental dataset (with a total of 50 data) of this paper. The two data have complex and diverse bone structure, and their original MR images also have different intensity and contrast. Moreover, all the experimental MR images have been coarsely scanned in the axial direction for fast bone region location, and the images have resolutions between and , and sizes between and
. The high divergence of imaging parameters (e.g., resolution, size) and patient conditions (e.g., bone shape) causes the humerus and scapula segmentation problem difficult. Currently, the atlas-based automated extraction approaches and statistical shape models (SSMs) of bony structures build the basic frameworks for scapula and humerus segmentation. An atlas is a pair of an image scan and a corresponding manual labeling mask. The atlas-based segmentation is estimated using image registration, and hence a target image is propagated to single or multiple atlases to extract and fuse the final segmented masks. Wang et al. proposed a representative multi-atlas approach by using joint label fusion[wang2013multi]
. This approach computes the atlas voting weights by minimizing a total expected fusion error and it achieves good accuracy. On the development of SSMs, Mutsvangwa et al. reported on an improved pipeline to construct the automated unbiased global SSMs by employing an iterative median closest point-Gaussian mixture model (IMCP-GMM) method[mutsvangwa2014an]. Although both methods can tolerate more variants of shape and appearance of bones, they still highly depend on the availability and proper selection of training samples for the computation of label fusion weights or the generation of mean-virtual shape. Moreover, under the complicated imaging conditions mentioned above, the manually labeled bone ground truth (GT) masks may suffer from the non-continuous boundary issue in 3D space and cause heavy biases to degenerate the segmentation performance of the two types of methods.
In this paper, we formulate the humerus and scapula segmentation as a deep end-to-end network, and its structure is shown in Fig. 2
. The 3D bone extraction is constructed based on the organ prediction branch task, which is a fully convolutional network for the inference of organ probability map, in the previous work[tan2018deep]
. The proposed segmentation architecture exploits the deep convolutional neural networks to infer tissue contextual information on different resolution levels of image and then make a prediction at every pixel. In each resolution stage, the residual connection is also employed to ease the vanishing gradient problem in training when the network goes deeper. Thus, without manually selecting proper reference data or setting handcrafted models, the proposed network has the ability to learn a hierarchical representation of the shape-varied bone data under the coarse imaging conditions. In order to further improve the trained model when using the imperfect GT labels, we introduce a self-reinforced learning strategy that employs the current trained model to generate more and higher quality data to support the next round of model training. Thus, the proposed method achieves an effective bone-marking towards the two important bones (i.e., humerus and scapula) in the coarsely scanned 3D MR knee data, and the results could be used to derive an initial shoulder preoperative diagnosis.
2.1 Deep end-to-end network for humerus and scapula segmentation
In this section, we present the deep end-to-end network for the bone tissue predication and segmentation. As shown in Fig. 2, the main structure of each task is designed as a symmetric encoding-decoding way. The training on the large-size 3D MR shoulder data requires very high GPU memory consumption. Hence, we shorten the depth of both encoding and decoding paths and reduce the number of pooling and deconvolution operations to 2, respectively. In the encoding direction, for each different resolution level, a convolution block is utilized for feature abstraction and followed with a residual connection. The successive multi-resolution encoder obtains multi-size contextual information which is helpful to receive the integral bone structure as well as the background knowledge surrounding the target tissue. In the decoding part, after the skip connection that concatenates the up-sampled low-level feature maps with the skipped equivalent-resolution maps from the encoding half, we utilize a localization block [isensee2017brain]
to fuse the concatenated features. The multi-class cross entropy is applied and the loss function is:. Here is the number of classes, representing the class of humerus, scapula, and background. For the -th class, and are the predicted probability and the GT at voxel , where is the volume data space.
2.2 Self-reinforced learning
The network in section 2.1 could obtain good initialized bone segmented masks based on the inaccurate GT labels and scarce training data. Yet this network has the potential to raise the segmentation performance. Hence, a self-reinforced learning strategy is proposed and its flowchart is demonstrated as Fig. 3. After the initial training, the trained model and augmentation techniques (e.g., distortion) are utilized to generate higher quality labels and extend the training set, which could be used for the subsequent training rounds. The labels produced by the trained model have more continuous boundaries and are able to represent more precise target regions. Based on our experience, two extra training rounds could make the model converged and have the highest performance.
3.1 Experiment settings
In this study, we validate the proposed method on a dataset with 50 3D MR shoulder images. Because of the high divergence of imaging parameters and patient conditions, all the images are resampled and cropped, and have the same voxel spacing (, , ) and size (). In addition, the N4 bias field correction is applied to all the data and the pixel intensity is normalized in [0, 1]. In order to validate the effectiveness of the self-reinforced learning, we carry out a 5-folds cross-validation on the experimental dataset. The dataset is randomly split into 5 mutually exclusive sets, and each group chooses four of the five sets for training and use the remaining set for testing. Then we compare the proposed method to the widely used atlas segmentation with joint label fusion (MALF) [wang2013multi]
. The proposed network (using the model trained above) and the MALF (using 15 selected atlases as the reference) are implemented using the Tensorflow and Matlab libraries, and the two methods are tested on the same testing dataset. For validation, dice similarity coefficient (DSC), Hausdorff distance (HD) and average surface distance (ASD) between the GT labels and segmented results are reported. In the network training, the batch size is set to 1 and the Adam solver (the learning rate is initialized asand multiplied by a factor of
every 10 epochs) is employed.
3.2 Experiment results
Fig. 4 shows the 2D visual comparisons of a subject to validate the self-reinforced learning. In Fig. 4 (d) to (f), the initialized model from the training round 0 (R0) could locate the main bone areas and describe their basic structures, yet miss some scapula areas. After deploying the self-reinforced learning to optimize the training steps, from the demonstrations in Fig. 4 (g) to (i) and (j) to (l), the trained model can progressively segment all the scapula bone and reconstruct more complete humerus and scapula shape. The comprehensive 3D segmentation views of the same subject are also shown in Fig. 5. In Fig. 5 (a), the GT labels’ surface and boundary are non-continuous and non-smooth, and these situations even cause some errors of inconsistency between the slices of GT. After the self-reinforced processing, from Fig. 5 (c) to (d), the recursively trained models can refine the humerus and scapula masks, as long as smoothing their surface and keeping the details of the bone shapes. The trained model on R2 obtains the best quantitative performance on this subject. The values for humerus are: DSC (0.918), HD (5.099), ASD (0.764), and for scapula are: DSC (0.734), HD (12.329) and ASD (0.784). Besides the comparisons on the single subject, the overall quantitative measurements by using the 5-fold cross-validation are shown in Table 1. The R2 in the self-reinforced learning almost produces the best mean results among other two rounds in general. Moreover, the shape of humerus has much higher consistency than that of scapula, and thus the prediction of our proposed method on humerus has a higher mean DSC score.
After showing the effectiveness of the proposed network and self-reinforced learning in refining the continuousness and smoothness on the segmented structures, we also visually compare the performance of the proposed method with the widely used MALF approach. By comparing the results in Fig. 6 and 7, both methods are able to locate the two bones, and the proposed one can overlap more areas with the GT. The MALF produces some rough “hole” defects on the segmented bone surfaces and also generates several spatially isolated segmentation errors. Moreover, the MALF’s 3D masks have obvious leaking issues on the humerus and simultaneously it misses some bone tissue in the scapula part. All of the above problems in the MALF may be caused by the improper reference atlas pairs selection, and the weak appearance information and various anatomical structures in the experimental data may make the similarity estimation inaccurate in the joint fusion step. On the other hand, our method could alleviate the isolated labeling errors and preserve the spiny shape of the scapula. In Table 2
, we report the mean evaluation metrics, and the proposed one outperforms the MALF. Thus, the proposed method has higher robustness to segment the small dataset with low-contrast and high-shape-variability 3D MR Data.
In the present work, we propose a deep end-to-end network and a self-reinforced learning strategy to segment humerus and scapula using low-contrast and high-shape-variability 3D MR shoulder data. The network has a U-shaped structure with an encoding-decoding architecture to formulate the bone extraction as a semantic segmentation with deep learning. In order to further improve the segmentation accuracy and ensure the continuousness and smoothness in the segmented structures, we introduce the self-reinforced learning mechanism. By starting from the small dataset with inaccurate GT labels, this process utilizes the initialized segmentation model to recursively extend the training set with newly generated higher quality labels and improve the next-round training. In the experiments, the proposed method achieves accurate segmentation evaluated with a 5-folds cross-validation and has superior performance by comparing with the MALF approach.