Patients suffering from Osteoarthritis (OA) experience pain in their joints due to the degeneration of cartilage and bones. To better understand how OA is affecting the knee joint, it can be analyzed under load, because it then shows different mechanical properties compared to the unloaded case [0000-02]. This analysis can be realized using weight-bearing in-vivo cone-beam computed tomography (CBCT) acquisitions with injected contrast agent visualizing the thin line between femoral and tibial cartilage (Fig. 1a).
A prerequisite for the analysis of cartilage is a prior segmentation of the knee’s structures. The segmentation of cartilage resp. its surface has mainly been investigated in Magnetic Resonance (MR) acquisitions [0000-04]. Since the conventional manual labeling of cartilage in CBCT is very time consuming, (semi-) automatic approaches using machine learning have been developed. Acetabular cartilage in the hip joint was segmented using a shape-based approach and prior knowledge [0000-05], or by applying a seed-growing algorithm [0000-06], both exploiting the specific shape of the hip joint. Regarding the segmentation of the thin contrast agent line in knee CBCT, Myller et al. [0000-07] were one of the first to apply a semiautomatic approach based on model registration and intensity changes. Their approach yielded good results on high resolution unloaded CT images segmenting the whole femoral and cartilage surface.
In contrast to this, this work aims to segment only the region of the contrast agent line where femoral and tibial cartilage are in contact. Consequently, the main challenge is the high imbalance in the data between the small contrast agent line and the large background. We propose an automatic segmentation based on a 3D volumetric convolutional neural network. The network is trained and evaluated on manual segmentations of contrast enhanced knee CBCT volumes. Since the resulting segmentations contain many false positives as expected due to the high class imbalance, a post processing step of extracting the largest connected point clouds is applied.
2 Materials and methods
The dataset used in this work was acquired under an IRB-approved protocol, containing in total 40 CBCT scans of 8 subjects in a supine (s) or weight-bearing (w) position. The C-arm (Artis Zeego, Siemens Healthcare GmbH, Erlangen, Germany) acquired 496 (s)/248 (w) projections of size 1240960 pixels with isotropic pixel size of 0.308 mm on a calibrated vertical (s)/horizontal (w) trajectory. Contrast agent was injected in the knee to visualize the outline of soft tissues. The reconstructions had a size of voxels with an isotopic spacing of 0.2 mm.
The tibia and the thin contrast agent line where tibial and femoral cartilage are in contact were manually segmented slice by slice in the sagittal view by an expert (Fig. 1b, c). In total, only 0.18% of all voxels belonged to the cartilage surface (= positive voxels), resulting in a high imbalance in the annotations.
The dataset was divided into for the training/validation/test group, with no subject being represented in both training/validation and test. Due to GPU memory restrictions, the data had to be sub-sampled into smaller volumetric patches of size . To address the high class imbalance, for training and validation the data was oversampled by using 70% patches that have a randomly picked positive voxel in the center, and 30% all negative patches. Four patches per volume were extracted for training and validation, and data augmentation was applied to the training patches with random rotations of , , and . For testing, the whole volumes were divided into disjoint patches.
2.2 Multi-channel volumetric neural network
The architecture we used was a VNet [0000-08], an extension of UNet for volumetric data (Fig. 2
). It takes advantage of 3D convolutions, fully connected connections and modified types of residual connections. Introducing skip connections between encoder and decoder path produced results correctly located and at the same time with more confidence in the prediction. The number of convolutions and several stages were adapted to the task of cartilage segmentation. SeLu was used as the activation function showing a stable and relatively fast convergence. Finally, AMSGrad was chosen as the optimization scheme, since it outperformed the ADAM traditionally used on VNet.AMSGrad proved that adding the concept of memory for a highly imbalanced dataset produced better results and a faster convergence[0000-13]. To avoid overfitting, dropout was added with a value of .
2.3 Loss function
The Tversky index [0000-14] described by Equation 1 was chosen as loss function since it is able to work with highly imbalanced data. represent the negative and positive voxels of the prediction, are the negative and positive voxels in the ground truth annotation. Tversky directly takes into account the relation between False Positive (FP) and False Negatives (FN) predictions and proposes parameters and to manage the trade-off between both errors. For this specific case, the highest performance was achieved using and .
2.4 Connected component analysis
To reduce the high number of false positive predictions due to data imbalance, the resulting segmentations were post-processed with a connected component analysis. Since the two surfaces of medial and lateral cartilage are expected to be the largest segmented connected point clouds, all but the two largest connected components were discarded. If this assumption didn’t hold, a manual selection of the point clouds corresponding to the cartilage surface was performed.
To evaluate the network’s performance the metrics accuracy, precision, recall, and dice index were computed. An accuracy of 99% was achieved due to the high number of negatives correctly classified. An average recall of 0.69, precision of 0.24, and dice index of 0.35 were achieved. Figure3a shows one slice of the network output containing many false positives. After the connected component analysis, the ground truth labels and the predictions show a high overlap (Fig. 3b and c). The connected component analysis successfully chose the correct patches in most of the test cases, and only one had to be adapted manually.
The proposed network shows promising results for the task of knee cartilage surface segmentation. Despite the use of oversampling and Tversky loss, the high imbalance still led to a high false positive rate. Using mainly patches containing positive voxels for training led the model to learn that even patches in the periphery of the knee joint should contain positive voxels (Fig. 3a). Since these peripheral false segmentations are small and closely connected, the connected component analysis applied in post-processing was able to remove them and predict the desired cartilage segmentation in a stable way (Fig. 3b). Only the false positives in the segmentation’s proximity could not be removed (Fig. 3c).
We see the connected component post-processing step as an intermediate solution. In future, we want to investigate an enhancement of the network using the prior knowledge about the segmentation being a 1D continuous line in the sagittal view. This can be achieved following the learning with known operators paradigm [0000-15] by including either the connected component analysis or a polynomial fitting step directly into the network.
An additional reason for the high false positive rate is the current way of dividing the volume into patches, thereby restraining the network from learning the spatial relation of the cartilage contact area between femur and tibia. The border between patches can even be seen in the resulting segmentation (Fig. 3c). The reason for dividing the volume into patches is the hardware limitation due to the large size of medical data. A solution with bigger patches or even the full volume could be achieved using reversible networks as proposed in [0000-16].
Note that the manual segmentations used as ground truth are one pixel thin lines in the sagittal view, meaning that a 1-pixel shift directly results in false predictions. However, the contrast agent in the cartilage contact area is in most cases multiple pixels thick, leading the network to predict a point cloud instead of only a thin line (Fig. 3b). The consequence of this is directly observable in our reported metrics with a very low precision due to many false positives, but also with a good recall because most of the true labels are contained in the predicted point clouds. As these metrics are used to compute the loss function and therefore guide the training, we hope that the enrichment of the network with prior knowledge or a polynomial fitting can stabilize the training and overcome this instability.
The presented results confirm the complexity of this highly imbalanced task, but show promising results towards a fully automatic cartilage segmentation in CBCT. Even though there are still many false positives in the final segmentation (Fig. 3c), the proposed method can help to facilitate and accelerate the process of analyzing cartilage thickness in the clinical field.
This work was supported by the Research Training Group 1773 Heterogeneous Image Systems, funded by the German Research Foundation (DFG). Further, the authors acknowledge funding support from NIH 5R01AR065248-03 and NIH Shared Instrument Grant No. S10 RR026714 supporting the zeego@StanfordLab.