Osteoarthritis (OA) is the most prevalent knee joint disease in the United States. OA eventually leads to chronic disability (cdc), making the early detection of OA very important. OA is characterized in its early stages by the degeneration of articular cartilage. To assess the integrity of the cartilage, its biochemical composition needs to be measured (jose1; jose2). Several compositional MRI techniques have been introduced that are sensitive to either proteoglycan (delayed Gadolinium-enhanced MRI [dGEMRIC], Sodium-MRI, glycosominoglycans chemical exchange saturation transfer [gagCEST]) or to a combination of components (T2 relaxation time, relaxation time in the rotating frame [T1], magnetization transfer). Recently, DTI was introduced as a novel biomarker that can capture proteoglycan content and collagen structure simultaneously (jose1; jose2; jose3). A significant limitation of the use in the clinical routine of these advanced MRI techniques is the lengthy image processing time (jose3)
. To measure MRI parameters in cartilage, the cartilage needs to be segmented, which is usually done by a human expert, a trained Radiologist, and takes few hours to complete segmenting all cartilage plates (tibia, femur, patella) from a patient. This way of segmentation is not scalable, error-prone, and extremely slow. Deep learning-based models are successful in performing quick and accurate segmentation of brain tissuesdarts; slant, tumor kamnitsas2015multi, pancreas cai2016pancreas, cardiac substructures avendi2016combined and other biological parts. Here, we propose to use deep learning methods to automate the segmentation of knee cartilages.
We work with a dataset of diffusion-weighted MRI of knees of OA patients, which has 71 MRI scans for patients whose OA severity ranges from none (Kellgren-Lawrence [KL] score of 0) to severe (KL=3) (klscore). DTI images were acquired using a radial imaging spin-echo diffusion sequence (RAISED, TR/TE=1500/35 ms, matrix=256256, 15 slices, resolution=0.60.63 mm, b value=300 s/mm , 6 directions, 105 spokes/image, acquisition time 17:50 min). For a good segmentation, one should be able to distinguish between articular cartilage and fluid, which have similar voxel intensity in the diffusion-weighted images. We compute the mean diffusivity (MD) maps, and fractional anisotropy (FA) maps from these seven contrast images that help to resolve the issue of distinguishing cartilage from fluid (jose1). MD and FA from the seven contrast images make the input of an individual patient to be a matrix of size 256256159. The ground truth of all cartilage plates (lateral and medial tibia, femur, and patella) labels associated with each diffusion-weighted map, were generated by a musculoskeletal radiologist. A subset of five images was resegmented by the radiologist to provide a benchmark for expert human coincidence (A sample map and segmentation are shown in Fig 2).
|Segment||Ave. Voxel Count||%|
We split the data set into a train (80%), validation (10%), and test (10%) set, making sure that the subjects in these sets are disjoint. We preprocess each channel independently using min-max normalization to keep the range from . In addition to this, to complement the limited data we have, the training set was augmented using random rotation ( to ) and random horizontal and vertical shift ( to pixels) of the images, which are perturbations that could happen during the procurement of MRI Images because of small movements of the patients. We did not use vertical or horizontal flipping as they are not natural perturbations in our setting.
We model the segmentation task as a -class classification problem - tissues of interest and no tissue. Since our dataset is highly imbalanced (Table 2), we optimize the Weighted Dice Loss (wcel), where we followed (wcel)
to estimate the weights. We use Adam optimizer(adam) with a learning rate of to optimize the model parameters. We assume that each spatial slice is independent of others, or in other words, the location of the slice in the 3-D cube does not matter. We validated this assumption by training a 3D convolution, which did not change our results (Table 4).
We train the original UNet proposed in (unet) with channels and a version of UNet with dilated convolutions (Fig 4). We analyze the predictions made by each of the models we trained and find that the nature of the error these models make is very different from each other (Fig 8). Therefore, as our final model, we ensemble these two models using a three-layered CNN, which has input channels ( from the output of each network) and achieved better performance (Table 4).
|3D U-Net cciccek20163d||0.618||0.697||0.378|
|2D Dilated U-Net||0.670||0.771||0.621|
|Ensemble of 2D U-Net|
|and Dilated U-Net||0.689||0.783||0.640|
Estimating Human Performance: Due to the low contrast of DW-MRI images, the estimation of the correct label for each pixel is inherently noisy. This is corroborated by the observation that when the same human expert segments some of the images again, the dice score calculated on these re segmented images is much less than (Table 4). Further, we visually observe that the distribution of dice score (considering the first segmentation as ground truth) for both the human expert and our model for all the images that were re-segmented is very similar, implying that, the nature of error made by the human expert and our model is similar (Fig 9).
Confidence Map for Predictions: Fig 5
shows an instance where our model predicted the existence of a tissue that was absent in the ground truth. The radiologist confirmed that this was mislabelled. Since ground truth segmentations were noisy, we calculate a confidence map to guide on the certainty of the model prediction. We calculate the log of odds ratio,, where
is the maximum probability of allclasses, for each voxel to quantify our confidence in the prediction of this voxel. We present this confidence map along with our segmentation to the radiologists (Fig 6). These confidence maps are particularly useful, as we are working with diffusion weighted MRI, which has poor contrast between cartilage and surrounding tissues making the labels around cartilage boundaries ambiguous and prone to error.
Sensitivity of Inputs for Trained Model: We selectively set one of the channels to zero and studied model performance (Dice score) to understand how much the model relies on each of the input channels. The performance drops drastically with the removal of any channel, but more when MD and FA maps are zeroed out. Further, we found that the performance was almost unchanged when we permute the seven contrast maps, implying that the model gave similar weights to all the seven contrast maps and treated the calculated MD and FA maps independently. Similar weights or averaging over contrasts help the model to be invariant to the addition of noise to inputs.
We built a deep learning system to perform semi-automatic segmentation of knee cartilages from diffusion-weighted MRI in less than a minute on a non-GPU machine. We showed that the Unet-based models perform similar to a human expert. Further, this model is deployed for clinical trials, and radiologists are currently using it to characterize the progression of OA disease. We find that the confidence maps are particularly helpful in determining which pixels to concentrate on for the radiologists. In practice, we see that the segmentation can be used out of the box in most of the cases without any manual corrections.
We thank Dr. Bonnie K Ray, Dr. David Rosenberg, Dr. Narges Razavian and Dr. Cem Deniz for useful discussions, and guidance. Research reported in this manuscript was supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) of the National Institute of Health (NIH) under award number and RO1AR067789. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Appendix A Additional Results
In this section, we provide additional results of our computational experiments:
Fig 7 shows specific instances of the model being conservative or aggressive relative to the ground truth predictions.
Fig 9 shows the distribution of dice scores on re-segmented images fo both the human expert and our model. The segmentation produced by the human expert during the first time is chosen as the ground truth for dice score calculations. Visually, the distributions look similar, and hence, one can infer that the nature of error made by the human expert and the model is similar.