Knee Cartilage Segmentation Using Diffusion-Weighted MRI

12/04/2019 ∙ by Alejandra Duarte, et al. ∙ NYU college 0

The integrity of articular cartilage is a crucial aspect in the early diagnosis of osteoarthritis (OA). Many novel MRI techniques have the potential to assess compositional changes of the cartilage extracellular matrix. Among these techniques, diffusion tensor imaging (DTI) of cartilage provides a simultaneous assessment of the two principal components of the solid matrix: collagen structure and proteoglycan concentration. DTI, as for any other compositional MRI technique, require a human expert to perform segmentation manually. The manual segmentation is error-prone and time-consuming (∼ few hours per subject). We use an ensemble of modified U-Nets to automate this segmentation task. We benchmark our model against a human expert test-retest segmentation and conclude that our model is superior for Patellar and Tibial cartilage using dice score as the comparison metric. In the end, we do a perturbation analysis to understand the sensitivity of our model to the different components of our input. We also provide confidence maps for the predictions so that radiologists can tweak the model predictions as required. The model has been deployed in practice. In conclusion, cartilage segmentation on DW-MRI images with modified U-Nets achieves accuracy that outperforms the human segmenter. Code is available at



There are no comments yet.


page 2

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

Osteoarthritis (OA) is the most prevalent knee joint disease in the United States. OA eventually leads to chronic disability (cdc), making the early detection of OA very important. OA is characterized in its early stages by the degeneration of articular cartilage. To assess the integrity of the cartilage, its biochemical composition needs to be measured (jose1; jose2). Several compositional MRI techniques have been introduced that are sensitive to either proteoglycan (delayed Gadolinium-enhanced MRI [dGEMRIC], Sodium-MRI, glycosominoglycans chemical exchange saturation transfer [gagCEST]) or to a combination of components (T2 relaxation time, relaxation time in the rotating frame [T1], magnetization transfer). Recently, DTI was introduced as a novel biomarker that can capture proteoglycan content and collagen structure simultaneously (jose1; jose2; jose3). A significant limitation of the use in the clinical routine of these advanced MRI techniques is the lengthy image processing time (jose3)

. To measure MRI parameters in cartilage, the cartilage needs to be segmented, which is usually done by a human expert, a trained Radiologist, and takes few hours to complete segmenting all cartilage plates (tibia, femur, patella) from a patient. This way of segmentation is not scalable, error-prone, and extremely slow. Deep learning-based models are successful in performing quick and accurate segmentation of brain tissues

darts; slant, tumor kamnitsas2015multi, pancreas cai2016pancreas, cardiac substructures avendi2016combined and other biological parts. Here, we propose to use deep learning methods to automate the segmentation of knee cartilages.

2 Methodology

We work with a dataset of diffusion-weighted MRI of knees of OA patients, which has 71 MRI scans for patients whose OA severity ranges from none (Kellgren-Lawrence [KL] score of 0) to severe (KL=3) (klscore). DTI images were acquired using a radial imaging spin-echo diffusion sequence (RAISED, TR/TE=1500/35 ms, matrix=256256, 15 slices, resolution=0.60.63 mm, b value=300 s/mm , 6 directions, 105 spokes/image, acquisition time 17:50 min). For a good segmentation, one should be able to distinguish between articular cartilage and fluid, which have similar voxel intensity in the diffusion-weighted images. We compute the mean diffusivity (MD) maps, and fractional anisotropy (FA) maps from these seven contrast images that help to resolve the issue of distinguishing cartilage from fluid (jose1). MD and FA from the seven contrast images make the input of an individual patient to be a matrix of size 256256159. The ground truth of all cartilage plates (lateral and medial tibia, femur, and patella) labels associated with each diffusion-weighted map, were generated by a musculoskeletal radiologist. A subset of five images was resegmented by the radiologist to provide a benchmark for expert human coincidence (A sample map and segmentation are shown in Fig 2).

Figure 1: Example of a dataset with the ground truth articular cartilages segmentation marked in colors. Red: Patella, Yellow: Tibia, Pink: Femur
Segment Ave. Voxel Count %
Femur 1083 1.659%
Patella 260 0.397%
Tibia 186 0.284%
None 64007 97.66%
Total 256 256 100%
Figure 2: Distribution of labels for each segment

We split the data set into a train (80%), validation (10%), and test (10%) set, making sure that the subjects in these sets are disjoint. We preprocess each channel independently using min-max normalization to keep the range from . In addition to this, to complement the limited data we have, the training set was augmented using random rotation ( to ) and random horizontal and vertical shift ( to pixels) of the images, which are perturbations that could happen during the procurement of MRI Images because of small movements of the patients. We did not use vertical or horizontal flipping as they are not natural perturbations in our setting.

We model the segmentation task as a -class classification problem - tissues of interest and no tissue. Since our dataset is highly imbalanced (Table 2), we optimize the Weighted Dice Loss (wcel), where we followed (wcel)

to estimate the weights. We use Adam optimizer

(adam) with a learning rate of to optimize the model parameters. We assume that each spatial slice is independent of others, or in other words, the location of the slice in the 3-D cube does not matter. We validated this assumption by training a 3D convolution, which did not change our results (Table 4).

We train the original UNet proposed in (unet) with channels and a version of UNet with dilated convolutions (Fig 4). We analyze the predictions made by each of the models we trained and find that the nature of the error these models make is very different from each other (Fig 8). Therefore, as our final model, we ensemble these two models using a three-layered CNN, which has input channels ( from the output of each network) and achieved better performance (Table 4).

Figure 3: Architecture of dilated UNet. Dilation helps achieve a larger field of view.
Model Femur Patella Tibia
3D U-Net cciccek20163d 0.618 0.697 0.378
VNet vnet 0.625 0.567 0.467
2D U-Net 0.678 0.773 0.593
2D Dilated U-Net 0.670 0.771 0.621
Ensemble of 2D U-Net
and Dilated U-Net 0.689 0.783 0.640
Human Expert
(Re-segmentation) 0.711 0.743 0.629
Figure 4: Performance of different models in terms of Dice Score on the validation set. We see that our ensemble model matches the performance of a human expert. Performance of the ensemble model on the test set is as follows: F: , P: , T:
Figure 5: Example of an output where model correctly predicted a cartilage that was missed in the ground truth segmentation by the radiologist. Left: Original Image, Center: Ground Truth, Right: Model Prediction. Femur = Pink, Patella = Red, Tibia = Yellow. See Fig 7 for more examples.
Figure 6:

A confidence map for two sample images can be seen. The circled stray pixels are incorrectly classified as one of the cartilage. It can be seen that incorrect pixels have low confidence.

3 Analysis

Estimating Human Performance: Due to the low contrast of DW-MRI images, the estimation of the correct label for each pixel is inherently noisy. This is corroborated by the observation that when the same human expert segments some of the images again, the dice score calculated on these re segmented images is much less than (Table 4). Further, we visually observe that the distribution of dice score (considering the first segmentation as ground truth) for both the human expert and our model for all the images that were re-segmented is very similar, implying that, the nature of error made by the human expert and our model is similar (Fig 9).

Confidence Map for Predictions: Fig 5

shows an instance where our model predicted the existence of a tissue that was absent in the ground truth. The radiologist confirmed that this was mislabelled. Since ground truth segmentations were noisy, we calculate a confidence map to guide on the certainty of the model prediction. We calculate the log of odds ratio,

, where

is the maximum probability of all

classes, for each voxel to quantify our confidence in the prediction of this voxel. We present this confidence map along with our segmentation to the radiologists (Fig 6). These confidence maps are particularly useful, as we are working with diffusion weighted MRI, which has poor contrast between cartilage and surrounding tissues making the labels around cartilage boundaries ambiguous and prone to error.

Sensitivity of Inputs for Trained Model: We selectively set one of the channels to zero and studied model performance (Dice score) to understand how much the model relies on each of the input channels. The performance drops drastically with the removal of any channel, but more when MD and FA maps are zeroed out. Further, we found that the performance was almost unchanged when we permute the seven contrast maps, implying that the model gave similar weights to all the seven contrast maps and treated the calculated MD and FA maps independently. Similar weights or averaging over contrasts help the model to be invariant to the addition of noise to inputs.

4 Conclusion

We built a deep learning system to perform semi-automatic segmentation of knee cartilages from diffusion-weighted MRI in less than a minute on a non-GPU machine. We showed that the Unet-based models perform similar to a human expert. Further, this model is deployed for clinical trials, and radiologists are currently using it to characterize the progression of OA disease. We find that the confidence maps are particularly helpful in determining which pixels to concentrate on for the radiologists. In practice, we see that the segmentation can be used out of the box in most of the cases without any manual corrections.


We thank Dr. Bonnie K Ray, Dr. David Rosenberg, Dr. Narges Razavian and Dr. Cem Deniz for useful discussions, and guidance. Research reported in this manuscript was supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) of the National Institute of Health (NIH) under award number and RO1AR067789. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.


Appendix A Additional Results

In this section, we provide additional results of our computational experiments:

  • Fig 7 shows specific instances of the model being conservative or aggressive relative to the ground truth predictions.

  • Fig 8 shows particular examples showing that the U-Net and Dilated-UNet models discussed in section 2 produces a different kind of errors. An ensemble model of both of these networks can correct for it and produce better segmentation results.

  • Fig 9 shows the distribution of dice scores on re-segmented images fo both the human expert and our model. The segmentation produced by the human expert during the first time is chosen as the ground truth for dice score calculations. Visually, the distributions look similar, and hence, one can infer that the nature of error made by the human expert and the model is similar.

Figure 7: Example of an output (a) where the model correctly predicted the ground truth cartilage (b) where the model was more conservative than the radiologist (c) where the radiologist was more conservative than the model. Left: Original Image, Center: Ground Truth, Right: Ensemble Model Prediction. Femur = Pink, Patella = Red, Tibia = Yellow
Figure 8: Each image shows the ground truth segmentation (blue) overlayed on top of the prediction from the model (red). We see that the Dilated UNet model (left) and the UNet (center) makes different kind of errors and ensembling them (right) can correct such errors.
Figure 9: Distribution of dice score from model predictions (left) and from human experts (right) on re-segmented images (Section 3). The initial segmentation produced by the human expert is used as the ground truth for calculating the dice score. Visually, one can observe that the distribution of the dice score for both the model and human expert is very similar to the re-segmented images, and hence, one could infer that the nature of error made by both the human expert and model is similar.