Coarse-to-fine volumetric segmentation of teeth in Cone-Beam CT

10/24/2018 ∙ by Matvey Ezhov, et al. ∙ 0

We consider the problem of localizing and segmenting individual teeth inside 3D Cone-Beam Computed Tomography (CBCT) images. To handle large image sizes we approach this task with a coarse-to-fine framework, where the whole volume is first analyzed as a 33-class semantic segmentation (adults have up to 32 teeth) in coarse resolution, followed by binary semantic segmentation of the cropped region of interest in original resolution. To improve the performance of the challenging 33-class segmentation, we first train the Coarse step model on a large weakly labeled dataset, then fine-tune it on a smaller precisely labeled dataset. The Fine step model is trained with precise labels only. Experiments using our in-house dataset show significant improvement for both weakly-supervised pretraining and for the addition of the Fine step. Empirically, this framework yields precise teeth masks with low localization errors sufficient for many real-world applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, machine learning has been successfully applied to various medical imaging problems, but the use of it within the field of dental radiography remains limited, especially with 3D CBCT scans. In this paper, we present methods for 3D tooth segmentation. We start by training a model to predict coarse (downscaled) segmentation using a large weakly labeled dataset (it should be noted that it is currently impossible to process typical CBCT scans in original resolution with sufficiently large networks). Then, we fine-tune this model on a smaller, precisely labeled dataset while still predicting coarse masks. Finally, we train a separate model to predict high-resolution segmentation inside cropped regions of interest (RoI) based on coarse masks of individual teeth, using only the precisely labeled dataset.

The main contribution of our work is two-fold. First, we show how using a combination of a large dataset with weak labels and a small dataset with precise labels can yield substantial improvement in performance compared to the use of a small precisely labeled dataset only. Second, we show that it is possible to train effective segmentation pipelines that are able to localize and precisely segment at least distinct anatomical structures.

2 Related work

Machine learning is rapidly transforming the medical imaging application landscape [1]

. Convolutional neural network architectures such as U-Net

[2] for 2D and V-Net [3] and 3D U-Net [4] for 3D images have proven effective for anatomic structure segmentation on medical images, as well as [5, 6, 7]. Generative Adversarial Networks [8] are applied for medical image synthesis [9] and domain adaptation [10, 11]. In [12] authors propose an approach that utilizes auto-context to perform semantic segmentation at high resolutions in a multi-scale pyramid of stacked 3D FCNs. A network architecture called the label refinement network is shown in [13] to predict segmentation labels in a coarse-to-fine fashion at several resolutions.

To our knowledge, applications of deep learning to dental imaging have been relatively sparse.

[14] evaluates different deep learning methods for dental X-ray analysis. In [15, 16] authors used deep convolutional neural networks for tooth type classification and labeling on dental CBCT images. [17, 18] apply deep neural networks to caries detection. The U-Net architecture was trained to segment dental X-ray images in the Grand Challenge for Computer-Automated Detection of Caries in Bitewing and won by a large margin. [19] presents an overview of the machine and deep learning techniques in dental image analysis.

3 Datasets

We used a dataset of depersonalized 3D CBCT head scans with isotropic voxel spacing from mm to mm and fields of view varying from just a few teeth to capturing the whole head. Typical image size is . Voxel values represent approximate Hounsfield Units, which measure radiodensity inside a voxel volume (in CBCTs it is not guaranteed that intensity values fully correspond to the Hounsfield scale). Care was taken to include scans produced by devices manufactured by different companies. We believe this dataset is representative of the majority of dental CBCT images obtained ”in the wild”. For those scans, two sets of labels were collected: a sparse axial bounding box annotation set and a 3D voxel-wise mask set.

A set of studies was annotated by specialists using axial bounding boxes. The process consisted of the specialist selecting to axial slices per upper and lower jaw, drawing a bounding box around the tooth axial profile, and entering a tooth number. Although bounding boxes are sufficient for learning tooth detection on 2D axial slices, we consider them weak labels for 3D segmentation in the sense that they provide hints to location and size of the tooth, but not its precise boundary. Approximate time to create such annotation is 30 minutes per CBCT volume.

Then, a set of studies was annotated by specialists using per-voxel label assignment. Specialists used MITK software [20]

that allows quick delineation of tooth view in three planes and exploits subsequent 3D contour interpolation. Then two different specialists validated segmentation results using ITK-SNAP software

[21] that allows detailed 2D and 3D inspection of the previously obtained mask, as well as its own manual and semi-automatic segmentation tools. We found MITK to be more labor-efficient, while ITK-SNAP is easier to use, more performant, and memory-efficient. Approximate time to annotate a single volume this way is hours.

Voxel-wise segmentation annotation requires a lot of time, attention, and proficiency with 3D software from a specialist, making it difficult and expensive to build a large dataset. However, it can yield very precise masks. On the other hand, bounding box annotation on axial slices is a very fast and simple procedure. However, to get volumetric masks from sparse axial bounding boxes we had to use a set of heuristic procedures, which led to artifacts and incorrect labels on resulting masks. It takes approximately

minutes for a specialist to segment one 3D CBCT scan with bounding boxes, while voxel-wise segmentation takes about 6 hours and requires powerful hardware and software training.

(a) Weak manual segmentation and prediction of the Coarse model trained on that data. (b) Downsampled precisely labeled masks and fine-tuned the Coarse model predictions. (c) Original precise labeled masks and results of the Fine model.
Figure 1: Examples of the model trained on weakly labeled data, the Coarse model and Fine model on 2D axial, sagittal and frontal slices. Ground truth labels are on the right and corresponding model predictions are on the left. We propose a pipeline for training precise 3D tooth segmentation model: (a) predicting coarse (downscaled) segmentation using large weakly labeled dataset, (b) fine-tuning this model on a smaller, precisely labeled downscaled dataset while still predicting coarse masks, (c) high-resolution segmentation inside cropped RoI’s based on coarse masks of individual teeth, using small precisely labeled dataset only.
Model Weak dataset % Precise dataset % Precise dataset % Time cost, ASD IoU
used for Coarse step used for Coarse step used for Fine step hours
(1) Coarse-only 100 0 0 380 0.66 0.61
(2) Coarse-only 0 100 0 558 1.72 0.58
(3.1) Coarse-only 100 5 0 410 0.43 0.67
(3.2) Coarse-only 100 10 0 440 0.36 0.70
(3.3) Coarse-only 100 25 0 524 0.33 0.72
(3.4) Coarse-only 100 50 0 662 0.28 0.75
(3.5) Coarse-only 100 100 0 938 0.24 0.78
(4.1) Coarse-to-Fine 100 5 5 410 0.28 0.91
(4.2) Coarse-to-Fine 100 10 10 440 0.25 0.92
(4.3) Coarse-to-Fine 100 25 25 524 0.20 0.93
(4.4) Coarse-to-Fine 100 50 50 662 0.19 0.93
(4.5) Coarse-to-Fine 100 100 100 938 0.17 0.94
Table 1: Results. Here, we present the evaluation of the Coarse-only model and the Coarse-to-Fine pipeline on different types and amounts of data. (1) evaluates the Coarse-only model on full weakly labeled dataset ( cases); (2) evaluates the Coarse-only model on full precisely labeled dataset (93 cases); (3) evaluates the Coarse-only model pretrained on full weakly labeled dataset, then fine-tuned on increasing amounts of precisely labeled data; (4)

evaluates a regime similar to (3) with the addition of the Fine model (separate model working in original resolution), where the Fine model is trained on the same part of precisely labeled data as the Coarse model is. The time cost is an estimated amount of annotator-hours required to obtain a dataset of this size (excluding development and test sets). All evaluations were performed on the hold-out test set of

precisely labeled cases.

4 Methods

Our approach consists of the following steps: (1) Preprocessing incoming volumetric image; (2) Coarse step weakly-supervised pretraining; (3) Coarse step fine-tuning on downscaled precise masks; (4) Fine step training in original resolution on original precise masks (Figure 1).

For preprocessing we tried different methods, including removing outlier intensities by clipping to high and low percentiles, normalizing inside

to range, histogram equalization, applying a fixed intensity window width and window level, or using raw Hounsfield Units. We found that training is not sensitive to the choice of preprocessing, with all methods leading to approximately the same results, while histogram equalization being slightly worse. For the experiments described in this work, we clipped the intensities to be inside the to

percentile range, then subtract mean and divide by standard deviation. For the Coarse step, we rescale the whole image to have

mm isotropic voxel resolution using linear interpolation, which translates into a x volume reduction when starting from typical mm isotropic voxels. Then, we crop a volume of size randomly from the rescaled image for each iteration.

For evaluation we selected studies from the precisely labeled dataset at random, stratified by manufacturers, as a test set for all experiments (including the one trained on weakly labeled dataset only). The rest of the data was split between training and development sets with approximately of data going into the training set.

4.1 Weak label preprocessing

Manually annotated coarse axial masks are sparse and collected in the form of axial bounding boxes, which is non-standard for segmentation. To create a single continuous mask for each tooth, we apply a modification of the distance transform function — we use the average distance from each voxel to the centerline of the individual teeth. We perform a linear combination of voxel intensities and distances as:

where is a parameter. We have energy masks afterward and use the function to label each voxel with a number from to , where is the background. We use unique value for each of the manufacturers of CT scans presented in our dataset. That parameter was chosen heuristically after visual validation.

This preprocessing results in continuous teeth masks, but with lots of artifacts and mistakes.

4.2 Model

We formulate the problem as a -way semantic segmentation, where the background and each of the possible teeth is interpreted as a separate class.

We use a VNet [3] fully convolutional network for both Coarse and Fine models. The Coarse model has an output width of , interpreted as a softmax distribution over each voxel, assigning it either to the background or one of the

teeth. The Fine model has output width of 1, interpreted as the probability of assignment of a voxel to a tooth of interest.

4.3 Loss function

Let be the ground truth segmentation with voxel value ( or for each class), and the predicted probabilistic map for each class with voxel value

. As a loss function, we use soft negative multiclass Jaccard similarity defined as:

where is the number of classes, which in our case is for the Coarse model and for the Fine model, and is a loss function stability coefficient that helps to avoid dividing by zero.

4.4 Training

Training was performed in distinct phases. First, we train the Coarse model on a weakly labeled dataset with target dense masks inferred from bounding boxes according to . We trained for epochs using Adam optimizer [22] with a learning rate of decayed by a factor of 10 after 50 and 75 epochs, and a batch size of 3. For testing, we use the checkpoint with the lowest recorded validation loss.

Then, we fine-tune the same model on a precisely labeled dataset, where original dense masks were downscaled to mm resolution. Training setup was the same except for increasing learning rate to .

As we transition to the Fine step, we use the Coarse model to prepare the training dataset. First, we obtain predicted voxel-tooth assignment in coarse mm resolution. Then, we select the largest connected component for each tooth’s mask. Then, we find a minimal bounding rectangle that contains the tooth volume and extend it by mm in each direction to account for possible Coarse model errors (the value of

mm was decided by hyperparameter search from

mm to mm). Finally, we crop the resulting rectangle from the image and precise mask. In this way, we obtain both model input and target. We apply the same procedure to obtain validation and test sets.

Finally, we train the Fine model on prepared datasets. Since we use the original resolution without rescaling or additional crops, every tooth volume has a different size. We use a batch of size 1 and a learning rate of .

4.5 Metrics

For evaluation, we use an average surface distance (ASD) score in millimeters, defined as the average of all distances from all points on the boundary of the predicted mask to the closest point on the boundary of the ground truth mask, and vice versa. [23] To measure whole-tooth localization performance, we also measure binary voxel-wise intersection over union (IoU) between the ground truth volumetric mask and the model prediction.

5 Results

As shown in table 1, for the Coarse step using full precisely labeled dataset (2) leads to significantly worse results than using full weakly labeled dataset (1), while being more expensive in terms of annotation time. This indicates that 97 CBCT scans in train set are insufficient to train good tooth segmentation model. At the same time, using only precisely labeled samples while fine-tuning from a model trained on full weakly labeled dataset (3.1) significantly improves results compared to both (2) and (1), while requiring only more annotation time then (1) and less annotation time then (2). Adding a Fine step model to the same setup (4.1) further improves performance by as much as in terms of ASD and by in terms of IoU without additional annotation time cost. Curiously, using only of precise labels with the Fine step (4.1) leads to better performance than using all precise labels without it (3.5).

Using full datasets with the Coarse-to-Fine pipeline (4.5) we can achieve ASD and IoU. At that point, performance keeps improving, indicating that the pipeline is not yet dataset-saturated.

The major problem of training on weak labels only (1) is a poor delineation between the background and nearby bones from teeth. These problems are clearly visible in roots and bitewing areas. Moreover, this model often mislabels voxels of one tooth as belonging to a neighboring tooth. The Coarse model with fine-tuning on precise labels (3) gets accurate results in such difficult cases. The Coarse model trained only on the small precise dataset (2) achieves significantly worse results than both weak and fine-tuned models, indicating insufficient dataset size. The Fine model (4) produces voxel-perfect masks, visibly very close to ideal.

6 Conclusion

In this work, we present two datasets of CBCTs labeled for tooth segmentation and numbering. We also present the Coarse-to-Fine segmentation pipeline with weakly supervised pretraining, achieving mm ASD and IoU and showing significant improvements over several typically used baseline setups.

We show that using a coarse-to-fine framework is effective for handling large volumetric images, even when localizing and segmenting as many as small anatomical structures, at least for tooth segmentation. Our results also indicate that the strategy of collecting weak, cheap labels first is a viable approach for problems where precise labels are expensive (such as volumetric segmentation of medical images), or where proof of concept is required. These weak labels can be reused for training precise models afterwards.

References