An improved 3D region detection network: automated detection of the 12th thoracic vertebra in image guided radiation therapy

03/26/2020 ∙ by Yunhe Xie, et al. ∙ Harvard University 0

Abstract. Image guidance has been widely used in radiation therapy. Correctly identifying anatomical landmarks, like the 12th thoracic vertebra (T12), is the key to success. Until recently, the detection of those landmarks still requires tedious manual inspections and annotations; and superior-inferior misalignment to the wrong vertebral body is still relatively common in image guided radiation therapy. It is necessary to develop an automated approach to detect those landmarks from images. There are three major challenges to identify T12 vertebra automatically: 1) subtle difference in the structures with high similarity, 2) limited annotated training data, and 3) high memory usage of 3D networks. Abstract. In this study, we propose a novel 3D full convolutional network (FCN) that is trained to detect anatomical structures from 3D volumetric data, requiring only a small amount of training data. Comparing with existing approaches, the network architecture, target generation and loss functions were significantly improved to address the challenges specific to medical images. In our experiments, the proposed network, which was trained from a small amount of annotated images, demonstrated the capability of accurately detecting structures with high similarity. Furthermore, the trained network showed the capability of cross-modality learning. This is meaningful in the situation where image annotations in one modality are easier to obtain than others. The cross-modality learning ability also indicated that the learned features were robust to noise in different image modalities. In summary, our approach has a great potential to be integrated into the clinical workflow to improve the safety of image guided radiation therapy.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Volumetric data (e.g. computed tomography (CT), cone-beam computed tomography (CBCT), CT, positron emission tomography, magnetic resonance imaging, etc.) are commonly used in radiation oncology. Image-guided radiation therapy (IGRT) utilizes image information to ensure radiation dose to be delivered precisely and effectively with reduced side effects. Correctly identifying anatomical landmarks, like T12 vertebra, is the key to success. Until recently, the detection of such landmarks still requires tedious manual inspections and annotations in a slice-by-slice manner; and superior-inferior misalignment to the wrong landmark is still relatively common in IGRT, especially in the applications like CBCT which contains high level of noises and suffers from a limited field of view. It is necessary to develop an automated approach to detect those landmarks from images.

There are pioneer studies on using Convolutional Neural Network (CNN) and its variant, fully convolutional neural network (FCN) to detect disease structures

[Haskins2019DeepLI]: for example, colonic polyps in CT images [Huang2017Nodule], cerebral microbleeds in MRI scans [Dou2016Microbleeds], breast and lung cancer in ultrasound images [Li2015Cancer]. With the superior performance of object detection networks [Ren2017FasterRCNN, poudel2019fastscnn, Liu2016SSD, Lin2017FPN] and their applications in self-driving [poudel2019fastscnn, Iglovikov2018TernausNet], face/ID recognition [Farfade2015Face], Document Analysis [Zhong2019Doc]

and etc., it makes sense to incorporate those recent technical developments to the medical field. However, there are two main challenges preventing us from doing it. First, those computer vision approaches are designed for

applications and lack of supports to tomography images which is crucial in the medical field. Also, in our experiments, a network cannot effectively distinguish a structure from other structures with high similarities. Second, those approaches are trained (or based on pre-trained models) from millions of annotated

images. For example, in a typical computer vision setup, like ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), there are roughly 1.2 million annotated training images and 50,000 validating images. To access such a large annotated dataset in the medical field is difficult, if not impossible. It is well known that a successful FCN based algorithm needs to be significantly modified in order to address the difficulties specific to medical images

[Greenspan16DL, Ronneberger15UNet, Cicek16UNet3D]: 1) the scarcity of available data, 2) the high cost of data annotation, 3) the high imbalance of classes, and 4) the high memory demands of networks.

In this paper, we propose a deep network that is trained to detect anatomical structures from volumetric data, but only requires a small amount of training datasets. Our approach differentiates itself from the existing approaches in four aspects: 1) the combination of a U-Net like network and an

one-shot detection network to distinguish anatomical structures with high similarities; 2) a novel detection network to improve the generalization with limited training datasets; 3) a novel pseudo-probability function to address the optimization plateau; 4) the improved loss functions to handle class imbalance without hyper parameters.

To evaluate our proposed approach, we trained the network to detect from CT the T12 vertebra, which is the last thoracic vertebra in the spine of the human body. We choose T12 based on considerations that it is 1) technically challenging to distinguish T12 from other vertebrae because of the high similarities; 2) widely used as a landmark in IGRT and various other radiation oncology clinical applications.

2 Network Architectur

Our network was inspired by both Faster R-CNN [Ren2017FasterRCNN] and U-Net [Ronneberger15UNet]. As shown in Fig. 1, it consists of two components - a feature extraction network (FEN) and a region detection network (RDN). Through the training samples, FEN learns the relevant features of the structure to be detected and RDN is attached to the top of FEN for outputting the bounding box field. Please note, our FEN’s architecture is more similar to U-Net than feature pyramid network (FPN) [Lin2017FPN] because of no multiple prediction layer in the skip connections. In the following subsections, we will describe both FEN and RDN in details.

2.1 Feature Extraction Network

U-Net, one of the most popular deep-learning architectures in the field of biomedical image segmentation, is well known for its effectiveness with limited annotated training datasets. The network architecture basically is a symmetric Encoder-Decoder architecture, which, as its name suggests, consists of two parts: an encoder and a decoder. The encoder part is also called down-path: it compresses the information contained inside the input and results in a coarse representation of the input. The decoder part is also called up-path: it receives the compressed information from the encoder and generates the final output. Besides the down-path and up-path of a typical encoder and decoder architecture, U-Net, as well as ResNet

[He16ResNet], adds skip connections between the feature maps of the same spatial scale. The down-path reduces the spatial resolution resulting in a stronger semantic feature map, and the up-path recovers the spatial resolution by an up-sampling operation. Although the semantic meaningful feature maps are obtained, the fine features are lost during these down-sampling and up-sampling operations. By introducing the skip connections, the fine details before the bottleneck are added back into the semantic stronger feature maps.

In order to achieve a good generalization and a strong semantic output, our FEN was designed to have the same architecture as a U-Net, but fewer layers in up-path as shown in the dash box of Fig. 1a. The input is a volumetric image of dimension . And the output is a tensor storing the output feature map, where is the number of features and is the dimension of the coarse-level spatial grid after pooling and up-sampling operators inside FEN. In our experiments, , and used in training were , , and 64, respectively. The intuition behind this architecture is that the Encoder-Decoder architecture forces the FEN to learn the major features to achieve a good generalization. The shortened up-path is not only necessary for extracting semantically meaningful information, but also a must-have to decrease GPU memory usage and reduce the number of parameters to be optimized.

2.2 Region Detection Network

RDN is attached to the layer on the up-path, where the structure of interest still has at least 3 voxels in each dimension. It can detect multiple anatomical structures simultaneously, although in our experiments we demonstrated one structure detection. The input is a feature map of tensor. RDN output is a tensor, where

is the number of structures to be detected and 7 is the number of the bounding box parameters (one for probability, three for locations and three for box sizes). The output tensor can be thought as a field of box parameters, including the probability of box existence, defined on the coarse-level spatial grids. For simplicity, we use the bounding box aligned to the image coordinate axis in this paper. The true center location (or offset) estimation is needed in order to recover the true box center from the coarse spatial grids.

a b
Figure 1: The schematic view of our network. It consists of two parts: FEN and RDN. a) FEN has the same architecture of U-Net, but with fewer layers in up-path. Each blue box represents a U-Net block which contains two convolution layers, either a maximum pooling layer for down-path or a up-sampling layer for up-path, and a dropout layer. The tensor dimensions before pooling or up-sampling are marked in / under the box. RDN is attached to a layer on the decoder path, where the bounding box contains at least 3 voxels. Please note, FEN is diferent from FPN [Lin2017FPN] because of no multiple prediction layer in the skip connections. b) RDN network was designed to detect multiple anatomical structures. Although for simplicity a

-vector is shown in the output end of the network, its output is a

tensor, where is the number of structures to be detected. The horizontal bars represent the output vectors from convolution layers with -kernels. The blue circles, except for , represent output nodes from convolution layers with -kernel. The blue circle of probability, , is the output from multiplication layer with inputs of , and .

The architecture of the RDN is as shown in Fig. 1b. Firstly, the input tensor is convolved with three small kernels to generate three low dimension feature maps (three 32-vector, on each spatial grid , shown as narrow bars in Fig. 1b). Then, those three vectors are convolved by -kernels to generate all the -, - and -components of the probability , location and box size . At last, the joint probability, , is a multiplication from all its three axis-components. This architecture was implemented naturally using

convolution and multiplication layers. To reduce the effects of image scanning intensity variations, a sigmoid function is used as the activation function in the convolution layer with

-kernels. Compared to its counterpart in [Ren2017FasterRCNN], our RDN decomposes the box parameter regressions to multiple independent regressions on each image axis. Therefore, much fewer weight parameters in the network are needed, which makes it possible to be trained with limited data.

3 Target Labeling

3.1 Augmentation

Besides the network architecture, data augmentation is another important step to train deep networks with limited training data sets. A properly designed augmentation teaches the network to focus on robust features for a good generalization. In our study, each input image and its corresponding annotated masks were augmented twenty-five times using a elastic deformation algorithm based on [Ronneberger15UNet]. In each augmentation, a

grid of random displacements were drawn from a Gaussian distribution (

pixels and

pixels). Then the displacements were interpolated to the pixel level. All the

slices of the volumetric image were deformed using the same displacements and a spline interpolation. We also experimented with deformations, which showed no advantages to the model convergence but took longer time to generate a data.

3.2 RDN Output Target

We teach both RDN and FEN networks what to learn by constructing the appropriate ground truth of RDN output. Although there are standard target generating methods used in computer vision community [Ross2015FastRCNN, Ren2017FasterRCNN], they did not work well in our experiments. For example, the intersection over union (IoU), a popular choice of the probability target in computer vision, suffers from the problem of optimization plateau [Rezatofighi2019GIoU], which threw the structure center estimation off in our experiments. To overcome those problems, we replaced IoU with a hand-crafted probability function which has a global peak value at the box center.

Given an annotated image, the target box parameters (box center locations and box sizes) are computed from the structure contours drawn by the experienced radiation physicists. The target box parameters are then used to compute a 7-vector target, , for every spatial grid location . The target was defined as following: when inside a target box , the probability target linearly decreases from the box center, , along the three axes; the center offset targets, , linearly increases from the box center; and the box size target, , is a constant vector. When outside the box , , , and are set to zero. The consideration behind setting zero to the targets outside the box is that we teach the network to focus on the features of the structure and ignore those outside the structure. All those targets are scaled by the down sampling rate, , caused by pooling operators inside FEN to match the coarse spatial resolution of RDN output.

where , , and are the box width, length and height, respectively.

4 Loss Functions

We assign a total loss to measure the discrepancy between the RDN output tensor and the target tensor. The total loss, , consists of the losses of the probability , the location offset , and the box size . Here , , and are the estimated probability, center offset and box size, and , and are their ground-truth counterparts. The loss functions are defined as

where , and are spatial grid index of the RDN output tensor and is the . The total loss is defined as . The summation over all the spatial grid index is needed in order to measure the overall discrepancy of the tensor.

With the above definitions, RDN and FEN networks can be trained jointly end-to-end. The trained network then computes the output tensor when a tomography image is given. The detection was performed by finding the maximum probability inside the output tensor and extracting its associated box parameters.

5 Implementation Details

Our codes were implemented in python using Keras with Tensorflow as back-end. All of the training images were cropped to the size of

in order to fit into the 12GB GPU memory. To handle the anisotropic voxel size of CT and CBCT, the first pooling layer has a pool size of . All the other pooling layers have pooling size of . Our model was trained end-to-end by back propagation and Adam optimizer with learning rate of . It takes about 16 hours to finish training on a single NVidia Titan XP GPU card. The model is trained with mini-batch size of one. In the prediction phase, the network declares no structure detected if the maximum probability of the structure is smaller than 0.1.

Figure 2: Examples of T12 detection in CT and CBCT images. T12 regions were correctly detected and outlined as the red boxes overlaid on patient images in axial, sagittal and coronal views (from left to right). Top row: T12 detection from CT using the model trained in thirty-five CT scans. Bottom row: T12 detection in CBCT images using the same model as in the top row. Although CBCT images are noisier than CT images, T12 was still correctly detected from CBCT images of a lung patient.

6 Experiment Results

The model was trained using thirty-five abdominal CT images of liver patients acquired with the same imaging protocol. The trained model was tested to detect T12 on another twenty data sets - fourteen CT scans and six CBCT scans. To achieve the goal of IGRT safety check in our experiments, we measured the detection error using the distance between the manually contoured center and the model predicted center. Fig. 2 shows examples of detected T12 vertebrae from CT and CBCT scans using the same model trained as described above.

On all fourteen test CT images, T12s were accurately identified with a mean detection error of mm. In the superior-inferior direction, the mean detection error is 0.20 mm. In the six CBCT images, only one scan failed the trained model to detect T12 (). All the others were identified correctly with the mean detection error of mm. The details are summarized in Table 1. The detection took about 3.30 seconds in a dataset with dimensions of voxels. With the advantages of the small detection error and fast detection speed, it can be used in real time as part of IGRT to reduce the risk of human error. Even with the relative larger detection errors of 4.49 mm in CBCT images, the correct vertebral body is still located, which is beneficial as a IGRT safety check.

modality Left-Right Anterior-Posterior Superior-Inferior Total Distance
CT -0.49 -0.24 0.20 3.97
2.50 2.64 3.30 2.70
CBCT 1.12 0.41 0.95 4.49
1.00 2.46 6.09 3.28

Table 1: The detection error is defined as the center distance between the predicted region and its ground truth. The average overall distance differences on CT and CBCT are mm and mm, respectively. Given a typical voxel resolution on Superior-Inferior direction is 1.99mm, the predicted center is about 2 slices away from manually labeled center, which is accurate enough to be integrated into the workflow to improve the safety of IGRT.

7 Conclusion

In our experiments, our proposed network, which was trained from a small number of annotated images, demonstrates the capability of accurately detecting structures with high similarity. Furthermore, our network is capable for cross-modality learning: the network trained from CT images detects the structure well in CBCT images. This is meaningful in the situation where image annotations in one modality are easier to obtain than others. The cross-modality learning capability also indicates that the features learned by the network are robust to the noise of different image modalities. In summary, our approach has a great potential to be integrated into the clinical workflow to improve the safety of IGRT.

8 Acknowledgment

We would also like to thank Dr. Yuliang Guo from Oppo US Research Center for the technical discussions on faster R-CNN.