Volumetric data (e.g. computed tomography (CT), cone-beam computed tomography (CBCT), CT, positron emission tomography, magnetic resonance imaging, etc.) are commonly used in radiation oncology. Image-guided radiation therapy (IGRT) utilizes image information to ensure radiation dose to be delivered precisely and effectively with reduced side effects. Correctly identifying anatomical landmarks, like T12 vertebra, is the key to success. Until recently, the detection of such landmarks still requires tedious manual inspections and annotations in a slice-by-slice manner; and superior-inferior misalignment to the wrong landmark is still relatively common in IGRT, especially in the applications like CBCT which contains high level of noises and suffers from a limited field of view. It is necessary to develop an automated approach to detect those landmarks from images.
and etc., it makes sense to incorporate those recent technical developments to the medical field. However, there are two main challenges preventing us from doing it. First, those computer vision approaches are designed forapplications and lack of supports to tomography images which is crucial in the medical field. Also, in our experiments, a network cannot effectively distinguish a structure from other structures with high similarities. Second, those approaches are trained (or based on pre-trained models) from millions of annotated
images. For example, in a typical computer vision setup, like ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), there are roughly 1.2 million annotated training images and 50,000 validating images. To access such a large annotated dataset in the medical field is difficult, if not impossible. It is well known that a successful FCN based algorithm needs to be significantly modified in order to address the difficulties specific to medical images[Greenspan16DL, Ronneberger15UNet, Cicek16UNet3D]: 1) the scarcity of available data, 2) the high cost of data annotation, 3) the high imbalance of classes, and 4) the high memory demands of networks.
In this paper, we propose a deep network that is trained to detect anatomical structures from volumetric data, but only requires a small amount of training datasets. Our approach differentiates itself from the existing approaches in four aspects: 1) the combination of a U-Net like network and an
one-shot detection network to distinguish anatomical structures with high similarities; 2) a novel detection network to improve the generalization with limited training datasets; 3) a novel pseudo-probability function to address the optimization plateau; 4) the improved loss functions to handle class imbalance without hyper parameters.
To evaluate our proposed approach, we trained the network to detect from CT the T12 vertebra, which is the last thoracic vertebra in the spine of the human body. We choose T12 based on considerations that it is 1) technically challenging to distinguish T12 from other vertebrae because of the high similarities; 2) widely used as a landmark in IGRT and various other radiation oncology clinical applications.
2 Network Architectur
Our network was inspired by both Faster R-CNN [Ren2017FasterRCNN] and U-Net [Ronneberger15UNet]. As shown in Fig. 1, it consists of two components - a feature extraction network (FEN) and a region detection network (RDN). Through the training samples, FEN learns the relevant features of the structure to be detected and RDN is attached to the top of FEN for outputting the bounding box field. Please note, our FEN’s architecture is more similar to U-Net than feature pyramid network (FPN) [Lin2017FPN] because of no multiple prediction layer in the skip connections. In the following subsections, we will describe both FEN and RDN in details.
2.1 Feature Extraction Network
U-Net, one of the most popular deep-learning architectures in the field of biomedical image segmentation, is well known for its effectiveness with limited annotated training datasets. The network architecture basically is a symmetric Encoder-Decoder architecture, which, as its name suggests, consists of two parts: an encoder and a decoder. The encoder part is also called down-path: it compresses the information contained inside the input and results in a coarse representation of the input. The decoder part is also called up-path: it receives the compressed information from the encoder and generates the final output. Besides the down-path and up-path of a typical encoder and decoder architecture, U-Net, as well as ResNet[He16ResNet], adds skip connections between the feature maps of the same spatial scale. The down-path reduces the spatial resolution resulting in a stronger semantic feature map, and the up-path recovers the spatial resolution by an up-sampling operation. Although the semantic meaningful feature maps are obtained, the fine features are lost during these down-sampling and up-sampling operations. By introducing the skip connections, the fine details before the bottleneck are added back into the semantic stronger feature maps.
In order to achieve a good generalization and a strong semantic output, our FEN was designed to have the same architecture as a U-Net, but fewer layers in up-path as shown in the dash box of Fig. 1a. The input is a volumetric image of dimension . And the output is a tensor storing the output feature map, where is the number of features and is the dimension of the coarse-level spatial grid after pooling and up-sampling operators inside FEN. In our experiments, , and used in training were , , and 64, respectively. The intuition behind this architecture is that the Encoder-Decoder architecture forces the FEN to learn the major features to achieve a good generalization. The shortened up-path is not only necessary for extracting semantically meaningful information, but also a must-have to decrease GPU memory usage and reduce the number of parameters to be optimized.
2.2 Region Detection Network
RDN is attached to the layer on the up-path, where the structure of interest still has at least 3 voxels in each dimension. It can detect multiple anatomical structures simultaneously, although in our experiments we demonstrated one structure detection. The input is a feature map of tensor. RDN output is a tensor, where
is the number of structures to be detected and 7 is the number of the bounding box parameters (one for probability, three for locations and three for box sizes). The output tensor can be thought as a field of box parameters, including the probability of box existence, defined on the coarse-level spatial grids. For simplicity, we use the bounding box aligned to the image coordinate axis in this paper. The true center location (or offset) estimation is needed in order to recover the true box center from the coarse spatial grids.
-vector is shown in the output end of the network, its output is atensor, where is the number of structures to be detected. The horizontal bars represent the output vectors from convolution layers with -kernels. The blue circles, except for , represent output nodes from convolution layers with -kernel. The blue circle of probability, , is the output from multiplication layer with inputs of , and .
The architecture of the RDN is as shown in Fig. 1b. Firstly, the input tensor is convolved with three small kernels to generate three low dimension feature maps (three 32-vector, on each spatial grid , shown as narrow bars in Fig. 1b). Then, those three vectors are convolved by -kernels to generate all the -, - and -components of the probability , location and box size . At last, the joint probability, , is a multiplication from all its three axis-components. This architecture was implemented naturally using-kernels. Compared to its counterpart in [Ren2017FasterRCNN], our RDN decomposes the box parameter regressions to multiple independent regressions on each image axis. Therefore, much fewer weight parameters in the network are needed, which makes it possible to be trained with limited data.
3 Target Labeling
Besides the network architecture, data augmentation is another important step to train deep networks with limited training data sets. A properly designed augmentation teaches the network to focus on robust features for a good generalization. In our study, each input image and its corresponding annotated masks were augmented twenty-five times using a elastic deformation algorithm based on [Ronneberger15UNet]. In each augmentation, a
grid of random displacements were drawn from a Gaussian distribution (pixels and
pixels). Then the displacements were interpolated to the pixel level. All theslices of the volumetric image were deformed using the same displacements and a spline interpolation. We also experimented with deformations, which showed no advantages to the model convergence but took longer time to generate a data.
3.2 RDN Output Target
We teach both RDN and FEN networks what to learn by constructing the appropriate ground truth of RDN output. Although there are standard target generating methods used in computer vision community [Ross2015FastRCNN, Ren2017FasterRCNN], they did not work well in our experiments. For example, the intersection over union (IoU), a popular choice of the probability target in computer vision, suffers from the problem of optimization plateau [Rezatofighi2019GIoU], which threw the structure center estimation off in our experiments. To overcome those problems, we replaced IoU with a hand-crafted probability function which has a global peak value at the box center.
Given an annotated image, the target box parameters (box center locations and box sizes) are computed from the structure contours drawn by the experienced radiation physicists. The target box parameters are then used to compute a 7-vector target, , for every spatial grid location . The target was defined as following: when inside a target box , the probability target linearly decreases from the box center, , along the three axes; the center offset targets, , linearly increases from the box center; and the box size target, , is a constant vector. When outside the box , , , and are set to zero. The consideration behind setting zero to the targets outside the box is that we teach the network to focus on the features of the structure and ignore those outside the structure. All those targets are scaled by the down sampling rate, , caused by pooling operators inside FEN to match the coarse spatial resolution of RDN output.
where , , and are the box width, length and height, respectively.
4 Loss Functions
We assign a total loss to measure the discrepancy between the RDN output tensor and the target tensor. The total loss, , consists of the losses of the probability , the location offset , and the box size . Here , , and are the estimated probability, center offset and box size, and , and are their ground-truth counterparts. The loss functions are defined as
where , and are spatial grid index of the RDN output tensor and is the . The total loss is defined as . The summation over all the spatial grid index is needed in order to measure the overall discrepancy of the tensor.
With the above definitions, RDN and FEN networks can be trained jointly end-to-end. The trained network then computes the output tensor when a tomography image is given. The detection was performed by finding the maximum probability inside the output tensor and extracting its associated box parameters.
5 Implementation Details
6 Experiment Results
The model was trained using thirty-five abdominal CT images of liver patients acquired with the same imaging protocol. The trained model was tested to detect T12 on another twenty data sets - fourteen CT scans and six CBCT scans. To achieve the goal of IGRT safety check in our experiments, we measured the detection error using the distance between the manually contoured center and the model predicted center. Fig. 2 shows examples of detected T12 vertebrae from CT and CBCT scans using the same model trained as described above.
On all fourteen test CT images, T12s were accurately identified with a mean detection error of mm. In the superior-inferior direction, the mean detection error is 0.20 mm. In the six CBCT images, only one scan failed the trained model to detect T12 (). All the others were identified correctly with the mean detection error of mm. The details are summarized in Table 1. The detection took about 3.30 seconds in a dataset with dimensions of voxels. With the advantages of the small detection error and fast detection speed, it can be used in real time as part of IGRT to reduce the risk of human error. Even with the relative larger detection errors of 4.49 mm in CBCT images, the correct vertebral body is still located, which is beneficial as a IGRT safety check.
In our experiments, our proposed network, which was trained from a small number of annotated images, demonstrates the capability of accurately detecting structures with high similarity. Furthermore, our network is capable for cross-modality learning: the network trained from CT images detects the structure well in CBCT images. This is meaningful in the situation where image annotations in one modality are easier to obtain than others. The cross-modality learning capability also indicates that the features learned by the network are robust to the noise of different image modalities. In summary, our approach has a great potential to be integrated into the clinical workflow to improve the safety of IGRT.
We would also like to thank Dr. Yuliang Guo from Oppo US Research Center for the technical discussions on faster R-CNN.