The treatment course of fractionated radiotherapy usually lasts weeks or months, during which the shape, position and size of tumor targets and organs at risk (OARs) may vary to different extents. It is essential to account for these changes for accurate dose delivery [RN2, RN1]. Cone-beam computed tomography (CBCT) is commonly used in image guided radiotherapy for patient setup and treatment evaluation prior to beam delivery. Compared to the treatment planning CT, CBCT images usually have poor image quality and soft tissue contrast due to scattering and beam hardening. Since the CBCT scan usually takes longer time than the planning CT, the CBCT images are also susceptible to motion artifacts. As a result, CBCTs are primarily used to verify the alignment with planning CT. The ability of CBCT to track patients’ anatomic changes throughout the treatment course has been under-explored. CBCT could enable detailed inter-fractional anatomic changes evaluation on treatment day throughout the treatment course. As such, inter-fractional target and OAR variation could potentially be modeled to predict anatomic variation in future treatment or to guide treatment planning, such as margin definition, to provide sufficient target coverage and OAR sparing. It is important to provide a CBCT-CBCT image registration tool to quantify the inter-fractional anatomic changes to facilitate such modeling.
Conventional intensity-based deformable image registrations (DIRs), such as optical flow [RN3], demons [RN4] and viscous fluid model [RN5], are iterative and generally very slow especially for large datasets. Over-smoothed deformation vector fields (DVFs) are usually produced because these methods apply spatial filters repeatedly throughout the iteration process [RN7, RN6]
. The large appearance variances and low image contrast of CBCT pose additional challenges for accurate registration. Landmarks identified either automatically or manually have been used to guide planning CT-CBCT and MRI-CBCT registration[RN9, RN8]. However, the landmark identification process can be challenging and laborious in the presence of severe artifacts, which in turn degrade the landmark-guided DIR [RN10]. Partly due to the poor CBCT image quality, very few papers have been published on CBCT-CBCT image registration [RN13, RN11, RN3, RN12]. Noe et. al. [RN3] tried to accelerate an optical flow method [RN14] on a graphics programing unit (GPU) and the test on CBCT-CBCT registration achieved run time of 64 s for image size of 512 × 512 × 55 and target registration errors (TREs) of 1.8±1.0 mm after rigid registration and 1.6±0.8 mm after deformable registration. Nithiananthan et. al. [RN11] studied the accuracy and convergence of multiscale Demons image registration and they achieved run time of 270 s (image size was not specified) and TRE of 1.6±0.9 mm on the CBCT images of ten head and neck cancer patients. Zachiu et. al. [RN12] implemented and evaluated an Evolution method [RN15]
, which estimates the deformation between two images by matching similar contrast patterns instead of pixel intensities, for both CT-CBCT and CBCT-CBCT image registration. They achieved run time of approximately 60 s for registration of images of size 256 × 256 × 256. Jiang et. al.[RN13]
proposed an multi-scale deformable image registration (DIR) framework with unsupervised joint training of convolutional neural network (MJ-CNN) for 4D-CT inter-phase registration. It was shown that, though being trained on a 4D-CT dataset, the MJ-CNN framework also performed well on both CT-CBCT and CBCT-CBCT registration without re-training or fine-tuning with a run time of about 1.4s for image size of 256 × 256 × 96.
Recently, Deep learning (DL)-based medical image registration has become a hot research topic and achieved promising performances. Two thorough review papers on DL based image registration were recently published by Fu et al. [RN16] and Haskins et al. [RN17]. Generally, DL-based image registration methods can be divided into three categories: deep iterative registration, supervised transformation predication and unsupervised transformation prediction. The limitations of the deep iterative registration is that they inherit the iterative nature of conventional DIR methods [RN18], which slows the registration process. Supervised transformation prediction methods utilize the ‘ground truth DVF’ which is usually obtained using other DIR methods or artificially generated and thereafter quality checked by experts to supervise the network training [RN19]
. The quality control of the fidelity of the ‘ground truth DVF’ are subjective which may induce further inter-observer variability of the network performance. Unsupervised transformation prediction enables large number of datasets to be used in training since no ‘ground truth DVF’ is needed. However, without ground truth transformations, it is difficult to define proper loss functions of the networks. A spatial transformer network (STN)[RN20] was proposed to generate the deformed image which enables image similarity loss calculation during the training process. In this study, a novel unsupervised deep learning framework for DIR of inter-fraction CBCT images is proposed. Several strengths in the network design are considered in the proposed workflow: I) This work is based on the STN which explicitly enables the loss function to be defined without any manually aligned or pre-registered image pairs [RN20, RN21]. Loss functions of image similarity and DVF regularization are used in the training stage of the proposed workflow. II) GAN architectures have been incorporated in the STN to improve the realism of the predicted DVFs. III) A multi-scale framework was adopted to capture the coarse-scale and fine-scale motion.
2 Methods and Materials
2.1 The proposed workflow of Spatial Transformation-Based Network
The schematic flowchart of the proposed method is outlined in Fig. 1. The training and inference stages follow the same feed forward path through a STN, which consists of a global generative adversarial network (GlobalGAN) and a local GAN (LocalGAN). The GlobalGAN is trained with the whole volume of the moving and the target images to capture the global geometric deformation and to generate a global deforming vector field (DVF) which facilitates the coarse alignment. The global DVF and moving CBCT fraction are then fed into spatial transformation to generate the globally deformed fractional CBCT images. However, the global DVF may fail to provide accurate local image registration due to non-rigid geometry and anatomic movements. To improve the accuracy of local registration, a LocalGAN is designed to capture the local deformation on top of the globally deformed CBCT images to match the target image. In the training of the LocalGAN, three-dimensional (3D) image patches with a size of 64 × 64 × 64 are extracted from the globally deformed CBCT images and the target CBCT images with an overlap size of 32 × 32 × 48. An image similarity loss and a regularization loss, as well as an adversarial loss, are included in the loss function of the GlobalGAN and LocalGAN. During the training stage, the GlobalGAN and LocalGAN are trained without the supervision of ground truth DVFs.
2.1.1 Architectures of the Networks
The GlobalGAN and LocalGAN share similar GAN architectures with different learnable parameters. The structures of the generator and discriminator are shown in Fig. 2. In the generator network, image sizes of the input image pairs are reduced while being encoded through 11 convolutional layers. In order to up-sample the generated DVFs to matrix sizes the same as the input images, bilinear interpolation was applied. The discriminators in the two GANs are implemented as conventional fully convolution networks (FCN)[RN22] for the necessary regularization to generate realistic DVFs. Since the discriminators are trained to distinguish the deformed images from the real CBCT images, they encourage the GlobalGAN and LocalGAN to predict realistic DVFs by penalizing unrealistic deformed images.
2.1.2 Attention Gates
In order to force the model to focus on learning the motion information, self-attention network is integrated into GAN architectures for both GlobalGAN and LocalGAN. Self-attention network is constructed by integrating two attention gates into the generator between the convolution layers before and after the max pooling layers. With the attention gates, feature maps with different scales of the adjacent convolution layers are combined and operated right before concatenation. Attention gates have previously been explored in the context of semantic segmentation and were able to capture the most relevant semantic contextual information without using a very large reception field[RN24, RN23].
2.1.3 Loss Functions
The image similarity, regularization and adversarial losses are consisted in the loss function of the GlobalGAN and LocalGAN. The difference between the loss term used for GlobalGAN and LocalGAN is that the loss of GlobalGAN is calculated based on whole images and the loss of LocalGAN is calculated based on patches.
where represents the predicted DVF from a pair of moving and target images. The deformed image, , is obtained by warping the moving image patch by the predicted DVF [RN25]. denotes the image similarity loss.
where NCC() and GD() denote the normalized cross-correlation (NCC) loss and the gradient difference (GD) loss, respectively, between the deformed and target images. MIND() denotes the modality independent neighbourhood descriptor (MIND) [RN26]. Due to scattering and image artefacts, inter-fraction CBCT HU values are inconsistent, which may deteriorate the effectiveness of similarity metrics such as mean square error and NCC. The MIND descriptor is a modality independent neighbourhood descriptor with normalized image intensity which is used here to pre-process the images before similarity measurement using NCC and GD.
ADV() denotes the adversarial loss that is computed as the discriminator binary cross entropy loss of the deformed and target images. The purpose of the adversarial loss is to encourage the deformed image to approach realistic CBCT image by penalizing unreasonable DVFs and unrealistic deformed images.
R(DVF) denotes the regularization term.
Weighted ﬁrst and second derivatives of the DVF are included in the regularization term to enforce general smoothness of the predicted DVF. Values of and are set as 1 and 0.5 in this study, respectively.
The hyperparameters of,, and are empirically set as 200, 1, 10 and 5, respectively, according to numerical experiments.
2.2 Datasets and Experiments
100 fractional CBCT images of 20 abdominal cancer patients who underwent radiotherapy were retrospectively investigated. The CBCT images have a resolution of 0.90 mm × 0.90 mm × 2.0 mm with size of 512 × 512 × 88. These images were acquired before each fraction during a five-fraction treatment course of each patient. Fiducial markers were implanted in the patients for tumor localization and external beam treatment planning.
The overall performance of the proposed method was investigated via a five-fold cross-validation. Specifically, the CBCT image data of the 20 patients was first randomly and equally separated into five groups, of which four groups were used for training; and the rest group was used for testing. The training and testing experiments were repeated five times by rotating each group as the testing group.
In a holdout test, 105 fractional CBCTs from a cohort of 21 different abdominal cancer patients were investigated to evaluate the proposed method. The trained STN was tested on the holdout dataset without re-training or parameter fine-tuning.
Qualitative evaluations of the proposed method were performed by visually assessing the alignment between the target and deformed CBCT images. Both the fusion images and the absolute difference images between the target and deformed images were generated for the visual assessment. The absolute intensity difference profiles along a line in the anterior-posterior direction are also plotted to demonstrate the accuracy of image alignment. To demonstrate the efficacy of the integrated attention gates, DIR results with and without the attention gates were compared.
For quantitative evaluations, mean absolute errors (MAEs) and normalized cross correlations (NCCs) between the target and the deformed CBCT images were calculated. The target registration errors (TREs) and dice similarity coefficients (DSCs) were also calculated.
The TRE was calculated as Euclidean distance between the landmark positions in the target and deformed CBCT images. The implanted fiducial markers with two or more additional landmarks identified by an experienced medical physicist are used for the TRE calculation. Four examples of the selected landmarks are shown in Fig. 3(a1) as red ‘×’ marks, of which two were implanted fiducials and the other two were the tips of the spinous process and rib bone. These landmarks were selected due to the fact that physicians usually prescribe to “match to target” or “match to bone” for on-treatment patient setup. With the position of the i-th landmark in patient K being denoted as in the moving fraction and the position of its corresponding landmark in the same patient denoted as , the TRE was calculated as:
where stands for the L-n norm of the matrix.
The MAE for patient K was calculated as
where the target and deformed images of patient K are denoted as and . stands for the image mask of the bounding box of the patient body, which was determined by all the tissues/organs with HU values higher than -300 HU in this study.
The NCC for patient K was calculated as
where stands for elemental wise summation of the 3D CBCT images; and stands for the mean of the image. has the same meaning as in the calculation of MAE of Eq. 5.
The DSC for patient K was calculated as
where and are the binary mask of the bony structures in the deformed and target images. stands for the element-wise multiplication of these masks. The masks of the bony structures for the DSC calculation in this study were determined by the tissues/organs with HU values higher than 300 HU.
Fractional CBCT images of an abdominal cancer patient during the treatment course are shown in Fig. 3 to demonstrate the overall registration results of the proposed method. In Fig. 3, the first fraction is shown in the upper left corner as the target CBCT image; and the subsequent four fractions (the following four columns) that were registered with on-treatment manual rigid registration, the proposed method without attention gates in the generators of the GlobalGAN and LocalGAN and the proposed method are shown in the first to third rows, respectively. Three observations could be drawn from Fig. 3: 1) The stomach region appearances could be essentially different in each treatment fraction (pointed by orange arrows); 2) The overall body (anterior-posterior) location could be well resemble to the target image with the proposed method, but not the manual rigid registration or the proposed method without attention gates (regions pointed by yellow arrows); 3) The anatomical morphology of the internal organs, fiducial markers, and even the streak artifacts, are also changed/modified, which is believed to be induced by the LocalGAN. Overall, the proposed method has better global and local alignment due to the multi-scale registration scheme realized by the GlobalGAN and Local GAN.
For better visibility of the registration quality, the fusion images of the registered fractions with the target fraction of the same patient shown in Fig. 3 at a different slice location are shown in Fig. 4. In the fusion images, the deformed images and the target image are show in the red and green channels, respectively; and they are labeled as the same subfigures as in Fig. 3 for consistence. Since the stomach region appearance was different from fraction to fraction, suboptimal image registration quality can be found (pointed by orange arrows). As shown by the yellow arrows in Fig. 4, with the proposed method, the liver area could be well registered to the target region, but not with the other two methods. Bony structures are also usually treated as indicators of image registration quality. For the ribs and the spine, (regions pointed by red arrows), the proposed method (c2 – c5) outperformed the other two approaches (a2 – a5, b2 – b5).
The difference images between the registered images (a2 – a5, b2 – b5 and c2 – c5) and the target images (a1) in Fig. 4 are shown in Fig. 5. The difference images are labeled as the same subfigures as in Fig. 4; and they have the same arrows copied from Fig. 4. The proposed method has shown better alignment both globally and locally, evidenced by the lower HU difference, than the other two methods. Despite of the challenges at the stomach region, the proposed method shows the lowest HU difference (orange arrows). Consistent with previous observations, the proposed method outperformed the other two methods at the liver and spine boundaries, as indicated by the yellow and red arrows.
Shown in Fig. 6 are the absolute HU difference profiles along the black dotted line in Fig. 5. The large HU difference anteriorly (voxel index ¡ 20) and posteriorly (voxel index ¿ 300) were caused by the body contour misalignment; and those near the spine (240 ¡ voxel index ¡ 300) were caused by the spine misalignment Compared with the other two methods, the proposed method has the best performance with lowest HU differences.
The average TREs over all patients are listed in Table 1. The average TREs is less than 2 mm for the proposed method, much better than the other two methods. The smaller TRE standard deviations of the proposed method also demonstrate better registration robustness. Two sample t-tests on the TRE results of the three investigated methods showed that the TRE improvement of the proposed method over the other two is statistically significant. On the contrary, the TRE improvement of the proposed method without attention gates over the on-treatment manual rigid registration was not statistically significant, showing the efficacy of the attention gates.
The average MAEs between the deformed CBCT images and the target images are listed in Table 2. Same conclusion as the TRE analysis can be drawn, which strengthened our argument that the proposed method with attention gates has the best performance out of the three methods.
As two important metrics indicating the image registration quality, the NCCs and DSCs were calculated and shown in Tables 3 and 4, respectively. From these two tables, same conclusion that the proposed method substantially outperformed the other two methods with statistical significance can be drawn.
Table 5 shows the average TREs, MAEs, NCCs and DSCs for all the patients in the holdout dataset. The uniform improvement of the proposed method over the other two methods across all four metrics demonstrated that the proposed method has great robustness and generalizability on the holdout datasets.
|Fraction||Manual Rigid 1||Proposed w/o AG 2||Proposed 3||
|Fraction||Manual Rigid 1||Proposed w/o AG 2||Proposed 3||
|Fraction||Manual Rigid 1||Proposed w/o AG 2||Proposed 3||
|Fraction||Manual Rigid 1||Proposed w/o AG 2||Proposed 3||
|Manual Rigid 1||Proposed w/o AG 2||Proposed 3||
Over the course of radiotherapy treatment, a method to provide fast and accurate inter-fraction CBCT image registration is essential for evaluation of the geometric and anatomic changes. With the proposed registration tool, quantitative anatomic changes could be calculated for inter-fractional variation modeling and prediction. It could potentially inform the physician in future treatment planning such as targets margin definition and image guidance usage frequency, tradeoff between target coverage and OAR sparing. The proposed inter-fraction CBCT image registration could also enable many applications such as image segmentation [RN27], motion estimation [RN7, 28], image fusion [RN29] and treatment response evaluations [RN30, 31]. Deep learning-based DIR is promising for the online DIR task of large volume CBCT images in radiotherapy. In this work, an unsupervised deep learning based inter-fraction CBCT registration method, which takes less than 3 seconds to perform a CBCT-CBCT registration, is proposed and its feasibility and performance are investigated through qualitative and quantitative evaluations. The proposed method can also perform well on the holdout dataset without re-training. The major contributions of the proposed workflow can be summarized as:
a) A multi-scale unsupervised deep learning-based DIR method is proposed for inter-fraction CBCT DIR in image-guided radiotherapy. The unsupervised training of the proposed STN-based network overcomes the challenge of collecting large amount of ground truth datasets via either manually aligning, which is labor-intense, or artificially synthesizing, which is error prone. The integration of GlobalGAN and LocalGAN networks captures the image misalignment in a multi-scale manner; and its effects for the DIR can be observed in Figs. 3 – 6.
b) The self-attention network enabled by the attention gates in the generator is essential to learn the differences between major and minor motion regions and avoid deforming the bony structures. The attention gate operation helps to retain only relevant activations and remove irrelevant/noisy responses. The neuron activations during both the forward and backword passes are filtered by the attention gates. In addition, gradients originating from image background during the backward pass are down weighted. The self-attention network allows model parameters in shallow layers to be updated based on spatial regions that are most relevant to a given task, i.e. motion estimation, such that the generator network could gain the ability to highlight the features from previous layers and to well represent the image motion. Thus, the integration of attention gates in the generator network could essentially improve the capturing of structural differences between the moving and target images.
c) Adversarial network is integrated into the proposed framework to enforce additional DVF regularization by penalizing unrealistic deformed images. Since the networks are designed to be trained in an unsupervised manner, DVF regularization is necessary to generate realistic results. Smoothness constraint has been commonly used in the literature for DVF regularization [RN7]. However, the smoothness constraint alone is insufficient for realistic DVF prediction especially when the network is trained in a completely unsupervised manner. Therefore, for additional DVF regularization, a discriminator is proposed to be integrated into STN. Since the purpose of the discriminator is to better differentiate deformed images from target images, as such unrealistic deformed images are penalized. Realistic DVFs are then encouraged to be predicted by the generator. The speed of the inference stage will not be affected as the discriminator is only used in the training stage.
To generate DVFs with the same matrix sizes as the input images, since image sizes of the input image pairs are reduced while being encoded through 11 convolutional layers in the generator network, bilinear interpolation was used to up-sample the DVFs. As an alternative, transpose-convolution layers with trainable parameters can be used to up-sample the DVFs [RN32]. However, we have found that bilinear interpolation, which does not contain trainable parameters, performs much better than the transpose-convolution layers in predicting accurate DVFs. The reason of this might be that bilinear interpolation tends to generate smooth DVFs which are desired in medical image registration. On the other hand, the transpose-convolution layers often generate unrealistic DVFs even with heavily-weighted DVF smoothness regularization term [RN21, 33].
No prepossessing had been applied on the fractional CBCT images to improve the image quality before deformable image registration using the proposed method. Therefore, suboptimal image quality of the CBCT images, such as inter-fraction variations in the altered HU values and streak artifacts, could impact the accuracy of image registration. Deep learning-based image synthetic approaches have been investigated and shown promising result in improving the image quality [RN34]. Then, incorporating of these approaches with the proposed method might improve the image registration result. Further investigations on this topic are needed in future works.
It was observed that the inter-fraction shape and position of the organs in some abdominal patients may vary significantly due to gas fillings, bowel movements and/or respiratory motions. This is an extremely challenging situation for both unsupervised and supervised deep learning image registration, because that significant shape and position variations may become beyond model ability of the proposed method and may also impossible to manual labelling of the ground truth. The proposed method may fail to accurately register the inter-fraction CBCTs of these patients. Future researches are necessary for this situation.
In this study, only a few qualitative (visual inspections on the deformed, fusion, difference images and intensity profiles) and quantitative (TRE, MAE, NCC and DSC) evaluations were performed. These metrics are not directly related with clinical outcomes. It is anticipated that more clinical investigations, i.e. dose volume histograms, patient follow-ups, etc., are needed in order to concluded whether the proposed method could be applied and effective in the image-guided radiotherapy.
An unsupervised deep learning-based CBCT-CBCT registration method is developed and its feasibility and performance are investigated. The proposed method is able to accurately register images between the moving and target CBCT fractions within three seconds in a single forward network prediction, as such it is expected to be promising as a fast and straightforward image registration tool for motion management and treatment planning in image-guided radiotherapy.
This research is supported in part by the National Institutes of Health under Award Number R01CA215718, and R01EB028324.
The authors declare no conflicts of interest.