Radiation therapy treatment planning is a complex process, as the target dose prescription and normal tissue sparing are conflicting objectives. The lowest achievable dose for each individual organ-at-risk (OAR) is unknown a priori. Multiple iterations between the planner and the physician may be required to reach the optimal balance between target coverage and OAR sparing. Recent planning tool developments have focused on improving the efficiency of the planning iteration process. Interactive multi-criteria optimization (MCO) allows the planner or the physician to explore the trade offs among the possible solutions and eliminates the communication lag in the plan iteration process Craft2012; Monz08; osti_20853459; havij; BREEDVELD20191. Knowledge-based planning (KBP) takes a data-driven approach to learn from the past high-quality clinical plans and predicts a patient’s specific dose-volume histogram (DVH) and OAR dose constraints for new patients by using the relationship between geometric and dosimetric information derived from historical patient plans in the library Wu_2009; Wu_2011; Yuan_2012; Appenzoller_2012; Wu_2014; Moore_2011; Shiraishi_2016; Wu_2013.
Advancements in the field of artificial intelligence (AI), especially in data-driven machine learning (ML) and deep learning (DL) algorithms for many challenging computer vision problems, have inspired many researchers in radiation oncology. Deep learning and, more specifically, convolutional neural network (CNN) architectures have significantly improved imaging and vision tasks. In particular, UNetRonneberger_2015, a complex architecture initially designed for biomedical image segmentation, has notably improved the performance of predicting the radiation dose distribution in the body without going through a real planning process Nguyen_2019; Siri2019; Nguyen2019; Montero2019. The predicted dose distribution provides both visual input and DVH metrics to assist physician’s trade-off decision-making up front to provide more achievable planning objectives prior to the planning process.
While DL models holds promise as accurate predictors of the expected dose distribution prior to planning, the heterogeneity of data is a common challenge for AI modeling. Despite nationally-accepted practice guidelines that set the baseline of patient care, actual clinical practice is rarely clearly defined in black and white. Patient-specific medical reasons such as hip prosthesis and previous radiation can lead to deviation from the guideline. Strategies for solving a specific dosimetric trade-off problem will vary among different physicians and planners Chan2018.
Treatment planning systems and optimization algorithms also introduce variations into clinical practice. Plans generated by different practice styles that meet the national guidelines, in terms of plan quality, can end up with different spatial dose distributions. A DL dose prediction model built based on a particular dataset from one institution may not work well for different treatment planning styles even within the same institution. Therefore, the ability to adapt a pre-built DL dose prediction model to a given planning/practice style is desirable.
Furthermore, expertise in AI modeling is not widely accessible in various clinical settings in the field of radiation oncology. Although some AI solutions are commercially available and the vendor provides AI expertise, building an initial model still requires a clean and large dataset of treated patient cases. This requires a tremendous effort from the user to collect and curate enough data for modeling. Sharing models between institutions could alleviate the challenges associated with building an initial AI model, but the heterogeneity in practice mentioned above would still yield unsatisfactory performance. Thus, the ability to easily adapt a pre-built dose prediction model to a different practice style could lead to practical clinical implementation of DL dose prediction models in real-world clinical settings.
The goals of this work are to investigate the problem of generalizing DL-based dose prediction models and to utilize transfer learning to adapt a DL prostate VMAT dose prediction model to various planning/practice styles with minimal data from each individual style. We built a source model based on the 108 patients treated with VMAT for prostate cancer in a large institution.This work tests two types of practice heterogeneity: first, different planning styles in the same institution where we built the source model; and second, planning practices at a different institution. The source model was adapted to three internal planning styles and one external planning style with 14-29 training cases for each model. To our knowledge, this is the first work to study the DL-based dose prediction model generalizability problem and to utilize transfer learning to provide a practical solution.
2.1 Data Groups
For this study, we selected a total of 248 cases of prostate cancer treated with the VMAT technique. 188 cases from one institution, planned with the Eclipse treatment planning system (TPS) utilizing the progressive resolution optimizer for VMAT optimization; and the 60 cases from an external institution, planned with the Erasmus-iCycle TPS van_Haveren_2019, a system for automated multi-criteria treatment planning. Table 1 summarizes the datasets and shows the dose distributions of individual styles.
The source model was trained with the ”Source” dataset, which consists of 108 plans in the ‘conformal’ dose style, which is the most representative style of planning in our institution. In the same institution, we found three additional planning styles for prostate cancer VMAT treatments, represented by three target datasets: ”Internal-A,” ”Internal-B,” and ”Internal-C”. The Internal-A and Internal-B styles are slightly more aggressive in OAR sparing than the ”Source” style. The Internal-C style is a more extreme approach that allows a higher dose to femoral heads to spare the bladder and rectum. We use these three target datasets to demonstrate the problem of model generalizability and to test the feasibility of using a small dataset for transfer learning by using 14-29 cases from each target dataset. We use 2-5 test cases to evaluate the target model’s performance.
The ”External” style represents plans from a different institution. As one can see, the dose distribution of the External style has many intermediate dose spikes in both the lateral and the anterior directions. Given the large differences between the External and the Source styles, we trained the target model with 20 cases and thoroughly evaluated it with 40 cases.
In this work, we implemented a 3D UNet for the model architecture, Ronneberger_2015, as UNet has been extensively used in radiation therapy for dose prediction Nguyen_2019; Siri2019; Nguyen2019; Montero2019. The inputs to this architecture are the contours of the planning target volume (PTV) and OARs, including bladder, rectum, left and right femoral heads, and body, which were formatted as binary masks with a size of each. Due to the memory limits of a 16 GB GPU (NVIDIA K80), we implemented patch-based training, with a patch size of . At each training iteration, the patches were randomly selected from the binary mask data based on a Gaussian sampling scheme proposed by Nguyen17. We also implemented group normalization wu2018group
after each set of convolution and rectified linear unit (ReLU) operations, which helps accelerate the convergence rate in the network. This model is trained to learn the mapping between the binary masks of the OARs and the clinical radiation dose distribution in the body. Mean Square Error (MSE) was the loss function, and the Adam optimizer with a learning rate ofwas used to minimize the MSE. The final network consists of 85 layers with 7,870,177 trainable parameters.
2.3 Transfer learning
For both the internal Source dataset and the external institution dataset, the transfer learning method implemented for style transfer involves 3 components: 1) freezing the weights of the first half of the model, 2) training only the second half of the model, and 3) randomly reinitializing the weights of the very last layer of the model. Assuming that the major variations in dose distributions are owing to different planning styles across institutions, we can effectively freeze the first half of the UNet, which finds features from just the anatomy. We can then train just the second half of the network to learn the dose distributions of a different planning style. Finally, to assist the network in converging and to prevent it from falling into a local minimum, we reinitialize the weights in the final layer of the network. The hyperparameters remain unchanged, except for the learning rate, which was reduced to one-tenth of the original value.
2.4 Model Evaluation
We first applied the Source model to the Source test cases to evaluate the model’s performance on its own planning style. We then applied the Source model to the test cases of the three internal and one external target styles to investigate the model’s performance on other planning styles and to set the baseline prediction quality before transfer learning. Individual target models trained via transfer learning were each applied to the test cases of their own planning style to investigate the prediction quality after adapting the Source model to the given target style. To further evaluate the improvement achieved by the transfer learning, we cross-compared each of the doses predicted by the Source and target models with the respective clinical plan dose.
We compared the absolute doses predicted by the DL models to the clinical plan doses without further normalization. We evaluated the spatial dose distributions and the dose-volume histograms (DVH). We calculated DVH metrics including structure mean and maximum doses (i.e. D2), PTV D98, and PTV D95. The clinical plan dose served as the baseline for the dose comparison. We calculated the differences in the DVH metrics between the clinical plan dose and the predicted doses to quantify the DVH agreement. In addition, we calculated the dice similarity coefficients (DSC),
, for isodose volumes from 10% to 100% of the prescribed dose to quantify the agreement of the spatial dose distribution between the clinical plan dose and the predicted doses. We used a paired t-test to calculate the statistical significance of the results for the External style, which had enough test cases for a meaningful test.
Figure 1 shows a typical example of the dose agreement between the clinical plan and the Source model prediction for the Source style test cases. Upon visually inspecting the dose distribution and the DVH, one can see that the Source model predicts the PTV dose correctly with minimal differences in the OAR doses. The violin plots present the differences in the PTV and the OAR DVH metrics between the clinical plan and the Source model prediction. The differences in the PTV DVH metrics are mostly within 3%, as reflected by the width of the individual data clusters. The median of the PTV dose differences are all within 2%. Similarly, the median of the mean and maximum OAR dose differences are within 1% and 2%, respectively.
Figure 2 illustrates the Source model and target model predictions on the test cases of individual target styles. The improvement in the model performance after transfer learning can be seen in the dose comparisons among the clinical plan, the Source model prediction and the target model prediction. The Source model fails to predict the style-specific dose distribution features. However, the target models have learned those dose distribution features via transfer learning, so they predict a distribution similar to the clinical plan and exhibit a better agreement in the DVH, especially for the Internal-C and External styles.
We compared each of the Source and target model predictions to the respective clinical plan dose and calculated the DSC of individual isodose volumes accordingly. The DSC values increased with the dose agreement with a coefficient of 1 considered to be a perfect match. As presented in Figure 3, the target models improved upon the Source model in term of DSC regardless of the planning style. However, the Internal and the External styles exhibit different trends. For the Internal styles, both the Source and the target models predict the high dose bath (70% to 100%) fairly well. The main improvements appear in the intermediate dose bath (60% to 30%). The mean DSC ranges between 0.81-0.94 for the three Source models and between 0.82-0.91 for the corresponding target models. The largest improvement may be seen in the Internal-C style, as the target model achieves a 7% improvement over the Source model (mean DSC is 0.80 for the Source model and 0.87 for the Internal-C target model) transfer learning. For the External style, the target model improves significantly upon the Source model throughout all isodose volumes (), with the improvement increasing as the isodose volume decreases. An average of 5% and 8% mean DSC improvements were achieved in the high and the intermediate dose volumes, respectively. A systematic 10% improvement was seen in the low dose volume.
Figure 4 illustrates the Source and the External target model performances on predicting the DVH metric of the External style test cases. Overall, as indicated by the length of the data clusters, the Source model demonstrates a large variation in the prediction quality, while the External target model demonstrates more consistent and accurate predictions. The Source model overestimates the PTV doses, especially for the Dmean and the D2. The External target model significantly improved the predictions and resulted in mean differences within 1.6% (Table 2). In terms of the OAR dose predictions, the target model achieved an agreement within 1.5%. In summary, the External target model improved upon the Source model by up to 6.4% and 6.0% in the PTV and the OAR DVH predictions, respectively ().
4 Discussion and Conclusion
There has been growing attention on leveraging AI-based decision support tools (DST) to improve the treatment planning process. Incorporating AI DST as a part of the clinical pathway Craft2012 could standardize evidence-based practices to ensure high quality and cost-effective medical care. AI algorithms, including the DL dose prediction model in this work, require large repositories of high quality data. This becomes one of the barriers to widespread deployment of AI-based solutions in the radiation oncology field. In this study, we utilized transfer learning to solve the data size problem and demonstrated the ability to adapt a source model to three different internal planning styles and one external planning style with minimal data input.
Heterogeneity in clinical practices, whether intra- or inter-institutional, is commonly seen. The clinical protocols and guidelines set the floor of treatment plan acceptance, but users set the ceiling for their own practice. Table 1 clearly shows that, even within the same institution, the individual trade-off preferences of different physicians and planners in combination result in a variety of dose distribution styles. On top of these differences in practice, the different treatment planning system employed in the External institution in this study yielded dramatically different dose distributions from the Source style. The treatment plans in the Source style were optimized with a specific objective function (normal tissue objective, NTO) to shape the dose fall-off conformally, whereas plans from the External institution were optimized with a unique 23-beam starting condition of the VMAT optimization that led to distinct spikes in the low dose bath. Regardless of the dose distribution style, they are all clinically accepted plans that comply with the clinical guideline. The AI model’s adaptation to the user’s practice style can achieve standardization with respect to the user’s prior experience. This would be practical and beneficial for implementing AI-guided planning in practice, because the user can precisely interpret the predicted dose for meaningful clinical use.
Directly deploying a model built in a different practice style may lead to unsatisfactory predictions. As shown in the Source model predictions on the cases planned with target styles in Figure 2, the predicted doses inherit the conformal dose distribution of the Source style. They meet the planning objectives, but they fail to represent the dose distribution features of the target styles. This reflects a common frustration of institutions that are trying to clinically implement a model provided by a vendor or developed by a different institution. In this work, we demonstrated the transfer learning can adapt a source model into various practice styles.
Transfer learning with an additional 14-29 cases in the target style, allowed the target models to learn the features of the new planning style quickly. Taking the External style as an example, the distinct low dose spikes were precisely predicted by the External target model. We saw this improvement globally in all target styles, and it increased as the variation between the Source and target styles increased (Figure 2). For example, the Internal-A and Internal-B styles pull the dose from the rectum only slightly more aggressively than the Source style, so the inherent differences are small. Therefore, the Source model still achieved more than a 0.78 mean DSC, which the target models only slightly outperformed. In contrast, the Internal-C style turns off the NTO and intentionally trades femur doses for lower rectum and bladder doses, so it differs more substantially from the Source style. Accordingly, the Source model predictions on the Internal-C style cases demonstrated a significant dip in DSC in the intermediate dose levels, but the Internal-C target model improved the DSC to a satisfactory level. For the External style, which is fundamentally different from the Source style, the External target model improved upon the Source model’s DSC by up to 10% DSC and had a DSC higher than 0.87 among all test cases and isodose volumes.
The intermediate dose bath of the VMAT plan is more unpredictable because of its complex nature. The variations in trade-offs and planning approaches can tremendously polarize the spatial dose distribution in this dose range. Moreover, with the large numbers of beam angles in VMAT optimization, the optimizer has more options to pull the dose off a specific area, which makes the dose more uncertain to predict. Traditionally, the DL dose predictor has a hard time handling these style-specific features. This is reflected in figure 3, which shows that the DSC of the Source model prediction ranges between 0.80-0.94. However, with transfer learning, the target models can represent these dose features, and the DSC of target model prediction increased to 0.87-0.99, which is satisfactory.
In light of the barriers to accessing the expertise of AI modeling, collecting sufficient data to train a model, and sharing data among institutions, a broad AI tool that can easily adapt a maturely built model to different practice styles could allow more practices to access and clinically implement an AI model for dose prediction. We used prostate VMAT as an example to demonstrate the feasibility of leveraging transfer learning for model adaptation. However, this methodology can be employed for other AI tools in the field of radiation oncology and allow for widespread application of such tools.
We would like to thank the National Institutes of Health (NIH) for supporting this study through a research grant (R01CA-237269) and Dr. Jonathan Feinberg for editing the manuscript.