The importance of evaluating the complete automated knowledge-based planning pipeline

10/31/2019 ∙ by Aaron Babier, et al. ∙ 0

We determine how prediction methods combine with optimization methods in two-stage knowledge-based planning (KBP) pipelines to produce radiation therapy treatment plans. We trained two dose prediction methods, a generative adversarial network (GAN) and a random forest (RF) with the same 130 treatment plans. The models were applied to 87 out-of-sample patients to create two sets of predicted dose distributions that were used as input to two optimization models. The first optimization model, inverse planning (IP), estimates weights for dose-objectives from a predicted dose distribution and generates new plans using conventional inverse planning. The second optimization model, dose mimicking (DM), minimizes the sum of one-sided quadratic penalties between the predictions and the generated plans using several dose-objectives. Altogether, four KBP pipelines (GAN-IP, GAN-DM, RF-IP, and RF-DM) were constructed and benchmarked against the corresponding clinical plans using clinical criteria; the error of both prediction methods was also evaluated. The best performing plans were GAN-IP plans, which satisfied the same criteria as their corresponding clinical plans (78 However, GAN did not necessarily provide the best prediction for the second-stage optimization models. Specifically, both the RF-IP and RF-DM plans satisfied all clinical criteria 25 worst performing planning), respectively. GAN predictions also had a higher mean absolute error (3.9 Gy) than those from RF (3.6 Gy). We find that state-of-the-art prediction methods when paired with different optimization algorithms, produce treatment plans with considerable variation in quality.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated knowledge-based planning (KBP) is a data-driven approach that uses previous radiation therapy treatments to generate high quality plans for patients diagnosed with cancer. KBP is typically conceptualized as a two-stage pipeline (see Figure 1

). In the first stage, a machine learning (ML) model uses contoured CT images to predict the dose that should be delivered to a patient. In the second stage, an optimization model uses the dose prediction from the first stage to generate fluence maps or a set of beam apertures. In the past decade, there has been significant research in improving KBP, focusing either on advancing the machine learning or the optimization stage independently of each other

[Wu:2009aa, PCAZhu:2011aa, PCAYuan:2012aa, appenzoller:2012predicting, Yang:2013aa, Shiraishi:2015aa, Shiraishi:2016ab, Kearney:2018aa, Fan:2018aa, Nguyen:2019aa]. In this work, we explore whether the interaction between the prediction and optimization model affects the overall quality of the final plans.

Figure 1: Overview of the automated knowledge-based planning pipeline.

Several prediction models have been developed that can accurately predict aspects of a clinical dose distribution from a patient’s anatomy. Originally, these prediction models required the engineering of useful features from patient information and only predicted simple summaries (e.g., desirable DVHs) that could be used to design objectives for inverse planning software [Wu:2009aa, PCAZhu:2011aa, PCAYuan:2012aa, appenzoller:2012predicting, Yang:2013aa, Shiraishi:2015aa, Shiraishi:2016ab, Babier:2018b]

. However, modern deep learning techniques can learn useful features in order to predict dose directly from CT images

[GANCER, Kearney:2018aa, Fan:2018aa, Nguyen:2019aa, Babier:2019]. These high-dimensional predictions contain more information and permit better integration into more sophisticated automated KBP pipelines [GANCER].

The dominant optimization models for KBP are inverse planning (IP) [Babier:2018a] and dose mimicking (DM) [petersson:2016]. The choice of the parameters, objectives, and constraints in these models can lead to final treatment plans with characteristics that differ significantly from the initial predictions. For example, a prediction model may produce dose distributions that consistently predict excess dose to an OAR, but an optimization model with an objective to minimize the dose to that OAR may be able to correct for this bias. As a result, prediction models that produce dose distributions with good criteria satisfaction may not necessarily produce final plans with the same properties. Constructing effective automated KBP pipelines, therefore, requires careful selection of both the prediction and optimization model.

In this paper, we perform the first comparison between different combinations of prediction and optimization models in KBP; each model was previously validated in a KBP pipeline [McIntosh:2017ab, Babier:2019]. In total, we consider two dose prediction methods—a generative adversarial network [Babier:2019] and a random forest [McIntosh:2017aa]—and two optimization methods—inverse planning [Babier:2018a] and dose mimicking [petersson:2016]. We then evaluate the four corresponding KBP pipelines (see Figure 2) using a large dataset of 217 patients with oropharyngeal cancer. We observe that the choice of both the prediction and optimization model can significantly affect the quality of the final plans generated by a KBP pipeline.

Figure 2: Overview of the automated knowledge-based planning pipelines evaluated in this paper. Solid lines connect the prediction and optimization methods that have been tested together in (a) Babier:2019 and (b) McIntosh:2017aa; dashed lines connect the methods that have not been tested in the extant literature.

2 Methods and Material

We used CT images with contours, which highlight the regions-of-interest (ROIs), and dose distributions from clinically accepted treatment plans to train two dose prediction models that were then tested on out-of-sample clinical plans. The resulting predicted dose distributions were then passed through each optimization model to generate fluence-based treatment plans. Figure 2 gives an overview of the pipelines, which were then evaluated in terms of the quality of plans they produced.

2.1 Data

For this research ethics board approved study, we obtained plans for 217 oropharyngeal cancer treatments delivered at a single institution with 6 MV, step-and-shoot, intensity-modulated radiation therapy. All plans were prescribed 70 Gy and 56 Gy in 35 fractions to the gross disease (PTV70) and elective target volumes (PTV56), respectively; in 130 plans there was also a prescription of 63 Gy to the intermediate-risk target volume (PTV63). The organs-at-risk (OARs) were the brainstem, spinal cord, right parotid, left parotid, larynx, esophagus, mandible, and the limPostNeck, which is an artificial structure used to limit dose to the posterior neck.

2.2 Prediction models

We trained two state-of-the-art dose prediction models with the same 130 plans from our dataset and used the remaining 87 for out-of-sample testing.

2.2.1 Generative adversarial network

Our conditional generative adversarial network (GAN) model [isola:2017image] is based on Babier:2019

and uses two convolutional neural networks: (1) a generator that produces a dose distribution from a contoured CT image; and (2) a discriminator that tries to differentiate between the artificially generated dose and the actual clinical dose (see Figure 

3). The generator is trained to minimize the mean absolute difference between the artificially generated image and the ground truth (i.e., clinical dose). The objective is regularized by the discriminator to make the output of the generator indistinguishable from a real clinical dose distribution. We then normalize the resulting dose generated by GAN so that it satisfies all target criteria.

Figure 3: Overview of GAN training and testing phases.

2.2.2 Random forest

The random forest model is a slight variation of the RF from McIntosh:2017ab. It uses the 148 features summarized in Table 1 to predict the dose delivered to each voxel independently. Of these features, 122 were generated by applying Gaussian filters (GFs) to the grayscale CT images. One of the GFs was isotropic () and a second was the Laplacian of the Gaussian (). The remaining 120 filters were made from all combinations of the following four parameters: (a) first and second order GFs; (b) and ; (c) rotations of and degrees; and (d) rotations in each of the three axes. RF was trained to minimize the mean squared difference between the prediction and the ground truth using the default settings of randomForestRegressor from scikit-learn.

Feature Quantity Description
Structure 11

Structure voxel is classified as (one-hot-encoded)

-coordinate 1 Voxel’s positions on the -axis in a slice
-coordinate 1 Voxel’s positions on the -axis in a slice
-coordinate 1 Plane of voxel’s slice
ROI distance 11 Voxel’s distance from surface of each ROI
CT gray-scale 1 Voxel’s gray-scale in the CT image
GF CT gray-scale 122 Voxel’s gray-scale in CT image post GF
Table 1: The features used in RF to predict the dose for each voxel.

2.3 Optimization models

For all out-of-sample patients, the output from each prediction model was passed to both optimization models which were solved using Gurobi 7.5. The complexity of all generated treatment plans was constrained to a sum-of-positive-gradients (SPG) value of 55 [Craft:2007aa]

. SPG was used since it is a convex surrogate for the physical deliverability of a plan and the parameter 55 was chosen as it is two standard deviations above the average clinical SPG 

[Babier:2018b]. Both optimization models used the same set of targets and healthy structures . Each target was a planning target volume (PTV) with a prescribed dose . The healthy structures contained in were the brainstem, spinal cord, right parotid, left parotid, larynx, esophagus, mandible, and limPostNeck. Each target structure and healthy structure was divided into a set of voxels and , respectively.

The KBP-generated plans were delivered from nine equidistant coplanar beams at angles 0, 40, …, 320. Those beams were divided into a set of beamlets , which make up one fluence map at each beam angle. The relationship between the intensity of beamlet and dose deposited to voxel was determined using the influence matrix generated by the IMRTP library from A Computational Environment for Radiotherapy Research [CERR], and it is given by

2.3.1 Inverse planning

We followed a previously developed two-stage approach to inverse planning [Babier:2018a]. In the first step, we estimate the objective weights for a conventional inverse planning model that makes a predicted dose distribution optimal. In the second step, the estimated weights are used to re-solve the conventional inverse planning optimization model and construct a treatment plan. The objective to be minimized was a sum of 65 functions: seven per OAR and three per target. The objectives for the OARs were the mean dose, maximum dose, and the average dose above 0.25, 0.50, 0.75, 0.90, and 0.975 of the maximum predicted dose to the OAR. The objectives for the target were the maximum dose, average dose below prescription, and average dose above prescription.

2.3.2 Dose mimicking

Our dose mimicking (DM) model minimized the sum of one-sided penalties to generate a plan that performs as close as possible to the predicted dose on several voxel- and structure-based objectives. Two types of OAR objectives were used. The first was a voxel-based objective that minimizes the dose that exceeds the predicted dose for each voxel :


The second was a structure-based objective that minimizes the maximum dose until it no longer exceeds the maximum predicted dose:


Three types of target objectives were also used. The first was a voxel-based objective to minimize the average deviation below the prescribed target dose , which is the average underdose to target . Specifically, the objective function penalizes dose until the plan underdose is no worse than what was predicted for each voxel . The objective is formulated as:


Similarly, the second objective was also voxel-based, however, it minimizes the average deviation above the prescribed target dose , which is the average overdose to target . Specifically, the objective function penalizes dose until the plan overdose is no worse than what was predicted for each voxel . The objective is formulated as:


The final target objective was structure-based. It maximizes the minimum dose to the target until it exceeds the minimum dose that was predicted for the target:


To form the dose mimicking optimization problem, we constructed linearized forms of equations (1)-(5) by introducing appropriate auxiliary variables and constraints and summed those terms in the objective function. We divided each voxel-based objective by the number of voxels in its respective structure. The conceptual DM model is

subject to

2.4 Performance analysis

We evaluated four distinct KBP pipelines based on the plans they produced; predictions were also evaluated because they are an important intermediate step. We refer to the four sets of KBP plans as GAN-IP, RF-IP, GAN-DM, and RF-DM. We evaluated the predicted and plan dose distributions in terms clinical criteria, the difference in the performance of each optimization model when the same set of predictions is used as input, and the prediction error. Details of these performance metrics are presented below.

Criteria Satisfaction

We quantified the quality of KBP plans by how often they satisfied the same clinical criteria presented in Table 2. Specifically, we examined how often the plans satisfied the same criteria as the clinical plans in each class of criteria (i.e., OARs, targets, and all ROIs, which includes both OARs and targets). Finally, we evaluated the quality of the prediction models to determine whether criteria satisfaction in the predicted dose distribution is an early indicator of final plan quality.

Structure Criteria
Brainstem 54 Gy
Spinal Cord 48 Gy
Right Parotid 26 Gy
Left Parotid 26 Gy
Larynx 45 Gy
Esophagus 45 Gy
Mandible 73.5 Gy
PTV56 53.2 Gy
PTV63 59.9 Gy
PTV70 66.5 Gy
Table 2: The planning criteria used for evaluation: is the minimum dose to 99% of the structure volume, is the mean dose to a structure, and is the maximum dose to a structure.

Optimization performance differences

For each clinical planning criterion (Table 2

), we evaluated the difference in dose between plans generated with an identical set of predictions but a different optimization model; the differences between the two optimization models (i.e., IP and DM) are visualized with a box plot. We then used a two-sided Mann-Whitney U test to determine if plans generated by IP were the same (null hypothesis) or different (alternative hypothesis) from those generated by DM for the population of plans generated from each set of predictions. For these and all subsequent hypothesis tests,

was considered significant.

Prediction performance differences

We evaluated the error of each prediction method by evaluating the median absolute difference between the predicted and clinical dose distributions across each ROI for every out-of-sample plan. The error is visualized with a box plot and we used a two-sided Mann-Whitney U test to determine if GAN had the same (null hypothesis) or a different (alternative hypothesis) prediction error than RF.

3 Results

Criteria Satisfaction

Table LABEL:clinCrit summarizes the performance of the predicted and plan dose distributions. RF-DM plans achieved similar OAR criteria satisfaction to the clinical plans most often (83.9%). However, GAN-IP plans satisfied the target criteria 28.8% more often than RF-DM plans, and achieved close to RF-DM performance on the OAR criteria (4.6 percentage points less). Across all ROIs, the proportion of GAN-IP plans satisfied the same criteria as the corresponding clinical plans 17% more often than its closest competitor (RF-IP). Additionally, while GAN-IP performed better than RF-IP, GAN-DM performed worse than RF-DM, which suggests that there is an interaction effect between the prediction and optimization model that must be accounted for.

In Table LABEL:clinCrit, we also compare the predictions to the clinical plans. We emphasize that unlike the generated plans, i.e., IP and DM plans, the predictions are only an intermediate step in the KBP pipeline. Here, we found that GAN predictions exhibited poor performance on all OAR criteria (24.1%) which we attribute to the poor performance on the mandible criteria (13.6%). The performance of RF and GAN predictions over all target criteria was similar. Overall, RF predicted that plans could satisfy the same criteria as the clinical plans in 78.2% of cases, which far exceeded GAN predictions (24.1%). Most importantly, however, these results do not carry through to the final plans. That is, only GAN-IP plans achieved the same proportion of All ROI criteria (78.2%) that was predicted by RF.

Predictions IP plans DM plans


Brainstem 100.0 98.9 100.0 100.0 100.0 98.9
Spinal Cord 97.7 100.0 100.0 100.0 90.8 97.7
Right Parotid 64.7 64.7 94.1 76.5 76.5 82.4
Left Parotid 81.8 54.5 81.8 54.5 81.8 90.9
Larynx 71.4 89.8 91.8 87.8 81.6 93.9
Esophagus 100.0 100.0 100.0 100.0 100.0 100.0
Mandible 13.6 100.0 81.8 74.2 36.4 93.9


PTV56 100.0 97.8 97.8 95.7 100.0 93.5
PTV63 100.0 98.0 100.0 98.0 100.0 100.0
PTV70 100.0 100.0 100.0 98.3 98.3 58.6


OARs 24.1 80.5 79.3 69.0 41.4 83.9
Targets 100.0 97.7 98.9 95.4 98.9 70.1
ROIs 24.1 78.2 78.2 66.7 41.4 56.3
Table 3: The percentage of plans that satisfied the same clinical criteria as the clinical plans. Only IP and DM plans use the full KBP pipeline

Optimization performance differences

In Figure 4, we present a box plot to compare the quality of plans from different optimization models when the same prediction model was used as input. The plot shows how the plans generated by IP differ from those generated by DM in terms of the dose delivered to each clinical planning criterion relative to the dose threshold of that criterion. On average, IP was better than DM by 2% when GAN predictions were used as input. However, we found no difference between plans generated by IP and DM when RF predictions were used as input. We also found that IP performed better than DM in 69.5% and 50.8% of all evaluation criteria when the inputs were from GAN and RF, respectively. Statistically, when the GAN predictions were used as input, the plans generated by IP and DM performed differently on the clinical criteria (). However, we observed no difference () when RF predictions were used as input to the optimization models. Overall, we observed that the performance of each optimization model was dependent on the prediction model that was used.

(a)  GAN plan differences
(b)  RF plan differences
Figure 4:

The difference in terms of clinical planning criteria between plans generated by IP and DM where the input to both models are (a) GAN predictions and (b) RF predictions. Positive differences imply that the IP plan was better than the DM plan in that criterion. The boxes indicate median and interquartile range (IQR). Whiskers extend to the minimum of 1.5 times the IQR and the most extreme outlier.

Prediction performance differences

In Figure 5, we present the distribution of mean absolute differences between the predicted and clinical dose over the regions of interest (i.e., the mean absolute error between the predictions and clinical plans). Although both models had the same median prediction error across all OARs (4.3 Gy), RF error across targets (1.3 Gy) was much lower than GAN error (3.0 Gy). Overall, GAN predictions had higher median error across all ROIs (3.9 Gy) than RF predictions (3.6 Gy), and these predictions errors were significantly different ().

GAN predictions RF predictions

Figure 5: The distribution of average dose differences between KBP prediction and clinical dose over all ROIs. The boxes indicate median and IQR. Whiskers extend to the minimum of 1.5 times the IQR and the most extreme outlier.

4 Discussion

Historically, each stage of KBP has been developed in isolation with a focus on improving the prediction stage. In this paper, we show that there are interaction effects between the prediction and optimization stages of KBP that significantly affect the quality of the generated plans. Our experimental setup consists of four KBP pipelines that were assembled from two existing KBP methods, i.e., Babier:2019 and McIntosh:2017aa (see Figure 2). Overall, the best performing combination of prediction and optimization methods was the GAN and IP. However, we also demonstrate that predictions that produce good plans with one optimization model (e.g., GAN-IP) do not always produce good plans with another optimization model (e.g., GAN-DM).

Although both RF and GAN predict 3D dose distributions, they differ in their approach. RF predicts the dose to each voxel independently of every other voxel. In contrast, GAN predicts the dose to all voxels simultaneously, thereby making predictions that are conditioned on the predictions of neighboring voxels. RF generally produces predictions that are more similar to clinical plans on summary statistics like mean absolute dose difference (Figure 5

). This is likely because GAN optimizes a regularized loss function that encourages realistic looking images. This results in predictions that have worse performance on summary statistics as compared to a model like RF that minimizes the squared difference between predictions and the ground truth without regularization.

The quality of deliverable plans depends heavily on the combination of the prediction and optimization components used to construct the KBP pipeline. For example, combining GAN and IP results in plans that perform well on average in terms of satisfying clinical criteria. Interestingly, the KBP pipelines that perform the best contain stages that use same order loss and objective functions (i.e., linear-linear or quadratic-quadratic). Namely, GAN (trained with mean absolute loss) and IP (optimized with a linear objective function) produce the best IP plans. Similarly, RF (trained with mean squared loss) and DM (optimized with a quadratic objective function) produce the best DM plans.

When considering OAR and target criteria satisfaction as the two key components, we observe that there is no single two-stage KBP pipeline that dominates all others. While GAN-IP performs at least as well as RF-IP and GAN-DM on both metrics, RF-DM outperforms GAN-IP on OAR criteria satisfaction; we conjecture that this is because GAN predictions perform poorly on the mandible criterion. Inverse planning includes a specific objective that minimizes the dose to the mandible, so even if the predictions (incorrectly) assume that mandible criterion satisfaction is unimportant, the mandible objective in IP improves mandible criterion performance. In contrast, dose mimicking attempts to construct dose distributions that are no worse than the predictions (in terms of Equations (1) - (5)), which generally leads to less improvement on the mandible criterion. Due to the biased nature of GAN predictions towards the mandible, IP can help to improve the single weak criterion with minimal expense to the criteria. On the other hand, IP is unable to exploit any significant under-performance in RF predictions, which generally perform well across most criteria.

A limitation of our work is that, although we identified that the prediction and optimization stages of KBP affect the overall quality of the plans they generate, we were unable to isolate the root cause of those effects. A second limitation is that the computational resources required for this analysis scales exponentially with the number of prediction and optimization models considered. As a result, it is computationally intensive to determine what existing optimization model should be paired with a new approach to dose prediction (and vice versa).

5 Conclusion

This study demonstrates that the performance of an automated KBP pipeline is dependent on how well the prediction and optimization models perform together. As a result, we recommend that new prediction methods should be tested with multiple optimization models before they are considered to be state-of-the-art (and vice versa).

6 Acknowledgments

This research was supported in part by the Natural Sciences and Engineering Research Council of Canada.