Patients undergoing radiotherapy (RT) are prone to develop radiation-related Adverse Effects (AEs) [2, 24, 6]. To improve the design of future multi-modality treatments, clinicians are interested in better understanding the relationship between radiation dose and onset of AEs. Modern research efforts in this direction delve into dosimetric details, employing dose distribution metrics to a specific organ (or sub-volume) as explanatory variables. Such rich information is obtained by simulating the RT plan on 3D imaging of the patient (i.e., CT scans) with organ segmentations in a Treatment Planning System (TPS) [9, 11, 4].
Unfortunately, when so-called late AEs (onset can be decades after RT) need to be studied, it is not always possible to straightforwardly obtain detailed information on dose distributions . For patients who underwent RT before the use of planning CTs became commonplace (in the following, historical patients), 2D radiographs were used for treatment planning (e.g., this was the case until the 1990s in the Netherlands 
), meaning no 3D anatomical imaging is available. Consequently, no simulations can be performed in a TPS to estimate 3D dose distributions for these patients[21, 25, 19]. The information available for historical patients normally consists of what was reported in treatment records, e.g., features of the patient such as age and gender, and features of the plan such as prescribed dose, geometry of the plan, and the use of blocks. Additionally, 2D radiographs can be available, from which information can be gathered on the internal anatomy (mainly bony anatomy, as internal organs are normally not clearly distinguishable), and on the plan configuration with respect to the patient’s anatomy [17, 24].
To improve the understanding of late AEs, recent research is striving to develop increasingly accurate dose reconstruction methods, i.e., methods to estimate the 3D dose distribution received by historical patients [21, 19, 35, 16]. State-of-the-art approaches employ phantoms, i.e., 3D surrogates of the human anatomy upon which the RT plan can be simulated, to compute the dose distribution. Phantoms exist in different forms: physical or virtual, made by simple geometrical shapes or by adopting and morphing actual CT scans and organ segmentations [21, 35, 16]. Generally, phantoms are built to represent average anatomies, for categories of patients (e.g., for a certain age range), and are collected into so-called phantom libraries [5, 20, 12]. Whenever dose reconstruction for a historical patient is needed, the phantom that represents the category that the patient belongs to is retrieved from the library and used as surrogate for simulation of the RT plan.
As the largest source of error related to phantom-based dose reconstruction comes from the mismatch between the anatomy of the phantom and the true anatomy of the patient 
, it is important to define the best way to match phantoms to patients. This issue is still under research, and different approaches employ different heuristic matching criteria that are normally hand-crafted and based upon statistics and guidelines drawn from large population studies (e.g., ICRP89, NANTHES)[22, 5, 20, 12]. However, the use of heuristic matching criteria has been hypothesized to be too simplistic to capture the high variability of internal human anatomy [8, 12, 35, 29, 32]. For example, a popular phantom-based dose reconstruction approach uses solely age and gender for surrogate matching . Our group’s recent work focusing on Wilms’ tumor (the most common type of kidney cancer for childhood cancer patients) irradiation for pediatric patients showed that utilizing surrogate CTs using age- and gender-based matching can lead to poor dose reconstruction quality in individual cases .
To improve the resemblance of a surrogate phantom, there have been efforts to replace the normally hand-crafted heuristic matching criteria with data-driven decisions. For example, statistical models inferred from CTs and 3D organ segmentations of adult patients have been used to drive a deformable image registration procedure that adapted 3D organ segmentations to the 2D anatomy of a specific patient, given features of the latter as measurable from 2D radiographs [19, 18]. Using a state-of-the-art Machine Learning (ML) algorithm, it has been shown that features typically available for historical patients treated for Wilms’ tumor can be linked to different 3D anatomy similarity metrics based on organ segmentations and CTs . Our group recently proposed an automatic pipeline that uses ML to steer the assembling of a new original anatomy based on 3D CTs and organ segmentations of multiple patients using the features of a historical patient [30, 31]. However, it is important to realize that maximizing some form of overall anatomical resemblance is difficult. Moreover, from the standpoint of optimizing dose reconstruction accuracy, it can be considered sub-optimal for RT dosimetry purposes. This is because in RT dosimetry, what part of anatomy is most meaningful largely depends on the particular RT plan .
To the best of our knowledge, although both patient anatomy and plan geometry play a key role in determining dose-volume metrics for Organs At Risk (OARs), existing dose reconstruction approaches focused solely on patient anatomy information, to obtain a representative surrogate. Plan information is used only later, to calculate the dose on the surrogate. The purpose of this article is to develop and validate an ML approach to predict dose-volume metrics for OARs based on patient anatomy and plan geometry information. Specifically, we propose to use ML to directly learn what dose-volume metrics for an OAR are likely given information on the patient and on the plan, without the need to select or craft any surrogate anatomy. We argue that this is a sensible choice because ML can directly be trained upon what ultimately matters, i.e., dose reconstruction accuracy. In this article we present our ML-based organ dose reconstruction approach and its validation, that was performed on a relatively large dataset of artificial plans, as well as on a smaller dataset of clinical plans.
2 Materials & Methods
We considered pediatric flank RT, and in particular RT for Wilms’ tumor, as an application for our dose reconstruction method, in continuity with our previous work. The choice to focus on pediatrics is because children are the most prone to develop late AEs , and are typically underrepresented in existing phantom libraries . Moreover, more than 85% of pediatric patients survives Wilms’ tumor five years or longer, but considerable chances of the onset of late AEs remain .
2.1 Patient data
To be able to create a ground-truth to learn dose-volume metrics from, CT scans were needed. Hence, a total of 142 pediatric planning CTs were collected by involving the following institutes (number of CTs in brackets): Amsterdam University Medical Centers / Emma Children’s Hospital (), University Medical Center Utrecht / Princess Máxima Center for Pediatric Oncology (), The Christie NHS Foundation Trust (), Princess Margaret Cancer Centre (), and Institute of Oncology Ljubljana (). Five further CTs were collected from the Amsterdam University Medical Centers and kept aside to be used exclusively for an additional validation step (Sec. 2.5).
The inclusion criteria were: patient age at scan acquisition between 1 to 8 years; the CT field of view including a common abdominal region from the tenth thoracic (T10) vertebral body to the first sacral (S1) vertebral body; presence of five lumbar vertebrae (rare cases of patients with six exist); patient scanned in supine position; quality of CT sufficient to perform organ segmentations. The patients underwent RT between 2002 and 2018, mostly but not exclusively for abdominal cancers. The median CT slice resolution was mm, the median slice thickness was mm.
As we focused on Wilms’ tumor treatment, four OARs were considered: the liver, the spleen, the contralateral kidney (left or right, depending on the side of the tumor), and the spinal cord (between T10 and S1). To provide accurate and consistent OAR segmentations, we carefully prepared the OAR segmentations in all CTs (): for 60 CTs, pre-existing clinical segmentations of OARs were manually improved and approved (I.W.E.M. van Dijk, checked by B.V. Balgobind only for difficult cases). To aid the manual segmentation of the OARs for the remaining CTs (n=87), the software ADMIRE (research version 2.3.0) from Elekta (Elekta AB, Stockholm, Sweden) was used to generate multi-atlas based automatic segmentations using the previous 60 CTs as atlas. These segmentations were further manually checked slice-by-slice and possibly adapted (Z. Wang, I.W.E.M. van Dijk), and finally checked and approved (I.W.E.M. van Dijk, checked by B.V. Balgobind only for difficult cases). Some patients did not have both kidneys intact, due to nephrectomy prior to RT. The number of CTs that had only a complete right kidney, only a complete left kidney, and two complete kidneys were 36, 40, and 71, respectively.
2.2 Automatic generation of artificial Wilms’ tumor plans
A method to automatically generate historical-like abdominal flank irradiation plans (i.e., artificial plans) for Wilms’ tumor treatment based on information visible on 2D radiographs was created, in order to obtain large plan variations.
Figures 1(a) and 1(c) illustrate examples of actual historical plans on respective historical radiographs. As can be observed from the examples, a typical historical flank irradiation field is a rectangular area, with possible shielding blocks, that is located on the right or on the left flank. Flank irradiation is done by beams from anterior-posterior (AP) and posterior-anterior (PA) direction. Along right-left (RL), one field border is located at the edge of the patient’s body contour, while the other is located as to include the vertebral column . In some cases, blocks were placed to protect OARs from irradiation (Fig. 1(c)). In historical plans the isocenter was positioned in the center of the treatment field that is projected on the coronal plane (Fig. 1) and at the middle of the patient’s AP abdominal diameter.
To generate artificial plans, two reference digitally reconstructed radiographs (DRRs) were considered, randomly selected from the data. One DRR was derived from a CT of a 5-year old female patient without nephrectomy (ref 1 in Fig. 2), and the other was derived from a CT of a 4-year old female patient with nephrectomy of the left kidney (ref 2 in Fig. 2). Upon these two DRRs, boundaries defining plan variability were identified by an experienced pediatric radiation oncologist (B. V. Balgobind), to ensure that generated plans would appear to be reasonable according to historical clinical guidelines. Note that historical clinical guidelines are slightly different from current ones (e.g., currently the iliac crests should be safeguarded, unlike in Fig. 1(c)). Figure 2 shows two examples of landmark locations identifying possible plan variations, on the two reference DRRs. Specifically, given the boundaries of possible isocenter positions and field borders, plans with a rectangular field were generated by sampling uniformly within those boundaries. For each plan generated, an additional version of that plan including one block was generated as well. A block was simulated as the area in the upper lateral corner enclosed by the border of the rectangular field and a line crossing two randomly sampled endpoints. The endpoints were sampled from two regions roughly covering the start and end points of rib 9 and rib 12 on the DRRs (regions indicated by the green boxes in Fig. 2). This way, a sampled block covered part of the liver (in right-sided plans) or part of the spleen (in left-sided plans). All plans consist of two opposing and symmetrical beams in AP-PA directions irradiating one side of the abdominal flank. Figures 1(b) and 1(d) illustrate two examples of sampled artificial plans (without or with a block) on respective DRRs.
A total of 300 artificial plans were generated automatically, of which 150 without a block, and 150 with a block. The random sampling of the plan side led to 142 left-sided plans and 158 right-sided plans (roughly half-half). The same set of plan features used in our previous work was considered to generate plans in DICOM RTPLAN format (e.g., gantry and collimator angles, isocenter location, field sizes) .
2.3 Generation of the dataset for ML
Figure 3 summarizes the pipeline used to generate the dataset for ML. Firstly, we emulated each of the 300 artificial plans on each of the 142 CT scans by the automatic plan emulation method proposed in our previous work , leading to a total of 42,600 emulations. The method automatically transfers a plan prepared on one CT to another CT (with quality comparable to human experts), using landmark detection upon the respective DRRs. Secondly, for each of the 42,600 plan emulations, dose-volume metrics of interest (see Sec. 2.3.1) were collected for the different OARs by use of our automatic dose computation pipeline . The pipeline used the collapsed cone dose calculation algorithm of Oncentra TPS (version 4.3, Elekta AB, Stockhom, Sweden). Thirdly, features that are plausible to be available for typical historical cases were collected from the anatomy of the included CTs as visible in the respective DRRs, from the artificial plans, and from the relationship between anatomy and plan geometry.
2.3.1 Response variables: dose-volume metrics
To select dose-volume metrics to use as response variables for ML, we considered metrics typically used to validate state-of-the-art dose reconstruction approaches [19, 16], and typically found to be of clinical relevance in studies of AEs in adults (e.g., QUANTEC) [17, 4, 10]. Studies on dose-volume response relationships for pediatric patients (so-called PENTEC studies ) are currently limited.
This reasoning led us to consider mean organ dose (), two levels of percentage of OAR volume receiving at least Gy (), and the minimum dose received by the maximally exposed 2 cubic centimeters of an OAR (), the latter being similar but more robust than the maximum dose to a single point. Typically, and are considered. However, instead of , we decided to use in this work since our plans have a prescribed dose of 14.4 Gy (thus was always 0 for the OARs). Regarding , we decided to include this metric because peak dose values to a small OAR portion may be relevant to explain late AEs related to OARs that work in a serial fashion (e.g., the spinal cord).
2.3.2 Explanatory variables: features of patients and plans
To assess what information can be available for historical patients, we considered the Dutch records of the Emma Children’s Hospital/Academic Medical Center childhood cancer survivor cohort, who underwent RT between 1966 and 1996 . For this cohort, along with historical patient records and treatment plan details, 2D coronal radiographs were consistently taken, hence providing partial information on the anatomy.
). For 12% of the patients, height and/or weight data were missing and preliminary experiments using automatic imputation methods showed no benefit in including them.
For the features related to anatomical geometry, anatomical landmarks from DRRs were detected automatically using the landmark detection method in our previous work . Note that the landmarks concerned only bony anatomy because other internal anatomy tissues are not reliably visible in historical radiographs. Importantly, we normalized features related to measurements of anatomy and anatomy-plan geometry configuration (e.g., rib-cage width, field sizes, distances between landmarks and the isocenter) by the width and height of the respective DRR they were measured from (after the DRRs were cropped to a same region of interest between T10 and S1). This was done because when plans are emulated, they are scaled based on proportions derived from the landmarks [33, 34]. Since differences in anatomy solely due to overall anatomy scaling do not result in different dose-volume metric values, these differences should not be accounted for by the explanatory variables (confirmed in preliminary experiments).
The abdominal diameter in AP () is the only anatomical feature not measurable from DRRs generated along AP/PA direction. In historical RT, it was measured using a ruler to determine the isocenter position along the AP axis, and was subsequently reported in the records. For our cohort, was not reported in the records because a CT scan was used for RT planning. We therefore measured automatically on the CT scans, by using a pre-determined isocenter position of typical abdominal flank irradiation plans. In particular, the intervertebral disk between the 1st and the 2nd lumbar vertebra (L1 and L2) was used to determine the isocenter position along the inferior-superior (IS) axis, and the center of mass of the kidney was used to determine the isocenter position along the RL axis (as the aim of Wilms’ tumor plans is to irradiate the renal fossa). For CTs including both kidneys, two were measured, and the average was taken. Conversely, only one was measured for CTs including a single kidney.
In our simulations we used for all artificial plans the same fractionation scheme (8 1.8 Gy), beam energy (6 MV), and prescribed dose (14.4 Gy at isocenter). These settings are the most common in historical records, and are still valid in the current Wilms’ tumor RT protocol . Moreover, choosing a specific prescribed dose (e.g., 14.4 Gy) does not limit generalizability, since the dose distribution over the entire anatomy depends linearly on the prescribed dose. Thus, if dose reconstruction for a historical case using a different prescription is needed, the dose-volume metrics predicted by the models trained on plans with a 14.4 Gy prescribed dose can be re-scaled.
For both fields with and without a block, the features representing field sizes in RL and IS directions ( and ) were set by simply considering the full rectangular area (i.e., irrespective of blocking). For fields with a block, the slope of the block (note that the block is formed by Multi-Leaf Collimators (MLCs)) and the ratio between the blocked region and non-blocked region of the field () was computed. In addition, we considered features that relate to how the plan was configured with respect to the patient’s anatomy, based on the position of the isocenter and of the bony landmarks. For instance, links the bottom of the T10 vertebra to the position of the isocenter in RL direction. Figure 4 shows an example of the anatomical landmarks and plan geometrical borders used to calculate features describing plan configuration with respect to the patient’s anatomy.
|Age||Records||years||patient age at CT scanning|
|ArmsUp||DRR||yes/no||whether the patient had arms in a raised position during scanning|
|Records||cm||patient AP diameter measured at isocenter|
|Nephrectomy||Records||yes/no||whether the patient underwent nephrectomy|
|DRR||cm||width (in RL) of right-part of the rib cage (from vertebral column to location of right-most rib)|
|DRR||cm||width (in RL) of left-part of the rib cage (from vertebral column to location of left-most rib)|
|DRR||cm||average vertebral column width|
|DRR||cm||length (in IS) of the vertebral column from T11 to L4|
|Plan||cm||field width (in RL)|
|Plan||cm||field length (in IS)|
|FieldSide||Plan||right/left||whether the plan concerns left-sided or right-sided flank irradiation|
|Plan||cm||distance (in RL) between isocenter and block endpoint of the top field border|
|Plan||%||, 0 for block-free plans|
|Plan||-||of the block (see Fig. 4); 0 for block-free plans|
|Plan||angle of collimator system with respect to gantry system|
|Plan DRR||cm||RL distance between bottom of T10 and isocenter|
|Plan DRR||cm||IS distance between bottom of T10 and isocenter|
|plan DRR||cm||RL distance between right border of T12 and isocenter|
|Plan DRR||cm||IS distance between right border of T12 and isocenter|
|Plan DRR||cm||RL distance between left border of T12 and isocenter|
|Plan DRR||cm||IS distance between left border of T12 and isocenter|
|Plan DRR||cm||RL distance between bottom of L1 and isocenter|
|Plan DRR||cm||IS distance between bottom of L1 and isocenter|
|Plan DRR||cm||RL distance between right border of L2 and isocenter|
|Plan DRR||cm||IS distance between right border of L2 and isocenter|
|Plan DRR||cm||RL distance between left border of L2 and isocenter|
|Plan DRR||cm||IS distance between left border of L2 and isocenter|
|Plan DRR||cm||RL distance between bottom of L4 and isocenter|
|Plan DRR||cm||IS distance between bottom of L4 and isocenter|
|Plan DRR||cm||RL distance between location of right-most rib and isocenter|
|Plan DRR||cm||IS distance between location of right-most rib and isocenter|
|Plan DRR||cm||RL distance between location of left-most rib and isocenter|
|Plan DRR||cm||IS distance between location of left-most rib and isocenter|
Abbreviations: R (in superscript): right, L (in superscript): left, RL: right-left, AP: anterior-posterior, IS: inferior-superior, IC: isocenter, VC: vertebral column, W: width, L in and : length.
2.3.3 Dataset for supervised learning
Features and dose-volume metrics were finally collected in a dataset. The dataset corresponded to a 2D matrix, where the rows represented patient-plan combinations, i.e., examples (), and the columns represented features () and response variables ( for each OAR).
2.4 Machine learning
In the following sections we describe how ML was performed in terms of training and validation on the artificial plans. We further introduce the ML algorithm adopted, and describe an independent validation on clinical cases.
2.4.1 Training and evaluation of ML models
Since dose metrics are scalars, we treated the learning problem as a regression problem. We trained a separate ML model for each combination of dose-volume metric and OAR.
Preliminary analysis showed that right-sided plans and left-sided plans led to markedly different distributions of possible dose-volume metric values for all OARs except for the spinal cord. Thus, ML models were set to be composed of two sub-models, each to be trained independently on a particular sub-set of the data based on plan side (right or left).
The quality of the models was estimated with a 5-fold cross-validation. This means that a random partition of 1/5th of the total number of patients and plans was held out (test set), and training was performed on the remaining data. Then, the prediction error was measured on the test set. This process was repeated five times, each time considering a different data partition for the test set. No patient nor plan that was in the test set was included in the data at training time.
Each training step included hyper-parameter tuning by grid-search with internal 5-fold cross-validation (upon the training set), as well as feature selection (which resulted in eight features being systematically discarded, see the supplementary material A). For each dose-volume metric, the Root Mean Square Error (RMSE) loss was used, i.e.,
where are the ground truth values and are the model predictions for the dose-volume metric , and is the total number of rows in the training set. The RMSE was chosen to regularize ML, i.e., to penalize larger errors more .
To account for the stochastic nature of the ML algorithm employed (see Sec. 2.4.2
) and for the random partitioning of the data, the 5-fold cross-validation was repeated ten times. The averages and standard deviations over thevalidation results (five folds repeated ten times) were considered.
To put the results of ML into perspective, for each cross-validation, a baseline prediction was considered that simply used the average dose-volume metric observed in the training data. The Wilcoxon signed-rank test was used to assess whether the results obtained by ML and by this baseline are significantly different (-value ).
2.4.2 Machine learning algorithm
The recently introduced Genetic Programming version of the Gene-pool Optimal Mixing Evolutionary Algorithm (GP-GOMEA) was considered as ML algorithm, as it was found to achieve competitive performance on a variety of benchmark problems[28, 27], as well as in previous work concerning radiotherapy [26, 30].
In addition, GP-GOMEA can perform symbolic regression, i.e., it can generate a regression ML model in the form of a (symbolic) mathematical expression. GP-GOMEA incorporates Information Theory methods that enable it to synthesize fairly accurate expressions of particularly compact size, an aspect that makes these expressions lightweight and fast to execute, and that can aid human-interpretability. Details on the hyper-parameters (and their tuning) adopted in GP-GOMEA are described in the supplementary material B.
2.5 Independent evaluation on clinical plans
As aforementioned, the 300 plans used to cross-validate our approach were generated with an automatic sampling procedure (Sec. 2.2). To assess whether our results on artificial plans can be valid for clinically-used plans, we further evaluated our approach on an independent dataset for which clinical plans were crafted manually.
For this validation, we trained ML models (as a reminder, one model per OAR - dose-volume metric combination) on the dataset using the 142 CTs and the 300 artificial plans, and evaluated their prediction accuracy on a separate set of five CTs each associated with two clinical plans. We gathered five clinical plans (three right-sided, two left-sided) for these five CTs. Under the supervision of an experienced pediatric radiation oncologist (B.V. Balgobind), two adapted versions of each plan were manually created that both had the isocenter in the middle of the fields. In one plan no block was used and in the other plan a block was introduced to protect part of the liver or spleen, depending on the plan side. Training (including 5-fold cross-validation to determine the best hyper-parameter settings) and validation were repeated ten times to account for the stochastic nature of GP-GOMEA. Averages and standard deviations were computed over these ten repetitions.
3.1 Dose-volume metric data distribution
Among the 300 artificial plans, plan side and OAR type was found to influence the distribution of a dose-volume metric considerably. To illustrate the effect of OAR type and plan side on the dose, Figure 5 shows the distributions found for and for the liver and the spleen, in case of left- and right-sided plans. For
for the liver, distributions approximately resembling the normal distribution were obtained (in case of right-sided plans with particular high variance and long left tail). The distribution in case of right-sided plans had a mean of 9.5 Gy (typically a major part of the liver was in-field), the distribution in case of left-sided plans had a mean of 3.4 Gy (typically a minor part of the liver was in-field). In terms of, for the liver we observed values close to the prescribed dose (14.4 Gy) both in case of left- and right-sided plans. The distributions of and for the spleen associated with the different plan sides had more marked differences than the ones for the liver. In case of right-sided plans, values close to 0 Gy were obtained for both metrics (typically the spleen was outside the field). For left-sided plans, large values of were found to be much more frequent than low values. The distribution of exhibited a peak around the prescribed dose. For the contralateral kidney and for the spinal cord the distributions are similar for both plan sides, as the contralateral kidney should be outside the field and the spinal cord should be included within the field (according to protocol).
For all OARs, distributions obtained for and largely resembled the ones obtained for . In fact, Pearson correlation coefficients above 98% were found when comparing with and for almost all OARs. Smaller (yet still large) correlation coefficients were found between and for the left and right kidney, with values of 96% and 91% respectively.
3.2 Validation on artificial plans
For each considered OAR, the Mean Absolute Errors (MAEs) (and standard deviation) at validation time for , , , and from the ten repetitions of the 5-fold cross-validation procedure using the artificial plans are reported in Table 2. We further present the average decrease in MAE when using ML compared to when using the baseline (i.e., the effect size), as well as the outcome of statistical significance tests (Wilcoxon signed-rank) comparing ML to the baseline.
The errors for and were generally below 2 Gy, which corresponds to approximately 14% of the prescribed dose of 14.4 Gy. For all OARs but for the spinal cord, the plan side had considerable impact on the magnitude of the errors. As the spinal cord in RL direction was in-field no matter the plan side, the MAEs of dose-volume metrics predictions were found to be small: Gy for and , for and . For OARs that were almost out-of-field, e.g., the spleen in case of right-sided plans, small MAEs of were found ( Gy), as very low values were obtained across all patient-plan combinations (see Fig. 5). Note that in this case (and also for the for the liver in both left- and right-sided plans), ML performs significantly but not substantially better than the baseline.
For the liver in case of right-sided plans, and for the spleen in case of left-sided plans, larger MAEs were found (liver: 1.7 Gy for , 12.1% for , 12.6% for ; spleen: 1.5 Gy for , 9.3% for , 10.7% for ). These errors can be attributed to the particular configuration of the position of these OARs and the field of the plans.
Among the dose-volume metrics, for the (partly) in-field OARs had low variability, with a close to the prescribed dose (14.4 Gy). For example, small errors were obtained for the for the liver ( Gy), as we consistently obtained a large value for both left- and right-sided plans (see Fig. 5). In contrast, was harder to predict when the OAR was contralateral to the plan side. The MAEs obtained for for the spleen in case of right-sided plans was 1.6 Gy. This was 2.9 Gy for the left kidney, and 1.4 Gy for the right kidney.
The largest average error was found for for the left kidney, amounting to 20% of the prescribed dose. For all dose-volume metrics for the right kidney, and for for the spleen, we found that ML predictions were slightly worse compared to using the baseline (note the negative effect sizes), but not significantly so. Lastly regarding the kidneys, although errors in were relatively large, errors in were relatively small (compared with for the other OARs). In fact, only a small percentage of the contralateral kidney, from to less than typically received at least 10 Gy.
Although not reported in Table 2, we remark that the errors were found to be unbiased: no systematic over- nor under-estimations of dose-volume metrics were found on average, with the mean (non-absolute) error being close to zero for all metrics.
|Side||OAR||Dmean [Gy]||D2cc [Gy]||V5Gy [%]||V10Gy [%]|
Right ( plans)
Left ( plans)
3.3 Independent validation on clinical plans
Figure 6 and Figure 7 show, for each clinical case, the ground truth dose-volume metric values and the predictions obtained by the ML models (trained on the artificial plans). Results for and are presented in Figure 6, and results for and are presented in Figure 7.
The errors in between predictions and ground truth values were generally low, totalling an average of 1.0 Gy (with a range of 0.0-4.9 Gy) across all OARs. Compared to the results obtained in the cross-validation using the artificial plans (Table 2), for the liver in case of right-sided plans, the average error on the clinical plans was found to be smaller (1.2 Gy vs. 1.7 Gy). Similar average errors in were found for the kidneys (0.8 Gy vs. 0.6 Gy for the left kidney and 0.5 Gy vs. 0.6 Gy for the right kidney), the liver in case of right-sided plans (1.0 Gy vs. 1.7 Gy), and the spinal cord (0.4 Gy vs. 0.4 Gy). Larger average errors in were found for the spleen (0.5 Gy vs. 0.1 Gy in case of the right-sided plans and 3.6 Gy vs. 1.5 Gy in case of left-sided plans). The largest error of 4.9 Gy was found for the spleen of case (1.9 Gy error found for the spleen of case ), indicating that the impact of the block on the plan was not well modeled. Furthermore, the error in spleen of and was large (2.8 Gy and 4.7 Gy, respectively), indicating both field types (without and with a block, respectively) were not well modeled for this case (see the discussion in Sec. 4).
Regarding , similar results to the ones obtained on artificial plans were found for the spinal cord, where an average error of 0.1 Gy was obtained in (with a range of 0.0-0.4 Gy). For the spleen of cases and , large errors in were found (12.3 Gy on average), as ML predictions essentially wrongly represented the spleen to be out-of-field. Whereas this was the case for and , and for and , where small errors were obtained (1.1 Gy on average). Similarly, large errors in were found in some cases for the contralateral kidneys (e.g., the left kidney: 5.1 Gy on average, with a range of 0.5-7.6 Gy).
Results for (average error 8% with a range of 0-35%) and (average error 7% with a range of 0-49%) mostly followed the trend of the errors for , as these metrics were found to be correlated for most OARs.
In this article we presented a new and different paradigm in organ dose reconstruction. By leveraging the modeling power of ML, we showed how patient and plan features can be used to predict organ dose-volume metrics directly, without the need of adopting a surrogate anatomy. Once the ML models are trained, they can readily be used to compute dose-volume metric predictions for a new historical patient and plan, by using their features as input.
Key to obtaining a decent amount of data to perform ML were the collaboration of five international institutes to gather pediatric patient CTs (147), the development of a new automatic sampling procedure yielding artificial Wilms’ tumor RT plans, and the creation of an automatic dose reconstruction pipeline to calculate the dose for all patient and plan combinations. We validated our approach on 300 automatically generated artificial plans, and on ten manually created clinical plans, to assess whether the results of the validation on the artificial plans generalize well in practice. Our approach showed promising levels of accuracy in dose reconstruction in both settings.
Errors were found to be overall similar between the validation on the ten clinical plans and the validation on the artificial plans. However, for some metrics, errors were larger for the ten clinical plans. This may be due to chance, because ten is a small number to validate upon. Another possibility is that the artificial plan generation method needs to be improved. Artificial plans were generated by sampling geometry properties uniformly within predefined boundaries on two reference DRRs. Uniform sampling might not be representative of the distribution clinical plans have. Moreover, we consulted a single radiation oncologist to define clinically acceptable boundaries to use in the sampling of artificial plans. Consulting multiple experts and allowing for a larger variation might better help covering the extent of variation that is present in historical plans (Sec. 2.2). For example, the isocenter locations of artificial left-sided plans were never sampled below the 1st lumbar vertebra (see Fig. 2) and approximately half of the values for the spleen in case of the artificial left-sided plans were close to the prescribed dose (14.4 Gy, see Fig. 5), which means that the spleen was often almost completely in-field in our artificially generated set of left-sided plans. When a block was applied, only a small part of the spleen was spared. However, in clinical practice, isocenter locations can be lower, and a larger part of the spleen might actually be outside the field (see Fig. 8). This might explain the relatively large errors observed in Figure 6 for and where the isocenter location is lower than the sampled range. Ultimately, effort should be done to improve the sampling of artificial plans.
In the validation performed upon artificial plans as well as in the one performed upon clinical plans, a main result that emerges is that dose-volume metrics for an organ are hard to predict when, due to the field setup, it is unclear whether the OAR is (partially) included in the field or not. For example, consider the as opposed to the for the spleen for and in Figure 6: a tiny part of the spleen being inside the field causes a large (wrongly predicted to be small), and small (correctly predicted to be small). As experimentally observed in prior work [29, 30], 2D bony anatomy provides only coarse information on OAR shape and position even for ML algorithms (e.g., an MAE of 6.4 mm for the prediction of the liver position along the IS axis was reported ). Yet, because bony anatomy is the only structure that is reliably visible in historical radiographs, most of the anatomical features rely on it. Patients with similar anatomical features derived from bony anatomy may have different OAR shape and position, and thus different dose-volume metrics. Furthermore, impreciseness in feature values due to e.g., uncertainties in landmark detection and plan emulation, aggravate the situation.
Compared to conventional dose reconstruction methods (that use surrogate anatomies and heuristics to decide what surrogate to use), we considered a relatively large number of features: 33. Phantom-based methods consider, e.g., only age and gender [21, 13], or gender together with height and weight percentiles . However, if a 2D radiograph is available, the added value of this information should be exploited. In our work, the majority of the features we considered, i.e., 23 out of 33 (minus eight due to automatic feature selection, see supplementary A), regarded patient anatomy as visible on a 2D radiograph, which we simulated with DRRs. Our DRRs were generated in a conformal fashion, e.g., the abdomen was always fully included in RL direction. The automatic landmark detection that was used to generate features expects this conformity to achieve precise detection . When dealing with actual historical radiographs, however, several challenges need to be taken into account. For example, our automatic landmark detection method requires further development to account for noise in the radiograph (e.g., the presence of hand-writing on the radiograph). Moreover, educated guesses of landmark locations may be needed in some cases, as some historical radiographs do not include the entire abdomen (see Fig. 1(c)). Nevertheless, as long as the features are somehow collected (e.g., manually), they can be used as input for the ML models to get respective dose-volume metric predictions.
There are disadvantages of our approach compared to conventional dose reconstruction methods that use surrogate anatomies beyond the need for patient radiographs, which are not always available in retrospective data. In particular, a key limitation is that ML models do not predict the entire 3D dose distribution an organ receives, but only the metrics they were trained for. Potentially useful information to link to AEs may be contained in 3D dose distributions. To predict 3D dose distributions, the ML models would need to be trained to predict a 3D output. Surrogate-based methods do allow to obtain the entire 3D dose distribution to an organ, since the distribution can be visualized on the organ of the surrogate anatomy, after plan simulation. However, considering the magnitude and variations of the errors of organ mean dose obtained by conventional approaches , it is questionable whether the full 3D distribution will be sufficiently reliable. Our approach, as currently proposed, can straightforwardly be extended to predict any (scalar) dose-volume metric that is suspected to be useful to study AEs (it suffices to train ML on that metric).
Another limitation of our approach is that it does not take into consideration uncertainties related to OAR motion. For validation, we aimed at reconstructing the dose based on the particular snapshot of anatomy at the moment the CT (ground-truth) / respective DRR (to simulate historical radiographs) was taken. Yet, OAR motion plays a key role in the uncertainty of organ positioning at the edge of the field, which can lead to a discrepancy between the planned dose and the actual delivered dose. In RT practice, radiation delivery is performed over a number of days, with fractionation schemes. The OAR position can therefore vary (i.e., inter-fractional position variation). Intra-fractional organ motion due to, e.g., respiration variation, contributes to the difference between planned dose and delivered dose as well.
Lastly, a main limitation of our approach is that the ML models we generated are specific to pediatric patients (1 to 8 years of age) and Wilms’ tumor RT plans: they can only predict reliable dose-volume metrics of specific OARs they were explicitly trained for. The RT plans we have sampled were also restricted to a standard AP-PA setup without considering wedges, boost fields, or other radiation sources such as Cobalt-60. Moreover, the predictions of the ML models (as well as the validation performed in this study) are based on the dose calculation algorithm we adopted when preparing training data, which has inaccuracies. Specifically, we used a collapsed cone algorithm available in Oncentra TPS. Though good accuracy was reported in the in-field and near-field region ( 5 cm from the field borders, achieves an error of 1-2% of the prescribed dose), in low dose regions (10-15 cm from the field border) an underestimation of 10% of the dose in the region was reported . We remark that the OARs we considered in this study were mostly within 5 cm near the field border (except for the spleen in case of the right-sided plans). To make the method more general for OARs far from field borders, more advanced Monte Carlo dose calculation algorithms should be applied in future implementations. However, we believe that the core ideas of our work can be replicated for other cohorts and other types of plans. Essentially, as long as a sufficient number of anatomies and plans are collected or generated, and a large number of dose reconstructions are performed to be used as examples, new ML models can be trained to predict how the dose-volume metrics are linked to anatomy-plan configurations. As was the case in our study, the collection and preparation of sufficient data for ML is likely to be the largest required effort.
Our proposed approach presents several advantages compared to traditional dose reconstruction methods. First of all, we found our validation results to compare favorably with respect to our recent work considering dose reconstruction for a similar childhood cancer cohort . The work considered 31 patients aged 2 to 6, 12 Wilms’ tumor clinical plans, and a total of 50 dose reconstruction combinations, which were performed by matching a surrogate CT based on age and gender. The work reported an MAE for the for the liver of 1.6 Gy (average across both left- and right-sided plans), and an MAE for the
for the spleen of 2.6 Gy. For the liver, we obtained an MAE of 1.3 Gy when validating on artificial plans, and of 1.1 Gy when validating on ten clinical plans. For the spleen, we obtained an MAE of 0.8 Gy on artificial plans, and of 1.7 Gy for the clinical plans. Furthermore, our ML-based predictions resulted in much smaller variations. The inter-quantile range (25th to 75th percentile) of the (non-absolute) prediction error of our previous work was 3.6 Gy for the liver, and 4.7 Gy for the spleen. On the artificial plans, we obtained a range of 2.0 Gy for the liver, and of 1.2 Gy for the spleen. On the clinical plans, the range for the liver was 1.9 Gy, and the one for the spleen was 2.2 Gy. We remark that since the dose reconstruction accuracy is largely influenced by the particular plans considered, these values may not be a fair comparison. We are currently working on a multi-institute study to compare our approach with two state-of-the-art, phantom-based dose reconstruction approaches [16, 13]. In that study, a same set of patients and plans will be used for validation.
Finally, a benefit of having ML models is that, once features are collected, they can be used as inputs for the model to obtain the prediction of a dose-volume metric immediately. Running a model on a computer simply means to follow the steps encoded by the formula the model represents, which takes a few milliseconds. Conversely, in a surrogate-based approach, the features are used to craft or select a surrogate anatomy. Then, effort and time must be put to emulate the plan on the surrogate anatomy, calculate the dose, and obtain the dose-volume metrics [16, 13, 34].
We presented the first surrogate-free organ dose reconstruction method based on ML. Our method was enabled by the collection of large amounts of patient and CT data, and the automatic generation of artificial plans and of dose distribution data. We assembled a dataset of dose-volume metrics corresponding to features of patient anatomy and plan geometry, and subsequently trained ML models to predict how features of patient anatomy and of treatment plans influence dose-volume metrics. The predictions were validated upon both artificial and clinical RT plans, and achieved good accuracy in both cases.
The authors acknowledge Stichting Kinderen Kankervrij (KiKa) for financial support (project #187), and the Maurits and Anna de Kock foundation for financing a high performance computing system. Elekta is acknowledged for providing the research software ADMIRE for automatic segmentation. The authors thank Abigail Bryce-Atkinson for her help in data preparation and feature extraction, and Dr. Cécile M. Ronckers for sharing her expertise.
-  (2017) A review of uncertainties in radiotherapy dose reconstruction and their impacts on dose–response relationships. J. Radiol. Prot. 37 (1), pp. R1. Cited by: §1.
-  (2005) Adverse effects of preoperative radiation therapy for rectal cancer: long-term follow-up of the Swedish Rectal Cancer Trial. J. Clin. Oncol. 23 (34), pp. 8697–8705. Cited by: §1, §1.
-  (2006) Pattern recognition and machine learning. Springer. Cited by: §2.4.1.
-  (2011) Dose–volume analysis of radiation nephropathy in children: preliminary report of the risk consortium. Int. J. Radiat. Oncol. Biol. Phys. 80 (3), pp. 840–844. Cited by: §1, §2.3.1.
-  (2011-05) Standing adult human phantoms based on 10th, 50th and 90th mass and height percentiles of male and female Caucasian populations. Phys. Med. Biol. 56 (13), pp. 3749–3772. Cited by: §1, §1.
-  (2017) Chronic health conditions and neurocognitive function in aging survivors of childhood cancer: a report from the childhood cancer survivor study. J. Nat. Cancer. Inst. 110 (4), pp. 411–419. Cited by: §1, §2.
-  (2019) Pediatric Normal Tissue Effects in the Clinic (PENTEC): An international collaboration to analyse normal tissue radiation dose–volume response relationships for paediatric cancer patients. Clin. Oncol. 31 (3), pp. 199–207. Cited by: §2.3.1.
-  (2001) Organ weight in 684 adult autopsies: new tables for a Caucasoid population. Forensic Sci. Int. 119 (2), pp. 149–154. Cited by: §1.
-  (2007) Randomised trial of standard 2D radiotherapy (RT) versus intensity modulated radiotherapy (IMRT) in patients prescribed breast radiotherapy. Radiother. Oncol. 82 (3), pp. 254–264. Cited by: §1.
-  (1991-05-15) Tolerance of normal tissue to therapeutic irradiation. Int. J. Radiat. Oncol. Biol. Phys. 21 (1), pp. 109–122. Cited by: §2.3.1.
-  (2007) Intensity-modulated radiotherapy of head and neck cancer aiming to reduce dysphagia: early dose–effect relationships for the swallowing structures. Int. J. Radiat. Oncol. Biol. Phys. 68 (5), pp. 1289–1298. Cited by: §1.
-  (2014-08) The UF/NCI family of hybrid computational phantoms representing the current US population of male and female children, adolescents, and adults—application to CT dosimetry. Phys. Med. Biol. 59 (18), pp. 5225–5242. Cited by: §1, §1, §2.3.2, §4.
-  (2019) Adaptations to a generalized radiation dose reconstruction methodology for use in epidemiologic studies: an update from the MD Anderson late effect group. Radiat. Res. 192 (2), pp. 169–188. Cited by: §1, §4, §4, §4.
-  (2015) Quantification of renal and diaphragmatic interfractional motion in pediatric image-guided radiation therapy: a multicenter study. Radiother. Oncol. 117 (3), pp. 425–431. Cited by: §4.
-  (2005) Monte carlo-versus pencil-beam-/collapsed-cone-dose calculation in a heterogeneous multi-layer phantom. Phys. Med. Biol. 50 (5), pp. 859. Cited by: §4.
-  (2015-02) Reconstruction of organ dose for external radiotherapy patients in retrospective epidemiologic studies. Phys. Med. Biol. 60 (6), pp. 2309–2324. Cited by: §1, §2.3.1, §4, §4.
-  (2009) Pediatric cancer survivorship research: experience of the childhood cancer survivor study. J. Clin. Oncol. 27 (14), pp. 2319. Cited by: §1, §2.3.1.
-  (2013-01) Evaluation of 3D fluoroscopic image generation from a single planar treatment image on patient data with a modified XCAT phantom. Phys. Med. Biol. 58 (4), pp. 841–858. Cited by: §1.
-  (2012) Individualized 3D reconstruction of normal tissue dose for patients with long-term follow-up: a step toward understanding dose risk for late toxicity. Int. J. Radiat. Oncol. Biol. Phys. 84 (4), pp. e557–e563. Cited by: §1, §1, §1, §2.3.1.
-  (2013) Population of anatomically variable 4D XCAT adult phantoms for imaging research and optimization. Med. Phys. 40 (4), pp. 043701. Cited by: §1, §1.
-  (2006) Dose reconstruction for therapeutic and diagnostic radiation exposures: use in epidemiological studies. Radiat. Res. 166 (1), pp. 141–157. Cited by: §1, §1, §4.
-  (2002) Basic anatomical and physiological data for use in radiological protection: reference values: ICRP publication 89. Annals of the ICRP 32 (3-4), pp. 1–277. Cited by: §1.
-  (2017) Position paper: rationale for the treatment of Wilms tumour in the UMBRELLA SIOP–RTSG 2016 protocol. Nat. Rev. Urol. 14 (12), pp. 743–752. Cited by: §2.2, §2.3.2.
-  (2010) Evaluation of late adverse events in long-term Wilms’ tumor survivors. Int. J. Radiat. Oncol. Biol. Phys. 78 (2), pp. 370–378. Cited by: §1, §1, §2.3.2, §2.
-  (2008) A (short) history of image-guided radiotherapy. Radiother. Oncol. 86 (1), pp. 4–13. Cited by: §1.
Symbolic regression and feature construction with GP-GOMEA applied to radiotherapy dose reconstruction of childhood cancer survivors.
Proc. Genetic and Evolutionary Computation Conference, GECCO ’18, New York, NY, USA, pp. 1395–1402. Cited by: §2.4.2.
-  (2020) Improving model-based genetic programming for symbolic regression of small expressions. Note: Preprint arXiv:1904.02050. (Accepted for publication with minor revisions in Evol. Comput.) Cited by: §2.4.2.
-  (2017) Scalable genetic programming by gene-pool optimal mixing and input-space entropy-based building-block learning. In Proc. Genetic and Evolutionary Computation Conference, GECCO ’17, New York, NY, USA, pp. 1041–1048. Cited by: §2.4.2.
-  (2018) On the feasibility of automatically selecting similar patients in highly individualized radiotherapy dose reconstruction for historic data of pediatric cancer survivors. Med. Phys. 45 (4), pp. 1504–1517. Cited by: §1, §1, §4.
-  (2019) Machine learning for automatic construction of pseudo-realistic pediatric abdominal phantoms. Note: Preprint arXiv:1909.03723 Cited by: §1, §2.4.2, §4.
-  (2020) Machine learning for automatic construction of pediatric abdominal phantoms for radiation dose reconstruction. Note: (To appear in Proc. SPIE) Cited by: §1.
-  (2019) How do patient characteristics and anatomical features correlate to accuracy of organ dose reconstruction for Wilms’ tumor radiation treatment plans when using a surrogate patient’s CT scan?. J. Radiol. Prot. 39 (2), pp. 598–619. Cited by: §1, §1.
-  (2018) Are age and gender suitable matching criteria in organ dose reconstruction using surrogate childhood cancer patients’ CT scans?. Med. Phys. 45 (6), pp. 2628–2638. Cited by: §1, §2.3.2, §4, §4.
-  (2020) Automatic generation of three-dimensional dose reconstruction data for two-dimensional radiotherapy plans for historically treated patients. J. Med. Imaging 7 (1), pp. 015001. Cited by: §2.2, §2.3.2, §2.3, §4, §4.
-  (2014-08) An exponential growth of computational phantom research in radiation protection, imaging, and radiotherapy: a review of the fifty-year history. Phys. Med. Biol. 59 (18), pp. R233–R302. Cited by: §1, §1, §2.