I. Introduction
Radiation therapy is one of the main modalities to cure cancer, and is used in over half of cancer treatments, either standalone or in conjunction with another modality, such as surgery or chemotherapy. For intensitymodulated radiation therapy (IMRT), the patient body is irradiated from fixed beam locations around patient body, and the radiation field is modulated at each beam position using multileaf collimators (MLC). In IMRT, the optimal choice of beam orientations has a direct impact on the treatment plan quality, influencing the final treatment outcome, hence patient quality of life. Current clinical protocols either have the beam orientations selected by protocol or manually by the treatment planner. Beam orientation optimization (BOO) methods solve for a suitable set of beam angles by solving an objective function to a local minimum. BOO has been studied extensively in radiation therapy procedures, for both coplanar ^{6}, ^{39}, ^{33}, ^{15}, ^{22}, ^{23}, ^{38}, ^{42}, ^{1}, ^{25}, ^{7}, ^{24}, ^{14}, ^{4}, ^{8}, ^{34}, ^{48}, ^{2}, ^{26}, ^{11}, ^{35}, ^{29}, ^{12}, ^{10}, and noncoplanar ^{33}, ^{15}, ^{32}, ^{27}, ^{4}, ^{8}, ^{26}, ^{47}, ^{46}, ^{29}, ^{49}, ^{36}, ^{5}, ^{45} IMRT, or intensitymodulated proton therapy^{30}, ^{17}, ^{43}, ^{18} (IMPT) by researchers in the past three decades. However, BOO has not been widely adopted due to their high computational cost and complexity, since it is a largescale NP hard combinatorial problem^{3}, ^{49}. Despite the extensive research, the lack of practical clinically beam orientation selection algorithms still exists due to the computational and time intensive procedure, as well as the suboptimality of the final solution, and BOO remains a challenging step of the treatment planning process. To measure the quality of the BOO solution, it is necessary to calculate dose influence matrices of each potential beam orientation. Dose influence matrices for one beam associates all the individual beamlets in the fluence map with the voxels of the patient body. This calculation is time consuming and requires a large amount of memory to use in optimization. To mange the limited capacity of computational resources, the treatment planning process, after defining the objective function, is divided into two major steps: 1) find a suitable set of beam orientations, and 2) solve the fluence map optimization problem (FMO)^{10} of those selected beams. However, these two steps are not independent of each other–the quality of BOO solution can be evaluated only after FMO is solved, and FMO can be defined only after BOO is solved. Due to the nonconvexity and large scale of the problem, researchers consider dynamic programming methods by breaking the problem into a sequence of smaller problems. One of the successful algorithms specially for solving complex problems such as BOO is a method known as Column Generation (CG). In the original application of CG into radiotherapy, Romeijn et al. ^{37} solved a direct aperture optimization (DAO) problem by using CG. Dong et al. ^{16} then proposed a greedy algorithm based on column generation, which iteratively adds beam orientations until the desired number of beams are reached. Rwigema et al. ^{40} use CG to find a set of 30 beam orientations to be used in treatment planning of stereotactic body radiation therapy (SBRT) for patients with recurrent, locally advanced, or metastatic headandneck cancers, to show the superiority of treatment plans to those created by volumetric modulated arc therapy (VMAT). Nguyen et al. ^{28} used CG to solve the triplet beam orientation selection problem specific to MRI guided Co60 radiotherapy. Yu et al. ^{47} used an inhouse CG algorithm to solve an integrated problem of beam orientation and fluence map optimization. However, CG is a greedy algorithm that has no optimality guarantee, and typically yields a suboptimal problem. In addition, CG still takes as much as 10 minutes to suggest a 5 beam plan for prostate IMRT. The aim of this work is to find a method to explore a larger area of the decision space of BOO in order to find higher quality solutions than that of CG in a short amount of time. The proposed method starts with a deep neural network that has been trained using CG as a supervisor. This network can mimic the behavior of CG by directly learning CG’s fitness evaluations of the beam orientations in a supervised learning manner. The efficiency of this supervised network, which can propose a set of beam angles that are noninferior to that of CG, within less than two seconds, is presented in our previous work^{41}
. Given a set of already selected beams, this network will predict the fitness value of each beam, which is how much the beam will improve the objective function when added in the next iteration. In this study, we extend our previous work, and combine this trained supervised learning (SL) network with a reinforcement learning method, called Monte Carlo tree search. We use these fitness values from the SL network as a guidance to efficiently navigate action space of the reinforcement learning tree. Specifically, it provides the probability of selecting a beam in the search space of the tree at each iteration, so that the beam with the better likelihood to improve the objective function has the higher chance of being selected at each step. To evaluate our proposed method, we compare its performance against the stateoftheart CG. We developed three additional combinations of the guided and random search tree approaches for comparison.
Ii. Methods
The proposed method has a reinforcement learning structure involving a supervised learning network to guide Monte Carlo tree search to explore the beam orientation selection decision space. This method, guided Monte Carlo tree search (GTS), consists of two main phases: 1) Supervised training a deep neural network (DNN) to predict the probability distribution of adding the next beam, based on patient anatomy, and 2), using this network for a guided Monte Carlo tree search method to explore a larger decision space more efficiently to find better solutions. For the first phase we use the CG implementation for BOO problem, where CG iteratively solves a sequence of Fluence Map Optimization (FMO) problems
^{10} by using GPUbased ChambollePock algorithm ^{13}, the results of the CG method are used to trained a supervised neural network. For the second phase, which is the main focus of this work, we present a Monte Carlo Tree Search algorithm, using the trained DNN. Each of these phases are presented in the following sections.Ii.A. Supervised Learning of the Deep Neural Network
We develop a deep neural network (DNN) model that learns from column generation how to find fitness values for each beam based on the anatomical features of a patient and a set of structure weights for the planning target volume (PTV) and organsatrisk (OAR). The CG greedy algorithm starts with an empty set of selected beams, calculates the fitness values of each beam based on the optimality condition of the objective function shown in Equation 1.
(1) 
where is the weight for structure s, which are pseudo randomly generated between zero and one during the training process to generate many different scenarios. The value, , is the prescription dose for each structure, which is assigned 1 for the PTV and 0 for OARs. At each iteration of CG, fitness values are calculated based on Karush–Kuhn–Tucker (KKT) conditions^{21}, ^{20} of a master problem, and they represent how much improvement each beam can make in the objective function value. The beam with the highest fitness value is selected to be added to the selected beam set, . Then, FMO for the selected beams is performed, which affects the fitness value calculations for the next iteration. The process is repeated until the desired number of beams are selected. The supervised DNN learns to mimic this behavior through the training of the DNN is shown in figure 1. Once trained, this DNN is capable of efficiently providing a suitable set of beam angles in less than 2 seconds, as opposed to the 360 seconds required to solve the same problem using CG. The details of the DNN structure and its training process is described in our previous work ^{41}.
Patient anatomical features include the contoured structures (organs at risk) of the images from patients with prostate cancer and the treatment planning weights assigned to each structure. The images of 70 prostate cancer patients are used for this research, each with 6 contours: planning target volume (PTV), body, bladder, rectum, and left and right femoral heads. From 70 patients, 50 was randomly selected to train the network and 7 for its validation. The remaining 13 patients images is used for testing and applying the Monte Carlo tree search method.
Ii.B. Monte Carlo Tree Search
The pretrained DNN probabilistically guides the traversal of the branches on the Monte Carlo decision tree to add a new beam to the plan. Each branch of the tree starts from root as an empty set of beams, and continues until it reaches the terminal state. After the exploration of each complete plan (selection of 5 beams in our case), the fluence map optimization problem is solved and and based on that, the probability distribution to select next beam will be updated, using the reward function, in the backpropagation stage. Then, starting from root the exploration of the next plan will begin until the stopping criteria is met.
Figure 2 shows an example of a tree search, which has discovered seven plans so far.Ii.B.1. Basics of Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) uses the decision tree to explore the decision space, by randomly sampling from it^{9}. The search process of MCTS consists of four steps: 1) node selection, 2) expansion, 3) simulation, and 4) backpropagation on the simulation result. To explain these processes in detail, there are some properties that need to be defined first, these definitions are as follows:
 State of the problem:

include patient’s anatomical features and a set of selected beam orientations (). At the beginning of the planning, this set has no member, and it is updated throughout the solution procedure.
 Actions:

the selection of the next beam orientation to be added to set B, given the state of the problem.
 Solution or terminal state:

state of the problem in which the number of selected beam orientations (size of ) is the same as a predefined number of beams (), chosen by user. At this point, a feasible solution for the problem is generated.
 Decision Tree:

The solution space of a set of discrete numbers–in this work discrete numbers are the beam orientations–specially with iterative structures, can be defined as a tree, where each node and branch represent the selection of one number or a subset of available numbers, respectively.
 Node ():

selection of one potential beam orientations is a node.
 Root ():

a node with empty set of beam orientations, every solution start from the root.
 Path:

a unique connected sequence of nodes in a decision tree.
 Branch ():

a path originated from Root node. Each branch represents the iterative structure of the proposed method. The length of a branch is the number of nodes in that branch. In this work solution is a branch with size . There is only one branch from each root to any node in a tree.
 Leaf:

last node of a branch. There is no exploration of the tree after a leaf is discovered.
 Internal node:

any node in a tree except for root and leaves.
The selection process in the proposed method is guided by a pretrained DNN as described in subsection II.A.. This DNN is used to probabilistically guide the traversal of the branches on the Monte Carlo decision tree to add a new beam to the plan. At each node–starting by root note–the DNN is called to predict an array of fitness values for each beam orientation(). An element of this array represents the likelihood of the selection of the beam orientation. For example, if the number of potential beam orientations is 180, with separation, would be an array of size 180, and is the likelihood of selecting beam orientation in position of the potential beam orientations, . The expansion process happens after selection process at internal nodes, to further explore the tree and create children nodes. The traversal approach in the proposed method is depth first, which means that the branch of a node, that is visited or created for the first time, continues expanding until there are nodes in a branch. In this case, selection and expansion processes are overlapping because only one child node is created or visited at a time, although a node can be visited multiple times and several children can be generated from one node, except for leaf. The leaf node does not have any children. Nodes in a branch must be unique, it means that a branch of each external node () can be expanded only to nodes that are not already in the branch. In fact, beam orientation optimization problem can be defined as a dynamic programming problem with the following formula:
(2) 
where is a set of indices for previously selected beams, is index of currently selected beam and is a subset of beams to be selected that has the highest reward value. Each consists of iteratively selecting a predefined number of beams (), in this work . After the exploration of each complete plan, the fluence map optimization problem is solved and used for the backpropagation step, which is used to update the probability distribution for beam selection.
Ii.B.2. Main Algorithm
The detailed of the guided Monte Carlo tree search algorithm in the form of a pseudo code is provided in Algorithm 1. Several properties of each node in the proposed tree are being updated after the exploration of final states. To simplify the algorithm, these properties are addressed as variables and the following is a list of them:
 Cost ():

After a leaf is discovered, an FMO problem associated with the beams of that branch will be solved, the value of the FMO cost function is the value associated with its corresponding leaf. The cost value of all other nodes (other than leaves) in a tree is the average cost of its subbranches. For example in Figure 2 the cost value of node is the average cost of nodes and .
 probability distribution ():

an array of size 180 (the number of potential beam orientations), where element of this array represents the chance of improvement in the current cost value if tree branches out by selecting beams. After a node is discovered in the tree, this distribution is populated by using DNN. After the first discovery of a node, is updated based on the reward values.
 Reward ():
 Depth ():

is simply the number of beam orientations selection after node is discovered.
 Name ():

a unique string value as id for each node, this value is the path from root to node .
 Beam Set ():

the set of beams selected for a branch started from root and ended in node .
 Parent ():

the immediate node before node in a branch from root to , except for root node, all other nodes in a tree have one parent.
 Children ()

: the immediate node(s) of the subbranches from the node , except for leaves, all other nodes in a tree have atleast one children.
Ii.C. Algorithms for performance comparison
In general, four frameworks were designed to show the efficiency of the proposed GTS method compared to others. These models are defined as:
 Guided Tree Search (GTS)

: As presented in Algorithm 1, used a pretrained policy network to guild a MonteCarlo decision tree.
 Guided Search (GuidS)

: Used the pretrained network to search the decision space by iteratively choosing one beam based on the predicted probabilities from the policy network. Unlike GTS, the search policy is not updated as the search progresses. This process is detailed in Algorithm 2.
 Randomly sample Tree Search (RTS)

: This algorithm is simple MonteCarlo tree search method which starts with a uniform distribution to select beam orientations (randomly select them), and then update the search policy as the tree search progresses. Note that all of tree operations used in GTS is also used in this algorithm, except for having a policy network to guide the tree. This method is proposed to show the impact of using DNN to guild the decision tree.
 Random Search (RandS)

: This method searches the decision space with uniformly random probability until stopping criteria is met. It randomly selects 5 beam orientations and solves the corresponding FMO problem. The search policy is not updated. Its procedure is close to Algorithm 2 where the “Select using DNN” procedure is replaced by randomly selecting 5 unique beams.
Ii.D. Data
We used images from 70 patients with prostate cancer, each with 6 contours: PTV, body, bladder, rectum, left femoral head, and right femoral head. Additionally, the skin and ring tuning structures were added during the fluence map optimization process to control high dose spillage and conformity in the body. The patients were divided randomly into two exclusive sets: 1) a model development set, which includes training and validation data, consisting of 57 patients, 50 for training and 7 for validation, for crossvalidation method, and 2) a test data set consisting of 13 patients. Each patient’s data contains 6 contours: PTV, body, bladder, rectum, and left and right femoral heads. Column generation was implemented with a GPUbased ChambollePock algorithm^{13}
, a firstorder primaldual proximalclass algorithm, to create 6270 training and validation scenarios (22 5beam plans for each of 57 patients) and 130 test scenarios (10 5beam plans for each of 13 test patients). The DNN trained over 400 epochs, each with 2500 steps and batch size of one. The performances of four methods GTS, GuidS, RTS, and RandS, explained in section
subsection II.C., are evaluated. Two of these methods, GTS and GuidS, use the pretrained DNN as a guidance network. We originally had the images of 70 patients with prostate cancer, and used the images of the 57 of them to train and validate DNN and therefore cannot be used for the testing in this project^{1}^{1}1 To keep the proposed methods completely independent from the dataset used for training DNN.. There are 13 patients that DNN has never seen before and the images of those patients are used in this project as test set. Multiple scenarios can be generated for each patient, based on the weights assigned to patient’s structures for planning their treatments. We semirandomly generated 10 sets of weights for each patients. In total, we have a total of 130 test plans among the 13 test patients for the comparison. All the tests in this paper were performed on a computer with an Intel Core I7 processor@3.6 GHz, 64 GB memory, and an NVIDIA GeForce GTX 1080 Ti GPU with 11 GB video memory. The structure weight selection scheme is outlined by the following process:
In of the times, a uniform distribution in the range of 0 to 0.1 is used to generate a weight for each OAR separately.

In of the times, the smaller range of 0 to 0.05 is used to select weights for OARs separately, with uniform distribution.

And finally in specific ranges were used for each OAR: Bladder: [0,0.2], Rectum: [0,0.2], Right Femoral Head: [0,0.1], Left Femoral Head: [0,0.1], Shell: [0,0.1] and Skin: [0,0.3]
The weights range from 0 to 1. This weighting scheme was found to give a clinically reasonable dose, however, the dose itself may not be approved by the physician for that patient. Finally, considering only test scenarios, FMO solutions of beam sets generated by CG and by the 4 tree search methods were compared with the following metrics:
 PTV , PTV :

The dose that and , respectively, of the PTV received
 PTV :

Maximum dose received by PTV, the value of is considered for this metric
 PTV Homogeneity:

where and are the dose received by and , respectively, of PTV
 Paddick Conformity Index () ^{44}, ^{31}:

where is the volumne of the PTV and is the volume of the isodose region that received of the dose
 High Dose Spillage ():

where is the volume of the isodose region that received of the dose
Iii. Results
At each attempt to solve a test scenario, each method is given 1000 seconds to search the solution space. Whenever a method finds a solution better than that of CG, the solution and its corresponding time stamp and the number of total solutions visited by this method are saved. We use these values to analyze the performance of each method. The best solution that is found in each attempt to solve the problem is used as the final solution of that attempt and is used for PTV metrics calculation. The average objective function value of final solutions in five attempts are used for the comparison of the performance of four methods with CG solution. At first we compare the efficiency of the four methods of GTS, GuidS, RTS and RandS. Although the main purpose of these methods are to find a solution better than CG, there were some cases that none of these method could beat CG solution, either CG solution was very close to optimal, or there were several local optimums with wide search space which makes it very difficult to explore it efficiently, this is especially true for RTS and RandS methods. The percentages of total number of attempts that each method could successfully find a solution better than CG in at least one of five attempts and averaged over all attempts to solve the problem are presented in (a) and 2(b), respectively. Note that for this test the stopping criteria was 1000 seconds of computational time. As we expected, GTS and GuidS that are using the pretrained DNN performed better than the other two methods. However, there are still cases that they are not able to find a better solution than CG. The maximum number of scenarios that all four methods were successfully find a solution with objective function value better than that of CG solution is 102 out of total 130 test cases (78.46%).
The domain of the objective function value varies for different testcase scenarios, therefore we introduce Distance measure as the normalized version of the objective values compared to CG solutions for further comparison. Distance measure is the difference between the objective function value of each method and CG, divided by CG objective value (). If a method finds a solution better than CG, measure will be positive, and for cases that a method was not successful to find a solution better than CG solution, this value will be negative. It means a method with largest measure found solutions with better qualities–with an objective value smaller than that of CG–and therefore more efficient in the limited time of 1000 seconds. Figure 4 shows the box plot of Distance measure for GTS, GuidS, RTS, and RandS methods. Based on this figure, the average measure for GTS method is 2.48, which is the highest compared to 1.79 for GuidS, 0.67 for RTS, and 0.81 for RandS.
For the computational time evaluation and comparison methods, only the 102 out of 130 test cases, where all four methods had successfully found a better solution than CG within the 1000 seconds time limit, were considered. The boxplot of the best time needed to beat CG solution is presented in (a). It represents the first time that a method finds a solution better than CG solution within 1000 seconds among all five attempts. The boxplot of the average time to beat CG among all attempts to solve each testcases is provided in (b). To measure the average time to beat CG, all testcases and all attempts are considered. The fastest method considering this measure is GTS, the average time to beat CG for GTS method is 195 seconds in best case and 237 seconds in total cases. The second fast method is GuidS with the average time of 227 seconds for best and 268 seconds for total cases. RandS outperforms RTS with best time of 337 seconds compared to 364 seconds of RTS. Interestingly, the average total time of RTS and RandS is less than the average of best cases.
Objective Value  Distance()  

Tested methods  tstatistic  pvalue  tstatistic  pvalue 
CG vs GTS  6.267  2.54E09  7.683  1.72E12 
CG vs GuidS  4.940  1.19E06  6.393  1.37E09 
CG vs RTS  0.885  1.89E01  2.014  2.30E02 
CG vs RandS  1.670  4.87E02  2.340  1.04E02 
RandS vs RTS  1.496  6.85E02  1.096  1.38E01 
RandS vs GTS  6.826  1.53E10  11.843  1.18E22 
RandS vs GuidS  3.873  8.51E05  4.839  1.83E06 
RTS vs GTS  8.245  8.18E14  15.271  5.25E31 
RTS vs GuidS  5.257  2.95E07  5.832  2.08E08 
GuidS vs GTS  2.412  8.64E03  3.945  6.52E05 
Onetailed paired sample ttest to test the average objective function value and
measure for every pairs of CG, GTS, GuidS, RTS, and RandS methods, with confidence intervals. All values in red have pvalues greater than 0.01.)To study the statistical significant of GTS method compared to other four methods (GuidS, RTS, RandS, and CG), we use onetailed paired sample ttest to compare the objective function values of each pair of methods, and Distance measure. The null hypothesis is that the average objective function and
measure of all methods are the same. If we show the null hypothesis as . The alternative hypothesis can be described as for the objective value parameters and for measure. Ten paired sample ttest are performed for objective function values and measure.These statistics are presented in Table 1. The distribution of Objective value and measures are provided in appendix section VI.. As highlighted by red, all pairs of the three methods of CG, RTS, and RandS have pvalues greater than 0.01, and are not significantly different, while the average and objective value measures of GuidS and GTS are significantly different. Based on these results GTS outperforms all other methods significantly while in the second position is GuidS as was expected.Method  PTV  PTV  PTV  PTV Homogeneity  

RandS  0.9770.012  0.9600.019  1.0710.040  0.0890.046  0.8670.070  4.6731.022 
RTS  0.9770.011  0.9590.020  1.0700.039  0.0880.045  0.8630.085  4.7141.312 
GTS  0.9770.011  0.9600.019  1.0700.040  0.0890.045  0.8740.061  4.5690.994 
GuidS  0.9760.012  0.9600.019  1.0710.040  0.0890.046  0.8740.068  4.4870.948 
CG  0.9770.011  0.9610.020  1.0720.041  0.0900.046  0.8840.059  4.4780.963 
PTV statistics, Paddick Conformity Index() and dose spillage() of plans generated by CG, GTS, GuidS, RTS, and RandS are presented in Table 2. Note that PTV is used to measure PTV , as recommended by the ICRU Report 83 ^{19}. The plans generated by all methods have very similar PTV coverage. CG plans have the highest followed by GTS and GuidS plans. While CG and GuidS plans have the lowest dose spillage value followed by GTS.
Methods  

Structures  GTS  GuidS  RTS  RandS  CG  
Mean Dose 
PTV  1.039 0.025  1.039 0.025  1.038 0.024  1.039 0.025  1.040 0.025 
Body  0.037 0.012  0.037 0.012  0.037 0.012  0.037 0.012  0.038 0.013  
Bladder  0.207 0.125  0.207 0.122  0.207 0.126  0.206 0.122  0.204 0.116  
Rectum  0.317 0.109  0.321 0.111  0.319 0.110  0.322 0.115  0.334 0.116  
Lfemoral  0.213 0.105  0.201 0.103  0.217 0.115  0.212 0.111  0.222 0.112  
Rfemoral  0.214 0.101  0.221 0.110  0.227 0.124  0.224 0.124  0.217 0.109  
Max Dose 
PTV  1.113 0.055  1.113 0.055  1.113 0.055  1.114 0.057  1.116 0.058 
Body  1.190 0.131  1.195 0.148  1.200 0.147  1.199 0.143  1.173 0.130  
Bladder  1.094 0.046  1.094 0.048  1.096 0.050  1.094 0.045  1.095 0.048  
Rectum  1.072 0.045  1.071 0.044  1.073 0.045  1.074 0.048  1.071 0.040  
Lfemoral  0.609 0.193  0.596 0.209  0.619 0.215  0.609 0.220  0.613 0.193  
Rfemoral  0.639 0.242  0.650 0.249  0.625 0.236  0.628 0.244  0.617 0.245 
The average and maximum dose received by each structure are provided in Table 3, these values reflect the fractional dose in plans generated by each method with the assumption that the prescription dose is one– e.g. if the prescription dose is 70 Gy, the average dose of 0.207 in the table means 14.47Gy () in the prescribed plan. The minimum values in each row are shown as bold numbers for easier interpretation of the table. On average plans generated by GTS have lower mean dose to OARs compared to other methods, while plans generated by CG have the lowest maximum dose to OARs. GTS plans spare rectum and right femoral head better than other methods. Although the average fractional dose to bladder by by GTS plans (0.207) is more than CG plans (0.204), GTS plans have lower maximum fractional dose to bladder (1.094) compared to CG (with maximum of 1.095). GuidS plans have minimum average fractional dose to leftfemoral head (0.201) with considerable difference compared to the second best (0.212) of RandS plans. As an example, Figure 6 shows the dosevolume and dosewash of plans generated by GTS and CG for one testcase scenario.
Iv. Discussion
In this research, we propose an efficient beam orientation optimization framework capable of finding a improved solution over CG, in a similar amount time, by utilizing a reinforcement learning structure involving a supervised learning network, DNN, to guide Monte Carlo tree search to explore the beam orientation selection decision space. Although CG is a powerful optimization tool, it is a greedy approach that not only is computationally expensive and time consuming, but it also may get stuck in a local optimum. This is particularly true for highly nonlinear optimization problems with lots of local optima, such as BOO. In this work, we tried 4 different approaches: 1) Guided Tree Search (GTS), 2) Guided Search (GuidS), 3) Random Tree Seach (RTS), and 4) Random Search (RandS). It is shown that although the quality of solutions using RandS, RTS and CG were not significantly different, in of testcases both RandS and RTS, which have no knowledge of the problem at the beginning of the search, can find solutions better than CG. This shows the high potential of improving the solution found by CG. We saw that GTS and GuidS, both performs better than other methods, which is expected because both of these methods are using a prior knowledge (trained DNN) to explore the solution space. GTS even outperforms GuidS on average, since GTS is a combination of GuidS and RTS, means adding a search method to GuidS can improve the quality of the solution. But considering the insignificant difference in the performance of RTS and RandS, adding any search method to GuidS may results in better solutions and it may not be directly related to RTS. This issue will be studied in future researches. The poor performance of RTS may also suggest that using uniform tree search is too slow to converge to the optimal selection of beams. Although GTS performs better than others to find solutions with better objective function values, but the dose spillage metrics, and specifically the average dose received by bladder in GTS plans can be improved further. Considering the success of GTS in reducing the objective function and its potential for further improvement, we will continue exploring new methods and techniques to upgrade the quality of treatment planning with the help of artificial intelligence. We should note that CG is a greedy and deterministic algorithm, therefor using CG on the same problem always results in the same solution. This is of completely different nature from our search methods and it may not be fair to compare its performance with the four search procedures, which, given infinite time and resources, can act as brute forced approach and guarantee finding the optimal solution. However, our main goal is to find the best possible solution for BOO problem, and in this work we try to see which search algorithms can find a better solutions and faster. Hence we expect to see search algorithms outperform greedy algorithms. The results showed us that the objective function value of GTS and GuidS CG, RTS and RandS perform similarly. Even though, plans generated by DNN solutions may not be superior to CG, but it can mimic the CG algorithm very efficiently ^{41} and is a very successful tool to explore the search space as shown by GuidS and GTS, especially when compared to CG , which can easily exhaust the computational resources and is very slow to find one solution for a problems. The good performance of CG compared to RandS and RTS shows how powerful the CG method can be to find a solution, and the success of using DNN to explore the decision space represents the proper knowledge that can be achieved by learning from CG. Finally, GTS is a problem specific search method that needs to be applied on each testcases separately. To use the knowledge that we can get from the GTS performance, more advanced reinforcement algorithms can be trained to create a single general knowledgebased method that is not only very powerful to find the best possible solutions, but also very fast for doing so. The advanced reinforcement learning method then can be easily applied on more sophisticated and challenging problems such as Proton and radiation therapy. For future studies, we are working to develop a smart, fast and powerful tool to be applied in these problems.
V. Conclusion
In this study, we proposed a method combined of two main components. First, a supervised deep neural network (DNN) to learn column generation (CG) decisionmaking process, and predict the fitness values of all candidate beams, beam with maximum fitness value will be chosen to add to the solution. CG, although powerful, is a heuristic, greedy algorithm that cannot guarantee the global optimality of the final solution, and leaves room for improvement. A Monte Carlo guided tree search (GTS) is proposed to see if finding a solution with better objective function in reasonable amount of time is feasible. After the DNN is trained, it is used to generate the beams fitness values for nodes in the decision tree, where each node represents a set of selected beams. Fitness values in each node are normalized and used as probability mass function to help deciding decision tree extension. Later probability distribution of beam selection will be modified by reward function, which is based on the solution of FMO problem FMO for every five selected beams. GTS continues to explore the decision tree for 1000 seconds. Along with GTS, three other approaches are also tested, GuidS which is also using DNN to select beams iteratively, but it does not update the probability distribution of beams during the search process. RTS which is a simple tree search algorithm, starts by randomly sampled from a uniform distribution of beam orientations for each node and continues to update beams probability distribution based on the tree search approach presented for GTS. And finally RandS which is randomly select beams, the most trivial and simple approach.
Vi. Appendix
Although statistically with 130 number of test cases we can assume that approximately our metrics follow normal distribution, but based on the graphs of
(a), this assumption may not be practical. Because of this, we introduce measures to normalize our metrics. The probability distribution and cumulative probability mass function of measure are presented in (b). By this graph we can verify that the this measure approximately normally distributed with similar standard deviation.References
 Aleman et al. 2008 Dionne M. Aleman, Arvind Kumar, Ravindra K. Ahuja, H. Edwin Romeijn, and James F. Dempsey. Neighborhood search approaches to beam orientation optimization in intensity modulated radiation therapy treatment planning. Journal of Global Optimization, 42(4):587–607, 12 2008. ISSN 09255001. doi: 10.1007/s108980089286x.
 Amit et al. 2015 Guy Amit, Thomas G. Purdie, Alex Levinshtein, Andrew J. Hope, Patricia Lindsay, Andrea Marshall, David A. Jaffray, and Vladimir Pekar. Automatic learningbased beam angle selection for thoracic IMRT. Medical Physics, 42(4), 4 2015. ISSN 00942405. doi: 10.1118/1.4908000.
 AziziSultan 2006 Ahmad Saher AziziSultan. Optimization of Beam Orientations in Intensity Modulated Radiation Therapy. PhD thesis, Technische Universität Kaiserslautern, 2006.

Bangert and Oelfke 2010
Mark Bangert and Uwe Oelfke.
Spherical cluster analysis for beam angle optimization in intensitymodulated radiation therapy treatment planning.
Physics in Medicine and Biology, 55(19):6023–6037, 10 2010. ISSN 00319155. doi: 10.1088/00319155/55/19/025.  Bedford et al. 2019 James L. Bedford, Peter Ziegenhein, Simeon Nill, and Uwe Oelfke. Beam selection for stereotactic ablative radiotherapy using Cyberknife with multileaf collimation. Medical Engineering and Physics, 64:28–36, 2 2019. ISSN 18734030. doi: 10.1016/j.medengphy.2018.12.011. URL https://www.sciencedirect.com/science/article/pii/S1350453318301802.
 Bortfeld and Schlegel 1993 T. Bortfeld and W. Schlegel. Optimization of beam orientations in radiation therapy: Some theoretical considerations. Physics in Medicine and Biology, 38(2):291–304, 1993. ISSN 00319155. doi: 10.1088/00319155/38/2/006.
 Breedveld et al. 2009 Sebastiaan Breedveld, Pascal R.M. Storchi, and Ben J.M. Heijmen. The equivalence of multicriteria methods for radiotherapy plan optimization. Physics in medicine and biology, 54(23):7199–7209, 2009. ISSN 13616560. doi: 10.1088/00319155/54/23/011.
 Breedveld et al. 2012 Sebastiaan Breedveld, Pascal R.M. Storchi, Peter W.J. Voet, and Ben J.M. Heijmen. ICycle: Integrated, multicriterial beam angle, and profile optimization for generation of coplanar and noncoplanar IMRT plans. Medical Physics, 39(2):951–963, 2012. ISSN 00942405. doi: 10.1118/1.3676689.
 Browne et al. 2012 Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods, 3 2012. ISSN 1943068X.
 Cabrera G. et al. 2018 Guillermo Cabrera G., Matthias Ehrgott, Andrew J. Mason, and Andrea Raith. A matheuristic approach to solve the multiobjective beam angle optimization problem in intensitymodulated radiation therapy. International Transactions in Operational Research, 25(1):243–268, 1 2018. ISSN 14753995. doi: 10.1111/itor.12241.
 CabreraGuerrero et al. 2018a Guillermo CabreraGuerrero, Carolina Lagos, Enrique Cabrera, Franklin Johnson, Jose M. Rubio, and Fernando Paredes. Comparing Local Search Algorithms for the Beam Angles Selection in Radiotherapy. IEEE Access, 6:23701–23710, 4 2018a. ISSN 21693536. doi: 10.1109/ACCESS.2018.2830646.
 CabreraGuerrero et al. 2018b Guillermo CabreraGuerrero, Andrew J. Mason, Andrea Raith, and Matthias Ehrgott. Pareto local search algorithms for the multiobjective beam angle optimisation problem. Journal of Heuristics, 24(2):205–238, 4 2018b. ISSN 15729397. doi: 10.1007/s1073201893651.
 Chambolle and Pock 2010 Antonin Chambolle and Thomas Pock. A firstorder primaldual algorithm for convex problems with applications to imaging, 2010. URL https://hal.archivesouvertes.fr/hal00490826.
 Craft and Monz 2010 David Craft and Michael Monz. Simultaneous navigation of multiple Pareto surfaces, with an application to multicriteria IMRT planning with multiple beam angle configurations. Medical Physics, 37(2):736–741, 2010. ISSN 00942405. doi: 10.1118/1.3292636.
 Djajaputra et al. 2003 David Djajaputra, Qiuwen Wu, Yan Wu, and Radhe Mohan. Algorithm and performance of a clinical IMRT beamangle optimization system, 2003. ISSN 00319155.
 Dong et al. 2013 Peng Dong, Percy Lee, Dan Ruan, Troy Long, Edwin Romeijn, Yingli Yang, Daniel Low, Patrick Kupelian, and Ke Sheng. 4 noncoplanar liver SBRT: A novel delivery technique. International Journal of Radiation Oncology Biology Physics, 85(5):1360–1366, 4 2013. ISSN 03603016. doi: 10.1016/j.ijrobp.2012.09.028.
 Gu et al. 2018 Wenbo Gu, Daniel O’Connor, Dan Nguyen, Victoria Y. Yu, Dan Ruan, Lei Dong, and Ke Sheng. Integrated beam orientation and scanningspot optimization in intensitymodulated proton therapy for brain and unilateral head and neck tumors. Medical Physics, 45(4):1338–1350, 4 2018. ISSN 00942405. doi: 10.1002/mp.12788.
 Gu et al. 2019 Wenbo Gu, Ryan Neph, Dan Ruan, Wei Zou, Lei Dong, and Ke Sheng. Robust Beam Orientation Optimization for Intensity Modulated Proton Therapy. Medical Physics, 46(8):mp.13641, 6 2019. ISSN 00942405. doi: 10.1002/mp.13641. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/mp.13641.
 Hodapp 2012 N. Hodapp. The ICRU Report No. 83: Prescribing, recording and reporting photonbeam intensitymodulated radiation therapy (IMRT), 1 2012. ISSN 01797158.
 Karush 2014 William Karush. Minima of Functions of Several Variables with Inequalities as Side Conditions, pages 217–245. Springer Basel, Basel, 2014. ISBN 9783034804394. doi: 10.1007/9783034804394˙10. URL https://doi.org/10.1007/9783034804394_10.
 Kuhn and Tucker 1951 Harold W Kuhn and Albert W Tucker. Nonlinear programming, in (j. neyman, ed.) proceedings of the second berkeley symposium on mathematical statistics and probability, 1951.

Li et al. 2004
Yongjie Li, Jonathan Yao, and Dezhong Yao.
Automatic beam angle selection in IMRT planning using genetic algorithm.
Physics in Medicine and Biology, 49(10):1915–1932, 5 2004. ISSN 00319155. doi: 10.1088/00319155/49/10/007. 
Li et al. 2005
Yongjie Li, Dezhong Yao, Jonathan Yao, and Wufan Chen.
A particle swarm optimization algorithm for beam angle selection in intensitymodulated radiotherapy planning.
Physics in Medicine and Biology, 50(15):3491–3514, 8 2005. ISSN 00319155. doi: 10.1088/00319155/50/15/002.  Lim et al. 2009 Gino Lim, Allen Holder, and Josh Reese. A clustering approach for optimizing beam angles in IMRT planning. Mathematical Sciences Technical Reports (MSTR), 8 2009. URL https://scholar.rosehulman.edu/math_mstr/14.
 Lim et al. 2008 Gino J. Lim, Jaewon Choi, and Radhe Mohan. Iterative solution methods for beam angle and fluence map optimization in intensity modulated radiation therapy planning. OR Spectrum, 30(2):289–309, 4 2008. ISSN 01716468. doi: 10.1007/s0029100700961.
 Liu et al. 2017 Hongcheng Liu, Peng Dong, and Lei Xing. A new sparse optimization scheme for simultaneous beam angle and fluence map optimization in radiotherapy planning. Physics in Medicine and Biology, 62(16):6428–6445, 7 2017. ISSN 13616560. doi: 10.1088/13616560/aa75c0.
 Llacer et al. 2009 Jorge Llacer, Sicong Li, Nzhde Agazaryan, Claus Promberger, and Timothy D. Solberg. Noncoplanar automatic beam orientation selection in cranial IMRT: A practical methodology. Physics in Medicine and Biology, 54(5):1337–1368, 2009. ISSN 00319155. doi: 10.1088/00319155/54/5/016.
 Nguyen et al. 2016 Dan Nguyen, David Thomas, Minsong Cao, Daniel O’Connor, James Lamb, and Ke Sheng. Computerized triplet beam orientation optimization for MRIguided Co60 radiotherapy. Medical Physics, 43(10):5667–5675, 10 2016. ISSN 00942405. doi: 10.1118/1.4963212.
 O’Connor et al. 2018 Daniel O’Connor, Victoria Yu, Dan Nguyen, Dan Ruan, and Ke Sheng. Fractionvariant beam orientation optimization for noncoplanar IMRT. Physics in Medicine and Biology, 63(4), 2 2018. ISSN 13616560. doi: 10.1088/13616560/aaa94f.
 Oelfke and Bortfeld 2001 U. Oelfke and T. Bortfeld. Inverse planning for photon and proton beams. Medical Dosimetry, 26(2):113–124, 2001. URL http://www.sciencedirect.com/science/article/pii/S0958394701000577.
 Paddick 2000 Ian Paddick. A simple scoring ratio to index the conformity of radiosurgical treatment plans. Journal of neurosurgery, 93(supplement_3):219–222, 2000.
 Potrebko et al. 2008 Peter S. Potrebko, Boyd M.C. McCurdy, James B. Butler, and Adel S. ElGubtan. Improving intensitymodulated radiation therapy using the anatomic beam orientation optimization algorithm. Medical Physics, 35(5):2170–2179, 2008. ISSN 00942405. doi: 10.1118/1.2905026.
 Pugachev et al. 2001 Andrei Pugachev, Jonathan G. Li, Arthur L. Boyer, Steven L. Hancock, Quynh Thu Le, Sarah S. Donaldson, and Lei Xing. Role of beam orientation optimization in intensitymodulated radiation therapy. International Journal of Radiation Oncology Biology Physics, 50(2):551–560, 6 2001. ISSN 03603016. doi: 10.1016/S03603016(01)015024.
 Rocha et al. 2013 Humberto Rocha, Joana M. Dias, BrÃgida C. Ferreira, and Maria C. Lopes. Beam angle optimization for intensitymodulated radiation therapy using a guided pattern search method. Physics in Medicine and Biology, 58(9):2939–2953, 5 2013. ISSN 00319155. doi: 10.1088/00319155/58/9/2939.
 Rocha et al. 2018 Humberto Rocha, Joana Dias, Tiago Ventura, BrÃgida Ferreira, and Maria do Carmo Lopes. Comparison of combinatorial and continuous frameworks for the beam angle optimization problem in IMRT. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 10961 LNCS, pages 593–606. Springer Verlag, 2018. ISBN 9783319951645. doi: 10.1007/9783319951652–_˝42.
 Rocha et al. 2019 Humberto Rocha, Joana Matos Dias, Tiago Ventura, BrÃgida da Costa Ferreira, and Maria do Carmo Lopes. Beam angle optimization in IMRT: are we really optimizing what matters? International Transactions in Operational Research, 26(3):908–928, 5 2019. doi: 10.1111/itor.12587.
 Romeijn et al. 2005a H. Edwin Romeijn, Ravindra K. Ahuja, James F. Dempsey, and Arvind Kumar. A Column Generation Approach to Radiation Therapy Treatment Planning Using Aperture Modulation. SIAM Journal on Optimization, 15(3):838–862, 1 2005a. ISSN 10526234. doi: 10.1137/040606612. URL http://epubs.siam.org/doi/10.1137/040606612.
 Romeijn et al. 2005b H Edwin Romeijn, Arvind Kumar, Ravindra K Ahuja, and James F Dempsey. A Column Generation Approach to Radiation Therapy Treatment Planning Using Aperture Modulation Railroad Scheduling View project Weapon Target Assignment View project A COLUMN GENERATION APPROACH TO RADIATION THERAPY TREATMENT PLANNING USING APERTURE MODULATION *. Article in SIAM Journal on Optimization, 2005b. doi: 10.1137/040606612. URL http://www.siam.org/journals/siopt/153/60661.html.
 Rowbottom et al. 1999 Carl Graham Rowbottom, Steve Webb, and Mark Oldham. Beamorientation customization using an artificial neural network. Phys. Med. Biol, 44:2251–2262, 1999.
 Rwigema et al. 2015 Jean Claude M. Rwigema, Dan Nguyen, Dwight E. Heron, Allen M. Chen, Percy Lee, Pin Chieh Wang, John A. Vargo, Daniel A. Low, M. Saiful Huq, Stephen Tenn, Michael L. Steinberg, Patrick Kupelian, and Ke Sheng. 4 noncoplanar stereotactic body radiation therapy for headandneck cancer: Potential to improve tumor control and late toxicity. International Journal of Radiation Oncology Biology Physics, 91(2):401–409, 2015. ISSN 1879355X. doi: 10.1016/j.ijrobp.2014.09.043.

Sadeghnejad Barkousaraie et al. 2019
Azar Sadeghnejad Barkousaraie, Olalekan Ogunmolu, Steve Jiang, and Dan Nguyen.
A Fast Deep Learning Approach for Beam Orientation Optimization for Prostate Cancer Treated with Intensity Modulated Radiation Therapy.
Medical Physics, 12 2019. ISSN 00942405. doi: 10.1002/mp.13986.  Schreibmann and Xing 2005 Eduard Schreibmann and Lei Xing. Dosevolume based ranking of incident beam direction and its utility in facilitating IMRT beam placement. International Journal of Radiation Oncology Biology Physics, 63(2):584–593, 10 2005. ISSN 03603016. doi: 10.1016/j.ijrobp.2005.06.008.
 Shirato et al. 2018 Hiroki Shirato, Quynh Thu Le, Keiji Kobashi, Anussara Prayongrat, Seishin Takao, Shinichi Shimizu, Amato Giaccia, Lei Xing, and Kikuo Umegaki. Selection of external beam radiotherapy approaches for precise and accurate cancer treatment. Journal of Radiation Research, 59:i2–i10, 3 2018. ISSN 13499157. doi: 10.1093/jrr/rrx092.
 Van’t Riet et al. 1997 Arie Van’t Riet, Ad CA Mak, Marinus A Moerland, Leo H Elders, and Wiebe Van Der Zee. A conformation number to quantify the degree of conformality in brachytherapy and external beam irradiation: application to the prostate. International Journal of Radiation Oncology* Biology* Physics, 37(3):731–736, 1997.
 Ventura et al. 2019 Tiago Ventura, Humberto Rocha, Brigida da Costa Ferreira, Joana Dias, and Maria do Carmo Lopes. Comparison of two beam angular optimization algorithms guided by automated multicriterial IMRT. Physica Medica, 64:210–221, 8 2019. ISSN 11201797. doi: 10.1016/J.EJMP.2019.07.012. URL https://www.sciencedirect.com/science/article/pii/S1120179719301656.
 Yarmand and Craft 2018 Hamed Yarmand and David Craft. Effective heuristics for beam angle optimization in radiation therapy. Simulation, 94(11):1041–1049, 11 2018. ISSN 17413133. doi: 10.1177/0037549718761108.
 Yu et al. 2018 Victoria Y. Yu, Angelia Landers, Kaley Woods, Dan Nguyen, Minsong Cao, Dongsu Du, Robert K. Chin, Ke Sheng, and Tania B. Kaprealian. A Prospective 4 Radiation Therapy Clinical Study in Recurrent HighGrade Glioma Patients. International Journal of Radiation Oncology Biology Physics, 101(1):144–151, 5 2018. ISSN 1879355X. doi: 10.1016/j.ijrobp.2018.01.048.
 Yuan et al. 2015 Lulin Yuan, Q. Jackie Wu, Fangfang Yin, Ying Li, Yang Sheng, Christopher R. Kelsey, and Yaorong Ge. Standardized beam bouquets for lung IMRT planning. Physics in Medicine and Biology, 60(5):1831–1843, 2 2015. ISSN 13616560. doi: 10.1088/00319155/60/5/1831.
 Yuan et al. 2018 Lulin Yuan, Wei Zhu, Yaorong Ge, Yuliang Jiang, Yang Sheng, Fang Fang Yin, and Q. Jackie Wu. Lung IMRT planning with automatic determination of beam angle configurations. Physics in Medicine and Biology, 63(13), 7 2018. ISSN 13616560. doi: 10.1088/13616560/aac8b4.
Comments
There are no comments yet.