Monte Carlo Tree Search (MCTS) has been successfully applied to many games and problems .
Nested Monte Carlo Search (NMCS)  is an algorithm that works well for puzzles and optimization problems. It biases its playouts using lower level playouts. At level zero NMCS adopts a uniform random playout policy. Online learning of playout strategies combined with NMCS has given good results on optimization problems . Other applications of NMCS include Single Player General Game Playing , Cooperative Pathfinding , Software testing 
, heuristic Model-Checking, the Pancake problem , Games  and the RNA inverse folding problem .
Online learning of a playout policy in the context of nested searches has been further developed for puzzles and optimization with Nested Rollout Policy Adaptation (NRPA) . NRPA has found new world records in Morpion Solitaire and crosswords puzzles. NRPA has been applied to multiple problems: the Traveling Salesman with Time Windows (TSPTW) problem [7, 11], 3D Packing with Object Orientation , the physical traveling salesman problem , the Multiple Sequence Alignment problem  or Logistics . The principle of NRPA is to adapt the playout policy so as to learn the best sequence of moves found so far at each level.
We now give the outline of the paper. The second section describes NRPA. The third section gives a theoretical analysis of NRPA. The fourth section describes the generalization of NRPA. The fifth section details optimizations of GNRPA. The sixth section gives experimental results for SameGame and TSPTW.
NRPA learns a rollout policy by adapting weights on each action. During the playout phase, action is sampled with a probability proportional to the exponential of the associated weight. The playout algorithm is given in algorithm1. The algorithm starts with initializing the sequence of moves that it will play (line 2). Then it performs a loop until it reaches a terminal states (lines 3-6). At each step of the playout it calculates the sum of all the exponentials of the weights of the possible moves (lines 7-10) and chooses a move proportional to its probability given by the softmax function (line 11). Then it plays the chosen move and adds it to the sequence of moves (lines 12-13).
Then, the policy is adapted on the best current sequence found, by increasing the weight of the best actions and decreasing the weights of all the moves proportionally to their probabilities of being played. The Adapt algorithm is given in algorithm 2.For all the states of the sequence passed as a parameter it adds to the weight of the move of the sequence (lines 3-5). Then it reduces all the moves proportionally to the probability of playing the move so as to keep a sum of all probabilities equal to one (lines 6-12).
In NRPA, each nested level takes as input a policy, and returns a sequence. At each step, the algorithm makes a recursive call to the lower level and gets a sequence as a result. It adapt the policy to the best sequence of the level at each step. At level zero it make a playout.
The NRPA algorithm is given in algorithm 3. At level zero it simply performs a playout (lines 2-3). At greater levels it performs N iterations and for each iteration it calls itself recursively to get a score and a sequence (lines 4-7). If it finds a new best sequence for the level it keeps it as the best sequence (lines 8-11). Then it adapts the policy using the best sequence found so far at the current level (line 12).
NRPA balances exploitation by adapting the probabilities of playing moves toward the best sequence of the level, and exploration by using Gibbs sampling at the lowest level. It is a general algorithm that has proven to work well for many optimization problems.
3 Theoretical Analysis of NRPA
In NRPA each move is associated to a weight. The goal of the algorithm is to learn these weights so as to produce a playout policy that generates good sequences of moves. At each level of the algorithm the best sequence found so far is memorized. Let be the sequence of states of the best sequence. Let be the number of possible moves in a state . Let be the possible moves in state and be the move of the best sequence in state . The goal is to learn to play the move in state .
The playouts use Gibbs sampling. Each move is associated to a weight . The probability of choosing the move in a playout is the softmax function:
The cross-entropy loss for learning to play move is . In order to apply the gradient we calculate the partial derivative of the loss: . We then calculate the partial derivative of the softmax with respect to the weights:
Where if and 0 otherwise. Thus the gradient is:
If we use as a learning rate we update the weights with:
This is the formula used in the NRPA algorithm to adapt weights.
4 Generalization of NRPA
We propose to generalize the NRPA algorithm by generalizing the way the probability is calculated using a temperature and a bias :
4.1 Theoretical Analysis
The formula for the derivative of is:
So the derivative of relative to is:
The derivative of relative to with is:
We then derive the cross-entropy loss and the softmax to calculate the gradient:
If we use as a learning rate we update the weights with:
This is a generalization of NRPA since when we set and we get NRPA.
4.2 Equivalence of Algorithms
Let the weights and probabilities of playing moves be indexed by the iteration of the GNRPA level. Let be the weight at iteration , be the probability of playing move at step at iteration , the at iteration .
By recurrence we get:
From this equation we can deduce the equivalence between different algorithms. For example GNRPA with and is equivalent to GNRPA with and provided we set in GNRPA to . It means we can always use provided we correspondingly set and .
Another deduction we can make is we can set provided we set . We can also set and use only which is easier.
The equivalences mean that GNRPA is equivalent to NRPA with the appropriate and . However it can be more convenient to use than to initialize the weights as we will see for SameGame.
5 Optimizations of GNRPA
5.1 Avoid Calculating Again the Possible Moves
In problems such as SameGame the computation of the possible moves is costly. It is important in this case to avoid to compute again the possible moves for the best playout in the Adapt function. The possible moves have already been calculated during the playout that found the best sequence. The optimized playout algorithm memorizes in a matrix the codes of the possible moves during a playout. The cell contains the code of the possible move of index at the state number of the best sequence. The state number 0 is the initial state of the problem. The array memorizes the index of the code of the best move for each state number, is the length of the best sequence and is the index of the best move for state number .
5.2 Avoid the Copy of the Policy
Tha Adapt algorithm of NRPA and GNRPA considers the states of the sequence to learn as a batch. The sum of the gradients is calculated for the entire sequence and then applied. The way it is done in NRPA is by copying the policy to a temporary policy, modifying the temporary policy computing the gradient with the unmodified policy, and then copying the modified temporary policy to the policy.
When the number of possible codes is large copying the policy can be costly. We propose to change the Adapt algorithm to avoid to copy twice the policy at each Adapt call. We also use the memorized codes and index so as to avoid calculating again the possible moves of the best sequence.
The way to avoid copying the policy is to make a first loop to compute the probabilities of each move of the best sequence, lines 2-8 of algorithm 6. The matrix contains the probability for move index in state number , the array contains the sum of the probabilities of state number . The second step is to apply the gradient directly to the policy for each state number and each code, see lines 9-14.
6 Experimental Results
We now give experimental results for SameGame and TSPTW.
The first algorithm we test is the standard NRPA algorithm with codes of the moves using a Zobrist hashing of the cells of the moves [17, 10, 9]. The selective policy used is to avoid the moves of the dominant color except for moves of size two after move number ten. The codes of the possible moves of the best playout are recorded so as to avoid computing again the possible moves in the Adapt function. It is called NRPA.
Using Zobrist hashing of the moves and biasing the policy with is better than initializing the weights at SameGame since there are too many possible moves and weights. We tried to reduce the possible codes for the moves but it gave worse results. The second algorithm we test is to use Zobrist hashing and the selective policy associated to the bias. It is GNRPA with and , with if the move is of size 2 and of the tabu color and otherwise. The variable being the number of cells of the move. The algorithm is called GNRPA.beta.
The third algorithm we test is to use Zobrist hashing, the selective policy, and the optimized Adapt function. The algorithm is called GNRPA.beta.opt.
All algorithms are run 200 times for 655.36 seconds and average scores are recorded each time the search time is doubled.
The evolution of the average score of the algorithms is given in figure 1. We can see that GNRPA.beta is better than NRPA but that for scores close to the current record of the problem the difference is small. GNRPA.beta.opt is the best algorithm as it searches more than GNRPA.beta for the same time.
gives the average scores for the three algorithms associated to the 95% confidence interval in parenthesis ().
|40.96||2435.12 (49.26)||2513.35 (53.57)||2591.46 (52.50)|
|81.92||2676.39 (47.16)||2749.33 (47.82)||2777.83 (48.05)|
|163.84||2838.99 (41.82)||2887.78 (39.50)||2907.23 (38.45)|
|327.68||2997.74 (21.39)||3024.68 (18.27)||3057.78 (13.52)|
|655.36||3081.25 (10.66)||3091.44 (10.96)||3116.54 ( 7.42)|
The Traveling Salesman with Time Windows problem (TSPTW) is a practical problem that has everyday applications. NRPA can be used to efficiently solve practical logistics problems faced by large companies such as EDF .
In NRPA paths with violated constraints can be generated. As presented in  , a new score of a path can be defined as follow:
with, the sum of the distances of the path and the number of violated constraints. is a constant chosen high enough so that the algorithm first optimizes the constraints.
The problem we use to experiment with the TSPTW problem is the most difficult problem from the set of .
In order to initialize we normalize the distances and multiply the result by ten. So , where is the smallest possible distance and the greatest possible one.
All algorithms are run 200 times for 655.36 seconds and average scores are recorded each time the search time is doubled.
Figure 2 gives the curves for the three GNRPA algorithms we haves tested with a logarithmic time scale for the x axis.
We could not represent the curve for NRPA in figure 2 since the average values are too low. They are given in table 2. It is possible to improve much on standard NRPA by initializing the weights with the distances between cities [11, 5]. However this solution is not practical for all problems as we have seen with SameGame and using a bias is more convenient and general. We also tried initializing the weights with instead of using and we got similar results to the use of .
We can see in figure 2 that using a temperature of 1.4 improves on a temperature of 1.0. Using the optimized Adapt function does not improve GNRPA for TSPTW since in the TSPTW problem the policy array and the number of possible moves is very small and copying the policy is fast.
The curve of the best algorithm is asymptotic toward the best value found by all algorithms. It reaches better scores faster.
Table 2 gives the average values for NRPA and the three GNRPA algorithms we have tested. As there is a penalty of 1 000 000 for each constraint violation, NRPA has very low scores compared to GNRPA. This is why NRPA is not depicted in figure 2. For a search time of 655.36 seconds and not taking into account the constraints, NRPA usually reaches tour scores between -900 and -930. Much worse than GNRPA. We can observe that using a temparature is beneficial until we use 655.36 seconds and approach the asymptotic score when both algorithms have similar scores. The numbers in parenthesis in the table are the 95% confidence interval ().
|40.96||-3745986.46 (245766.53 )||-897.60 (1.32 )||-892.89 (0.96 )||-892.17 (1.04 )|
|81.92||-1750959.11 (243210.68 )||-891.04 (1.05 )||-886.97 (0.87 )||-886.52 (0.83 )|
|163.84||-1030946.86 (212092.35 )||-888.44 (0.98 )||-883.87 (0.71 )||-884.07 (0.70 )|
|327.68||-285933.63 (108975.99 )||-883.61 (0.63 )||-880.76 (0.40 )||-880.83 (0.32 )|
|655.36||-45918.97 (38203.97 )||-880.42 (0.30 )||-879.35 (0.16 )||-879.45 (0.17 )|
We presented a theoretical analysis and a generalization of NRPA named GNRPA. GNRPA uses a temperature and a bias .
We have theoretically shown that using a bias is equivalent to initializing the weights. For SameGame initializing the weights can be difficult if we initialize all the weights at the start of the program since there are too many possible weights, whereas using a bias is easier and improves search at SameGame. A lazy initialization of the weights would also be possible in this case and would solve the weight initialization problem for SameGame. For some other problems the bias could be more specific than the code of the move, i.e. a move could be associated to different bias depending on the state. In this case different bias could be used in different states for the same move which would not be possible with weight initialization.
We have also theoretically shown that the learning rate and the temperature can replace each other. Tuning the temperature and using a bias has been very beneficial for the TSPTW.
The remaining work is to apply the algorithm to other domains and to improve the way to design formulas for the bias .
C. Boutilier (Ed.) (2009)
IJCAI 2009, proceedings of the 21st international joint conference on artificial intelligence, pasadena, california, usa, july 11-17, 2009. Cited by: 8.
-  (2013) Monte-carlo fork search for cooperative path-finding. In Computer Games - Workshop on Computer Games, CGW 2013, Held in Conjunction with the 23rd International Conference on Artificial Intelligence, IJCAI 2013, Beijing, China, August 3, 2013, Revised Selected Papers, pp. 1–15. Cited by: §1.
-  (2016) Burnt pancake problem: new lower bounds on the diameter and new experimental optimality ratios. In Proceedings of the Ninth Annual Symposium on Combinatorial Search, SOCS 2016, Tarrytown, NY, USA, July 6-8, 2016, pp. 119–120. Cited by: §1.
-  (2012-03) A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4 (1), pp. 1–43. External Links: Cited by: §1.
-  (2020) Monte carlo vehicle routing. In Submitted, pp. . Cited by: §6.2, §6.2.
-  (2016) Nested monte carlo search for two-player games. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pp. 687–693. External Links: Cited by: §1.
-  (2012) Application of the nested rollout policy adaptation algorithm to the traveling salesman problem with time windows. In Learning and Intelligent Optimization - 6th International Conference, LION 6, Paris, France, January 16-20, 2012, Revised Selected Papers, pp. 42–54. Cited by: §1.
-  (2009) Nested Monte-Carlo Search. See IJCAI 2009, proceedings of the 21st international joint conference on artificial intelligence, pasadena, california, usa, july 11-17, 2009, Boutilier, pp. 456–461. Cited by: §1.
-  (2016) Nested rollout policy adaptation with selective policies. In Computer Games, pp. 44–56. Cited by: §6.1.
-  (2016) Improved diversity in nested rollout policy adaptation. In Joint German/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz), pp. 43–55. Cited by: §6.1.
Algorithm and knowledge engineering for the tsptw problem. In Computational Intelligence in Scheduling (SCIS), 2013 IEEE Symposium on, pp. 44–51. Cited by: §1, §6.2.
-  (2016) Monte-carlo tree search for logistics. In Commercial Transport, pp. 427–440. Cited by: §1.
-  (2014) Monte-carlo tree search for 3d packing with object orientation. In KI 2014: Advances in Artificial Intelligence, pp. 285–296. Cited by: §1.
-  (2014) Solving physical traveling salesman problems with policy adaptation. In Computational Intelligence and Games (CIG), 2014 IEEE Conference on, pp. 1–8. Cited by: §1.
-  (2015) Monte-carlo tree search for the multiple sequence alignment problem. In Eighth Annual Symposium on Combinatorial Search, Cited by: §1.
-  (2010) Combining UCT and Nested Monte Carlo Search for single-player general game playing. IEEE Transactions on Computational Intelligence and AI in Games 2 (4), pp. 271–277. Cited by: §1.
-  (2017) Distributed nested rollout policy for samegame. In Workshop on Computer Games, pp. 108–120. Cited by: §6.1.
-  (2018) An unexpectedly effective monte carlo technique for the rna inverse folding problem. bioRxiv, pp. 345587. Cited by: §1.
-  (1996) The vehicle routing problem with time windows part ii: genetic search. INFORMS journal on Computing 8 (2), pp. 165–172. Cited by: §6.2.
Generating structured test data with specific properties using nested monte-carlo search.
Genetic and Evolutionary Computation Conference, GECCO ’14, Vancouver, BC, Canada, July 12-16, 2014, pp. 1279–1286. Cited by: §1.
-  (2015) Heuristic model checking using a monte-carlo tree search algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2015, Madrid, Spain, July 11-15, 2015, pp. 1359–1366. Cited by: §1.
-  (2011) Optimization of the Nested Monte-Carlo algorithm on the traveling salesman problem with time windows. In Applications of Evolutionary Computation - EvoApplications 2011: EvoCOMNET, EvoFIN, EvoHOT, EvoMUSART, EvoSTIM, and EvoTRANSLOG, Torino, Italy, April 27-29, 2011, Proceedings, Part II, Lecture Notes in Computer Science, Vol. 6625, pp. 501–510. Cited by: §1, §6.2.
-  (2011) Nested rollout policy adaptation for Monte Carlo Tree Search. In IJCAI, pp. 649–654. Cited by: §1.