Nested Monte Carlo Search (NMCS)  uses multiple levels of search, memorizing the best sequence of each level. It has been applied to many single-player games and optimization problems [32, 26, 24, 3, 11, 30, 31, 4] and also to two-player games .
Nested Rollout Policy Adaptation (NRPA)  also uses multiple levels of search, memorizing the best sequences. It additionally learns a playout policy using the best sequences. It has also been applied to many problems [8, 19, 21, 22, 23, 20, 18] and games .
The RNA design problem also named the RNA Inverse Folding problem is computationally hard . This problem is important for scientific fields such as bioengineering, pharmaceutical research, biochemistry, synthetic biology and RNA nanostructures . NMCS has been successfully applied to the RNA Inverse Folding problem with the NEMO program . As a follow-up to NEMO, we propose to investigate different Monte Carlo Search algorithms for this problem.
The paper is organized as follows. The second section describes the Inverse Folding problem, the NEMO program by Fernando Portela  and the domain knowledge used in NEMO. The third section describes the Monte Carlo Search algorithms we have used for solving Inverse Folding problems of the Eterna100 benchmark. We present a new algorithm performing well for this problem, the GNRPA algorithm with restarts. The fourth sections details experimental results.
2 Inverse Folding
2.1 Presentation of the RNA inverse folding problem
An RNA strand is a molecule composed of a sequence of nucleotides. This strand folds back on itself to form what is called its secondary structure (See Figure 1). It is possible to find in a polynomial time the folded structure of a given sequence. However, the opposite, which is the RNA inverse folding problem, is much harder and is supposed to be NP-complete. This problem still resists algorithmic approaches that still fail to match the performance of human experts. This is partly due to the chaotic changes that can occur in the secondary structure because of a small change in the sequence (See difference between Figure 3 and 2.b).
These performances are evaluated on the Eterna100 benchmark which contains 100 RNA secondary structure puzzles of varying degrees of difficulty. A puzzle consists of a given structure under the dot-bracket notation. This notation defines a structure as a sequence of parentheses and points each representing a base. The matching parentheses symbolize the paired bases and the dots the unpaired ones. The puzzle is solved when a sequence of the four nucleotides A,U,G and C, folding according to the target structure, is found. In some puzzles, the value of certain bases is imposed.
Where human experts have managed to solve the 100 problems of the benchmark, no program has so far achieved such a score. The best score so far is 95/100 by NEMO, NEsted MOnte Carlo RNA Puzzle Solver .
NEMO works by performing several iterations of NMCS-B, a slightly modified version of NMCS. Between each iteration of the NMCS-B, NEMO retains part of the best current solution. It identifies a part of the sequence on which to perform mutations using stochastic heuristics and restarts the NMCS-B on it. We model a candidate by a sequence. The bases assigned are represented by the corresponding letter (A/U/G/C) and the others by the letter N. This is how the NMCS-B identifies the bases on which it must work.
NEMO uses a level 1 NMCS for its NMCS-B. At level 1, the NMCS, in each state of the problem, will perform a certain number of playouts for each possible move. It then plays the move that led to the best playout and moves to the next state until it reaches a final state. In the context of NEMO, each state corresponds to a sequence, initially the candidate sequence. Each move consists in taking the first N in the sequence and assigning a value to it, working on paired bases first. When the base is paired in the target structure, it will assign a value to both bases of the pair simultaneously. Indeed only the three combinations AU, GC, and GU can be paired, so it is more convenient to consider them at the same time. To perform playouts, NEMO is also using heuristics with biased weights depending on the location in the target structure. In addition, unlike the classic NMCS, the NMCS-B retains the best playout achieved so far throughout the execution. A final state is found when the sequence is fully completed. The playouts are evaluated according to the function :
Where is the number of different pairs between the secondary structure of the sequence and the target structure.
is the number of pairs in the target structure.
is the difference between the Minimum Free Energy of the secondary structure and the free energy that the sequence would have in the target structure.
2.3 Domain Knowledge and Heuristics
The heuristics used for the sampling of the NMCS-B are based on domain knowledge and personal experience and are chosen without computational optimization.
The paired bases are generally chosen according to the same rule. With the exception of adjacent stacks in multi-loops, the closing pairs of the left-most and rightmost stacks are chosen with the weights given in the following table. The notion of right and left in this case is defined from the point of view of the inside of the loop.
|Left-Most in Junction||82%||11%||7%|
|Right-Most in Junction||37%||56%||7%|
Various rules are applied for unpaired bases. The weights used to choose between A/U/G/C in the general case are 93%, 1%, 5% and 1%. Mismatches are treated differently depending on the case.
Since NEMO first assigns a value to the paired bases, the weights for the bases with a paired mismatch depend on the value of this mismatch.
|Mismatch with a Paired Base||A||U||G||C|
|Mismatch is a paired A||63%||0%||25%||12%|
|Mismatch is a paired U||0%||55%||9%||36%|
|Mismatch is a paired G||25%||12%||63%||0%|
|Mismatch is a paired C||55%||36%||0%||9%|
Furthermore, in internal loops, the weights also depend on the mismatch value if it has already been assigned, otherwise a more general rule applies.
|Mismatch in Internal Loops||A||U||G||C|
|Mismatch is not assigned||18%||4%||74%||4%|
|Mismatch is A||44%||0%||44%||12%|
|Mismatch is U||0%||67%||11%||22%|
|Mismatch is G||67%||11%||22%||0%|
|Mismatch is C||66%||17%||0%||17%|
Finally, the mismatch in junctions and external loops are drawn according to the distribution 97%, 1%, 1% and 1%.
Much stronger and more deterministic rules are applied with high probabilities (more than 80%) in specific cases, especially for triloops and internal loops. For both 1-1 and 2-2 internal loops for instance, only one mismatch pair is possible, and there are only three possibilities in the general case. This is part of a process of reproducing a "boosting" strategy. Depending on the type of loop, certain combinations of nucleotides at specific locations called "boosting points", especially terminal mismatches, can be used to reduce the energy of the structure. However, the most difficult puzzles may require less conventional solutions, hence the need not to apply these rules 100% of the time.
Therefore, in the use we will make of this heuristic we will not apply these last rules and when we mention the weights of the NEMO heuristic we refer to the previously mentioned values.
In addition, between iterations of NMCS-B, if it hasn’t solved the problem NEMO keeps part of the best current solution to restart the algorithm on. The set of bases that are not kept contains those that do not fold correctly, their neighborhood and randomly selected bases. This principle has not been applied to the presented algorithms.
3 Monte Carlo Search
3.1 Presentation of the NRPA algorithm
The Nested Rollout Policy Adaptation (NRPA) algorithm is a Monte Carlo Tree Search based algorithm with adaptive rollout policy during execution. It is a recursive algorithm. At level 0 it generates a playout according to the current policy. At level n, it calls for a given number of iterations the n-1 level of the algorithm, adapting the policy each time with the best solution so far. NRPA is given in algorithms 1, 2 and 3.
3.2 Application of NRPA to Inverse Folding
As part of the Inverse Folding problem, one solution consists in a chain of bases. A playout is made by running through the targeted structure of the chain, each move consists in assigning a value to the missing links. We distinguish between two cases, the unpaired bases in the target structure have four possible moves, one for each nucleic base (A/U/G/C). The unpaired bases have 6 possible moves, one for each possible ordered combination with their pair (GC/CG/AU/…). Each move is therefore defined by its position in the chain and whether it is a pair or not. Thus, there is a fixed number of moves, which are always ordered in the same way. Solutions are evaluated with the same score function as in the NEMO algorithm which is a combination of the fitness of the chain with the target structure and the difference between the target structure and the folded chain structure.
Let be the weight associated to move b at index i in the sequence. In NRPA the probability of choosing move b at index i is:
We propose to try GNRPA  for Inverse Folding and to replace it with:
where we use for the logarithm of the probabilities used in Nemo.
3.4 Stabilized GNRPA
Stabilized NRPA  is a simple improvement of NRPA. The principle is to play P playouts at level 1 before each call to the adapt function. The number of calls to the adapt function as level 1 is still N, the number of iteration of upper levels. So at level 1, playouts are performed.
3.5 Beam GNRPA
Beam NRPA has already been applied successfully to the TSPTW and to Morpion Solitaire . The best results were obtained using a beam at level 1. Similarly to Stabilized GNRPA, at level 1, playouts are performed for a beam of size B. However the algorithm is different from Stabilized NRPA since it memorizes B best sequences and B policies and plays the B playouts with different policies.
As Stabilized GNRPA it is embarrassingly parallel at level 1 and can be very efficient on a parallel machine.
When using Beam NRPA it can be beneficial to ensure the diversity of the beam . The diversity criterion we have used is to only keep in the beam sequences that have different scores. It is simple and efficient as it ensures diversity while keeping enough sequences.
3.6 Coding Moves
The natural way to code moves for Inverse Folding is to use the index of the base or of the pair of base in the string (m.index) and the index of the base in the list of bases or of the pair of bases in the list of pairs of bases (m.number). The formula is then:
For example if the move is to put the fourth base at index 10 the code is , provided strings always have less than 2000 characters and therefore .
It may be interesting to include the previously chosen bases in the code of a move, for example if a base has meaning only if following another base. We call the history of a code the number of bases in the history included in the code. The previous formula holds for a code history of 0. The code for a code history of 1 is:
Six is the maximum value for a move number, the maximum number of legal moves is 6. The code for a code history of 2 includes the two previous moves in the code.
3.7 Start Learning
In order to wait for better sequences before learning it is possible to delay learning only after a given number of sequences have been found .
3.8 Zobrist Hashing
Each state is associated to a different hash code. This is done with Zobrist Hashing. Each move at each index is associated to a random number. The hash code of a state is the XOR of all the random numbers corresponding to the moves that have been played to reach this state. Zobrist hashing is used in games to build a transposition table. We will use it in the UCT variants. Another use of Zobrist Hashing is to detect playouts that have already been evaluated. As most of the time in Inverse Folding is spent scoring the playouts, it is advisable to avoid reevaluating an already evaluated playouts. This is done with a score hash table that contains the hash of the terminal states already encountered associated to their computed scores. If a terminal state is met again the score need not be recomputed it can just be sent back from the table. There are variations on the number of playouts already evaluated according to the sequence, some sequences have very few while others have a lot.
There can be large variations on the solving times of some problems. Sometimes the search algorithm takes a wrong direction and stay stuck on a suboptimal sequence without making any progress. A way to deal with this behavior is to periodically stop and restart the search. For some difficult problems however a long search is required to find the solution. There are multiple ways to use restarts. The algorithm can double the search time at each restart for example. We call this method iterative doubling. We have observed that a level 2 search is able to solve many problems, a restart strategy can also be to repeatedly call GNRPA at level 2 until thinking time is elapsed. Another way to deal with search being stuck is to stop a level when the best sequence has not changed for a given number of recursive calls.
It is difficult to set a static restart strategy for all problems. Long sequences are much more difficult than short ones and the progress on long sequences is slower. In order to cope with this property we use a restart threshold. It is set to the length of the sequence divided by 5. When using this restart strategy there is no limit on the length of a level.
Algorithm 4 gives the GNRPA algorithm with restarts.
The multiple playouts of stabilized NRPA and the loop over the elements of the beam are embarrassingly parallel. We simply parallelized with OpenMP a common loop including the beam and the stabilized playouts at level 1. If we have 4 stabilized playouts and a beam of 8, the 32 resulting playouts are played in parallel. This kind of parallelization is a kind of leaf parallelization [6, 16].
We also experiment with root parallelization [6, 16]. The principle is to perform multiple independent GNRPA in parallel and to stop as soon as one has found a solution or when the allocated time is elapsed.
Leaf parallelization is more difficult to scale than root parallelization. For the same wall clock time leaf parallelization runs more iterations for a single policy than root parallelization which optimizes many more different policies but with less iterations. On the other hand root parallelization scales very well and has a built-in restart strategy. Root parallelization works well for problems that converge relatively rapidly on a suboptimal solution as they benefit from restarts, while leaf parallelization works better for problems that converge slowly but steadily towards the best solutions.
4 Experimental Results
We now detail experiments with the different Monte Carlo Search algorithms on the Eterna100 benchmark.
Table 1 gives the number of problems solved with different parameters for the GNRPA algorithm. We can see that at level 1 using GNRPA instead of NRPA enables to solve 30 problems instead of 3. Similarly at level 2 it solves 73 problems instead of 49. Using Stabilized GNRPA with P=4 and Beam GNRPA with a beam of 8 at level 1 also improves quite much the number of problems solved at levels 1 and 2. Interestingly for level 2 it is 32 times slower that a regular level 2 search and solves 85 problems, when a search at level 3 is 100 times slower and still solves 85 problems.
Table 2 gives the median time over 3 runs of a level 1 search with different numbers of threads on problem 64 which is difficult. The algorithm is GNRPA with P = 4, a beam of 8 and N = 100. We can see that using 8 threads gives the best results. We optimized memory in order to avoid cache misses but we were not able to have better results with more threads.
Table 3 gives the number of problems solved within a fixed time limit for different algorithms. The time limits range from 1 minute to 64 minutes per problem. The parallel algorithm runs on a multicore machine. The results of the parallel program are given in the last line of table 3. The times used to stop the parallel program are the wall clock times. The last line is leaf parallel GNRPA with restarts and gives the best results within 1 hour of wall clock time, solving 92 of the 100 problems in one run. Start is 4 meaning that it starts learning after 4 playouts. H is 1 meaning the code include the previous move, R is 3 meaning that the restart threshold is set to the length of the string divided by 3.
The problems solved by different runs of 1 hour we made are not always the same. Some hard problems are solved only in some runs. So the limit of 92 solved problems is not the limit of the algorithm. The problems that were never solved in 1 hour are problems 100, 99, 97, 91, 90, 78. With a two hours limit, problem 90 is solved thus reaching 95 solved problems, the same number of solved problems as NEMO.
For the sake of completeness we also tested other popular Monte Carlo Search algorithms. The results are given in table 4 for UCT , Nested Monte Carlo Search [10, 29] and Diversity NRPA . The UCT constant is set to 0.4, NMCS is tested for repeated calls to level 1 and level 2. Diversity GNRPA  is called with a set of 5 sequences at level 1 and 1 sequence at level 2. The improved GNRPA algorithm gives better results than these algorithms.
Table 5 gives the results with time of the best parallel algorithm using different options. The Correction option is to fix a discrepancy between the NEMO paper and the NEMO code in the heuristic. The Order option is to order moves such as NEMO or to use the order of the string. Given the results in the table the two options do not seem to matter much.
Table 6 gives the number of problems solved with time using the root parallel GNRPA algorithm with 20 process. The results are slightly worse than when using leaf parallelization with 8 threads. It is due to problems that converge slowly and do not benefit from restarts and where leaf parallelization enables to improve during much longer than root parallelization the best policy. Given enough ressources the best algorithm might be the combination of root and leaf parallelization as in . The parallelization of NRPA proposed by Nagorko  is also appealing.
We experimented with various Monte Carlo Search algorithms for the Inverse Folding problem. We have used very limited domain knowledge, essentially using a small part of the NEMO heuristics for the bias. By applying general Monte Carlo Search heuristics we were able to solve as many problems as NEMO in comparable times.
Thanks to Fernando Portela for his NEMO program. Tristan Cazenave is supported by the PRAIRIE institute.
-  (2020) Designing RNA secondary structures is hard. Journal of Computational Biology 27 (3). Cited by: §1.
C. Boutilier (Ed.) (2009)
IJCAI 2009, proceedings of the 21st international joint conference on artificial intelligence, pasadena, california, usa, july 11-17, 2009. Cited by: 10.
-  (2013) Monte-carlo fork search for cooperative path-finding. In Computer Games - Workshop on Computer Games, CGW 2013, Held in Conjunction with the 23rd International Conference on Artificial Intelligence, IJCAI 2013, Beijing, China, August 3, 2013, Revised Selected Papers, pp. 1–15. Cited by: §1.
-  (2016) Burnt pancake problem: new lower bounds on the diameter and new experimental optimality ratios. In Proceedings of the Ninth Annual Symposium on Combinatorial Search, SOCS 2016, Tarrytown, NY, USA, July 6-8, 2016, pp. 119–120. Cited by: §1.
-  (2012-03) A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4 (1), pp. 1–43. External Links: Cited by: §1.
-  (2007-06) On the Parallelization of UCT. In Computer Games Workshop, Amsterdam, Netherlands. External Links: Cited by: §3.10, §3.10.
-  (2020) Stabilized nested rollout policy adaptation. In Submitted, pp. . Cited by: §3.4.
-  (2012) Application of the nested rollout policy adaptation algorithm to the traveling salesman problem with time windows. In Learning and Intelligent Optimization - 6th International Conference, LION 6, Paris, France, January 16-20, 2012, Revised Selected Papers, pp. 42–54. Cited by: §1.
-  (2012) Beam nested rollout policy adaptation. In Computer Games Workshop, ECAI 2012, pp. 1–12. Cited by: §3.5.
-  (2009) Nested Monte-Carlo Search. See IJCAI 2009, proceedings of the 21st international joint conference on artificial intelligence, pasadena, california, usa, july 11-17, 2009, Boutilier, pp. 456–461. Cited by: §1, §4.
-  (2013) Monte-carlo expression discovery. International Journal on Artificial Intelligence Tools 22 (1). Cited by: §1.
-  (2016) Nested rollout policy adaptation with selective policies. In Computer Games, pp. 44–56. Cited by: §1.
-  (2016) Nested rollout policy adaptation with selective policies. In CGW at IJCAI 2016, pp. . Cited by: §3.7.
-  (2016) Playout policy adaptation with move features. Theoretical Computer Science 644, pp. 43–52. Cited by: §1.
-  (2020) Generalized nested rollout policy adaptation. CoRR abs/2003.10024. External Links: Cited by: §3.3.
-  (2008) Parallel monte-carlo tree search. In International Conference on Computers and Games, pp. 60–71. Cited by: §3.10, §3.10.
-  (2006) Efficient selectivity and backup operators in Monte-Carlo tree search. In Computers and Games, 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers, H. J. van den Herik, P. Ciancarini, and H. H. L. M. Donkers (Eds.), Lecture Notes in Computer Science, Vol. 4630, pp. 72–83. Cited by: §1.
-  (2016) Improved diversity in nested rollout policy adaptation. In KI 2016: Advances in Artificial Intelligence - 39th Annual German Conference on AI, Klagenfurt, Austria, September 26-30, 2016, Proceedings, pp. 43–55. Cited by: §1, §3.5, §4.
Algorithm and knowledge engineering for the tsptw problem. In Computational Intelligence in Scheduling (SCIS), 2013 IEEE Symposium on, pp. 44–51. Cited by: §1.
-  (2016) Monte-carlo tree search for logistics. In Commercial Transport, pp. 427–440. Cited by: §1.
-  (2014) Monte-carlo tree search for 3d packing with object orientation. In KI 2014: Advances in Artificial Intelligence, pp. 285–296. Cited by: §1.
-  (2014) Solving physical traveling salesman problems with policy adaptation. In Computational Intelligence and Games (CIG), 2014 IEEE Conference on, pp. 1–8. Cited by: §1.
-  (2015) Monte-carlo tree search for the multiple sequence alignment problem. In Eighth Annual Symposium on Combinatorial Search, Cited by: §1.
-  (2012) A new approach to the snake-in-the-box problem.. In ECAI, Vol. 242, pp. 462–467. Cited by: §1.
Bandit based Monte-Carlo planning.
17th European Conference on Machine Learning (ECML’06), LNCS, Vol. 4212, pp. 282–293. Cited by: §1, §4.
-  (2010) Combining UCT and Nested Monte Carlo Search for single-player general game playing. IEEE Transactions on Computational Intelligence and AI in Games 2 (4), pp. 271–277. Cited by: §1.
-  (2019) Parallel nested rollout policy adaptation. In IEEE Conference on Games, pp. . Cited by: §4.
-  (2017) Distributed nested rollout policy for samegame. In Workshop on Computer Games, pp. 108–120. Cited by: §4.
-  (2018) An unexpectedly effective monte carlo technique for the rna inverse folding problem. BioRxiv, pp. 345587. Cited by: §1, §1, §2.1, §4.
Generating structured test data with specific properties using nested monte-carlo search.
Genetic and Evolutionary Computation Conference, GECCO ’14, Vancouver, BC, Canada, July 12-16, 2014, pp. 1279–1286. Cited by: §1.
-  (2015) Heuristic model checking using a monte-carlo tree search algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2015, Madrid, Spain, July 11-15, 2015, pp. 1359–1366. Cited by: §1.
-  (2011) Optimization of the Nested Monte-Carlo algorithm on the traveling salesman problem with time windows. In Applications of Evolutionary Computation - EvoApplications 2011: EvoCOMNET, EvoFIN, EvoHOT, EvoMUSART, EvoSTIM, and EvoTRANSLOG, Torino, Italy, April 27-29, 2011, Proceedings, Part II, Lecture Notes in Computer Science, Vol. 6625, pp. 501–510. Cited by: §1.
-  (2011) Nested rollout policy adaptation for Monte Carlo Tree Search. In IJCAI, pp. 649–654. Cited by: §1.