1 Introduction
Retrosynthetic planning is one of the fundamental problems in organic chemistry. Given a target product, the goal of retrosynthesis is to identify a series of reactions that can lead to the synthesis of the product, by searching backwards and iteratively applying chemical transformations to unavailable molecules. As thousands of theoreticallypossible transformations can all be applied during each step of reactions, the search space of planning will be huge and makes the problem challenging even for experienced chemists.
The onestep retrosynthesis prediction, which predicts a list of possible direct reactants given product, serves as the foundation for realizing the multistep retrosynthetic planning. Existing methods roughly fall into two categories, either templatebased or templatefree. Each chemical reaction is associated with a reaction template that encodes how atoms and bonds change during the reaction. Given a target product, templatebased methods predict the possible reaction templates, and subsequently apply the predicted reaction templates to target molecule to get corresponding reactants. Existing methods include retrosim (coley2017computer), neuralsym (segler2017neural) and GLN (dai2019retrosynthesis). Though conceptually straightforward, templatebased methods need to deal with tens or even hundreds of thousands of possible reaction templates, making the classification task hard. Besides, templates are not always available for chemical reactions. Due to these reasons, people have also been developing templatefree methods that could directly predict reactants. Most of existing methods employ seq2seq models like LSTM (liu2017retrosynthetic) or Transformer (karpov2019transformer)
from neural machine translation literature.
While onestep methods are continuously being improved, most molecules in real world cannot be synthesized within one step. Possible number of synthesis steps could go up to 60 or even more. Since each molecule could be synthesized by hundreds of different possible reactants, the possible synthesis routes becomes countless for a single product. Such huge space poses challenges for efficient searching and planning, even with advanced onestep approaches.
Besides the huge search space, another challenge is the ambiguity in performance measure and benchmarking. It has been extremely hard to quantitatively analyze the performance of any multistep retrosynthesis algorithms due to the ambiguous definition of ‘good synthesis routes’, nor are there any benchmark datasets for analyzing designed algorithms. Most common ways for quantitative analysis is to employ domain experts and let them judge if one synthesis route is better than the other based solely on their experiences, which is both timeconsuming and costly.
Due to aforementioned challenges, there are less work proposed in the field of multistep retrosynthetic planning. Previous works using Monte Carlo Tree Search (MCTS) (segler2018planning; segler2017towards)
have achieved superior results over neural or heuristicbased Breadth First Search (BFS). However, MCTSbased methods has several limitations in this setting:

[leftmargin=*,nolistsep,nosep]

Each tree node corresponds to a set of molecules instead of single molecule. This addtional combinatorial aspect make the representation of tree node, and the estimation of its value even harder. Furthermore, reactions do not explicilty appear as nodes in the tree, which prevents their algorithm from exploiting the structure of subproblems.

As the algorithm depends on online value estimation, the full rollout from vanilla MCTS may not be efficient for the planning need. Furthermore, the algorithm can not exploit historical data in that many good retrosysthesis plans may have been found previously, and “intuitions” on how to plan efficiently may be learned from these histories.
For quantitative evaluation, they have employed numerous domain experts to conduct AB tests over methods proposed by their algorithm and other baselines.
In this paper, we present a novel neuralguided tree search method, called Retro*^{1}^{1}1Available at https://github.com/binghongml/retro_star, for chemical retrosynthesis planning. In our method,

[leftmargin=*,nolistsep,nosep]

We explicitly maintain information about reactions as nodes in an ANDOR tree, where a node with “AND” type corresponds to a reaction, and a node with “OR” type corresponds to a molecule. The tree captures the relations between candidate reactions and reactant molecules, which allows us to exploit structure of subproblems corresponding to a single molecule.

Based on the ANDOR tree representation, we propose an A*like planning algorithm which is guided by a neural network learned from past retrosynthesis planning experiences. More specifially, The neural network learns a synthesis cost for each molecule, and it helps the search algorithm to pick the most promising molecule node to expand.
Furthermore, we also propose a method for constructing benchmark synthesis routes data given reactions and chemical building blocks. Based on this, we construct a synthesis route dataset from benchmark reaction dataset USPTO. The route dataset is not only useful for quantitative analysis for predicted synthesis routes, but also work as training data for the neural network components in our method.
Below we summarize our contributions:

[leftmargin=*,nolistsep,nosep]

We propose a novel learningbased retrosynthetic planning algorithm to learn from previous planning experience. The proposed algorithm outperforms stateoftheart methods by a large margin on a realworld benchmark dataset.

Our algorithm framework can induce a search algorithm that guarantees the optimal solution.

We propose a method for constructing synthesis route datasets for quantitative analysis of multistep retorsynthetic planning methods.
Our planning algorithm is general in the sense that it can also be applied to other machine learning problems such as theorem proving (yang2019learning) and hierarchical task planning (erol1996hierarchical). A synthetic task planning experiment is included in Appendix D to demonstrate the idea. Most related works have been mentioned in the first two sections. For more related works, please refer to Appendix E.
2 Background
In this section, we first state the problem and its background we are tackling in sec:prob_stmt. Then in sec:mcts and sec:pns we describe how MCTS and proof number search fit in the problem setting.
2.1 Problem Statement
Onestep retrosynthesis: Denote the space of all molecule as . The onestep retrosynthesis takes a target molecule as input, and predicts a set of source reactants that can be used to synthesize . This is the reverse problem of reaction outcome prediction. In our paper, we assume the existence of such onestep retrosynthesis model (or onestep model for simplicity in the rest of the paper) ,
(1) 
which outputs at most reactions , the corresponding reactant sets and costs . The cost can be the actual price of the reaction , or simply the negative loglikelihood of this reaction under model . A onestep retrosynthesis model can be learned from a dataset of chemical reactions ^{2}^{2}2For simplicity we follow the common practice to ignore the reagents and other chemical reaction conditions. which have already been discovered by chemists in the past (coley2017computer; segler2017neural; liu2017retrosynthetic; dai2019retrosynthesis; karpov2019transformer).
Retrosynthesis planning. Given a single target molecule and an initial set of molecules , we are interested in synthesizing via a sequence of chemical reactions using reactants that are from or can be synthesized by . In this case, corresponds to a set of molecules that are commercially available. The goal of retrosynthesis planning is to predict a sequence of reactions with reactants in and will ultimately arrive at product .
Instead of performing forward chaining like reasoning that starts from , a more efficient and commonly used method is to perform backward chaining that starts from the molecule , and perform a series of onestep retrosynthesis prediction until all the reactants required are from . Beyond just finding such a synthesis route, our goal is to find the retrosynthesis plan that are:

[leftmargin=*,nolistsep,nosep]

Highquality:

The entire retrosynthesis plan should be chemically sound with high probability;

The reactants or chemical reactions required should have as low cost as possible;


Efficient: Due to the synthesis effort, the number of retrosynthesis steps should be limited.
Our proposed Retro* is aiming at finding the best retrosynthesis plan with respect to above criteria in limited time. To achieve this, we also assume that the quality of a solution can be measured by the reaction cost, where such cost is known to our model.
2.2 Monte Carlo Tree Search
The Monte Carlo Tree Search (MCTS) has achieved ground breaking successes in two player games, such as GO (silver2016mastering; silver2017mastering). Its variant, UCT (kocsis2006bandit), is especially powerful for balancing exploration and exploitation in online learning setting, and has been employed in segler2018planning for retrosynthesis planning. Specifically, as illustrated in fig:mctsandorstump, the tree search start from the target molecule . Each node in the current search tree represents a set of molecules . Each child node of is obtained by selecting one molecule and a onestep retrosynthesis reaction , where the resulting node contains molecule set .
Despite its good performance, MCTS formulation for retrosynthesis planning has several limitations. First, the rollout needed in MCTS makes it timeconsuming, and unlike in twoplayer zerosum games, the retrosynthesis planning is essentially a single player game where the return estimated by random rollouts could be highly inaccurate. Second, since each tree node is a set of molecules instead of a single molecule, the combinatorial nature of this representation brings the sparsity in the variance estimation.
2.3 Proof Number Search and Variants
The proofnumber search (PNS) (allis1994proof) is a game tree search that is designed for twoplayer game with binary goal. It tries to either prove or disprove the root node as fast as possible. In the retrosynthesis planning scenario, this corresponds to either proving the target molecule by finding a feasible planning path, or concluding that it is not synthesizable.
ANDOR Tree: The search tree of PNS is an ANDOR tree , where each AND node needs all its children to be proved, while OR node requires at least one to be satisfied. Each node is associated with a proof number that defines the minimum number of leaf nodes to be proved in order to prove . Similarly, the disproof number finds the minimum number of leaf nodes needed to disprove . With such definition, we can recursively define these numbers for internal nodes. Specifically, for AND node ,
(2) 
and for OR node , we have
(3) 
Represent retrosynthesis planning using ANDOR tree: As illustrated in fig:mctsandorstump, the application of onestep retrosynthesis model on molecule can be represented using one block of ANDOR tree (denoted as ANDOR stump), with molecule node as ‘OR’ node and reaction node as ‘AND’ node. This is because a molecule can be synthesized using any one of its children reactions (orrelation), and each reaction node requires all of its children molecules (andrelation) to be ready.
The search of PNS starts from the root node every time, and selects the child node with either minimum proof number or minimum disproof number, depends on whether the current node is an OR node or AND node, respectively. The process ends when a leaf node is reached, which can be either reaction or molecule node to be expanded. And after one step of retrosynthesis expansion, all the and of nodes along the path back to the root will be updated. The twoplayer game in this sense comes from the interleaving behavior of selecting proof and disproof numbers, where the first ‘player’ tries to prove the root while the second ‘player’ tries to disprove it. As both of the players behave optimally when the proof/disproof numbers are accurate, such perspective would bring the efficiency for finding a feasible synthesis path or prove that it is not synthesizable.
Variant: There have been several variants to improve different aspects of PNS, including different traversal strategy, different initialization methods of and for newly added nodes. The most recent work DFPNE (kishimoto2019depth) builds on top of the depthfirst variant of PNS with an additive cost in addition to classical update rule in Eq (3). Specifically, for an unsolved OR node,
(4) 
Here is the function of the cost of corresponding onestep retrosynthesis. Together with manually defined thresholds, this method addresses the lopsided problem in retrosynthesis planning, , the imbalance of branching factor between AND and OR nodes.
The variants of PNS has shown some promising results over MCTS for retrosynthesis planning. However, the twoplayer game formulation is designed for the speed of a proof, not necessarily the overall solution quality. Moreover, existing works rely on human expert to design and thresholds during search. This makes it not only timeconsuming to tune, but also hard to generalize well when solving new target molecule or dealning with new onestep model or reaction data.
3 Retro* Search Algorithm
algocf[t]
Our proposed Retro* is a retrosynthetic planning algorithm that works on the ANDOR search tree. It is significantly different from PNS which is also based on ANDOR tree, or other MCTS based methods in the following ways:

[leftmargin=*,nolistsep,nosep]

Retro* utilizes ANDOR tree for single player game which only utilizes the global value estimation. This is different from PNS which models the problem as twoplayer game with both proof numbers and disproof numbers. The distinction of the objective makes Retro* advantageous in finding best retrosynthetic routes.

Retro* estimates the future value of frontier nodes with neural network that can be trained using historical retrosynthesis planning data. This is different from the expensive rollouts used in segler2018planning, or the human designed heuristics in kishimoto2019depth. This not only enables more accurate prediction during expansion, but also generalizes the knowledge learned from existing planning paths.
3.1 Overview of Retro*
Retro* (alg:main) is a bestfirst search algorithm, which exploits neural priors to directly optimize for the quality of the solution. The search tree is an ANDOR tree, with molecule node as ’OR’ node and reaction node as ’AND’ node. It starts the search tree with a single root molecule node that is the target molecule . At each step, it selects a node in the frontier of (denoted as ) according to the value function. Then it expands with the onestep model and grows with one ANDOR stump. Finally the nodes with potential dependency on will be updated. Below we first provide a big picture of the algorithm by explaining these steps one by one, then we look into details of value function design and its update in sec:vt_design and sec:update, respectively. fig:algframework summarizes these steps in high level.
Selection: Given a search tree , we denote the molecule nodes as and reaction nodes as , where the total nodes in will be . The frontier contains all the molecule nodes in that haven’t been expanded before. Since we want to minimize the total cost of the final solution, an ideal option to expand next would be the molecule node which is part of the best synthesis plan.
Suppose we already have a value function oracle which tells us that under the current search tree , the cost of the best plan that contains for synthesizing target . We can use it to select the next node to expand:
(5) 
A proper design of such would not only improve search efficiency, but can also bring theoretical guarantees.
Expansion: After picking the node with minimum cost estimation , we will expand the search tree with onestep retrosynthesis proposals from . Specifically, for each proposed retrosynthesis reaction , we create a reaction node under node , and for each molecule , we create a molecule node under the reaction node . This will create an ANDOR stump under node . Unlike in MCTS (segler2018planning) where multiple calls to is needed till a terminal state during rollout, here the expansion only requires a single call to the onestep model.
Update: Denote the search tree after expansion of node to be . Such expansion obtains the corresponding cost information for onestep retrosynthesis. we utilize this more direct information to update of all other relevant nodes to provide a more accurate estimation of total cost.
3.2 Design of
To properly design , we borrow the idea from A* algorithm. A* algorithm is a bestfirst search algorithm which uses the cost from start together with the estimation of future cost to select move. When such estimation is admissible, it will be guaranteed to return the optimal solution. Inspired by the A* algorithm, we decompose the value function into two parts:
(6) 
where is the cost of current reactions that have happened in , if should be in the final route, and is the estimated cost for future reactions needed to complete such planning. Instead of explicitly calculate these two separately, we show an equivalent but simpler way to calculate directly.
Specifically, we first define , which is a boundary case of the value function oracle that simply tells how much cost is needed to synthesize molecule . For the simplicity of notation, we denote it as . Then we define the reaction number function that is inspired by proof number but with different purpose:
(7) 
where and calculate for reaction node and molecule node, respectively. The reaction number tells the minimum estimated cost needed for a molecule or reaction to happen in the current tree. We further define to get the parent node of , and be all the ancestors of node . Note that and vise versa. Then function will be:
(8)  
The first summation calculates all the reaction cost that has happened along the path from node to root. Additionally, , the child node should also be synthesized, as each such reaction node is an AND node. This requirement is captured in the second summation of Eq (8). We can see that implicitly sums up the cost associated with the reaction nodes in this route related to , and takes all the terms related to in Eq (7).
In fig:algframework we demonstrate the calculation of with a simple example. Notice that we can compute the parts that relevant to with existing information. But we can only estimate the part of since the required reactions are not in the search tree yet. We will show how to learn this future estimation in sec:learning.
3.3 Updating
After a node is expanded, there are several components needed to be updated to maintain the search tree state.
Update : Following Eq (7), the reaction number for newly created molecule nodes under the subtree rooted at will be , and the reaction nodes will have the cost added to the sum of reaction numbers in children. After that, all the nodes would potentially have the reaction number updated following Eq (7). Thus this process requires the computation complexity to be . However in our implementation, we can update these nodes in a bottomup fashion that starts from , and stop anytime when an ancestor node value doesn’t change. This would speed up the update.
Update : Let be the set of molecule nodes that have reaction number being updated in the stage above. From Eq (8) we can see, for any molecule node , will be recalculated if .
Remark: The expansion of a node can potentially affect all other nodes in in the worst case. However the expansion of a single molecule node will only affect another node in the frontier when it is on the current best synthesis solution that composes . For the actual implementation, we use efficient caching and lazy propagate mechanism, which will guarantee to only update the when it is necessary. The implementation details of both above updates can be found in Appendix A.
3.4 Guarantees on Finding the Optimal Solution
Assuming or its lowerbound is known for all encountered molecules , alg:main is guaranteed to return an optimal solution, if the halting condition is changed to “the total costs of a found route is no larger than ”.
The proof can be found in Appendix B.
Remark 1: If we define the cost of a reaction to be its negative loglikelihood, then is the lowerbound of for any molecule . The induced algorithm is guaranteed to find the optimal solution.
Remark 2: In practice, due to the limited time budget, we prefer the algorithm to return once a solution is found.
3.5 Extension: Retro* on Graph Search Space
We have been mainly illustrating the technique on a tree structured space. As the retrosynthesis planning is essentially performend on a directed graph (, certain intermediate molecules may share the same reactants, which may further reduce the actual cost), the above calculation can be extended to the general bipartite graph with edges connecting and . Due to the potential existence of loops, the calculation of Eq (7) will be performed using shortest path algorithm instead. As there will be no negative loops, shortest path algorithm will still converge. By viewing the search space as tree rather than graph, we may possibly find suboptimal solution due to the repetition in state representation. However, as loopy synthesis is rare in real world, we mainly focus on the tree structured search in this paper, and will investigate this extension to bipartite graph space search in future work.
4 Estimating from Planning Solutions
Retro* requires the value function oracle to compute for expansion node selection. However in practice it is impossible to obtain the exact value of for every molecule . Therefore we try to estimate it from previous planning data.
4.1 Represention of
To parameterize for any molecule , we first compute its Morgan fingerprint (rogers2010extended) of radius with bits, and feed it into a singlelayer fully connected neural network of hidden dimension , which then outputs a scalar representing .
4.2 Offline Learning of
Previous work has either used random rollout or human designed heuristics for estimating , which may not be accurate enough to guide the search. Instead of learning it online during planning (silver2017mastering), we utilize the existing reactions in the training set to train it.
Specifically, we construct retrosynthesis routes for feasible molecules in , where the available set of molecule is also given beforehand. The specific construction strategy will be covered in sec:construct_routes. The resulting dataset will be , where each tuple contains the target molecule , the best entire route cost , the onestep retrosynthesis candidates which also contains the true onestep retrosynthesis used in the planning solution.
The learning of consists of two parts, namely the value fitting which is a regression loss and the consistency learning which maintains the partial order relationship between best onestep solution and other solutions :
(9) 
where is a positive constant margin to ensure has higher priority for expansion than its alternatives even if the value estimates have tolerable noise. The overall objective is:
(10)  
where balances these two losses. In experiment we set it to be 1 by default.
5 Experiments
Algorithm  Retro*  Retro*0  DFPNE+  DFPNE  MCTS+  MCTS  Greedy DFS 

Success rate  86.84%  79.47%  53.68%  55.26%  35.79%  33.68%  22.63% 
Time  156.58  208.58  289.42  279.67  365.21  370.51  388.15 
Shorter routes  50  52  59  59  18  14  11 
Better routes  112  102  22  25  46  41  26 
5.1 Creating Benchmark Dataset
5.1.1 USPTO Reaction Dataset
We use the publicly available reaction dataset extracted from United States Patent Office (USPTO) to train onestep model and extract synthesis routes. The whole dataset consists of chemical reactions published up to September 2016. For reactions with multiple products, we duplicate them into multiple ones with one product each. After removing the duplications and reactions with wrong atom mappings, we further extract reaction templates with RDChiral ^{3}^{3}3https://github.com/connorcoley/rdchiral for all reactions and discard those whose reactants cannot be obtained by applying reaction templates to their products. The remaining reactions are further split randomly into train/val/test sets following proportions.
With reaction data, we train a templatebased MLP model (segler2017neural) for onestep retrosynthesis. Following literature, we formulate the onestep retrosynthesis as a multiclass classification problem, where given a molecule as product, the goal is to predict possible reaction templates. Reactants are obtained by applying the predicted templates to product molecule. There are in total distinct templates. Throughout all experiments, we take the top templates predicted by MLP model and apply them on each product to get corresponding reactant lists.
5.1.2 Extracting Synthesis Routes
To train our value function and quantitatively analyze the predicted routes, we construct synthesis routes based on USPTO reaction dataset and a list of commercially available building blocks from eMolecules ^{4}^{4}4http://downloads.emolecules.com/free/20191101/. eMolecules consists of commercially available molecules that could work as ending points for our searching algorithm.
Given the list of building blocks, we take each molecule that have appeared in USPTO reaction data and analyze if it can be synthesized by existing reactions within USPTO training data. For each synthesizable molecule, we choose the shortestpossible synthesis routes with ending points being available building blocks in eMolecules.
We obtain validation and test route datasets with slightly different process. For validation dataset, we first combine train and validation reaction dataset, and then repeat aforementioned extraction procedure on the combined dataset. Since we extract routes with more reactions, synthesizable molecules will include those who could not be synthesized with original reactions and those who have shorter routes. We exclude molecules with routes of same length as in training data, and pack the remaining as validation route dataset. We apply similar procedure to test data but make sure that there is no overlap between test and training/validation set.
We further clean the test route dataset by only keeping the routes whose reactions are all covered by the top predictions by the onestep model. To make the test set more challenging, we filter out the easier molecules by running a heuristicbased BFS planning algorithm, and discarding the solved molecules in a fixed time limit. After processing, we obtain training routes, validation routes, test routes and the corresponding target molecules.
5.2 Results
We compare Retro* against DFPNE (kishimoto2019depth), MCTS (segler2018planning) and greedy Depth First Search (DFS) on product molecules in test route dataset described in sec:construct_routes. Greedy DFS always prioritizes the reaction with the highest likelihood. MCTS is implemented with PUCT, where we used the reaction probability provided by the onestep model as the prior to bias the search.
We measure both route quality and planning efficiency to evaluate the algorithm. To measure the quality of a solution route, we compare its total cost as well as its length, number of reactions in the route. The cost function is defined as the negative loglikelihood of the reaction. Therefore, minimizing the total costs is equivalent to maximizing the likelihood of the route. To measure planning effiency, we use the number of calls to the onestep model ( per call) as a surrogate of time (since it will occupy of running time) and compare the success rate under the same time limit.
Performance summary: The performances of all algorithms are summarized in tbl:summary. Under the time limit of onestep calls, Retro* solves more test molecules than the second best method, DFPNE. Among all the solutions given by Retro*, of them are shorter than expert routes, and of them are better in terms of the total costs. We also conduct an ablation study to understand the importance of the learning component in Retro* by evaluating its nonlearning version Retro*0. Retro*0 is obtained from Retro* by setting to , which is a lowerbound of any valid values. Comparing to baseline methods, Retro*0 is also showing promising results. However, it is outperformed by Retro* by in terms of success rate, demonstrating the performance gain brought by learning from previous planning experience.
To find out whether MCTS and DFPNE can benefit from the learned value function oracle in Retro*, we replace the reward estimation by rollout in MCTS and the proof number initialization in DFPNE by the same , calling the strengthened algorithms MCTS+ and DFPNE+. Value function helps MCTS as expected due to having a value estimate with less variance than rollout. The performance of DFPNE is not improved because we don’t have a good initialization of the disproof number.
Influence of time limit: To show the influence of time limit on performance, we plot the success rate against the number of onestep model calls in fig:succrate. We can see that Retro* not only outperforms baseline algorithms by a large margin at the beginning, but also is improving faster than the baselines, enlarging the performance gap as the time limit increases.
Solution quality: To evaluate the overall solution quality, for each test molecule, we collect solutions from all algorithms, and compare the route lengths and costs (see fig:routequalityleft). We only keep the best routes (could be multiple) for each test molecule, and count the number of best routes in total for each method. We find that in terms of total costs, Retro* produces more best routes than the second best method. Even for the length metric, which is not the objective Retro* is optmizing for, it still achieves about the same performance as the best method.
As a demonstration for Retro*’s ability to find highquality routes, we illustrate a sample solution in fig:routequalitymid, where each node represents a molecule. The target molecule corresponds to the root node, and the building blocks are in yellow. The numbers on the edges indicates the likelihoods of successfully producing the corresponding reactions in realworld. The expert route provided shares the exactly the same first reaction and the same right branch with the route found by our algorithm. However, the left branch (fig:routequalityright) is much longer and less probable than the corresponding part of the solution route, as shown in the dotted box region in fig:routequalitymid. Please refer to Appendix C for more sample solution routes and search tree visualizations.
6 Conclusion
In this work, we propose Retro*, a learningbased retrosynthetic planning algorithm for efficiently finding highquality routes. Retro* is able to utilize previous planning experience to bias the search on unseen molecules towards promising directions. We also propose a systematic approach for creating a retrosynthesis dataset from publicly available reaction datasets and novel metrics for evaluating solution routes without involving human experts. Experiments on realworld benchmark dataset demonstrate our algorithm’s significant improvement over existing methods on both planning efficiency and solution quality.
Acknowledgements
We thank Junhong Liu, Wei Yang and Yong Liu for helpful discussions. This work is supported in part by NSF grants CDS&E1900017 D3SC, CCF1836936 FMitF, IIS1841351, CAREER IIS1350983, CNS1704701, ONR MURI grant to L.S.
References
Appendix A Implementation details
In this section we describe the algorithm details in the update phase of Retro*. The goal of the update phase is to compute the uptodate for every molecule node . To implement efficient update, we need to cache for all . Note that from Eq (8), we can observe the fact that sibling molecule nodes have the same , if . Therefore instead of storing the value of in every molecule node , we store the value in their common parent via defining if for every reaction node .
In our implementation, we cache for all reaction nodes and cache for all nodes . Caching values in this way would allow us to visit each related node only once for minimal update.
algocf[ht]
The update function is summarized in alg:update and illustrated in fig:update, which takes in the expanded node and the expansion result , and performs updates to affected nodes. We first compute the values for new reactions according to Eq (7) and (8) in line LABEL:ln:startLABEL:ln:newvalues. Then we update the ancestor nodes of in a bottomup fashion in line LABEL:ln:bottomupstartLABEL:ln:end. We also update the molecule nodes in the sibling subtrees in line LABEL:ln:sib and alg:updatesibling.
algocf[ht]
Our implementation visits a node only when necessary. When updating along the ancestor path, it immediately stops when the influence of the expansion vanishes (line LABEL:ln:stopcriteria). When updating a single node, we use a delta update by leveraging the relations derived from Eq (7) and (8), avoiding a direct computation which may require or summations.
Appendix B Guarantees on finding the optimal solution
Since Retro* is a variant of the A* algorithm, we can leverage existing results to prove the theoretical guarantees for Retro*. In this section, we first state the assumptions we make, and then prove the admissibility (thm:admissibility) of Retro*.
The theoretical results in this paper build upon the assumption that we can access , which is a lowerbound for for all molecules . Note that this is a weak assumption, since we know is a universal lowerbound for .
As we describe in Eq (6), can be decomposed into and , where is the exact cost of the partial route through which is already in the tree, and is the future costs for frontier nodes in the route which is a summation of a series of s. In practice we use in the summation, and arrive at , which is a lowerbound of , the following lemma.
Assuming or its lowerbound is known for all encountered molecules , then the approximated future costs in Retro* is a lowerbound of true .
We restate the admissibility result (thm:admissibility) in the main text and prove it with existing results in A* literature.
(Admissibility) Assuming or its lowerbound is known for all encountered molecules , alg:main is guaranteed to return an optimal solution, if the halting condition is changed to “the total costs of a found route is no larger than ”.
Combine lm:admissibility and Theorem 1 in the original A* paper (hart1968formal).
Appendix C Sample search trees and solution routes
In this section, we present two examples of the solution routes and the corresponding search trees for target molecule and produced by Retro*.
Solution route for target molecule is illustrated in the top/bottom subfigure of fig:route_ex12, where a set of edges pointing from the same product molecule to reactant molecules represents an onestep chemical reaction. Molecules on the leaf nodes are all available.
The search trees for molecule and are illustrated in fig:search_tree_ex1 and fig:search_tree_ex2. We use reactangular boxes to represent molecules. Yellow/grey/blue boxes indicate available/unexpanded/solved molecules. Reactangular arrows are used to represent reactions. The numbers on the edges pointing from a molecule to a reaction are the probabilities produced by the onestep model. Due to space limit, we only present the minimal tree which leads to a solution.
Appendix D Retro* for hierarchical task planning
As a general planning algorithm, Retro* can be applied to other machine learning problems as well, including theorem proving (yang2019learning) and hierarchical task planning (erol1996hierarchical) (or HTP), etc. Below, we conduct a synthetic experiment on HTP to demonstrate the idea. In the experiment, we are trying to search for a plan to complete a target task. The tasks (OR nodes) can be completed with different methods, and each method (AND nodes) requires a sequence of subtasks to be completed. Furthermore, each method is associated with a nonnegative cost. The goal is to find a plan with minimum total cost to realize the target task by decomposing it recursively until all the leaf task nodes represent primitive tasks that we know how to execute directly. As an example, to travel from home in city to hotel in city , we can take either flight, train or ship, each with its own cost. For each method, we have subtasks such as home airport , flight(), and airport hotel. These subtasks can be further realized by several methods.
As usual, we want to find a plan with small cost in limited time which is measured by the number of expansions of task nodes. We use the optimal halting condition as stated in theorem 3.4. We compare our algorithms against DFPNE, the best performing baseline. The results are summarized in tbl:htnsucc and 3.
Time Limit  15  20  25  30  35 

Retro*  .67  .91  .96  .98  1. 
Retro*0  .50  .86  .95  .98  .99 
DFPNE  .02  .33  .74  .93  .97 
Alg  Retro*  Retro*0  DFPNE 

Avg. AR  1  1  1.5 
Max. AR  1  1  3.9 
As we can see, in terms of success rate, Retro* is slightly better than Retro*0, and both of them are significantly better than DFPNE. In terms of solution quality, we compute the approximation ratio (= solution cost / ground truth best solution cost) for every solution, and verify the theoretical guarantee in theorem 3.4 on finding the best solution.
Appendix E Related Works
Reinforcement learning algorithms (without planning) have also been considered for the retrosynthesis problem. schreck2019learning leverages selfplay experience to fit a value function and uses policy iteration for learning an expansion policy. It is possible to combine it with a planning algorithm to achieve better performance in practice.
Learning to search from previous planning experiences has been well studied and applied to Go (silver2016mastering; silver2017mastering), Sokoban (guez2018learning) and path planning (chen2020learning). Existing methods cannot be directly applied to the retrosynthesis problem since the search space is more complicated, and the traditional representation where a node corresponds to a state is highly inefficient, as we mentioned in the discussion on MCTS in previous sections.