Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search

by   Binghong Chen, et al.

Retrosynthetic planning is a critical task in organic chemistry which identifies a series of reactions that can lead to the synthesis of a target product. The vast number of possible chemical transformations makes the size of the search space very big, and retrosynthetic planning is challenging even for experienced chemists. However, existing methods either require expensive return estimation by rollout with high variance, or optimize for search speed rather than the quality. In this paper, we propose Retro*, a neural-based A*-like algorithm that finds high-quality synthetic routes efficiently. It maintains the search as an AND-OR tree, and learns a neural search bias with off-policy data. Then guided by this neural network, it performs best-first search efficiently during new planning episodes. Experiments on benchmark USPTO datasets show that, our proposed method outperforms existing state-of-the-art with respect to both the success rate and solution quality, while being more efficient at the same time.


page 1

page 2

page 3

page 4


Retrosynthetic Planning with Experience-Guided Monte Carlo Tree Search

Retrosynthetic planning problem is to analyze a complex molecule and giv...

Exploiting Learned Policies in Focal Search

Recent machine-learning approaches to deterministic search and domain-in...

Towards "AlphaChem": Chemical Synthesis Planning with Tree Search and Deep Neural Network Policies

Retrosynthesis is a technique to plan the chemical synthesis of organic ...

Scalable Online Planning via Reinforcement Learning Fine-Tuning

Lookahead search has been a critical component of recent AI successes, s...

Just-in-Time Learning for Bottom-Up Enumerative Synthesis

A key challenge in program synthesis is the astronomical size of the sea...

An Abstraction-Free Method for Multi-Robot Temporal Logic Optimal Control Synthesis

The majority of existing Linear Temporal Logic (LTL) planning methods re...

Subgoal Search For Complex Reasoning Tasks

Humans excel in solving complex reasoning tasks through a mental process...

1 Introduction

Retrosynthetic planning is one of the fundamental problems in organic chemistry. Given a target product, the goal of retrosynthesis is to identify a series of reactions that can lead to the synthesis of the product, by searching backwards and iteratively applying chemical transformations to unavailable molecules. As thousands of theoretically-possible transformations can all be applied during each step of reactions, the search space of planning will be huge and makes the problem challenging even for experienced chemists.

The one-step retrosynthesis prediction, which predicts a list of possible direct reactants given product, serves as the foundation for realizing the multistep retrosynthetic planning. Existing methods roughly fall into two categories, either template-based or template-free. Each chemical reaction is associated with a reaction template that encodes how atoms and bonds change during the reaction. Given a target product, template-based methods predict the possible reaction templates, and subsequently apply the predicted reaction templates to target molecule to get corresponding reactants. Existing methods include retrosim (coley2017computer), neuralsym (segler2017neural) and GLN (dai2019retrosynthesis). Though conceptually straightforward, template-based methods need to deal with tens or even hundreds of thousands of possible reaction templates, making the classification task hard. Besides, templates are not always available for chemical reactions. Due to these reasons, people have also been developing template-free methods that could directly predict reactants. Most of existing methods employ seq2seq models like LSTM (liu2017retrosynthetic) or Transformer (karpov2019transformer)

from neural machine translation literature.

While one-step methods are continuously being improved, most molecules in real world cannot be synthesized within one step. Possible number of synthesis steps could go up to 60 or even more. Since each molecule could be synthesized by hundreds of different possible reactants, the possible synthesis routes becomes countless for a single product. Such huge space poses challenges for efficient searching and planning, even with advanced one-step approaches.

Besides the huge search space, another challenge is the ambiguity in performance measure and benchmarking. It has been extremely hard to quantitatively analyze the performance of any multi-step retrosynthesis algorithms due to the ambiguous definition of ‘good synthesis routes’, nor are there any benchmark datasets for analyzing designed algorithms. Most common ways for quantitative analysis is to employ domain experts and let them judge if one synthesis route is better than the other based solely on their experiences, which is both time-consuming and costly.

Due to aforementioned challenges, there are less work proposed in the field of multi-step retrosynthetic planning. Previous works using Monte Carlo Tree Search (MCTS) (segler2018planning; segler2017towards)

have achieved superior results over neural- or heuristic-based Breadth First Search (BFS). However, MCTS-based methods has several limitations in this setting:

  • [leftmargin=*,nolistsep,nosep]

  • Each tree node corresponds to a set of molecules instead of single molecule. This addtional combinatorial aspect make the representation of tree node, and the estimation of its value even harder. Furthermore, reactions do not explicilty appear as nodes in the tree, which prevents their algorithm from exploiting the structure of subproblems.

  • As the algorithm depends on online value estimation, the full rollout from vanilla MCTS may not be efficient for the planning need. Furthermore, the algorithm can not exploit historical data in that many good retrosysthesis plans may have been found previously, and “intuitions” on how to plan efficiently may be learned from these histories.

For quantitative evaluation, they have employed numerous domain experts to conduct A-B tests over methods proposed by their algorithm and other baselines.

In this paper, we present a novel neural-guided tree search method, called Retro*111Available at, for chemical retrosynthesis planning. In our method,

  • [leftmargin=*,nolistsep,nosep]

  • We explicitly maintain information about reactions as nodes in an AND-OR tree, where a node with “AND” type corresponds to a reaction, and a node with “OR” type corresponds to a molecule. The tree captures the relations between candidate reactions and reactant molecules, which allows us to exploit structure of subproblems corresponding to a single molecule.

  • Based on the AND-OR tree representation, we propose an A*-like planning algorithm which is guided by a neural network learned from past retrosynthesis planning experiences. More specifially, The neural network learns a synthesis cost for each molecule, and it helps the search algorithm to pick the most promising molecule node to expand.

Furthermore, we also propose a method for constructing benchmark synthesis routes data given reactions and chemical building blocks. Based on this, we construct a synthesis route dataset from benchmark reaction dataset USPTO. The route dataset is not only useful for quantitative analysis for predicted synthesis routes, but also work as training data for the neural network components in our method.

Below we summarize our contributions:

  • [leftmargin=*,nolistsep,nosep]

  • We propose a novel learning-based retrosynthetic planning algorithm to learn from previous planning experience. The proposed algorithm outperforms state-of-the-art methods by a large margin on a realworld benchmark dataset.

  • Our algorithm framework can induce a search algorithm that guarantees the optimal solution.

  • We propose a method for constructing synthesis route datasets for quantitative analysis of multistep retorsynthetic planning methods.

Our planning algorithm is general in the sense that it can also be applied to other machine learning problems such as theorem proving (yang2019learning) and hierarchical task planning (erol1996hierarchical). A synthetic task planning experiment is included in Appendix D to demonstrate the idea. Most related works have been mentioned in the first two sections. For more related works, please refer to Appendix E.

2 Background

In this section, we first state the problem and its background we are tackling in sec:prob_stmt. Then in sec:mcts and sec:pns we describe how MCTS and proof number search fit in the problem setting.

2.1 Problem Statement

One-step retrosynthesis: Denote the space of all molecule as . The one-step retrosynthesis takes a target molecule as input, and predicts a set of source reactants that can be used to synthesize . This is the reverse problem of reaction outcome prediction. In our paper, we assume the existence of such one-step retrosynthesis model (or one-step model for simplicity in the rest of the paper) ,


which outputs at most reactions , the corresponding reactant sets and costs . The cost can be the actual price of the reaction , or simply the negative log-likelihood of this reaction under model . A one-step retrosynthesis model can be learned from a dataset of chemical reactions  222For simplicity we follow the common practice to ignore the reagents and other chemical reaction conditions. which have already been discovered by chemists in the past (coley2017computer; segler2017neural; liu2017retrosynthetic; dai2019retrosynthesis; karpov2019transformer).

Retrosynthesis planning. Given a single target molecule and an initial set of molecules , we are interested in synthesizing via a sequence of chemical reactions using reactants that are from or can be synthesized by . In this case, corresponds to a set of molecules that are commercially available. The goal of retrosynthesis planning is to predict a sequence of reactions with reactants in and will ultimately arrive at product .

Instead of performing forward chaining like reasoning that starts from , a more efficient and commonly used method is to perform backward chaining that starts from the molecule , and perform a series of one-step retrosynthesis prediction until all the reactants required are from . Beyond just finding such a synthesis route, our goal is to find the retrosynthesis plan that are:

  • [leftmargin=*,nolistsep,nosep]

  • High-quality:

    • The entire retrosynthesis plan should be chemically sound with high probability;

    • The reactants or chemical reactions required should have as low cost as possible;

  • Efficient: Due to the synthesis effort, the number of retrosynthesis steps should be limited.

Our proposed Retro* is aiming at finding the best retrosynthesis plan with respect to above criteria in limited time. To achieve this, we also assume that the quality of a solution can be measured by the reaction cost, where such cost is known to our model.

2.2 Monte Carlo Tree Search

Figure 1: Left: MCTS (segler2018planning) for retrosynthesis planning. Each node represents a set of molecules. Orange nodes/molecules are available building blocks; Right: AND-OR stump illustration of . Reaction requires molecule and . Reaction requires molecule . Either or can be used to synthesize .

The Monte Carlo Tree Search (MCTS) has achieved ground breaking successes in two player games, such as GO (silver2016mastering; silver2017mastering). Its variant, UCT (kocsis2006bandit), is especially powerful for balancing exploration and exploitation in online learning setting, and has been employed in segler2018planning for retrosynthesis planning. Specifically, as illustrated in fig:mcts-and-or-stump, the tree search start from the target molecule . Each node in the current search tree represents a set of molecules . Each child node of is obtained by selecting one molecule and a one-step retrosynthesis reaction , where the resulting node contains molecule set .

Despite its good performance, MCTS formulation for retrosynthesis planning has several limitations. First, the rollout needed in MCTS makes it time-consuming, and unlike in two-player zero-sum games, the retrosynthesis planning is essentially a single player game where the return estimated by random rollouts could be highly inaccurate. Second, since each tree node is a set of molecules instead of a single molecule, the combinatorial nature of this representation brings the sparsity in the variance estimation.

2.3 Proof Number Search and Variants

The proof-number search (PNS) (allis1994proof) is a game tree search that is designed for two-player game with binary goal. It tries to either prove or disprove the root node as fast as possible. In the retrosynthesis planning scenario, this corresponds to either proving the target molecule by finding a feasible planning path, or concluding that it is not synthesizable.

AND-OR Tree: The search tree of PNS is an AND-OR tree , where each AND node needs all its children to be proved, while OR node requires at least one to be satisfied. Each node is associated with a proof number that defines the minimum number of leaf nodes to be proved in order to prove . Similarly, the disproof number finds the minimum number of leaf nodes needed to disprove . With such definition, we can recursively define these numbers for internal nodes. Specifically, for AND node ,


and for OR node , we have


Represent retrosynthesis planning using AND-OR tree: As illustrated in fig:mcts-and-or-stump, the application of one-step retrosynthesis model on molecule can be represented using one block of AND-OR tree (denoted as AND-OR stump), with molecule node as ‘OR’ node and reaction node as ‘AND’ node. This is because a molecule can be synthesized using any one of its children reactions (or-relation), and each reaction node requires all of its children molecules (and-relation) to be ready.

The search of PNS starts from the root node every time, and selects the child node with either minimum proof number or minimum disproof number, depends on whether the current node is an OR node or AND node, respectively. The process ends when a leaf node is reached, which can be either reaction or molecule node to be expanded. And after one step of retrosynthesis expansion, all the and of nodes along the path back to the root will be updated. The two-player game in this sense comes from the interleaving behavior of selecting proof and disproof numbers, where the first ‘player’ tries to prove the root while the second ‘player’ tries to disprove it. As both of the players behave optimally when the proof/disproof numbers are accurate, such perspective would bring the efficiency for finding a feasible synthesis path or prove that it is not synthesizable.

Variant: There have been several variants to improve different aspects of PNS, including different traversal strategy, different initialization methods of and for newly added nodes. The most recent work DFPN-E (kishimoto2019depth) builds on top of the depth-first variant of PNS with an additive cost in addition to classical update rule in Eq (3). Specifically, for an unsolved OR node,


Here is the function of the cost of corresponding one-step retrosynthesis. Together with manually defined thresholds, this method addresses the lopsided problem in retrosynthesis planning, , the imbalance of branching factor between AND and OR nodes.

The variants of PNS has shown some promising results over MCTS for retrosynthesis planning. However, the two-player game formulation is designed for the speed of a proof, not necessarily the overall solution quality. Moreover, existing works rely on human expert to design and thresholds during search. This makes it not only time-consuming to tune, but also hard to generalize well when solving new target molecule or dealning with new one-step model or reaction data.

3 Retro* Search Algorithm

Figure 2: Retro* algorithm framework. We use circles to represent molecule nodes, and squares to represent reaction nodes. An iteration consists of three phases. In the selection phase, one of the frontier molecule nodes is selected according to the cost estimation . Then the an AND-OR stump is expanded from the selected node. All the new reactions and molecules are added to the tree. Finally the values inside the tree are updated using the s from the newly added molecules. The left-most figure also serves as the illustration for computing . , where , and .


Our proposed Retro* is a retrosynthetic planning algorithm that works on the AND-OR search tree. It is significantly different from PNS which is also based on AND-OR tree, or other MCTS based methods in the following ways:

  • [leftmargin=*,nolistsep,nosep]

  • Retro* utilizes AND-OR tree for single player game which only utilizes the global value estimation. This is different from PNS which models the problem as two-player game with both proof numbers and disproof numbers. The distinction of the objective makes Retro* advantageous in finding best retrosynthetic routes.

  • Retro* estimates the future value of frontier nodes with neural network that can be trained using historical retrosynthesis planning data. This is different from the expensive rollouts used in segler2018planning, or the human designed heuristics in kishimoto2019depth. This not only enables more accurate prediction during expansion, but also generalizes the knowledge learned from existing planning paths.

3.1 Overview of Retro*

Retro* (alg:main) is a best-first search algorithm, which exploits neural priors to directly optimize for the quality of the solution. The search tree is an AND-OR tree, with molecule node as ’OR’ node and reaction node as ’AND’ node. It starts the search tree with a single root molecule node that is the target molecule . At each step, it selects a node in the frontier of (denoted as ) according to the value function. Then it expands with the one-step model and grows with one AND-OR stump. Finally the nodes with potential dependency on will be updated. Below we first provide a big picture of the algorithm by explaining these steps one by one, then we look into details of value function design and its update in  sec:vt_design and  sec:update, respectively. fig:alg-framework summarizes these steps in high level.

Selection: Given a search tree , we denote the molecule nodes as and reaction nodes as , where the total nodes in will be . The frontier contains all the molecule nodes in that haven’t been expanded before. Since we want to minimize the total cost of the final solution, an ideal option to expand next would be the molecule node which is part of the best synthesis plan.

Suppose we already have a value function oracle which tells us that under the current search tree , the cost of the best plan that contains for synthesizing target . We can use it to select the next node to expand:


A proper design of such would not only improve search efficiency, but can also bring theoretical guarantees.

Expansion: After picking the node with minimum cost estimation , we will expand the search tree with one-step retrosynthesis proposals from . Specifically, for each proposed retrosynthesis reaction , we create a reaction node under node , and for each molecule , we create a molecule node under the reaction node . This will create an AND-OR stump under node . Unlike in MCTS (segler2018planning) where multiple calls to is needed till a terminal state during rollout, here the expansion only requires a single call to the one-step model.

Update: Denote the search tree after expansion of node to be . Such expansion obtains the corresponding cost information for one-step retrosynthesis. we utilize this more direct information to update of all other relevant nodes to provide a more accurate estimation of total cost.

3.2 Design of

To properly design , we borrow the idea from A* algorithm. A* algorithm is a best-first search algorithm which uses the cost from start together with the estimation of future cost to select move. When such estimation is admissible, it will be guaranteed to return the optimal solution. Inspired by the A* algorithm, we decompose the value function into two parts:


where is the cost of current reactions that have happened in , if should be in the final route, and is the estimated cost for future reactions needed to complete such planning. Instead of explicitly calculate these two separately, we show an equivalent but simpler way to calculate directly.

Specifically, we first define , which is a boundary case of the value function oracle that simply tells how much cost is needed to synthesize molecule . For the simplicity of notation, we denote it as . Then we define the reaction number function that is inspired by proof number but with different purpose:


where and calculate for reaction node and molecule node, respectively. The reaction number tells the minimum estimated cost needed for a molecule or reaction to happen in the current tree. We further define to get the parent node of , and be all the ancestors of node . Note that and vise versa. Then function will be:


The first summation calculates all the reaction cost that has happened along the path from node to root. Additionally, , the child node should also be synthesized, as each such reaction node is an AND node. This requirement is captured in the second summation of Eq (8). We can see that implicitly sums up the cost associated with the reaction nodes in this route related to , and takes all the terms related to in Eq (7).

In fig:alg-framework we demonstrate the calculation of with a simple example. Notice that we can compute the parts that relevant to with existing information. But we can only estimate the part of since the required reactions are not in the search tree yet. We will show how to learn this future estimation in sec:learning.

3.3 Updating

After a node is expanded, there are several components needed to be updated to maintain the search tree state.

Update : Following Eq (7), the reaction number for newly created molecule nodes under the subtree rooted at will be , and the reaction nodes will have the cost added to the sum of reaction numbers in children. After that, all the nodes would potentially have the reaction number updated following Eq (7). Thus this process requires the computation complexity to be . However in our implementation, we can update these nodes in a bottom-up fashion that starts from , and stop anytime when an ancestor node value doesn’t change. This would speed up the update.

Update : Let be the set of molecule nodes that have reaction number being updated in the stage above. From Eq (8) we can see, for any molecule node , will be recalculated if .

Remark: The expansion of a node can potentially affect all other nodes in in the worst case. However the expansion of a single molecule node will only affect another node in the frontier when it is on the current best synthesis solution that composes . For the actual implementation, we use efficient caching and lazy propagate mechanism, which will guarantee to only update the when it is necessary. The implementation details of both above updates can be found in Appendix A.

3.4 Guarantees on Finding the Optimal Solution

Assuming or its lowerbound is known for all encountered molecules , alg:main is guaranteed to return an optimal solution, if the halting condition is changed to “the total costs of a found route is no larger than ”.

The proof can be found in Appendix B.

Remark 1: If we define the cost of a reaction to be its negative log-likelihood, then is the lowerbound of for any molecule . The induced algorithm is guaranteed to find the optimal solution.

Remark 2: In practice, due to the limited time budget, we prefer the algorithm to return once a solution is found.

3.5 Extension: Retro* on Graph Search Space

We have been mainly illustrating the technique on a tree structured space. As the retrosynthesis planning is essentially performend on a directed graph (, certain intermediate molecules may share the same reactants, which may further reduce the actual cost), the above calculation can be extended to the general bipartite graph with edges connecting and . Due to the potential existence of loops, the calculation of Eq (7) will be performed using shortest path algorithm instead. As there will be no negative loops, shortest path algorithm will still converge. By viewing the search space as tree rather than graph, we may possibly find sub-optimal solution due to the repetition in state representation. However, as loopy synthesis is rare in real world, we mainly focus on the tree structured search in this paper, and will investigate this extension to bipartite graph space search in future work.

4 Estimating from Planning Solutions

Retro* requires the value function oracle to compute for expansion node selection. However in practice it is impossible to obtain the exact value of for every molecule . Therefore we try to estimate it from previous planning data.

4.1 Represention of

To parameterize for any molecule , we first compute its Morgan fingerprint (rogers2010extended) of radius with bits, and feed it into a single-layer fully connected neural network of hidden dimension , which then outputs a scalar representing .

4.2 Offline Learning of

Previous work has either used random rollout or human designed heuristics for estimating , which may not be accurate enough to guide the search. Instead of learning it online during planning (silver2017mastering), we utilize the existing reactions in the training set to train it.

Specifically, we construct retrosynthesis routes for feasible molecules in , where the available set of molecule is also given beforehand. The specific construction strategy will be covered in sec:construct_routes. The resulting dataset will be , where each tuple contains the target molecule , the best entire route cost , the one-step retrosynthesis candidates which also contains the true one-step retrosynthesis used in the planning solution.

The learning of consists of two parts, namely the value fitting which is a regression loss and the consistency learning which maintains the partial order relationship between best one-step solution and other solutions :


where is a positive constant margin to ensure has higher priority for expansion than its alternatives even if the value estimates have tolerable noise. The overall objective is:


where balances these two losses. In experiment we set it to be 1 by default.

5 Experiments

Algorithm Retro* Retro*-0 DFPN-E+ DFPN-E MCTS+ MCTS Greedy DFS
Success rate 86.84% 79.47% 53.68% 55.26% 35.79% 33.68% 22.63%
Time 156.58 208.58 289.42 279.67 365.21 370.51 388.15
Shorter routes 50 52 59 59 18 14 11
Better routes 112 102 22 25 46 41 26
Table 1: Performance summary. Time is measured by the number of one-step model calls, with a hard limit of 500. The number of shorter and better routes are obtained from the comparison against the expert routes, in terms of number of reactions and the total costs.

5.1 Creating Benchmark Dataset

5.1.1 USPTO Reaction Dataset

We use the publicly available reaction dataset extracted from United States Patent Office (USPTO) to train one-step model and extract synthesis routes. The whole dataset consists of chemical reactions published up to September 2016. For reactions with multiple products, we duplicate them into multiple ones with one product each. After removing the duplications and reactions with wrong atom mappings, we further extract reaction templates with RDChiral 333 for all reactions and discard those whose reactants cannot be obtained by applying reaction templates to their products. The remaining reactions are further split randomly into train/val/test sets following proportions.

With reaction data, we train a template-based MLP model (segler2017neural) for one-step retrosynthesis. Following literature, we formulate the one-step retrosynthesis as a multi-class classification problem, where given a molecule as product, the goal is to predict possible reaction templates. Reactants are obtained by applying the predicted templates to product molecule. There are in total distinct templates. Throughout all experiments, we take the top- templates predicted by MLP model and apply them on each product to get corresponding reactant lists.

5.1.2 Extracting Synthesis Routes

To train our value function and quantitatively analyze the predicted routes, we construct synthesis routes based on USPTO reaction dataset and a list of commercially available building blocks from eMolecules 444  eMolecules consists of commercially available molecules that could work as ending points for our searching algorithm.

Given the list of building blocks, we take each molecule that have appeared in USPTO reaction data and analyze if it can be synthesized by existing reactions within USPTO training data. For each synthesizable molecule, we choose the shortest-possible synthesis routes with ending points being available building blocks in eMolecules.

We obtain validation and test route datasets with slightly different process. For validation dataset, we first combine train and validation reaction dataset, and then repeat aforementioned extraction procedure on the combined dataset. Since we extract routes with more reactions, synthesizable molecules will include those who could not be synthesized with original reactions and those who have shorter routes. We exclude molecules with routes of same length as in training data, and pack the remaining as validation route dataset. We apply similar procedure to test data but make sure that there is no overlap between test and training/validation set.

We further clean the test route dataset by only keeping the routes whose reactions are all covered by the top- predictions by the one-step model. To make the test set more challenging, we filter out the easier molecules by running a heuristic-based BFS planning algorithm, and discarding the solved molecules in a fixed time limit. After processing, we obtain training routes, validation routes, test routes and the corresponding target molecules.

5.2 Results

Figure 3: Left: Counts of the best solutions among all algorithms in terms of length/cost; Mid: Sample solution route from Retro*. Numbers on the edges are the likelihoods of the reactions. Yellow nodes are building blocks; Right: The corresponding dotted box part in the expert route, much longer and less probable than the solution.

We compare Retro* against DFPN-E (kishimoto2019depth), MCTS (segler2018planning) and greedy Depth First Search (DFS) on product molecules in test route dataset described in sec:construct_routes. Greedy DFS always prioritizes the reaction with the highest likelihood. MCTS is implemented with PUCT, where we used the reaction probability provided by the one-step model as the prior to bias the search.

We measure both route quality and planning efficiency to evaluate the algorithm. To measure the quality of a solution route, we compare its total cost as well as its length,  number of reactions in the route. The cost function is defined as the negative log-likelihood of the reaction. Therefore, minimizing the total costs is equivalent to maximizing the likelihood of the route. To measure planning effiency, we use the number of calls to the one-step model ( per call) as a surrogate of time (since it will occupy of running time) and compare the success rate under the same time limit.

Performance summary: The performances of all algorithms are summarized in tbl:summary. Under the time limit of one-step calls, Retro* solves more test molecules than the second best method, DFPN-E. Among all the solutions given by Retro*, of them are shorter than expert routes, and of them are better in terms of the total costs. We also conduct an ablation study to understand the importance of the learning component in Retro* by evaluating its non-learning version Retro*-0. Retro*-0 is obtained from Retro* by setting to , which is a lowerbound of any valid values. Comparing to baseline methods, Retro*-0 is also showing promising results. However, it is outperformed by Retro* by in terms of success rate, demonstrating the performance gain brought by learning from previous planning experience.

To find out whether MCTS and DFPN-E can benefit from the learned value function oracle in Retro*, we replace the reward estimation by rollout in MCTS and the proof number initialization in DFPN-E by the same , calling the strengthened algorithms MCTS+ and DFPN-E+. Value function helps MCTS as expected due to having a value estimate with less variance than rollout. The performance of DFPN-E is not improved because we don’t have a good initialization of the disproof number.

Figure 4: Influence of time limit on performance.

Influence of time limit: To show the influence of time limit on performance, we plot the success rate against the number of one-step model calls in fig:succ-rate. We can see that Retro* not only outperforms baseline algorithms by a large margin at the beginning, but also is improving faster than the baselines, enlarging the performance gap as the time limit increases.

Solution quality: To evaluate the overall solution quality, for each test molecule, we collect solutions from all algorithms, and compare the route lengths and costs (see fig:route-quality-left). We only keep the best routes (could be multiple) for each test molecule, and count the number of best routes in total for each method. We find that in terms of total costs, Retro* produces more best routes than the second best method. Even for the length metric, which is not the objective Retro* is optmizing for, it still achieves about the same performance as the best method.

As a demonstration for Retro*’s ability to find high-quality routes, we illustrate a sample solution in fig:route-quality-mid, where each node represents a molecule. The target molecule corresponds to the root node, and the building blocks are in yellow. The numbers on the edges indicates the likelihoods of successfully producing the corresponding reactions in realworld. The expert route provided shares the exactly the same first reaction and the same right branch with the route found by our algorithm. However, the left branch (fig:route-quality-right) is much longer and less probable than the corresponding part of the solution route, as shown in the dotted box region in fig:route-quality-mid. Please refer to Appendix C for more sample solution routes and search tree visualizations.

6 Conclusion

In this work, we propose Retro*, a learning-based retrosynthetic planning algorithm for efficiently finding high-quality routes. Retro* is able to utilize previous planning experience to bias the search on unseen molecules towards promising directions. We also propose a systematic approach for creating a retrosynthesis dataset from publicly available reaction datasets and novel metrics for evaluating solution routes without involving human experts. Experiments on realworld benchmark dataset demonstrate our algorithm’s significant improvement over existing methods on both planning efficiency and solution quality.


We thank Junhong Liu, Wei Yang and Yong Liu for helpful discussions. This work is supported in part by NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMitF, IIS-1841351, CAREER IIS-1350983, CNS-1704701, ONR MURI grant to L.S.


Appendix A Implementation details

Figure 5: Illustration for the update process. Three phases correspond to line LABEL:ln:start-LABEL:ln:new-values, line LABEL:ln:r-update-start-LABEL:ln:r-update-end, and line LABEL:ln:m-update-start-LABEL:ln:end in alg:update.

In this section we describe the algorithm details in the update phase of Retro*. The goal of the update phase is to compute the up-to-date for every molecule node . To implement efficient update, we need to cache for all . Note that from Eq (8), we can observe the fact that sibling molecule nodes have the same ,   if . Therefore instead of storing the value of in every molecule node , we store the value in their common parent via defining if for every reaction node .

In our implementation, we cache for all reaction nodes and cache for all nodes . Caching values in this way would allow us to visit each related node only once for minimal update.


55footnotetext: For clarity, we omit the condition on in the notations.

The update function is summarized in alg:update and illustrated in fig:update, which takes in the expanded node and the expansion result , and performs updates to affected nodes. We first compute the values for new reactions according to Eq (7) and (8) in line LABEL:ln:start-LABEL:ln:new-values. Then we update the ancestor nodes of in a bottom-up fashion in line LABEL:ln:bottom-up-start-LABEL:ln:end. We also update the molecule nodes in the sibling sub-trees in line LABEL:ln:sib and alg:update-sibling.


Our implementation visits a node only when necessary. When updating along the ancestor path, it immediately stops when the influence of the expansion vanishes (line LABEL:ln:stop-criteria). When updating a single node, we use a delta update by leveraging the relations derived from Eq (7) and (8), avoiding a direct computation which may require or summations.

Appendix B Guarantees on finding the optimal solution

Since Retro* is a variant of the A* algorithm, we can leverage existing results to prove the theoretical guarantees for Retro*. In this section, we first state the assumptions we make, and then prove the admissibility (thm:admissibility) of Retro*.

The theoretical results in this paper build upon the assumption that we can access , which is a lowerbound for for all molecules . Note that this is a weak assumption, since we know is a universal lowerbound for .

As we describe in Eq (6), can be decomposed into and , where is the exact cost of the partial route through which is already in the tree, and is the future costs for frontier nodes in the route which is a summation of a series of s. In practice we use in the summation, and arrive at , which is a lowerbound of ,  the following lemma.

Assuming or its lowerbound is known for all encountered molecules , then the approximated future costs in Retro* is a lowerbound of true .

We re-state the admissibility result (thm:admissibility) in the main text and prove it with existing results in A* literature.

(Admissibility) Assuming or its lowerbound is known for all encountered molecules , alg:main is guaranteed to return an optimal solution, if the halting condition is changed to “the total costs of a found route is no larger than ”.

Combine lm:admissibility and Theorem 1 in the original A* paper (hart1968formal).

Appendix C Sample search trees and solution routes

In this section, we present two examples of the solution routes and the corresponding search trees for target molecule and produced by Retro*.

Solution route for target molecule is illustrated in the top/bottom sub-figure of fig:route_ex12, where a set of edges pointing from the same product molecule to reactant molecules represents an one-step chemical reaction. Molecules on the leaf nodes are all available.

Figure 6: Top/bottom: solution route produced by Retro* for molecule . Edges point from the same product molecule to the reactant molecules represent an one-step chemical reaction.

The search trees for molecule and are illustrated in fig:search_tree_ex1 and fig:search_tree_ex2. We use reactangular boxes to represent molecules. Yellow/grey/blue boxes indicate available/unexpanded/solved molecules. Reactangular arrows are used to represent reactions. The numbers on the edges pointing from a molecule to a reaction are the probabilities produced by the one-step model. Due to space limit, we only present the minimal tree which leads to a solution.

Figure 7: Search tree produced by Retro* for molecule . Reactangular boxes/arrows represent molecules/reactions. Yellow/grey/blue indicate available/unexpanded/solved molecules. Numbers on the edges are the probabilities produced by the one-step model.
Figure 8: Search tree produced by Retro* for molecule . Reactangular boxes/arrows represent molecules/reactions. Yellow/grey/blue indicate available/unexpanded/solved molecules. Numbers on the edges are the probabilities produced by the one-step model.

Appendix D Retro* for hierarchical task planning

As a general planning algorithm, Retro* can be applied to other machine learning problems as well, including theorem proving (yang2019learning) and hierarchical task planning (erol1996hierarchical) (or HTP), etc. Below, we conduct a synthetic experiment on HTP to demonstrate the idea. In the experiment, we are trying to search for a plan to complete a target task. The tasks (OR nodes) can be completed with different methods, and each method (AND nodes) requires a sequence of subtasks to be completed. Furthermore, each method is associated with a nonnegative cost. The goal is to find a plan with minimum total cost to realize the target task by decomposing it recursively until all the leaf task nodes represent primitive tasks that we know how to execute directly. As an example, to travel from home in city to hotel in city , we can take either flight, train or ship, each with its own cost. For each method, we have subtasks such as home airport , flight(), and airport hotel. These subtasks can be further realized by several methods.

As usual, we want to find a plan with small cost in limited time which is measured by the number of expansions of task nodes. We use the optimal halting condition as stated in theorem 3.4. We compare our algorithms against DFPN-E, the best performing baseline. The results are summarized in tbl:htn-succ and 3.

Time Limit 15 20 25 30 35
Retro* .67 .91 .96 .98 1.
Retro*-0 .50 .86 .95 .98 .99
DFPN-E .02 .33 .74 .93 .97
Table 2: Success rate (higher is better) vs time limit.
Alg Retro* Retro*-0 DFPN-E
Avg. AR 1 1 1.5
Max. AR 1 1 3.9
Table 3: AR = Approximation ratio (lower is better), time limit=35.

As we can see, in terms of success rate, Retro* is slightly better than Retro*-0, and both of them are significantly better than DFPN-E. In terms of solution quality, we compute the approximation ratio (= solution cost / ground truth best solution cost) for every solution, and verify the theoretical guarantee in theorem 3.4 on finding the best solution.

Appendix E Related Works

Reinforcement learning algorithms (without planning) have also been considered for the retrosynthesis problem. schreck2019learning leverages self-play experience to fit a value function and uses policy iteration for learning an expansion policy. It is possible to combine it with a planning algorithm to achieve better performance in practice.

Learning to search from previous planning experiences has been well studied and applied to Go (silver2016mastering; silver2017mastering), Sokoban (guez2018learning) and path planning (chen2020learning). Existing methods cannot be directly applied to the retrosynthesis problem since the search space is more complicated, and the traditional representation where a node corresponds to a state is highly inefficient, as we mentioned in the discussion on MCTS in previous sections.