RetroXpert: Decompose Retrosynthesis Prediction like a Chemist

11/04/2020 ∙ by Chaochao Yan, et al. ∙ The University of Texas at Arlington Tsinghua University Tencent SUN YAT-SEN UNIVERSITY 7

Retrosynthesis is the process of recursively decomposing target molecules into available building blocks. It plays an important role in solving problems in organic synthesis planning. To automate or assist in the retrosynthesis analysis, various retrosynthesis prediction algorithms have been proposed. However, most of them are cumbersome and lack interpretability about their predictions. In this paper, we devise a novel template-free algorithm for automatic retrosynthetic expansion inspired by how chemists approach retrosynthesis prediction. Our method disassembles retrosynthesis into two steps: i) identify the potential reaction center of the target molecule through a novel graph neural network and generate intermediate synthons, and ii) generate the reactants associated with synthons via a robust reactant generation model. While outperforming the state-of-the-art baselines by a significant margin, our model also provides chemically reasonable interpretation.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Retrosynthesis of the desired compound is commonly constructed by recursively decomposing it into a set of available reaction building blocks. This analysis mode was formalized in the pioneering work Corey and Wipke (1969); Corey (1991) and now have become one of the fundamental paradigms in the modern chemical society. Retrosynthesis is challenging, in part due to the huge size of the search space. The reported synthetic-organic knowledge consists of in the order of reactions and compounds Gothard et al. (2012). On the other hand, the incomplete understanding of the reaction mechanism also increases the difficulty of retrosynthesis, which is typically undertaken by human experts. Therefore, it is a subjective process and requires considerable expertise and experience. However, molecules may have multiple possible retrosynthetic routes and it is challenging even for experts to select the most appropriate route since the feasibility of a route is often determined by multiple factors, such as the availability of potential reactants, reaction conditions, reaction yield, and potential toxic byproducts.

In this work, we focus on the single-step version (predict possible reactants given the product) of retrosynthesis following previous methods Liu et al. (2017); Dai et al. (2019); Zheng et al. (2020). Our method can be decomposed into two sub-tasks Corey and Wipke (1969); Pensak and Corey (1977): i) Breaking down the given target molecule into a set of synthons which are hypothetical units representing potential starting reactants in the retrosynthesis of the target, and ii) Calibrating the obtained synthons into a set of reactants, each of which corresponds to an available molecule.

Various computational methods Corey (1967); Christ et al. (2012); Bogevig et al. (2015); Szymkuć et al. (2016); Coley et al. (2017); Liu et al. (2017); Segler and Waller (2017b); Segler et al. (2018); Dai et al. (2019); Zheng et al. (2020); Lin et al. (2020); Shi et al. (2020) have been developed to assist in designing synthetic routes for novel molecules, and these methods can be broadly divided into two template-based and template-free categories. Template-based methods plan retrosynthesis based on hand-encoded rules or reaction templates. Synthia (formerly Chematica) relies on hand-encoded reaction transformation rules Szymkuć et al. (2016), and it has been experimentally validated as an efficient software for retrosynthesis Klucznik et al. (2018). However, it is infeasible to manually encode all the synthesis routes in practice considering the exponential growth in the number of reactions Segler et al. (2018). Reaction templates are often automatically extracted from the reaction databases and appropriate templates are selected to apply to the target Coley et al. (2017); Segler and Waller (2017b); Segler et al. (2018); Dai et al. (2019). The key process of these approaches is to select relevant templates for the given target. An obvious limitation is that these methods can only infer reactions within the chemical space covered by the template database, preventing them from discovering novel reactions Segler and Waller (2017a).

On the other hand, template-free methods Liu et al. (2017); Zheng et al. (2020); Lin et al. (2020)

treat the retrosynthesis as a neural machine translation problem, since molecules can be represented as SMILES

111 strings. Although simple and expressive, these models do not fit into the chemists’ analytical process and lack interpretability behind their predictions. Besides, such approaches fail to consider rich chemistry knowledge within the chemical reactions. For example, the generation order of reactants is undetermined in Liu et al. (2017); Zheng et al. (2020); Lin et al. (2020) since they ignore the correlation between synthons and reactants, resulting in slower and inferior model convergence. Similar to our method, the concurrent work G2Gs Shi et al. (2020) also presents a decomposition and generation two-step framework. G2Gs proposes to incrementally generate reactants from the associated synthons with a variational graph translation model. However, G2Gs can predict at most one bond disconnection which is not universal. Besides, G2Gs independently generates multiple reactants, which ignores the relationship between multiple reactants.

To overcome these challenges, inspired by the expert experience from chemists, we devise a two-step framework named as RetroXpert (Retrosynthesis eXpert) to automate the retrosynthesis prediction. Our model tackles it in two steps as shown in Figure 1. Firstly, we propose to identify the potential reaction center within the target molecule using a novel Edge-enhanced Graph Attention Network (EGAT). The reaction center is referred to as the set of bonds that will be disconnected in the retrosynthesis process. Synthons can be obtained by splitting the target molecule according to the reaction center. Secondly, the Reactant Generation Network (RGN) predicts associated reactants given the target molecule and synthons. Different from previous methods Liu et al. (2017); Zheng et al. (2020); Lin et al. (2020), the reactant generation order can be uniquely decided in our method, thanks to the intermediate synthons. What is more, we notice that the robustness of the RGN plays an important role. To robustify the RGN, we propose to augment the training data of RGN by incorporating unsuccessful predicted synthons. Our main contributions can be summarized as follows:

  • We propose to identify the potential reaction center with a novel Edge-enhanced Graph Attention Network (EGAT) which is strengthened with chemical knowledge.

  • By splitting the target molecule into synthons, the RGN is able to determine the generation order of reactants. We further propose to augment training data by introducing unsuccessfully predicted synthons, which makes RGN robust and achieves significant improvement.

  • On the standard USPTO-50K dataset Schneider et al. (2016), our method achieves 70.4% and 65.5% Top-1 accuracy when w/ and wo/ reaction type, respectively, which outperforms SOTA accuracy 63.2% (w/) and 52.6% (wo/) reported in Dai et al. (2019) by a large margin.

Figure 1: Pipeline overview. We conduct retrosynthesis in two closely dependent steps reaction center identification and reactant generation. The first step aims to identify the reaction center of the target molecule and generates intermediate synthons accordingly. The second step is to generate the desired set of reactants. Note that a molecule can be represented in two equivalent representations: molecule graph and SMILES string.

2 Methodology

Given a molecule graph G with nodes (atoms), we denote the matrix representation of node features as

, the tensor representation of edge features as

, and the adjacency matrix as . and are feature dimensions of atoms and bonds, respectively. We denote as the product, synthons, and reactants in the reaction formulation, respectively. The single-step retrosynthesis problem can be described as given the desired product P, seeking for a set of reactants that can produce the major product P through a valid chemical reaction. It is denoted as (predict R given P), which is the reverse process of the forward reaction prediction problem Bradshaw et al. (2018); Jin et al. (2017) that predicts the outcome products given a set of reactants.

As illustrated in Figure 1, our method decomposes the retrosynthesis task () into two closely dependent steps reaction center identification () and reactant generation (). The first step is to identify the potential reaction bonds which will be disconnected during the retrosynthesis, and then the product P can be split into a set of intermediate synthons . Note that each synthon can be regarded as the substructure of a reactant . The second step is to transform synthons into associated reactants . Although the intermediate synthons are not needed in retrosynthesis, decomposing the original retrosynthesis task () into two dependent procedures can have multiple benefits, which will be elaborated thoroughly in the following sections.

2.1 EGAT for reaction center identification

We treat the reaction center identification as a graph-to-graph transformation problem which is similar to the forward reaction outcome prediction Jin et al. (2017). To achieve this, we propose a graph neural network named Edge-enhanced Graph Attention Network (EGAT) which takes the molecule graph G

as input and predicts disconnection probability for each bond, and this is the main task. Since a product may be produced by different reactions, there can be multiple reaction centers for a given product and each reaction center corresponds to a different reaction. While current message passing neural networks

Gilmer et al. (2017) are shallow and capture only local structure information for each node, and it is difficult to distinguish multiple reaction centers without global information. To alleviate the problem, we add a graph-level auxiliary task to predict the total number of disconnection bonds.

As shown in Figure 2, distinct from the Graph Attention Network (GAT) Veličković et al. (2018) which is designed to learn node and graph-level embeddings, our proposed EGAT also learns edge embedding. It identifies the reaction center by predicting the disconnection probability for each bond taking its edge embedding as input. Given the target , the EGAT layer computes node embedding and edge embedding from previous layer’s embeddings and by following equations:


Figure 2: Embedding computation flows of GAT and the proposed EGAT.

where , , , and are trainable parameters, means concatenation operation, is all neighbor nodes of the node , is the attention weight between the node and its neighbor node , and as well as are the output node and edge representations, respectively. Initial input embeddings

are the input node and edge feature vectors

, respectively, which will be detailed later, and in this special case the dimensions and equals to the dimensions of associated features, respectively.

After stacking multiple EGAT layers, we obtain the final edge representation for the chemical bond between nodes and , as well as the node representation for each node . To predict the disconnection probability for a bond, we perform a fully-connected layer parameterized by and a activation layer to and its disconnection probability is Note that the multi-head attention mechanism can also be applied like the original GAT. The optimization goal for bond disconnection prediction is to minimize the negative log-likelihood between prediction and ground-truth

through the binary cross entropy loss function:


where is the total number of training reactions and bond exists if the associated adjacency element is nonzero. The ground truth means the bond is disconnected otherwise remaining the same during the reaction. Bond disconnection labels can be obtained by comparing molecule graphs of target and reactants.

The input of the auxiliary task is the graph-level representation , which is the output of the READOUT operation over all learned node representations. We adopts an arithmetic mean as the READOUT function and it works well in practice.

Similarly, a fully-connected layer parameterized by and a activation function are applied to to predict the total number of disconnected bonds, which is solved as a classification problem here. Each category represents the exact number of disconnected bonds, so there are 1+ classification categories. is the maximum number of possible disconnected bonds in the retrosynthesis. We denote the output as . The total number of disconnected bonds for each target molecule is predicted as:


The ground truth number of disconnections for molecule is denoted as , the indicator function is 1 if equals to otherwise it is 0, and the cross entropy loss for the auxiliary task:


Finally, the overall loss function for the EGAT is where is fixed to 1 in our study since we empirically find that is not a sensitive hype-parameter.

Atom and bond features.

The atom feature consists of a series of general atom information such as atom type, hybridization, and formal charge, while the bond feature is composed of chemical bond information like bond type and conjugation (see Appendix B for details). These features are similar to those used in Yang et al. (2019)

which is for chemical property prediction. We compute these features using the open-source toolkit RDKit

222 To fully utilize the provided rich atom-mapping information of the USPTO datasets Schneider et al. (2016) Lowe (2018), we add a semi-templates indicator to atom feature. For retrosynthesis dataset with given reaction type, a type indicator is also added to the atom feature.


For atom-mapped USPTO datasets, reaction templates are extracted from reaction data like previous template-based methods Coley et al. (2017); Segler et al. (2018); Dai et al. (2019). However, we are not interested in full reaction templates since these templates are often too specific. There are as many as 11,647 templates for the USPTO-50K train data Dai et al. (2019). Only the product side of templates are kept instead, which we name as semi-templates. Since reaction templates are closely related to the exact reaction, the semi-templates indicator expected to play a significant role in reaction center identification.

The semi-templates can be considered as subgraph patterns within molecules. We build a database of semi-templates from training data and find all appeared semi-templates within each molecule. For each atom, we mark the indicator bits associated with appeared semi-templates. Note that each atom within a molecule may belong to several semi-templates since these semi-templates are not mutually exclusive. Although reaction templates are introduced, our method is still template-free since i) only semi-templates are incorporated and our method does not rely on full templates to plan the retrosynthesis, and ii) our EGAT still works well in the absence of semi-templates, with only slight performance degradation (Appendix D.2).

2.2 Reactant generation network

Once the reaction center has been identified, synthons can be obtained by applying bond disconnection to decompose the target graph. Since each synthon is basically a substructure within the reactant, we are informed of the total number of reactants and substructures of these reactants. The remaining task is much simpler than the original in which even the number of reactants is unknown.

Specifically, task is to generate the set of desired reactants given obtained synthons. Based on commonsense knowledge of chemical reaction, we propose that the ideal RGN should meet following three requirements: R1) be permutation invariant and generate the same set of reactants no matter the order of synthons, R2) all given information should be considered when generating any reactant, and R3) the generation of each reactant also depends on those previously generated reactants.

To fulfill these requirements, we represent molecules in SMILES and formulate as a sequence-to-sequence prediction problem. We convert synthon graphs to SMILES representations using RDKit, though these synthons may be chemically invalid. As in Figure 3, source sequence is the concatenation of possible reaction types, canonical SMILES of the product, and associated synthons. The target sequence is the desired reactants arranged according to synthons.

Figure 3: Illustration of source and target sequences. <rxn_k> is the th reaction type if applicable. The product and synthons are separated with a special <link> token. The order of reactants is arranged according to synthons. SMILES strings are joined with a dot following RDkit.

We approximate the requirement R1 by augmenting train samples with reversely arranged synthons and reactants as shown in Figure 3. Our empirical studies demonstrate that such approximation works pretty well in practice. To satisfy the requirement R2, the encoder-decoder attention mechanism Bahdanau et al. (2014) Vaswani et al. (2017) is employed, which allows each position in the target sequence attends to all positions in the source sequence. A similar masked self-attention mechanism Vaswani et al. (2017), which masks future positions in the decoder, is adopted to make the RGN meet the requirement R3.

Motivated by the success of Transformer Vaswani et al. (2017) in natural machine translation, we build the RGN based on the Transformer module. Transformer is a sequence-to-sequence model equipped with two types of attention mechanisms: self-attention and encoder-decoder attention Vaswani et al. (2017). Transformer is also adapted for reaction outcome prediction Schwaller et al. (2019) and retrosynthesis Zheng et al. (2020), in which both products and reactants are represented in SMILES. We include a brief description of Transformer in Appendix C.

Determine the generation order of reactants.

For the first time, the generation order of reactants can be determined by aligning reactants in the target with synthons in the source, thanks to intermediate synthons which are associated with reactants uniquely. While the generation order of reactants is undetermined in previous methods Liu et al. (2017); Zheng et al. (2020); Lin et al. (2020), which naively treats the sequence-to-sequence model as a black box. The uncertainty of the generation order makes their models hard to train.

Robustify the RGN.

We find the EGAT suffers from distinguishing multiple coexisting reaction centers, which is the major bottleneck of our method. As a result of the failure of identifying the reaction center, the generated synthons are different from the ground truth. To make our RGN robust enough and able to predict the desired reactants even if the EGAT fails to recognize the reaction center, we further augment RGN training data by including those unsuccessfully predicted synthons on training data. We do not reverse the order of synthons for these augmentation samples like in Figure 3. The intuition behind is that EGAT tends to make similar mistakes on training and test datasets since both datasets follow the same distribution. This method can make our RGN able to correct reaction center prediction error and generate the desired set of reactants.

3 Experiments

Dataset and preprocessing.

We evaluate our method on USPTO-50K Schneider et al. (2016) and USPTO-full Lowe (2018) to verify its effectiveness and scalability. USPTO-50K consists of 50K reactions annotated with 10 reaction types (see appendix A for type distribution), which is derived from USPTO granted patents Lowe (2012). It is widely used in previous retrosynthesis work. We adopt the same training/validation/test splits in 8:1:1 as Coley et al. (2017); Dai et al. (2019). For RGN training data, we add an extra 28K samples of which synthons are reversed as shown in Figure 3 if there are at least two synthons. There are 68K training samples for RGN, which is still denoted as USPTO-50K in the following content. The USPTO-full consists of 950K cleaned reactions from the USPTO 1976-2016 Lowe (2018), which has 1,808,937 raw reactions without reaction types. Reactions with multiple products are duplicated into multiple single-product ones. After removing invalid reactions (empty reactant and missing atom mappings) and deduplication, we can obtain 950K reactions 333Code and processed USPTO-full data are available at, which are randomly partitioned into training/validation/test sets in 8:1:1.

For the EGAT, we build molecule graphs using DGL Wang et al. (2019) and extract atom and bond features with RDkit. By comparing molecule graphs of product and reactants, we can identify disconnection bonds within the product graph and obtain training labels for both main and auxiliary tasks. This comparison can be easily done for atom-mapped reactions. For reactions without atom-mapping, a substructure matching algorithm in RDKit can be utilized to accomplish the comparison. We use RDChiral Coley et al. (2019) to extract super general reaction templates, and obtain 1859 semi-templates for USPTO-50K training data. Semi-templates that appear less than twice are filtered and finally 654 semi-templates are obtained. As for the RGN, the product molecule graph is divided into synthon graphs according to the ground truth reaction center, then are converted into SMILES strings. The input sequence of RGN is the concatenation of the possible reaction type, product SMILES string, and synthon SMILES strings as illustrated in Figure 3.


All reactions are represented in canonical SMILES, which are tokenized with the regular expression in Schwaller et al. (2018). We use DGL Wang et al. (2019) and OpenNMT Klein et al. (2017) to implement our EGAT and RGN models, respectively. As for the EGAT, we stack three identical four-head attentive layers of which the hidden dimension is 128. All embedding sizes in EGAT are set to 128, such as , , and . The

is set to be two to cover 99.97% training samples. We train the EGAT on USPTO-50K for 80 epochs. EGAT parameters are optimized with Adam

Kingma and Ba (2014) with default settings, and the initial learning rate is and it is scheduled to multiply 0.2 every 20 epochs. We train the RNG for time steps, and it takes about 30 hours on two GTX 1080 Ti GPUs. We save a checkpoint of RGN parameters every steps and average the last 10 checkpoints as the final model. We run all experiments for three times and report the means of their performance in default.

Evaluation metric.

The Top-

accuracy is used as the evaluation metric for retrosynthesis. Beam search

Tillmann and Ney (2003) strategy is adopted to keep top K predictions throughout the reactant generation process. K is set to 50 in all experiments. The generated reactants are represented in canonical SMILES. A correct predicted set of reactants must be exactly the same as the ground truth reactants.

3.1 Reaction center identification results

To verify the effectiveness of edge-enhanced attention mechanism, we also include the ablation study by removing edge embedding when computing the coefficient . Results are reported in Table 1. The auxiliary task (Aux) can successfully predict the number of disconnection bonds for 99.2% test molecules given the reaction type (Type) while 86.4% if not given.

Type EAtt Accuracy (%)
Main Aux EGAT
73.9 99.1 85.7
74.4 99.2 86.0
50.0 86.1 64.3
51.5 86.4 64.9
Table 1: Results of EGAT on USPTO-50K dataset. EAtt and Aux are the short for edge-enhanced attention and auxiliary task, respectively. EGAT

consists of both main and auxiliary tasks. The prediction is binarized with a threshold of 0.5 if main task alone.

As for the main task (Main) alone, its prediction accuracy is 74.4% w/ reaction type and 51.5% wo/ reaction type. However, if we adopt the prediction from the auxiliary task as the prior of the number of disconnection bonds, and select the most probable disconnection bonds (EGAT), then the prediction accuracy can be boosted to 86.0% (w/) and 64.9% (wo/), respectively. The edge-enhanced attention (EAtt) can consistently improve the model’s performance in all tasks. The improvement is more significant when the reaction type is unknown, so our EGAT is more practical in real world applications without reaction types. This demonstrates that the reaction type information plays an important role in the retrosynthesis. The reactions of the same type usually share similar reaction patterns (involved atoms, bonds, and functional groups), it is much easier to recognize the reaction center if the reaction type is given as the prior knowledge. We also verify the importance of semi-templates in Appendix D.2.

3.2 Reactant prediction results

To robustify the RGN as described in the paragraph Robustify the RGN, we also conduct the prediction on the EGAT training data for USPTO-50K (40K), and the prediction accuracy is 89.0% for the reaction type conditional setting. We can obtain about 4K unsuccessful synthon predictions as augmentation samples (Aug), adding the original 68K RGN training data, the total RGN training data size is 72K. For the unconditional setting, the EGAT accuracy is 70.0% and there are 12K augmentation samples, and the total RGN training size is 80K in this case. We train RGN models on the USPTO-50K with/without the augmentation (Aug), and report results in Table 2.

RGN evaluation

For the RGN evaluation, the RGN input consists of the ground truth synthons. Therefore the results in Table 2 indicate the upper bound of our method’s overall retrosynthesis performance. The proposed augmentation strategy does not always improve the upper bound. Without given reaction type, the RGN generally performs worse with the augmentation due to the introduced dirty training samples. However, when given reaction type, this augmentation boosts its prediction accuracy. We presume that it is because the reaction type plays a significant role. The RGN learns to put more attention on the reaction type and product instead of synthons to generate the reactants.

Type Aug Training size Top- accuracy (%)
1 3 5 10 20 50
68K 72.9 86.5 88.3 89.5 90.4 91.6
72K 73.4 86.7 88.5 89.7 90.9 92.1
68K 71.9 85.7 87.5 88.9 90.0 91.0
80K 70.9 84.6 86.4 88.2 89.4 90.6
Table 2: prediction results. Aug denotes training data augmentation. Evaluation results are based on ground-truth synthons as the RGN input.

Retrosynthesis evaluation

To evaluate the overall retrosynthesis prediction accuracy, the generated synthons from instead of the ground truth are input into the RGN. In this way, we only need to compare the predicted reactants with the ground truth ones, without considering if the reaction center predictions correct or not. We report the retrosynthesis results in Tables 3. Our method RetroXpert achieves impressive performance on the test data. Specifically, when given reaction types, our proposed method achieves 70.4% Top-1 accuracy, which outperforms the SOTA Top-1 accuracy 63.2% Dai et al. (2019) by a large margin. Note that our Top-1 accuracy 70.4% is quite close to the upper bound 73.4% in Table 2, which indicates the proposed augmentation strategy in Robustify the RGN is considerably effective. As for results wo/ given reaction type, our model improves the SOTA Top-1 accuracy from 52.6% Dai et al. (2019) to 65.6%. To verify the effectiveness of augmentation, we conduct ablation study and report results in Appendix D.3.

While our method outperforms in Top-1, Top-3, and Top-5 accuracy, template-based methods GLN Dai et al. (2019) and RetroSim Coley et al. (2017) are better at Top-20 and Top-50 predictions since they enumerate multiple different reaction templates for each product to increase the hit rate. While our RetroXpert is currently designed to find the best set of reactants. To increase the diversity, we can design new strategies to enumerate multiple reaction centers for each product. This is left as the feature work.

We notice that the gap between Top-2 and Top-1 accuracy is around 10%. After investigating these 10% predictions by experienced chemists from the synthetic chemistry perspective, we find about 9/10 these Top-1 predictions are actually reasonable (see Appendix E for details). This indicates that our method can learn general chemical reaction knowledge, which is beyond the given ground truth.

Methods Top- accuracy (%)
1 3 5 10 20 50
Reaction types given as prior on USPTO-50K
Seq2Seq Liu et al. (2017) 37.4 52.4 57.0 61.7 65.9 70.7
RetroSim Coley et al. (2017) 52.9 73.8 81.2 88.1 91.8 92.9
NeuralSym Segler and Waller (2017b) 55.3 76.0 81.4 85.1 86.5 86.9
SCROP Zheng et al. (2020) 59.0 74.8 78.1 81.1 - -
GLN Dai et al. (2019) 63.2 77.5 83.4 89.1 92.1 93.2
RetroXpert 70.4 83.4 85.3 86.8 88.1 89.3
Reaction type unknown on USPTO-50K
RetroSim Coley et al. (2017) 37.3 54.7 63.3 74.1 82.0 85.3
NeuralSym Segler and Waller (2017b) 44.4 65.3 72.4 78.9 82.2 83.1
SCROP Zheng et al. (2020) 43.7 60.0 65.2 68.7 - -
GLN Dai et al. (2019) 52.6 68.0 75.1 83.1 88.5 92.1
RetroXpert 65.6 78.7 80.8 83.3 84.6 86.0
Retrosynthesis results on USPTO-full.
GLN* Dai et al. (2019) 39.0 50.1 55.3 61.3 65.9 69.1
SCROP* Zheng et al. (2020) 45.7 60.7 65.3 70.1 73.3 76.0
RetroXpert 49.4 63.6 67.6 71.6 74.6 77.0
Table 3: Retrosynthesis results compared with the existing methods. NeuralSym Segler and Waller (2017b) results are copied from Dai et al. (2019). *We run the self-implemented SCROP Zheng et al. (2020) and official implementation of GLN Dai et al. (2019) on the USPTO-full dataset.

4 Large scale experiments

To demonstrate the scalability of our method, we also experiment on the USPTO-full dataset, which consists of 760K training data. We extract 75,129 semi-templates and keep only 3,788 ones that appear at least 10 times. We set as 5 to cover 99.87% training data. We obtain 1.35M training data after reversing synthons. The final accuracy of the on training set is 60.5%, and there are 0.3M unsuccessful synthon data and the total RNG training data size is 1.65M. We train the RNG for 500,000 time steps on USPTO-full while keeping the other settings the same as those in section 3. We run the official implementation of GLN following their instructions Dai et al. (2019), as well as the self-implemented SCROP Zheng et al. (2020) on the USPTO-full dataset. Experimental results are reported at the bottom of Table 3. Our method again significantly outperforms the SCROP and GLN, which demonstrates that our model scales well to the large real-world dataset. Note that both template-free methods SCROP and RetroXpert outperform the GLN significantly, which may indicate the scalability of template-based methods is very limited.

5 Prediction visualization

For EGAT, how the auxiliary task helps to identify the reaction center is illustrated in Figure 4. Note that in the first example the two colored bonds and their surrounding structures are very similar. Current shallow GNNs consider only local information and fails to distinguish the true reaction center. Under the guidance of the auxiliary task, EGAT is able to identify the true reaction center. Figure 5 demonstrates the robustness of our method. Even if the predicted synthons are different from the ground truth, the RGN still successfully generates desired reactants.

Figure 4: Importance of the auxiliary task. Pink indicates the reaction center along with disconnection probability predicted by the EGAT main task. Blue cross indicates the ground truth disconnection. Our EGAT successfully finds the desired reaction center under the guidance of the auxiliary task.
Figure 5: Robustness of the RGN. Top-1 prediction is the same to ground truth.

6 Discussion

One major common limitation of current retrosynthesis work is the lack of reasonable evaluation metrics. There may be multiple valid ways to synthesize a product, while the current evaluation metric considers only the given reaction. More evaluation metrics should be proposed in the future.

Broader Impact

Our proposed new retrosynthesis method RetroXpert solves the retrosynthesis prediction in two steps like chemists do, and it achieves impressive performance. It is template-free and is very scalable to the large real-world dataset. We believe that our work will greatly inspire and advance related research, such as forward reaction prediction and drug discovery. The researchers and industry experts in drug discovery will benefit most from this research since the retrosynthesis prediction is an important part of drug discovery. We are not aware anyone may be put at disadvantage from this research. Our method does not take advantage of the data bias, it is general and scalable.

We would like to thank Hanjun Dai for providing the source implementation of GLN. This work was partially supported by US National Science Foundation IIS-1718853, the CAREER grant IIS-1553687 and Cancer Prevention and Research Institute of Texas (CPRIT) award (RP190107).


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.2.
  • [2] A. Bogevig, H. Federsel, F. Huerta, M. G. Hutchings, H. Kraut, T. Langer, P. Low, C. Oppawsky, T. Rein, and H. Saller (2015) Route design in the 21st century: the icsynth software tool as an idea generator for synthesis prediction. Organic Process Research & Development 19 (2), pp. 357–368. Cited by: §1.
  • [3] J. Bradshaw, M. J. Kusner, B. Paige, M. H. Segler, and J. M. Hernández-Lobato (2018) A generative model for electron paths. arXiv preprint arXiv:1805.10970. Cited by: §2.
  • [4] C. D. Christ, M. Zentgraf, and J. M. Kriegl (2012) Mining electronic laboratory notebooks: analysis, retrosynthesis, and reaction based enumeration. Journal of chemical information and modeling 52 (7), pp. 1745–1756. Cited by: §1.
  • [5] C. W. Coley, W. H. Green, and K. F. Jensen (2019) RDChiral: an rdkit wrapper for handling stereochemistry in retrosynthetic template extraction and application. Journal of chemical information and modeling 59 (6), pp. 2529–2537. Cited by: §3.
  • [6] C. W. Coley, L. Rogers, W. H. Green, and K. F. Jensen (2017) Computer-assisted retrosynthesis based on molecular similarity. ACS central science 3 (12), pp. 1237–1245. Cited by: §1, §2.1, §3, §3.2, Table 3.
  • [7] E. J. Corey and W. T. Wipke (1969) Computer-assisted design of complex organic syntheses. Science 166 (3902), pp. 178–192. Cited by: §1, §1.
  • [8] E. J. Corey (1967) General methods for the construction of complex molecules. Pure and Applied chemistry 14 (1), pp. 19–38. Cited by: §1.
  • [9] E. J. Corey (1991) The logic of chemical synthesis: multistep synthesis of complex carbogenic molecules (nobel lecture). Angewandte Chemie International Edition in English 30 (5), pp. 455–465. Cited by: §1.
  • [10] H. Dai, C. Li, C. Coley, B. Dai, and L. Song (2019) Retrosynthesis prediction with conditional graph logic network. In Advances in Neural Information Processing Systems, pp. 8870–8880. Cited by: item 3), §1, §1, §2.1, §3, §3.2, §3.2, Table 3, §4.
  • [11] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1263–1272. Cited by: §2.1.
  • [12] C. M. Gothard, S. Soh, N. A. Gothard, B. Kowalczyk, Y. Wei, B. Baytekin, and B. A. Grzybowski (2012) Rewiring chemistry: algorithmic discovery and experimental validation of one-pot reactions in the network of organic chemistry. Angewandte Chemie International Edition 51 (32), pp. 7922–7927. Cited by: §1.
  • [13] W. Jin, C. Coley, R. Barzilay, and T. Jaakkola (2017) Predicting organic reaction outcomes with weisfeiler-lehman network. In Advances in Neural Information Processing Systems, Cited by: §2.1, §2.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
  • [15] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017) OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL, External Links: Link, Document Cited by: §3.
  • [16] T. Klucznik, B. Mikulak-Klucznik, M. P. McCormack, H. Lima, S. Szymkuć, M. Bhowmick, K. Molga, Y. Zhou, L. Rickershauser, E. P. Gajewska, et al. (2018) Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory. Chem 4 (3), pp. 522–532. Cited by: §1.
  • [17] K. Lin, Y. Xu, J. Pei, and L. Lai (2020) Automatic retrosynthetic route planning using template-free models. Chemical Science 11 (12), pp. 3355–3364. Cited by: §1, §1, §1, §2.2.
  • [18] B. Liu, B. Ramsundar, P. Kawthekar, J. Shi, J. Gomes, Q. Luu Nguyen, S. Ho, J. Sloane, P. Wender, and V. Pande (2017) Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS central science 3 (10), pp. 1103–1113. Cited by: §1, §1, §1, §1, §2.2, Table 3.
  • [19] D. M. Lowe (2012) Extraction of chemical structures and reactions from the literature. Ph.D. Thesis, University of Cambridge. Cited by: §3.
  • [20] D. Lowe (2018) Chemical reactions from us patents (1976-sep2016). Cited by: §2.1, §3.
  • [21] D. A. Pensak and E. J. Corey (1977)

    LHASA—logic and heuristics applied to synthetic analysis

    ACS Publications. Cited by: §1.
  • [22] N. Schneider, N. Stiefl, and G. A. Landrum (2016) What’s what: the (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling 56 (12), pp. 2336–2346. Cited by: item 3), §2.1, §3.
  • [23] P. Schwaller, T. Gaudin, D. Lanyi, C. Bekas, and T. Laino (2018) “Found in translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chemical science 9 (28), pp. 6091–6098. Cited by: §3.
  • [24] P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas, and A. A. Lee (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science 5 (9), pp. 1572–1583. Cited by: §C.1, §2.2.
  • [25] M. H. Segler, M. Preuss, and M. P. Waller (2018) Planning chemical syntheses with deep neural networks and symbolic ai. Nature 555 (7698), pp. 604–610. Cited by: §1, §2.1.
  • [26] M. H. Segler and M. P. Waller (2017) Modelling chemical reasoning to predict and invent reactions. Chemistry–A European Journal 23 (25), pp. 6118–6128. Cited by: §1.
  • [27] M. H. Segler and M. P. Waller (2017) Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry–A European Journal 23 (25), pp. 5966–5971. Cited by: §1, Table 3.
  • [28] C. Shi, M. Xu, H. Guo, M. Zhang, and J. Tang (2020) A graph to graphs framework for retrosynthesis prediction. arXiv preprint arXiv:2003.12725. Cited by: §1, §1.
  • [29] S. Szymkuć, E. P. Gajewska, T. Klucznik, K. Molga, P. Dittwald, M. Startek, M. Bajczyk, and B. A. Grzybowski (2016) Computer-assisted synthetic planning: the end of the beginning. Angewandte Chemie International Edition 55 (20), pp. 5904–5937. Cited by: §1.
  • [30] C. Tillmann and H. Ney (2003) Word reordering and a dynamic programming beam search algorithm for statistical machine translation. Computational linguistics 29 (1), pp. 97–133. Cited by: §3.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Appendix C, §2.2, §2.2.
  • [32] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph Attention Networks. International Conference on Learning Representations. Cited by: §2.1.
  • [33] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, et al. (2019)

    Deep graph library: towards efficient and scalable deep learning on graphs

    arXiv preprint arXiv:1909.01315. Cited by: §3, §3.
  • [34] K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper, B. Kelley, M. Mathea, et al. (2019) Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling 59 (8), pp. 3370–3388. Cited by: Table 7, §2.1.
  • [35] S. Zheng, J. Rao, Z. Zhang, J. Xu, and Y. Yang (2020)

    Predicting retrosynthetic reactions using self-corrected transformer neural networks

    Journal of Chemical Information and Modeling. Cited by: §1, §1, §1, §1, §2.2, §2.2, Table 3, §4.

Appendix A Dataset information

The USPTO-50K dataset is annotated with 10 reaction types, the distribution of reaction types is displayed in Table 4. The distribution is extremely unbalanced. We also report the statistics of the number of disconnection bonds for training reactions in Tables 5 and 6.

Reaction type Reaction type name # Examples
1 Heteroatom alkylation and arylation 15204
2 Acylation and related processes 11972
3 C-C bond formation 5667
4 Heterocycle formation 909
5 Protections 672
6 Deprotections 8405
7 Reductions 4642
8 Oxidations 822
9 Functional group interconversion (FGI) 1858
10 Functional group addition (FGA) 231
Table 4: Distribution of 10 recognized reaction types.
# Disconnection bonds 0 1 2 3
# Reactions 11296 27851 849 12
Accumulative percent 28.23% 97.85% 99.97% 100.00%
Table 5: Statistics of the number of disconnection bonds for the USPTO-50K training reactions.
# Disconnection bonds 0 1 2 3 4 5 6
# Reactions 161500 485449 88146 19303 5687 2032 1000
Accumulative percent 21.16% 84.77% 96.33% 98.86% 99.60% 99.87% 100.00%
Table 6: Statistics of the number of disconnection bonds for the USPTO-full training reactions.

Appendix B Atom and bond features

Feature Description Size
Atom type Type of atom (ex. C, N, O), by atomic number. 100
# Bonds Number of bonds the atom is involved in. 6
Formal charge Integer electronic charge assigned to atom. 5
Chirality Unspecified, tetrahedral CW/CCW, or other. 4
# Hs Number of bonded Hydrogen atom. 5
Hybridization sp, sp2, sp3, sp3d, or sp3d2. 5
Aromaticity Whether this atom is part of an aromatic system. 1
Atomic mass Mass of the atom, divided by 100. 1
Semi-templates Semi-templates that the atom is within. 654
Reaction type The specified reaction type if it exists. 10
Table 7:

Atom Features used in EGAT. All features are one-hot encoding, except the atomic mass is a real number scaled to be on the same order of magnitude. The upper part is general atom feature following

Yang et al. (2019), the lower part is specifically extended for the retrosynthesis prediction. Semi-templates size is 654 for the USPTO-50K dataset.
Feature Description Size
Bond type Single, double, triple, or aromatic. 4
Conjugation Whether the bond is conjugated. 1
In ring Whether the bond is part of a ring. 1
Stereo None, any, E/Z or cis/trans. 6
Table 8: Bond features used in EGAT. All features are one-hot encoding.

Appendix C Transformer

The transformer Vaswani et al. (2017) is an autoregressive encoder-decoder model built with multi-head attention layers and position-wise feed-forward layers. As illustrated in Figure 6, the encoder is composed of stacked multi-head self-attention layers and position-wise feed-forward layers. The encoder self-attention layers attend the full input sequence and iteratively transform it into a latent representation with the self-attention mechanism. The decoder is similar to the encoder. In addition to multi-head self-attention layers and position-wise feed-forward layers, the multi-head encoder-decoder attention layers are inserted to perform cross attention over the encoder output. Different from the encoder self-attention layers, the decoder adopts the masked self-attention which prevents the decoder positions from attending future positions. The encoder-decoder attention and masked self-attention layers enable the decoder to combine the information from the source sequence and the target sequence that has been produced to make the output prediction. We refer readers to Vaswani et al. (2017) and The Illustrated Transformer for comprehensive details about the Transformer.

Figure 6:

Transformer model architecture. The residual connection and layer normalization layer are omitted in the illustration for simplification.

The transformer removes all recurrent units and introduces a positional encoding to account for the order information of the sequence. Positional encoding adds a position-dependent signal to the token embedding of size to discriminate the position of different tokens in the sequence:


where is the token position and is the dimension of the positional encoding.

The transformer adopts a scale dot-product attention as the attention formulation, which compute the attention weighted output by taking as input the matrix represented keys K, values V, and queries Q:


where the is the dimension of Q and K.

c.1 Parameters setting

We compose both the encoder and decoder of four layers of size 256. The label smoothing parameter is set to 0 since a nonzero label smoothing parameter will deteriorate the model’s discrimination Schwaller et al. (2019). We adopt eight attention heads as suggested. We set the batch size to 4096 tokens and accumulate gradients over four batches.

Appendix D More experimental results

d.1 Per category performance

Figure 7: Top-10 retrosynthesis accuracy per reaction category with given reaction type.
Figure 8: Top-10 retrosynthesis accuracy per reaction category with unknown reaction type.

We investigate the retrosynthesis accuracy per reaction category on USPTO-50K to have a better understanding of our model’s capability in different types of reaction. We report Top-10 accuracy for a fair comparison following baseline methods, though our method is not designed for diverse predictions. Although the reaction types are highly imbalanced as shown in Table 4, our method performs well in all reaction categories as displayed in Figure 7, which indicates our method is not that sensitive to the number of samples of the same reaction type. Particularly, our method achieves comparable or better performance in 9 out 10 reaction categories compared with the template-based methods RetroSim and GLN. For rare categories like class 4 and 10, our methods is much better than the GLN. Similar conclusion also applies to the unknown reaction type scenario as illustrated in Figure 8. Note that for the most common type 1 and 2, the type prior does not help much to our method’s performance, which suggests that our method may exploit training reactions to the maximum extent given enough reaction data.

d.2 Ablation study of atom features

Our method can also work without semi-templates. When removing semi-templates, the EGAT performance drops slightly as listed in Table 9. The semi-templates feature is not a must component of our method, but it is definitely helpful for finding the reaction center.

Type Semi-templates Accuracy (%)
Main Aux EGAT
70.0 99.2 84.0
74.4 99.2 86.0
43.3 83.8 59.9
51.5 86.4 64.9
Table 9: Results of atom features ablation study. Aux is the short for auxiliary. EGAT consists of both main and auxiliary tasks. The prediction is binarized with a threshold of 0.5 if the main task alone.

d.3 Ablation study of augmentation in RGN

Training Aug Test Aug Top- accuracy (%)
1 3 5 10 20 50
Reaction type given as prior on USPTO-50K
63.5 75.2 76.7 77.6 78.3 79.3
64.0 75.7 77.5 78.3 79.2 80.2
63.8 75.1 76.5 77.4 78.4 79.3
70.4 83.4 85.3 86.8 88.1 89.3
Reaction type unknown on USPTO-50K
48.1 56.3 57.3 58.0 58.7 59.1
48.4 56.9 57.9 58.8 59.6 60.2
47.6 56.1 57.0 57.9 58.5 59.2
65.6 78.7 80.8 83.3 84.6 86.0
Table 10: Retrosynthesis results of augmentation ablation study.

We robustify the RGN by including unsuccessfully predicted synthons by EGAT as the RGN training data augmentation (Training Aug). When evaluating the retrosynthesis on test data, we also gather predicted synthons from the EGAT to form the RGN input sequences without considering if the reaction center identification successful or not, and we denote this evaluation strategy as the test augmentation (Test Aug) with a little abused use of augmentation. Without the test augmentation, we must first evaluate reaction center identification results, and only try to predict reactants for the product whose reaction center is successfully identified. In this case, the overall retrosynthesis performance will be capped by the EGAT. The ablation study results are listed in Table 10.

Generally, applying only training or test augmentation makes only tiny influence on the retrosynthesis performance in both cases (w/ or wo/ reaction type). While the retrosynthesis performance will be boosted significantly if both training and test augmentation are adopted. This is not unexpected. Exploiting training data augmentation makes the RGN robust and also assigns a correction ability to the RGN. The correction ability will take effect only if the test augmentation is also employed.

Appendix E Top-1 and Top-2 predictions

About 10% Top-1 predictions by our model have been considered as wrong predictions while the associated Top-2 predictions are the same to the ground-truth. However, 9 in 10 of these Top-1 predictions are re-considered as reasonable and valid predictions checked by experienced chemists from the synthetic chemistry perspective. As Figure 9 shows, the major retro-predictions that both Top-1 and Top-2 can be thought correct, are among metal-catalyzed cross-coupling reactions, N- and O-alkylation reactions, saponification of ethyl esters and methyl esters, different sources of reactants, esterification of alcohol with acyl chlorides or carboxylic acid, and deprotection of different protecting groups to same alcohols.

There are some deprotection reactions with different protecting groups, such as deprotecting O-THP ether and O-Bn ether to free alcohol in Figure 9(a). They are prevalent strategies in chemistry utilizing different protecting groups. In Figure 9(b), both bromoarenes and iodoarenes are reactive enough to initiate Suzuki coupling reactions, similar to N- and O-alkylation of propargyl like or benzyl chloride and bromide in Figure 9(c). In Figure 9(d), hydrolysis of ethyl ester and methyl ester to corresponding carboxylic acid can both occur under certain conditions, although saponification of methyl ester is faster than ethyl ester. Real reactants that participated in the reactions are predicted in our Top-1 predictions, such as allyl Grignard reagent and acyl chloride in cases shown in Figure 9(e). Last but not least, in Figure 9(f), methyl boronic acid or its trimer form and trimethyl borate are very common reagents used by chemists in Suzuki coupling reaction to introduce methyl group.

Figure 9: Top-1 and Top-2 predictions are both reasonable reactants.