1 Introduction
Retrosynthesis, which devotes to find a set of reactants to synthesize a target molecule, is of crucial importance to the synthesis planning and drug discovery. The problem is challenging as the search space of all possible transformations is huge by nature. For decades, people have been seeking to aid retrosynthesis analysis with modern computer techniques (Corey and Wipke, 1969). Among them, machine learning plays a vital role and significant progress has been made recently (Szymkuć et al., 2016; Coley et al., 2018).
Existing machine learning works on retrosynthesis prediction mainly fall into two categories: templatebased (Coley et al., 2017; Segler and Waller, 2017; Dai et al., 2019) and templatefree models (Liu et al., 2017; Karpov et al., 2019). The templatebased approaches match the target molecule with a large set of reaction templates, which define the subgraph patterns of a set of chemical reactions. For example, (Coley et al., 2017) proposed a similaritybased approach to select reaction templates for the target molecule. (Segler and Waller, 2017; Baylon et al., 2019) cast rule selection as a multiclass classification problem. Recently, (Dai et al., 2019)
treats chemistry knowledge as logic rules and directly models the joint probability of rules and reactants, which achieves the new state of the art. Despite their great potential for synthesis planning, templatebased methods, however, not only require expensive computation but also suffer from poor generalization on new target structures and reaction types. Another line of research for retrosynthesis prediction
(Liu et al., 2017; Karpov et al., 2019)bypasses reaction templates and formulates retrosynthesis prediction as a sequencetosequence problem. Such approaches leverage the recent advances in neural machine translation
(Bahdanau et al., 2014; Vaswani et al., 2017) and the SMILES (Weininger, 1988) representation of molecules. However, SMILES representation assumes a sequential order between the atoms in a molecule, which cannot effectively reflect the complex relationships between atoms in a molecule. As a result, these approaches fail to capture the rich chemical contexts and their interplays of molecules, resulting in unsatisfactory predictive performance.To address the aforementioned issues, in this paper we represent each molecule as a graph and formulate retrosynthesis prediction as a graphtographs translation problem. The socalled G2Gs leverages the powerful representation of graph for molecule and is a novel templatefree approach, which is trained with an extensive collection of molecule reaction data. Specifically, it consists of two key components: (1) a reaction center identification module, which splits synthons from the target molecule and reduces the onetomany graph translation problem into multiple onetoone translation processes; (2) a variational graph translation module, which translates a synthon to a final reactant graph. We aim to model the probability of reactant graph conditioned on the synthon , i.e., . As a synthon could be potentially translated to different reactants in different reaction contexts, a latent code is introduced to handle the uncertainty of reactant prediction, i.e., . Following existing work on graph generation (You et al., 2018a; Liu et al., 2018; Shi et al., 2020), we formulate reactant graph generation as a sequential decision process, more specifically, sequentially generating nodes and edges. The whole graph translation process can be efficiently and effectively optimized with the variational inference approach (Kingma and Welling, 2014).
We evaluate our model on the benchmark data set USPTO50k derived from a patent database (Lowe, 2012), and compare it with both templatebased and templatefree approaches. We experimentally show that G2Gs significantly outperforms existing templatefree baselines by up to 63% in terms of the top1 accuracy. These numbers are also approaching those obtained by the stateoftheart templatebased strategies, but our method excludes the need for domain knowledge and scales well to larger data sets, making it particularly attractive in practice.
2 Related Work
Retrosynthesis Prediction Prior works on retrosynthesis prediction are primarily based on reaction templates, which are either handcrafted by human experts (Hartenfeller et al., 2011; Szymkuć et al., 2016) or extracted from large chemical databases automatically (Coley et al., 2017). Since there are hundreds of qualified templates for a target molecule by subgraph matching, selecting templates that lead to chemically feasible reactions remains a crucial challenge for these approaches. To cope with such challenge, (Coley et al., 2017) proposed to select templates based on similar reactions in the data set. (Segler and Waller, 2017; Baylon et al., 2019) further employed neural models for rule selection. The stateoftheart method (Dai et al., 2019)
leveraged the idea that the templates and reactants are hierarchically sampled from their conditional joint distributions. Such templatebased approaches, however, still suffer from poor generalization on unseen structures and reaction types, and the computational cost for subgraph isomorphism in these methods is often prohibitively expensive.
To overcome the limitation of templatebased methods, templatefree approaches have recently been actively investigated. (Liu et al., 2017; Karpov et al., 2019) formulated the task as a sequencetosequence problem on SMILES representation of molecules, and the same idea was adopted in (Schwaller et al., 2017, 2018) for its dual problem, i.e., organic reaction prediction. However, these approaches are apt to ignore the rich chemical contexts contained in the graph structures of molecules, and the validity of the generated SMILES strings can not be ensured, resulting in unsatisfactory performance. In contrast, our G2Gs framework, directly operating on the graph structures of molecules, is able to generate 100% chemically valid predictions with high accuracy, while excludes the need for reaction templates and computationally expensive graph isomorphism.
Our work is also related to the chemical reaction prediction method presented in (Jin et al., 2017) in terms of learning a neural model for reaction center identification. Nevertheless, both the definition of the reaction center and the targeted task, namely retrosynthesis prediction, distinguish our strategy from their algorithm.
Molecular Graph Generation Various deep generative models for molecular graph generation have recently been introduced (Segler et al., 2017; Olivecrona et al., 2017; Samanta et al., 2018; Neil et al., 2018; You et al., 2018a; Liu et al., 2018; Shi et al., 2020). Our G2Gs framework is closely related to the stateoftheart models that decompose the generation process into a sequence of graph transformation steps, including (You et al., 2018a; Liu et al., 2018; Shi et al., 2020). In these approaches, the generation procedure is formulated as a sequential decision making process by dynamically adding new nodes and edges based on current subgraph structures. For example, (You et al., 2018a) introduced a reinforce policy network for the decision making. (Shi et al., 2020) presented an alternative sequential generation algorithm based on autoregressive flow called GraphAF. (Liu et al., 2018)
combined sequential generation with a variational autoencoder to enable continuous graph optimization in the latent space.
Unlike the aforementioned approaches, we here leverage graph generation for retrosynthesis prediction, where novel components have to be devised for this specific task. For example, in the retrosynthesis scenario, given a specified product molecule, the reactants could be slightly different with diverse reaction conditions such as reagent and temperature. To cope with such intrinsic multimodality problem, in G2Gs we also explicitly introduce lowdimensional latent variables to modulate the sequential graph translation process, aiming at capturing the diverse output distributions.
3 The Graph to Graphs Framework
In this paper, we formulate the retrosynthesis task as a onetomany graphtographs translation problem. In specific, we first employ a graph neural network to estimate the reactivity score of all atom pairs of the product graph, and the atom pair with the highest reactivity score above a threshold will be selected as the reaction center. We then split the product graph into synthons by disconnecting the bonds of the reaction center resulted. Finally, basing on the obtained synthons, the reactants are generated via a series of graph transformations, where a latent vector is employed to encourage the model to capture the transformation uncertainty and generate diverse predictions. The proposed framework is illustrated in Figure
1, and will be discussed in detail next.3.1 Preliminaries
Notation In this paper, a molecule is represented as a labeled graph , where is the adjacency matrix and the matrix of node features. Let the number of atoms be , the number of bond types be , and the dimension of node features be , then we have and . here indicates a bond with type between the and nodes.
Retrosynthesis Prediction Formally, a chemical reaction can be represented as a pair of two sets of molecules , where denotes a reactant graph and a product graph. In retrosynthesis prediction, given the set of products , the goal is to precisely predict the set of reactants . Following existing work, in this paper we focus on the standard singleoutcome reaction case, i.e., , and thus simplifying the notation of a reaction as .
Reaction Center and Synthon Unlike that in (Jin et al., 2017), the phrase reaction center here is used to represent an atom pair that satisfies two conditions: (1) there is a bond between the and nodes in the product graph; (2) there is no bond between the and nodes in the reactant graph. Also, synthons are subgraphs extracted from the products by breaking the bonds in the reaction centers. These synthons can later be transformed into reactants, and a synthon may not be a valid molecule.
Molecular Graph Representation Learning Graph Convolutional Networks (GCN) (Duvenaud et al., 2015; Kearnes et al., 2016; Kipf and Welling, 2016; Gilmer et al., 2017; Schütt et al., 2017) have achieved great success in representation learning for computational chemistry. Given a molecular graph , we adopt a variant of Relational GCN (RGCN) (Schlichtkrull et al., 2018) to compute both the node embeddings and the graphlevel embedding. Let be the embedding dimension and the node embeddings at the layer computed by the RGCN . represents the embedding of the atom. At each layer, the RGCN calculates the embedding for each node by aggregating messages from different edge types:
(1) 
where denotes the adjacency matrix of the edge type with selfloop and is the trainable weight matrix for the edge type. denotes an aggregation function chosen from summation, averaging and concatenation. We stack RGCN layers to compute the final node embeddings . The entire graphlevel embedding can also be calculated by applying a function to (Hamilton et al., 2017), e.g., summation.
3.2 Reaction Center Identification
Given a chemical reaction , we first derive a binary label matrix for each atom pair (i.e., bond) in the product , indicating the reaction centers as defined in Section 3.1; here indicates that the bond between and node in is a reaction center. It is worth noting that, both the reactants and the product are atomindexed. That is, each atom in the reactant set is associated with a unique index. This property enables us to identify the reaction centers by simply comparing each pair of atoms in the product with that in a reactant , forming the binary label matrix . With such label matrix, the Reaction Center Identification procedure in G2Gs is then formulated as a binary link prediction task as follows.
We use a layer RGCN defined in Section 3.1 to compute the node embeddings and the entire graph embedding of product :
(2) 
To estimate the reactivity score of the atom pair , the edge embedding is formed by concatenating the node embeddings of the and nodes as well as the onehot bond type feature. In addition, since a layer RGCN can only gather information within hops of the center node, but the reactivity of a reaction center may also be affected by the remote atoms, we also enable the edge embedding to take into account the graph embedding (i.e., global structure), so as to leverage the knowledge of the remote atoms. Formally, we have:
(3) 
where denotes the vector concatenation operation. denotes the row of and is the edge type of the atom pair in . The final reactivity score of the atom pair is calculated as:
(4) 
where is a feedforward network that maps to a scalar, and denotes the function. In the case that a certain reaction type is known during retrosynthesis analysis, we can also include this information by concatenating its embedding to the input of the feedforward network, i.e., .
For learning, the reaction center identification module can be optimized by maximizing the cross entropy of the binary label matrix as follows:
(5) 
where
is a weighting hyperparameter, with the following purpose. In practice, among all the atom pairs there are typically a few reaction centers. As a result, the output logit of the feedforward network
is often extremely small. Through the weighting , we are able to alleviate such issue caused by the imbalanced class distributions.During inference, we first calculate the reactivity scores of all atom pairs, and then select the highest one above a threshold as the reaction center. Alternatively, one can select the top atom pairs with the highest scores above a threshold as the reaction center. This scenario will yield more diverse synthesis routes but with the cost of inference time.
After collecting reaction centers, we then disconnect the bonds of the reaction centers in , and treat each connected subgraph in as a synthon. Note that, in the case that none of the reaction center is identified, the product itself will be considered as a synthon. Doing so, we can extract all the synthons from the product , and then formulate a onetoone graph translation procedure to translate each resulted synthon to a final reactant graph. We will discuss this translation process in detail next.
3.3 Reactants Generation via Variational Graph Translation
Given a chemical reaction , we denote the set of synthons extracted from (as discussed in Section 3.2) as . For simplicity, we omit the subscript and denote a translation pair as . In this setting, our goal is to learn a conditional generative model that recovers the reactant molecule domain () from the synthon molecule domain (). It is worth noting that, an intrinsic issue is that the same synthon can be translated to different reactants, which is known as the multimodality problem. In order to mitigate such multimodality problem and encourage the module to model the diverse distribution of reactants, we incorporate a lowdimensional latent vector to capture the uncertainty for the graph translation process. Details are discussed next.
3.3.1 The Generative Model
We build upon previous works on graph generation models (Li et al., 2018; You et al., 2018b). The generation of graph is conditioned on both the and the latent vector . In detail, we first autoregressively generate a sequence of graph transformation actions , and then apply them on the initial synthon graph . Here is a onestep graph transformation (i.e., action) acting as a modification to the graph. Formally, let be the collection of all trajectories that can translate synthons to target reactants and be a possible trace. Then, we factorize modeling into modeling the joint distribution over the sequence of graph transformation actions . The connection between and is illustrated in Section 3.3.2. With such a generative model, a synthon can be translated into a reactant by sampling action sequences from the distribution. Next, we describe the details of the generative procedure.
Let denote the graph after applying the sequence of actions to , and . Then we have . In previous graph generation models (Li et al., 2018; You et al., 2018b), each decision is conditioned on a full history of the generation sequence (
), which leads to the stability and scalability problems arising from the optimization process. To alleviate these issues, our graph translation dynamics leverages the assumption of a Markov Decision Process (MDP), which satisfies the Markov property that
. The MDP formulation means that each action is only conditioned on the graph that has been modified so far. Hence the graph translation model can be naturally factorized as follows:(6) 
Before we detail the parameterization of distribution , we formally introduce the definition of an action. An action is a tuple with four elements:
(7) 
Assuming the number of atom types is , then predicts the termination of the graph translation procedure; indicates the first node to focus; indicates the second node to focus; predicts the type of bond between two nodes. Then the distribution can be further decomposed into three parts: (1) termination prediction ; (2) nodes selection ; (3) edge labeling . We will discuss the parameterization of these components in detail next.
Termination Prediction Denote a layer RGCN (Section 3.1) that can compute the node embeddings of an input graph as . We parameterize the distribution as:
(8)  
where denotes the softmax function, and is a feedforward network. At each transformation step, we sample . If indicates the termination, then the graph translation process will stop, and the current graph is treated as the final reactant generated by the module.
Nodes Selection Let the set of possible atoms (e.g., carbon, oxygen) to be added during graph translation be and denote the collection as . We first extend the graph to by adding isolated atoms, i.e., . The first node is selected from atoms in , while the second node is selected from conditioned on the first node, by concatenating its embeddings with embeddings of each atom in :
(9)  
where and are feedforward networks. and are masks to zero out the probability of certain atoms being selected. Specifically, forces the module to select node from , and prevents the first node from being selected again. In this case, only the second node can be selected from , and it corresponds to adding a new atom to . The distribution of node selection is the product of the two conditional distributions described above.
Edge Labeling The distribution for edge labeling is parameterized as follows:
(10)  
where is a feedforward networks. The knowledge of reaction classes can be incorporated in the same way as in Section 3.2.
3.3.2 Learning
To learn the parameters of our variational graph translation module, we aim to maximize the log likelihood of the observed translation pair, i.e., . Directly optimizing the objective involves marginalizing the latent variable , which is computationally intractable. To this end, we turn to the standard amortized variational inference (Kingma and Welling, 2014) by introducing an approximate posterior
, which is modeled as a Gaussian distribution to allow effectively sampling
via the reparameterization trick. In specific, the mean and the log variance of
are parameterized as follows:(12)  
where the and are graph embeddings of and respectively, computed by the same RGCN. and are feedforward networks. The evidence lower bound (ELBO) is then defined as:
(13) 
where
is the KullbackLeibler divergence between
and . We here take the prior as a standard Gaussian .Efficient Training The computation of (eq. 11) is expensive as it requires the summation over all possible graph transformation sequences that translate to . Here we introduce two strategies that perform well empirically. We first show that is lower bounded by the expected log likelihood of all the trajectories that translate to using Jensen’s inequality:
(14)  
where is the number of different action traces that translate to . In practice, we can throw the constant and evaluate the expectation using Monte Carlo estimation. We further adopt the breadthfirstsearch (BFS) nodeordering, a widelyused technique in sequential graph generation (You et al., 2018b; Popova et al., 2019; Shi et al., 2020), to reduce the number of valid transformation traces during sampling.
3.3.3 Generation
To generate a reactant graph, a natural way is to first sample a latent vector from the prior , then sample a trace of graph transformations from (Section 3.3.1), and finally apply these transformations to the synthon . However, in our proposed generative model, the probability of invalid actions will be nonzero even if the model is welltrained. As a result, any reactant molecules including invalid ones can be generated if sampling is arbitrarily long. Besides, this process also suffers from the nontrivial exposure bias problem (Bengio et al., 2015). To overcome the above obstacles during sampling, we design a beam search sampling process as follows.
Consider a beam search with size of . For the graph generation in the step, we maintain a candidate set with size . At the transformation step, we first calculate the probabilities of all possible actions and sort them, and then select the top ranked valid actions for each candidate graph in . Once this is done for graphs in , the top graphs among all the generated graphs are then selected as the candidates for the next transformation step. During this beam search, a translation branch will stop if reaches the predefined maximum transformation step or indicates a termination. In this scenario, the current graph will be added into a set , and the whole beam search terminates once all translation branches stop. When the beam search finishes, the top graphs in , ranked by their likelihoods, will be collected as the final predicted graphs.
3.4 Scalability Analysis
Both the Reaction Center Identification and the Variational Graph Translation modules in our G2Gs framework bypass the deployment of reaction templates and take advantage of the representation power of the molecular graph embeddings. As a result, the model size of G2Gs scales linearly w.r.t the maximum number of atoms in the molecules and is invariant to the quantity of rules and reactions in the given data sets, making it highly scalable to larger data sets.
4 Empirical Studies
4.1 Experiment Setup
Data We evaluate our approach on the widely used benchmark data set USPTO50k, which contains 50k atommapped reactions with 10 reaction types. Following (Liu et al., 2017), we randomly select 80% of the reactions as training set and divide the rest into validation and test sets with equal size.
Baselines We evaluate our strategy using five comparison baselines, including two templatefree and three templatebased ones. In specific, Seq2seq (Liu et al., 2017) is a templatefree approach that learns a LSTM (Hochreiter and Schmidhuber, 1997) to translate the SMILES strings of target molecules to reactants. Transformer (Karpov et al., 2019) is also a neural sequencetosequence model, but it leverages the learning power of the Transformer (Vaswani et al., 2017) for better sequential modeling. Retrosim (Coley et al., 2017) is a datadriven method that selects template for target molecules based on similar reactions in the data set. Neuralsym (Segler and Waller, 2017) employs a neural model to rank templates for target molecules. GLN (Dai et al., 2019) is the stateoftheart method, which samples templates and reactants jointly from the distribution learned by a conditional graphical model.
Evaluation Metrics Following the existing works (Liu et al., 2017; Dai et al., 2019), we use the top
exact match accuracy as our evaluation metrics. For comparison purpose, we used
with 1, 3, 5, and 10 in our experiments. Also, the accuracy was computed by matching the canonical SMILES strings of the predicted molecules with the ground truth.Implementation Details
G2Gs is implemented in Pytorch
(Paszke et al., 2017). We use the opensource chemical software RDkit
(Landrum, 2016) to preprocess molecules for the training and generate canonical SMILES strings for the evaluation. The RGCN in G2Gs is implemented with 4 layers and the embedding size is set as 512 for both modules. We use latent codes of dimension. We train our G2Gs for 100 epochs with a batch size of 128 and a learning rate of 0.0001 with Adam
(Kingma and Ba, 2014) optimizer on a single GTX 1080Ti GPU card. Theis set as 20 for reaction center identification module, and the beam size is 10 during inference. The maximal number of transformation steps is set as 20. We heuristically selected these values based on the validation data set.
Methods  Top accuracy %  

1  3  5  10  
Templatefree  
Seq2seq  37.4  52.4  57.0  61.7 
G2Gs  61.0  81.3  86.0  88.7 
Templatebased  
Retrosim  52.9  73.8  81.2  88.1 
Neuralsym  55.3  76.0  81.4  85.1 
GLN  64.2  79.1  85.2  90.0 
Methods  Top accuracy %  

1  3  5  10  
Templatefree  
Transformer  37.9  57.3  62.7  / 
G2Gs  48.9  67.6  72.5  75.5 
Templatebased  
Retrosim  37.3  54.7  63.3  74.1 
Neuralsym  44.4  65.3  72.4  78.9 
GLN  52.5  69.0  75.6  83.7 
4.2 Predictive Performance
We evaluate the top exact match accuracy of the proposed approach in both reaction class known and reaction class unknown settings, with results presented in Tables 1 and 2, respectively. The sets of reactant molecules are generated via beam search (Section 3.3.3) and ranked by their log likelihoods.
Setting  Top accuracy %  

1  2  3  5  
reaction class known  90.2  94.5  94.9  95.0 
reaction class unknown  75.8  83.9  85.3  85.6 
Setting  Top accuracy %  

1  3  5  10  
reaction class known  66.8  87.2  91.5  93.9 
reaction class unknown  61.1  81.5  86.7  90.0 
When compared with templatefree approaches, results shown in Tables 1 and 2 demonstrate that the G2Gs achieves competitive results on all the cases with different . In particular, our G2Gs always outperforms the templatefree baselines by a large margin, with up to 63% relative improvement in terms of the top1 exact match accuracy when the reaction class is known (the second column in Table 1), and up to 29% relative improvement when the reaction class is unknown (the second column in Table 2).
When consider the comparison with templatebased baselines, the results in Tables 1 and 2 indicate that our G2Gs approaches or outperforms the stateoftheart method GLN (Dai et al., 2019), especially when the is small. For example, when reaction class is given as a prior, our G2Gs outperforms the GLN in terms of the top3 and top5 exact match accuracy.
4.3 Ablation Study
To gain insights into the working behaviours of G2Gs, we conduct ablation studies in this section.
Effectiveness of the Reaction Center Identification Module We evaluate the top accuracy of the reaction center identification module by comparing the true reaction center against the top atom pairs with the highest reactivity scores above a threshold. The results in Table 3 indicate that when the reaction class is given as a prior, G2Gs can precisely pinpoint the reaction center in most cases even for . When the reaction class is unknown, the performance is slightly worse than the previous case, as a target molecule can usually be synthesized in different routes depending on what reaction type a chemist chooses, and thus the model tends to make predictions with low certainty.
Impact of the Variational Graph Translation Module To examine the graph translation module, we first split synthons from products based on the true reaction centers (i.e., label matrix in Section 3.2), and then use the same strategy as in Section 4.2 to compute the top exact match accuracies. As shown in Table 4, the graph translation module achieves high accuracy on translating the given synthon graphs to reactant graphs, which is an important attribute to the superior performance of the proposed G2Gs framework.
Diverse Reactant Graph Generation To observe the benefits brought by the latent vector used in G2Gs , Figure 3 visualizes the top1 translation results for a given product molecule, which is randomly selected from the test set. In the figure, the right four molecules are generated, based on the same synthon (leftmost molecule in the figure), through randomly sampling from the prior distribution of the formed latent vector. Results in this figure suggest that, through sampling the latent vector formed, our method can generate diverse molecule structures. That is, sampling the learned latent vector, an intrinsic process in G2Gs , powers our model to create diverse and valid reactant graphs, which represents an appealing feature for retrosynthesis prediction.
4.4 Case Study via Visualization
In Figure 5, we show cases where G2Gs successfully identifies the reaction centers and translates the product graph to a set of reactant graphs that match the ground truth. The synthesis routes shown in Figure 5 can be divided into two groups, each of which corresponds to a reaction template presented at the bottom of the figure. These figures suggest that our model does learn domain knowledge from the data set. Such property of our method makes it an appealing solution to practical problems with limited template knowledge.
In Figure 5, we also present a case where none of the predictions matches the ground truth. However, we note that this does not necessarily mean that our model fails to predict a synthesis route for the target molecule. This is because a molecule can be synthesized in multiple ways and the ground truth in the data set is not the only answer. To verify this hypothesis, we employ a forward reaction prediction model (Jin et al., 2017) to predict the product molecules based on the reactants generated by our model. As shown at the bottom of Figure 5, the predicted product exactly matches the target molecule of the retrosynthesis problem. This confirms that the predictions made by G2Gs are indeed potentially valid.
5 Conclusion and Outlook
We novelly formulated retrosynthesis prediction as a graphtographs translation task and proposed a templatefree approach to attain the prediction goal. In addition, we devised a variational graph translation module to capture the uncertainty and to encourage diversity in the graph translation process. We also empirically verified the superior performance of our proposed method; our strategy outperformed the stateoftheart templatefree counterpart by up to 63% and approached the performance obtained by the stateoftheart templatebased strategy, in terms of top1 accuracy. Our method excludes domain knowledge and scales well to large datasets, making it particularly attractive in practice.
Our future work will include extending our G2Gs approach to embrace an endtoend training paradigm and leveraging it to cope with multistep retrosynthesis tasks (Corey, 1991).
References
 Neural machine translation by jointly learning to align and translate. arXiv eprints abs/1409.0473. Cited by: §1.

Enhancing retrosynthetic reaction prediction with deep learning using multiscale reaction classification
. Journal of Chemical Information and Modeling 59, pp. . External Links: Document Cited by: §1, §2. 
Scheduled sampling for sequence prediction with recurrent neural networks
. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §3.3.3.  Prediction of organic reaction outcomes using machine learning. ACS Central Science 3, pp. . External Links: Document Cited by: §2.
 Machine learning in computeraided synthesis planning. Accounts of Chemical Research 51, pp. . External Links: Document Cited by: §1.
 Computerassisted retrosynthesis based on molecular similarity. ACS Central Science 3, pp. . External Links: Document Cited by: §1, §2, §4.1.
 The logic of chemical synthesis: multistep synthesis of complex carbogenic molecules (nobel lecture). Angewandte Chemie International Edition in English 30 (5), pp. 455–465. Cited by: §5.
 Computerassisted design of complex organic syntheses. Science (New York, N.Y.) 166, pp. 178–92. External Links: Document Cited by: §1.
 Retrosynthesis prediction with conditional graph logic network. In Advances in Neural Information Processing Systems 32, pp. 8870–8880. Cited by: §1, §2, §4.1, §4.1, §4.2, Table 1, Table 2.
 Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §3.1.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §3.1.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §3.1.
 A collection of robust organic synthesis reactions for in silico molecule design. Journal of chemical information and modeling 51, pp. 3093–8. External Links: Document Cited by: §2.
 Long shortterm memory. Neural computation 9, pp. 1735–80. External Links: Document Cited by: §4.1.
 Predicting organic reaction outcomes with weisfeilerlehman network. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 2607–2616. Cited by: §2, §3.1, Figure 5, §4.4.
 A transformer model for retrosynthesis. External Links: Document Cited by: §1, §2, §4.1.
 Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design 30 (8), pp. 595–608. Cited by: §3.1.
 Adam: a method for stochastic optimization. In 3nd International Conference on Learning Representations, Cited by: §4.1.
 Autoencoding variational bayes. pp. . Cited by: §1, §3.3.2.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.1.
 RDKit: opensource cheminformatics software. Cited by: §4.1.
 Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324. Cited by: §3.3.1, §3.3.1.
 Retrosynthetic reaction prediction using neural sequencetosequence models. ACS Central Science 3, pp. . External Links: Document Cited by: §1, §2, §4.1, §4.1, §4.1.
 Constrained graph variational autoencoders for molecule design. In Advances in Neural Information Processing Systems, pp. 7795–7804. Cited by: §1, §2.
 Extraction of chemical structures and reactions from the literature. Ph.D. Thesis. Cited by: §1.

Exploring deep recurrent models with reinforcement learning for molecule design
. In 6th International Conference on Learning Representations, Workshop Track Proceedings, Cited by: §2.  Molecular denovo design through deep reinforcement learning. Journal of cheminformatics 9 (1), pp. 48. Cited by: §2.
 Automatic differentiation in pytorch. In NIPSW, Cited by: §4.1.
 MolecularRNN: generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372. Cited by: §3.3.2.
 Designing random graph models using variational autoencoders with applications to chemical design. arXiv preprint arXiv:1802.05283. Cited by: §2.
 Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §3.1.

Quantumchemical insights from deep tensor neural networks
. Nature communications 8, pp. 13890. Cited by: §3.1.  ”Found in translation”: predicting outcomes of complex organic chemistry reactions using neural sequencetosequence models. Chemical Science 9, pp. . External Links: Document Cited by: §2.
 Molecular transformer for chemical reaction prediction and uncertainty estimation. ArXiv abs/1811.02633. Cited by: §2.
 Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science 4 (1), pp. 120–131. Cited by: §2.
 Neuralsymbolic machine learning for retrosynthesis and reaction prediction. Chemistry (Weinheim an der Bergstrasse, Germany) 23, pp. . External Links: Document Cited by: §1, §2, §4.1.

Graph{af}: a flowbased autoregressive model for molecular graph generation
. In International Conference on Learning Representations, Cited by: §1, §2, §3.3.2.  Computerassisted synthetic planning: the end of the beginning. Angewandte Chemie (International ed. in English) 55, pp. . External Links: Document Cited by: §1, §2.
 Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §1, §4.1.
 SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28 (1), pp. 31–36. Cited by: §1.
 Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems, pp. 6410–6421. Cited by: §1, §2.
 Graphrnn: generating realistic graphs with deep autoregressive models. arXiv preprint arXiv:1802.08773. Cited by: §3.3.1, §3.3.1, §3.3.2.