A Model to Search for Synthesizable Molecules

06/12/2019 ∙ by John Bradshaw, et al. ∙ BenevolentAI University of Cambridge The Alan Turing Institute 7

Deep generative models are able to suggest new organic molecules by generating strings, trees, and graphs representing their structure. While such models allow one to generate molecules with desirable properties, they give no guarantees that the molecules can actually be synthesized in practice. We propose a new molecule generation model, mirroring a more realistic real-world process, where (a) reactants are selected, and (b) combined to form more complex molecules. More specifically, our generative model proposes a bag of initial reactants (selected from a pool of commercially-available molecules) and uses a reaction model to predict how they react together to generate new molecules. We first show that the model can generate diverse, valid and unique molecules due to the useful inductive biases of modeling reactions. Furthermore, our model allows chemists to interrogate not only the properties of the generated molecules but also the feasibility of the synthesis routes. We conclude by using our model to solve retrosynthesis problems, predicting a set of reactants that can produce a target product.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability of machine learning to generate structured objects has progressed dramatically in the last few years. One particularly successful example of this is the flurry of developments devoted to generating small molecules

(Gómez-Bombarelli et al., 2018; Segler et al., 2017; Kusner et al., 2017; Dai et al., 2018; Simonovsky and Komodakis, 2018; De Cao and Kipf, 2018; Jin et al., 2018; Liu et al., 2018; You et al., 2018). These models have been shown to be extremely effective at finding molecules with desirable properties: drug-like molecules (Gómez-Bombarelli et al., 2018), biological target activity molecules (Segler et al., 2017), and soluble molecules (De Cao and Kipf, 2018).

However, these improvements in molecule discovery come at a cost: these methods do not describe how to synthesize such molecules, a prerequisite for experimental testing. Traditionally, in computer-aided molecular design, this has been addressed by virtual screening (Shoichet, 2004), where molecule data sets , are first generated via the expensive combinatorial enumeration of molecular fragments stitched together using hand-crafted bonding rules, and then are scored in an step.

In this paper we propose a generative model for molecules (shown in Figure 1) that describes how to make such molecules from a set of commonly-available reactants. Our model first generates a set of reactant molecules, and second maps them to a predicted product molecule via a reaction prediction model. It allows one to simultaneously search for better molecules and describe how such molecules can be made. By closely mimicking the real-world process of designing new molecules, we show that our model: 1. Is able to generate a wide range of molecules not seen in the training data; 2. Addresses practical synthesis concerns such as reaction stability and toxicity; and 3. Allows us to propose new reactants for given target molecules that may be more practical to manage.

Figure 1: An overview of approaches used to find molecules with desirable properties. Left: Virtual screening (Shoichet, 2004) aims to find novel molecules by the (computationally expensive) enumeration over all possible combinations of fragments. Center: More recent ML approaches, eg (Gómez-Bombarelli et al., 2018), aim to find useful, novel molecules by optimizing in a continuous latent space; however, there are no clues to whether (and how) these molecules can be synthesized. Right: We approach the generation of molecules through a multistage process mirroring how complex molecules are created in practice, whilst maintaining a continuous latent space to use for optimization. Our model, Molecule Chef, first finds suitable reactants which then react together to create a final molecule.

2 Background

We start with an overview of traditional computational techniques to discover novel molecules with desirable properties. We then review recent work in machine learning (ML) that seeks to improve parts of this process. We then identify aspects of molecule discovery we believe deserve much more attention from the ML community. We end by laying out our contributions to address these concerns.

2.1 Virtual Screening

To discover new molecules with certain properties, one popular technique is virtual screening (VS) (Shoichet, 2004; Hu et al., 2011; Pyzer-Knapp et al., 2015; Chevillard and Kolb, 2015; Nicolaou et al., 2016)

. VS works by (a) enumerating all combinations of a set of building-block molecules, which are combined via virtual chemical bonding rules, (b) for each molecule, calculating the desired properties via simulations or prediction models, (c) filtering the most interesting molecules to synthesize in the lab. While VS is general, it has the important downside that the generation process is not targeted: VS needs to get lucky to find molecules with desirable properties, it does not search for them. Given that the number of possible drug-like compounds is estimated to be

(van Hilten et al., 2019), the chemical space usually screened in VS is tiny. Search in combinatorial fragment spaces has been proposed, but is limited to simpler similarity queries Rarey and Stahl (2001).

2.2 The Molecular Search Problem

To address these downsides, one idea is to replace this full enumeration with a search algorithm; an idea called de novo-design (DND) (Schneider and Schneider, 2016). Instead of generating a large set of molecules with small variations, DND searches for molecules with particular properties, recomputes them for the newfound molecules, and searches again. We call this the molecular search problem

. Early work on the molecular search problem used genetic algorithms, ant-colony optimization, or other discrete search techniques to make local changes to molecules

(Hartenfeller and Schneider, 2011). While more directed than library-generation, these approaches still explored locally, limiting the diversity of discovered molecules.

The first work to apply current ML techniques to this problem was Gómez-Bombarelli et al. (2018) (in a late 2016 preprint). Their idea was to search by learning a mapping from molecular space to continuous space and back. With this mapping it is possible to leverage well-studied optimization techniques to do search: local search can be done via gradient descent and global search via Bayesian optimization (Snoek et al., 2012; Gardner et al., 2014). For such a mapping, the authors chose to represent molecules as SMILES strings (Weininger, 1988) and leverage advances in generative models for text (Bowman et al., 2016)

to learn a character variational autoencoder (CVAE)

(Kingma and Welling, 2014). Shortly after this work, in an early 2017 preprint, Segler et al. (2017)

trained recurrent neural networks (RNNs) to take properties as input and output SMILES strings with these properties, with molecular search done using reinforcement learning (RL).

In Search of Molecular Validity.

However, the SMILES string representation is very brittle: if individual characters are changed or swapped, it may no longer represent any molecule (called an invalid molecule). Thus, the CVAE often produced invalid molecules (in one experiment, sampling from the continuous space produced valid molecules only of the time (Kusner et al., 2017)). To address this validity problem, there has been a flurry of recent work using alternative molecular representations such as parse trees (Kusner et al., 2017; Dai et al., 2018) or graphs (Simonovsky and Komodakis, 2018; De Cao and Kipf, 2018; Li et al., 2018; Jin et al., 2018; Liu et al., 2018; You et al., 2018; Jin et al., 2019), where Jin et al. (2018); Liu et al. (2018); You et al. (2018) guarantee validity. In parallel, there has been work based on RL that has aimed to learn a validity function during training directly (Guimaraes et al., 2017; Janz et al., ).

2.3 The Molecular Recipe Problem

Crucially, all of the works in the previous section solving the molecular search problem focus purely on optimizing molecules towards desirable properties. These works, in addressing the downsides of VS, removed a benefit of it: knowledge of the synthesis pathway of each molecule. Without this we do not know how practical ML-generated molecules are to make.

To address this concern is to address the molecular recipe problem: what molecules are we able to make, given a set of readily-available starting molecules? So far, this problem has been addressed independently of the molecular search problem through synthesis planning (SP) (Segler et al., 2018). SP works by recursively deconstructing a molecule. This deconstruction is done via (reversed) reaction predictors: models that predict how reactant molecules produce a product molecule. More recently, novel ML models have been designed for reaction prediction (Wei et al., 2016; Segler and Waller, 2017; Jin et al., 2017; Schwaller et al., 2018a; Bradshaw et al., 2019; Schwaller et al., 2018b).

2.4 This Work

In this paper, we propose to address both the molecular search problem and the molecular recipe problem jointly. To do so, we propose a generative model over molecules using the following map: First, a mapping from continuous space to a set of known, reliable, easy-to-obtain reactant molecules. Second a mapping from this set of reactant molecules to a final product molecule, based on a reaction prediction model (Wei et al., 2016; Segler and Waller, 2017; Jin et al., 2017; Schwaller et al., 2018b; Bradshaw et al., 2019). Thus our generative model not only generates molecules, but also a synthesis route using available reactants. This addresses the molecular recipe problem, and also the molecular search problem, as the learned continuous space can also be used for search. Compared to previous work, in this work we are searching for new molecules through virtual chemical reactions, more directly simulating how molecules are actually discovered in the lab.

Concretely, we argue that our model, which we shall introduce in the next section, has several advantages over the current deep generative models of molecules reviewed previously:

Better extrapolation properties

Generating molecules through graph editing operations, representing reactions, we hope gives us strong inductive biases for extrapolating well.

Validity of generated molecules

Naive generation of molecular SMILES strings or graphs can lead to molecules that are invalid. Although the syntactic validity can be fixed by using masking (Kusner et al., 2017; Liu et al., 2018), the molecules generated can often still be semantically invalid. By generating molecules from chemically stable reactants by means of reactions, our model proposes more semantically valid molecules.

Provide synthesis routes

Proposed molecules from other methods can often not be evaluated in practice, as chemists do not know how to synthesize them. As a byproduct of our model we suggest synthetic routes, which could have a useful, practical value.

3 Model

In this section we describe our model. We define the set of all possible valid molecular graphs as , with an individual graph representing the atoms of a molecule as its nodes, and the type of bonds between these atoms (we consider single, double and triple bonds) as its edge types. The set of common reactant molecules, easily procurable by a chemist, which we want to act as building blocks for any final molecule is a subset of this, .

As discussed in the previous section (and shown in Figure 1) our generative model for molecules consists of the composition of two parts: (1) a decoder from a continuous latent space, , to a bag (ie multiset111Note how we allow molecules to be present multiple times as reactants in our reaction, although practically many reactions only have one instance of a particular reactant.) of easily procurable reactants, ; (2) a reaction predictor model that transforms this bag of molecules into a multiset of product molecules .

The benefit of this approach is that for step (2) we can pick from several existing reaction predictor models, including recently proposed methods that have used ML techniques (Kayala et al., 2011; Segler and Waller, 2017; Schwaller et al., 2018b; Bradshaw et al., 2019; Coley et al., 2019). In this work we use the Molecular Transformer (MT) of Schwaller et al. (2018b), as it has recently been shown to provide state-of-the-art performance in this task (Schwaller et al., 2018b, Table 4).

This leaves us with the task of (1), learning a way to decode to (and encode from) a bag of reactants, using a parameterized encoder and decoder . We call this co-occurrence model Molecule Chef, and by moving around in the latent space we can select using Molecule Chef different “bags of reactants”.

Again there are several viable options of how to learn Molecule Chef. For instance one could choose to use a VAE for this task (Kingma and Welling, 2014; Rezende et al., 2014). However, when paired with a complex decoder these models are often difficult to train (Bowman et al., 2016; Alemi et al., 2018), such that much of the previous work for generating graphs has has tuned down the KL regularization term in these models (Liu et al., 2018; Kusner et al., 2017). We therefore instead propose using the WAE objective (Tolstikhin et al., 2018), which involves minimizing

where is cost function, that enforces the reconstructed bag to be similar to the encoded one, and is a divergence measure which forces the marginalised distribution of all encodings to match the prior on the latent space, which is weighted in relative importance by . Following Tolstikhin et al. (2018) we use the maximum mean discrepancy (MMD) divergence measure, with and a standard normal prior over the latents. We choose so that this first term matches the reconstruction term we would obtain in a VAE, i.e. with . This means that the objective only differs from a VAE in the second, regularisation term, such that we are not trying to match each encoding to the prior but instead the marginalised distribution over all datapoints. Empirically, we find that this trains well and does not suffer from the same local optimum issues as the VAE.

3.1 Encoder and Decoder

We can now begin describing the structure of our encoder and decoder. In these functions it is often convenient to work with

-dimensional vector embeddings of graphs,

. Again we are faced with a series of possible alternative ways to compute these embeddings. For instance, we could ignore the structure of the molecule and learn embeddings for each, or use fixed molecular fingerprints, such as Morgan Fingerprints (Morgan, 1965). We instead choose to use deep graph neural networks (Merkwirth and Lengauer, 2005; Duvenaud et al., 2015; Battaglia et al., 2018) that can produce graph-isomorphic representations.

Deep graph neural networks have been shown to perform well on a variety of tasks involving small organic molecules, and their advantages compared to the previously mentioned alternative approaches are that (1) they take the structure of the graph into account and (2) they can learn which characteristics are important when forming higher-level representations. In particular in this work we use 4 layer Gated Graph Neural Networks (GGNN) (Li et al., 2016). These can compute higher-level representations for each node. These node-level representations in turn can be combined by a weighted sum, to form a graph-level representation invariant to the order of the nodes, in an operation referred to as an aggregation transformation (Johnson, 2017, §3).

Encoder
Figure 2: The encoder of our Molecule Chef. This maps from a multiset of reactants to a distribution over latent space. There are three main steps: (1) the reactants molecules are embedded into a continuous space by using GGNNs (Li et al., 2016)

to form molecule embeddings; (2) the molecule embeddings in the multiset are summed to form one order-invariant embedding for the whole multiset; (3) this is then used as input to a neural network which parameterizes a Gaussian distribution over

.

The structure of Molecule Chef’s encoder, , is shown in Figure 2. For the th data point the encoder has as input the multi-set of reactants

. It first computes the representation of each individual reactant molecule graph using the GGNN, before summing these representations to get a representation that is invariant to the order of the multiset. A feed forward network is then used to parameterize the mean and variance of a Gaussian distribution over

.

Decoder

The decoder, , (Figure 3) maps from the latent space to a multiset of reactant molecules. These reactants are typically small molecules, which means we could fit a deep generative model which produces them from scratch. However, to better mimic the process of selecting reactant molecules from an easily obtainable set, we instead restrict the output of the decoder to pick the molecules from a fixed set of reactant molecules, .

This happens in a sequential process using a recurrent neural network (RNN), with the full process described in Algorithm 1. The latent vector, is used to parametrize the initial hidden layer of the RNN. The selected reactants are fed back in as inputs to the RNN at the next generation stage. Whilst training we randomly sample the ordering of the reactants, and use teacher forcing.

0:   (latent space sample), GGNN (for embedding molecules), RNN (recurrent neural network), (set of easy-to-obtain reactant molecules), (learnt “halt” embedding), (learnt matrix that projects the size of the latent space to the size of RNN’s hidden space)
   ; {Start symbol}
  for  to  do
      ; STACK([GGNN(g) for all g in ] + [])

     logits

     
     if  then
         break {If the logit corresponding to the halt embedding is selected then we stop early}
     else
         
     end if
  end for
  return
Algorithm 1 Molecule Chef’s Decoder
Figure 3: The decoder of Molecule Chef. The decoder generates the multiset of reactants in sequence through calls to a RNN. At each step the model picks either one reactant from the pool or to halt, finishing the sequence. The latent vector, , is used to parameterize the initial hidden layer of the RNN. Reactants that are selected are fed back into the RNN on the next step. The reactant bag formed is later fed through a reaction predictor to form a final product.

3.2 Adding a predictive penalty loss to the latent space

As discussed in section 2.2 we are interested in using and evaluating our model’s performance in the molecular search problem, that is using the learnt latent space to find new molecules with desirable properties. In reality we would wish to measure some complex chemical property that can only be measured experimentally. However, as a surrogate for this, following (Gómez-Bombarelli et al., 2018), we optimize instead for the QED (Quantitative Estimate of Drug-likeness (Bickerton et al., 2012)) score of a molecule, , as a deterministic mapping from molecules to this score, , exists in RDKit (RDKit, online, ).

To this end, in a similar manner to Liu et al. (2018, §4.3) & Jin et al. (2018, §3.3), we can simultaneously train a 2 hidden layer property predictor NN for use in local optimization tasks. This network tries to predict the QED property, , of the final product from the latent encoding of the associated bag of reactants. The use of this property predictor network for local optimization is described in Section 4.2.

4 Evaluation

In this section we evaluate Molecule Chef in (1) its ability to generate a diverse set of valid molecules; (2) how useful its learnt latent space is when optimizing product molecules for some property; and (3) whether by training a regressor back from product molecules to the latent space, Molecule Chef can be used as part of a setup to perform retrosynthesis.

In order to train our model we need a dataset of reactant bags. For this we use the USPTO dataset (Lowe, 2012), processed and cleaned up by Jin et al. (2017). We filter out reagents, molecules that form context under which the reaction occurs but do not contribute atoms to the final products, by following the approach of Schwaller et al. (2018a, §3.1).

We wish to use as possible reactant molecules only popular molecules that a chemist would have easy access to. To this end, we filter our training dataset so that each reaction only contains reactants that occur at least 15 times across different reactions in the original larger training USPTO dataset. This leaves us with a dataset of 34426 unique reactant bags for training the Molecule Chef. In total there are 4344 unique reactants. For training the baselines, we combine these 4344 unique reactants and the associated products from their different combinations, to form a training set for baselines, as even though Molecule Chef has not seen the products during training, the reaction predictor has.

4.1 Generation

Model Name Validity Uniqueness Novelty Quality FCD
Molecule Chef + MT 99.15 94.92 87.41 95.87 0.76
AAE (Kadurin et al., 2017; Polykovskiy et al., 2018) 85.91 98.54 93.21 94.46 1.06
CGVAE (Liu et al., 2018) 100.00 93.51 95.80 44.26 11.29
CVAE (Gómez-Bombarelli et al., 2018) 12.02 56.28 85.57 52.64 37.14
GVAE (Kusner et al., 2017) 12.92 70.07 87.80 46.65 28.81
LSTM (Segler et al., 2017) 91.19 93.43 73.72 99.68 0.47
Table 1: Table showing the validity, uniqueness, novelty and normalized quality (all as %, higher better) of the products/or molecules generated from decoding from 20k random samples from the prior . Quality is the proportion of molecules that pass the quality filters proposed in Brown et al. (2018, §3.3), normalized such that the score on the training set is 100. FCD is the Fréchet ChemNet Distance (Preuer et al., 2018), capturing a notion of distance between the generated molecules and the training dataset (lower better). MT stands for the Molecular Transformer (Schwaller et al., 2018b).

We begin by analyzing our model using the metrics favored by previous work222Note that we have extended the definition of these metrics to a bag (multiset) of products, given that our model can output multiple molecules for each reaction. However, when sampling 20000 times from the prior of our model, we generate single product bags 96.3% of the time, so that in practice most of the time we are using the same definition for these metrics as the previous work which always generated single molecules. (Jin et al., 2018; Liu et al., 2018; Li et al., 2018; Kusner et al., 2017): validity, uniqueness and novelty. Validity is defined as requiring that at least one of the molecules in the bag of products can be parsed by RDKit. For a bag of products to be unique we require it to have at least one valid molecule that the model has not generated before in any of the previously seen bags. Finally, for computing novelty we require that the valid molecules not be present in the same training set we use for the baseline generative models.

In addition, we compute the Fréchet ChemNet Distance (FCD) (Preuer et al., 2018) between the valid molecules generated by each method and our baseline training set. Finally in order to try to assess the quality of the molecules generated we record the (train-normalized) proportion of valid molecules that pass the quality filters proposed by Brown et al. (2018, §3.3); these filters aim to remove molecules that are “potentially unstable, reactive, laborious to synthesize, or simply unpleasant to the eye of medicinal chemists”.

For the baselines we consider the character VAE (CVAE) (Gómez-Bombarelli et al., 2018), the grammar VAE (GVAE) (Kusner et al., 2017), the AAE (adversarial autoencoder) (Kadurin et al., 2017), the constrained graph VAE (CGVAE) (Liu et al., 2018), and a stacked LSTM generator with no latent space (Segler et al., 2017). Further details about the baselines can be found in the appendix.

The results are shown in Table 1. As Molecule Chef decodes to a bag made up from a predefined set of molecules, those reactants going into the reaction predictor are valid. The validity of the final product is not 100%, as the reaction predictor can make non-valid edits to these molecules, but we see that in a high number of cases the products are valid too. Furthermore, what is very encouraging is that the molecules generated often pass the quality filters, giving evidence that the process of building molecules up by combining stable reactant building blocks often leads to stable products.

4.2 Local Optimization

As discussed in Section 3.2, when training Molecule Chef we can simultaneously train a property predictor network, mapping from the latent space of Molecule Chef to the QED score of the final product. In this section we look at using the gradient information obtainable from this network to do local optimization to find a molecule created from our reactant pool that has a high QED score.

Figure 4: KDE plot showing that the distribution of the best QEDs found through local optimization, using our trained property predictor for QEDs, has higher mass over higher QED scores compared to the best found from a random walk. The starting locations’ distribution (sampled from the training data) is shown in green. The final products, given a reactant bag are predicted using the MT (Schwaller et al., 2018b). Figure 5: Having learnt a latent space which can map to products through reactants, we can learn a regressor back from the suggested products to latent space (orange arrow shown) and couple this with Molecule Chef’s decoder to see if we can do retrosynthesis – the act of computing the reactants that create a particular product.

We evaluate the local optimization of molecular properties by taking 250 bags of reactants, encoding them into the latent space of Molecule Chef, and then repeatedly moving in the latent space using the gradient direction of the property predictor until we have decoded ten different reactant bags. As a comparison we consider instead moving in a random walk until we have also decoded to ten different reaction bags. In Figure 4 we look at the distribution of the best QED score found in considering these ten reactant bags, and how this compared to the QEDs started with.

When looking at individual optimization runs, we see that the QEDs vary a lot between different products even if made with similar reactants. However, Figure 4 shows that overall the distribution of the final best found QED scores is improved when purposefully optimizing for this. This is encouraging as it gives evidence of the utility of these models for the molecular search problem.

4.3 Retrosynthesis

A unique feature of our approach is that we learn a decoder from latent space to a bag of reactants. This gives us the ability to do retrosynthesis by training a model to map from products to their associated reactants’ representation in latent space and using this in addition to Molecule Chef’s decoder to generate a bag of reactants. This process is highlighted in Figure 5. Although retrosynthesis is a difficult task, with often multiple possible ways to create the same product and with current state-of-the-art approaches built using large reaction databases and able to deal with multiple reactions (Segler et al., 2018), we believe that our model could open up new interesting and exciting approaches to this task. We therefore train a small network based on the same graph neural network structure used for Molecule Chef followed by four fully connected layers to regress from products to latent space.

Figure 6: An example of performing retrosynthesis prediction using a trained regressor from products to latent space. This reactant-product pair has not been seen in the training set of Molecule Chef. Further examples are shown in the appendix. (a) Reachable Products (b) Unreachable Products Figure 7: Assessing the correlation between the QED scores for the original product and its reconstruction (see text for details). We assess on two portions of the test set, products that are made up of only reactants in Molecule Chef’s vocabulary are called ‘Reachable Products’, those that have at least one reactant that is not are called ‘Unreachable Products’.

A few examples of the predicted reactants corresponding to products from reactions in the USPTO test set, but which can be made in one step from the predefined possible reactants, are shown in Figure 6 and the appendix. We see that often this approach, although not always able to suggest the correct whole reactant bag, chooses similar reactants that on reaction produce similar structures to the original product we were trying to synthesize. While we would not expect this approach to retrosynthesis to be competitive with complex planning tools, we think this provides a promising new approach, which could be used to identify bags of reactants that produce molecules similar to a desired target molecule. In practice, it would be valuable to be pointed directly to molecules with similar properties to a target molecule if they are easier to make than the target, since it is the properties of the molecules, and not the actual molecules themselves that we are after.

With this in mind, we assess our approach in the following way: (1) we take a product and perform retrosynthesis on it to produce a bag of reactants, (2) we transform this bag of reactants using the Molecular Transformer to produce a new reconstructed product, and then finally (3) we plot the resulting reconstructed product molecule’s QED score against the QED score of the initial product. We evaluate on the test set of USPTO, however we split this into two sets: ‘Reachable Products’ which can be made fully from the Molecule Chef’s reactant vocabulary and ‘Unreachable Products’, which have at least one reactant that is not in the vocabulary. The results are shown in Figure 7; overall we see that there is some correlation between the properties of products and the properties of their reconstructions.

4.4 Qualitative Quality of Samples

Figure 8: Random walk in latent space. See text for details.

In Figure 8 we show molecules generated from a random walk starting from the encoding of a particular molecule (shown in the left-most column). We compare the CVAE, GVAE, and Molecule Chef (for Molecule Chef we encode the reactant bag known to generate the same molecule). We showed all generated molecules to a domain expert and asked them to evaluate their properties in terms of their stability, toxicity, oxidizing power, corrosiveness. Many molecules produced by the CVAE and GVAE show undesirable features, unlike the molecules generated by Molecule Chef.

5 Discussion

In this work we have introduced Molecule Chef, a model that generates synthesizable molecules. By constructing molecules through selecting reactants and running chemical reactions, while performing optimization in continuous latent space, we can combine the strengths of previous VAE-based models and classical discrete de-novo design algorithms based on virtual reactions.

Acknowledgements

This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. JB also acknowledges support from an EPSRC studentship.

References

  • Alemi et al. (2018) Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken ELBO. In ICML, pages 159–168, 2018.
  • Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • Bickerton et al. (2012) G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90, 2012.
  • Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016.
  • Bradshaw et al. (2019) John Bradshaw, Matt J Kusner, Brooks Paige, Marwin HS Segler, and José Miguel Hernández-Lobato. A generative model for electron paths. In ICLR, 2019.
  • Brown et al. (2018) Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. Guacamol: Benchmarking models for de novo molecular design. arXiv preprint arXiv:1811.09621, 2018.
  • Chevillard and Kolb (2015) Florent Chevillard and Peter Kolb. Scubidoo: A large yet screenable and easily searchable database of computationally created chemical compounds optimized toward high likelihood of synthetic tractability. J. Chem. Inf. Mod., 55(9):1824–1835, 2015.
  • Coley et al. (2019) Connor W Coley, Wengong Jin, Luke Rogers, Timothy F Jamison, Tommi S Jaakkola, William H Green, Regina Barzilay, and Klavs F Jensen.

    A graph-convolutional neural network model for the prediction of chemical reactivity.

    Chemical Science, 10(2):370–377, 2019.
  • Dai et al. (2018) Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variational autoencoder for structured data. In ICLR, 2018.
  • De Cao and Kipf (2018) Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs. In ICML Deep Generative Models Workshop, 2018.
  • Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
  • Gardner et al. (2014) Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. Bayesian optimization with inequality constraints. In ICML, pages 937–945, 2014.
  • Gómez-Bombarelli et al. (2018) Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a Data-Driven continuous representation of molecules. ACS Cent Sci, 4(2):268–276, February 2018.
  • Guimaraes et al. (2017) Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, and Alán Aspuru-Guzik. Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.
  • Hartenfeller and Schneider (2011) Markus Hartenfeller and Gisbert Schneider. Enabling future drug discovery by de novo design. Wiley Interdisc. Rev. Comp. Mol. Sci., 1(5):742–759, 2011.
  • Hu et al. (2011) Qiyue Hu, Zhengwei Peng, Jaroslav Kostrowicki, and Atsuo Kuki. Leap into the Pfizer global virtual library (PGVL) space: creation of readily synthesizable design ideas automatically. In Chemical Library Design, pages 253–276. Springer, 2011.
  • Irwin et al. (2012) John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
  • (18) David Janz, Jos van der Westhuizen, Brooks Paige, Matt J Kusner, and José Miguel Hernández-Lobato.
  • Jin et al. (2017) Wengong Jin, Connor W Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction outcomes with Weisfeiler-Lehman network. In Advances in Neural Information Processing Systems, 2017.
  • Jin et al. (2018) Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In ICML, 2018.
  • Jin et al. (2019) Wengong Jin, Kevin Yang, Regina Barzilay, and Tommi Jaakkola. Learning multimodal graph-to-graph translation for molecular optimization. In ICLR, 2019.
  • Johnson (2017) Daniel D Johnson. Learning graphical state transitions. In ICLR, 2017.
  • Kadurin et al. (2017) Artur Kadurin, Alexander Aliper, Andrey Kazennov, Polina Mamoshina, Quentin Vanhaelen, Kuzma Khrabrov, and Alex Zhavoronkov. The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology. Oncotarget, 8(7):10883, 2017.
  • Kayala et al. (2011) Matthew A Kayala, Chloé-Agathe Azencott, Jonathan H Chen, and Pierre Baldi. Learning to predict chemical reactions. Journal of chemical information and modeling, 51(9):2209–2222, 2011.
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.
  • Kusner et al. (2017) Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. ICLR, 2016.
  • Li et al. (2018) Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, March 2018.
  • Liu et al. (2018) Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L Gaunt. Constrained graph variational autoencoders for molecule design. In Advances in neural information processing systems, 2018.
  • Lowe (2012) Daniel Mark Lowe. Extraction of chemical structures and reactions from the literature. PhD thesis, University of Cambridge, 2012.
  • Merkwirth and Lengauer (2005) Christian Merkwirth and Thomas Lengauer. Automatic generation of complementary descriptors with molecular graph networks. Journal of chemical information and modeling, 45(5):1159–1168, 2005.
  • Morgan (1965) HL Morgan. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of Chemical Documentation, 5(2):107–113, 1965.
  • Nicolaou et al. (2016) Christos A Nicolaou, Ian A Watson, Hong Hu, and Jibo Wang. The proximal lilly collection: Mapping, exploring and exploiting feasible chemical space. Journal of chemical information and modeling, 56(7):1253–1266, 2016.
  • Polykovskiy et al. (2018) Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, Artur Kadurin, Sergey Nikolenko, Alan Aspuru-Guzik, and Alex Zhavoronkov. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. arXiv preprint arXiv:1811.12823, 2018.
  • Preuer et al. (2018) Kristina Preuer, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter, and Günter Klambauer. Fréchet chemnet distance: A metric for generative models for molecules in drug discovery. Journal of Chemical Information and Modeling, 58(9):1736–1741, 2018. doi: 10.1021/acs.jcim.8b00234. URL https://doi.org/10.1021/acs.jcim.8b00234. PMID: 30118593.
  • Pyzer-Knapp et al. (2015) Edward O Pyzer-Knapp, Changwon Suh, Rafael Gómez-Bombarelli, Jorge Aguilera-Iparraguirre, and Alán Aspuru-Guzik. What is high-throughput virtual screening? a perspective from organic materials discovery. Annual Review of Materials Research, 45:195–216, 2015.
  • Rarey and Stahl (2001) Matthias Rarey and Martin Stahl. Similarity searching in large combinatorial chemistry spaces. Journal of Computer-Aided Molecular Design, 15(6):497–520, 2001.
  • (38) RDKit, online. RDKit: Open-source cheminformatics. http://www.rdkit.org. [Online; accessed 01-February-2018].
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML, pages 1278–1286, 2014.
  • Schneider and Schneider (2016) Petra Schneider and Gisbert Schneider. De novo design at the edge of chaos: Miniperspective. J. Med. Chem., 59(9):4077–4086, 2016.
  • Schwaller et al. (2018a) Philippe Schwaller, Théophile Gaudin, Dávid Lányi, Costas Bekas, and Teodoro Laino. “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci., 9:6091–6098, 2018a. doi: 10.1039/C8SC02339E.
  • Schwaller et al. (2018b) Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Costas Bekas, and Alpha A Lee. Molecular transformer for chemical reaction prediction and uncertainty estimation. arXiv preprint arXiv:1811.02633, 2018b.
  • Segler and Waller (2017) Marwin HS Segler and Mark P Waller. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry–A European Journal, 23(25):5966–5971, 2017.
  • Segler et al. (2017) Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci., 4(1):120–131, 2017.
  • Segler et al. (2018) Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604, 2018.
  • Shoichet (2004) Brian K Shoichet. Virtual screening of chemical libraries. Nature, 432(7019):862, 2004.
  • Simonovsky and Komodakis (2018) Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. In Věra Kůrková, Yannis Manolopoulos, Barbara Hammer, Lazaros Iliadis, and Ilias Maglogiannis, editors, Artificial Neural Networks and Machine Learning – ICANN 2018, pages 412–422, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01418-6.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • Tolstikhin et al. (2018) Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In ICLR, 2018.
  • van Hilten et al. (2019) Niek van Hilten, Florent Chevillard, and Peter Kolb. Virtual compound libraries in computer-assisted drug discovery. Journal of chemical information and modeling, 2019.
  • Wei et al. (2016) Jennifer N Wei, David Duvenaud, and Alán Aspuru-Guzik. Neural networks for the prediction of organic chemistry reactions. ACS central science, 2(10):725–732, 2016.
  • Weininger (1988) David Weininger. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
  • You et al. (2018) Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems, 2018.

Appendix A Appendix

a.1 Generation Benchmarks on ZINC

We also ran the baselines for the generation task on the ZINC dataset [Irwin et al., 2012]. The results are shown in Table 2.

Model Name Validity Uniqueness Novelty Quality FCD
AAE [Kadurin et al., 2017, Polykovskiy et al., 2018] 87.69 100.00 99.99 95.69 7.32
CGVAE [Liu et al., 2018] 100.00 95.39 96.39 43.29 14.93
CVAE [Gómez-Bombarelli et al., 2018] 0.31 40.98 24.59 127.99 39.68
GVAE [Kusner et al., 2017] 3.69 85.35 94.98 38.38 26.83
LSTM [Segler et al., 2017] 95.72 99.98 99.93 108.20 7.97
Table 2: Table showing generation results for the baseline models when trained on ZINC dataset [Irwin et al., 2012]. The first four result columns show the validity, uniqueness, novelty and normalized quality (all as %, higher better) of the products/or molecules generated from decoding from 20k random samples from the prior . Quality is the proportion of molecules that pass the quality filters proposed in Brown et al. [2018, §3.3], normalized such that the score on the USPTO derived training dataset (used in the main paper) is 100. FCD is the Fréchet ChemNet Distance [Preuer et al., 2018], capturing a notion of distance between the generated molecules and the USPTO derived training dataset used in the main paper.

a.2 Further Random Walk Example

Figure 9: Another example random walk in latent space. See §4.4 for further details.

a.3 Further Retrosynthesis Results

In this section we first provide more retrosynthesis examples before also describing an extra experiment in which we try to assess how well the retrosynthesis pipeline is at finding molecules with similar properties, even if not reconstructing the correct reactants themselves.

a.3.1 Further Examples

(a)
(b)
(c)
(d)
Figure 10: Further examples of the predicted reactants associated with a given product for product molecules not in Molecule Chef’s training dataset, however with reactants belonging to Molecule Chef’s vocabulary (ie Reachable Dataset).
(a)
(b)
Figure 11: Further examples of the predicted reactants associated with a given product for product molecules not in Molecule Chef’s training dataset, with at least one reactants not part of Molecule Chef’s vocabulary (ie Unreachable Dataset).

a.3.2 ChemNet Distances between Products and their Reconstructions

We also consider an experiment for which we analyze the Euclidean distance between the ChemNet embeddings of the product and the reconstructed product (found by feeding the original product through our retrosynthesis pipeline and then the Molecular Transformer). ChemNet embeddings are used when calculating the FCD score between molecule distributions [Preuer et al., 2018], and so hopefully capture various properties of the molecule. Whilst learning Molecule Chef we include a one layer NN regressor from the latent space to the associated ChemNet embeddings, for which the MSE loss is minimized during training.

To try to establish an idea of how randomly chosen pairs of molecules in our dataset differ from each other, when measured using this metric, we provide a distribution of the distances of random pairs. This distribution is formed by taking each of the molecules in our dataset (consisting of all the reactants and their associated products) and matching it up with another randomly chosen molecule from this set, before measuring the Euclidean distance between the embeddings of each of these pairs.

The results are shown in Figure 12. We see that the distribution of distances between the products and their reconstructions has greater mass on smaller distances compared to the random pairs baseline, even when evaluated on the unreachable dataset.

(a) Reachable dataset
(b) Unreachable dataset
Figure 12: KDE plot showing the distribution of the Euclidean distances between the ChemNet embeddings [Preuer et al., 2018] of our product and reconstructed product.

a.4 Details about our Dataset

In this section we provide further details about the molecules used in training our model and the baselines. We also describe details of the molecules used in the retrosynthesis experiments.

We extract reactants that occur at least 15 times in the USPTO train dataset [Lowe, 2012], as processed by Jin et al. [2017], to use as possible reactants output by the Molecule Chef. In total we have 4344 reactants, and a training set of 34426 unique reactant bags for which these reactants co-occur. Each reactant bag is associated with a product.

For the baselines we train on these reactants and the associated products. This results in a dataset of 37686 unique molecules, consisting of the following heavy elements: ’Ag’, ’Al’, ’B’, ’Bi’, ’Br’, ’C’, ’Ca’, ’Cl’, ’Cr’, ’Cu’, ’F’, ’I’, ’K’, ’La’, ’Li’, ’Mg’, ’Mn’, ’N’, ’Na’, ’O’, ’P’, ’Pd’, ’S’, ’Se’, ’Si’, ’Sn’, ’Zn’. Some examples of the molecules found in the training set are shown in Figure 13. Note that the large number of heavy atoms present as well as the small overall dataset size, makes a challenging learning task compared to some of the more common benchmark datasets used elsewhere, such as ZINC [Irwin et al., 2012].

Figure 13: Examples of molecules found in our dataset for training the baselines. This is a subset of the molecules found in USPTO [Lowe, 2012]. It consists of the reactants that the Molecule Chef can produce along with their corresponding products. It contains complex molecules with challenging structures to learn.

We use examples from the USPTO test dataset when performing the retrosynthesis experiments. Of this we split the set into two subsets. The first, which we refer to as the reachable dataset, contains only reactants in Molecule Chef’s vocabulary. The second, which we refer to as the unreachable dataset contains reactions with at least one reactant not in the vocabulary, when producing the KDE distribution plots we use a subset of 10000 items from this unreachable dataset for computational reasons.

a.5 Implementation Details

Implementation Details for the Baselines in Section 4.1 of Main Paper

For the baselines in the generation section in the main paper we use the following implementations:

The LSTM baseline implementation follows Segler et al. [2017], which has as its alphabet a list of all individual element symbols, plus special characters used in SMILES strings. This differs from the alphabet used by the decoder in the Molecular Transformer [Schwaller et al., 2018b], which instead extracts “bracketed” atoms directly from the training set; this means that a portion of a SMILES string such as [OH+] or [NH2-] would be represented as a single symbol, rather than as a sequence of five symbols. A regular expression can be used to extract a list of all such sequences from the training data. Effectively, this makes the trade off of increasing the alphabet size (from 47 to 203 items), while reducing the chance of making syntax errors or suggesting invalid charges. In practice we found very little qualitative or quantitative difference in performance of the LSTM model for the two alphabets; for sake of consistency with Molecule Chef we report the baseline using the larger alphabet.

For the CGVAE we decide to include element-charge-valence triplets that occur at least 10 times over all the molecules in the training data. At generation time we pick one starting node at random.

Other Details

The majority of experiments for Molecule Chef were run on NVIDIA Tesla K80. For running the Molecular Transformer and CGVAE, we used NVIDIA P100 and P40 GPUs, as the latter in particular required a GPU with large memory for training on the larger datasets.

For Molecule Chef

we have not tried a wide range of hyperparameters. For the latent dimensionality we initially tried a dimension of 100 before trying and sticking with 25. Initially, we did not anneal the learning rate but found slightly improved performance by annealing it by a factor of 10 after 40 epochs. These changes were made after considering the reconstruction error of the model on the validation set (the validation dataset of USPTO restricted to the reactants in

Molecule Chef’s vocabulary).