Constrained Bayesian Optimization for Automatic Chemical Design

09/16/2017 ∙ by Ryan-Rhys Griffiths, et al. ∙ University of Cambridge 0

Automatic Chemical Design leverages recent advances in deep generative modelling to provide a framework for performing continuous optimization of molecular properties. Although the provision of a continuous representation for prospective lead drug candidates has opened the door to hitherto inaccessible tools of mathematical optimization, some challenges remain for the design process. One known pathology is the model's tendency to decode invalid molecular structures. The goal of this thesis is to test the hypothesis that the origin of this pathology is rooted in the current formulation of Bayesian optimization. Recasting the optimization procedure as a constrained Bayesian optimization problem results in novel drug compounds produced by the model consistently ranking in the 100th percentile of the distribution over training set scores.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are two fundamental ways in which Machine Learning can be leveraged in chemistry:

  1. To evaluate a molecule for a given application.

  2. To find a promising molecule for a given application.

There has been much progress in the first use-case through the development of quantitative structure activity relationship (QSAR) models using deep learning

[Ma et al., 2015]. These models have achieved state-of-the-art results in predicting properties of known molecules.

The second use-case, finding new molecules that are useful for a given application, is arguably more important however. One existing approach for finding molecules that maximise an application-specific metric involves searching a large library of compounds, either physically or virtually Pyzer-Knapp et al. [2015]. This has the disadvantage that the search is not open-ended; if the molecule is not in the library you specify, the search won’t find it.

A second method involves the use of genetic algorithms. In this approach, a known molecule acts as a seed and a local search is performed over a discrete space of molecules. Although these methods have enjoyed success in producing biologically active compounds, an approach featuring a search over an open-ended, continuous space would be beneficial. The use of geometrical cues such as gradients to guide the search in continuous space could accelerate both drug

[Pyzer-Knapp et al., 2015, Gómez-Bombarelli et al., 2016a] and material [Hachmann et al., 2011, 2014] discovery by functioning as a high-throughput virtual screen of unpromising candidates.

Recently, Gómez-Bombarelli et al. [Gómez-Bombarelli et al., 2016b] presented Automatic Chemical Design, a VAE architecture capable of encoding continuous representations of molecules. In continuous latent space, gradient-based optimization is leveraged to find molecules that maximize a design metric.

Although a strong proof of concept, Automatic Chemical Design possesses a deficiency in so far as it fails to generate a high proportion of valid molecular structures. Gómez-Bombarelli et al. [2016b] hypothesise that molecules selected by Bayesian Optimization lie in dead regions of the latent space far away from any data that the VAE has seen in training, yielding invalid structures when decoded. Although there have been many attempts to address the issue of generating valid molecules, they all rely on model assumptions that can negatively impact either the complexity [Jaques et al., 2017, Janz et al., 2017, Kusner et al., 2017] or the diversity [Jin et al., 2018] of the generated molecules.

The principle contribution of this paper will be to present an approach based on constrained Bayesian optimization that generates a high proportion of valid sequences without relying on model assumptions that affect the complexity or the diversity of the generated molecules.

2 Methods

2.1 SMILES Representation

SMILES strings Weininger [1988]

are a means of representing molecules as a character sequence. This text-based format facilitates the use of tools from natural language processing for applications such as organic chemistry reaction prediction

Schwaller et al. [2017], Jin et al. [2017]

. For use with the variational autoencoder, the SMILES strings in turn are converted into one-hot vectors indicating the presence or absence of a particular character within a sequence as illustrated in

Figure 1.

Figure 1:

The SMILES and one-hot encoding for benzene. For simplicity only the characters present in Benzene are shown in the one-hot encoding. In reality there would be a column for each character in the SMILES alphabet.

2.2 Variational Autoencoder

Variational autoencoders Kingma and Welling [2013], Kingma et al. [2014] allow us to map molecules to and from continuous values in a latent space. The encoding is interpreted as a latent variable in a probabilistic generative model over which there is a prior distribution . The probabilistic decoder is defined by the likelihood function . The posterior distribution is interpreted as the probabilistic encoder. The parameters of the likelihood as well as the parameters of the approximate posterior distribution are learned by maximizing the evidence lower bound (ELBO)

Variational autoencoders have been coupled with recurrent neural networks by

Bowman et al. [2015] to encode sentences into a continuous latent space. This approach is followed for the SMILES format both by Gómez-Bombarelli et al. [2016b] and here. The SMILES variational autoencoder is shown in Figure 2.

Figure 2: The SMILES Variational Autoencoder.

2.3 Bayesian Optimization of Molecules

Bayesian Optimization is performed in the latent space of the variational autoencoder in order to find molecules that score highly under a specified objective function. We assess molecular quality on the following objectives:

denotes a molecule, is the water-octanol partition coefficient,

is the quantitative estimate of drug-likeness

[Bickerton et al., 2012] and is the synthetic accessibility [Ertl and Schuffenhauer, 2009]. The ring penalty term is as featured in [Gómez-Bombarelli et al., 2016b]. The comp subscript is designed to indicate that the objective function is a composite of standalone metrics.

It is important to note, that the first objective, a common metric of comparison in this area Gómez-Bombarelli et al. [2016b], Kusner et al. [2017], Jin et al. [2018], is mis-specified. From a chemical standpoint it is undesirable to maximize the logP score as is being done here. Rather it is preferable to optimize logP to be in a range that is in accordance with the Lipinski Rule of Five Lipinski et al. [1997]. We use the penalized logP objective here because regardless of its relevance for chemistry, it serves as a point of comparison against other methods.

2.4 Constrained Bayesian Optimization of Molecules

We now describe our extension to the Bayesian Optimization procedure followed by Gómez-Bombarelli et al. [2016b]. Expressed formally, the constrained Bayesian optimization problem is

where is a black-box objective function,

denotes the probability that a boolean constraint

is satisfied and is some user-specified minimum confidence that the constraint is satisfied [Gelbart et al., 2014]. The constraint is that a latent point must decode successfully a large fraction of the times decoding is attempted. The black-box objective function is noisy because a single latent point may decode to multiple molecules when the model makes a mistake, obtaining different values under the objective. In practice, is one of the objectives described in section 2.3

2.5 Expected Improvement with Constraints (EIC)

EIC may be thought of as expected improvement (EI) that offers improvement only when a set of constraints are satisfied [Schonlau et al., 1998]:

The implicit incumbent solution suppressed in , may be set in an analogous way to vanilla expected improvement [Gelbart, 2015] as either:

  1. The best observation in which all constraints are observed to be satisfied.

  2. The minimum of the posterior mean such that all constraints are satisfied.

The latter approach is adopted for the experiments performed in this paper. If at the stage in the Bayesian optimization procedure where a feasible point has yet to be located, the form of acquisition function used is that defined by [Gelbart, 2015]

with the intuition being that if the probabilistic constraint is violated everywhere, the acquisition function selects the point having the highest probability of lying within the feasible region. The algorithm ignores the objective until it has located the feasible region. It would also have been possible to adopt the methodology introduced by Rainforth et al. [2016] under the assumption that the constraint is cheap to evaluate.

3 Related Work

Gómez-Bombarelli et al. [2016b], Segler et al. [2017] constructed character-level generative models of SMILES strings using recurrent decoders. These models both suffered from the problem that they produced a high proportion of invalid SMILES strings. Blaschke et al. [2017] compared a range of autoencoder architectures in terms of their ability to produce valid molecules. Kusner et al. [2017], Dai et al. [2018] addressed the issue of generating valid molecules explicitly by imposing syntactic and semantic constraints using context free and attribute grammars while Janz et al. [2017], Guimaraes et al. [2017] leveraged additional training signal to enforce molecular validity.

A potential drawback of these methods is that the constraints may favour the generation of simple molecules. Simonovsky and Komodakis [2018], Li et al. [2018] build generative models of molecular graphs but do not impose any constraints for molecular validity. Jin et al. [2018] build a generative model of molecular graphs with constraints in order to achieve 100% validity. One possible disadvantage of this approach is that each molecule is assumed to be constructed from a fixed vocabulary of subgraphs and so it will be impossible to generate a molecule comprised of subgraphs outside this vocabulary. Direct comparison with the aforementioned methods is complicated due to the orthogonality of the assumptions made for each model. As such, we compare out method against Gómez-Bombarelli et al. [2016b] only.

Our principle contribution is in showing that valid molecules can be generated without encoding assumptions into the design model that limit the complexity and diversity of the generated molecules.

4 Experiment II: Drug Design

In this section we conduct an empirical test of the hypothesis from [Gómez-Bombarelli et al., 2016b] that the decoder’s lack of efficiency is due to data point collection in dead regions of the latent space far from the data on which the VAE was trained. We use this information to construct a binary classification BNN to serve as a constraint function that outputs the probability of a latent point being valid. Secondly, we compare the performance of our constrained Bayesian optimization implementation against the original model (baseline) in terms of the numbers of valid and drug-like molecules generated. Thirdly, we compare the quality of the molecules produced by constrained Bayesian optimization with those of the baseline model.

4.1 Implementation

The implementation details of the encoder-decoder network as well as the sparse GP for modelling the objective remain unchanged from [Gómez-Bombarelli et al., 2016b]. For the constrained Bayesian optimization algorithm, the BNN is constructed with hidden layers each

units wide with ReLU activation functions and a logistic output. Minibatch size is set to

and the network is trained for epochs with a learning rate of 0.0005. iterations of parallel Bayesian optimization are performed using the Kriging-Believer algorithm in all cases. Data is collected in batch sizes of . The same training set as [Gómez-Bombarelli et al., 2016b] is used, namely drug-like molecules drawn at random from the ZINC database [Irwin et al., 2012].

4.2 Diagnostic Experiments and Labelling Criteria

These experiments were designed to test the hypothesis that points collected by Bayesian optimization lie far away from the training data in latent space. In doing so, they also serve as labelling criteria for the data collected to train the BNN acting as the constraint function. The resulting observations are summarized in Figure 3.

(a) % Valid Molecules
(b) % Methane Molecules
(c) % Drug-like Molecules
Figure 3: Experiments on disjoint sets comprising latent points each. Very small (VS) Noise are training data latent points with approximately noise added to their values, Small (S) Noise have noise added to their values and Big (B) Noise have noise added to their values. All latent points underwent decode attempts and the results are averaged over the points in each set. The percentage of decodings to: a) valid molecules b) methane molecules. c) drug-like molecules.

There is a noticeable decrease in the percentage of valid molecules decoded as one moves further away from the training data in latent space. Points collected by Bayesian optimization do the worst in terms of the percentage of valid decodings. This would suggest that these points lie farthest from the training data. The decoder over-generates methane molecules when far away from the data. One hypothesis for why this is the case is that methane is represented as ’C’ in the SMILES syntax and is by far the most common character. Hence far away from the training data, combinations such as ’C’ followed by a stop character may have high probability under the distribution over sequences learned by the decoder.

Given that methane has far too low a molecular weight to be a suitable drug candidate, a third plot in Figure 3(c), shows the percentage of decoded molecules such that the molecules are both valid and have a tangible molecular weight. The definition of a tangible molecular weight was interpreted somewhat arbitrarily as a SMILES length of or greater. Henceforth, molecules that are both valid and have a SMILES length greater than will be loosely referred to as drug-like. This is not to imply that a molecule comprising five SMILES characters is likely to be drug-like, but rather this SMILES length serves the purpose of determining whether the decoder has been successful or not.

As a result of these diagnostic experiments, it was decided that the criteria for labelling latent points to initialize the binary classification neural network for the constraint would be the following: if the latent point decodes into drug-like molecules in more than

of decode attempts, it should be classified as drug-like and non drug-like otherwise.

4.3 Molecular Validity

The BNN for the constraint was initialized with positive class points and negative class points. The positive points were obtained by running the training data through the decoder assigning them positive labels if they satisfied the criteria outlined in the previous section. The negative class points were collected by decoding points sampled uniformly at random across the latent dimensions of the design space. Each latent point undergoes decode attempts and the most probable SMILES string is retained. is the choice of objective function. The relative performance of constrained Bayesian optimization and unconstrained Bayesian optimization (baseline) [Gómez-Bombarelli et al., 2016b] is compared in (a).

(a) Drug-like Molecules
(b) New Drug-like Molecules
Figure 4: a)

The percentage of latent points decoded to drug-like molecules. The results are from 20 iterations of Bayesian optimization with batches of 50 data points collected at each iteration (1000 latent points decoded in total). The standard error is given for 5 separate train/test set splits of 90/10.

The results show that greater than of the latent points decoded by constrained Bayesian optimization produce drug-like molecules compared to less than for unconstrained Bayesian optimization. One must account however, for the fact that the constrained approach may be decoding multiple instances of the same novel molecules. Constrained and unconstrained Bayesian optimization are compared on the metric of the percentage of unique novel molecules produced in (b).

One may observe that constrained Bayesian optimization outperforms unconstrained Bayesian optimization in terms of the generation of unique molecules, but not by a large margin. A manual inspection of the SMILES strings collected by the unconstrained optimization approach showed that there were many strings with lengths marginally larger than the cutoff point, which is suggestive of partially decoded molecules. As such, a fairer metric for comparison should be the quality of the new molecules produced as judged by the scores from the black-box objective function. This is examined next.

4.4 Molecular Quality

(a) Composite LogP
(b) Composite QED
(c) QED
Figure 5: The best scores for new molecules generated from the baseline model (blue) and the model with constrained Bayesian optimization (red). The vertical lines show the best scores averaged over 5 separate train/test splits of 90/10. For reference, the histograms are presented against the backdrop of the top 10% of the training data in the case of Composite LogP and QED, and the top 20% of the training data in the case of Composite QED.

The results of Figure 5 indicate that constrained Bayesian optimization is able to generate higher quality molecules relative to unconstrained Bayesian optimization across the three drug-likeness metrics introduced in section 2.3. Over the independent runs, the constrained optimization procedure in every run produced new drug-like molecules ranked in the 100th percentile of the distribution over training set scores for the objective and over the 90th percentile for the remaining objectives. Table 1 gives the percentile that the averaged score of the new molecules found by each process occupies in the distribution over training set scores.

Objective Baseline Constrained
LogP Composite 36 14 92 4
QED Composite 14 3 72 10
QED 11 2 79 4
Table 1: Percentile of the averaged new molecule score relative to the training data. The results of 5 separate train/test set splits of 90/10 are provided.

5 Experiment 3: Material Design

In order to show that the constrained Bayesian optimization approach is extensible beyond the realm of drug design, we trained the model on data from the Harvard Clean Energy Project [Hachmann et al., 2011, 2014] to generate molecules optimized for power conversion efficiency (PCE).

5.1 Implementation

A neural network was trained on molecules drawn at random from the Harvard Clean Energy Project dataset using 512-bit Morgan circular fingerprints Rogers and Hahn [2010] as input features with bond radius of using the RDKit package RDKit .

5.2 Results

The results are given in Figure 6. The averaged score of the new molecules generated lies above the 90th percentile in the distribution over training set scores. Given that the objective function in this instance was learned using a neural network, advances in predicting chemical properties from data Duvenaud et al. [2015], Ramsundar et al. [2015] are liable to yield concomitant improvements in the optimized molecules generated through this approach.

Figure 6: The best scores for novel molecules generated by the constrained Bayesian optimization model optimizing for PCE. The results are averaged over 3 separate runs with train/test splits of 90/10.

6 Conclusion and Future Work

6.1 Contributions

The reformulation of the search procedure in the Automatic Chemical Design model as a constrained Bayesian optimization problem has led to concrete improvements on two fronts:

  1. Validity - The number of valid molecules produced by the constrained optimization procedure offers a marked improvement over the original model. Notably, by applying constraints solely to the latent space of the variational autoencoder, the method does not require model assumptions that compromise either the complexity or the diversity of the generated molecules.

  2. Quality - For five independent train/test splits, the scores of the best molecules generated by the constrained optimization procedure consistently ranked above the 90th percentile of the distribution over training set scores for all objectives considered. The ability to find new molecules that are competitive with the very best of a training set of already drug-like molecules is a powerful demonstration of the model’s capabilities. As a further point, the generality of the approach should be emphasised. This approach is liable to work for a large range of objectives encoding countless desirable molecular properties.

6.2 Future Work

Future work could investigate whether performance gains can be achieved through the implementation of a more accurate constraint model. Recent work by Blaschke et al. [2017] has featured a more targeted search for novel compounds with predicted activity against dopamine receptor type 2. This represents a move towards more industrially-relevant objective functions for Bayesian Optimization which should ultimately replace the chemically mis-specified objectives, such as the penalized logP score, identified here.