ChemTS: An Efficient Python Library for de novo Molecular Generation

09/29/2017 ∙ by Xiufeng Yang, et al. ∙ The University of Tokyo 0

Automatic design of organic materials requires black-box optimization in a vast chemical space. In conventional molecular design algorithms, a molecule is built as a combination of predetermined fragments. Recently, deep neural network models such as variational auto encoders (VAEs) and recurrent neural networks (RNNs) are shown to be effective in de novo design of molecules without any predetermined fragments. This paper presents a novel python library ChemTS that explores the chemical space by combining Monte Carlo tree search (MCTS) and an RNN. In a benchmarking problem of optimizing the octanol-water partition coefficient and synthesizability, our algorithm showed superior efficiency in finding high-scoring molecules. ChemTS is available at https://github.com/tsudalab/ChemTS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

ChemTS

Molecule Design using Monte Carlo Tree Search with Neural Rollout


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In modern society, a variety of organic molecules are used as important materials such as solar cells [1], organic light-emitting diodes (OLEDs) [2], conductors [3], sensors [4] and ferroelectrics [5]

. At the highest level of abstraction, design of organic molecules is formulated as a combinatorial optimization problem to find the best solutions in a vast chemical space. Most computer-aided methods for molecular design build a molecule by a combination of predefined fragments (e.g.,

[6]). Recently, Ikebata et al [7] succeeded de novo molecular design using an engineered language model of SMILES representation of molecules [8]. It is increasingly evident, however, that engineered models often perform worse than neural networks in text and image generation [9, 10]. Gomez-Bombarelli et al. [11]

were the first to employ a neural network called variational autoencoder (VAE) to generate molecules. Later Kusner et al. enhanced it to grammar variational autoencoder (GVAE) 

[12]. SMILES strings created by VAEs are mostly invalid (i.e., they do not translate to chemical structures), so generation steps have to be repeated many times to obtain a molecule. Segler et al. [13]

showed that a recurrent neural network (RNN) using long short term memory (LSTM) 

[14]

achieves a high probablity of valid SMILES generation. In their algorithm, a large number of candidates are generated randomly and a black-box optimization algorithm is employed to choose high-scoring molecules. It is required to generate a very large number of candidates to ensure that desirable molecules are included in the candidate set. Optimization in a too large candidate space can be inhibitively slow.

In this paper, we present a novel python library ChemTS to offer material scientists a versatile tool of de novo molecular design. The space of SMILES strings is represented as a search tree where the -th level corresponds to the -th symbol. A path from the root to a terminal node corresponds to a complete SMILES string. Initially, only the root node exists and the search tree is gradually generated by Monte Carlo tree search (MCTS) [15]. MCTS is a randomized best-first search method that showed exceptional performance in computer Go [16]. Recently, it has been successfully applied to alloy design [17]. MCTS constructs only a shallow tree and downstream paths are generated by a rollout procedure. In ChemTS, an RNN trained by a large database of SMILES strings is used as the rollout procedure. In a benchmarking experiment, ChemTS showed better efficiency in comparison to VAEs, creating about 40 molecules per minute. As a result, high scoring molecules were generated within several hours.

2 Method

ChemTS requires a database of SMILES strings and a reward function where is an input SMILES string. Our definition of SMILES strings contains the following symbols representing atoms, bonds, ring numbers and branches: {C, c, o, O, N, F, [C@@H], n, -, S,Cl, [O-],[C@H], [NH+],[C@], s, Br, [nH], [NH3+], [NH2+], [C@@], [N+], [nH+], [S@], [N-], [n+],[S@@], [S-], I, [n-], P, [OH+],[NH-], [P@@H], [P@@], [PH2], [P@], [P+], [S+],[o+], [CH2-], [CH-], [SH+], [O+], [s+], [PH+], [PH], [S@@+], /,=, #, 1,2,3,4,5,6,7,8,9,(, ),}. In addition, we have a terminal symbol $. The reward function involves first principle or semi-empirical calculations and describes the quality of the molecule described by . If does not correspond to a valid molecule, is set to an exceptionally small value. We employ rdkit (www.rdkit.org) to check if is valid or not. Before starting the search, an RNN is trained by the database and we obtain the conditional probability as a result. The architecture of our RNN is similar to that in [13] and will be detailed in Section 2.1.

Figure 1:

Monte Carlo Tree Search. (a) Selection step. The search tree is traversed from the root to a leaf by choosing the child with the largest UCB score. (b) Expansion step. Children nodes are created by sampling from RNN 30 times. (c) Simulation step. Paths to terminal nodes are created by the rollout procedure using RNN. Rewards of the corresponding molecules are computed. (d) Backpropagation step. The internal parameters of upstream nodes are updated.

MCTS creates a search tree, where each node corresponds to one symbol. Nodes with the terminal symbol are called terminal nodes. Starting with the root node, the search tree grows gradually by repeating the four steps, Selection, Expansion, Simulation and Backpropagation (Figure 1). Each intermediate node has a UCB score that evaluates the merit of the node [15]. The distinct feature of MCTS is the use of rollout in the simulation step. Whenever a new node is added, paths from the node to terminal nodes are built by a random process. In computer games, it is known that uniformly random rollout does not perform well, and designing a better rollout procedure based on available knowledge is essential in achieving high performance [15]. Our idea is to employ a trained RNN for rollout. A node at level has a partial SMILES string corresponding to the path from the root to the node. Given the partial string, RNN allows us to compute the distribution of the next letter . Sampling from the distribution, the string is elongated by one. Elongation by RNN is repeated until the terminal symbol occurs. After elongation is done, the reward of the generated string is computed. In the backpropagation step, the reward is propagated backwards and the UCB scores of traversed nodes are updated. See [17] for details about MCTS.

2.1 Recurrent Neural Network

Our recurrent neural network (RNN) has a non-deterministic output: an input string

is mapped to probability distributions of output symbols

. The RNN represents the function , where is a hidden state at position and

is the one-hot coded vector of input symbol

. The function

is implemented by two stacked gated recurrent units (GRUs) 

[14], each with 256 dimensional hidden states. The input vector is fed to the lower GRU, and the hidden state of the lower GRU is fed to the upper GRU. The distribution of output symbol is computed as , where

is a softmax activation function depending only on the hidden state of the upper GRU.

Given strings in the training set, we train the network such that it outputs a right-shifted version of the input. Denote by the one-hot coded vector of the -th symbol in the -th training string. The parameters in the network

is trained to minimize the following loss function,

where

denotes the relative entropy. Our RNN was implemented using Keras library (

github.com/fchollet/keras), and trained with ADAM [18] using a batch size of 256. After the training is finished, one can compute from . It allows us to perform rollout by sampling the next symbol repeatedly.

3 Experiments

Following [11], we generate molecules that jointly optimize the octanol-water partition coefficient logP and the other two properties: synthetic accessibility [19] and ring penalty that penalizes unrealistically large rings. The score of molecule is described as

(1)

The reward function of ChemTS is defined as

(2)

ChemTS was compared with two existing methods CVAE [11] and GVAE [12] based on variational autoencoders. Their implementation is available at https://github.com/mkusner/grammarVAE. Both methods perform molecular generation by Bayesian optimization (BO) in a latent space of VAE. RNN, CVAE and GVAE were trained with approximately 250,000 molecules in ZINC database [20]

. All methods were trained for 100 epochs. Training took 3.8, 9.4 and 33.5 hours respectively, on a CentOS 6.7 server with a GeForce GTX Titan X GPU. To evaluate the efficiency of MCTS, we prepared two alternative methods using RNN. One is simple random sampling using RNN, where the first symbol is made randomly and it is elongated until the terminal symbol occurs. The other is the combination of RNN and Bayesian optimization 

[21], where 4,000 molecules are made a priori and Bayesian optimization is applied to find the best scoring molecule.

Method 2h 4h 6h 8h Molecules/Min
ChemTS
RNN+BO
Only RNN
CVAE+BO
GVAE+BO
Table 1: Maximum score

at time points 2,4,6 and 8 hours achieved by different molecular generation methods. The rightmost column shows the number of generated molecules per minute. The average values and standard deviations over 10 trials are shown.

As shown in Table 1, effectiveness of each method is quantified by the maximum score among all generated molecules at 2,4,6 and 8 hours and the speed of molecules generation (i.e., the number of generated molecules per minute). VAE methods performed substantially slower than RNN-based methods, which reflects the low probability of generating valid SMILES strings. ChemTS performed best in finding high scoring molecules, while the speed of molecular generation (40.89 molecules per minute) was only slightly worse than random generation by RNN (41.33 molecules per minute). The combination of RNN and BO could not find high scoring molecules. Preparing more candidate molecules may improve the best score, but it would further slow down the molecular generation. In general, it is difficult to design a correct reward function when there are multiple objectives. So, it is important to generate many good molecules in a given time frame to allow the user to browse and select favorite molecules afterwards. See Figure 2 for the best molecules generated by ChemTS.

SMILES representation
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2Cl)c2ccccc2c1OC(F)F)c1cccc2ccccc12 6.56
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2Cl)ccc1C1=CCCCC1)c1cc(F)cc(Cl)c1 6.43
O=C(Nc1cc(Nc2c(Cl)cccc2N=C(SC2CCCCC2)c2ccccc2)cc(Cl)c1Cl)c1ccc2ccccc2n1 6.34
O=C(Nc1cc(Oc2ccc(Cl)cc2Cl)ccc1Nc1cc(Cl)ccc1Cl)c1ccc(Cl)cc1 6.33
O=C(Nc1cc(Nc2c(Cl)cccc2Cl)c(Cl)cc1Br)N(c1ccccc1)c1ccc(Cl)cc1 6.26
O=C(Nc1cc(Oc2c(Cl)cccc2Oc2ccc(-c3ccccc3)cc2)ccc1Cl)c1ccccc1 6.19
O=C(Nc1cc(Nc2c(Cl)cccc2Cl)c(Cl)c(C(=O)N(Cc2ccccc2)c2ccccc2)c1Cl)c1ccccc1F 6.08
O=C(Nc1cc(Oc2ccc(Cl)cc2Cl)cc(Cl)c1Cl)c1ncoc1-c1ccc(Sc2ccccc2)cc1 6.007
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2Cl)c2ncccc2c1Cl)c1ccc(Cl)cc1 6.0067
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2)c(Cl)cc1Cl)c1cc(F)ccc1Cl 6.0062
O=C(Nc1cc(Oc2c(Cl)cccc2Oc2ccccc2)nnc1-c1ccccc1)c1sc2ccccc2c1Cc1ccccc1 6.004
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2Cl)c2ncccc2c1Cl)c1ccccc1Cl 5.97
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2Cl)c(Cl)cc1Cl)c1ccc(F)cc1F 5.958
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccccc2)ccc1C(F)(F)F)c1ccc(Cl)c2ccccc12 5.952
O=C(Nc1cc(Nc2c(Cl)cccc2Cl)c(Cl)cc1OC(F)F)N(Cc1ccccc1)c1ccccc1C(F)(F)F 5.94
O=C(Nc1cc(Oc2c(Cl)cccc2Oc2ccccc2C2=CCCCC2)cc(Cl)c1)c1ccccc1 5.93
O=C(Nc1cc(Nc2c(Cl)cccc2[N+](=O)[O-])cs1)c1sc2ccc(Br)cc2c1N(c1ccccc1)c1ccccc1 5.92
O=C(Nc1cc(Nc2c(Cl)cccc2Cl)c(C(=O)c2ccc(Cl)cc2F)c(Cl)c1)Nc1cccc(Cl)c1 5.87
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2Cl)cc(F)c1F)c1cccc2ccccc12 5.84
O=C(Nc1cc(Nc2c(Cl)cccc2NCc2ccc(Cl)cc2Cl)c(Cl)cc1Cl)c1cccs1 5.82
Figure 2: Best 20 molecules by ChemTS. Blue parts in SMILES strings indicate prefixes made in the search tree. The remaining parts are made by the rollout procedure.

4 Conclusion

In this paper, we presented a new python package for molecular generation. It will be further extended to include more sophisticated tree search methods and neural networks. Use of additional packages for computational physics such as pymatgen [22] allows the users to implement their own reward function easily. We look forward to see ChemTS as a part of the open-source ecosystem for organic materials development.

Acknowledgement(s)

We would like to thank Hou Zhufeng, Diptesh Das, Masato Sumita and Thaer M. Dieb for their fruitful discussions.

Disclosure statement

Authors declare no conflict of interest.

Funding

This work was supported by the “Materials research by Information Integration” Initiative (MI2I) project and CREST Grant No. JPMJCR1502 from Japan Science and Technology Agency (JST). It was also supported by Grant-in-Aid for Scientific Research on Innovative Areas “Nano Informatics” (Grant No. 25106005) from the Japan Society for the Promotion of Science (JSPS). In addition, it was supported by MEXT as “Priority Issue on Post-K computer” (Building Innovative Drug Discovery Infrastructure Through Functional Control of Biomolecular Systems).

Notes on contributors

K. Tsuda proposed the research idea, supported experiments design and help draft the manuscript. K. Yoshizoe supported ChemTS design. J. Zhang and K. Terayama helped analyze the experimental data. X. Yang designed and implemented ChemTS, analyzed the data, and compiled the manuscript. All of the authors have read and approved the final manuscript.

References

  • [1] Niu G, Guo X, Wang L. Review of recent progress in chemical stability of perovskite solar cells. J Mater Chem A. 2015;3(17):8970–8980.
  • [2] Kaji H, Suzuki H, Fukushima T, et al. Purely organic electroluminescent material realizing 100% conversion from electricity to light. Nat Commun. 2015;6:8476.
  • [3]

    Ueda A, Yamada S, Isono T, et al. Hydrogen-bond-dynamics-based switching of conductivity and magnetism: A phase transition caused by deuterium and electron transfer in a hydrogen-bonded purely organic conductor crystal. J Am Chem Soc. 2014;136(34):12184–12192.

  • [4] Yeung MCL, Yam VWW. Luminescent cation sensors: from host–guest chemistry, supramolecular chemistry to reaction-based mechanisms. Chem Soc Rev. 2015;44(13):4192–4202.
  • [5] Horiuchi S, Tokura Y. Organic ferroelectrics. Nat Mater. 2008;7(5):357.
  • [6]

    Podlewska S, Czarnecki WM, Kafel R, et al. Creating the new from the old: Combinatorial libraries generation with machine-learning-based compound structure optimization. J Chem Inf Model. 2017;57(2):133–147.

  • [7] Ikebata H, Hongo K, Isomura T, et al. Bayesian molecular design with a chemical language model. J Comput Aided Mol Des. 2017;31(4):379—391.
  • [8] Weininger D. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–36.
  • [9] Bowman SR, Vilnis L, Vinyals O, et al. Generating sentences from a continuous space. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016; 2016. p. 10–21.
  • [10] Oord Avd, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks. In: Proceedings of 33rd International Conference on Machine Learning, ICML 2016; 2016. p. 1747–1756.
  • [11] Gómez-Bombarelli R, Duvenaud D, Hernández-Lobato JM, et al. Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:161002415. 2016;.
  • [12] Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. In: Proceedings of 34th International Conference on Machine Learning, ICML 2017; 2017. p. 1945–1954.
  • [13] Segler MH, Kogej T, Tyrchan C, et al. Generating focussed molecule libraries for drug discovery with recurrent neural networks. arXiv preprint arXiv:170101329. 2017;.
  • [14]

    Cho K, van Merrienboer B, Gülçehre Ç, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014; 2014. p. 1724–1734.

  • [15] Browne C, Powley E, Whitehouse D, et al. A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI in Games. 2012;4(1):1–43.
  • [16] Silver D, Huang A, Maddison CJ, et al. Mastering the game of go with deep neural networks and tree search. Nature. 2016;529(7587):484–489.
  • [17] M Dieb T, Ju S, Yoshizoe K, et al. MDTS: automatic complex materials design using monte carlo tree search. Sci Tech Adv Mater. 2017;18(1):498–503.
  • [18] Kingma D, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.
  • [19]

    Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminf. 2009;1(1):8.

  • [20] Irwin JJ, Sterling T, Mysinger MM, et al. ZINC: a free tool to discover chemistry for biology. J Chem Inf Model. 2012;52(7):1757–1768.
  • [21] Ueno T, Rhone T, Hou Z, et al. COMBO: an efficient Bayesian optimization library for materials science. Mater Discov. 2016;4:18–21.
  • [22] Ong SP, Richards WD, Jain Aa, et al. Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Comp Mater Sci. 2013;68:314–319.