conditional-molecular-design-ssvae
None
view repo
Although machine learning has been successfully used to propose novel molecules that satisfy desired properties, it is still challenging to explore a large chemical space efficiently. In this paper, we present a conditional molecular design method that facilitates generating new molecules with desired properties. The proposed model, which simultaneously performs both property prediction and molecule generation, is built as a semi-supervised variational autoencoder trained on a set of existing molecules with only a partial annotation. We generate new molecules with desired properties by sampling from the generative distribution estimated by the model. We demonstrate the effectiveness of the proposed model by evaluating it on drug-like molecules. The model improves the performance of property prediction by exploiting unlabeled molecules, and efficiently generates novel molecules fulfilling various target conditions.
READ FULL TEXT VIEW PDF
We propose a molecular generative model based on the conditional variati...
read it
The rational design of molecules with desired properties is a long-stand...
read it
It is common practice for chemists to search chemical databases based on...
read it
In recent years, deep generative models for graphs have been used to gen...
read it
We describe ChemBO, a Bayesian Optimization framework for generating and...
read it
Methods for designing organic materials with desired properties have hig...
read it
Optimising discrete data for a desired characteristic using gradient-bas...
read it
None
The primary goal of molecular design is to propose novel molecules that satisfy desired properties, which has been challenging due to the difficulty in efficiently exploring a large chemical space. In the past, molecular design has been largely driven by human experts. They would suggest candidate molecules that are then evaluated through computer simulations and subsequent experimental syntheses.^{1} This is time-consuming and costly, and is inadequate when many candidate molecules must be considered. In recent decades, machine learning based approaches have been actively studied as efficient alternatives to expedite the molecular design processes.^{2, 3, 4}
A conventional approach is to build a prediction model that estimates the properties of a given molecule, as shown in Figure 1(a). Molecules with desired properties are then chosen after screening a possible set of candidate molecules by this prediction model. ^{5} The candidate molecules for screening need to be manually obtained from such sources as combinatorial enumerations of possible fragments ^{6, 7, 8} and public databases.^{9, 10} It is necessary to secure a sufficient amount of molecules that are properly labeled with properties, as prediction accuracy typically depends on the number of labeled molecules and the quality of the labels. Early work has attempted to transform a molecule into a hand-engineered feature representation, so-called molecular fingerprint, and to use it as an input to predict the properties.^{11, 12, 13}
With recent advances of deep learning,
^{14}prediction quality has improved by employing deep neural networks.
^{15, 16} Moreover, various recent studies have extracted features directly from graph representations of molecules to better predict the properties.^{17, 18, 19, 20, 21, 22}Another approach aims to automatically generate new molecules by building a molecule generation model, as in Figure 1(b). This approach learns to map a molecule on a latent space. From this space, it randomly generates new molecules that are analogous to those in the original set. Molecules randomly generated by this model can be used as candidates for screening with a separate property prediction model. Most existing studies on this approach have represented a molecule as a simplified molecular-input line-entry system (SMILES)^{23}
with a recurrent neural network (RNN). They include RNN language models,
^{24, 25, 26} variational autoencoders (VAE),^{27, 28, 29, 30} and generative adversarial networks (GAN)^{31, 32} with RNN decoders. More recently, the models directly generating the graph structures of molecules have also been proposed.^{33, 34, 35, 36}The molecule generation approach has been extended to conditional molecular design, generating a new molecule whose properties are close to a predetermined target condition. This is often done by finding a latent representation that closely reflects the target condition using a property prediction model, as in Figure 1(c). Previous work has proposed to use recursive fine tuning,^{24} Bayesian optimization,^{28}
^{25, 26, 32} These methods generate molecules with intended properties not directly but by an additional optimization procedure often in the latent space. This is inefficient especially when multiple target conditions are considered.Here we present a novel approach to efficiently and accurately generating new molecules satisfying designated properties. We build a conditional molecular design model that simultaneously performs both property prediction and molecule generation, as illustrated in Figure 1(d), using a semi-supervised variational autoencoder (SSVAE).^{37} Given a set of specific properties, conditional molecular design is done by directly sampling new molecules from a conditional generative distribution without any extra optimization procedure. The semi-supervised model can effectively exploit unlabeled molecules. This is advantageous particularly when only a small portion of molecules in the data are labeled with their properties, which is usual due to the expensive cost of labeling molecules.
We adapt the original SSVAE^{37} to incorporate continuous output variables. SSVAE is a directed probabilistic graphical model that captures the data distribution in a semi-supervised manner. In the generative process of the SSVAE model, the input variable is generated from a generative distribution , which is parameterized by conditioned on the output variable and latent variable . is treated as an additional latent variable when is not labeled, which necessitates introducing the distribution over . The prior distributions over and are assumed to be and . We use variational inference to address the intractability of the exact posterior inference of the model. We approximate the posterior distributions over and by and , both of which are parameterized with
. For the semi-supervised learning scenario where some values of
are missing, the missing values are predicted by .In our framework for conditional molecular design, and denote a molecule and its continuous-valued properties, respectively. In this study, we consider molecules that can be represented by SMILES, which has been commonly used in the recent related work.^{24, 25, 26, 28, 29, 32} SMILES encodes the graph structure of a molecule in a compact line notation by depth-first traversal into a sequence with a simple vocabulary and grammar rules.^{23} For example, a benzene is described in the form of SMILES as c1ccccc1. A molecule representation
is then formed as a sequence of one-hot vectors describing a SMILES string
, where each one-hot vector corresponds to the index of a symbol in a predefined vocabulary and is the length of the sequence. The vocabulary consists of all unique characters in the data, except for atoms represented by two characters (e.g., Si, Cl, Br, and Sn) which are considered single symbols. A vector consists of scalar values , where is the number of properties of a molecule.We use an RNN to model and . The SSVAE model is composed of three RNNs, which are the predictor network , the encoder network , and the decoder network . We use bidirectional RNNs ^{38} for the predictor and encoder networks, while the decoder network is a unidirectional RNN. The input to the encoder network at each time step contains and . The decoder network, which generates sequences, takes the output of the current time step , , and as the input at time +1.
We define two loss functions
and corresponding to labeled and unlabeled instances, respectively. The variational lower boundof the log-probability of a labeled instance
is:(1) |
For an unlabeled instance , is considered as a latent variable. The variational lower bound is then:
(2) |
where .
Given the data distributions of labeled and unlabeled cases , the full loss function for the entire dataset is defined as:
(3) |
where the last term is mean squared error for supervised learning. The distribution is not estimated from the labeled cases without the last term, because does not contribute to .^{37} The last term encourages to be predictive of the observed properties based on the labeled instances. As
is assumed to follow a normal distribution,
is equivalent to . The hyper-parameter controls the trade-off between generative and supervised learning. It becomes fully generative learning when , while it focuses more on supervised learning with a larger .Once the SSVAE model is trained, property prediction is performed using the predictor network . Given an unlabeled instance , the corresponding properties are predicted as below.
(4) |
The point estimate of can be obtained by maximizing the probability, which is equivalent to .
We use the decoder network to generate a molecule. A molecule representation is obtained from and by
(5) |
At each time step of the decoder, the output is predicted by conditioning on all the previous outputs , , and , because we decompose as
(6) |
The optimal decoding solution can be obtained by maximizing the autoregressive distribution of . This is however computationally intractable, because the search space grows exponentially with respect to the length of sequences. Sampling from the autoregressive distribution is simple and fast, but is vulnerable to the noise in sequence generation. Therefore, we use beam search to find an approximate solution efficiently, which has been successfully used to generate sequences with RNN.^{39, 40, 41} Beam search generates a sequence from left to right based on a breadth-first tree search mechanism. At each time step , top- candidates are maintained.
To generate an arbitrary molecule unconditionally, and are sampled from their prior distributions and , respectively. For conditional molecular design given a target value for a property, is sampled from , while the corresponding element of is set to the target value and the other elements are sampled from the conditional prior distribution given the target value. For example, if we want to generate a new molecule whose first property is close to 0.5, the first element is set to 0.5 while the other elements are sampled from .
We collect 310,000 SMILES strings of drug-like molecules randomly sampled from the ZINC database.^{9} We use 300,000 molecules for training and the remaining 10,000 molecules for testing the property prediction performance. The SMILES strings of the molecules are canonicalized using the RDKit package,^{42} and then are transformed into sequences of symbols occurring in the training set. The vocabulary contains 35 different symbols including {1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, =, , (, ), [, ], H, B, C, N, O, F, Si, P, S, Cl, Br, Sn, I, c, n, o, p, s}. The minimum, median, and maximum lengths of a SMILES string are 8, 42, and 86, respectively. A special symbol indicating the end of a sequence is appended at the end of each sequence.
It is time-consuming and costly to directly obtain the chemical properties of numerous newly generated molecules by performing first-principles calculations or experimental syntheses. In order to efficiently evaluate the proposed approach, we use the three properties that can be readily calculated using the RDKit package:^{42} molecular weight (MolWt), Wildman-Crippen partition coefficient (LogP),^{43} and quantitative estimation of drug-likeness (QED).^{44} Figure 2
shows the distributions of these properties in the training set. In this figure, the histograms plot the distributions of individual properties, and the scatterplots represent the pairwise distributions of the properties on 3,000 randomly selected molecules. MolWt ranges from 200 to 500 g/mol and LogP has the range of [-2, 5], according to the drug-like criteria of the ZINC database. QED is valued from 0 to 1 by definition. The averages of MolWt, LogP, and QED are 359.019, 2.911, and 0.696, and their standard deviations are 67.669, 1.179, and 0.158, respectively. There is a positive correlation between MolWt and LogP with the correlation coefficient of 0.434, whereas QED is negatively correlated with both MolWt and LogP with the correlation coefficients of -0.548 and -0.298, respectively.
We evaluate the SSVAE model against baseline models in terms of the prediction performance. We vary the number of labeled molecules (5%, 10%, 20%, and 50% of the training set) to investigate its effect on property prediction. 95% of the training set is used for training, while the remaining 5% is used for early stopping. During training, we normalize each output variable to have a mean of 0 and standard deviation of 1. We use backpropagation with the Adam optimizer.
^{45}We set the default learning rate to 0.001 and use a batch size of 200. Training is terminated if the validation error failed to decrease by 1% over ten consecutive epochs or the number of epochs reached 300. The property prediction performance is evaluated by mean absolute error (MAE) on the test set.
For the SSVAE model, its predictor, encoder, and decoder networks consist of three hidden layers each having 250 gated recurrent units (GRU).
^{46} The dimension of is set as 100. For the prior distribution , we estimate the mean vector and covariance matrix from the labeled molecules in the training set. Figure 3 shows the architectural detail of the SSVAE model. We train the model with both the labeled and unlabeled molecules to minimize the objective function in Equation 3. We set to by conducting a preliminary experiment of minimizing the average validation error of property prediction over the varying numbers of labeled molecules.As baseline models, we use the extended-connectivity fingerprint (ECFP),^{11} molecular graph convolutions (GraphConv),^{19} independently trained predictor network , and VAE model jointly trained with a property prediction model that predicts the properties from the latent representation (VAE),^{28} In the cases of the ECFP and GraphConv models, a molecule is processed by three hidden layers, each of which consisting of 2000, 500, and 500 sigmoid units with a dropout rate of 0.2.^{47} It is then followed by a final output layer that predicts the three output variables. The baseline models except the VAE model are trained only with the labeled molecules in the training set to minimize mean squared error between the actual and predicted properties. For the implementation of the VAE model^{28}, its VAE part is trained with the entire molecules without their labels and the joint prediction model is trained only with the labeled molecules using the same objective function as that of the SSVAE model.
To demonstrate conditional molecular design, we use the SSVAE model trained on the training set in which 50% of the molecules were labeled with their properties. New molecules are generated under various target conditions of properties, each of which sets one property with a specific target value and the others to be sampled from the corresponding conditional prior distribution.
We compare the SSVAE model with the unsupervised VAE model (VAE) and the VAE model that are trained on the same training set. In the case of the VAE model, we use Gaussian process to smoothly approximate the property prediction given a latent representation and perform Bayesian optimization.^{28} The objective function for Bayesian optimization is set as the normalized absolute difference between the target value and the value predicted from the latent representation by the joint property prediction model. For generating a molecule, Bayesian optimization is terminated when the value of the objective function is below 0.01.
Molecules are generated from the decoder network of the model using beam search, where the beam width is set to 5. The target values for MolWt, LogP, and QED are set as {250, 350, 450}, {1.5, 3.0, 4.5}, and {0.5, 0.7, 0.9}, respectively. We also test generating new molecules unconditionally without specifying any target value. During the generation procedure given each target condition, we check the validity of each generated molecule using the SMILES grammar rules (e.g., the number of open/close parentheses and the existence of unclosed rings) and pre-conditions (e.g., kekulizability) using the RDKit package.^{42} We discard those molecules that are identified as invalid, already exist in the training set, or duplicated. The generation procedure continues until 3,000 novel unique molecules are obtained or the number of trials exceeds 10,000. Then, these molecules are labeled with MolWt, LogP and QED to confirm whether their properties were distributed around their respective target values.
All the experiments are implemented based on GPU-accelerated TensorFlow in Python.
^{48} The source code of the SSVAE model used in the experiments is available at https://github.com/nyu-dl/conditional-molecular-design-ssvae.Table 1 shows the results in terms of MAE with the varying fractions of labeled molecules in the training set. We report the average and standard deviation over ten repetitions for each setting. Among the baseline models, the GraphConv model was superior to the ECFP model in every case. The GraphConv model yielded performance comparable to the predictor model with a fewer labeled molecules, while the predictor model was superior with more labeled molecules. The predictor model significantly outperformed the ECFP and GraphConv models on predicting MolWt, which is almost identical to the task of simply counting atoms in a SMILES string. The VAE model performed worse for MolWt but was superior in predicting LogP and QED with a fewer labeled molecules, when compared to the predictor model.
The SSVAE model outperformed the baseline models on most of the cases. The SSVAE model yielded better prediction performance than the predictor model did with a lower fraction of labeled molecules. On the other hand, the difference between the SSVAE model and the predictor model narrowed as the fraction of labeled molecules increased. The results successfully demonstrate the effectiveness of this semi-supervised learning scheme in improving property prediction.
frac. labeled | property | ECFP | GraphConv | predictor network | VAE | SSVAE |
---|---|---|---|---|---|---|
5% | MolWt | 17.7130.396 | 6.7232.116 | 2.5820.288 | 3.4630.971 | 1.6390.577 |
LogP | 0.3800.009 | 0.1870.015 | 0.1620.006 | 0.1250.013 | 0.1200.006 | |
QED | 0.0530.001 | 0.0340.004 | 0.0370.002 | 0.0290.002 | 0.0280.001 | |
10% | MolWt | 15.0570.358 | 5.2550.767 | 1.9860.470 | 2.4640.581 | 1.4440.618 |
LogP | 0.3350.005 | 0.1480.016 | 0.1160.006 | 0.0970.008 | 0.0900.004 | |
QED | 0.0450.001 | 0.0280.003 | 0.0270.002 | 0.0210.002 | 0.0210.001 | |
20% | MolWt | 12.0470.168 | 4.5970.419 | 1.2280.229 | 1.7480.266 | 1.0080.370 |
LogP | 0.2490.004 | 0.1120.015 | 0.0700.007 | 0.0740.006 | 0.0710.007 | |
QED | 0.0330.001 | 0.0210.002 | 0.0170.002 | 0.0150.001 | 0.0160.001 | |
50% | MolWt | 9.0120.184 | 4.5060.279 | 1.0100.250 | 1.3500.319 | 1.0500.164 |
LogP | 0.1800.003 | 0.0860.012 | 0.0450.005 | 0.0490.008 | 0.0470.003 | |
QED | 0.0230.000 | 0.0180.001 | 0.0110.001 | 0.0090.002 | 0.0100.001 |
Table 2 shows the statistics of generated molecules given each target condition in order to investigate the efficacy of molecule generation. For the SSVAE model, the fraction of invalid molecules was generally less than 1%, and was slightly higher when the target value had a lower density in the distribution of the training set. There were a few duplicated molecules from unconditional generation, and the fraction of new unique molecules was 92.7%. There were more duplicates when molecules were conditionally generated. In particular, the fraction of duplicated molecules for a target condition was higher when the prediction of the property for the condition was more accurate. As the normalized MAEs of MolWt, LogP, and QED by the SSVAE model were 0.016, 0.038, and 0.058, MolWt yielded the lowest fraction of new unique molecules and was followed by LogP and QED.
Both the VAE and VAE models were less efficient than the SSVAE model was, evident from the higher number of duplicated molecules generated. When we tried sampling from the VAE model without beam search, the model rarely generated duplicated ones, while the majority of the generated ones were invalid.
model | target condition | no. generated | no. invalid | no. in training set | no. duplicated | no. new unique |
---|---|---|---|---|---|---|
VAE | uncond. gen. (sampling) | 10000 (100%) | 8771 (87.7%) | 5 (0.1%) | 65 (0.7%) | 1159 (11.6%) |
uncond. gen. (beam search) | 10000 (100%) | 2 (0.0%) | 1243 (12.4%) | 6802 (68.0%) | 1953 (19.5%) | |
VAE | unconditional generation | 5940 (100%) | 2 (0.0%) | 486 (8.2%) | 2452 (41.3%) | 3000 (50.5%) |
MolWt=250 | 10000 (100%) | 7 (0.1%) | 1136 (11.4%) | 6400 (64.0%) | 2457 (24.6%) | |
MolWt=350 | 8618 (100%) | 1 (0.0%) | 647 (7.5%) | 4970 (57.7%) | 3000 (34.8%) | |
MolWt=450 | 10000 (100%) | 9 (0.1%) | 1120 (11.2%) | 6626 (66.3%) | 2245 (22.5%) | |
LogP=1.5 | 9521 (100%) | 10 (0.1%) | 575 (6.0%) | 5936 (62.3%) | 3000 (31.5%) | |
LogP=3.0 | 7628 (100%) | 4 (0.1%) | 560 (7.3%) | 4064 (53.3%) | 3000 (39.3%) | |
LogP=4.5 | 10000 (100%) | 13 (0.1%) | 862 (8.6%) | 6563 (65.6%) | 2562 (25.6%) | |
QED=0.5 | 9643 (100%) | 20 (0.2%) | 764 (7.9%) | 5859 (60.8%) | 3000 (31.1%) | |
QED=0.7 | 6888 (100%) | 3 (0.0%) | 617 (9.0%) | 3268 (47.4%) | 3000 (43.6%) | |
QED=0.9 | 10000 (100%) | 6 (0.1%) | 851 (8.5%) | 6476 (64.8%) | 2667 (26.7%) | |
SSVAE | unconditional generation | 3236 (100%) | 23 (0.7%) | 163 (5.0%) | 50 (1.5%) | 3000 (92.7%) |
MolWt=250 | 4079 (100%) | 16 (0.4%) | 177 (4.3%) | 886 (21.7%) | 3000 (73.5%) | |
MolWt=350 | 3629 (100%) | 17 (0.5%) | 137 (3.8%) | 475 (13.1%) | 3000 (82.7%) | |
MolWt=450 | 4181 (100%) | 31 (0.7%) | 277 (6.6%) | 873 (20.9%) | 3000 (71.8%) | |
LogP=1.5 | 3457 (100%) | 26 (0.8%) | 127 (3.7%) | 304 (8.8%) | 3000 (86.8%) | |
LogP=3.0 | 3433 (100%) | 21 (0.6%) | 166 (4.8%) | 246 (7.2%) | 3000 (87.4%) | |
LogP=4.5 | 3507 (100%) | 30 (0.9%) | 186 (5.3%) | 291 (8.3%) | 3000 (85.5%) | |
QED=0.5 | 3456 (100%) | 49 (1.4%) | 171 (4.9%) | 236 (6.8%) | 3000 (86.8%) | |
QED=0.7 | 3308 (100%) | 19 (0.6%) | 168 (5.1%) | 121 (3.7%) | 3000 (90.7%) | |
QED=0.9 | 3233 (100%) | 12 (0.4%) | 125 (3.9%) | 96 (3.0%) | 3000 (92.8%) |
model | target condition | no. molecules | sequence length | MolWt | LogP | QED |
---|---|---|---|---|---|---|
training set | all molecules | 300000 | 42.3759.316 | 359.01967.669 | 2.9111.179 | 0.6960.158 |
labeled molecules | 150000 | 42.4029.313 | 359.38167.666 | 2.9121.177 | 0.6960.158 | |
240MolWt260 | 4868 | 28.8183.707 | 250.2375.642 | 2.1161.074 | 0.7620.118 | |
340MolWt360 | 18799 | 41.0943.865 | 348.8365.751 | 2.7931.094 | 0.7590.123 | |
440MolWt460 | 8546 | 53.5624.586 | 448.9595.631 | 3.5690.989 | 0.5400.122 | |
1.4LogP1.6 | 4591 | 38.2238.679 | 320.29961.256 | 1.5030.057 | 0.7570.130 | |
2.9LogP3.1 | 9657 | 42.2149.044 | 357.68662.747 | 2.9990.058 | 0.7220.150 | |
4.4LogP4.6 | 6040 | 47.2288.404 | 404.42556.184 | 4.4960.059 | 0.5890.149 | |
0.49QED0.51 | 3336 | 49.0288.415 | 409.73363.362 | 3.4671.130 | 0.5000.006 | |
0.69QED0.71 | 5961 | 42.6148.571 | 362.44863.359 | 2.9291.255 | 0.7000.006 | |
0.89QED0.91 | 5466 | 36.9105.219 | 316.12832.789 | 2.5290.865 | 0.9000.006 | |
VAE | uncond. gen. (sampling) | 1159 | 34.9658.123 | 305.75665.817 | 2.9001.262 | 0.7250.143 |
uncond. gen. (beam search) | 1953 | 43.9118.384 | 366.50260.865 | 2.9870.995 | 0.7070.149 | |
VAE | unconditional generation | 3000 | 43.8537.477 | 362.03754.528 | 2.9860.994 | 0.7160.132 |
MolWt=250 | 2457 | 30.2844.261 | 255.67625.457 | 2.2300.965 | 0.7890.093 | |
MolWt=350 | 3000 | 40.7184.534 | 338.85827.478 | 3.0230.982 | 0.7660.111 | |
MolWt=450 | 2245 | 53.7384.636 | 443.25323.950 | 3.4770.946 | 0.5730.108 | |
LogP=1.5 | 3000 | 40.2968.019 | 329.75458.057 | 1.4780.537 | 0.7440.117 | |
LogP=3.0 | 3000 | 42.9757.352 | 353.53553.487 | 2.7400.643 | 0.7280.127 | |
LogP=4.5 | 2562 | 46.1987.871 | 389.46553.514 | 4.2900.435 | 0.6360.136 | |
QED=0.5 | 3000 | 49.9556.749 | 409.02147.190 | 3.3861.023 | 0.5440.096 | |
QED=0.7 | 3000 | 45.3317.382 | 375.08353.888 | 3.0791.002 | 0.6880.111 | |
QED=0.9 | 2667 | 37.4415.573 | 310.39638.871 | 2.5150.918 | 0.8600.062 | |
SSVAE | unconditional generation | 3000 | 42.0939.010 | 359.13565.534 | 2.8731.117 | 0.6950.148 |
MolWt=250 | 3000 | 28.5133.431 | 250.2876.742 | 2.0771.072 | 0.7960.094 | |
MolWt=350 | 3000 | 41.4014.393 | 349.5997.345 | 2.7821.060 | 0.7230.129 | |
MolWt=450 | 3000 | 53.1794.760 | 449.5938.901 | 3.5441.016 | 0.5630.122 | |
LogP=1.5 | 3000 | 38.7098.669 | 323.33660.288 | 1.5390.301 | 0.7500.127 | |
LogP=3.0 | 3000 | 42.5238.919 | 361.26461.524 | 2.9840.295 | 0.7010.149 | |
LogP=4.5 | 3000 | 45.5668.698 | 397.60961.436 | 4.3500.309 | 0.6240.147 | |
QED=0.5 | 3000 | 48.4127.904 | 404.15956.788 | 3.2881.069 | 0.5270.094 | |
QED=0.7 | 3000 | 41.7377.659 | 356.67255.629 | 2.8931.093 | 0.7190.088 | |
QED=0.9 | 3000 | 36.2437.689 | 312.98556.270 | 2.4441.079 | 0.8400.070 |
Table 3 presents the summary statistics for newly generated molecules of each condition, and Figure 4 and 5 compare the histograms representing the distributions of MolWt, LogP, and QED between different target conditions by the SSVAE and VAE models, respectively. For the SSVAE model, the unconditionally generated molecules without any target value followed the property distributions of the training set, as evident from contrasting Figure 4(a–c) and Figure 2(a–c). When a target condition is set, the SSVAE model successfully generated new molecules fulfilling the condition. In Figure 4(d–f), we observe that the distributions of the conditionally generated molecules by the SSVAE model were centered around the corresponding target values with much smaller standard deviations. The conditionally generated molecules followed the property distributions of those molecules in the training set whose property was around the target value, as shown in Table 3. The accuracy of conditional molecular design for a target condition tended to be proportional to the prediction accuracy of the corresponding property. For MolWt which yielded the lowest normalized MAE, the distributions were relatively narrow and separated distinctly by its target values. On the other hand, LogP and QED exhibited larger overlap between target values. The VAE model also generated new molecules satisfying the target conditions, but the distributions were relatively dispersed and far from the corresponding target values compared to those by the SSVAE model, as shown in Figure 5(d–f).
We present some sample molecules generated from the SSVAE model under each target condition in Figure 6. From the glance at the sample molecules generated with three different target MolWt, we observe that the SSVAE had generated smaller molecules when the target condition of MolWT was set to 250. On the other hand, when MolWt was set to a higher value, relatively larger molecules were generated.
model | training time | inference time (per generation) | |
---|---|---|---|
unconditional gen. | conditional gen. | ||
VAE | 7.41.9 hrs | 4.90.7 s | - |
VAE | 10.22.1 hrs | 4.70.7 s | 46.6135.5 s |
SSVAE | 20.35.3 hrs | 4.61.0 s | 4.51.1 s |
Table 4 compares training and inference time between the models. It took longer to train the SSVAE model than the other models, because it has one more RNN as the predictor network compared to the other models. For unconditional generation, there was a slight difference in the generation speed between the models. Conditional generation with the VAE model, which involves Bayesian optimization, was time-consuming. On the other hand, conditional generation with the SSVAE model, which simply uses the decoder network without any extra optimization procedure, was as fast as unconditional generation.
We have presented a novel approach to conditionally generating molecules efficiently and accurately using the regression version of SSVAE. We designed and trained the SSVAE model on a partially labeled training set in which only a small portion of molecules were labeled with their properties. New molecules with desired properties were generated from the generative distribution of the SSVAE model given a target condition of properties. The experiments using drug-like molecules sampled from the ZINC database have successfully demonstrated the effectiveness in terms of both property prediction and conditional molecular design. The SSVAE model efficiently generates novel molecules satisfying the target conditions without any extra optimization procedure. Moreover, the conditional design procedure works by automatically learning implicit knowledge from data without necessitating any explicit knowledge.
The proposed approach can serve as an efficient tool for designing new chemical structures fulfilling a specified target condition. These structures generated as SMILES strings are to be examined further to obtain realistic molecules with desired properties. In this study, the application is limited to only a part of the chemical space that SMILES can represent. To broaden its applicability, we should investigate other alternatives to SMILES that provide higher coverage of the chemical space and are able to represent molecules more comprehensively.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT; Ministry of Science and ICT) (No. NRF-2017R1C1B5075685). K.C. thanks support by AdeptMind, eBay, TenCent, NVIDIA and CIFAR. K.C. was partly supported for this work by Samsung Electronics (Improving Deep Learning using Latent Structure).
The authors declare no competing financial interest.
Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.; Tkatchenko, A. Quantum-Chemical Insights from Deep Tensor Neural Networks.
Nat. Commun. 2017, 8, 13890:1–8.Schuster, M.; Paliwal, K. K. Bidirectional Recurrent Neural Networks.
IEEE Trans. Signal Process. 1997, 45, 2673–2681.Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014; pp 1724–1734.
Comments
There are no comments yet.