Multiple-objective Reinforcement Learning for Inverse Design and Identification

by   Haoran Wei, et al.
University of Delaware

The aim of the inverse chemical design is to develop new molecules with given optimized molecular properties or objectives. Recently, generative deep learning (DL) networks are considered as the state-of-the-art in inverse chemical design and have achieved early success in generating molecular structures with desired properties in the pharmaceutical and material chemistry fields. However, satisfying a large number (larger than 10 objectives) of molecular objectives is a limitation of current generative models. To improve the model's ability to handle a large number of molecule design objectives, we developed a Reinforcement Learning (RL) based generative framework to optimize chemical molecule generation. Our use of Curriculum Learning (CL) to fine-tune the pre-trained generative network allowed the model to satisfy up to 21 objectives and increase the generative network's robustness. The experiments show that the proposed multiple-objective RL-based generative model can correctly identify unknown molecules with an 83 to 100 percent success rate, compared to the baseline approach of 0 percent. Additionally, this proposed generative model is not limited to just chemistry research challenges; we anticipate that problems that utilize RL with multiple-objectives will benefit from this framework.



There are no comments yet.


page 6


Inverse design of 3d molecular structures with conditional generative neural networks

The rational design of molecules with desired properties is a long-stand...

Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

The fundamental goal of generative drug design is to propose optimized m...

Scaffold-constrained molecular generation

One of the major applications of generative models for drug Discovery ta...

Generative chemical transformer: attention makes neural machine learn molecular geometric structures via text

Chemical formula is an artificial language that expresses molecules as t...

Goal-directed Generation of Discrete Structures with Conditional Generative Models

Despite recent advances, goal-directed generation of structured discrete...

Mol-CycleGAN - a generative model for molecular optimization

Designing a molecule with desired properties is one of the biggest chall...

GuacaMol: Benchmarking Models for De Novo Molecular Design

De novo design seeks to generate molecules with required property profil...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Designing a chemical for specific target use, such as a new drug compound, traditionally relies on high throughput screening experiments or simulations. Given that the chemical space spans on the order of

potential targets, it comes with no surprise that this brute-force approach is highly iterative with low success rates. Recent advances in deep learning (DL) have demonstrated preliminary success with the inverse design paradigm, where desired properties are used as input, for the DL models to generate chemical structures that would satisfy the design requirements. To date, various generative approaches have been proposed and deployed in the chemical design domain, including models based on variational autoencoders

[12], generative RNN models[21] and GANs [10].

Besides chemical design, generative models can also be used for chemical identification, which has various applications in fields such as in the production of biofuels, where complex mixtures of compounds from biomass are generated. Current approaches used in identifying chemicals are limited, often relying on database matching of Mass Spectrometry (MS) and Nuclear Magnetic Resonance (NMR) spectra to known chemicals [3, 4]. However, given the extensiveness of the chemical space, this approach would be effective only in identifying the chemicals already available in the database. On the other hand, optimization of a target compound structure against a set of constraints (i.e., fingerprints), such as molecular weight (MW), elemental composition and presence of specific functional groups (FG) can also constrain the search space. With enough constraints, one may arrive at a unique solution (chemical structure) that would satisfy all listed constraints, and in doing so, identify the unknown chemical. The drawback is that with the number of constraints, the performance of generative DL network drops dramatically. This decrease in performance was observed in our experiment. Designing valid chemicals that satisfy a large number of chemical property objectives remains a big challenge.

Figure 1: Inverse chemical design and identification (identified constraint limit the chemical search space)

Our approach for inverse molecule design and identification is demonstrated in the Fig 1. The target molecule is the unknown which needs to satisfy a list of input constraints that are the desired molecular properties. Note that “constraints" and “objectives" are exchangeable in this paper. We have two specific goals in this work: 1) propose a valid DL framework for inverse molecule design and identification; 2) increase the number of input constraints that can be satisfied by the model simultaneously, considering the high computational and time complexity in current strategies. The ultimate goal is to generate a molecule that is identical or close to the unknown target solely based on a sequence of constraints at high accuracy. We believe this work could have a significant contribution to the various fields mentioned above.

We formulate this molecule design and identification as a text generation and multi-constraint optimization problem. It is solved with the combination of RL and CL. Our contributions are as follows:

  • We developed training heuristics for multiple-objective (20+) RL using a modified curriculum training approach.

  • We developed the first multiple-objective RL-based generative DL model for chemical identification of simple organic molecules that are relevant to biofuels applications.

Related Work

One of the first molecular generative models developed was based on conditional variational autoencoders (CVAE) [8, 16]

, which is used to convert molecules, represented as SMILES (a representation of molecules with ASCII alphabets), into a continuous vector representation. A major issue in early CVAE models is the low accuracy on generating valid SMILES, although recent work has made progress towards this goal

[5]. Other generative models have been reported based on generative RNN models, reinforcement learning (RL) [21], and GANs [11], and these models tend to result in a higher proportion of valid SMILES than CVAE-based models.

All prior work thus far have been using generative models for chemical design, typically optimizing for a single objective [22, 21]. More recent work has included a few objectives, with typically no more than 5 properties optimized simultaneously[16]. In addition, the use of generative models for chemical identification has yet to be reported.

In this work, we combine RL with CL to extend the boundary of the number of property constraints that can be optimized in the molecule generation context. RL as a machine learning method that learns to conduct complex tasks with real environment interactions has received much attention in many domains, such as game playing

[17] and robotic control[15]. RL also has been widely used in chemical design domains, such as molecule optimization with Deep Q-network[26] where actions are handcrafted molecular properties associated with the input constraints. In our work, we minimized human participation and considered the desired constraints as part of the RL reward function instead of merging it into actions. Our work is also related to the work of Popova et al[22] where a predictive model is trained separated from the generation model to forecast and bias the generation of new chemical structures towards the desired properties. Our model also consists of two parts: prior and agent. The prior is a general generative model, while the agent is used to optimize the constraints. In our effort, we particularly emphasized increasing the number of constraints that the model can optimize. Lastly, our work is also inspired by the related work [27, 1]

where Recurrent Neural Network is used to address the long-term dependencies between local molecular function groups. To our knowledge, this is the first work that applied CL to the molecular inverse design problem. Previously, the application of CL to improve RL learning efficiency to deal with difficult or multiple tasks has achieved much success in other domains, such as game playing

[19, 18] and robotic control [6].

Model Design

In this section, we give an overview of the system design, including a prior model for generating valid SMILES, and an agent model fine-tuning through curriculum-based RL for optimizing multiple objectives/constraints.

Figure 2: RNN learning the SMILES syntax

Prior Model

The prior model is designed and trained to generate valid molecule SMILES sequences starting with only a start token feed. SMILES[25] is a molecule structure representation using ASCII format. Alphabets are used to represent atoms and molecular structures. For example “cyclopropene" is written as “C1=CC1”. We consider that the valid formulations of molecules are encoded in the SMILES sequences with the right chemical syntax. Previous research has demonstrated that the SMILES syntax can be learned efficiently with generative RNN models [21, 22]. Using a similar approach, we utilize an RNN(GRU)-based model to represent the prior model and learn the SMILES syntax. With the RNN layers, the training data is processed with an integrated loop and information flows from one step to the next while selectively remembering the past information, in this way, the long-term dependencies inside a sequence can be learned.

The input for the prior model is the tokenized SMILES, and the output is the corresponding sequence-shifted one-hot vector. This means that when the first characters in a SMILES sequence are fed into the prior, the

character is predicted but as a conditional probability distribution across the whole vocabulary, as shown in Fig

2. Assuming a single input/output pair is where represent the first characters and is the the character as well as the prediction target, and the is the output of the prior model which is a conditional probability distribution where is the predicted character. The loss for a single generated SMILES is represented with the cross-entropy equation, shown as Eq 1:


where are the trainable parameters in the prior model.

is the fixed length of the prior model’s output. Here we used 140 as the desired length with zero post-padding for SMILES outputs whose lengths are shorter.

Agent Model

After being trained with 150 million valid SMILES sequences (training batch size is 128) for 20 epochs, the prior model reaches

of SMILES generation validity. However, the generated SMILES do not necessarily satisfy any specific design or identification criteria, only the valid SMILES representations of plausible molecules. The desired SMILES that satisfy all constraints is nearly intractable for the prior model, as our model has a max output length of 140 and the vocabulary size considered (number of valid SMILES) is 87, which translates to a search space of up to . Therefore, it is necessary to build an agent model to generate a higher fraction of desired SMILES more efficiently. The agent model is initialized from the prior model and tuned with multiple target molecular constraints. The agent model tuning is a distribution shifting process, as shown in Fig 3. The SMILES sequences generated by the well-trained prior model follow a “default" distribution over the chemical space, however, with the target molecular constraints, the distribution is expected to change. Therefore, tuning the agent model aims to make it able to generate SMILES with high validity as well as following a new target distribution. In this study, we use RL as the tuning approach.

Figure 3: Shift in generated SMILES distribution to fit the desired molecular distribution due to reinforcement learning

Tuning agent model with RL

RL is a commonly used machine learning method for sequential decision-making problems, and it learns by interacting with the environment and maximizing the cumulative rewards. In this study, we show that the SMILES generative model optimization can be as studied as an RL problem, and it is modeled as a Markov Decision Process (MDP). The MDP can be represented as


  • The state space () is a collection of any possible previous SMILES subsequence. For example, one state represent a sub-sequence and is one alphabetic character. In our setting, states can have various dimensions instead of having a fixed one in other MDPs.

  • The action space () is the entire vocabulary used to represent SMILES, and each action is one alphabet(character) in the vocabulary. Therefore, the vocabulary’s size is the action space’s size.

  • The transition model () represents the stochasticity of the generative model. Given a current state and an action , the probability of next state is and this is the output of the generative model. It also can be used to balance the RL exploration and exploitation: sampling an action following this probability distribution is the exploration and taking the one action with the highest probability is the exploitation.

  • The reward function () is aggregated and measures both the sequence validity and the distance to the desired SMILES w.r.t the desired constraints:


    The first term represents the syntax validity of a sequence generated by the agent model compared to the pretrained prior model. is the output of the prior model given the same input with the agent model, and it’s the cross-entropy loss calculated with Eq 1. is the constraint score measuring how much satisfies the desired constraints (we’ll introduce details in the following sections). The reward function is defined at the SMILES sequence level so that a reward is only provided at the end of each RL trajectory. Each trajectory is a SMILES sequence that is iteratively predicted by the agent model until a terminal token is reached.

  • The initial state is the same initial token “G" for all SMILES sequences.

The agent model follows the RL policy to make a valid prediction and the goal is to maximize the objective function, , shown as Eq 3.


The agent model’s trainable paramters () are optimized (as Eq 4 ) by updating using gradient ascent with the objective function, as Eq4


The difference of cross-entropy loss is an inverse Kullback–Leibler (KL)-divergence [13], shown in Eq 5. Maximizing the inverse KL divergence is to minimize the difference of agent model’s policy and the pior model’s policy and thus maintain the valid SMILES syntax. Here the well-trained prior is used as the target since it has near 100% generative validity.


In related work, when there is more than 1 desired constraint, a naive (baseline) approach is to use an equal weighted sum as the constraint score[16], as Eq 12.


where is the number of constraints. However, the performance of this naive approach is unlikely to scale well to multiple constraints. We hypothesize that part of the issue is caused by the equal weights, where the influence from different directions are unequal or even cancelling out one another. Another issue is the inability to find a good local minimum solution that satisfies all constraints in an extensive search space across multiple objectives. To address these limitations, we develop heuristics inspired from CL.

Curriculum Learning (CL)

Figure 4: Average inherent reward score of the 20 FG constraints

CL is a strategy for multi-task learning, which draws parallels to human knowledge. A sequence of subtasks is ranked in difficulty ascending order and conducted within different phases[2]

. Its success in learning performance improvement can be seen in many domains, such as learning language modeling and pattern recognition

[2, 9]. The difficulty level of subtasks is usually defined with respect to the training data. In this study, we design a measurement (difficulty score) to differentiate task difficulties and tune the agent model to satisfy a larger number of constraints with multiple tuning phases. The goal for the agent model is to generate valid SMILES satisfying a molecular mass constraint and 20 fingerprinting (FP) objectives. A fingerprinting objective is a quantitative molecular structural or functional property.

Figure 5: Agent model is trained with RL and curriculum learning

The difficulty score is to measure the difficulty of each constraint. For one constraint , the difficulty score is the percentage of it being captured in the prior distribution with the pretrained prior model, as calculated with Eq7:


where is 1 if the sequence contains the target constraint, otherwise, -1. is the size of SMILES test dataset. Using 1000 SMILES string generated by the prior model, we report the average inherent difficulty score as shown as Fig 4. The inherent difficulty score can be interpreted as the “default” probability that a certain constraint is satisfied. For example, the FG constraint for benzene is about 0.8, which means that satisfying this constraint is relatively easy since it can be generated 80% of the time. With the difficulty score (), the constraints can be ranked from easiest to the most difficult. Progressively more difficult tasks are designated by having a greater divergence from the inherent distribution relative to the prior model.

In this work, we propose a novel approach of combining CL heuristics in a RL context. We hypothesize that the agent model can be more effectively trained to satisfy multiple constraints using this approach. This RL-based CL process is demonstrated as Fig 5. There are multiple curriculum training phases. In each phase, the agent model is initialized from the prior model in the beginning, and the prior model is synchronized with the agent model at the end of each phase for memorizing the previously learned constraints. As one progresses through phases, new constraints will be merged with the previous ones as part of the RL reward for agent model tuning. The constraint score, at phase is updated accordingly as:


where is a constant weight to balance the previous fine-tuning constraints and the newly-added one(s).

Method and Experiment

Dataset and Pre-processing

We used the ChEMBL [7] database as our training dataset that has 1.5 million syntactically valid SMILES. In the ChEMBL vocabulary, the complex characters ‘Cl’ and ‘Br’ are singularized to tokens ‘R’ and ‘L’. A start token (‘G’) and a termination token (‘E’) are also added. With the start and terminal token, the vocabulary size of unique alphabets and symbols used in SMILES training SMILES dataset is 87. The input to the prior is the tokenized SMILES, for example, a molecule ’[nH]1cnc2cncnc21’ is wrapped with token ’G’ and ’E’ firstly, then alphabetic characters are mapped to a integer vector as [1,78,10,82,83,82,11,82,83,82,83,82,11,10,2]. The output is the one-hot vectors of the same SMILES with offset by +1. Post zero padding is applied on the encoded input to be obtained a unified length of 140.

Training Prior Model

The prior model network architecture starts with a single embedding layer, followed by a stacked 3-layer GRU with 512 cells per layer, and ending with a single fully-connected layer. The model was trained on mini-batch (size 128) for 20 epochs with early stopping. RMSprop is used as the optimizer with initial learning rate 0.001 and gradient values are clipped to

. The prior model generates the next token with a given input and advances time by one step and recurrently extends the input with the newly generated token until it generates the terminal token ‘E’. After training, the prior model is able to generate around 98% valid SMILES.

Prospective Test Set

For the purpose of prospectively identifying unknown chemicals, we limit the search space to biomass-derived liquid relevant chemical space, which is defined as low molecular weight (MW) (< 200g/mole) organic compounds, containing only elements C, H and O. Using these criteria, we extracted a subset of 25,901 relevant chemicals from the original the ChEMBL database. We then used RDKit[14] to compute descriptors pertaining to molecular weight (MW) and functional groups represented as FG constraints to simulate experimental characterization data from Mass Spectrometry(MS) and Nuclear Magnetic Resonance(NMR) sources. These descriptors are used as the constraints in our reward function.

Designing the Constraint Score

The constraint score for molecule mass (MW) is adapted from [20]:


where calculates the real mass of a given SMILES. The other 20 FG constraints can be represented as and , where “True" means this FG is desired to be present in a given SMILES and vice versa.


The MW and presence of FG groups can be easily evaluated with a python package RDKit[14]. MW approximately controls the length of the SMILES string, which is a significant factor in determining the extensiveness of chemical space search. Therefore, we designated a constant weighting factor () in the curriculum reinforcement reward function for MW. Combined with Eq8, the constraint score () at phase is finalized as:


Tuning Agent Model

We use the machine learning approach done in previous work as the baseline where the fine-tuning loss objective is the mean value over all constraints with equal weights and the model is fine-tuning with all constraints together without subtasks, shown as Eq 12

Method No. of bins
No. of retrains
(at each phase)
No. of phases
Baseline 1 0 1
Retrain Fine-tuning (RF_2) 1 1 2
Curriculum Fine-tuning (CF_2) 2 0 2
Curriculum Retrain Fine-tuning (CRF_2) 2 1 2
Retrain Fine-tuning (RF_4) 1 1 4
Curriculum Fine-tuning (CF_4) 4 0 4
Curriculum Retrain Fine-tuning (CRF_4) 4 1 4
Retrain Fine-tuning (RF_6) 1 1 6
Curriculum Fine-tuning (CF_6) 6 0 6
Curriculum Retrain Fine-tuning (CRF_6) 6 1 6
Table 1: Agent Model Fine-Tuning Approaches

With CL methods, we investigated 3 different approaches to fine-tune the agent model with RL:

  1. [leftmargin=3]

  2. Retrain-based Fine-tune (RF)

  3. Curriculum Fine-tune (CF)

  4. Curriculum Retrain-based Fine-tune (CRF)

RF is an extension from the naive baseline with CL for a better comparison. Instead of a one-time tuning, it retrains the model with multiple phases. However, the reward function (as in Eq 12) remains the same in all phases. In the CF method, the model is trained gradually with new constraints sequentially added at each phase without model retraining. Compared to the RF, CF is designed to prove the significance of separating the constraints into different difficulty levels. One observation is that the agent model’s validity drops dramatically with too many tuning phases because each tuning with constraints is conducted while sacrificing of the generation validity. To balance the model performance and CL, we group the constraints into multiple bins according to their difficulty so that the number of tuning phases is reduced. A new bin of constraints is added at each phase. For example, if we set 2 bins, there are about 10 constraints in each bin, and the first bin has easier difficulty than the second bin. The CRF method combines and alternates between both RF and CF approaches. The constraints are introduced into the reward function in a curriculum-based manner, but after each phase, the agent model is retrained with the same reward function one more time. A summary of the various methods settings with a different number of constraint bins (represented as the “_n" where n is the number of bins) is summarized in Table 1.

As with the baseline approach, the agent is updated with mini-batch gradient descent with size 128, and gradients are clipped to . Each phase also has an early stopping algorithm, where training terminates if the reward does not improve after 50 iterations, and the last checkpoint is saved.

Experiment Results and Analysis

(a) top 5 constraint scores
(b) top 5 average similarities
(c) refined top 5 constraint scores
(d) refined top 5 average similarities improvement
Figure 6: Agent fine-tuning performance across 5 chemical targets (as Fig 6(a) and 6(b)). Adaptive weight refinement of reward function improves results (as Fig 6(c) and Fig 6(d))

We randomly picked 5 biofuel-relevant molecules as the targets (shown in Fig 7) and formed the corresponding constraints as the framework input. We evaluated the performance of CF with 2, 4 and 6 constraints bins. Controlling for total training time and number of updates, we also evaluated the RF and CRF with 2, 4, and 6 retraining phases. In total, the baseline method and 9 variations CF/RF/CRF were tested for each of the 5 selected target molecules.

We evaluate the final performance of each model with two metrics: 1) the constraint score over all desired constraints; 2) the similarity between the generated molecule and the target molecule with the Tanimoto distance [24] which also can be calculated with RDKit. The similarity score is used only as a post ad-hoc evaluation once the model has been well trained. It has no bearing on the model’s constraint score function.

The performance measured with the total constraint score is shown in Fig 6(a). The total constraint score is calculated with 256 randomly sampled SMILES generated by the tuned agent model. For each target and each method, all generated SMILES are sorted based on the constraint score in a descending order. We selected the 5 SMILES with the highest scores to show (named as top-5 constraint score). An underlying assumption is that a model that generates SMILES with high constraint scores (thus satisfying most if not all constraints) will exhibit high similarity to the target molecule. The corresponding similarity compared to the target molecules is also shown in Fig 6(b).

The proposed approaches all outperform the baseline approach in the top-5 constraint score, with several models achieving close to the maximum reward score of 21. Overall, the approaches with curriculum fine-tuning have consistent good performance, and there is no significant difference between curriculum fine-tuning with model retrain (CRF) and without retrain (CF). With 4 constraint bins (5 constraints in each), the agent model showed better and robust performance compared to other bin numbers. This observation matches the previous related discovery that 5 is the upper bound of constraints that are equally weighted as the RL reward for a generative model under a good performance condition. On the other hand, the approaches with 5 bins outperform the ones with more bins, and it proved that the number of phases in CL matters regarding the model performance as well. When evaluated against similarity, our proposed methods all perform better than baseline on the first three targets.

However, there is an apparent disconnection for targets 4 and 5, as even though the top-5 reward score is higher than the baseline, that is not translated to a higher similarity score. To explain this, we have two hypotheses: (1) the model gets stuck at a local optima trained by RL during the early phases, and hence it was unable to arrive at a global optima at the last phase, i.e., the agent learns early objectives “too-well” and is unable to learn subsequent objectives; (2) certain constraints are mutually exclusive. For example, we observed that the existence of constraints "fr_allylic_oxid" limits the appearance of "fr_aldehyde" in the same molecule. The latter hypothesis is an inherent disadvantage and may be addressed by human experts. In this work, we focus on addressing the first one.

To avoid the model getting stuck at a "bad” local optima in early phases, we explored a heuristics to reshape the constraint function so that some difficult constraints can be addressed after the curriculum tuning, as shown in Eq 13.


where is the weakly-learned constraint and it is picked by comparing with a pre-defined threshold () w.r.t the constraint score ratio compared to the maximum value and here :


is maximum constraint score, here is 21. If there are more than 1 weakly learned constraints, the average value over all weakly-learned constraints is used instead. Using this heuristic, we refined the CRF_2 and CF_4 models on targets 4 and 5 with the reshaped constraint function , and add ‘_R’ at the middle to notate this is a refinement approach. For example, ’CF_R_4’ representing the refined CL heuristics with 4 bins. Using this refined approach, the resulting top-5 constraint score achieves near the maximum value (shown in Fig 6(c)). Furthermore, we also observed that similarity to the targets was significantly improved (Fig 6(d)).

Figure 7: Agent generated molecules comparing with targets

Lastly, we visualized the top-5 generated SMILES, and compare to the target molecule (summarized in Fig 7) with the similarity labeled below. The generated SMILES correctly identified the unknown target 2, 3 and 5. Target 1 identification was sufficiently close, as structural isomers of the target was generated. We note that the prediction of target 4 is not the same as its target, and we hypothesize that this is because there may not be sufficient or appropriate constraints for that structure, and so the search space remains large (i.e. it is an undetermined search problem).


In conclusion, building on earlier work [21, 23, 20] of RL-based generative RNN model, we enhance the functionality of such methods to include multiple-objectives in the reward function. In this study, we proposed and evaluated several curriculum-based reinforcement learning heuristics, and showed that at least 21 different objectives that incorporate both MW and FG constraints could be simultaneously optimized.

The naive approach of an equally weighted reinforcement learning heuristic with a single-pass training achieves poor results, both in its ability to achieve maximum reward score as well as similarity to the unknown target. In general, our results indicate that a curriculum-learning approach consistently outperforms baseline. However, there are complications associated with having too many learning phases, possibly because it gets trapped in a bad local optimum in the beginning, and so is unable to learn latter objectives effectively. To address these limitations, we developed further heuristics that adaptively overweight unlearned objectives. Our results show that for all five molecules tested, and maximum top-5 reward score can be obtained in all 5 cases tested. In addition, a maximum top-5 reward score also translates well to chemical similarity, as 4 out of 5 molecules were identified correctly. Lastly, the curriculum-learning based heuristics developed in this work is also a significant improvement to the baseline model, which does not even identify a single molecule correctly.


  • [1] J. Arús-Pous, T. Blaschke, S. Ulander, J. Reymond, H. Chen, and O. Engkvist (2019) Exploring the gdb-13 chemical space using deep generative models. Journal of cheminformatics 11 (1), pp. 20. Cited by: Related Work.
  • [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: Curriculum Learning (CL).
  • [3] K. Bingol (2018) Recent advances in targeted and untargeted metabolomics by nmr and ms/nmr methods. High-throughput 7 (2), pp. 9. Cited by: Introduction.
  • [4] R. M. Boiteau, D. W. Hoyt, C. D. Nicora, H. A. Kinmonth-Schultz, J. K. Ward, and K. Bingol (2018) Structure elucidation of unknown metabolites in metabolomics by combined nmr and ms/ms prediction. Metabolites 8 (1), pp. 8. Cited by: Introduction.
  • [5] H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song (2018) Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786. Cited by: Related Work.
  • [6] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel (2017) Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300. Cited by: Related Work.
  • [7] A. Gaulton, A. Hersey, M. Nowotka, A. P. Bento, J. Chambers, D. Mendez, P. Mutowo, F. Atkinson, L. J. Bellis, E. Cibrián-Uhalte, et al. (2016) The chembl database in 2017. Nucleic acids research 45 (D1), pp. D945–D954. Cited by: Dataset and Pre-processing.
  • [8] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS central science 4 (2), pp. 268–276. Cited by: Related Work.
  • [9] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017) Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003. Cited by: Curriculum Learning (CL).
  • [10] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. Aspuru-Guzik (2017) Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843. Cited by: Introduction.
  • [11] A. Kadurin, A. Aliper, A. Kazennov, P. Mamoshina, Q. Vanhaelen, K. Khrabrov, and A. Zhavoronkov (2017) The cornucopia of meaningful leads: applying deep adversarial autoencoders for new molecule development in oncology. Oncotarget 8 (7), pp. 10883. Cited by: Related Work.
  • [12] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Introduction.
  • [13] S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: Tuning agent model with RL.
  • [14] G. Landrum et al. (2006) RDKit: open-source cheminformatics. Cited by: Prospective Test Set, Designing the Constraint Score.
  • [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Related Work.
  • [16] J. Lim, S. Ryu, J. W. Kim, and W. Y. Kim (2018) Molecular generative model based on conditional variational autoencoder for de novo molecular design. arXiv preprint arXiv:1806.05805. Cited by: Related Work, Related Work, Tuning agent model with RL.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: Related Work.
  • [18] S. Narvekar and P. Stone (2019) Learning curriculum policies for reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 25–33. Cited by: Related Work.
  • [19] S. Narvekar (2017) Curriculum learning in reinforcement learning.. In IJCAI, pp. 5195–5196. Cited by: Related Work.
  • [20] D. Neil, M. Segler, L. Guasch, M. Ahmed, D. Plumbley, M. Sellwood, and N. Brown (2018) Exploring deep recurrent models with reinforcement learning for molecule design. Cited by: Designing the Constraint Score, Conclusion.
  • [21] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen (2017) Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics 9 (1), pp. 48. Cited by: Introduction, Related Work, Related Work, Prior Model, Conclusion.
  • [22] M. Popova, O. Isayev, and A. Tropsha (2018) Deep reinforcement learning for de novo drug design. Science advances 4 (7), pp. eaap7885. Cited by: Related Work, Related Work, Prior Model.
  • [23] M. H. Segler, T. Kogej, C. Tyrchan, and M. P. Waller (2017) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science 4 (1), pp. 120–131. Cited by: Conclusion.
  • [24] T. T. Tanimoto (1958) Elementary mathematical theory of classification and prediction. Cited by: Experiment Results and Analysis.
  • [25] D. Weininger, A. Weininger, and J. L. Weininger (1989) SMILES. 2. algorithm for generation of unique smiles notation. Journal of Chemical Information and Computer Sciences 29 (2), pp. 97–101. Cited by: Prior Model.
  • [26] Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley (2019) Optimization of molecules via deep reinforcement learning. Scientific reports 9 (1), pp. 10752. Cited by: Related Work.
  • [27] Z. Zhou, X. Li, and R. N. Zare (2017) Optimizing chemical reactions with deep reinforcement learning. ACS central science 3 (12), pp. 1337–1344. Cited by: Related Work.