Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models

by   Gabriel Lima Guimaraes, et al.

In unsupervised data generation tasks, besides the generation of a sample based on previous observations, one would often like to give hints to the model in order to bias the generation towards desirable metrics. We propose a method that combines Generative Adversarial Networks (GANs) and reinforcement learning (RL) in order to accomplish exactly that. While RL biases the data generation process towards arbitrary metrics, the GAN component of the reward function ensures that the model still remembers information learned from data. We build upon previous results that incorporated GANs and RL in order to generate sequence data and test this model in several settings for the generation of molecules encoded as text sequences (SMILES) and in the context of music generation, showing for each case that we can effectively bias the generation process towards desired metrics.


page 1

page 2

page 3

page 4


OptiGAN: Generative Adversarial Networks for Goal Optimized Sequence Generation

One of the challenging problems in sequence generation tasks is the opti...

Rethinking Exposure Bias In Language Modeling

Exposure bias describes the phenomenon that a language model trained und...

A SeqGAN for Polyphonic Music Generation

We propose an application of SeqGAN, generative adversarial networks for...

Efficient Generation of Structured Objects with Constrained Adversarial Networks

Generative Adversarial Networks (GANs) struggle to generate structured o...

TreeGAN: Syntax-Aware Sequence Generation with Generative Adversarial Networks

Generative Adversarial Networks (GANs) have shown great capacity on imag...

A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

Generative adversarial networks (GANs) are a recently proposed class of ...

NFTGAN: Non-Fungible Token Art Generation Using Generative Adversarial Networks

Digital arts have gained an unprecedented level of popularity with the e...

1 Introduction

Unsupervised generation of data is a dynamic area of machine learning and a very active research frontier in areas ranging from language processing and music generation to materials and drug discovery.

In any of these fields, it is often advantageous to guide the generative model towards some desirable characteristics, while ensuring that the samples resemble the initial distribution. In music generation, for example, it might be expected that pleasant melodic patterns prevail over more dissonant ones [Jaques et al.2016]

. In natural language processing, a given sentiment might be emphasized, maybe for producing movie reviews

[Radford et al.2017]. Finally, in materials discovery, the aim is often to optimize some properties for a particular application, for example in organic solar cells [Hachmann et al.2011], OLEDs [Gómez-Bombarelli et al.2016a]

or new drugs. The generation of discrete data using Recurrent Neural Networks (RNNs), in particular, Long Short-Term Memory cells

[Hochreiter and Schmidhuber1997]

and maximum likelihood estimation has been shown to work well in practice. However, this often suffers from the so-called

exposure bias, and might lack some of the multi-scale structures or salient features of the data. Meanwhile Generative Adversarial Networks (GANs) [Goodfellow et al.2014], an approach where a generative model competes against a discriminate model, one trying to generate likely data while the other trying to distinguish false from real data. GANs have shown remarkable results at generation of data that imitates a data distribution, however they can suffer from several issues, among these mode-collapse [Arjovsky and Bottou2017]. Where the generator learns to produce samples with low variety.

Although GANs were not initially applicable to discrete data due to non-differentiability, approaches such as SeqGAN [Yu et al.2017], MaliGAN [Che et al.2017] and BGAN [Hjelm et al.2017] have arisen to deal with this issue.

Furthermore methods from Reinforcement Learning (RL) have shown great success at solving problems where continuous feedback from an environment is needed [Hjelm et al.2017].

In this paper, we introduce a novel approach to optimize the properties of a distribution of sequences, increase the diversity of the samples while maintaining the likeliness of the data distribution. In our approach, the generator is trained to maximize a weighted average of two types of rewards: the objective, domain-specific metrics, and the discriminator

, which is trained along with the generator in an adversarial fashion. While the objective component of the reward function ensures that the model selects for traits that maximize the specified heuristic, the discriminator incentives the samples to stay within boundaries of the initial data distribution. Diversity is additionally promoted by reducing rewards of non-unique and less diverse sequences.

In order to implement the above idea, we build on SeqGAN, a recent work that successfully combines GANs and RL to apply the GAN framework to sequential data [Yu et al.2017] and extend it towards domain-specific rewards. To increase the stability of the adversarial training, we test Wasserstein-GANs [Arjovsky et al.2017] in this framework.

We test our model in the context of molecular and music generation, optimizing several domain-specific metrics. Our results show that ORGAN is able to tune the quality and structure of samples. We compare our results with the maximum likelihood estimation (MLE), SeqGAN and a RL approach.

2 Related work

Previous work has relied on specific modifications of the objective function to reach the desired properties. For example, [Jaques et al.2016] introduce penalties to unrealistic sequences, in absence of which RL can easily get stuck around local maxima which can be very far from the global maximum reward. Related applications by [Ranzato et al.2015] and [Li et al.2016] apply reinforcement learning to sequence generation in a NLP setting.

In the last two years, many methodologies have been proposed for de novo molecular generation. [Ertl et al.2017] and [Segler et al.2017] trained recurrent neural networks to generate drug-like molecules. [Gómez-Bombarelli et al.2016b]

employed a variational autoencoder to build a latent, continuous space where property optimization can be made through surrogate optimization. Finally,

[Kadurin et al.2017] presented a GAN model for drug generation. Additionally, the approach presented in this paper has recently been applied to molecular design [Sanchez-Lengeling et al.2017].

In the field of music generation, [Lee et al.2017] built a SeqGAN model employing an efficient representation of multi-channel MIDI to generate polyphonic music. [Chen et al.2017] presented Fusion GAN, a dual-learning GAN model that can fuse two data distributions. [Jaques et al.2017] employ deep Q-learning with a cross-entropy reward to optimize the quality of melodies generated from an RNN.

In adversarial training, [Pfau and Vinyals2016] recontextualizes GANs in the actor-critic setting. This connection is also explored with the Wasserstein-1 distance in WGANs [Arjovsky et al.2017]. Minibatch discrimination and feature mapping were used to promote diversity in GANs [Salimans et al.2016]. Another approach to avoid mode collapse was shown with Unrolled GANs [Metz et al.2016]. Issues and convergence of GANs has been studied in [Mescheder et al.2017].

3 Background

In this section, we elaborate on the GAN and RL setting based on SeqGAN [Yu et al.2017]

is a generator parametrized by , that is trained to produce high-quality sequences of length and a discriminator model parametrized by

, trained to classify real and generated sequences.

is trained to deceive , and to classify correctly. Both models are trained in alternation, following a minimax game:


For discrete data, the sampling process is not differentiable. However, can be trained as an agent in a reinforcement learning context using the REINFORCE algorithm [Williams1992]. Let be the reward function defined for full length sequences. Given an incomplete sequence , also to be referred to as state , must produce an action , along with the next token .

The agent’s stochastic policy is given by and we wish to maximize its expected long term reward


where is a fixed initial state. is the action-value function that represents the expected reward at state of taking action and following our current policy to complete the rest of the sequence. For any full sequence , we have but we also wish to calculate for partial sequences at intermediate timesteps, considering the expected future reward when the sequence is completed. In order to do so, we perform -time Monte Carlo search with the canonical rollout policy represented as


where and is stochastically sampled via the policy . Now becomes


An unbiased estimation of the gradient of

can be derived as


Finally in SeqGAN the reward function is provided by .

4 Organ

Figure 1: Schema for ORGAN. Left: is trained as a classifier receiving as input a mix of real data and generated data by . Right: is trained by RL where the reward is a combination of and the objectives, and is passed back to the policy function via Monte Carlo sampling. We penalize non-unique sequences.

Figure 1 illustrates the main idea of ORGAN. To take into account domain-specific desired objectives , we extend the reward function for a particular sequence to a linear combination of and , parametrized by :


If the model ignores and becomes a "naive" RL algorithm, whereas if it is simply a SeqGAN model. It should be noted that, if chosen, the objective function can vary based on the current iteration of adversarial training, leading to alternating rewards between several objectives and the discriminator.

An additional mechanism to prevent mode collapse is to penalize non-unique sequences by dividing the reward of a repeated sequence by it’s the number of copies. The more a sequence gets repeated, the more it will have diminishing rewards. Alternatively, domain-specific similarity metrics could be used to penalize.

To improve the stability of learning, and avoid of problems of GAN convergence like "perfect discriminator", we also implemented the Wasserstein-1 distance, also known as earth mover’s distance, for [Arjovsky et al.2017]. Although the computation of this distance is intractable due to an infimum, it can be transformed via the Kantorovich-Rubinstein duality:

Under , is no longer meant to classify data samples, but now trained and converged to learn such that is K-Lipschitz continuous and used to compute the Wasserstein distance. Intuitively the cost of moving the generated distribution to the data. In this context, can now be considered as a critic in an actor-critic setting.

4.1 Implementation Details

is a RNN with LSTM cells, while

is Convolutional Neural Network (CNN) designed specifically for text classification tasks


To avoid over-fitting with the CNN, we optimized its architecture on classification task between different datasets for each experiment. In the molecule generation task, we utilized a set of drug-like and nondrug-like molecules from the ZINC database [Irwin and Shoichet2005]. In the music task, we discriminated between a set of folk and videogame tunes scraped from the internet. We utilize a dropout layer at and also regularization on the network weights. All the gradient descent steps are done using the Adam algorithm [Kingma and Ba2014].

Molecular metrics are implemented using the RDKit chemoinformatics package [Landrum2016]. Music metrics employ the MIDI frequencies. The code for ORGAN, including metrics for each experiment, can be found at http://github.com/gablg1/ORGAN111Repo soon to be updated (May’18).

5 Experimental results

In this section, we will test the performance of ORGAN in two scenarios: the generation of molecules encoded as text sequences and musical melodies. Our objective is to show that ORGAN can generate samples that fulfill some desired objectives while promoting diversity. For purposes of interpretation, the range of each objective has been mapped to range, where corresponds to an undesirable property and

to a very desirable property. Each generator model was pre-trained for 250 epochs using MLE, and the discriminator was trained for 10 epochs.

To measure diversity we use domain-specific measures. In both fields, there are multiple ways of quantifying the notion of diversity so we tried utilizing more widely used metrics.

We compare ORGAN and the Wasserstein variant () with three other methods of training RNNs: SeqGAN, Naive RL, and Maximum Likelihood Estimation (MLE). Unless specified, is assumed to be 0.5. All training methods involve a pre-training step of 250 epochs of MLE for , and 10 epochs for . The MLE baseline simply stops right after pre-training, while the other methods proceed to further train the model using the different approaches, up to 100 epochs.

For each dataset, we first build a dictionary mapping the vocabulary - the set of all characters present in the dataset - to integers. The dataset is then preprocessed by transforming each sequence into a fixed sized integer sequence of length where is the maximum length of a string present in the dataset (in the case of molecules, along with around more characters to increase flexibility and allow generation of larger samples of data). Every string with a length smaller than

is padded with “_" characters. Thus the input to our model becomes a list of fixed sized integer sequences.

5.1 Experiment: Molecules

Here we test the effectiveness of ORGAN for generating molecules with desirable properties in a pharmaceutical context of drug discovery.

Molecules can be encoded as text sequences by using the SMILES representation [Weininger1988] of a molecule. This representation encodes the topological information of a molecule based on common chemical bonding rules. For example, the 6-carbon ringed molecule benzene can be encoded as ’C1=CC=CC=C1’. Each C represents a carbon atom, the ’=’ symbolizes a double bond and ’1’ the start and closing of a cycle/ring, hydrogen atoms can be deduced via simple rules.

The SMILES representation has predefined grammar rules, and as such, it is possible to have invalid expressions that cannot be decoded back to a valid molecule. Therefore desired property on a generative algorithm is to have a high percentage of valid expression. Invalid expressions get penalized. Additionally, we also penalize the generation of duplicate molecules.

Recent generative models ([Gómez-Bombarelli et al.2016a],[Kusner et al.2017]) have reported valid expression rates between up to . It should be noted that there are common uninteresting ways to generate valid expressions by alternating "C" and "O" characters such as ’CCCCCCCC’ and ’COCCCCOC’, the combinatorial possibilities of such permutations is already huge.

For training, we utilized a random subset of 5k molecules from the set of 134 thousand stable small molecules [Ramakrishnan et al.2014]. This is a subset of all molecules with up to nine heavy atoms (CONF) out of the GDB-17 universe of 166 billion organic molecules [Ramakrishnan et al.2014]. The maximum sequence length is 51 and the alphabet size is 43.

When choosing objectives we picked qualities that are normally desired for small molecule drug discovery:


a property that measures how likely a molecule is able to mix with water, also known as the water-octanol partition coefficient (LogP). Computed via RDKit’s Crippen function [Landrum2016].


estimates how hard (0) or how easy (1) it is to synthesize a given molecule [Ertl and Schuffenhauer2009].


how likely a molecule is a viable candidate for a drug, an estimate that captures the abstract notion of aesthetics in medicinal chemistry [Bickerton et al.2012]. This property is correlated to the previous two metrics.

To estimate the diversity of our generated samples we can utilize the notion of molecular similarity to construct a measure of how similar or dissimilar a molecule is with respect to a dataset. This measure is based on molecular fingerprints and their Jaccard distance [Sanchez-Lengeling et al.2017]. More concretely, Diversity measures the average similarity of a molecule with respect to a set, in this case, a random subset of molecules from the training set. A value of 1 would indicate the molecule is likely to be considered a diverse member of this set, 0 would indicate it has many repeated sub-structures with respect to the set.

Objective Algorithm Validity (%) Diversity Druglikeliness Synthesizability Solubility
MLE 75.9 0.64 0.48 (0%) 0.23 (0%) 0.30 (0%)
SeqGAN 80.3 0.61 0.49 (2%) 0.25 (6%) 0.31 (3%)
Druglikeliness ORGAN 88.2 0.55 gray!250.52 gray!25(8%) 0.32 (38%) 0.35 (18%)
OR(W)GAN 85.0 0.95 gray!250.60 gray!25(25%) 0.54 (130%) 0.47 (57%)
Naive RL 97.1 0.8 gray!250.57 gray!25(19%) 0.53 (126%) 0.50 (67%)
Synthesizability ORGAN 96.5 0.92 0.51 (6%) gray!250.83 gray!25(255%) 0.45 (52%)
OR(W)GAN 97.6 1.00 0.20 (-59%) gray!250.75 gray!25(223%) 0.84 (184%)
Naive RL 97.7 0.96 0.52 (8%) gray!25 0.83 gray!25(256%) 0.46 (54%)
Solubility ORGAN 94.7 0.76 0.50 (4%) 0.63 (171%) gray!250.55 gray!25(85%)
OR(W)GAN 94.1 0.90 0.42 (-12%) 0.66 (185%) gray!250.54 gray!25(81%)
Naive RL 92.7 0.75 0.49 (3%) 0.70 (200%) gray!250.78 gray!25(162 %)
All/Alternated ORGAN 96.1 92.3 gray!250.52 gray!25(9%) gray!250.71 gray!25(206%) gray!250.53 gray!25(79%)
Table 1: Evaluation of metrics, on several generative algorithms and optimized for different objectives for molecules. Reported values are mean values of valid generated molecules. The percentage of improvement over the MLE baseline is reported in parenthesis. Values shown in bold indicate significant improvement. Shaded cell indicates direct optimized objectives.

Table 1 shows quantitative results comparing ORGAN to other methods and three different optimization scenarios. MLE and SeqGAN are able to capture the distribution of properties of the training set with minimal alteration in their metrics. While the metric optimized methods excelled in all metrics above the non-optimized methods, effectively showing that they are able to bias the generation process. The Wasserstein variant of ORGAN also seemed to give better diversity properties.

In our experiments, we also noted that naive RL has different failure scenarios. For instance, this approach excelled particularly in the task of Solubility, this particular task rewards very simple sequences such as for the single atom molecule “N" or monotonous patterns like “CCCCCCC" or “CCOCOCCCC" positively. It seems for the other approaches, the GAN/WGAN setting is enforcing more diversity and so punishes these types of patterns, providing highly soluble molecules with more complex features.

Capacity ceiling

We did notice a form of capacity ceiling in our generation tasks in two forms. The GAN models tended to generate sequences that had the same average sequence length as the training set (15.42). With RL we did not observe this constraint, either it went quite low with synthesizability (9.4) or high (21.3) with druglikeliness. This might be advantageous or detrimental based, on the setting. Optimizing a property that relates to sequence length, for example, molecular size might change this.

The other ceiling is illustrated in figure 2, where the upper limits in Druglikeliness for the data and the best performing approach match. While OR(W)GAN tends to generate more druglike molecules, they do not reach the highest value of 1. This might be property and dataset dependent.

Figure 2: Violinplots of Druglikeliness for molecules from the baseline Dataset(n=5000) and optimized OR(W)GAN (n=5440).

Multi-objective training programs

We also experimented with alternating objectives during training. By training for one epoch each objective in rotation until 99 epochs (33 epochs per objective) we arrive to figure 3.

Surprisingly by alternating the objectives, as seen in the last row of table 1, the gains in each metric are quite high and almost comparable with the best models in each individually trained objective. Although it can also be appreciated in the slight fluctuating behavior of the graphs that there might be limits to the gains that can be achieved. Further work is warranted in this direction.

Figure 3: Plots of each objective across the training epochs. Objectives were trained for one epoch, and then switched for another.

5.2 Experiment: Musical melodies

To further demonstrate the applicability of ORGAN, we extend our study to music sequences. We employ the notation introduced by [Jaques et al.2017], where each token corresponds to a sixteenth of a bar of music. The first two tokens are reserved as 0, which is silent, and 1, which means no event; the other 36 tokens encode three octaves of music, from C3 (MIDI pitch 48) to B5. We use a 1k random sample from the Essen Associative Code (EsAC) folk dataset as processed by [Chen et al.2017], where every melody has a duration of 36 tokens (2.25 music bars). We generate songs optimizing two different metrics:


This measures how many perfect fifths are in the music that is generated. A perfect fifth is defined as a musical interval whose frequencies have a ratio of approximately 3:2. These provide what is generally considered pleasant note sequences due to their high consonance.

Ratio of Steps.

A step is an interval between two consecutive notes of a scale. An interval from C to D, for example, is a step. A skip, on the other hand, is a longer interval. An interval from C to G, for example, is a skip. By maximizing the ratio of steps in our music, we are adhering to the conjunct melodic motion. Our rationale here is that by increasing the number of steps in our songs we make our melodic leaps rarer and more memorable [Bonds2013].

Moreover, we calculate diversity as the average pairwise edit distance of the generated data [Habrard et al.2008]. We do not attempt to maximize this metric explicitly but we keep track of it to shed light on the trade-off between metric optimization and sample diversity in the ORGAN framework. Table 2 shows quantitative results comparing ORGAN to other baseline methods optimizing for three different metrics. ORGAN outperforms SeqGAN and MLE in all of the three metrics. Naive RL achieves a higher score than ORGAN for the Ratio of Steps metric, but it under-performs in terms of diversity, as Naive RL would likely generate very simple rather than diverse songs. In this sense, similar to the molecule case, although the Naive RL ratio of steps score is higher than ORGAN’s, the actual generated songs can be deemed much less interesting.

Objective Algorithm Diversity Tonality Ratio of Steps
MLE 0.221 0.007 0.010
SeqGAN 0.187 0.005 0.010
Tonality Naive RL 0.100 gray!250.478 2.9E-05
ORGAN 0.268 gray!250.372 1.78E-04
OR(W)GAN 0.268 gray!250.177 2.4E-04
Ratio of Steps Naive RL 0.321 0.001 gray!250.829
ORGAN 0.433 0.001 gray!250.632
OR(W)GAN 0.134 5.95E-05 gray!250.622
Table 2: Evaluation of metrics, on several generative algorithms and optimized for different objectives for melodies. Each measure is averaged over a set of 1000 generated songs. Values shown in bold indicate significant improvement over MLE baseline. Shaded cell indicates directly optimized objectives.

We note that the Ratio of Steps and Tonality have an inverse relationship. This is because two consecutive notes - what qualifies as a step - do not have the frequency ratio of a perfect fifth, which are responsible for increasing tonality. In addition, although the usage of the Wasserstein metric seems to decrease the metrics value, this can be explained as the result of slower training.

Figure 4: Plots of Diversity and Tonality rewards (the latter re-scaled to the [0, 1] interval) after 80 epochs of training on the music generation task. The upper plot employs the classical GAN loss, while the lower displays a WGAN. The values have been averaged over 1000 samples.

Effect of

By tweaking , the ORGAN approach allows one to explore the trade-off between maximizing the desired objective and maintaining likelihood to the data distribution.

Figure 4 shows the distribution of tonality and diversity sampled from ORGAN and OR(W)GAN for several values. This showcases that there exists an optimal value for which maximizes simultaneously the reward and diversity. This value is dependent on the model, dataset and metric, therefore a parameter search would be advantageous to maximize objectives.

6 Conclusions and future work

In this work, we have presented ORGAN, a novel framework to optimize an arbitrary object in a sequence generation task. We have built on recent advances in GANs, particularly SeqGAN, and extended them with reinforcement learning to control properties of generated samples.

We have shown that ORGAN can improve certainly desired metrics, achieving better results than recurrent neural networks trained via either MLE or SeqGAN. Even more importantly, data generation can be made subject to a domain-specific reward function while still using the adversarial setting to guarantee the production of non-repetitious samples. Moreover, ORGAN possesses a natural advantage as a black box compared to similar objective-optimization models, since it is not necessary to introduce multiple domain-specific penalties to the reward function: many times a simple objective "hint" will suffice.

As evidenced with the experiments, the RL component is the one of the major drivers for the property optimization and promotion of diversity. Values tended to be higher when RL was present in the architecture of the models.

Future work should investigate how the choice of heuristic can affect the performance of the model. There are also other formulations of GANs for discrete sequences [Che et al.2017],[Hjelm et al.2017] that could be extended with a RL component in order to fine-tune the generation processes.

One area of improvement as seen from figure 2

is to push the boundaries of the datasets in certain properties. In some domains, outliers might be more valuable such as in the case of drug and materials discovery.

Finally, forthcoming research should extend ORGANs to work with non-sequential data, such as images or audio. This requires framing the GAN setup as a reinforcement learning problem in order to add an arbitrary (not necessarily differentiable) objective function. We believe this extension to be quite promising since real-valued GANs are currently better understood than sequence data GANs.


  • [Arjovsky and Bottou2017] Martin Arjovsky and Léon Bottou. Towards Principled Methods for Training Generative Adversarial Networks. 2017.
  • [Arjovsky et al.2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. 2017.
  • [Bickerton et al.2012] G Richard Bickerton et al. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2), 2012.
  • [Bonds2013] Mark Evan. Bonds. A history of music in Western culture. Pearson Education, 2013.
  • [Che et al.2017] Tong Che et al. Maximum-Likelihood Augmented Discrete Generative Adversarial Networks. 2017.
  • [Chen et al.2017] Zhiqian Chen et al. Learning to fuse music genres with generative adversarial dual learning. arXiv preprint arXiv:1712.01456, 2017.
  • [Ertl and Schuffenhauer2009] Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules. J. Cheminform., 1(1), 2009.
  • [Ertl et al.2017] Peter Ertl et al. In silico generation of novel, drug-like chemical matter using the lstm neural network. arXiv preprint arXiv:1712.07449, 2017.
  • [Gómez-Bombarelli et al.2016a] Rafael Gómez-Bombarelli et al. Design of efficient molecular OLEDs by a high-throughput virtual screening and experimental approach. Nat. Mater., 15(10), 2016.
  • [Gómez-Bombarelli et al.2016b] Rafael Gómez-Bombarelli et al. Automatic chemical design using a data-driven continuous representation of molecules. arXiv, 2016.
  • [Goodfellow et al.2014] Ian J. Goodfellow et al. Generative Adversarial Networks. arXiv, 2014.
  • [Habrard et al.2008] Amaury Habrard et al. Melody Recognition with Learned Edit Distances. Springer, Berlin, Heidelberg, 2008.
  • [Hachmann et al.2011] Johannes Hachmann et al. Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid. J. Phys. Chem. Lett., 2(17), 2011.
  • [Hjelm et al.2017] R Devon Hjelm et al. Boundary-Seeking Generative Adversarial Networks. 2017.
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8), 1997.
  • [Irwin and Shoichet2005] John J Irwin and Brian K Shoichet. J. Chem. Inf. Model, (December 2004), 2005.
  • [Jaques et al.2016] Natasha Jaques et al. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control. arXiv, 2016.
  • [Jaques et al.2017] Natasha Jaques et al. Tuning recurrent neural networks with reinforcement learning. 2017.
  • [Kadurin et al.2017] Artur Kadurin et al. drugan: de novo generation of new molecules with desired molecular properties in silico. Molecular pharmaceutics, 14(9), 2017.
  • [Kim2014] Yoon Kim. Convolutional Neural Networks for Sentence Classification. Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP 2014), 2014.
  • [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. ICLR, 2014.
  • [Kusner et al.2017] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar Variational Autoencoder. 2017.
  • [Landrum2016] Greg Landrum. RDKit, 2016.
  • [Lee et al.2017] Sang-gil Lee et al. A seqgan for polyphonic music generation. arXiv preprint arXiv:1710.11418, 2017.
  • [Li et al.2016] Jiwei Li et al. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
  • [Mescheder et al.2017] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The Numerics of GANs. 2017.
  • [Metz et al.2016] Luke Metz et al. Unrolled Generative Adversarial Networks. 2016.
  • [Pfau and Vinyals2016] David Pfau and Oriol Vinyals. Connecting Generative Adversarial Networks and Actor-Critic Methods. 2016.
  • [Radford et al.2017] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
  • [Ramakrishnan et al.2014] Raghunathan Ramakrishnan et al. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data, 1, 2014.
  • [Ranzato et al.2015] Marc’Aurelio Ranzato et al. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
  • [Salimans et al.2016] Tim Salimans et al. Improved Techniques for Training GANs. NIPS, 2016.
  • [Sanchez-Lengeling et al.2017] Benjamin Sanchez-Lengeling et al. Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry. 2017.
  • [Segler et al.2017] Marwin HS Segler et al. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Science, 2017.
  • [Weininger1988] David Weininger. SMILES, a chemical language and information system. J. Chem. Inf. Model., 28(1), 1988.
  • [Williams1992] R J Williams. Simple Statistical Gradient-Estimating Algorithms for Connectionist Reinforcement Learning. Mach. Learn., 8(3), 1992.
  • [Yu et al.2017] Lantao Yu et al. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.