1 Introduction
Unsupervised generation of data is a dynamic area of machine learning and a very active research frontier in areas ranging from language processing and music generation to materials and drug discovery.
In any of these fields, it is often advantageous to guide the generative model towards some desirable characteristics, while ensuring that the samples resemble the initial distribution. In music generation, for example, it might be expected that pleasant melodic patterns prevail over more dissonant ones [Jaques et al.2016]
. In natural language processing, a given sentiment might be emphasized, maybe for producing movie reviews
[Radford et al.2017]. Finally, in materials discovery, the aim is often to optimize some properties for a particular application, for example in organic solar cells [Hachmann et al.2011], OLEDs [GómezBombarelli et al.2016a]or new drugs. The generation of discrete data using Recurrent Neural Networks (RNNs), in particular, Long ShortTerm Memory cells
[Hochreiter and Schmidhuber1997]and maximum likelihood estimation has been shown to work well in practice. However, this often suffers from the socalled
exposure bias, and might lack some of the multiscale structures or salient features of the data. Meanwhile Generative Adversarial Networks (GANs) [Goodfellow et al.2014], an approach where a generative model competes against a discriminate model, one trying to generate likely data while the other trying to distinguish false from real data. GANs have shown remarkable results at generation of data that imitates a data distribution, however they can suffer from several issues, among these modecollapse [Arjovsky and Bottou2017]. Where the generator learns to produce samples with low variety.Although GANs were not initially applicable to discrete data due to nondifferentiability, approaches such as SeqGAN [Yu et al.2017], MaliGAN [Che et al.2017] and BGAN [Hjelm et al.2017] have arisen to deal with this issue.
Furthermore methods from Reinforcement Learning (RL) have shown great success at solving problems where continuous feedback from an environment is needed [Hjelm et al.2017].
In this paper, we introduce a novel approach to optimize the properties of a distribution of sequences, increase the diversity of the samples while maintaining the likeliness of the data distribution. In our approach, the generator is trained to maximize a weighted average of two types of rewards: the objective, domainspecific metrics, and the discriminator
, which is trained along with the generator in an adversarial fashion. While the objective component of the reward function ensures that the model selects for traits that maximize the specified heuristic, the discriminator incentives the samples to stay within boundaries of the initial data distribution. Diversity is additionally promoted by reducing rewards of nonunique and less diverse sequences.
In order to implement the above idea, we build on SeqGAN, a recent work that successfully combines GANs and RL to apply the GAN framework to sequential data [Yu et al.2017] and extend it towards domainspecific rewards. To increase the stability of the adversarial training, we test WassersteinGANs [Arjovsky et al.2017] in this framework.
We test our model in the context of molecular and music generation, optimizing several domainspecific metrics. Our results show that ORGAN is able to tune the quality and structure of samples. We compare our results with the maximum likelihood estimation (MLE), SeqGAN and a RL approach.
2 Related work
Previous work has relied on specific modifications of the objective function to reach the desired properties. For example, [Jaques et al.2016] introduce penalties to unrealistic sequences, in absence of which RL can easily get stuck around local maxima which can be very far from the global maximum reward. Related applications by [Ranzato et al.2015] and [Li et al.2016] apply reinforcement learning to sequence generation in a NLP setting.
In the last two years, many methodologies have been proposed for de novo molecular generation. [Ertl et al.2017] and [Segler et al.2017] trained recurrent neural networks to generate druglike molecules. [GómezBombarelli et al.2016b]
employed a variational autoencoder to build a latent, continuous space where property optimization can be made through surrogate optimization. Finally,
[Kadurin et al.2017] presented a GAN model for drug generation. Additionally, the approach presented in this paper has recently been applied to molecular design [SanchezLengeling et al.2017].In the field of music generation, [Lee et al.2017] built a SeqGAN model employing an efficient representation of multichannel MIDI to generate polyphonic music. [Chen et al.2017] presented Fusion GAN, a duallearning GAN model that can fuse two data distributions. [Jaques et al.2017] employ deep Qlearning with a crossentropy reward to optimize the quality of melodies generated from an RNN.
In adversarial training, [Pfau and Vinyals2016] recontextualizes GANs in the actorcritic setting. This connection is also explored with the Wasserstein1 distance in WGANs [Arjovsky et al.2017]. Minibatch discrimination and feature mapping were used to promote diversity in GANs [Salimans et al.2016]. Another approach to avoid mode collapse was shown with Unrolled GANs [Metz et al.2016]. Issues and convergence of GANs has been studied in [Mescheder et al.2017].
3 Background
In this section, we elaborate on the GAN and RL setting based on SeqGAN [Yu et al.2017]
is a generator parametrized by , that is trained to produce highquality sequences of length and a discriminator model parametrized by
, trained to classify real and generated sequences.
is trained to deceive , and to classify correctly. Both models are trained in alternation, following a minimax game:(1) 
For discrete data, the sampling process is not differentiable. However, can be trained as an agent in a reinforcement learning context using the REINFORCE algorithm [Williams1992]. Let be the reward function defined for full length sequences. Given an incomplete sequence , also to be referred to as state , must produce an action , along with the next token .
The agent’s stochastic policy is given by and we wish to maximize its expected long term reward
(2) 
where is a fixed initial state. is the actionvalue function that represents the expected reward at state of taking action and following our current policy to complete the rest of the sequence. For any full sequence , we have but we also wish to calculate for partial sequences at intermediate timesteps, considering the expected future reward when the sequence is completed. In order to do so, we perform time Monte Carlo search with the canonical rollout policy represented as
(3) 
where and is stochastically sampled via the policy . Now becomes
(4) 
An unbiased estimation of the gradient of
can be derived as(5) 
Finally in SeqGAN the reward function is provided by .
4 Organ
Figure 1 illustrates the main idea of ORGAN. To take into account domainspecific desired objectives , we extend the reward function for a particular sequence to a linear combination of and , parametrized by :
(6) 
If the model ignores and becomes a "naive" RL algorithm, whereas if it is simply a SeqGAN model. It should be noted that, if chosen, the objective function can vary based on the current iteration of adversarial training, leading to alternating rewards between several objectives and the discriminator.
An additional mechanism to prevent mode collapse is to penalize nonunique sequences by dividing the reward of a repeated sequence by it’s the number of copies. The more a sequence gets repeated, the more it will have diminishing rewards. Alternatively, domainspecific similarity metrics could be used to penalize.
To improve the stability of learning, and avoid of problems of GAN convergence like "perfect discriminator", we also implemented the Wasserstein1 distance, also known as earth mover’s distance, for [Arjovsky et al.2017]. Although the computation of this distance is intractable due to an infimum, it can be transformed via the KantorovichRubinstein duality:
Under , is no longer meant to classify data samples, but now trained and converged to learn such that is KLipschitz continuous and used to compute the Wasserstein distance. Intuitively the cost of moving the generated distribution to the data. In this context, can now be considered as a critic in an actorcritic setting.
4.1 Implementation Details
is a RNN with LSTM cells, while
is Convolutional Neural Network (CNN) designed specifically for text classification tasks
[Kim2014].To avoid overfitting with the CNN, we optimized its architecture on classification task between different datasets for each experiment. In the molecule generation task, we utilized a set of druglike and nondruglike molecules from the ZINC database [Irwin and Shoichet2005]. In the music task, we discriminated between a set of folk and videogame tunes scraped from the internet. We utilize a dropout layer at and also regularization on the network weights. All the gradient descent steps are done using the Adam algorithm [Kingma and Ba2014].
Molecular metrics are implemented using the RDKit chemoinformatics package [Landrum2016]. Music metrics employ the MIDI frequencies. The code for ORGAN, including metrics for each experiment, can be found at http://github.com/gablg1/ORGAN^{1}^{1}1Repo soon to be updated (May’18).
5 Experimental results
In this section, we will test the performance of ORGAN in two scenarios: the generation of molecules encoded as text sequences and musical melodies. Our objective is to show that ORGAN can generate samples that fulfill some desired objectives while promoting diversity. For purposes of interpretation, the range of each objective has been mapped to range, where corresponds to an undesirable property and
to a very desirable property. Each generator model was pretrained for 250 epochs using MLE, and the discriminator was trained for 10 epochs.
To measure diversity we use domainspecific measures. In both fields, there are multiple ways of quantifying the notion of diversity so we tried utilizing more widely used metrics.
We compare ORGAN and the Wasserstein variant () with three other methods of training RNNs: SeqGAN, Naive RL, and Maximum Likelihood Estimation (MLE). Unless specified, is assumed to be 0.5. All training methods involve a pretraining step of 250 epochs of MLE for , and 10 epochs for . The MLE baseline simply stops right after pretraining, while the other methods proceed to further train the model using the different approaches, up to 100 epochs.
For each dataset, we first build a dictionary mapping the vocabulary  the set of all characters present in the dataset  to integers. The dataset is then preprocessed by transforming each sequence into a fixed sized integer sequence of length where is the maximum length of a string present in the dataset (in the case of molecules, along with around more characters to increase flexibility and allow generation of larger samples of data). Every string with a length smaller than
is padded with “_" characters. Thus the input to our model becomes a list of fixed sized integer sequences.
5.1 Experiment: Molecules
Here we test the effectiveness of ORGAN for generating molecules with desirable properties in a pharmaceutical context of drug discovery.
Molecules can be encoded as text sequences by using the SMILES representation [Weininger1988] of a molecule. This representation encodes the topological information of a molecule based on common chemical bonding rules. For example, the 6carbon ringed molecule benzene can be encoded as ’C1=CC=CC=C1’. Each C represents a carbon atom, the ’=’ symbolizes a double bond and ’1’ the start and closing of a cycle/ring, hydrogen atoms can be deduced via simple rules.
The SMILES representation has predefined grammar rules, and as such, it is possible to have invalid expressions that cannot be decoded back to a valid molecule. Therefore desired property on a generative algorithm is to have a high percentage of valid expression. Invalid expressions get penalized. Additionally, we also penalize the generation of duplicate molecules.
Recent generative models ([GómezBombarelli et al.2016a],[Kusner et al.2017]) have reported valid expression rates between up to . It should be noted that there are common uninteresting ways to generate valid expressions by alternating "C" and "O" characters such as ’CCCCCCCC’ and ’COCCCCOC’, the combinatorial possibilities of such permutations is already huge.
For training, we utilized a random subset of 5k molecules from the set of 134 thousand stable small molecules [Ramakrishnan et al.2014]. This is a subset of all molecules with up to nine heavy atoms (CONF) out of the GDB17 universe of 166 billion organic molecules [Ramakrishnan et al.2014]. The maximum sequence length is 51 and the alphabet size is 43.
When choosing objectives we picked qualities that are normally desired for small molecule drug discovery:
 Solubility:

a property that measures how likely a molecule is able to mix with water, also known as the wateroctanol partition coefficient (LogP). Computed via RDKit’s Crippen function [Landrum2016].
 Synthetizability:

estimates how hard (0) or how easy (1) it is to synthesize a given molecule [Ertl and Schuffenhauer2009].
 Druglikeness:

how likely a molecule is a viable candidate for a drug, an estimate that captures the abstract notion of aesthetics in medicinal chemistry [Bickerton et al.2012]. This property is correlated to the previous two metrics.
To estimate the diversity of our generated samples we can utilize the notion of molecular similarity to construct a measure of how similar or dissimilar a molecule is with respect to a dataset. This measure is based on molecular fingerprints and their Jaccard distance [SanchezLengeling et al.2017]. More concretely, Diversity measures the average similarity of a molecule with respect to a set, in this case, a random subset of molecules from the training set. A value of 1 would indicate the molecule is likely to be considered a diverse member of this set, 0 would indicate it has many repeated substructures with respect to the set.
Objective  Algorithm  Validity (%)  Diversity  Druglikeliness  Synthesizability  Solubility  
MLE  75.9  0.64  0.48  (0%)  0.23  (0%)  0.30  (0%)  
SeqGAN  80.3  0.61  0.49  (2%)  0.25  (6%)  0.31  (3%)  
Druglikeliness  ORGAN  88.2  0.55  gray!250.52  gray!25(8%)  0.32  (38%)  0.35  (18%) 
OR(W)GAN  85.0  0.95  gray!250.60  gray!25(25%)  0.54  (130%)  0.47  (57%)  
Naive RL  97.1  0.8  gray!250.57  gray!25(19%)  0.53  (126%)  0.50  (67%)  
Synthesizability  ORGAN  96.5  0.92  0.51  (6%)  gray!250.83  gray!25(255%)  0.45  (52%) 
OR(W)GAN  97.6  1.00  0.20  (59%)  gray!250.75  gray!25(223%)  0.84  (184%)  
Naive RL  97.7  0.96  0.52  (8%)  gray!25 0.83  gray!25(256%)  0.46  (54%)  
Solubility  ORGAN  94.7  0.76  0.50  (4%)  0.63  (171%)  gray!250.55  gray!25(85%) 
OR(W)GAN  94.1  0.90  0.42  (12%)  0.66  (185%)  gray!250.54  gray!25(81%)  
Naive RL  92.7  0.75  0.49  (3%)  0.70  (200%)  gray!250.78  gray!25(162 %)  
All/Alternated  ORGAN  96.1  92.3  gray!250.52  gray!25(9%)  gray!250.71  gray!25(206%)  gray!250.53  gray!25(79%) 
Table 1 shows quantitative results comparing ORGAN to other methods and three different optimization scenarios. MLE and SeqGAN are able to capture the distribution of properties of the training set with minimal alteration in their metrics. While the metric optimized methods excelled in all metrics above the nonoptimized methods, effectively showing that they are able to bias the generation process. The Wasserstein variant of ORGAN also seemed to give better diversity properties.
In our experiments, we also noted that naive RL has different failure scenarios. For instance, this approach excelled particularly in the task of Solubility, this particular task rewards very simple sequences such as for the single atom molecule “N" or monotonous patterns like “CCCCCCC" or “CCOCOCCCC" positively. It seems for the other approaches, the GAN/WGAN setting is enforcing more diversity and so punishes these types of patterns, providing highly soluble molecules with more complex features.
Capacity ceiling
We did notice a form of capacity ceiling in our generation tasks in two forms. The GAN models tended to generate sequences that had the same average sequence length as the training set (15.42). With RL we did not observe this constraint, either it went quite low with synthesizability (9.4) or high (21.3) with druglikeliness. This might be advantageous or detrimental based, on the setting. Optimizing a property that relates to sequence length, for example, molecular size might change this.
The other ceiling is illustrated in figure 2, where the upper limits in Druglikeliness for the data and the best performing approach match. While OR(W)GAN tends to generate more druglike molecules, they do not reach the highest value of 1. This might be property and dataset dependent.
Multiobjective training programs
We also experimented with alternating objectives during training. By training for one epoch each objective in rotation until 99 epochs (33 epochs per objective) we arrive to figure 3.
Surprisingly by alternating the objectives, as seen in the last row of table 1, the gains in each metric are quite high and almost comparable with the best models in each individually trained objective. Although it can also be appreciated in the slight fluctuating behavior of the graphs that there might be limits to the gains that can be achieved. Further work is warranted in this direction.
5.2 Experiment: Musical melodies
To further demonstrate the applicability of ORGAN, we extend our study to music sequences. We employ the notation introduced by [Jaques et al.2017], where each token corresponds to a sixteenth of a bar of music. The first two tokens are reserved as 0, which is silent, and 1, which means no event; the other 36 tokens encode three octaves of music, from C3 (MIDI pitch 48) to B5. We use a 1k random sample from the Essen Associative Code (EsAC) folk dataset as processed by [Chen et al.2017], where every melody has a duration of 36 tokens (2.25 music bars). We generate songs optimizing two different metrics:
 Tonality.

This measures how many perfect fifths are in the music that is generated. A perfect fifth is defined as a musical interval whose frequencies have a ratio of approximately 3:2. These provide what is generally considered pleasant note sequences due to their high consonance.
 Ratio of Steps.

A step is an interval between two consecutive notes of a scale. An interval from C to D, for example, is a step. A skip, on the other hand, is a longer interval. An interval from C to G, for example, is a skip. By maximizing the ratio of steps in our music, we are adhering to the conjunct melodic motion. Our rationale here is that by increasing the number of steps in our songs we make our melodic leaps rarer and more memorable [Bonds2013].
Moreover, we calculate diversity as the average pairwise edit distance of the generated data [Habrard et al.2008]. We do not attempt to maximize this metric explicitly but we keep track of it to shed light on the tradeoff between metric optimization and sample diversity in the ORGAN framework. Table 2 shows quantitative results comparing ORGAN to other baseline methods optimizing for three different metrics. ORGAN outperforms SeqGAN and MLE in all of the three metrics. Naive RL achieves a higher score than ORGAN for the Ratio of Steps metric, but it underperforms in terms of diversity, as Naive RL would likely generate very simple rather than diverse songs. In this sense, similar to the molecule case, although the Naive RL ratio of steps score is higher than ORGAN’s, the actual generated songs can be deemed much less interesting.
Objective  Algorithm  Diversity  Tonality  Ratio of Steps 
MLE  0.221  0.007  0.010  
SeqGAN  0.187  0.005  0.010  
Tonality  Naive RL  0.100  gray!250.478  2.9E05 
ORGAN  0.268  gray!250.372  1.78E04  
OR(W)GAN  0.268  gray!250.177  2.4E04  
Ratio of Steps  Naive RL  0.321  0.001  gray!250.829 
ORGAN  0.433  0.001  gray!250.632  
OR(W)GAN  0.134  5.95E05  gray!250.622 
We note that the Ratio of Steps and Tonality have an inverse relationship. This is because two consecutive notes  what qualifies as a step  do not have the frequency ratio of a perfect fifth, which are responsible for increasing tonality. In addition, although the usage of the Wasserstein metric seems to decrease the metrics value, this can be explained as the result of slower training.
Effect of
By tweaking , the ORGAN approach allows one to explore the tradeoff between maximizing the desired objective and maintaining likelihood to the data distribution.
Figure 4 shows the distribution of tonality and diversity sampled from ORGAN and OR(W)GAN for several values. This showcases that there exists an optimal value for which maximizes simultaneously the reward and diversity. This value is dependent on the model, dataset and metric, therefore a parameter search would be advantageous to maximize objectives.
6 Conclusions and future work
In this work, we have presented ORGAN, a novel framework to optimize an arbitrary object in a sequence generation task. We have built on recent advances in GANs, particularly SeqGAN, and extended them with reinforcement learning to control properties of generated samples.
We have shown that ORGAN can improve certainly desired metrics, achieving better results than recurrent neural networks trained via either MLE or SeqGAN. Even more importantly, data generation can be made subject to a domainspecific reward function while still using the adversarial setting to guarantee the production of nonrepetitious samples. Moreover, ORGAN possesses a natural advantage as a black box compared to similar objectiveoptimization models, since it is not necessary to introduce multiple domainspecific penalties to the reward function: many times a simple objective "hint" will suffice.
As evidenced with the experiments, the RL component is the one of the major drivers for the property optimization and promotion of diversity. Values tended to be higher when RL was present in the architecture of the models.
Future work should investigate how the choice of heuristic can affect the performance of the model. There are also other formulations of GANs for discrete sequences [Che et al.2017],[Hjelm et al.2017] that could be extended with a RL component in order to finetune the generation processes.
One area of improvement as seen from figure 2
is to push the boundaries of the datasets in certain properties. In some domains, outliers might be more valuable such as in the case of drug and materials discovery.
Finally, forthcoming research should extend ORGANs to work with nonsequential data, such as images or audio. This requires framing the GAN setup as a reinforcement learning problem in order to add an arbitrary (not necessarily differentiable) objective function. We believe this extension to be quite promising since realvalued GANs are currently better understood than sequence data GANs.
References
 [Arjovsky and Bottou2017] Martin Arjovsky and Léon Bottou. Towards Principled Methods for Training Generative Adversarial Networks. 2017.
 [Arjovsky et al.2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. 2017.
 [Bickerton et al.2012] G Richard Bickerton et al. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2), 2012.
 [Bonds2013] Mark Evan. Bonds. A history of music in Western culture. Pearson Education, 2013.
 [Che et al.2017] Tong Che et al. MaximumLikelihood Augmented Discrete Generative Adversarial Networks. 2017.
 [Chen et al.2017] Zhiqian Chen et al. Learning to fuse music genres with generative adversarial dual learning. arXiv preprint arXiv:1712.01456, 2017.
 [Ertl and Schuffenhauer2009] Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of druglike molecules. J. Cheminform., 1(1), 2009.
 [Ertl et al.2017] Peter Ertl et al. In silico generation of novel, druglike chemical matter using the lstm neural network. arXiv preprint arXiv:1712.07449, 2017.
 [GómezBombarelli et al.2016a] Rafael GómezBombarelli et al. Design of efficient molecular OLEDs by a highthroughput virtual screening and experimental approach. Nat. Mater., 15(10), 2016.
 [GómezBombarelli et al.2016b] Rafael GómezBombarelli et al. Automatic chemical design using a datadriven continuous representation of molecules. arXiv, 2016.
 [Goodfellow et al.2014] Ian J. Goodfellow et al. Generative Adversarial Networks. arXiv, 2014.
 [Habrard et al.2008] Amaury Habrard et al. Melody Recognition with Learned Edit Distances. Springer, Berlin, Heidelberg, 2008.
 [Hachmann et al.2011] Johannes Hachmann et al. LargeScale Computational Screening and Design of Organic Photovoltaics on the World Community Grid. J. Phys. Chem. Lett., 2(17), 2011.
 [Hjelm et al.2017] R Devon Hjelm et al. BoundarySeeking Generative Adversarial Networks. 2017.
 [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Comput., 9(8), 1997.
 [Irwin and Shoichet2005] John J Irwin and Brian K Shoichet. J. Chem. Inf. Model, (December 2004), 2005.
 [Jaques et al.2016] Natasha Jaques et al. Sequence Tutor: Conservative FineTuning of Sequence Generation Models with KLcontrol. arXiv, 2016.
 [Jaques et al.2017] Natasha Jaques et al. Tuning recurrent neural networks with reinforcement learning. 2017.
 [Kadurin et al.2017] Artur Kadurin et al. drugan: de novo generation of new molecules with desired molecular properties in silico. Molecular pharmaceutics, 14(9), 2017.
 [Kim2014] Yoon Kim. Convolutional Neural Networks for Sentence Classification. Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP 2014), 2014.
 [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. ICLR, 2014.
 [Kusner et al.2017] Matt J. Kusner, Brooks Paige, and José Miguel HernándezLobato. Grammar Variational Autoencoder. 2017.
 [Landrum2016] Greg Landrum. RDKit, 2016.
 [Lee et al.2017] Sanggil Lee et al. A seqgan for polyphonic music generation. arXiv preprint arXiv:1710.11418, 2017.
 [Li et al.2016] Jiwei Li et al. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
 [Mescheder et al.2017] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The Numerics of GANs. 2017.
 [Metz et al.2016] Luke Metz et al. Unrolled Generative Adversarial Networks. 2016.
 [Pfau and Vinyals2016] David Pfau and Oriol Vinyals. Connecting Generative Adversarial Networks and ActorCritic Methods. 2016.
 [Radford et al.2017] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
 [Ramakrishnan et al.2014] Raghunathan Ramakrishnan et al. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data, 1, 2014.
 [Ranzato et al.2015] Marc’Aurelio Ranzato et al. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
 [Salimans et al.2016] Tim Salimans et al. Improved Techniques for Training GANs. NIPS, 2016.
 [SanchezLengeling et al.2017] Benjamin SanchezLengeling et al. Optimizing distributions over molecular space. An ObjectiveReinforced Generative Adversarial Network for Inversedesign Chemistry. 2017.
 [Segler et al.2017] Marwin HS Segler et al. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Science, 2017.
 [Weininger1988] David Weininger. SMILES, a chemical language and information system. J. Chem. Inf. Model., 28(1), 1988.
 [Williams1992] R J Williams. Simple Statistical GradientEstimating Algorithms for Connectionist Reinforcement Learning. Mach. Learn., 8(3), 1992.
 [Yu et al.2017] Lantao Yu et al. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.