## 1 Introduction

Generative adversarial networks (GANs) are methods for generating synthetic data with similar statistical properties as the real one goodfellow2014generative

. In the GAN methodology a discriminative neural network D is trained to distinguish whether a given data instance is synthetic or real, while a generative network G is jointly trained to confuse D by generating high quality data. This approach has been very sucessful in computer vision tasks for generating samples of natural images

denton2015deep ; dosovitskiy2016generating ; radford2016 .GANs work by propagating gradients back from the discriminator D through the generated samples to the generator G. This is perfectly feasible when the generated data is continuous such as in the examples with images mentioned above. However, a lot of data exists in the form of squences of discrete items. For example, text sentences Bowman2016 , molecules encoded in the SMILE language gomez2016automatic

, etc. In these cases, the discrete data is not differentiable and the backpropagated gradients are always zero.

Discrete data, encoded using a one-hot representation, can be sampled from a multinomial distribution with probabilities given by the output of a softmax function. The resulting sampling process is not differentiable. However, we can obtain a differentiable approximation by sampling from the Gumbel-softmax distribution

jang2016categorical. This distribution has been previously used to train variatoinal autoencoders with discrete latent variables

jang2016categorical . Here, we propose to use it to train GANs on sequences of discrete tokens and we evaluate its performance in this setting.An alternative approach to train GANs on discrete sequences is described in yu2016seqgan

. This method models the generation of the discrete sequence as a stochastic policy in reinforcement learning and bypasses the generator differentiation problem by directly performing gradient policy update.

## 2 Gumbel-softmax distribution

The softmax function can be used to parameterized a multinomial distribution on a one-hot-encoding

-dimensional vector

in terms of a continuous -dimensional vector . Let be a -dimensional vector of probabilities specifying the multinomial distribution on with , . Then(1) |

where returns here a -dimensional vector with the output of the softmax function:

(2) |

It can be shown that sampling according to the previous multinomial distribution with probability vector given by (1) is the same as sampling according to

(3) |

where the are independent and follow a Gumbel distribution with zero location and unit scale.

The sample generated in (3) has gradient zero with respect to because the operator is not differentiable. We propose to approximate this operator with a differentiable function based on the soft-max transformation jang2016categorical . In particular, we approximate with

(4) |

where is an inverse temperature parameter. When , the samples generated by (4) have the same distribution as those generated by (3) and when , the samples are always the uniform probability vector. For positive and finite values of the samples generated by (4) are smooth and differentiable with respect to .

The probability distribution for (

4), which is parameterized by and , is called the Gumbel-softmax distribution jang2016categorical . A GAN on discrete data can then be trained by using (4), starting with some relatively large and then anealing it to zero during training.## 3 A recurrent neural network for discrete sequences

In this section we describe how to construct a generative adversarial network (GAN) that is able to generate text from random noise samples. We also give a simple algorithm to train our model, inspired by recent work in adversarial modeling.

### An example

Consider the problem of learning to generate simple one-variable arithmetic sequences that can be described by the following context-grammar:

where divides possible productions of the grammar. The above grammar generates sequences of characters such as and .

Our generative model is based on a Long Short Term Memory (LSTM) recurrent neural network

hochreiter1997long , shown in the top of Figure 1. The LSTM is trained to predict a hidden-state vector at every time-step (i.e., for every character). The softmax operator is then applied to as in equations (2) and (1), whic gives a distribution over all possible generated characters (i.e., ). After training, the network generates data by sampling from the softmax distribution at each time-step.One way to train the LSTM model to predict future characters is by matching the softmax distribution to a one-hot encoding of the input data via maximum likelihood estimation (MLE). In this work, we are interested in constructing a generative model for discrete sequences, which we will accomplish by sampling through the LSTM, as shown in the bottom of Figure

1. Our generative model takes as input a sample-pair which effectively replace the initial cell and hidden states. From this sample our generator constructs a sequence by successively feeding its predictions as input to the following LSTM unit. Our primary contribution is designing a method to train this generator to generate real-looking discrete sequences.### Generative adversarial modeling

Given a set of data points independently and identically drawn from a -dimensional distribution (in our case each is a one-hot encoding of a character), the goal of generative modeling is to learn a distribution that accuratley approximates . The framework of generative adversarial modeling has been shown to yield models that generate amazingly realistic data points. The adversarial training idea is straight-forward. First, we are going to learn a so-called *generator*

that transforms samples from a simple, known distribution (e.g., a uniform or Gaussian distribution) into samples that approximate those drawn from

. Specifically, we define , where (let be the-dimensional uniform distribution on the interval

). Second, to learnwe will introduce a classifier we call the

*discriminator*. The discriminator takes as input any real -dimensional vector (this could be a generated input or a real one ) and predicts the probability that the input is actually drawn from the real distribution . It will be trained to take samples and real inputs and accurately distinguish them. At the same time, the generator is trained so that it can fool the discriminator into thinking that a fake point it generated is real with high probability. Initially, the discriminator will be able to easily tell the fake points from the real ones and the generator is poor. However, as training progresses the generator uses this signal from the discriminator to determine how to generate more realistic samples. Eventually, the generator will generate samples so real that the discriminator will have a random chance of guessing if a generated point is real.

### Using the Gumbel-softmax distribution

In our case and are both LSTMs with parameters and , respectfully. Our aim is to learn and by sampling inputs and generated points

, and minimizing differentiable loss functions for

and to update and . Unfortunately, sampling generated points from the softmax distribution given by the LSTM, eq. (1), is not differentiable with respect to the hidden states (and thus ). However, the Gumbel-softmax distribution, eq. (4) is. Equipped with this trick we can take any differentiable loss function and optimize and using gradient-based techniques. We describe our adversarial training procedure in Algorithm 1, inspired by recent work on GANs sonderby2016amortised . This algorithm can be shown in expectation to minimize the KL-divergence between and .Figure 2 shows a schematic of the adversarial training procedure for discrete sequences.

## 4 Experiments

We now show the power of our adversarial modeling framework for generating discrete sequences. To illustrate this we consider modeling the context-free grammar introduced in Section 3. We generate samples with a maximum length of

characters from the context-free grammar (CFG) for our training set. We pad all sequences with less than

characters with spaces.### Optimization details

We train both the discriminator and generator using ADAM kingma2014adam with a fixed learning rate of and a mini-batch size of . Inspired by the work of sonderby2016amortised who use input noise to stabilize GAN training, for every input we form a vector such that its softmax (instead of being one-hot) places a probability of approximately on the correct character and a probability of on the remaining characters. We then apply the Gumbel-softmax trick to generate a vector as in equation (4). We use this vector instead of throughout training. We train the generator and discriminator for mini-batch iterations. During the training we linearly anneal the temperature of the Gumbel-softmax distribution, from (i.e., a very flat distribution) to (a more peaked distribution) for iterations to and then kept at until training ends.

### Learning a CFG

Figure 3 (a) shows the generator and discriminator losses throughout training for this setting. We experimented with increasing the size of the generated samples to , as this has been reported to improve GAN modeling huszar2015not , shown in Figure 3 (b). We also experimented with just varying the temperature for the input vectors and fixing the generator temperature to (in Figure 3 (c)). Finally, we also tried just introducing random noise into the hidden state and allowing the network to learn an initial cell state (Figure 3 (d)).

Figure 4 shows the text generated by MLE and GAN models. Each row is a sample from either model, each consisting of characters (we have included the blank space character as some training inputs are padded with spaces if less than characters). While the MLE LSTM is not strictly a generative model in the sense of drawing a discrete sequence from a distribution, we include it for reference. We can see that our GAN models are learning to generate alternating sequences of ’s, similar to the MLE result. Specifically, the 4th, 10th, and 17th rows of plot (a), show samples that are very close to the training data, and many such examples exist for the remaining plots as well.

We believe that these results, as a proof of concept, show strong promise for training GANs to generate discrete sequence data. Further, we believe that incorporating recent advances in GANs such as training GANs using variational divergence minimization nowozin2016f or via density ratio estimation uehara2016generative could yield further improvements. We aim to experiment with these in future work.

## References

- (1) S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. In SIGNLL Conference on Computational Natural Language Learning, CONLL, 2016.
- (2) Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
- (3) Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644, 2016.
- (4) Rafael Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415, 2016.
- (5) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- (6) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- (7) Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
- (8) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- (9) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- (10) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
- (11) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
- (12) Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.
- (13) Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
- (14) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473, 2016.