1 Introduction
Recurrent neural network (RNN) based techniques such as language models are the most popular approaches for text generation. These RNNbased text generators rely on maximum likelihood estimation (MLE) solutions such as teacher forcing [11] (i.e. the model is trained to predict the next item given all previous observations); however, it is wellknown in the literature that MLE is a simplistic objective for this complex NLP task [15]. MLEbased methods suffer from exposure bias [20], which means that at training time the model is exposed to gold data only, but at test time it observes its own predictions.
However, GANs which are based on the adversarial loss function and have the generator and the discriminator networks suffers less from the mentioned problems. GANs could provide a better image generation framework comparing to the traditional MLEbased methods and achieved substantial success in the field of computer vision for generating realistic and sharp images. This great success motivated researchers to apply its framework to NLP applications as well.
GANs have been exploited recently in various NLP applications such as machine translation [24, 25], dialogue models [15], question answering [26], and natural language generation [7, 20, 19, 13, 28, 29]
. However, applying GAN in NLP is challenging due to the discrete nature of the text. Consequently, backpropagation would not be feasible for discrete outputs and it is not straightforward to pass the gradients through the discrete output words of the generator. The existing GANbased solutions can be categorized according to the technique that they leveraged for handling the problem of the discrete nature of text: Reinforcement learning (RL) based methods, latent space based solutions, and approaches based on continuous approximation of discrete sampling. Several versions of the RLbased techniques have been introduced in the literature including SeqGAN
[27], MaskGAN [5], and LeakGAN [8]. However, they often need pretraining and are computationally more expensive compared to the methods of the other two categories. Latent spacebased solutions derive a latent space representation of the text using an AE and attempt to learn data manifold of that space [13]. Another approach for generating text with GANs is to find a continuous approximation of the discrete sampling by using the Gumbel Softmax technique [14] or approximating the nondifferentiable argmax operator [28] with a continuous function.In this work, we introduce TextKDGAN as a new solution for the main bottleneck of using GAN for text generation with knowledge distillation: a technique that transfer the knowledge of softened output of a teacher model to a student model [9]. Our solution is based on an AE (Teacher) to derive a smooth representation of the real text. This smooth representation is fed to the TextKDGAN discriminator instead of the conventional onehot representation. The generator (Student) tries to learn the manifold of the softened smooth representation of the AE. We show that TextKDGAN outperforms the conventional GANbased text generators that do not need pretraining. The remainder of the paper is organized as follows. In the next two sections, some preliminary background on generative adversarial networks and related work in the literature will be reviewed. The proposed method will be presented in section 4. In section 5, the experimental details will be discussed. Finally, section 6 will conclude the paper.
2 Background
Generative adversarial networks include two separate deep networks: a generator and a discriminator. The generator takes in a random variable,
following a distribution and attempt to map it to the data distribution. The output distribution of the generator is expected to converge to the data distribution during the training. On the other hand, the discriminator is expected to discern real samples from generated ones by outputting zeros and ones, respectively. During training, the generator and discriminator generate samples and classify them, respectively by adversarially affecting the performance of each other. In this regard, an adversarial loss function is employed for training
[6]:(1) 
This is a twoplayer minimax game for which a Nashequilibrium point should be derived. Finding the solution of this game is nontrivial and there has been a great extent of literature dedicated in this regard [22].
As stated, using GANs for text generation is challenging because of the discrete nature of text. To clarify the issue, Figure 1 depicts a simplistic architecture for GANbased text generation. The main bottleneck of the design is the argmax operator which is not differentiable and blocks the gradient flow from the discriminator to the generator.
(2) 
2.1 Knowledge Distillation
Knowledge distillation has been studied in model compression where knowledge of a large cumbersome model is transferred to a small model for easy deployment. Several studies have been studied on the knowledge transfer technique [9, 21]
. It starts by training a big teacher model (or ensemble model) and then train a small student model which tries to mimic the characteristics of the teacher model, such as hidden representations
[21], it’s output probabilities [9], or directly on the generated sentences by the teacher model in neural machine translation
[12]. The first teacherstudent framework for knowledge distillation was proposed in [9] by introducing the softened teacher’s output. In this paper, we propose a GAN framework for text generation where the generator (Student) tries to mimic the reconstructed output representation of an autoencoder (Teacher) instead of mapping to a conventional onehot representations.2.2 Improved WGAN
Generating text with pure GANs is inspired by improved Wasserstein GAN (IWGAN) work [7]. In IWGAN, a character level language model is developed based on adversarial training of a generator and a discriminator without using any extra element such as policy gradient reinforcement learning [23]
. The generator produces a softmax vector over the entire vocabulary. The discriminator is responsible for distinguishing between the onehot representations of the real text and the softmax vector of the generated text. The IWGAN method is described in Figure
2. A disadvantage of this technique is that the discriminator is able to tell apart the onehot input from the softmax input very easily. Hence, the generator will have a hard time fooling the discriminator and vanishing gradient problem is highly probable.
3 Related Work
A new version of Wasserstein GAN for text generation using gradient penalty for discriminator was proposed in [7]
. Their generator is a CNN network generating fixedlength texts. The discriminator is another CNN receiving 3D tensors as input sentences. It determines whether the tensor is coming from the generator or sampled from the real data. The real sentences and the generated ones are represented using onehot and softmax representations, respectively.
A similar approach was proposed in [20] with an RNNbased generator. They used a curriculum learning strategy [2] to produce sequences of gradually increasing lengths as training progresses. In [19], RNN is trained to generate text with GAN using curriculum learning. The authors proposed a procedure called teacher helping, which helps the generator to produce long sequences by conditioning on shorter groundtruth sequences.
All these approaches use a discriminator to discriminate the generated softmax output from onehot real data as in Figure 2, which is a clear downside for them. The reason is the discriminator receives inputs of different representations: a onehot vector for real data and a probabilistic vector output from the generator. It makes the discrimination rather trivial.
AEs have been exploited along with GANs in different architectures for computer vision application such as AAE [17], ALI [4], and HALI [1]. Similarly, AEs can be used with GANs for generating text. For instance, an adversarially regularized AE (ARAE) was proposed in [13]. The generator is trained in parallel to an AE to learn a continuous version of the code space produced by AE encoder. Then, a discriminator will be responsible for distinguishing between the encoded hidden code and the continuous code of the generator. Basically, in this approach, a continuous distribution is generated corresponding to an encoded code of text.
4 Methodology
AEs can be useful in denoising text and transferring it to a code space (encoding) and then reconstructing back to the original text from the code. AEs can be combined with GANs in order to improve the generated text. In this section, we introduce a technique using AEs to replace the conventional onehot representation [7] with a continuous softmax representation of real data for discrimination.
4.1 Distilling output probabilities of AE to TextKDGAN generator
As stated, in conventional textbased discrimination approach [7]
, the real and generated input of the discriminator will have different types (onehot and softmax) and it can simply tell them apart. One way to avoid this issue is to derive a continuous smooth representation of words rather than their onehot and train the discriminator to differentiate between the continuous representations. In this work, we use a conventional AE (Teacher) to replace the onehot representation with softmax reconstructed output, which is a smooth representation that yields smaller variance in gradients
[9]. The proposed model is depicted in Figure 3. As seen, instead of the onehot representation of the real words, we feed the softened reconstructed output of the AE to the discriminator. This technique would makes the discrimination much harder for the discriminator. The GAN generator (Student) with softmax output tries to mimic the AE output distribution instead of conventional onehot representations used in the literature.4.2 Why TextKDGAN should Work Better than IWGAN
Suppose we apply IWGAN to a language vocabulary of size two: words and . The onehot representation of these two words (as two points in the Cartesian coordinates) and the span of the generated softmax outputs (as a line segment connecting them) is depicted in the left panel of Figure 4. As evident graphically, the task of the discriminator is to discriminate the points from the line connecting them, which is a rather simple very easy task.
Now, let’s consider the TextKDGAN idea using the twoword language example. As depicted in Figure 4 (Right panel), the output locus of the TextKDGAN decoder would be two red line segments instead of two points (in the onehot case). The two line segments lie on the output locus of the generator, which will make the generator more successful in fooling the discriminator.
4.3 Model Training
We train the AE and TextKDGAN simultaneously. In order to do so, we break down the objective function into three terms: (1) a reconstruction term for the AE, (2) a discriminator loss function with gradient penalty, (3) an adversarial cost for the generator. Mathematically,
(3) 
These losses are trained alternately to optimize different parts of the model. We employ the gradient penalty approach of IWGAN [7] for training the discriminator. In the gradient penalty term, we need to calculate the gradient norm of random samples . According to the proposal in [7], these random samples can be obtained by sampling uniformly along the line connecting pairs of generated and real data samples:
(4) 
The complete training algorithm is described in 1.
5 Experiments
5.1 Dataset and Experimental Setup
We carried out our experiments on two different datasets: Google 1 billion benchmark language modeling data^{1}^{1}1http://www.statmt.org/lmbenchmark/ and the Stanford Natural Language Inference (SNLI) corpus^{2}^{2}2https://nlp.stanford.edu/projects/snli/. Our text generation is performed at character level with a sentence length of 32. For the Google dataset, we used the first 1 million sentences and extract the most frequent 100 characters to build our vocabulary. For the SNLI dataset, we used the entire preprocessed training data ^{3}^{3}3https://github.com/aboev/araetf/tree/master/data_snli, which contains 714667 sentences in total and the built vocabulary has 86 characters. We train the AE using one layer with 512 LSTM cells [10] for both the encoder and the decoder. We train the autoencoder using Adam optimizer with learning rate 0.001, = 0.9, and = 0.9. For decoding, the output from the previous time step is used as the input to the next time step. The hidden code is also used as an additional input at each time step of decoding. The greedy search approach is applied to get the best output [13]. We keep the same CNNbased generator and discriminator with residual blocks as in [7]. The discriminator is trained for 5 times for 1 GAN generator iteration. We train the generator and the discriminator using Adam optimizer with learning rate 0.0001, = 0.5, and = 0.9.
We use the BLEUN score to evaluate our techniques. BLEUN score is calculated according to the following equation [16, 3, 18]:
(5) 
where is the probability of gram and
. We calculate BLEUn scores for ngrams without a brevity penalty
[29]. We train all the models for 200000 iterations and the results with the best BLEUN scores in the generated texts are reported. To calculate the BLEUN scores, we generate ten batches of sentences as candidate texts, i.e. 640 sentences (32character sentences) and use the entire test set as reference texts.5.2 Experimental Results
The results of the experiments are depicted in Table 1 and 2. As seen in these tables, the proposed TextKDGAN approach yields significant improvements in terms of BLEU2, BLEU3 and BLEU4 scores over the IWGAN [7], and the ARAE [13] approaches. Therefore, softened smooth output of the decoder can be more useful to learn better discriminator than the traditional onehot representation. Moreover, we can see the lower BLEUscores and less improvement for the Google dataset compared to the SNLI dataset. The reason might be the sentences in the Google dataset are more diverse and complicated. Finally, note that the textbased onehot discrimination in IWGAN and our proposed method are better than the traditional codebased ARAE technique [13].
Model  BLEU2  BLEU3  BLEU4 
IWGAN  0.50  0.27  0.11 
ARAE  0.13  0.02  0.00 
TextKDGAN  0.51  0.29  0.13 
Model  BLEU2  BLEU3  BLEU4 
IWGAN  0.57  0.44  0.30 
ARAE  0.37  0.27  0.17 
TextKDGAN  0.62  0.50  0.38 
Some examples of generated text from the SNLI experiment are listed in Table 3. As seen, the generated text by the proposed TextKDGAN approach is more meaningful and contains more correct words compared to that of IWGAN [7].
IWGAN  TextKDGAN 

The people are laying in angold  Two people are standing on the s 
A man is walting on the beach  A woman is standing on a bench . 
A man is looking af tre walk aud  People have a ride with the comp 
A man standing on the beach  A woman is sleeping at the brick 
The man is standing is standing  Four people eating food . 
A man is looking af tre walk aud  The dog is in the main near the 
The man is in a party .  A black man is going to down the 
Two members are walking in a hal  These people are looking at the 
A boy is playing sitting .  the people are running at some l 
We also provide the training curves of JensenShannon distances (JSD) between the grams of the generated sentences and that of the training (real) ones in Figure 5. The distances are derived from SNLI experiments and calculated as in [7]. That is by calculating the logprobabilities of the grams of the generated and the real sentences. As depicted in the figure, the TextKDGAN approach further minimizes the JSD compared to the literature methods [7, 13]. In conclusion, our approach learns a more powerful discriminator, which in turn generates the data distribution close to the real data distribution.
5.3 Discussion
The results of our experiment shows the superiority of our TextKDGAN method over other conventional GANbased techniques. We compared our technique with those GANbased generators which does not need pretraining. This explains why we have not included the RLbased techniques in the results. We showed the power of the continuous smooth representations over the wellknown tricks to work around the discontinuity of text for GANs. Using AEs in TextKDGAN adds another important dimension to our technique which is the latent space, which can be modeled and exploited as a separate signal for discriminating the generated text from the real data. It is worth mentioning that our observations during the experiments show training textbased generators is much easier than training the codebased techniques such as ARAE. Moreover, we observed that the gradient penalty term plays a significant part in terms of reducing the modecollapse from the generated text of GAN. Furthermore, in this work, we focused on characterbased techniques; however, TextKDGAN is applicable to the wordbased settings as well. Bear in mind that pure GANbased text generation techniques are still in a newborn stage and they are not very powerful in terms of learning semantics of complex datasets and large sentences. This might be because of lack of capacity of capturing the longterm information using CNN networks. To address this problem, RL can be employed to empower these pure GANbased techniques such as TextKDGAN as a next step .
6 Conclusion and Future Work
In this work, we introduced TextKDGAN as a new solution using knowledge distillation for the main bottleneck of using GAN for generating text, which is the discontinuity of text. Our solution is based on an AE (Teacher) to derive a continuous smooth representation of the real text. This smooth representation is distilled to the GAN discriminator instead of the conventional onehot representation. We demonstrated the rationale behind this approach, which is to make the discrimination task of the discriminator between the real and generated texts more difficult and consequently providing a richer signal to the generator. At the time of training, the TextKDGAN generator (Student) would try to learn the manifold of the smooth representation, which can later on be mapped to the real data distribution by applying the argmax operator. We evaluated TextKDGAN over two benchmark datasets using the BLEUN scores, JSD measures, and quality of the output generated text. The results showed that the proposed TextKDGAN approach outperforms the traditional GANbased text generation methods which does not need pretraining such as IWGAN and ARAE. Finally, We summarize our plan for future work in the following:

We evaluated TextKDGAN in a characterbased level. However, the performance of our approach in wordbased level needs to be investigated.

Current TextKDGAN is implemented with a CNNbased generator. We might be able to improve TextKDGAN by using RNNbased generators.

TextKDGAN is a core technique for text generation and similar to other pure GANbased techniques, it is not very powerful in generating long sentences. RL can be used as a tool to accommodate this weakness.
References
 [1] Belghazi, M.I., Rajeswar, S., Mastropietro, O., Rostamzadeh, N., Mitrovic, J., Courville, A.: Hierarchical adversarially learned inference. arXiv preprint arXiv:1802.01071 (2018)

[2]
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning. pp. 41–48. ACM (2009)
 [3] Cer, D., Manning, C.D., Jurafsky, D.: The best lexical metric for phrasebased statistical mt system optimization. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. pp. 555–563. Association for Computational Linguistics (2010)
 [4] Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., Courville, A.: Adversarially learned inference. arXiv preprint arXiv:1606.00704 (2016)
 [5] Fedus, W., Goodfellow, I., Dai, A.M.: Maskgan: Better text generation via filling in the _. arXiv preprint arXiv:1801.07736 (2018)
 [6] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
 [7] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028 (2017)
 [8] Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J.: Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624 (2017)
 [9] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

[10]
Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural computation, 9(8):1735–1780 (1997)
 [11] J, W.R., David, Z.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280 (1989)
 [12] Kim, Y., Rush, A.M.: Sequencelevel knowledge distillation. In: EMNLP. pp. 1317–1327 (2016)
 [13] Kim, Y., Zhang, K., Rush, A.M., LeCun, Y., et al.: Adversarially regularized autoencoders for generating discrete structures. arXiv preprint arXiv:1706.04223 (2017)
 [14] Kusner, M.J., HernándezLobato, J.M.: Gans for sequences of discrete elements with the gumbelsoftmax distribution. arXiv preprint arXiv:1611.04051 (2016)
 [15] Li, J., Monroe, W., Shi, T., Ritter, A., Jurafsky, D.: Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547 (2017)
 [16] Liu, C.W., Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J.: How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023 (2016)
 [17] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015)
 [18] Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL. pp. 311–318 (2002)
 [19] Press, O., Bar, A., Bogin, B., Berant, J., Wolf, L.: Language generation with recurrent generative adversarial networks without pretraining. arXiv preprint arXiv:1706.01399 (2017)
 [20] Rajeswar, S., Subramanian, S., Dutil, F., Pal, C., Courville, A.: Adversarial generation of natural language. arXiv preprint arXiv:1705.10929 (2017)
 [21] Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (2015)
 [22] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems. pp. 2234–2242 (2016)
 [23] Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS. pp. 1057–1063 (1999)
 [24] Wu, L., Xia, Y., Zhao, L., Tian, F., Qin, T., Lai, J., Liu, T.Y.: Adversarial neural machine translation. arXiv preprint arXiv:1704.06933 (2017)
 [25] Yang, Z., Chen, W., Wang, F., Xu, B.: Improving neural machine translation with conditional sequence generative adversarial nets. arXiv preprint arXiv:1703.04887 (2017)
 [26] Yang, Z., Hu, J., Salakhutdinov, R., Cohen, W.W.: Semisupervised qa with generative domainadaptive nets. arXiv preprint arXiv:1702.02206 (2017)
 [27] Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: Sequence generative adversarial nets with policy gradient. In: AAAI. pp. 2852–2858 (2017)
 [28] Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., Carin, L.: Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850 (2017)
 [29] Zhu, Y., Lu, S., Zheng, L., Jiaxian, G., Weinan, Z., Jun, W., Yong, Y.: Texygen: A benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886. (2018)