Codes for Category-aware Generative Adversarial Networks (AAAI 2020)
Generating multiple categories of texts is a challenging task and draws more and more attention. Since generative adversarial nets (GANs) have shown competitive results on general text generation, they are extended for category text generation in some previous works. However, the complicated model structures and learning strategies limit their performance and exacerbate the training instability. This paper proposes a category-aware GAN (CatGAN) which consists of an efficient category-aware model for category text generation and a hierarchical evolutionary learning algorithm for training our model. The category-aware model directly measures the gap between real samples and generated samples on each category, then reducing this gap will guide the model to generate high-quality category samples. The Gumbel-Softmax relaxation further frees our model from complicated learning strategies for updating CatGAN on discrete data. Moreover, only focusing on the sample quality normally leads the mode collapse problem, thus a hierarchical evolutionary learning algorithm is introduced to stabilize the training procedure and obtain the trade-off between quality and diversity while training CatGAN. Experimental results demonstrate that CatGAN outperforms most of the existing state-of-the-art methods.READ FULL TEXT VIEW PDF
Codes for Category-aware Generative Adversarial Networks (AAAI 2020)
Nowadays, category text generation has received more and more attention. Generating coherent and meaningful text with different categories will bring great benefits to many natural language processing applications, such as sentiment analysis and dialogue generation . Recently, generative adversarial net (GAN) 
, which adopts the discriminator to guide the generator, is combined with the reinforcement learning (RL) algorithms to generate discrete text data for general text generation, and some competitive results have been reported in the previous works [28, 7, 2]. Compared with general text generation which only focuses on obtaining high-quality text, category text generation aims at automatically generating a variety of controllable category text to fit the task-specific applications. However, the category information of sentences can not be easily controlled, and it is also difficult to design an appropriate training objective for different categories. Thus, category text generation is a more challenging task. There are a few works [26, 15]
which try to extend the general text generation models for category text generation. They mostly employ a long short-term memory (LSTM)
as the generator and combine the auxiliary components (e.g., classifiers) with the RL algorithms on GANs to generate category text. The auxiliary components can help the model to focus on the category information.
The existing category text generation models have shown some positive results, but RL algorithms and auxiliary components complicate the learning strategy and the model, respectively, which may exacerbate some fundamental problems of GANs, including training instability and mode collapse. Firstly, most of the existing models [26, 15] heavily rely on RL algorithms, and some strategies, such as Monte Carlo search, are adopted to guide the discriminator for providing reward signals. These complicated strategies further increase the training difficulty of GANs. The auxiliary components may carry more burden to the adversarial training, which also makes the training procedure more unstable. Secondly, the mode collapse problem is serious in the existing models. Because the LSTM based generator  may lack enough expressive power, and category text generation, as a sequential decision process, also easily leads the generator to focus on some limited samples in the target distribution. For generating diversified samples, the temperature variable [2, 7] is employed to make GANs focus on either the quality or the diversity, but an improvement of diversity always leads to significant degradation of quality.
In this paper, a new category text generation framework, category-aware GAN (CatGAN), is proposed to deal with the above problems. CatGAN provides a category-aware model for category text generation and a hierarchical evolutionary learning algorithm for training the model and obtaining the balance between the sample quality and diversity. Firstly, a novel category-aware model is proposed, which includes the category-wise relativistic objective to estimate the gap between the specific category generated samples and the corresponding real samples. The generator wants to make the generated samples as realistic as the real samples, while the discriminator is eager to enlarging this gap. The relativistic relation can guide our model to update more easily than strict ground-truth labels. A relational memory core (RMC) based generator, which promises a larger memory capacity and a better ability for catching the long-term dependencies, is adopted to replace LSTM. Further, instead of RL algorithms, CatGAN employs the Gumbel-Softmax relaxation [10, 17]
to generate the continuous approximation of the discrete generated samples. The continuous data allow the generator and the discriminator to be optimized directly during the adversarial procedure. Without any auxiliary components, the architecture of CatGAN is as concise as the classical GAN framework and only consists of one generator and one discriminator. Secondly, a hierarchical evolutionary learning algorithm is developed to train the category-aware model. The adversarial training can be seen as an evolutionary problem, and the discriminator provides the environment for a population of generators to evolve. For adapting the category text generation task, the evolution procedure is designed with two stages. In the first temperature-oriented stage, the temperature is subtly controlled to maintain the category text quality during the improvement of diversity. In the second objective-oriented stage, various training objectives are adopted to narrow the distances between the generated data and the real data from different perspectives on each category. Only the well-performing generator is preserved, and the generated samples will retain diversified and high-quality. Finally, although the evaluation metric of quality has been designed well, the evaluation metric of diversity is not explored well. This paper proposes a new evaluation metric of diversity based on the repeatability of the generated samples.
In summary, our contributions are as follows:
A category-aware model is proposed for generating category text, which accurately takes the gap between real samples and generated samples on each category as an efficient learning signal.
A hierarchical evolutionary learning algorithm is designed to train the category-aware model, and it specializes in text generation for making the generating samples more diversified and high-quality.
An effective metric is presented to evaluate the sample diversity. Experimental results on synthetic and real data demonstrate that our model achieves a new state-of-the-art performance on both category text generation and general text generation.
Traditional recurrent neural network (RNN) based text generation models always suffer from the exposure bias problem [9, 1]. Different from these RNN based models which are trained by maximum likelihood estimation (MLE), GAN introduces a minimax game between the generator and the discriminator. However, GAN is designed to output differentiable data, which has a conflict with the discrete text generation.
The same RL algorithm is adopted in SeqGAN  and LeakGAN  to solve the above problem, and the discriminator can guide the generator by the reward signal. However, LeakGAN shows that the reward signal is not sufficiently informative. MaskGAN  adopts the actor-critic algorithm for filling in missing text conditioned on the surrounding context. RankGAN  replaces the original binary classifier with a ranking model as the discriminator. Approximating methods is another way to handle the non-differentiable problem of discrete data. TextGAN  and FM-GAN  apply an annealed softmax to approximate the argmax operation. Gu et al. gu2018neural and RelGAN  adopt the Gumbel-Softmax relaxation to approximate the categorical distribution, and this relaxation method helps to train GANs and improve the generation quality.
The above methods focus on general text generation, and category text generation is drawing more attention. CSGAN  proposes a descriptor which consists of a discriminator and an auxiliary classifier, the classifier distinguishes the sentence category to guide the generator. The adversarial procedure of CSGAN is similar to SeqGAN, and the less-informative reward signal limits the model performance. SentiGAN  contains multiple generators, and each generator aims at generating the samples of a specific sentiment label. However, as the category number grows up, the multiple generators will significantly raise the number of trainable parameters, which may reduce the efficiency and amplify the training instability. Experiments will show that the proposed CatGAN is more effective than the previous methods using auxiliary components.
Recently, the evolutionary learning algorithm is firstly introduced to optimize the adversarial model for image generation . For generating better category text, both the quality and diversity should be focused. CatGAN makes the first attempt to solve category text generation with the evolutionary learning algorithm. Our hierarchical evolutionary learning algorithm is designed with two stages, the temperature-oriented stage and the objective-oriented stage, to explore the possible solutions of the generator for improving the performance on the sample quality and diversity.
The category text generation task is denoted as follows. Given a dataset with categories, supposing we want to generate a sentence with the specific category , then a -parameterized generator is trained to generate the sentence , , where is the vocabulary of candidate tokens. In order to guide the generator effectively, a -parameterized discriminator also need to be trained to provide a learning signal for to update when the whole sentence has been generated.
The overall framework of CatGAN is shown in Fig. 1 (a). CatGAN consists of two core parts, the category-aware model and the hierarchical evolutionary learning algorithm.
With the help of the category-wise relativistic objective, the proposed category-aware model employs a RMC based generator to generate texts with a specific category to fool the discriminator, while the discriminator is trained to discriminate between the real samples and the generated samples for each category. In our model, the Gumbel-Softmax relaxation enables the gradients to pass back to the generator from the discriminator directly. For training the model and boosting the performance, this paper proposes the hierarchical evolutionary learning algorithm, which evolves a population of generators via combining various mutation strategies in a given environment . At the end of the evolutionary learning algorithm, the best-performing generator is preserved to generate the realistic sentences with the given category.
The category-aware model is shown in Fig. 1 (b), which is guided by a novel category-wise relativistic objective to generate category samples. It includes a generator and a basic CNN based discriminator .
In the standard GAN 
, the discriminator is trained on the ground-truth labels to predict the probability that the input data are real. By this training method, the discriminator cannot provide an informative signal to update the generator. Thus, this paper proposes a novel training objective based on the relativistic relation between real data and generated data on each category.
Formally, and denote the real data sampled from the real data distribution and the generated data sampled from the generated data distribution on the category , respectively. The category-wise relativistic objective contains the summed category loss and the enhanced real-fake loss. For the discriminator objective, it is defined as follows:
where and are sampled from the real data distribution and the generated data distribution on all categories, respectively. On the right-hand side, the first term measures the distance between the real data and the generated data on each category, while the second term measures the overall distance on all categories. Similar to the form of RaGAN , is defined by:
where the relativistic relation is measured by:
Intuitively, the relativistic relation shows the gap between the probabilities of being real on the real samples and that on the generated samples. For each category, the generator wants to reduce this gap for making generated samples as realistic as real samples, while the discriminator wants to increase the probability that real samples are more realistic than generated samples. Generally, the generator objective can be set to . Compared with the standard GAN objective, the category-wise relativistic objective can efficiently train our model for category text generation.
Since the LSTM based generator may lack enough expressive power for text generation, relational memory core (RMC) is employed as the generator . The basic concept of RMC is to consider a fixed set of memory slots (e.g., memory matrix) and allow self-attention mechanism  to interact in these memories. The increased capacity of memory boosts the expressive power and the ability to capture the category information. Given a new vocabulary observation at time , it is represented by the embedded token , and the embedded category
is built to control the category information. Then, the input vector
of the generator is obtained by a linear transformationon the concatenation of and :
where denotes the row-wise concatenation.
Considering a memory matrix , Fig. 1 (b) shows how is updated from by incorporating at time . As implied by the name of the multi-head dot product attention (MHA), a -heads RMC contains groups of linear transformation weights for query , key and value . The updated memory can be obtained by:
where denotes the softmax function on each row, and is the column dimension of . Then, the next memory and the generator output are obtained by:
where the two parameterized functions and both represent the combinations of skip connections, multi-layer perception (MLP) and gated operations.
Directly sampling from the multinomial distribution will cause the non-differentiability problem , thus the Gumbel-Softmax relaxation is employed with the generator to approximate the samples. The Gumbel-Max trick  and the softmax function are used to sample discrete sentences and approximate the argmax function, respectively. The Gumbel-Max trick samples the discrete token at time by:
where is the value of the -th dimension of , and is sampled from the Gumbel distribution, where with . The differentiable approximation of argmax is obtained by:
where is the temperature variable. Since the softmax-like token is differentiable with respect to , it is used as the input of the discriminator instead of the discrete token .
can adjust bias and variance while approximating. Larger brings lower bias but higher variance , allowing the generator to obtain higher diversity but poorer quality samples .
Unlike previous text generation methods, which adopt the fixed temperature strategy and one adversarial objective to train a generator and a discriminator, our hierarchical evolutionary learning algorithm evolves a population of generators, with various temperatures and objectives, to play the adversarial game with the discriminator.
During the variation procedure, the individuals are mutated from the parents via asexual reproduction based on the combination of two kinds of strategies, the temperature mutation strategy (TMS) and the objective mutation strategy (OMS). TMS is to maintain the high sample quality when the diversity improves (i.e., increases). To explore the possible solutions of the generator in the parameter space, OMS further stabilizes the model training process via leveraging various training objectives.
Previous works adopt the monotone increasing function to boost over training iterations, where is the target temperature, and denotes the current iteration of the maximum iterations . obtains a subtle increment after each iteration. Although the monotone increasing brings diversity, it leads to quality degradation. In one iteration, the subtle change of only affects the tightness of the relaxation for one batch of training samples which cannot fully represent all training samples. Thus, various subtle changes have the potential to improve the sample quality. With overall increasing , TMS aims at finding the optimal temperature change direction according to the quality in each iteration. The comparison between the evolutionary temperature and the monotone increasing temperature is given in the Appendix C. Formally, denotes the TMS set that includes the temperatures with various change directions:
Besides, denotes the OMS set which contains several relativistic training objectives for the generator as follows:
where is another way to measure the the relativistic relation for all categories, similar to the form of :
It is worth noting that adding more temperatures and objectives into and is feasible. The Cartesian product of and constitutes all mutation directions on each round of evolution. Each individual is mutated by one mutation direction. That is, the generator is updated by a specific training objective under a certain temperature.
Since the goals of TMS and OMS are different, two stages, including the temperature-oriented stage and the objective-oriented stage , are designed. Both the evaluation procedure and the selection procedure are divided into the above two stages, where can preliminarily filter the individuals for . In , the individuals with different temperature changes can only be compared under the same objective. For each objective, the individual with the optimal temperature is preserved for further selection. Then, selects the best individual considering overall performance. Thus, the proposed learning algorithm including two stages, and , is considered as the hierarchical evolutionary learning algorithm. Two properties, the sample diversity and quality, are mainly considered to measure the performance of each individual in the whole hierarchical evolutionary learning algorithm.
For evaluating the diversity, a new metric named is proposed. calculates the negative log-likelihood of generated samples on the generator by:
where is the generated sample distribution. can captures the repeatability of the generated samples, which will better reflect the mode collapse issue. When the generator can only learn some limited patterns from the real data or assign all its probability mass to a small region, the value of will become extremely low.
For evaluating the quality, in Eq. 3 can accurately measure the gap between the generated samples and the real samples. The higher , the better quality sentences that the generator can generate. Therefore, the evaluation scores in and are respectively defined as:
where can be tuned to balance the quality and diversity. aims at maintaining the quality when increases in , and wants to stabilize the training process and further balance the sample quality and diversity. Then, we expect to maximize and hierarchically.
The evaluation procedure of each stage corresponds to a selection process, which selects the individuals with larger evaluation scores. Firstly, according to each objective in , the individual which has the largest is preserved with a selected direction in . Secondly, the surviving individuals are further filtered based on to obtain the best-performing generators as the new parents, which will participate in future adversarial training. Following the principle of “survival of the fittest”, the optimal temperature and training objective are selected for the generator, allowing the whole model is trained as expected.
Some evaluation metrics have been widely used to measure the performance of text generation models from various aspects . Generally, the negative log-likelihood  is used to measure the quality on synthetic data. Two evaluation metrics, and , are adopted to measure the diversity, where is defined by Eq. 12, and  is the reversed direction of . and are defined as follows:
where is the generated data distribution and is the real data distribution. is sensitive to the quality, while and are sensitive to the diversity.
Since cannot evaluate the quality of real data, the BLEU scores 
are adopted. For category text generation, the harmonic mean values of the metrics on each category are obtained to evaluate the performance. The repeatable experiment code is made publicly available for further research111https://github.com/williamSYSU/CatGAN.
Both synthetic and real data are employed to test CatGAN, as in previous works . For category text generation, synthetic data include 20,000 samples, and each 10,000 samples are obtained from different oracle-LSTM , and real data include movie reviews (MR)  and amazon reviews (AR) . MR has two sentiment classes (negative and positive), and AR includes two types of products reviews (book and application). For general text generation, synthetic data include 10,000 training samples generated by an oracle-LSTM, and real data contain EMNLP2017 WMT News (EN). All real data employ the same preprocessing as in LeakGAN . MR has 4,503 samples, including 3,152 samples for training and 1,351 samples for testing. For AR, each category review includes 100,000 samples for training and 10,000 samples for testing, and each sample may have multiple sentences. EN contains 200,000 training samples and 10,000 test samples.
Several state-of-the-art methods are set as baselines in the experiments. For category text generation, SentiGAN  and CSGAN  are compared with CatGAN . For general text generation, four models are compared with CatGAN , including SeqGAN , RankGAN , LeakGAN , and RelGAN . The standard MLE training is used for all models before the adversarial training. For the models which need the temperature, the exponential function is adopted to increase the temperature, and is set to 1 on synthetic data and 100 on real data. Adam 
is employed to optimize our model. CatGAN is run with 6 random seeds on all experiments, and the final scores are presented with means and standard deviations (see theAppendix A for more detailed settings).
The synthetic data experiments are set with sequence length 20 and 40. is used to measure the sample quality, and the ground-truth scores are 5.748 and 4.015 for different sequence length, respectively. In Table 1, multiple generators help SentiGAN to obtain competitive results, and CatGAN outperforms SentiGAN by 0.327 and 0.323 on , which illustrates that our model can obtain better quality on all categories.
|Method||CatGAN w/o H||CatGAN w/o T||CatGAN w/o O||CatGAN|
The real data experiments are conducted on MR and AR. After the same preprocessing, MR consists of 6,216 unique words with the maximum sentence length 15, and AR contains 6,416 unique words with the maximum sentence length 40. The results over generated samples are shown in Table 2 and Table 3. On MR, since SentiGAN is designed to generate sentiment text, it shows better results than CSGAN on the BLEU scores. On AR, the sufficient training samples improves the performance of all methods. CSGAN heavily relies on the auxiliary classifier and the RL algorithm, and it shows a significant quality degradation than CatGAN on BLEU while generating long sentences on AR. Compared with the baselines, CatGAN is not limited by the category type of data and obtains the better BLEU scores on both MR and AR, which also shows that CatGAN can catch the dependencies in short and long sentences. Besides, on AR, CatGAN gets 3.104 on and 1.539 on , which illustrates that our model can maintain good diversity while significantly improving quality. Actually, optimizing based score function can lead to a better , yet CatGAN still consistently bests and an existing metric, .
The impact of is investigated on AR. In the right-hand side of Eq. 14, the first term used to measure the quality lies in , while the second term usually lies in . To balance the sample quality and diversity, is set to increase from 0 to 1. Table 4 shows that the increase of triggers the increase of diversity but the degradation of quality, especially for BLEU-4 and BLEU-5. In practice, is set to 0.001 for CatGAN on all experiments, since it shows a good trade-off between quality and diversity.
|a tired, talky a hole in the worst movie.||goes interesting, and the comedy were interesting. (Wrong category)||the premise is intriguing but quickly becomes distasteful and creepy.|
|a touching, and politically potent piece of work, a film.||it ’s a treat. (Short)||one of the greatest family-oriented, fantasy-adventure movies ever.|
|i have read as the series. i could recommend it to my other walker series. (Short)||this book had an hard time for what’s a nice fast read. good character is great reading. what works worth the money.||this is a really good book in a series. the characters are great and they are so easy to read. it is a good read, can’t wait for the next book.|
|i love it. i love it. i would recommend a great game. (Short)||i got this game from amazon it is that i have even it said weather tries. (Unreadable)||great game. i play it until i get to level 3. it ’s a nice game for the whole family. my kids too, and it does a great job.|
For illustrating the effectiveness of TMS and OMS, the ablation study is conducted on AR. The whole hierarchical evolutionary learning algorithm is removed as CatGAN w/o H, TMS is removed from CatGAN as CatGAN w/o T, and OMS is replaced with to form CatGAN w/o O. The results are shown in Table 5. Compared with CatGAN, CatGAN w/o T shows the degradation on all BLEU scores and only increases by 0.016, which means the worse quality and the similar diversity, respectively. Although CatGAN w/o O achieves competitive sample diversity over CatGAN, it shows a significant degradation on BLEU, which means OMS can effectively guide our model. CatGAN w/o H gets the worse sample quality than CatGAN w/o O, but it still outperforms SentiGAN. The ablation study illustrates that combining TMS and OMS facilitates generating diversified and high-quality samples on real data.
The experiments on general text generation are further to show the contribution of the hierarchical evolutionary learning algorithm. General text generation can be considered as the special case of category text generation when .
The synthetic data experiments run with sequence length 20 and 40, and the ground-truth scores are 5.750 and 4.071, respectively. The results are presented in Table 6. Compared with all baselines, CatGAN outperforms the best of them by 0.303 and 0.521 on with different sequence length, respectively, which verifies the better sample quality. Specially, with sequence length 40, CatGAN significantly improves the metric by 0.956 and 3.723 over LeakGAN and RankGAN, respectively, and it illustrates that our model is more powerful to catch long-term dependencies. With the help of the hierarchical evolutionary learning algorithm, CatGAN outperforms RelGAN which also employs the temperature and the Gumbel-Softmax relaxation.
The trade-off between quality and diversity under different temperatures is shown in Fig. 2. It illustrates that the improvement of quality is always accompanied by the decline in diversity, and higher brings more diversity. Compared with RelGAN, CatGAN shows better quality under the same diversity. With the help of TMS, CatGAN greatly improves the quality over RelGAN under the temperature . The results validate that the hierarchical evolutionary learning algorithm can reduce the impact of mode collapse.
The real data experiments are conducted on the EN dataset. After the preprocessing, EN contains 5,255 unique words with the maximum sentence length 51. Table 7 shows that CatGAN consistently outperforms other methods on BLEU, which illustrates its power to generate high-quality long sentences. Under the premise of achieving better BLEU scores, our model improves and by 0.04 and 0.021 than the best performance of the baselines, which shows CatGAN can maintain the higher sample quality and diversity simultaneously.
The human evaluation is conducted for further evaluating the sample quality of generated sentences on AR and EN. The sample quality is measured based on grammatical and semantic correctness, and the detailed protocol is provided in the Appendix D. For category text generation, each model randomly generates 100 samples for each category, then these samples with category information are rated by five graduate students with the score from 1 to 5, where 1 means the worst quality and 5 means the best. The harmonic mean values of the average score on each category are shown in Fig. 3. For general text generation, the average score over 100 generated sentences from each model is reported. The human evaluation results demonstrate that CatGAN can generate better quality samples than other baselines.
With trained on MR and AR, the generated sentences from SentiGAN, CSGAN, and CatGAN are listed in Table 8. As shown by the examples, CSGAN shows some problems, such as short length, wrong category and unreadable sentence. Especially on AR, CSGAN lacks the ability for catching the long-term dependencies and generates many unreadable sentences. SentiGAN is capable of obtaining sentiment category text, but it also cannot generate the high-quality long sentences on AR, which may due to the gap between the distributions of two products from AR is larger than the gap of various sentiments. To summarize, CatGAN produces the samples which are longer, more readable and accurate on different categories (see the Appendix E for more samples).
This paper proposes CatGAN for category text generation. In order to guide the category-aware model to obtain category samples accurately, the informative updating signal is provided by measuring the relativistic relation between generated samples and the corresponding real samples on each category. Besides, a hierarchical evolutionary learning algorithm is developed to train CatGAN and improve generation performance. It allows the model to preserve the well-performing offspring, where the generated category samples can retain diversified and high-quality after each training iteration. Experimental results on several datasets demonstrate that CatGAN achieves a better performance than most of the existing state-of-the-art methods on both category text generation and general text generation.
This work is supported by the National Key R&D Program of China (2018AAA0101203), and the National Natural Science Foundation of China (61673403, U1611262).
The concrete distribution: a continuous relaxation of discrete random variables. In International Conference on Learning Representations, Cited by: Introduction, Relational Memory Core based Generator..
IEEE Transactions on Evolutionary Computation. Cited by: Related Work.