Currently, neural generation techniques have powered many inspiring applications, e.g., poem generation Yang:18
, neural machine translation (NMT)Bahdanau:14 and chatbot Zhao:17. Conditional text generation is an important sub-task of text generation. It aims to generate realistic text that carries a specific attribute (e.g., positive or negative sentiment).
The existing neural models for conditional text generation are mainly developed in a totally supervised or semi-supervised manner. The typical idea is to encode the condition into a vector representation and then integrate it with the text generation processKingma:14a; Hu:17; Mirza:14. These end-to-end neural models have achieved encouraging performance. However, when new condition appears (e.g., a new topic for categorical generation), they require a full re-training because the text generation module and condition representation module are coupled tightly in an end-to-end training fashion.
Inspired by the recent success of Variational Auto-Encoder (VAE) Kingma:13 based “post-hoc” conditional image generation strategy Engel:17
, we provide a new perspective for flexible conditional text generation. That is, we assume that text under each condition forms a unique subspace which is part of a larger union probability space. Following this assumption, we proposeHierarchical Neural Auto-Encoder (HAE), which decouples text generation module from condition representation module. As illustrated in Figure 1, HAE is composed of two VAEs: (1) Global VAE (GlobVAE), which derives the latent space for texts with its encoder (global encoder) and learns to generate text based on an easily-accessible large unlabeled dataset with its decoder (global decoder); (2) Conditional VAE (CondVAE), which is a lightweight conditional network that derives the conditional subspace that matches the specified condition in the global latent space. We utilize CondVAE to identify the conditional subspace, which can be easily learned with limited conditional training samples. Once the subspace is found, the global decoder already learned in the first step is directly adopted for generation. In this way, whenever a new condition emerges, we only need to train CondVAE and directly plug it into the framework.
In essence, our proposed HAE
can be regarded as a combination of post-hoc strategy, hierarchical model and transfer learning. Different to the existing end-to-end neural modelsMirza:14; Sohn:15; Kingma:14a, HAE focuses on post-processing on the latent space. Once trained, GlobVAE is fixed for text generation of different conditions. This post-hoc strategy decouples conditional subspace learning from the text generation, endowing HAE with more flexibility when handling emerging conditions. Also, the word-level generation is converted to subspace learning, which eases the learning process. Furthermore, any large unlabeled corpus can be used for the training of GlobVAE. The rich semantic knowledge can be easily transferred for different conditional generation tasks. It is also validated that transfer learning usually leads to more robust and effective results Howard:18; Devlin:18. To summarize, the key contributions of this work are as follows:
We provide a new perspective for conditional text generation following the assumption that text under each condition forms a unique subspace and belongs to a larger union space.
We propose HAE, which is able to flexibly accommodate one-to-many conditions without re-training the entire model. Our extensive experimental results demonstrate that our framework can generate more realistic text than state-of-the-art alternatives under different semantic and syntax conditions with much fewer parameters.
2 Related work
Boosted by the recent success of deep learning technology, Natural Language Generation (NLG) has recently become a new trend in NLP community. Many great works have attempted to solve various subtasks like dialogue generationLi:16, poetry generation Yi:18 and story generation Fan:18
. However, due to the black-box nature of neural networks, the recent proposed generic models all suffer the problem of lacking interpretability and controllability.
To handle this problem and generate more controllable text, conditional text generation or say, controllable text generation Hu:17 has recently attracted extensive attention. Current research in this direction mainly falls into two fashions: the supervised methods and semi-supervised methods. For supervised methods, Mirza:14; Sohn:15 first converted the condition information to one-hot vector, then integrate them into generator and discriminator. To enhance the correlation between structured conditional code and generated samples, Chen:16
adopted an extra adversarial classifier to infer the structured code from generated samples.Wang:18 used multiple generators for multiple conditions and a multi-class classifier to provide diverse training signal for the generator.
However, given only a limited number of conditional samples, semi-supervised methods are compulsory. To utilize the implicit conditional distribution behind unlabeled texts, Kingma:14a introduced a classifier into the VAE architecture. Hu:17 further involved two additional independent regularization terms to enhance the disentanglement between structured code and unstructured code.
Our work differs from the existing works in the following ways: (1) Our model decouples the text generation module from the condition representation module, which are tightly fused as a single module in previous studies. (2) Our model allows single-condition generation, which could inspire new applications like polite speech generator niu2018polite. (3) Our model can handle emerging conditions and achieve satisfying performance with much fewer parameters and less training time.
Conditional text generation refers to a series of tasks in which generic models are conditioned on human-defined factors (e.g., sentiment, category). Following this broad setting, many similar but different tasks are defined including style transfer, story continuation and so on. In this work, we follow the problem setting proposed by Hu:17. First, a latent code
is sampled from a prior distribution (e.g., Gaussian distribution). Then, given a specific condition, together with the sampled latent code, the model needs to generate a text sample under the desired condition. We use factor and task to refer to different tasks (e.g., sentiment, category) and use condition for different targets in a single task (e.g., positive and negative for sentiment conditional generation). We list the common notations used in this paper in Table 1.
|Latent code for a VAE|
|Latent space derived by a VAE|
|Unlabelled text samples|
|Conditional text samples|
|Generator for a VAE|
|Encoder for a VAE|
|Aggregated posterior distribution|
3.0.1 Conditional Text Generation
Given a set of conditions and conditional text samples where each in Y matches to a single condition , the goal is to learn a generator to model the relation between text samples and corresponding conditions. Thus, when we specify the condition and a latent code , the model could generate realistic text samples matching the given condition. Formally, the generator is defined as:
where the latent code is modeled as continuous variables with the prior distribution of a standard Gaussian distribution.
Variational Auto-Encoder (VAE) Kingma:13 consists of an encoder and a generator (decoder) . The encoder aims to encode input data into latent space . The generator is to reconstruct the original input
. The loss function of VAE is:
where is the latent code for text , is the Kullback-Leibler (KL) divergence, is the prior distribution (e.g., Gaussian). The first term ensures that VAE can distill compact code in latent space for reconstruction. The second term pushes posterior distribution to be close to the prior distribution, which guarantee the mutual information between original data and the latent space Dupont:18.
Wasserstein Auto-Encoder (WAE) Tolstikhin:17 is a variant of VAE. The main difference is that WAE optimizes the distance between the aggregated posterior distribution and the prior distribution, which effectively alleviates the reconstruction problem underlying vanilla VAE Tolstikhin:17. The loss function of WAE is formulated as:
where is the aggregated posterior distribution, is any divergence measure between two distributions, is the coefficient hyper-parameter (). There are two variants of WAE, WAE-GAN and WAE-MMD Tolstikhin:17. The former applied adversarial classifier as , while the latter uses non-adversarial Mean Maximum Discrepency (MMD) as . Tolstikhin:17 found WAE-GAN performs better than WAE-MMD in their experiments.
4 Hierarchical Neural Auto-Encoder
Shown in Figure 2, the entire framework is based on the assumption that text under each condition could form a conditional subspace in a larger global space. Under this setting, we devise a Global VAE (GlobVAE) and a Conditional VAE (CondVAE) to derive the global space and conditional space, respectively. GlobVAE is composed of a global encoder for text representation and a global decoder (i.e., generator) for text generation. For each condition, we use a condition-specific CondVAE to derive the conditional subspace. and are used to denote the conditional encoder and decoder for condition , respectively.
4.1.1 Global VAE
GlobVAE is trained on a large number of unlabeled text samples to derive the global latent space for the latent code , where and
is the space size. We utilize Wasserstein Autoencoder (WAE)Tolstikhin:17 for GlobVAE. Previous studies usually use a common VAE for text representation and generation. However, as pointed out by Bowman:16, VAE suffers the notorious “posterior collapse” problem. Different from VAE, WAE encourages aggregated posterior distribution to be close to the prior, which often alleviates the reconstruction problem of VAE Tolstikhin:17. Specifically, we adopt a variant of WAE-GAN, WGAN-DIV, which is proposed in wu:18, due to its effective adversarial learning of the discriminator. The loss function of GlobVAE is formulated as:
where is the aggregated posterior distribution, is the prior Gaussian distribution, is the adversarial classifier (i.e., discriminator), and is the coefficient hyper-parameter ().
4.1.2 Conditional VAE
We devise a conditional VAE (CondVAE) to derive the conditional subspace for each condition. Specifically, for each condition, we use a small number of text samples encoded by the global encoder to train a VAE Bowman:16. Note that the encoded text samples for a single condition are less likely to densely clustered in the global text space , since the learning process of space is condition-independent and the unlabeled corpus contains diverse text samples. Thus, CondVAE is necessary for capturing the text distribution patterns in the conditional subspace. Specifically, CondVAE for condition consists of an encoder and a generator . The learned latent space is , where is the space size. When is relatively small (i.e., ), CondVAE could abstract the succinct conditional information involved in the actual latent space and approximate to it. The loss function of CondVAE for condition is as:
where is the prior distribution of CondVAE; is the latent code of the conditional vector which is encoded by the encoder (). To enhance the diversity of generated texts, we introduce an extra constant term to control the amount of encoded information in VAE Dupont:18; ChenLGD:18; Kim:18. By setting to an appropriate value, CondVAE could extract compact conditional information without sacrificing the fluency or accuracy.
In this section, we provide the details of training and generation process.
During training, each is encoded as by the global encoder :
Note that the re-parameteristic trick Kingma:13
includes mean vector and variance vector, we directly use mean vector here. Accordingly, we obtain conditional vector set.
To handle multiple conditions, we firstly define a CondVAE-like model for each condition , which also has an encoder and a generator . Then to further exploit the multiple-condition information, we treat samples under other conditions as negative samples. By enforcing the model to fail in reconstructing negative samples, we can encourage the model to extract more distinct conditional information. The loss function of HAE is formulated as:
where is the vectorized sample under condition , and is a negative sample of , is the hyper-parameter to control the importance of negative samples. For different conditions, we may need different due to the different natures of them. Intuitively, the larger the difference between the conditions is, the smaller should be.
For conditional text generation (see Figure 2(c)), we firstly sample a vector . Then we utilize CondVAE’s generator to convert to :
Finally, we directly adopt the WAE generator for text generation:
where is the generated text under condition .
Overall, due to the lack of the classifier regarding each condition, the training of CondVAE does not involve the unlabelled data in CondVAE, which is more flexible than the existing approaches. Additionally, despite we have to train CondVAE for conditions, the total number of parameters is still much smaller than existing methods, and we can resort to using parallel computing to easily speed up the training process.
5 Experimental Settings
5.1 Data and Task
We use Yelp Shen:17 and News Titles Fu:2018 for experiments. Yelp is a collection of restaurant reviews. We directly adopt the pre-processed version used in Shen:17, where two polarity sentiment labels are provided. For News Titles, we choose the titles belong to Business, Entertainment and Health categories for our experiments.
Both Yelp and News Titles are datasets with relatively short texts. We filter out texts longer than 15 words, and choose top 8,900 and 10,000 words as vocabulary for Yelp and News Titles, respectively. The statistics of the two datasets are listed in Table 2. We use the training set to pre-train WAE, the validation set to select better pre-trained models, and the test set to sample conditional text.
Based on the Yelp dataset, we define two tasks: (1) Sentiment. This task aims at generating text samples either positive or negative. The ratio of positive/negative texts in Yelp is roughly . We randomly sample 200 positive and 200 negative texts for training CondVAE. (2) Length. This task aims at generating text samples with a specific length. We define () as short text, () as long text and () as medium text. We respectively sample 200 texts for short, medium, and long text for model training.
Based on the News Titles dataset, we define the categorical text generation task: Category. This task aims at generating text samples under a certain topic. The ratio of business/health/entertainment in News Title is , which is more imbalanced than Yelp. We randomly sample 200 texts for each category to train CondVAE.
5.2 Evaluation Settings
We evaluate the results with two metrics, accuracy and diversity. For accuracy
, we pre-train a sentiment classifier and categorical classifier similar toKim:14, which could achieve accuracy of 90% and 97% on validation set, respectively. The accuracy of length task can be directly calculated by counting the word number of generated text. For diversity, we adopt Distinct-1 and Distinct-2 Li:16 as metrics. Distinct-1/Distinct-2 are the ratios of unique 1-gram/2-gram, respectively. The higher value indicates better diversity. For all tasks and models, the reported results are the average of 10K randomly generated texts by greedy decoding. Besides, for language generation, fluency is another important metric. We leave this part for human evaluation, which is to be detailed in Section 6.
We use two semi-supervised methods, S-VAE Kingma:14a and Ctrl-gen Hu:17 as our baselines. S-VAE incorporates a classifier to provide conditional distribution for unlabeled data. Ctrl-gen further exploits several regularization terms to enhance the disentanglement between the structured code and the unstructured code.
As an essential step in our framework, we pre-train GlobVAE on unlabeled text data. For Yelp, we discover that when the reconstruction loss of GlobVAE achieves around 1.41 on validation set, GlobVAE could achieve an elegant balance between sample quality and reconstruction capacity. Similar balance can be obtained when the reconstruction loss achieves 2.85 on News Titles.
5.2.4 Special Tasks.
Specially, we define two special tasks:HAE-single and HAE-incomplete to demonstrate the flexibility of HAE under some restricted situations. (1) Single-condition generation is a task where the condition number . This setting reflects the situation when a new condition emerges. We denote the HAE that address single-condition generation as HAE-single. Previous models cannot be adapted to this setting since they rely on multiple conditional labels for training. (2) Incomplete conditional generation is another tricky setting where is only a subset of all the potential conditions. For example, when contains and while is not included, this is the incompete setting for the task of sentimental generation. Previous semi-supervised methods cannot handle this incomplete situation due to their inflexible integrated classifier. We call the HAE trained on incomplete conditional samples as HAE-incomplete. We conduct experiments upon (short+long) and (health+entertainment) with the incomplete setting and test the single condition setting over all the tasks.
5.3 Hyper-parameter Settings
, we use a one-layer Bidirectional Gated Recurrent Unit (Bi-GRU) withhidden units in each direction as its encoder. Two linear Fully-Connected (FC) layers are used for re-parameteristic trick Kingma:13. The size of latent space is set to . The word embedding is in dimensions and randomly initialized. For the global decoder, we use a stacked three layers and eight-head self-attention block which is similar to Transformer Vaswani:17
. Additionally, we add extra positional embedding after each block, and the linearly transformed encoded vector is provided as input for each block, similar to the methods proposed inBrock:18. The output softmax matrix is tied with embedding layer. For adversarial classifier, we adopt two 128D hidden FC layers with LeakyRelu activation and one 1D output linear layer without bias. The balance coefficient is 20 for Yelp and 15 for News Titles. The coefficient and power in WGAN-DIV wu:18 are set to 2 and 6 respectively. During pre-training, batch size is set as 512. Adam Kingma:14b with is used as the optimizer. Learning rate is set to .
For the CondVAE, we set the size of latent space . The encoder is two hidden FC layers with LeakyRelu activation with 64/32 units and one 20D output linear layer. The decoder is two hidden FC layers with LeakyRelu activation with 32/64 units and one 128D output linear layer. is set to 0.1 for sentiment tasks, 0.05 for categorical tasks, and for length tasks. Batch size is set to 128. Adam Kingma:14b with is used as the optimizer, learning rate is for 20K iterations. linearly increases from 0 to 5 in first 10K iterations.
For fair comparison, we use the same encoder-decoder architecture for both S-VAE and Ctrl-gen. The hyper-parameters are set as suggested in Kingma:14a; Hu:17.
6 Experimental Results
6.1 Conditional Text Generation
The results of conditional text generation are listed in Table 3. We find both S-VAE and Ctrl-gen could not achieve a good balance between accuracy and diversity. Specifically, Ctrl-gen suffers poor diversity, which indicates “dull texts” Li:16. Moreover, the accuracy of Ctrl-gen fluctuates heavily on different tasks, which could be probably attributed to the GAN-like training procedure. S-VAE is more robust than Ctrl-gen in terms of accuracy, but its diversity fluctuates heavily. We analyze that the classifier applied in S-VAE may not be trained well upon such limited labeled text. In six out of eight tasks, HAE could achieve better accuracy and diversity than both S-VAE and Ctrl-gen. Shown in Table 5 HAE has fewer than of parameters than the baselines and is able to handle one-to-many conditions in a more flexible way, so the effectiveness of HAE showed in the experiments is rather encouraging.
For the special settings, We find that although HAE-single achieves a slightly lower accuracy than HAE, its performance is competitive comparing to S-VAE and Ctrl-gen, which incorporate multiple condition information. Meanwhile HAE-single could achieve similar degree of diversity with HAE. As for HAE-incomplete, it could achieve comparable performance with HAE. Note that for all the setting, a same GlobVAE is used and only 200 samples is leveraged for training each CondVAE. That means HAE could flexibly handle different situations with very few conditional text samples.
6.2 Impact of Hyper-parameter C
Since is an important hyper-parameter for HAE, we test on the long text generation task. From the results in Table 4, we find that controls the balance between diversity and accuracy. Specifically, when is too large, more diverse samples could be generated, but the accuracy may be sacrificed slightly. On the contrary, when is too small, the accuracy could climb to a higher value, but meanwhile the diversity drops dramatically. Empirically, we find that is an appropriate value for all tasks. In fact, the value of can be tuned to meet different practical needs.
6.3 Impact of Training Size
In real applications, the number of given conditional text samples may be fewer than 200. Thus, we conduct experiments for measuring the robustness of HAE and baselines by using of conditional text samples on positive text generation task. The results are in Fig 3. We find HAE, HAE-single and S-VAE all show robustness over fewer training set, but HAE and HAE-single are stably effective than S-VAE in all settings. The performance of Ctrl-gen notable drops with fewer training data.
To measure the efficiency of proposed methods, we report the training time and number of parameters of S-VAE, Ctrl-gen and HAE in Table 5. Here we measure the whole process of training until convergence. We find that HAE is 200 times faster than Ctrl and 80 times faster than S-VAE. Moreover, HAE only has 22k parameters, which is about and of the parameter numbers of S-VAE and Ctrl-gen, respectively.
|Positive||The services are friendly, fast.|
|Long||And this made me feel uncomfortable and the prices aren’t right.|
|Health||Ebola : Virus Spreads in the US|
|Business||FDA Approves New Case of E-cigarettes|
|Grammatical||Eat the service!|
|In addition, this location sucks it is.|
|Star Wars 7 will include US production on set|
|Conditional||I was shocked that this is what I needed. (Negative)|
|Are you actually drunk outside? (Long)|
|Michael Jackson’s New Album ‘Xscape’ (Business)|
6.5 Human Evaluation
We conduct human evaluation experiments as a complementary measurement beyond automatic metrics. Specifically, eight judges are asked to rate over 200 conditional samples generated from each model and each condition. That is, only for our model, 4800 text samples are annotated, and more than 10K samples are annotated for all the models. A judge needs to rate fluency and conditionality with a scale from to for each sample111The score , , represent not fluent / conditional at all, slightly fluent / conditional, and fluent / conditional, respectively.. Fluency measure whether the text samples are natural and fluent as real text. Conditionality indicates whether the generated text adheres to the given condition. From the results provided in Table 6, we find that HAE could achieve better fluency score than S-VAE and Ctrl-gen. These results of conditional scores match the automatic metric, together proving that HAE can generate more accurate conditional samples than S-VAE and Ctrl-gen.
6.6 Qualitative Analysis
To make qualitative analysis, we randomly list some generated conditional samples in Table 7 and failed examples in Table 8. As shown in the table, our proposed HAE is capable of generating accurate and fluent conditional text. As for failed examples, there are two major mistakes: grammatical and conditional. The conditional mistakes occur because CondVAE in HAE does not perfectly fit the conditional subspace, then some noises are incorporated into the generation. The grammatical mistakes occur due to two reasons. Firstly, the sampled conditional vectors fall out of high probability region of global space. Secondly, the learned global generator is prone to make local mistakes due to the exposure bias and the imperfect capacity of generation.
7 Conclusions and Future Work
In this paper, we present a novel HAE framework toward flexible conditional text generation. HAE decouples the text generation module from the condition representation module, allowing more flexibility of conditional text generation. Extensive experiments demonstrate the superiority of the proposed HAE against the existing alternatives while requiring much less training time and much fewer model parameters. For future work, we are interested in whether similar ideas can generalize to more generation tasks, e.g., unsupervised text style transfer and dialogue generation.
We would like to thank Daya Guo from Microsoft Research Asia for proofreading. Chenliang Li is the corresponding author.