Transformer-based Conditional Variational Autoencoder for Controllable Story Generation
We investigate large-scale latent variable models (LVMs) for neural story generation – an under-explored application for open-domain long text – with objectives in two threads: generation effectiveness and controllability. LVMs, especially the variational autoencoder (VAE), have achieved both effective and controllable generation through exploiting flexible distributional latent representations. Recently, Transformers and its variants have achieved remarkable effectiveness without explicit latent representation learning, thus lack satisfying controllability in generation. In this paper, we advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers to enhance controllability without hurting state-of-the-art generation effectiveness. Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE). Model components such as encoder, decoder and the variational posterior are all built on top of pre-trained language models – GPT2 specifically in this paper. Experiments demonstrate state-of-the-art conditional generation ability of our model, as well as its excellent representation learning capability and controllability.READ FULL TEXT VIEW PDF
Transformer-based Conditional Variational Autoencoder for Controllable Story Generation
Thesis regarding graph to text generation with multimodal VAE's at DPG media
Neural text generation has achieved remarkable success to enable both effective and controllable generation at a level that computational models can write like humans to satisfy practical needs. Among various research objectives, the most significant ones are the effectiveness and the controllability of generation, where there are always emerging opportunities and challenges.
Deep latent variable models (LVMs), especially variational autoencoder (VAE) Kingma and Welling (2013); Rezende et al. (2014) have been a significant class of methods to achieve both effective and controllable generation Bowman et al. (2015); Miao et al. (2016); Zhao et al. (2017, 2018b); Zhou and Neubig (2017); Hu et al. (2017); Bao et al. (2019b); Shah and Barber (2018)1997)
and Gated recurrent unit networks (GRU)Cho et al. (2014). The advantage of LVMs is to learn and exploit flexible distributional latent representations to capture holistic features of input and further guide the generation of sentences. Such powerful representation learning can deal with both the effectiveness and the controllability of generation.
In recent years, Transformers Vaswani et al. (2017) and its variants have become the main-stream workhorses and boosted previous generation effectiveness by large margins. Models based on similar self-attention architectures Devlin et al. (2018); Radford et al. (2018, 2019)
could leverage both big models and big training data. A dominant paradigm emerges to be “pre-training + fine-tuning” on a number of natural language processing tasks. Even without explicitly learning latent representations, Transformer-based models could effectively learn from training data and generate high-quality text. It’s thrilling to witness computational models generate consistent long text in thousands of words with ease. However, given state-of-the-art generation effectiveness, controllability of these models—especially when generating long text—is still under-explored. The emerging challenge is, how to achieve controllable generation in the era of Transformers and a long text setting?
In this paper, we advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers to enhance controllability without hurting state-of-the-art generation effectiveness. Specifically, we integrate latent representation vectors with self-attention based pre-trained architecture to build a conditional variational autoencoder (CVAE). Model components such as encoder, decoder and the variational posterior are all built on top of pre-trained language models—GPT2 Radford et al. (2019) specifically. We demonstrate excellent representation learning capability and controllability of our Transformer-based LVMs through learning and manipulating the latent representation vectors.
On the application side, we emphasize a much challenging and under-explored task, i.e. neural story generation, which creatively writes open-domain long text in hundreds or thousands of words conditioned on very short and abstract prompts Fan et al. (2018). The task featured with much longer output leads to higher complexity and more flexibility in a broader space than short text generation. Previous literature Fan et al. (2018); Mao et al. (2019); See et al. (2019); Ziegler et al. (2019) have at most studied how to effectively learn the mapping between prompt and story through explicit end to end (end2end) training. However, controllability in such a setting has rarely been studied. For instance, how to control story development and semantic transition during the spanning of long text? Pure end2end learning seems quite rigid, which could miss flexible and controllable mechanisms inside a black box. A reasonable solution for this issue is to introduce latent representation vectors, which is the treatment we consider in this paper.
To summarize, our paper is among the first works, by our knowledge, to build Transformer-based latent variable models to solve the controllability issue in the setting of long text generation. Recently, we notice an independent parallel work Li et al. (2020), which proposes similar Transformer-based architecture to incorporate latent representation vectors. We note that there are a number of differences between this work and ours. Most significantly, we considered both VAE and CVAE in a long text setting, while Li et al. (2020) considered a pre-trained VAE model in traditional short text setting. Our datasets and source code is available on GitHub111https://github.com/fangleai/TransformerCVAE.
Conditional story generation Fan et al. (2018) refers to generating open-domain long text based on a short prompt, which provides either a starting point or an abstract summary for the writing. In this paper, we propose a Transformer-based conditional variational autoencoder to learn the generative process from prompt to story.
illustrates the graphical model of VAE, an unsupervised learning method for unconditional generation. VAE consists of a generative network (decoder) and an inference network (encoder). Given a language dataset, where represents th sentence of length . With a prior distribution , VAE generates a sentence using the deep generative network parameterized by . The prior is typically assumed to be a standard multivariate Gaussian. The decoder typically takes an auto-regressive form
. In this paper, we will build the decoder based on the pre-trained GPT2 rather than traditional recurrent neural networks.
The goal of VAE training is to maximize the marginal data log-likelihood . However, posterior inference is generally intractable. Consequently, an -parameterized encoder is introduced to approximate with a variational distribution . Variational inference is employed for VAE learning, yielding the following evidence lower bound (ELBO):
also illustrates the CVAE, an adaptation of VAE to fit supervised learning and conditional generation. Given a training dataset of pairs, where represents th sentence of length . In controllable story generation, and refer to a prompt and a story, respectively. Given an input , CVAE encodes the prior knowledge of latent code as , and generates target using the deep generative network parameterized by .
The goal of CVAE is to maximize the conditional data log-likelihood . Similarly, variational inference is employed for CVAE learning, yielding the following evidence lower bound (ELBO):
Note that both the prior and posterior are learnable in CVAE.
Our model architecture is illustrated in Figure 3. Basically, it consists of a prior, posterior and conditional generator based on multi-layer self-attention architecture Vaswani et al. (2017), more specifically on top of pre-trained models.
In order to exploit the power of pre-trained models, we propose to reuse the GPT2 model Radford et al. (2019) as our decoder. For ease of computation, we adopt the smallest public version with layers, heads per layer, model dimension of units and total parameters of 117M. The encoder has =6 unmasked/bi-directional self-attention layers, whose parameters are initialized to the parameters of the first layers of the GPT2 model (initialized but not shared afterwards). Moreover, the word embedding and positional embedding tables in the encoder and decoder are shared.
Comparing with masked/uni-directional structure in decoder for auto-regressive generation, the key point is to have unmasked/bi-directional structure in the encoder to allow full information scope. In this sense, our design is comparable with Li et al. (2020) to reuse BERT in the encoder and GPT2 in the decoder. However, we avocate two main design differences: 1) We note that BERT uses word piece (WPE) embeddings for tokenization and GPT-2 uses Byte Pair Encoding (BPE), leading to totally different vocabulary books. Li et al. (2020) resorts to keeping both tokenizations for all inputs and outputs; while our design has none of such issues. 2) Li et al. (2020) only works with short sentences typically less that 64 words, while our model works with hundreds or thousands of words in a minimal run. In our case, a model with a full layer encoder (=12) is empirically too large to fit a single GPU memory. In order to save memory and considering that the first several layers of GPT2 may implicitly serve to encode features, our model only use =6 layers in the encoder222Our experiment confirms that using full layers in encoder has limited improvement in performance comparing to using =6 layers in encoder..
Traditional RNN/LSTM encoders typically only use the last hidden state from the encoder to produce a latent space. This is insufﬁcient to summarize sequential data and keep long-term knowledge. In our model, representations from self-attention layers are a sequence of vectors with total number equal to the number of input tokens. To utilize all the information, we define an attention-average block right afterwards to merge variable length sequence of vectors into a single vector. The attention average block essentially perform a multi-head self-attention as Figure 2(a) using a learnable single query , and taken as the variable length sequence of vectors from the last blocked self-attention layer. The single vector representation is then passed to linear layers to predict prior and posterior distribution, respectively.
In terms of model components, we define both the prior and the variational posterior as isotropic Gaussian distributions,i.e., , with learnable mean vector and “” vector. The KL divergence between the prior and posterior in Eq. 2 is therefore analytically solvable. Traditional reparameterization trick Kingma and Welling (2013) is used to allow gradient passing through Gaussian sampling. Note that in our model, the prior distribution and variational posterior distribution
share all the parameters except the linear layers predicting their mean and variances to promote prior posterior alignment.
In contrast to Transformers Vaswani et al. (2017) and other models that learn sequence of encoding vectors, our model is dedicated to learning a single vector as an explicit latent representation.
With a single latent code representation333Latent code can have dimension , in which case linear projection layers are needed before feeding latent code to the decoder to ensure identical dimensions. and a “GPT2” decoder, we investigate three mainstream ways of latent code injection inspired by previous literatures Cheng et al. (2019); Ziegler et al. (2019); Wang and Wan (2019).
Input: is added to each input token during decoding, i.e., added with word embeddings and positional embeddings element-wisely.
PSA444In recent literature Li et al. (2020), they also studied a way of latent injection described as “Memory” vector. Essentially, “Memory” is identical or equivalent to our “PSA”. : inject latent code in a per-layer basis. Specifically, we first project into through a linear layer, so that it can be split into vectors , with being fed into the -th blocked self-attention layer. As presented in Ziegler et al. (2019) and shown in Figure 2(b), pseudo self-attention could absorb extra encoding embeddings into a pre-trained GPT2 self-attention structure through
where are the original input embeddings participating self-attention; , are augmented key and value matrices with projected latent code , from filling the first row; means concatenation by rows. Here, we abbreviate per-layer code to for notation simplicity.
Softmax: in the original GPT2, an embedding vector
from the last blocked attention layer is projected to a pre-softmax logit vectorthrough a linear head, where is the vocabulary size used in tokenization. When a latent code should be injected in such a position, a new and shared linear head will be initialized and learned in order to project into . Finally we send for the softmax and output.
We empirically study all the three ways of latent code injection into the decoder, and present comparison in the experiment section.
We train our CVAE model according to the negative loss objective in (2). For conditional story generation, the input to the prior distribution is purely the prompt and the input to posterior distribution is the connected sequence of prompt and story split by a special token ‘”. The conditional generative distribution is implemented as decoding with a text prefix “prompt + ” and feeding the latent code.
To avoid learning deviation caused by random initialized parameters, we freeze pre-trained parameters initialized from GPT2 in the first several iterations of training, i.e. 10K iterations, and unfreeze them afterwards.
To alleviate the notorious posterior collapse issue, we take a cyclic annealing schedule Fu et al. (2019) by adjusting the coefficient before KL divergence in (2). Specifically, we have kept close to zero in the first half of cyclic schedule, linearly annealed to 1 in the next one fourth of cyclic schedule and kept in the remaining one fourth of cyclic schedule. The purpose of such schedule is to exploit the period that is close to zero, which pushes the model towards a pure autoencoder. Note that autoencoder Bourlard and Kamp (1988)
During generation, a short prompt text is fed to the encoder, and a latent code is sampled from the prior distribution to guide the decoding. This procedure is the same as how traditional CVAE works.
Most of previous works on text generation consider a setting of short text. For controllable generation, they mainly consider certain global aspects of text, with the most common aspects being sentiment and topic Shen et al. (2017); Zhao et al. (2018a); Hu et al. (2017); Fang et al. (2019); Dathathri et al. (2019); Keskar et al. (2019); Li et al. (2020); Wang et al. (2020). Researchers have attempted short story generation with fine-grained control through plots, plans or the so-called storylines Peng et al. (2018); Yao et al. (2019), leading to a wide usage and benchmark on 5-lines story dataset Mostafazadeh et al. (2016).
In recent years, Fan et al. (2018) proposes story generation as a test bed of open-domain long text generation. Ziegler et al. (2019) initiates the research of conditionally generating story based on a pre-trained GPT2.
Though achieving promising results, very few works have been presented to improve controllability in the setting of long text. This work is, by our knowledge, the first work to build a Transformers-based latent variable model to improve controllable open-domain long text generation.
Recently, there are several works building latent variable models on top of the Transformer. One main class of work is conditional VAEs with Transformer-based components in non-autoregressive sequence generation, especially non-autoregressive machine translation Shu et al. (2020); Ma et al. (2019); Kasai et al. (2020); Han et al. (2020). Another class of work is on dialogue generation with conditional VAEs Bao et al. (2019a); Lin et al. (2020) for response diversity.
proposes a semi-supervised method using a Transformer-based VAE to solve aspect-term sentiment analysis problem. The method also disentangles latent space for aspect-specific sentiment and the lexical context, respectively. A recently released literatureLi et al. (2020) proposes large-scale VAE as a pre-trained model. The model is first pre-trained with massive unlabelled corpus and then fine-tuned in various down-stream generation tasks that traditional RNN/LSTM based VAEs have attempted.
Our paper indeed gets some inspirations from previous works of Transformer-based latent variable models on architecture side. However, our model is motivated to enhance controllability of generation and deals with a challenging setting of long-text generation.
In order to verify our idea of Transformer-based latent variable models, we first conduct a pre-experiment with the VAE architecture on two small datasets. The VAE implemented is a simpler version of our CVAE model shown in Figure 3, where the prior is defined as standard spherical Gaussian . Moreover, the VAE conducts pure unsupervised learning, where unlabelled language texts are encoded into a distributional representation space and then decoded back.
The two relatively small datasets are introduced in the following for VAE learning respectively with statistics shown in Table 1.
The is an online dataset Sergio (2019)
that extracts abstracts from “arxiv” articles. Specifically, a topic query is searched among arxiv abstracts and the matched ones are collected. The three topic queries we used are “artificial intelligence”, “computer vision” and “language generation”, leading to around 12K article abstracts respectively having such topic words. Theis a public dataset Yang et al. (2017); He et al. (2018)
with restaurant reviews collected from the “Yelp” website. Reviews are associated with user ratings from one to five stars. We binarize reviews with user rating above three as positive, and below three as negative, leading to a binary sentiment dataset.
Figure 4 visualizes the posterior of texts in the test dataset in 2D space using t-SNE Maaten and Hinton (2008). As we can see, meaningful latent spaces can be learned, which are able to cluster high-dimension data according to proximity between their latent codes. Interestingly, for the dataset, cluster of “artificial intelligence” lies between clusters of “computer vision” and “language generation”, which coincides with our understanding of these topics. Such visualization shows encouraging signs on representation learning power of our model.
We conduct conditional story generation on two datasets, and , with statistics shown in Table 2. The datasets are publicly available and meet our target of open-domain long text corpora. We have investigated several other most commonly used public datasets in conditional text generation. Another dataset commonly used in story-plot generation is Mostafazadeh et al. (2016); Yao et al. (2019), which consists of 5-lines stories, thus are too short to use in our task.
For the two datasets adopted, the is a dedicated large scale hierarchical story generation dataset collected from Reddit’s “WritingPromts” forum Fan et al. (2018); Mao et al. (2019). Given a prompt as a rough guide or starting point, stories are multi-paragraph short novels written by human users; the 555https://github.com/markriedl/WikiPlots contains story plot about books, novels, films, and etc, extracted from English language Wikipedia. Each story plot is paired with a short title, which is used similarly as prompt in the .
Each of our benchmark models serves designated purposes. Note that we don’t benchmark with other pre-trained language model bases which may be much more powerful than GPT2. We also don’t choose some popular controllable text generators such as Hu et al. (2017); Dathathri et al. (2019) since they either only work in a short text setting or discuss a different notion of control.
By comparing with a state-of-the-art specialized-architecture task-specific story generation model Fan et al. (2018), we evaluate models’ in-domain generation performances. Fusion models in Fan et al. (2018) takes a convolutional seq2seq model structure with a fusion training mechanism. Although similar self-attention architectures are used, the fusion model is still different with our Transformer-based architectures on the design of key and value vectors.
By comparing with a state-of-the-art transfer learning method based on GPT-2 models—pseudo self attention (PSA)Ziegler et al. (2019), we compare our CVAE model with a pure supervised training method for conditional generation. The pseudo self attention introduces new projection matrices to absorb a sequence of embedding vectors from input to the self-attention computational framework. Note that we may use pseudo self attention (PSA) as one way of latent code injection (2⃝), but there are key differences: our model only injects a single encoded vector, rather than a sequence of encoding vectors in original PSA; our CVAE model has exploited the notion of distributional representation to learn a representation space to enable flexible controllability. In another words, our CVAE learns encoded vector together with a posterior distribution, when pure PSA doesn’t.
By comparing with a simple way of transfer learning called “fine-tuning with special tokens” (FIST), we investigate the effect of incorporating a latent code into the decoding. FIST does not learn a latent code, but only fine tunes a pre-trained GPT2 model with augmented language texts, i.e., directly connecting prompt with story and put a special token ‘” in between.
By comparing different ways of latent code injection, we evaluate their effectiveness accordingly. We label our CVAE model with the latent code injections 1⃝, 2⃝ and 3⃝ as CVAE-1⃝, CVAE-2⃝ and CVAE-3⃝, respectively, as is reflected in Figure 3.
We implement our models using the “Huggingface Transformers” library in PytorchWolf et al. (2019). In evaluation, we generate stories using the top-k top-p random sampling scheme Holtzman et al. (2019); Keskar et al. (2019) with and . Temperature smoothing technique is also applied with . Considering the two relatively large test datasets, we randomly decode one story per test input, rather than sampling several stories per test prompt and selecting the best one.
We evaluate the following automatic metrics towards target stories:
Perplexity (PPL) is used to evaluate language models and often regarded as a proxy for generation quality. All models based on GPT-2 use the BPE tokenization scheme, where PPL values are not directly comparable with some previous models such as Fan et al. (2018) with PPLs computed at the natural word level. Similar to See et al. (2019)
, we additionally compute the word-level perplexity of GPT-2 models to enable the comparison with previous models. That is, we normalize the total negative log probability of the target text by the number of word level tokens.
The results are presented in Table 3.
Overall, our CVAE model achieves generally better, at least comparable, metrics in terms of lower PPL and higher ROUGE scores, demonstrating state-of-the-art conditional story generation performance.
Methods based on pre-trained models (PSA / FIST / CVAE) show better overall performance with relatively less task specific efforts than the Fusion models, demonstrating the power of large-scale pre-training with Transformer-based architecture. Fusion models still show relatively high precision scores, especially in , due to its dedicated design for story generation.
When comparing CVAE 2⃝ with PSA, we observe performance improvement due to the flexible learned representation space. Note that CVAE merges a sequence of encoding representation vectors into a single latent vector, which is the key difference with original PSA.
When comparing CVAE variants with FIST, we observe the benefit of latent representation modeling as a powerful addition to pure occurance modeling.
When comparing different ways of latent code injection in CVAE, we observe that it is hard to made option 3⃝ work empirically; options 1⃝ and 2⃝ perform comparably well.666We also observe that using both 1⃝ and 2⃝ does not consistently improve performance. Our observation is different from Li et al. (2020), which claims 2⃝ works significantly better than 1⃝. We suspect this is due to an inherently different experiment setting, where we work with significantly longer text.
When conducting training on the dataset, we observe similar representation learning results as shown in Figure 5. A story prompt in may have extractable key words to reveal the item types, such as TV series, film, music, manga, novel, and game, etc. We observe that item types from test story prompts are clearly clustered in the latent code space, which implies effective representation learning to capture inherent characteristic of the prompts.
We further present qualitative generation examples on the test datasets in Tables 4–11. We observe that stories are semantically and grammatically sound, and more importantly, highly conditioned on and consistent with given prompts. A large scale human evaluation is underway, which is quite expensive due to text length and evaluation scale.
To verify the effect of the learned latent representation vectors in generation, we conduct an interesting “control” experiment: given two prompt and , we generate story from , i.e., conditioning on prefix and feeding in latent code along the decoding. In this way, the generated story will lie in the combination semantic space of the two prompt and , especially after the latent code takes effect and dominates.
Generation examples from the two test datasets are presented in Tables 7 and 11. We colorize key words in generated stories that coincides with given prompts and accordingly. Such examples confirm the effect of latent codes in generation, indicating our model as a principal way to enhance controllability.
In this paper, we propose Transformer-based latent variable models to enhance story controllability while maintaining state-of-the-art generation effectiveness. Our test bed is a much more challenging and under-explored long-text application than the traditional short-text generation. Our results indicate the superiority of Transformer-based latent variable models, and appeal more efforts to be invested in the domain.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1587–1596. Cited by: Introduction, Controllable Story Generation, Benchmark Models.
Proceedings of the ACL-02 Workshop on Automatic Summarization-Volume 4, pp. 45–51. Cited by: 2nd item.
Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286. Cited by: Introduction.
Generative neural machine translation. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 1346–1355. External Links: Cited by: Introduction.