Outline to Story: Fine-grained Controllable Story Generation from Cascaded Events

01/04/2021 ∙ by Le Fang, et al. ∙ JD.com, Inc. University at Buffalo 6

Large-scale pretrained language models have shown thrilling generation capabilities, especially when they generate consistent long text in thousands of words with ease. However, users of these models can only control the prefix of sentences or certain global aspects of generated text. It is challenging to simultaneously achieve fine-grained controllability and preserve the state-of-the-art unconditional text generation capability. In this paper, we first propose a new task named "Outline to Story" (O2S) as a test bed for fine-grained controllable generation of long text, which generates a multi-paragraph story from cascaded events, i.e. a sequence of outline events that guide subsequent paragraph generation. We then create dedicate datasets for future benchmarks, built by state-of-the-art keyword extraction techniques. Finally, we propose an extremely simple yet strong baseline method for the O2S task, which fine tunes pre-trained language models on augmented sequences of outline-story pairs with simple language modeling objective. Our method does not introduce any new parameters or perform any architecture modification, except several special tokens as delimiters to build augmented sequences. Extensive experiments on various datasets demonstrate state-of-the-art conditional story generation performance with our model, achieving better fine-grained controllability and user flexibility. Our paper is among the first ones by our knowledge to propose a model and to create datasets for the task of "outline to story". Our work also instantiates research interest of fine-grained controllable generation of open-domain long text, where controlling inputs are represented by short text.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Large-scale pretrained language models have shown thrilling generation capabilities to compose coherent and meaningful long text Radford et al. (2019); Keskar et al. (2019); Zellers et al. (2019). However, in these models, users can only control the prefix or certain global aspects of generated text. Generation is also prone to deviate from the topic and wander to elsewhere freely. For more coherent generation, one may wish to pre-define the semantics flowing in each part, or even more explicitly control the words appearing in each paragraph. To this end, researchers face an open challenge to generate long text with fine-grained control, and simultaneously preserve the state-of-the-art unconditional text generation capabilities.

Input: Prompt (Optional) Event_1 Event_2 Event_3
Output: Paragraph_1 Paragraph_2 Paragraph_3
Table 1: Outline to Story. Prompt is optional but an useful addition to perform global control; each event is given to control the corresponding paragraph.

Given the success of pre-training on a broad range of language processing tasks Peters et al. (2018); Radford et al. (2018); Devlin et al. (2018), a dominant paradigm emerges to be pre-training a transformer Vaswani et al. (2017)

based language model on a super large unlabeled corpus, and fine-tuning the model on task specific supervised data. The paradigm sheds light on transfer learning to inherit capabilities of pre-trained models. Researchers

See et al. (2019); Ziegler et al. (2019) have accordingly studied the strength of massively pre-trained language models as long text generators and demonstrated their unparalleled advantages on context conditioning and generation quality. However, the need of fine-grained controllable generation is still not fulfilled, meaning that the control handle remains a single prefix or a certain global aspect such as sentiment.

In this paper, we propose a new task called “Outline to Story” (O2S) as a test bed of fine-grained controllable generation of long text. Here, story refers to open-domain multi-paragraph long text that is self-contained and semantically sound. Given a story outline as cascaded events, i.e., a sequence of events with one event per paragraph, where each event consists of a set of key words or phrases to appear in the corresponding paragraph, our task aims to generate a multi-paragraph story that is highly conditioned on and consistent with the outline (Table 1). Fined-grained controllable generation is therefore performed through designated outlines. The task is challenging comparing to trivial generation of short text considering the following facts.

  • Firstly, we require the generated length to be much longer, i.e. from hundreds to one thousand words, to leverage the state-of-the-art text generation ability. Longer text leads to higher complexity and more flexibility in a broader space.

  • Secondly, generated story should be highly conditioned on given cascaded events, while the events lack explicit connections and details. The generation is more constrained given more control signals than a single prompt or prefix.

  • Lastly, a successful model on the task should not only generate well to the purpose, but also be highly flexible and controllable in generation. For instance, we humans are flexible to write stories with partial or incomplete outlines. Accordingly, an ideal model on O2S should be easily adaptable to write with only a fixed beginning event and future events on demand.

In order to tackle the proposed task, we also create some dedicate datasets. When multi-paragraph narratives are generally available, large-scale human annotated outlines are expensive and formidable. We propose to use state-of-the-art keyword extraction techniques Mihalcea and Tarau (2004); Rose et al. (2010) to extract a set of keywords from each paragraph and pair them with the paragraph accordingly. Note that how to get higher quality outlines is an independent and parallel research, which is out of the scope of this paper.

Furthermore, we propose an extremely simple yet strong baseline method for the O2S task. Inspired by the usage of control codes and artificial tokens in recent literatures Keskar et al. (2019); Tsutsui and Crandall (2017) to specify desired features of generated text, we use special tokens as delimiters to connect outline events with paragraphs in an interleaving manner. This leads to an augmented sequence for each outline-story pair. We fine tune pre-trained language models with the original language modeling objective on augmented sequences to learn the functionality of special tokens and co-occurrence structures between events and stories. Our method of “fine-tuning with special tokens” (FIST) is featured with the following facts.

  • Utilizing pre-trained language models enables cheap but powerful long text generation capability after transfer learning is performed.

  • Fine-tuning on augmented sequences naturally grasps the conditioning relationship from outline to story and reserve the generation capabilities of a pre-trained model. The method does not introduce any new parameters except the latent representations of several special tokens. We also do not perform any architecture modification. Therefore, FIST has least architecture prerequisite, learning effort and model adaptation cost, while captures strong conditional dependencies.

  • During the generation stage, the model itself is able to generate candidate outline events, and users reserve all the flexibility to modify, extend, and manipulate those events right before generating succeeding paragraphs. Users are free to start with incomplete outline. Actually it is all up to a user’s practical needs to decide the level of human supervision.

To summarize, our paper is among the first ones by our knowledge to study the “outline to story” task along with creating new datasets for the task. It fulfills the specific need of fine-grained controllable generation, where conditions are short text. Recently, we notice a concurrent work Rashkin et al. (2020) that proposes a similar task and a dedicated architecture with memory mechanisms. Their method will be compared empirically with ours in this paper. Extensive experiments demonstrate that firstly, FIST achieves state-of-the-art conditional story generation performance; secondly, FIST has shown outstanding flexibility and controllability in the generation; lastly, FIST achieves comparable and even better metrics with its way much simpler design than Rashkin et al. (2020). Our datasets and source code is publicly available111https://github.com/fangleai/Outline2Story.

The Task and Data

New Task: Outline to Story

The “Outline to Story” (O2S) task aims to build strong conditional dependency between a given outline and a generated story. An outline consists of cascaded events, i.e., a sequence of events, each corresponding to a paragraph in the story. An event is formulated as a set of keywords or key phrases that summarize the main content of a paragraph. During training and generation, an outline is expected to be self-contained and follow a latent semantic flow. The challenge of O2S task is to connect the cascaded events and fill their gap with semantically and grammatically sound details. A high quality story output is expected to strongly condition on outlines and be fruitful in content. However, it may not necessarily use all the key phrases in the outline, since a novel, creative and coherent story following the given high level semantic flow is of interest, rather than an unique particular story.

Our Proposed Datasets

Raw stories

Our task requires datasets of paired human written outline and contentful long narratives. Unfortunately, existing public datasets rarely endow such a property. Although some tasks such as “data to text” Chan et al. (2019) and “table to text” Liu et al. (2019); Wang et al. (2020) consider similar conditional generation scenarios, the text in their datasets is generally too short and rigid. Those tasks emphasize faithful output from input data, rather than the generation capability of a model.

Neural story generation community has contributed several candidate datasets such as Fan et al. (2018); Mao et al. (2019), 222https://github.com/markriedl/WikiPlots, and Mostafazadeh et al. (2016); Yao et al. (2019). However, only consists of 5-lines stories that are too short for our task. For the other two datasets:

  • is a dedicated large-scale hierarchical story generation dataset collected from Reddit’s “WritingPromts” forum. Based on a prompt as a rough guide or starting point, stories are multi-paragraph novels written by human users.

  • contains plots about books, movies etc

    extracted from English language Wikipedia. Each plot is paired with a short title and given as one sentence per line without paragraphing. Therefore, the first pre-processing is to segment each plot into paragraphs. We map each sentence to a fixed-length vector representation using BERT

    Devlin et al. (2018); Xiao (2018) and allocate adjacent sentences with less cosine proximity of representations into different paragraphs. We also set a lower limit of 20 words per paragraph, since 20 words in average makes an English sentence according to linguist Cutts (2020); Campaign (2004); Vincent (2014).

Keyword Extraction

Without human annotated outlines, we use state-of-the-art keyword extraction techniques, such as TextRank

Mihalcea and Tarau (2004) and RAKE Rose et al. (2010) to automatically extract outline from a multi-paragraph story. Specifically, we choose RAKE333https://pypi.org/project/rake-nltk/ (Rapid Automatic Keyword Extraction), an unsupervised, domain‐independent, and language‐independent method, to extract key phrases from each paragraph. At least for English documents, RAKE is shown to be more computationally efficient than TextRank while achieving higher precision and comparable recall scores Rose et al. (2010)

. We set minimum and maximum lengths of phrases as 1 and 4, respectively. We keep the number of extracted phrases linearly dependent on the paragraph length: it extracts one more key phrase for additional two sentences, which takes 40 words in average. These settings can be tuned as hyperparameters while the guideline is to achieve a balance between information completeness and generation creativity.

To summarize, an automatically extracted outline is expected to capture the key content and semantics flowing in a multi-paragraph story. Note that for and , we keep the given prompt or title of a story as an optional global control signal and name it “the prompt” in the following paper. We summarize detailed dataset statistics in Table 2. The processed datasets will be released together with our source code for the community.

scale=0.80.8,center Data -set Num. stories Data split Prompt avg. len. Story avg. len. Story avg. paragraphs Event avg. phrases Phrase avg. len. 303 K 90-5-5 25.4 674.5 6.3 2.8 2.8 113 K 90-5-5 3.4 332.9 3.1 3.3 3.1

Table 2: Statistics of two datasets. : ; : .

Methodology

Figure 1: The architecture and example of augmented sequence for language modeling in FIST. Abbreviations include ST: , SC: , E: an event, SEC: , EC: , ET: , P: a paragraph, phr: a key phrase. Special tokens colored in light red are not split in tokenization and learned together with normal text.
The interleaving manner:

ST Prompt

SC E_1 EC P_1

SC E_2 EC P_2

SC E_3 EC P_3

SC E_n EC P_n ET
The prepending manner:

ST Prompt

SC E_1 EC

SC E_2 EC

SC E_3 EC

SC E_n EC

P_1 P_2 P_3P_n ET
Table 3: Two ways to build augmented sequence for each outline-story pair. Abbreviations follow that in Figure 1.

At the core of our approach is the language modeling task, which is formulated as unsupervised distribution estimation of a sequence of tokens. Given a set of language examples,

, each composed of variable length sequences of tokens

, it is common to factorize the joint probabilities using the chain rule of conditional probabilities

Bengio et al. (2003):

(1)

This decomposition allows both tractable sampling from and estimation of and any conditionals of the form . Various models are trained as distribution approximators to minimize the negative log-likelihood over a dataset :

(2)

In recent years, large transformer-based models Vaswani et al. (2017), especially the GPT-2 model Radford et al. (2019), have shown superb capabilities to estimate the conditional probabilities owing to their self-attention architectures as powerful feature extractors. Considering the nature of languages in terms of sequential ordering and inherent co-occurrence, language modeling has been a fundamental task in the field.

In this paper, we highlight that language modeling provides a flexible way to cast various conditional distribution estimations of sequences into an unified estimation framework. A general supervised estimation task expressed as estimating is possibly solvable by language modeling on the augmented sequence of (input, output) connected by delimiters. Note that the key requirement is the sequential nature of both input and output, which is exactly the case for the “outline to story” task.

Fine-tune with Special Tokens

We adopt the common two-stage “pre-training + fine-tuning” paradigm. The pre-training exploits abundant unlabeled language corpus to model open-domain unconditional sequences. A continual language modeling task in fine-tuning leads to least modification on architecture and distribution space, which consequently ensures efficient model adaptation. Inspired by the usage of control codes and artificial tokens in recent literatures Keskar et al. (2019); Tsutsui and Crandall (2017) to specify desired features of generated text, we use special tokens as delimiters to connect outline events with paragraphs. Consequently, each outline-story pair leads to an augmented sequence for language modeling in fine-tuning. Our method of “fine-tuning with special tokens” (FIST) naturally grasps the conditional relationship from outline to story and reserves the generation capabilities of the pre-trained model.

Take the GPT-2 Radford et al. (2019) pre-trained model as an example. Like the token “” to indicate the end of current article and the start of next article, special tokens will not be split during tokenization. Instead, they will be learned together with normal tokens and encode certain semantic meanings in a profound way. For each outline-story pair, FIST builds an augmented sequence as following:

  1. For each outline event, i.e., a set of key phrases, FIST builds an event sequence by separating key phrases with a “” token, prepending a “” token to initiate the event, and appending a “” token to halt the event. The optional global prompt will be initiated with a “” token and put at the beginning of the first event sequence to interact with the whole content in the story.

  2. For a list of event sequences and paragraphs, FIST joins them in an interleaving manner.

The architecture and example of augmented sequence are shown in Figure 1.

There are alternative ways to join a list of event sequences and paragraphs. As shown in Table 3, we advocate the best strategy in FIST to join them in an interleaving manner. Another possible way is to prepend all event sequences at the beginning and before all paragraphs. This has the advantage of exploiting all outline information starting from the very first paragraph. However, the disadvantages are also obvious. First, it weakens the paired and sequential correspondences between events and paragraphs; second, it implicitly requires the generation should have all outline events fixed before even generating the first word, hurting the generation flexibility and controllability. We will demonstrate in experiment section the benefits of FIST with an interleaving co-occurrence structure.

To summarize, special tokens are flexibly designed to accommodate various co-occurrence structures, whose representations are learned from scratch during the fine-tuning and play a significant role in the modeling of conditional dependence.

Training and Generation

With a pre-trained model, only a light fine-tuning on relatively small supervised data is needed. At the training stage, FIST performs language modeling on the outline-story augmented sequences; in generation, the story is sequentially decoded following the same format of augmented sequences.

Since event sequences are inherently modeled as part of “the lanague”, FIST model endows the ability to generate succeeding event sequences. During the generation stage, users reserve all flexibility to modify, extend, and manipulate those events right before generating succeeding paragraphs. For instance, users can replace a generated event sequence starting from “” to “” in place by another user written event sequence. FIST model may also wrap up a story indefinitely with less or more paragraphs than the provided length of outline, i.e., the number of events. We will demonstrate through experiments the quality and characteristics of model generated outline events. It is all up to an user’s practical needs to decide the level of human supervision. Overall, FIST model keeps the highest level of generation controllability and flexibility to fulfill the promise of fine-grained controllable generation.

Related Work

Controllable Text Generation

A number of previous studies of controllable text generation have focused on certain global aspects of the text. The most common aspects are sentiment and topic Shen et al. (2017); Zhao et al. (2018); Hu et al. (2017); Fang et al. (2019); Dathathri et al. (2019); Keskar et al. (2019). Also, researchers attempt fine-grained control with plots, plans or the so-called storylines Peng et al. (2018); Yao et al. (2019), leading to a wide usage and benchmark on 5-lines story dataset Mostafazadeh et al. (2016) and similar datasets with relatively short text. Later on, long story generation stands at the frontier Fan et al. (2018, 2019) of conditional and controllable generation. The task named as “neural story generation” is still under development and relatively under-explored so far.

So far, controllable generation with increasingly longer text and finer attributes has not been studied very well. This paper proposes the “outline to story” task to instantiate such research interest with controlled conditions as evolving short text. Recently, a concurrent work Rashkin et al. (2020) proposes a similar task and a dedicated architecture with memory mechanisms integrated with pre-trained language models. This paper will benchmark both models and promote the application from outline to open-domain long stories.

Transfer Learning on Language Models

Language modeling Bengio et al. (2003)

have played an important role in natural language processing and understanding, especially when used as a pre-training technique. Word embeddings

Mikolov et al. (2013) and contextual word vectors McCann et al. (2017); Howard and Ruder (2018); Peters et al. (2018) are some early but significant products. Recent works of large Transformer architectures Vaswani et al. (2017); Radford et al. (2019); Devlin et al. (2018) have leveraged the power of both big models and big data to further improve language representation. A number of works also study different pre-training model bases Song et al. (2019); Dong et al. (2019); Keskar et al. (2019); Lample and Conneau (2019); Zellers et al. (2019). In terms of generation tasks, the GPT-2 models attracts huge attention due to its dedicated design for unconditional language generation. Researchers See et al. (2019); Mao et al. (2019); Ziegler et al. (2019) have studied transfer learning on GPT-2 models to solve conditional generation tasks. For instance, Ziegler et al. (2019) impressively introduces modification to self-attention architecture for adapting a pre-trained model to arbitrary conditional input that goes beyond text. However, we note that how to leverage pre-trained language models for fine-grained controllable generation of long text is still an open problem.

Experiments and Discussions

Experimental Settings

We first evaluate various models’ story generation capability; then evaluate and emphasize controllability and flexibility in generation, a.k.a. production or inference stage.

We conduct experiments on and as introduced in “the Task and Data” section, which meet our target of open-domain long text corpora with paired outlines.

We notice that our model generating story from outline doesn’t necessarily need prompt as part of input. However, since conventional story generation do uses prompt as input and the concurrent work Rashkin et al. (2020) also uses prompt as input to form outlines, we always conduct experiments using prompt as an useful addition to perform global control for a fair comparison between models.

We use the smallest public version GPT-2 as a pre-trained model base, which is a large auto-regressive transformer based language model trained on 40 GB of non-Wikipedia text Radford et al. (2019). It is, by our knowledge, the most widely used model base in relevant literature. Note that there are many other pre-trained language models that may be larger and more powerful than our used one Song et al. (2019); Dong et al. (2019); Keskar et al. (2019); Lample and Conneau (2019); Zellers et al. (2019). Our argument is that, the purpose of our experiment is not to compare powers of different pre-trained model bases. It’s orthogonal to this work to investigate different pre-trained model bases.

Benchmark Models

We compare models with designated purposes. By comparing with a specialized-architecture task-specific story generation model Fan et al. (2018), we evaluate the model’s in-domain generation performances. The fusion model in Fan et al. (2018) takes a convolutional seq2seq model structure with a fusion training mechanism.

By comparing with a concurrent work Rashkin et al. (2020), we compare FIST with PlotMachines, a dedicated architecture using memory mechanisms for outline-conditioned story generation. Note that the benchmarked PlotMachines use the same GPT2 model base to construct its architecture.

By comparing with state-of-the-art transfer learning method based on GPT-2 models Ziegler et al. (2019), we evaluate different ways to absorb input for conditional generation. The encoder-agnostic model adaptation in Ziegler et al. (2019) has certain advantage of absorbing arbitrary conditional input other than text. Its key architecture is the pseudo self attention (PSA), which introduces new projection matrices to absorb input embeddings to the self-attention framework.

By comparing different inputs, we perform ablation study on input side to evaluate effect of outlines in conditional generation. One input source is only the prompt, whereas the other is “prompt + outline” connected with necessary delimiters.

By comparing with another way to build augmented sequence for fine-tuning, we perform ablation study on the FIST model side. The alternative way is the prepending manner shown in Table 3, while our FIST advocates to join cascaded events and paragraphs in an interleaving manner.

scale=0.920.92,center Methods Perplexity BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L Word BPE F1 P R F1 P R F1 P R Dataset: Fusion (only prompt) 36.0 - 0.372 0.223 0.386 0.157 0.038 0.074 0.026 0.206 0.358 0.145 Fusion (prompt + outline) 33.3 - 0.375 0.268 0.455 0.19 0.063 0.119 0.043 0.25 0.424 0.177 PSA (only prompt) 31.6 21.3 0.387 0.265 0.316 0.228 0.047 0.054 0.041 0.248 0.296 0.213 PSA (prompt + outline) 30.9 20.9 0.389 0.265 0.327 0.223 0.048 0.056 0.042 0.249 0.307 0.209 PoltMachines - - - - - 0.311 - - 0.067 - - 0.261 FIST (only prompt) 30.2 25.9 0.356 0.181 0.339 0.123 0.023 0.046 0.015 0.17 0.321 0.116 FIST (both, prepend) 18.9 16.6 0.382 0.299 0.324 0.277 0.069 0.07 0.069 0.283 0.308 0.262 FIST 20.3 17.6 0.377 0.294 0.347 0.255 0.07 0.078 0.063 0.279 0.33 0.242 Dataset: Fusion (only prompt) 108.2 - 0.333 0.185 0.185 0.185 0.026 0.027 0.025 0.15 0.149 0.151 Fusion (prompt + outline) 79.1 - 0.342 0.232 0.244 0.221 0.057 0.059 0.056 0.185 0.194 0.177 PSA (only prompt) 79.5 47.8 0.326 0.188 0.188 0.189 0.026 0.025 0.027 0.172 0.171 0.173 PSA (prompt + outline) 79.2 47.7 0.344 0.185 0.186 0.185 0.024 0.023 0.026 0.167 0.168 0.167 PoltMachines - - - - - 0.228 - - 0.065 - - 0.175 FIST (only prompt) 38.9 26.5 0.265 0.166 0.253 0.124 0.018 0.032 0.013 0.15 0.231 0.111 FIST (both, prepend) 26.0 18.5 0.346 0.281 0.284 0.279 0.083 0.081 0.086 0.254 0.257 0.252 FIST 26.7 18.9 0.333 0.275 0.289 0.262 0.081 0.084 0.078 0.248 0.261 0.237

Table 4: Automatic metrics for conditional story generation evaluated on two datasets.

Implementation Details

We implement FIST using the “Huggingface Transformers” library in Pytorch

Wolf et al. (2019). Special tokens are conveniently added to the original 50K BPE GPT-2 vocabulary. All GPT-2 model settings remain the same. Other models are re-trained and re-evaluated with the same datasets using their own implementation repositories if needed. In evaluation, we generate stories using the top-k top-p random sampling scheme Holtzman et al. (2019); Keskar et al. (2019) with and . Temperature smoothing technique is also applied with .

Considering two relatively large test datasets, we only decode one story per test input. When testing an outline-story pair using FIST, we replace the model generated outline event in place by the corresponding automatically extracted outline event from the test story.

scale=0.970.97,center Decoding Story Event Phrase BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L scenario para. phrase len. story event F1 P R F1 P R F1 P R Dataset: FIST- 2.9 2.4 3.0 0.377 0.13 0.22 0.333 0.2 0.044 0.061 0.044 0.184 0.318 0.19 FIST- 2.9 2.4 3.0 0.371 0.12 0.21 0.323 0.191 0.039 0.055 0.038 0.173 0.308 0.181 FIST- 3.0 2.4 2.9 0.358 0.04 0.21 0.328 0.187 0.038 0.055 0.037 0.172 0.313 0.178 Dataset: FIST- 2.9 2.6 3.3 0.333 0.21 0.239 0.263 0.246 0.058 0.062 0.065 0.203 0.24 0.225 FIST- 2.9 2.6 3.3 0.325 0.20 0.229 0.254 0.239 0.053 0.057 0.061 0.191 0.232 0.218 FIST- 2.9 2.5 3.2 0.321 0.06 0.229 0.257 0.234 0.053 0.058 0.059 0.192 0.234 0.213

Table 5: Automatic metrics for event analysis.

Automatic and Qualitative Evaluation

Automatic Metrics

We basically evaluate the following automatic metrics towards target stories:

  • Perplexity (PPL) is used to evaluate language models and often regarded as a proxy for generation quality. All models based on GPT-2 use BPE tokenization scheme, where PPL values are not directly comparable with some previous models such as Fan et al. (2018), where PPLs are computed at natural word level. Similarly to See et al. (2019), we additionally compute word-level perplexity of GPT-2 models to enable the comparison with previous models. That is, we normalize the total negative log probability of the target text by the number of word level tokens. To ensure the fairness, when we evaluate perplexity on augmented sequences using FIST, we only count over story tokens, but not tokens of the event sequences, since other models only count over story tokens.

  • BLEU and ROUGE scores

    are computed as n-gram overlap of generated stories versus target stories. We compute 4-gram BLEU using the “NLTK” library and ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L) using

    Lin and Hovy (2002). The ROUGE score includes both precision (P), recall (R), and F1, with ROUGE precision having similar interpretation as BLEU.

Automatic Evaluation Results

The automatic evaluation results are presented in Table 4. Overall, the proposed FIST with full input source (prompt and outline) generally achieves better results, with lower PPL, higher BLEU and higher ROUGE scores, demonstrating state-of-the-art conditional story generation performance.

Methods based on pre-trained models, i.e. PSA / PlotMachines / FIST, show better overall performance with relatively less task specific efforts than Fusion models. This demonstrates the power of large-scale pre-training. Fusion models still show relatively high precision scores in , due to its dedicated design on this dataset.

FIST achieves comparable or even better results over PlotMachines, presenting a much simpler approach based on solely language modeling rather than dedicated architecture. The comparison so far on automatic metrics poses an interesting and open-ended question that, does dedicated architecture really benefit as expected?

With full input sources, the FIST model generally outperforms PSA, showing that simple language modeling on naturally co-occurred text conditions could grasp conditional dependence to some considerable extent. In other words, architecture modification is not always necessary with text conditions. With only prompt, FIST cannot beat PSA, showing that co-occurred structures labeled by special tokens could work poorly with insufficient information and structure exposure. Therefore, the FIST method may prefer scenarios that text conditions are strong and fruitful, for instance, longer and finer input condition.

The performance gap caused by two different input sources vary between different models: both Fusion models and PSA have limited performance improvement when provided with full inputs, while the FIST model is observed with a relatively larger performance gap. It is reasonable since the FIST model works directly and universally on augmented sequences, which is more sensitive and easier to reflect the enhancement of input information; while others “separate” input and story, and are less sensitive to the input.

The FIST model with interleaving joining mechanism shows a little performance drop comparing with the prepending manner, due to the fact that the latter exposes more information along the decoding. We will see that the performance drop is a quite acceptable sacrifice, as FIST model with interleaving joining mechanism achieves great decoding flexibility and controllability.

Qualitative Evaluation

We also present some generation examples on the test datasets in Tables 6 and 9 in the Appendix. Stories are seen as semantically and grammatically sound, moreover, highly conditioned on and consistent with given outlines. Due to the pairing nature between extracted outlines and stories, the model presents obvious grasp on at least a lexical level. Tables 7, 8, 10, 11 are also good examples. A large scale human evaluation is underway, which is quite expensive due to text length and scale.

Analysis on Automatically Generated Events

Since event sequences are inherently modeled as a part of “the language”, FIST model has its own ability to generate succeeding event sequences, which future story paragraphs and events are conditioned on. We perform event analysis under various decoding scenarios that involve different levels of event supervision and utilization.

  • FIST-: exactly as the test setting in previous evaluation section, the prompt and all the automatically extracted outline events are given following the chronological order.

  • FIST-: with the prompt and the first event sequence given, the following test events are also given, but randomly shuffled in order.

  • FIST-: with only the prompt and the first event sequence given, the model will depend on its own generated event sequence starting from the second paragraph till the end.

We compare automatic metrics of generated stories in different decoding scenarios versus target stories in Table 5. Metrics include certain statistics in Table 2 and Table 4, with an extra BLEU-4 score between model generated and target extracted event sequences.

We observe that tatistics about paragraphs and phrases are quite uniform over different decoding scenarios, implying that the model could generate events in the correct format. Since the GPT-2 model allows a limited context window of at most 1024 tokens, generated stories with different decoding scenarios all consist of roughly 3 paragraphs. From FIST- to FIST- and finally FIST-, less coincidence between generated and target event sequences are observed, along with steady but minor metrics drop. We conclude that in terms of capturing content and semantics flowing in target stories, events automatically extracted from target stories have certain advantage, while model generated events show close and considerable semantic consistence. Note that the comparison may be biased since the limited context window may lead to a less metric gap.

In the Appendix, generation examples from the two new introduced decoding scenarios are presented in Tables 7, 8, 10, 11. We colorize target extracted events into red and model generated events into blue. Reasonable quality drop is observed from FIST- to FIST- and finally FIST-.

To emphasize, with the ability of feeding in events in chronological order and generating events automatically, the FIST model has its unique advantage in flexibility and controllability than others, including the PlotMachines.

Conclusion

In this paper, we propose and create dataset for the task of “outline to story” to instantiate research interest of fine-grained controllable generation of open-domain long text with short-text controlled conditions. By leveraging pre-trained language models, our method FIST represents an extremely simple yet strong baseline for the task. More efforts towards the challenging case of open-domain long text should be invested in future works.

References

  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model.

    Journal of machine learning research

    3 (Feb), pp. 1137–1155.
    Cited by: Methodology, Transfer Learning on Language Models.
  • P. E. Campaign (2004) How to write in plain english. Plain English Campaign. Cited by: 2nd item.
  • Z. Chan, X. Chen, Y. Wang, J. Li, Z. Zhang, K. Gai, D. Zhao, and R. Yan (2019) Stick to the facts: learning towards a fidelity-oriented e-commerce product description generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4960–4969. Cited by: Raw stories.
  • M. Cutts (2020) Oxford guide to plain english. Oxford University Press, USA. Cited by: 2nd item.
  • S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu (2019) Plug and play language models: a simple approach to controlled text generation. In International Conference on Learning Representations, Cited by: Controllable Text Generation.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, 2nd item, Transfer Learning on Language Models.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pp. 13042–13054. Cited by: Transfer Learning on Language Models, Experimental Settings.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898. Cited by: Raw stories, Controllable Text Generation, 1st item, Benchmark Models.
  • A. Fan, M. Lewis, and Y. Dauphin (2019) Strategies for structuring story generation. arXiv preprint arXiv:1902.01109. Cited by: Controllable Text Generation.
  • L. Fang, C. Li, J. Gao, W. Dong, and C. Chen (2019) Implicit deep latent variable models for text generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3937–3947. Cited by: Controllable Text Generation.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. In International Conference on Learning Representations, Cited by: Implementation Details.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: Transfer Learning on Language Models.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1587–1596. Cited by: Controllable Text Generation.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: Introduction, Introduction, Fine-tune with Special Tokens, Controllable Text Generation, Transfer Learning on Language Models, Implementation Details, Experimental Settings.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: Transfer Learning on Language Models, Experimental Settings.
  • C. Lin and E. Hovy (2002) Manual and automatic evaluation of summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization-Volume 4, pp. 45–51. Cited by: 2nd item.
  • T. Liu, F. Luo, P. Yang, W. Wu, B. Chang, and Z. Sui (2019) Towards comprehensive description generation from factual attribute-value tables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5985–5996. Cited by: Raw stories.
  • H. H. Mao, B. P. Majumder, J. McAuley, and G. Cottrell (2019) Improving neural story generation by targeted common sense grounding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5990–5995. Cited by: Raw stories, Transfer Learning on Language Models.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: Transfer Learning on Language Models.
  • R. Mihalcea and P. Tarau (2004) Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: Introduction, Keyword Extraction.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: Transfer Learning on Language Models.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696. Cited by: Raw stories, Controllable Text Generation.
  • N. Peng, M. Ghazvininejad, J. May, and K. Knight (2018) Towards controllable story generation. In Proceedings of the First Workshop on Storytelling, pp. 43–49. Cited by: Controllable Text Generation.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: Introduction, Transfer Learning on Language Models.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: Introduction.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: Introduction, Fine-tune with Special Tokens, Methodology, Transfer Learning on Language Models, Experimental Settings.
  • H. Rashkin, A. Celikyilmaz, Y. Choi, and J. Gao (2020) PlotMachines: outline-conditioned generation with dynamic plot state tracking. arXiv preprint arXiv:2004.14967. Cited by: Introduction, Controllable Text Generation, Benchmark Models, Experimental Settings.
  • S. Rose, D. Engel, N. Cramer, and W. Cowley (2010) Automatic keyword extraction from individual documents. Text mining: applications and theory 1, pp. 1–20. Cited by: Introduction, Keyword Extraction.
  • A. See, A. Pappu, R. Saxena, A. Yerukola, and C. D. Manning (2019) Do massively pretrained language models make better storytellers?. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 843–861. Cited by: Introduction, Transfer Learning on Language Models, 1st item.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pp. 6830–6841. Cited by: Controllable Text Generation.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning, pp. 5926–5936. Cited by: Transfer Learning on Language Models, Experimental Settings.
  • S. Tsutsui and D. Crandall (2017) Using artificial tokens to control languages for multilingual image caption generation. arXiv preprint arXiv:1706.06275. Cited by: Introduction, Fine-tune with Special Tokens.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Introduction, Methodology, Transfer Learning on Language Models.
  • S. Vincent (2014) Sentence length: why 25 words is our limit. Blog]. InsideGOVUk. Retrieved from: https://insidegovuk. blog. gov. uk/2014/08/04/sentence-length-why-25-words-is-our-limit. Cited by: 2nd item.
  • Z. Wang, X. Wang, B. An, D. Yu, and C. Chen (2020) Towards faithful neural table-to-text generation with content-matching constraints. External Links: 2005.00969 Cited by: Raw stories.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: Implementation Details.
  • H. Xiao (2018) Bert-as-service. Note: https://github.com/hanxiao/bert-as-service Cited by: 2nd item.
  • L. Yao, N. Peng, R. Weischedel, K. Knight, D. Zhao, and R. Yan (2019) Plan-and-write: towards better automatic storytelling. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 7378–7385. Cited by: Raw stories, Controllable Text Generation.
  • R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2019) Defending against neural fake news. In Advances in Neural Information Processing Systems, pp. 9051–9062. Cited by: Introduction, Transfer Learning on Language Models, Experimental Settings.
  • J. Zhao, Y. Kim, K. Zhang, A. Rush, and Y. LeCun (2018)

    Adversarially regularized autoencoders

    .
    In International Conference on Machine Learning, pp. 5902–5911. Cited by: Controllable Text Generation.
  • Z. M. Ziegler, L. Melas-Kyriazi, S. Gehrmann, and A. M. Rush (2019) Encoder-agnostic adaptation for conditional language generation. arXiv preprint arXiv:1908.06938. Cited by: Introduction, Transfer Learning on Language Models, Benchmark Models.