Strategies for Structuring Story Generation

02/04/2019 ∙ by Angela Fan, et al. ∙ Google Facebook 0

Writers generally rely on plans or sketches to write long stories, but most current language models generate word by word from left to right. We explore coarse-to-fine models for creating narrative texts of several hundred words, and introduce new models which decompose stories by abstracting over actions and entities. The model first generates the predicate-argument structure of the text, where different mentions of the same entity are marked with placeholder tokens. It then generates a surface realization of the predicate-argument structure, and finally replaces the entity placeholders with context-sensitive names and references. Human judges prefer the stories from our models to a wide range of previous approaches to hierarchical text generation. Extensive analysis shows that our methods can help improve the diversity and coherence of events and entities in generated stories.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stories exhibit structure at multiple levels. While existing language models can generate stories with good local coherence, they struggle to coalesce individual phrases into coherent plots or even maintain consistency of the characters throughout the story. One reason for this failure is that classical language models generate the whole story at the word level, which makes it difficult to capture the high-level interactions between the plot points.

Figure 1: Proposed Model. Conditioned upon the prompt, we generate sequences of predicates and arguments. Then, a story is generated with placeholder entities such as ent0. Finally we replace the placeholders with specific references.

To address this, we investigate novel decompositions of the story generation process that break down the problem into a series of easier coarse-to-fine generation problems. These decompositions can offer three advantages:

  • They allow more abstract representations to be generated first, in which challenging long-range dependencies may be more apparent.

  • They allow specialized modelling techniques for the different stages, which exploit the structure of the specific sub-problem.

  • They are applicable to any textual dataset and require no manual labelling.

Several hierarchical models for story generation have recently been proposed Xu et al. (2018); Yao et al. (2019)

, but it is not well understood which properties characterize a good decomposition. We therefore implement and evaluate several representative approaches based on keyword extraction, sentence compression, and summarization.

We build on this understanding to devise the proposed decomposition (Figure 1). Our approach breaks down the generation process in three steps: modelling the action sequence, the narrative, and then entities (such as story characters). To model action sequences, we first generate the predicate-argument structure of the story by generating a sequence of verbs and arguments. This representation is more structured than free text, making it easier for the model learn dependencies across events. To model entities, we initially generate a version of the story where different mentions of the same entity are replaced with placeholder tokens. Finally, we re-write these tokens into different references for the entity, based on both its previous mentions and global story context.

The models are trained on a large dataset of 300k stories, and we evaluate quality both in terms of human judgments and using automatic metrics. We find that our novel approach leads to significantly better story generation. Specifically, we show that generating the action sequence first makes the model less prone to generating generic events, leading to a much greater diversity of verbs. We also find that by using sub-word modelling for the entities, our model can produce novel names for locations and characters that are appropriate given the story context.

2 Models

The crucial challenge of long story generation lies in maintaining coherence across a large number of generated sentences—in terms of both the logical flow of the story, and the characters and entities. While there has been much recent progress in left-to-right text generation, particularly using self-attentive architectures Dai et al. (2018); Liu et al. (2018), we find that models still struggle to maintain coherence to produce interesting stories on par with human writing. We therefore introduce strategies to decompose neural story generation into coarse-to-fine steps to make modelling high-level dependencies easier.

2.1 Tractable Decompositions

In general, we can decompose the generation process by converting a story into a more abstract representation . The negative log likelihood of the decomposed problem is given by


We can generate from this model by first sampling from and then sampling from . However, the marginalization over is in general intractable, except in special cases where every can only be generated by a single (for example, if the transformation removed all occurrences of certain tokens). Instead, we minimize a variational upper bound of the loss by constructing a deterministic posterior , where can be given by running semantic role labeller or coreference resolution system on . Put together, we optimize the following loss:


This approach allows models and to be trained tractably and separately.

2.2 Model Architectures

We build upon the convolutional sequence-to-sequence architecture Gehring et al. (2017). Deep convolutional networks are used as the encoder and decoder, connected with an attention module Bahdanau et al. (2015) that performs a weighted sum of the encoder output. The decoder uses a gated multi-head self-attention mechanism Vaswani et al. (2017); Fan et al. (2018) to allow the model to refer to previously generated words and improve the ability to model long-range context.

2.3 Modelling Action Sequences

Figure 2: Verb-Attention. To improve the model’s ability to condition upon past verbs, one head of the decoder’s self-attention mechanism is specialized to only attend to previously generated verbs.

To decompose a story into a structured form that emphasizes logical sequences of actions, we use Semantic Role Labeling (SRL). SRL identifies predicates and arguments in sentences, and assigns each argument a semantic role. This representation abstracts over different ways of expressing the same semantic content. For example, John ate the cake and the cake that John ate would receive identical semantic representations.

Conditioned upon the prompt, we generate an SRL decomposition of the story by concatenating the predicates and arguments identified by a pretrained model He et al. (2017); Tan et al. (2018)222for predicate identification, we use, for SRL given predicates, we use and separating sentences with delimiter tokens. We place the predicate verb first, followed by its arguments in canonical order. To focus on the main narrative, we retain only core arguments.

Verb Attention Mechanism

SRL parses are more structured than free text, allowing scope for more structured models. To encourage the model to consider sequences of verbs, we designate one of the heads of the decoder’s multihead self-attention to be a verb-attention head (see Figure 2

). By masking the self-attention appropriately, this verb-attention head can only attend to previously generated verbs. When the text does not yet have a verb, the model attends to a padding token. We show that focusing on verbs with a specific attention head generates a more diverse array of verbs and reduces repetition in generation.

Figure 3: Input for Coreferent entity reference generation. The model has a representation of the entity context in a bag of words form, all previous predicted values for the same anonymized entity token, and the full text story. The green circle represents the entity mention the model is attempting to fill.

2.4 Modelling Entities

The challenges of modelling characters throughout a story is twofold: first, entities such as character names are rare tokens, which make them hard to model for neural language models. Human stories often feature novel character or location names. Second, maintaining the consistency of a specific set of characters is difficult, as the same entity may be referenced by many different strings throughout a story—for example Bilbo Baggins, he, and the hobbit may refer to the same entity. It is challenging for existing language models to track which words refer to which entity purely from a language modelling objective.

We address both problems by first generating a form of the story with different mentions of the same entity replaced by a placeholder token (e.g. ent0), similar to Hermann et al. (2015). We then use a sub-word seq2seq model trained to replace each mention with a reference, based on its context. The sub-word model is better equipped to model rare words and the placeholder tokens make maintining consistency easier.

2.4.1 Generating Entity Anonymized Stories

We explore two approaches to identifying and clustering entities:

  • NER Entity Anonymization

    : We use a named entity recognition (NER) model

    333Specifically, Spacy:, en_core_web_lg to identify all people, organizations, and locations. We replace these spans with placeholder tokens (e.g. ent0). If any two entity mentions have an identical string, we replace them with the same placeholder. For example, all mentions of Bilbo Baggins will be abstracted to the same entity token, but Bilbo would be a separate abstract entity.

  • Coreference-based Entity Anonymization: The above approach cannot detect different mentions of an entity that use different strings. Instead, we use the Coreference Resolution model from Lee et al. (2018)444 to identify clusters of mentions. All spans in the same cluster are then replaced with the same entity placeholder string. Coreference models do not detect singleton mentions, so we also replace non-coreferent named entities with unique placeholders.

2.4.2 Generating Entity References in a Story

We train models to replace placeholder entity mentions with the correct surface form, for both NER-based and coreference-based entity anonymised stories. Both our models use a seq2seq architecture that generates an entity reference based on its placeholder and the story. To better model the specific challenges of entity generation, we also make use of a pointer mechanism and sub-word modelling.

Pointer Mechanism

Generating multiple consistent mentions of rare entity names is challenging. To make it easier for the model to re-use previous names for an entity, we augment the standard seq2seq decoder with a pointer-copy mechanism Vinyals et al. (2015). To generate an entity reference, the decoder can either generate a new entity string or choose to copy an already generated entity reference, which encourages the model to use consistent naming for the entities.

To train the pointer mechanism, the final hidden state of the model

is used as input to a classifier


is a fixed dimension parameter vector. When the model classifier predicts to copy, the previously decoded entity token with the maximum attention value is copied. One head of the decoder multi-head self-attention mechanism is used as the pointer attention head, to allow the heads to specialize.

Sub-word Modelling

Entities are often rare or novel words, so word-based vocabularies can be inadequate. We compare entity generation using word-based, byte-pair encoding (BPE) Sennrich et al. (2015), and character-level models.

NER-based Entity Reference Generation

Here, each placeholder string should map onto one (possibly multiword) surface form—e.g. all occurrences of the placeholder ent0 should map only a single string, such as Bilbo Baggins. We train a simple model that maps a combination placeholder token and story (with anonymized entities) to the surface form of the placeholder. While the placeholder can appear multiple times, we only make one prediction for each placeholder as they all correspond to the same string.

Coreference-based Entity Reference Generation

Generating entities based on coreference clusters is more challenging than for our NER entity clusters, because different mentions of the same entity may use different surface forms. We generate a separate reference for each mention by adding the following inputs to the above model:

  • A bag-of-words context window around the specific entity mention, which allows local context to determine if an entity should be a name, pronoun or nominal reference.

  • Previously generated references for the same entity placeholder. For example, if the model is filling in the third instance of ent0, it receives that the previous two generations for ent0 were Bilbo, him. Providing the previous entities allows the model to maintain greater consistency between generations.

3 Experimental Setup

3.1 Data

We use the WritingPrompts dataset555Dataset download of 300k story premises paired with long stories. Stories are on average 734 words, making the generation significantly longer compared to related work on storyline generation. In this work, we focus on the prompt to story generation aspect of this task. We follow the previous preprocessing of limiting stories to 1000 words and fixing the vocabulary size to 19,025 for prompts and 104,960 for stories.

3.2 Baselines

Figure 4: Human evaluations of different decomposed models for story generation. We find that using SRL action plans and coreference-resolution to build entity clusters generates stories that are preferred by human judges.

We compare our results to the Fusion model from Fan et al. (2018) which generates the full story directly from the prompt. We also implement various decomposition strategies as baselines:

  • Summarization: We propose a new baseline that generates a summary conditioned upon the prompt and then a story conditioned upon the summary. Story summaries are obtained with a multi-sentence summarization modelWu et al. (2019) trained on full-text CNN-Dailymail 666Dataset download and applied to stories.

  • Keyword Extraction: We generate a series of keywords conditioned upon the prompt and then a story conditioned upon the keywords, based on Yao et al. (2019). Following Yao et al, we extract keywords with the rake algorithm Rose et al. (2010)777 Yao et al. extract one word per sentence, but we find that extracting keyword phrases per story worked well as our stories are much longer.

  • Sentence Compression: Inspired by Xu et al. (2018), we generate a story with compressed sentences conditioned upon the prompt and then a story conditioned upon the compressed shorter story. We use the same deletion-based compression data as Xu et al., from Filippova and Altun (2013)888Dataset download. We train a seq2seq model to compress all non-dialog story sentences. The compressed sentences are concatenated to generate the compressed story.

Decomposition Stage 1
Stage 2
Summary 4.20 5.09
Keyword 6.92 4.23
Compression 5.05 3.64
SRL Action Plan 2.72 3.95
NER Entity Anonymization 3.32 4.75
Coreference Anonymization 3.15 4.55
Table 1: Negative log likelihood of generating stories using different decompositions (lower is better). Stage 1 is the generation of the intermediate representation , and Stage 2 is the generation of the story conditioned upon . Entity generation is with a word-based vocabulary to be consistent with the other models.

3.3 Training

We implement models using fairseq-py Gehring et al. (2017)999

in PyTorch and train

Fan et al. (2018)

’s convolutional architecture. We tune all hyperparameters on validation data.

3.4 Generation

We suppress the generation of unknown tokens. We require stories to be at least 150 words and cut off the story at the nearest sentence for stories longer than 250 words (to ease human evaluation). We generate stories with temperature 0.8 and random top sampling, where next words are sampled from the top candidates rather than the entire vocabulary distribution.

4 Experiments

4.1 Comparing Decomposition Strategies

Automated Evaluation

We compare the relative difficulty of modelling through each decomposition strategy by measuring the log loss of the different stages in Table 1. We observe that generating the SRL structure has a lower negative log-likelihood and so is much easier than generating either summaries, keywords, or compressed sentences — a benefit of its more structured form. We find keyword generation is especially difficult as the identified keywords are often the more salient, rare words appearing in the story, which are challenging for neural seq2seq models. This suggests that rare words should appear mostly at the last levels of the decomposition. Finally, we compare models with entity-anonymized stories as an intermediate representation, either with NER-based or coreference-based entity anonymization. Entity references are then filled using a word-based model101010To make likelihoods are comparable across models.. The entity fill is the more difficult stage.

Figure 5: Our decomposition can generate more coherent stories than previous work.
Human Evaluation

To compare overall story quality using various decomposition strategies, we conduct human evaluation. Judges marked which story they prefer from 2 choices. 100 stories are evaluated for each model by 3 judges.

Figure 5 shows that human evaluators prefer our novel decompositions over a carefully tuned Fusion model from Fan et al. (2018) by about 60% in a blind comparison. We see additive gains from modelling actions and entities.

In a second study, evaluators compared baselines against stories generated by our strongest model, which uses SRL-based action plans and coreference-based entity anonymization. In all cases, our full decomposition is preferred.

Qualitatively, we find that many poor generations result from mistakes in early stages. Subsequent models were not exposed to errors during training, so are not able to recover.

4.2 Effect of SRL Decomposition

Human-written stories feature a wide variety of events, while neural models are plagued by generic generations and repetition.

Table 2 quantifies model performance on two metrics to assess action diversity: (1) the number of unique verbs generated, averaged across all stories (2) the percentage of diverse verbs, measured by the percent of all verbs generated in the test set that are not one of the top 5 most frequent verbs. A higher percentage indicates more diverse events.111111We identify verbs using Spacy:

Our decomposition using the SRL predicate-argument structure improves the model’s ability to generate diverse verbs. Adding verb attention leads to further improvement. Qualitatively, the model can often outline clear action sequences, as shown in Figure 6. However, all models remain far from matching the diversity of human stories.

Figure 6: Example generated action plan. It shows a plausible sequence of actions for a character.
Model # Unique Verbs % Diverse Verbs
Human Stories 34.0 76.5
Fusion 10.3 61.1
Summary 12.4 60.6
Keyword 9.1 58.2
Compression 10.3 54.3
SRL 14.4 62.5
+ verb-attention 15.9 64.9
Table 2: Action Generation. Generating the SRL structure improves verb diversity and reduces repetition.
First Mentions Subsequent Mentions
Model Rank 10 Rank 50 Rank 100 Rank 10 Rank 50 Rank 100
Word-Based 42.3 25.4 17.2 48.1 38.4 28.8
BPE 48.1 20.3 25.5 52.5 50.7 48.8
Character-level 64.2 51.0 35.6 66.1 55.0 51.2
No story 50.33 40.0 26.7 54.7 51.3 30.4
Left story context 59.1 49.6 33.3 62.9 53.2 49.4
Full story 64.2 51.0 35.6 66.1 55.0 51.2
Table 3: Accuracy at choosing the correct reference string for a mention, discriminating against 10, 50 and 100 random distractors. We break out results for the first mention of an entity (requiring novelty to produce an appropriate name in the context) and subsequent references (typically pronouns, nominal references, or shorter forms of names). We compare the effect of sub-word modelling and providing longer contexts.
Model # Unique Entities
Human Stories 2.99
Fusion 0.47
Summary 0.67
Keyword 0.81
Compression 0.21
SRL + NER Entity Anonymization 2.16
SRL + Coreference Anonymization 1.59
Table 4: Diversity of entity names. Baseline models generate few unique entities per story. Our decompositions generate more, but still fewer than human stories. Using coreference resolution to build entity clusters reduces diversity here—partly due to re-using existing names more, and partly due to greater use of pronouns.

4.3 Comparing Entity Reference Models

We explored a variety of different ways to generate the full text of abstracted entities—using different amounts of context, and different granularities of subword generation. To compare these models, we calculated their accuracy at predicting the correct reference in Table 3. Each model evaluates different entities in the test set, 1 real and randomly sampled distractors. Models must give the true mention the highest likelihood. We analyze accuracy on the first mention of an entity, an assessment of novelty, and subsequent references, which measures coherence.

Effect of Sub-word Modelling

Table 3 shows that modelling a character-level vocabulary for entity generation significantly outperforms BPE and word-based models, because of the diversity of entity names. This result highlights a key advantage of multi-stage modelling: it allows the use specialized modelling techniques for each sub-task.

Effect of Additional Context

Entity references should be contextual. Firstly, names must be appropriate for the story setting—Bilbo Baggins is more appropriate for a fantasy novel than one set in the present day. Subsequent references to the character may be briefer, depending on context—for example, he is more likely to be referred to as he or Bilbo than his full name in the next sentence.

We compare three models ability to name entities based on context (using the coreference-anonymization): a model that does not receive the story, a model that uses only leftward context (as in Clark et al. (2018)), and the full story. We show in Table 3 that having access to the full story provides the best performance. Having no access to the story significantly decreases ranking accuracy, even though the model still receives the local context window of the entity as input. The left story context model performs significantly better, but looking at the complete story provides additional gains. We note that full-story context can only be provided in a multi-stage generation approach.

Qualitative Examples

Figure 7 shows examples of entity naming in three stories of different genres. The models adapt to the context—for example generating The princess and The Queen when the context includes monarchy.

Model # Coref Chains Unique Names per Chain
Human Stories 4.77 3.41
Fusion 2.89 2.42
Summary 3.37 2.08
Keyword 2.34 1.65
Compression 2.84 2.09
SRL + NER Entity Anonymization 4.09 2.49
SRL + Coreference Anonymization 4.27 3.15
Table 5: Analysis of non-singleton coreference clusters. Baseline models generate very few different coreference chains, and repetitive mentions within clusters. Our models generate larger and more diverse clusters.

4.4 Effect of Entity Anonymization

To understand the effectiveness of the entity generation models, we examine their performance by analyzing generation diversity.

Diversity of Entity Names

Human-written stories often contain many diverse, novel names for people and places. However, these tokens are rare and subsequently difficult for standard neural models to generate. Table 4 shows that the fusion model and baseline decomposition strategies generate very few unique entities in each story. Generated entities are often generic names such as John.

Figure 7: Generating entity references for different genres, using entity-anonymized human written stories. Models use the story context to fill in relevant entities.

In contrast, our proposed decompositions generate substantially more unique entities. We found that using coreference resolution for entity anonymization led to fewer unique entity names than generating the names independently. This result can be explained by the coreference-based model re-using previous names more frequently, and also it using more pronouns.

Coherence of Entity Clusters

Well structured stories will refer back to previously mentioned characters and events in a consistent manner. To evaluate if the generated stories have these characteristics, we examine the coreference properties in Table 5. We quantify the average number of coreference clusters and the diversity of entities within each cluster (e.g. the cluster Bilbo, he, the hobbit is more diverse than the cluster he, he, he).

Our full model produces more non-singleton coreference chains, suggesting greater coherence, and also gives different mentions of the same entity more diverse names. However, both numbers are still lower than for human generated stories, indicating potential for future work.

Qualitative Example

Figure 8 displays a sentence constructed to require the generation of an entity as the final word. The fusion model does not perform any implicit coreference to associate the allergy with his dog. In contrast, coreference entity fill produces a high quality completion.

5 Related Work

5.1 Story Generation with Planning

Story generation by first composing a plan has been explored using many different techniques. Traditional approaches organized sequences of character actions using hand crafted models Riedl and Young (2010); Porteous and Cavazza (2009). Recent work has extended this to modelling story events Martin et al. (2017); Mostafazadeh et al. (2016), plot graphs Li et al. (2013), or generate by conditioning upon sequences of images Huang et al. (2016) or descriptions Jain et al. (2017).

We build on previous work that decomposes generation into separate steps. For example, Xu et al. (2018)

learn a story skeleton extraction model and a generative model conditioned upon the skeleton, using reinforcement learning to train the two together.

Zhou et al. (2018) train a storyline extraction model for news article generation, but require additional supervision from manually annotated storylines. Yao et al. (2019) use the rake Rose et al. (2010) algorithm to extract storylines, and condition upon the storyline to write the full story using dynamic and static schemas that govern if the storyline is allowed to change during the writing process.

Figure 8: Constructed sentence where the last word refers to an entity. The coreference model is able to track the entities, whereas the fusion model relies heavily on local context to generate the next words.

5.2 Entity Language Models

An outstanding challenge in text generation is modelling and tracking entities through a document. Centering (Grosz et al., 1995) gives a theoretical account of how referring expressions for entities are chosen in discourse context. Named entity recognition has been incorporated into language models since at least Gotoh et al. (1999), and has been shown to improve domain adaptation Liu and Liu (2007). Language models have been extended to model entities based on additional information, such as entity type Parvez et al. (2018). Recent work has incorporated learning representations of entities and other unknown words Kobayashi et al. (2017), as well as explicitly model entities by dynamically updating these representations Ji et al. (2017). Dynamic updates to entity representations are used in other story generation models Clark et al. (2018).

6 Conclusion

We proposed an effective method for writing short stories by separating the generation of actions and entities. We show through human evaluation and automated metrics that our novel decomposition significantly improves story quality.