TopNet: Learning from Neural Topic Model to Generate Long Stories

Long story generation (LSG) is one of the coveted goals in natural language processing. Different from most text generation tasks, LSG requires to output a long story of rich content based on a much shorter text input, and often suffers from information sparsity. In this paper, we propose TopNet to alleviate this problem, by leveraging the recent advances in neural topic modeling to obtain high-quality skeleton words to complement the short input. In particular, instead of directly generating a story, we first learn to map the short text input to a low-dimensional topic distribution (which is pre-assigned by a topic model). Based on this latent topic distribution, we can use the reconstruction decoder of the topic model to sample a sequence of inter-related words as a skeleton for the story. Experiments on two benchmark datasets show that our proposed framework is highly effective in skeleton word selection and significantly outperforms the state-of-the-art models in both automatic evaluation and human evaluation.



There are no comments yet.


page 1

page 2

page 3

page 4


A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation

Narrative story generation is a challenging problem because it demands t...

Multimodal Story Generation on Plural Images

Traditionally, text generation models take in a sequence of text as inpu...

Induction and Reference of Entities in a Visual Story

We are enveloped by stories of visual interpretations in our everyday li...

Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling

Visual Storytelling (VIST) is a task to tell a narrative story about a c...

Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication

Visual storytelling aims to generate a narrative paragraph from a sequen...

A Knowledge-based Filtering Story Recommender System for Theme Lovers with an Application to the Star Trek Television Franchise

In this paper, we propose a recommender system that takes a user-selecte...

"My Way of Telling a Story": Persona based Grounded Story Generation

Visual storytelling is the task of generating stories based on a sequenc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Long story generation (LSG) is one of the desired goals for artificial intelligence systems and has many real-world applications such as automatic news generation, tutoring systems,

etc. Given a short text description or even a single word, the task of LSG is to teach the machine to generate a long narrative story (Table 1). The recent introduction of high capacity language models (e.g., GPT-2) (Radford et al., 2019) have shown their ability to generate stylistically coherent text but also suffer from uncontrollability in topics and content, which makes them have very few opportunities in industrial or commercial usage.

Short Description: sunflower seeds
Story: Misty took a bag of sunflower seeds downstairs without asking. This made her father irritated, but he allowed her to do this. Then, she began spilling sunflower seeds. She spilled sunflower seeds even after her father said to be careful. Now, sunflower seeds are banned from the entire house.
Table 1. An example from the ROCStories dataset.

Most state-of-the-art works (Fan et al., 2018; Xu et al., 2018b; Yao et al., 2019)

tackle the challenges of LSG with a hierarchical structure, which first generates several skeleton keywords indicating the topic of the story, and then generates the story based on this skeleton. However, almost all of them generate the keywords by a supervised sequence-to-sequence model, which relies heavily on the maximum likelihood estimation objective and often leads to problems: sequences are dull, generic and repetitive 

(Serban et al., 2016; Li et al., 2017). Moreover, at the training stage, the labeled keywords are often created by word-frequency based methods or initialized by other sentence compression datasets, and suffer from lack of diversity and bias from different domains.

Probabilistic topic modeling, which infers latent topics from documents is one of the greatest dimension reduction technologies. Given a corpus, the topic model assigns a topic distribution vector to each document and also builds a decoder that can reconstruct words in the document from the vector. This topic distribution contains the major information and latent features of the long document but is in the form of a low-dimensional vector, thus is an ideal distillation of the document and has been widely applied to various tasks such as text analysis and information retrieval 

(Hofmann, 1999; Blei et al., 2003; Teh et al., 2005). Recently, neural topic models have attracted much attention (Kingma and Welling, 2014; Gan et al., 2015; Miao et al., 2017)

. They typically approximate the posterior of a variational distribution with an inference model parameterized by a neural network, permitting unbiased and low-variance estimates of the gradients, and can provide a robust, scalable and theoretically sound foundation for long text modeling.

Inspired by the recent success of neural topic models, we focus on exploring whether they can further help provide knowledge for long story generation. Intuitively, the topic distribution assigned for each document by the topic model is a low-dimensional vector, which is easy for the given short description to map to; with the reconstruction decoder, it is also informative enough to generate diverse and inter-related topic words. Moreover, since the topic models are unsupervised learning methods, there is no need to annotate labels and they can be applied to large-scale datasets.

Given the above discussion, we propose TopNet, a long story generation framework that leverages the neural variational inference (Kingma and Welling, 2014) as in a topic model to tackle the information sparsity challenge in LSG (i.e., lack of information in the short input). Our key ideas lie in two folds: (1) Given a short text input, predict the low-dimensional topic distribution of the to-be-generated story, rather than directly generate a word sequence. More specifically, we first compress each story in the training set into a topic distribution by the topic model and then train a Topic Generator

to map the short input text to the topic distribution. Since we employ the Gaussian distribution as the prior in the topic model, our variational topic distribution should also approach the same pattern that is easier for the input text to map to. Now, given a new input text, we can first use the Topic Generator to predict a topic distribution of its potential corresponding story and then use the reconstruction decoder of the topic model to transform the topic distribution to the word distribution, from which we can sample skeleton words to complement the short text. (2) Instead of randomly sampling from the decoded word distribution, we train a language model on the stories as an auto-regressive

Word Sampler to pick up more inter-related words as the skeleton. Finally, we concatenate these skeleton words with the given short text as input to Transformer (Vaswani et al., 2017), one of the state-of-the-art generative architectures, to generate a long story.

To verify the effectiveness of our approach, we conduct experiments on two long story generation tasks, which differ in terms of the input text length and are representatives of two realistic application scenarios.

(1) Title-to-article. Given a title that usually consists of less than three words, the task requires to create a narrative article about it. On the ROCStories corpus (Mostafazadeh et al., 2016), both automatic evaluation and human evaluation show that our TopNet significantly improves the performance over previous state-of-the-art approaches.

(2) Summary Expansion

. This task can be thought of the reverse of the text summarization task, and aims to expand the summary text by adding more details. We evaluate our model on the CNN/DailyMail 

(Hermann et al., 2015) dataset. The experimental results demonstrate that our generated stories perform much better in relevance, fluency and diversity than competitive baselines. Moreover, we show that our framework can not only augment the performance of normal size neural networks, but also has improvement on large-scale language models such as GPT-2.

2. Neural Topic Model

In this section, we train a neural variational inference framework (Kingma and Welling, 2014; Miao et al., 2016, 2017) on the stories from the dataset with the goal of obtaining their latent topic distributions and a well-trained reconstruction decoder. In the next section, we will learn a map between the input text and the latent topic distribution and utilize the reconstruction decoder to sample informative keywords from the predicted topic distribution as a skeleton for long story generation.

2.1. Parameterizing Topic Distributions

Let denote the bag-of-words representation of a story, with denoting nonnegative integers. is the vocabulary size, and each element of reflects the frequency of the corresponding word in the story. Following (Miao et al., 2017), we use a Gaussian random vector through a softmax function as the prior to parameterize the multinomial topic distribution. The generative process is:


where is an isotropic Gaussian distribution, with mean and variance in each dimension; is the topic distribution of the story where is the number of topics; is the topic assignment for the observed word ; , where and are trainable parameters; represents the word distribution given topic assignment .

Figure 1. Overview of our model, comprising the neural topic model (upper) and TopNet framework (bottom).

2.2. Neural Variational Inference

The neural variational inference is a simple instance of unsupervised learning where a continuous hidden variable , which generates all the words in a document independently, is introduced to represent its semantic content. Inspired by (Miao et al., 2017), we calculate the word distribution over topics by:


where is the trainable topic vectors, and is the word vectors. Therefore,

is the topic-to-word probability distribution matrix, which can be regarded as a decoder. The marginal likelihood for story



To parameterize the latent variable , we construct a neural variational inference to approximate the posterior , where and are functions of

that are implemented as multi-layer perceptrons (MLP). We optimize the variational objective function, also called the evidence lower bound (ELBO), as:


where . In practice, we re-parameterize with the sample to reduce the variance in stochastic estimation (Kingma and Welling, 2014). Since is conditioned on a standard Gaussian prior, the KL term in Equation 4 can be easily integrated as a Gaussian KL-divergence. Given a sampled , the topic can be integrated out as:


Hence we obtain a -dimensional topic distribution for each story and a shared reconstruction decoder matrix where each row corresponds to one topic.

3. TopNet Framework

To generate a long story, one crucial challenge we have to attack is the lack of information on the input side (e.g., the length is much shorter than the target side). With little input information, the neural model degenerates to language model (Fan et al., 2018; Devlin et al., 2018; Radford et al., 2019; Joshi et al., 2020) that generates a story without taking the input into consideration. Such degeneration harms the fidelity of the generated story to its input as well as the story diversity. Intuitively, the above neural topic model captures what humans tend to write in a low-dimensional topic distribution and a knowledgeable reconstruction decoder ; hence, given a short text, we will first map it to its potential topic distribution and then use to decode interesting skeleton words to complement the given short input, before generating a long story.

Our proposed TopNet framework is shown in Figure 1. Its high-level idea is: Once a neural topic model is developed on a story corpus, we train a Topic Generator to map the short input text to a topic distribution, based on which, as well as the matrix, we can map a new input text to a topic vector and then decode it to a word distribution to sample skeleton words for story generation.

3.1. Topic Generator

Given the short input text where is the number of the words, we use pre-trained word embeddings GloVe (Pennington et al., 2014) to transform the words into vectors . We use the average of these vectors, , to represent the input text and approximate its topic distribution via:


where are trainable parameters. For training, we use the short description text from the training set and the mean of their topic distributions computed by the topic model in Section 2.2 to train the Topic Generator , and we adopt the cross-entropy loss as our objective function. Note that given the low dimensionality of and the short input text, it is more practical to predict its topic distribution than, e.g., directly predict the word distribution of the to-be-generated story, as the latter is over the entire vocabulary. Moreover, the Gaussian prior of the variational distribution can also improve the robustness of the topic approximation.

3.2. Auto-regressive Word Sampler

Since is a shared topic-to-word matrix for all the stories, it stores the knowledge about the common constituents for various types of topics. We compute the word distribution for a to-be-generated story by decoding its estimated topic distribution :


To complement the short text input for long story generation, we aim to select skeleton words that can follow the content of the short input as well as interrelate with each other. Hence, instead of independently sampling a number of words from , we pre-train a forward language model on the collection of stories in the dataset as an auto-regressive Word Sampler

. Specifically, we use 3 layers of bi-directional Gated Recurrent Unit (GRU) 

(Chung et al., 2014) to form the language model and the top layer of the GRU output is used to predict the next token with a softmax function. We adopt the same vocabulary of the neural topic model in this language model and take out all the stop words in the stories. This means our language model does not focus on syntactic structure, but aims to capture the intrinsic semantic coherence of a story.

When sampling, we use the short description

as the initial input. Formally, the language model computes the probability of a sampled word sequence by modeling the conditional probability:


where is the m-th word in the original short text input, is the sampled complementary word and is a fixed number. We confine the language model to select words that are ranked in the top () of , by which we aim to obtain the skeleton words that have a strong topical connection with the given short description and are also interrelated with each other so as to form a coherent story.

Dataset Training Set Validation Set Testing Set Source Length Target Length
ROCStories 78529 9816 9816 2.2 52.0
CNN/DailyMail 287113 13368 11490 66.1 778.3
Table 2. Statistics of the ROCStories and CNN/DailyMail datasets.

3.3. Story Generation

To generate content-rich and coherent stories, we adopt the Transformer (Vaswani et al., 2017) model which is known as good at drawing long dependencies and one of the state-of-the-art architectures for text generation. Transformer is a sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. We concatenate the given short text description with the skeleton as the input of the model, and the output is the story text. Note that our framework is not limited to the Transformer and other more advanced generative models can apply as well.

Topic Model ROCStories CNN/DM
LDA 617.13 1325.1
NTM 344.54 952.93
Table 3. Perplexity of different topic models. “NTM” denotes neural topic model.

4. Experiments

4.1. Datasets

We apply our approach to two long story generation tasks, which differ in terms of the input text length and are representatives of two realistic application scenarios, and conduct comprehensive analyses to show its effectiveness.

(1) Title-to-article: Given a title that usually consists of less than three words, the task aims to create a narrative article about it. In this task, we use ROCStories111 (Mostafazadeh et al., 2016), which is a popularly used dataset whose input text is a short title and target is a five-sentence article, which captures a rich set of causal and temporal commonsense relations between daily events, making it a good resource for evaluating story generation models (Peng et al., 2018; Yao et al., 2019).
(2) Summary Expansion: This task can be thought of the reverse of the text summarization task, and aims to expand the summary text by adding more details. We evaluate our model on CNN/DailyMail222 (Hermann et al., 2015) dataset, which is widely studied for text summarization tasks (See et al., 2017; Paulus et al., 2018); in this paper we use it to evaluate models for expanding a short summary to a long story. Table 2 shows the statistics of the datasets.

4.2. Implementation Details

4.2.1. Neural Topic Model

We train our neural topic model on the stories in a given dataset. The stories are preprocessed by stemming, filtering stopwords, and we choose the top 5000 most frequent words to compose vocabulary. We use glove.840B.300d (Pennington et al., 2014) as the word vectors (Eq. 2). For the inference network

, we use an MLP with 2 layers and 500-dimension rectifier linear units. The dropout of 0.8 is applied to the output of the MLP before parameterizing the diagonal Gaussian distribution. The model is trained by Adam 

(Kingma and Ba, 2014) and tuned by hold-out validation perplexity. We follow (Miao et al., 2016) to alternately optimize the generative model and the inference network by fixing the parameters of one while updating the parameters of the other.

4.2.2. TopNet Framework

For the Topic Generator, we set the dimension of the weight matrix as 512. We optimize the model by Adamax (Kingma and Ba, 2014) with a learning rate as 0.002. Our batch size is set as 128, and the dropout rate as 0.2. For the language model sampler, the dimension of the GRU is set as 512 and the model is trained with Adadelta (Zeiler, 2012) with a learning rate of 0.001. The batch size is set as 20 and the dropout rate as 0.15. For the title-to-article task, we set the number of topics as 50, the and in Section 3.2 as 10 and 100 respectively and train the Transformer for 220k steps. For the summary expansion task, since its target length is relatively longer as shown in Table 2, we set the number of topics as 80, the and in Section 3.2

as 60 and 200 respectively and train the Transformer for 340k steps. For the Transformer model, we set hyperparameters the same as 

(Vaswani et al., 2017), and use the base model in its official implementation333 for this work.

Models Inter-S Intra-S Dist-2 Ent-4 Dist-2 (SW)
Inc-Seq2seq 0.95 0.16 0.074 7.929
Skeleton Model 0.89 0.09 0.082 8.573 0.285
Static Planning 0.82 0.06 0.093 9.238 0.473
Fusion Model 0.71 0.05 0.101 11.558 0.604
Transformer 0.88 0.09 0.091 8.623
TopNet (LDA) 0.69 0.05 0.124 11.593 0.828
TopNet 0.65 0.04 0.151 11.682 0.856
Table 4. Quantitative evaluation on the ROCStories dataset shows that our model achieves the state-of-the-art performance. “Dist-2 (SW)” denotes the Dist-2 score for the generated skeleton words.
Choice % TopNet  vs.  Fusion TopNet  vs.  Random TopNet vs. Top Words
Coherence 55.1 44.9 73.5 26.5 56.4 45.6
Meaningfulness 54.6 45.4 64.1 35.9 55.2 44.8
Fidelity 52.0 48.0 78.3 21.7 47.5 52.5
Richness 66.7 33.3 51.6 48.4 66.1 33.9
Table 5. Human comparison with the state-of-the-art method and ablation study on the ROCStories dataset.

4.3. Baselines

To demonstrate the effectiveness of our model, we choose several representative and state-of-the-art models from the current open-source works:

Inc-Seq2seq (Bahdanau et al., 2015) denotes the incremental sentence-to-sentence generation baseline, which is built upon the Long-Short Time Memory Network (LSTM) (Cheng et al., 2016) along with attention mechanism.
Skeleton Model (Xu et al., 2018b) is also LSTM-based model, and trains its skeleton extraction module on other sentence compression datasets (Filippova and Altun, 2013).
Fusion Model (Fan et al., 2018) uses the human-annotated skeletons for training and adopts the gated self-attention along with Convolutional Sequence-to-Sequence Model (Gehring et al., 2017) to learn the long-range context.
Static Planning (Yao et al., 2019) uses word-frequency methods to train a skeleton extraction model and generates each sentence using each word of the skeleton.

4.4. Topic Models

Table 3 presents the comparison of LDA and the neural topic model. As we can see, the neural topic model (NTM) performs much better than the traditional LDA on both datasets. The improvement over LDA indicates that the neural topic model can keep the information more accurately after compressing the original document, thus has potential to provide a better learning signal for the TopNet framework to generate a satisfying story.

4.5. Title-to-article Task

In this task, we conduct experiments on the ROCStories dataset. For evaluation, we follow (Yao et al., 2019) to use inter-story and intra-story repetition scores, which denote the repetition rate of trigrams between stories at sentence level and the average trigrams repetition of sentences comparing with former sentences in a story respectively. We also follow (Li et al., 2016) to calculate Dist-2, which is the proportion of unique bigrams over the total number of bigrams in the generated stories, and (Zhang et al., 2018) to use the Ent-4

metric, which reflects how evenly distributed the 4-grams are over all generated stories. Since the input titles are usually less than three words, there can be various kinds of output that are all relevant so the evaluation metrics such as ROUGE 

(Lin, 2004) for generating a particular piece of text are not suitable for this task.


As shown in Table 4, our TopNet outperforms all the baseline models and achieves the state-of-the-art results, and the skeleton words generated by our method are more diverse than all other hierarchical models. In the bottom part, we conduct an ablation study: We replace the neural topic model with LDA (Blei et al., 2003), and the performance is better than the Transformer’s but still has a gap with the results of our TopNet. We conjecture that this is because the predicted top words for many stories are overlapped as we observe high-frequency words tend to be selected as the skeleton under the LDA model, while the neural inference network is more capable of learning complicated non-linear distributions.

For human evaluation in Table 5, we randomly sample 120 titles along with the generated stories and consider 4 aspects: Coherence (whether the story as a whole is coherent in meaning and theme), Meaningfulness (whether the story conveys some certain messages), Fidelity (the relevance between the generated story and the title), Richness (the amount of information in the story). We compare our TopNet with the Fusion Model (Fan et al., 2018), Transformer with random words (randomly select words from the vocabulary as the skeleton), and Transformer with top words (directly select the top words from the probability (Eq. 7) instead of using our Word Sampler). For each sample, 5 people are asked to decide which of the two stories are better in the above aspects, and we show the average scores across the five annotators on all samples. As we can see, similar to the previous quantitative results, our TopNet almost outperforms their counterpart baselines in all evaluation aspects, thus demonstrating the effectiveness of the proposed framework.

Figure 2. Human accuracy at pairing the skeleton with the correct short input text (candidates consist of the original one and 4 randomly selected ones).
Title: Allergies
Skeleton (Static Planning): anna day doctor sick medicine
Skeleton (TopNet): seasonal summer test medication sick symptom pill lunch need happy
Human: Kia had a runny nose and headache for weeks. She finally went to her doctor. He told her she had allergies. He gave her an antihistamine to take. Soon Kia was feeling much better.
Static Planning: Anna had a bad cough. She went to the doctor. The doctor told Anna she had a fever. Anna was very sick. She decided to go to the doctor.
TopNet: Will had seasonal allergies. He sneezed and sniffled throughout the summer. He took allergy medication daily, which helped his symptoms. One morning he forgot to take a pill, and his allergies were terrible. Will made sure to take his pill with breakfast so he wouldn’t forget.
Table 6. Examples of skeletons and stories generated by human, Static Planning, and TopNet on the ROCStories dataset.
Models Relevance Diversity Fluency Skeleton
RG-1 RG-L Dist-2 Ent-4 Perplexity Dist-2
Inc-Seq2seq 22.4 13.8 0.051 10.024 38.73
Fusion Model 28.3 20.5 0.076 13.021 36.12 0.179
Transformer 25.1 16.9 0.060 11.565 38.15
Transformer + Random Words 24.4 15.8 0.076 13.024 38.29 0.246
Transformer + Top Words 27.9 19.4 0.070 12.754 36.38 0.127
TopNet (LDA) 28.5 20.8 0.075 13.011 36.25 0.204
TopNet 29.3 21.6 0.078 13.033 35.73 0.219
GPT-2 11.2 8.3 0.121 14.380 30.25
GPT-2 + TopNet 12.8 9.6 0.114 14.133 30.21 0.219
Table 7. Quantitative evaluation on the CNN/DailyMail dataset, where the summary is the input and the article is the target. “GPT-2 + TopNet” means we use the GPT-2 as the story generation module of the TopNet.

In Figure 2, we show the correlation between the generated skeleton words and their short input text. We randomly pick up 100 title, generated skeleton pairs and for each pair, we randomly sample 4 other titles from the dataset to mix them up with the original title. For each skeleton and the 5 titles, 5 people are asked to select the most relevant title according to the skeleton, and they obtain the highest averaged accuracy under TopNet’s skeletons, which shows that our skeleton words are topical, meaningful, and loyal to the input text.

In Table 6, we present a randomly selected title from ROCStories and the stories generated by humans (original story), Static Planning (Yao et al., 2019), and TopNet, and we also compare the skeletons generated by the latter two methods. It is obvious that our TopNet generates a more complicated and interesting story compared to the Static Planning and even the human-generated one. While the Static Planning produces redundant and repetitive content such as “She decided to go to the doctor”, our method generates a story that is much more informative and has more complex sentence structures. Since the number of the skeleton words in our TopNet is adjustable, we can produce more keywords than Static Planning, and thus provide more detailed information to support story generation. We also observe that some sampled words are not directly used in the generated story (e.g.

lunch, happy), but in general the complementary words are interrelated and pertinent to the given title, which can be attributed to the latent feature extraction of the topic model.

4.6. Summary Expansion Task

We further apply our model to another task named summary expansion, which typically has a relatively longer input than the previous title-to-article task. In particular, we use the summaries in CNN/DailyMail dataset as the input to predict the corresponding stories. For evaluation, we measure the relevance of the generated story with the summary by ROUGE-1 and ROUGE-L since a summary is able to express some informative messages; we still use Dist-2 and Ent-4 to evaluate the diversity of the generated stories; we also measure the fluency of the stories by perplexity, which reflects how fluently the model can produce the correct next word given the preceding words. Except for ablation baselines, we select Inc-Seq2seq and Fusion Model as two representative baselines based on our previous experiments in Table 4 and also compare our TopNet with GPT-2 (Radford et al., 2019) (345M version)444, which is a large-scale language model trained on external data and is known as one of the most powerful text generation models.


Table 7 shows the performance of the models on the CNN/DailyMail dataset, and we can see that our TopNet achieves the best results over almost all the baselines, and our generated skeleton words also perform well in diversity comparing to other results (except random sampling). These results demonstrate that our TopNet is able to generate relevent, diverse yet fluent stories which further verifies the effectiveness of our approach. For GPT-2, we see that the results of diversity and fluency are much better than the other approaches since it is pre-trained on an extremely large corpus and the model itself is much more complex than the others. However, we observe the articles it generates are almost irrelevant to the original ones, and many of them even deviate from the topic of the summaries. This indicates that although the large-scale language models have impressive improvements under various metrics, it is still challenging to control their generated text content, which is to some extent improved by our TopNet according to the results under relevance. Such phenomenon also reflects that neural model is easy to ignore the model input when it contains little information for the model to take into consideration. With the informative supplement, our TopNet can theoretically remedy this problem.

Table 8 present an example of the generated skeleton and stories generated by human, GPT-2, and TopNet on the CNN/DailyMail dataset. As we can see, most of our generated skeleton words have strong correlation with the given summary, such as “power”, “energy”, “facility”, “collapse”, etc. These words together with the summary support a long and coherent story. However, there are some words like “lab”, “foot” that seem to be less relevant to the given summary, thus they don’t play significant roles in story generation. Moreover, some words, such as “Beijing”, induces the model to generate the sentence “The plant is located in the nearby city of Beijing, about 240 kilometers (150 miles) north of tokyo”, but it doesn’t make sense because it is a wrong information.

As for the generated stories, both our method (TopNet) and GPT-2 have produced a long and interesting story. We notice that the story generated by GPT-2 is longer and has more complex choices of words, and the GPT-2 is very good at making up specific name entities such as address, date, name and organization. However, we observe that the story generated by GPT-2 has deviated from the given summary, let alone the original story in the CNN/DailyMail dataset. It seems the GPT-2 is more skillful in continue writing, rather than expanding the given short text by adding details. We also find that the story generated by our TopNet has repetitive phrases such as “since the march 11 earthquake and tsunami”, indicating that the regular generators (e.g. Transformer) still have a lot of room for improvement on content selection, even if provided with sufficient knowledge at the input side.

Summary: The operator of the fukushima nuclear plant said it has abandoned a robotic probe inside one of the damaged reactors. A report stated that a fallen object has left the robot stranded. The robot collected data on radiation levels and investigated the spread of debris.
Skeleton: possible power energy inch facility core light announce improve identify tsunami lab expert quake unit safe confirm meter strike range emergency effort foot probe problem atmosphere electronic damage beijing image locate screen earth produce battery investigate fuel firm worker collapse staff source billion statement operate zone contribute conference since recover surface incident locate storm human heat current weather increase mile
Human: The operator of the crippled Fukushima daiichi nuclear plant has given up trying to recover a robotic probe after it stopped moving inside one of the reactors. Tokyo electric power company deployed the remote-controlled robot on Friday inside one of the damaged reactors that had suffered a meltdown following a devastating earthquake and tsunami in 2011. It was the first time the probe had been used. The robot, set out to collect data on radiation levels and investigate the spread of debris, stalled after moving about 10 meters, according to a statement released by tepco. A newly released report and footage from the robot shows that a fallen object had blocked its path and left it stranded. Tepco decided to cut off the cable connected to the device sunday as it had already covered two-thirds of the originally planned route. It managed to collect data on radiation levels in 14 of the 18 targeted locations. Four years after the devastating nuclear crisis, the radiation levels inside the three damaged reactors are still extremely high and remain unsafe for people to enter. Decommissioning work is estimated to cost $50 billion and will take years to complete. Tepco called the robotic probe an “unprecedented” experiment. CNN’s Yoko Wakatsuki reported from Tokyo, Japan and Maomi Ng wrote from Hong Kong.
GPT-2: A senior researcher at the Medecins Sans Frontieres study said southern regions of the Fukushima Daiichi nuclear power plant had past radiation warnings from Japanese health authorities. ”The elderly had two 30mm fukushima radiators installed on top of them. In October last year the Vienna air and ground report Socio-Economic Department adopted most comprehensive rules on how the elderly should measure radiation levels if they use the radiators.” In Japan technicians received an experimental Japanese Shishii add-on for Onsen fire bomblets proclaiming King Abe will rise up with 256 Great Komeito Attackers The Japan Meteorological Agency released this special report accuses Japan of having chemical weapons in spring 1986. Many Inland Lobsters were loaded with Anbak bombs in ”standard package”. a report described the outbreak as a ”heavy blow, which operated simultaneously. A barrage of weapons and chemicals inside the nineties generation should have been spent. There was no evidence of evidence of contaminated equipment here. Everyone seems to talk about this as evidence of old bombs. In Texas, when Nichols received a shipment of $40,000, he started recording the airport log of security personnel for weightlifting authorities. every flight time is recorded; most of the time, the official touch-screen window is detached from the flight HUD. Richard Yates interviews employees of Honda, Kentucky steel company, about retired attack helicopters being ”rendered ”ignorant”, and that, after contracting Elvira for combat helicopter for close-up spraying in 2002, …
TopNet: The japanese operator of the nuclear power plant said it has abandoned a robotic probe inside one of the damaged reactors at the Fukushima daiichi nuclear power plant. The operator of the plant said it has abandoned the power plant since the march 11 earthquake and tsunami that resulted in the worst nuclear accident since Chernobyl, as a result of the heavy pressure on the plant ’s surface. “This is not the first time we have seen from outside the plant,” said Tom Biden, director of the plant’s mission, in a written statement. “The nuclear power plant has not had any such problems with the power plant, as the cores of reactors at the Fukushima daiichi nuclear plant overheated and spewed radioactive material into the atmosphere.” The plant is located in the nearby city of Beijing, about 240 kilometers (150 miles) north of tokyo. “It has since been called off,” the company said, ”Adding that it does not have any reports of radiation contamination from the plant.” The plant has been operating since the March 11 earthquake and tsunami, according to the U.S. environmental protection agency. The agency said in a statement that the robot collected data on radiation levels and investigated the spread of the reactors. ”we are working closely with the engineers to determine the cause of the accident, ”the agency said.
Table 8. Examples of the generated skeleton and stories generated by human, GPT-2, and TopNet on the CNN/DailyMail dataset.

5. Related Work

5.1. Long Story Generation

Recently, deep learning models have been demonstrated effective for LSG (Kiros et al., 2015; Roemmele, 2016; Jain et al., 2017; Fan et al., 2019; Tambwekar et al., 2019; Goldfarb-Tarrant et al., 2019). Most state-of-the-art methods (Fan et al., 2018; Xu et al., 2018b; Yao et al., 2019) proposed to decompose the story generation procedures with a hierarchical generation strategy to first produce a skeleton and then generate a story based on the skeleton. However, these work either do not have skeletons informative enough to support a long story, or they require human annotators to label skeletons for training. Unlike previous work, we integrate the tricks of topic modeling into the story generation task through projecting an input text to its latent topic space and leveraging the reconstruction decoder of a neural topic model to obtain abundant inter-related skeleton words in an unsupervised fashion for story generation.

Large-scale language models (e.g., GPT-2) have also achieved impressive performance on long text generation (Radford et al., 2019; Mao et al., 2019). However, training such a language model can cost in excess of $10,000 and also requires a huge amount of external data (Wang et al., 2019a), making it prohibitively expensive and time-consuming to perform a fully-fledged model exploration. Moreover, large-scale language models are usually uncontrollable because the local dependencies among the story itself are easier to model than the subtle dependencies between the input text and the story (Fan et al., 2018). Different from tasks such as text abstract summarization (See et al., 2017)

or Neural Machine Translation (NMT) 

(Gu et al., 2016), where the semantics of the target are fully specified by the source, the generation of stories from the model input is far more open-ended. How to tackle the degeneration problem of story generation model that the neural models ignore the input and focus purely on previously generated sequence, is crucial in the scenario of generating long story. In such situation, it is hard for GPT-2 to produce loyal stories. In this paper, we make full use of the given corpus by unsupervised learning while eliminating the need of large-scale external data or human interactions, and show that generating an informative skeleton for LSG is still important and challenging.

5.2. Topic Models

Topic models have been studied for a variety of applications in document modeling and information retrieval. Beyond LDA (Blei et al., 2003)

that consists of multiple layers of Bayesian networks, various extensions have been explored to discover topics 

(Teh et al., 2005), model temporal dependencies (Blei and Lafferty, 2006), among many others. Recently, neural topic models have gained much attention (Hinton and Salakhutdinov, 2009; Larochelle and Lauly, 2012; Gan et al., 2015; Miao et al., 2016, 2017), which improved the performance of traditional methods. Different from prior approaches that sought closed-form derivations on the discrete text to train the topic model, Miao et al. (2016) presented a generic variational inference framework for the intractable distributions over latent variables. Given the discrete text, they build an inference network in order to obtain the variational distribution. They further proposed to provide parameterisable distributions over topics (Miao et al., 2017), so that it will permit training with back-propagation in the framework of neural variational inference, which makes it easy to obtain powerful topic models. In this work, we also construct our topic model by a variational inference framework (Miao et al., 2016, 2017), in which the latent topic vector is constrained by a prior distribution (e.g. Gaussian distribution) and is easier to be approximated from the input text.

5.3. Topic Learning in NLP

The idea of using learned topics to improve NLP tasks has been explored previously, including methods combining topic and neural language models (Mikolov and Zweig, 2012; Ahn et al., 2016; Dieng et al., 2016; Lau et al., 2017; Wang et al., 2018), leveraging topic and word embeddings (Liu et al., 2015; Xu et al., 2018a), as well as topic-guided VAEs (Wang et al., 2019b). Wang et al. (2018) presented a Topic Compositional Neural Language Model (TCNLM), which can not only maintain the overall topic representing the global information, but also learn the local semantic information of a document. In order to discriminate the ubiquitous homonymy and polysemy, Liu et al. (2015) utilized the topic model to assign topic vector to each word embedding, and finally obtained the topical word embeddings. Xu et al. (2018a) proposed to conduct the joint training of the word embedding and topic modeling to obtain topic-aware word embedding. Wang et al. (2019b) extended the VAE through specifying the prior of Gaussian distribution parametrized by latent topic distribution. Unlike previous works that mainly focus on modeling the latent topics, we propose to leverage the topic-to-word decoder of a topic model as well, which is knowledgeable and robust to help obtain skeleton words based on the approximated latent topics.

6. Conclusion

In this paper, we propose a novel framework TopNet

, which leverages the document modeling of the neural topic model for long story generation. To tackle the information sparsity challenge, we learn to map the short input text to a low dimensional topic distribution and use the decoder of the topic model to transform it to the word distribution, based on which we finally develop a language model to sample skeleton words for story generation. The experimental results on different tasks show that our model significantly outperforms the state-of-the-art approaches and can produce interesting and coherent stories. We also demonstrate that our method has the potential to improve the large-scale language model by providing appropriate topic words. Future works include further research into controllable long sequence generation as well as exploring unsupervised or self-supervised learning in other NLP tasks.

This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), The National Nature Science Foundation of China (Grant Nos: 62036009, 61936006), the Alibaba-Zhejiang University Joint Institute of Frontier Technologies, and The Innovation Capability Support Program of Shaanxi (Program No. 2021TD-05). Boyuan Pan (while visiting the Ohio State University) and Huan Sun were sponsored in part by the Army Research Office under cooperative agreements W911NF-17-1-0412, NSF Grant IIS1815674, NSF CAREER #1942980, Fujitsu gift grant, and Ohio Supercomputer Center (Center, 1987). The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notice herein.


  • S. Ahn, H. Choi, T. Pärnamaa, and Y. Bengio (2016) A neural knowledge language model. arXiv preprint arXiv:1608.00318. Cited by: §5.3.
  • D. Bahdanau, K. Cho, Y. Bengio, and etc. (2015) Neural machine translation by jointly learning to align and translate. ICLR. Cited by: §4.3.
  • D. M. Blei and J. D. Lafferty (2006) Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pp. 113–120. Cited by: §5.2.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §1, §4.5, §5.2.
  • O. S. Center (1987) Ohio supercomputer center. External Links: Link Cited by: §6.
  • J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §4.3.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    arXiv preprint arXiv:1412.3555. Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.
  • A. B. Dieng, C. Wang, J. Gao, and J. Paisley (2016) Topicrnn: a recurrent neural network with long-range semantic dependency. arXiv preprint arXiv:1611.01702. Cited by: §5.3.
  • A. Fan, M. Lewis, Y. Dauphin, and et al. (2018) Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898. Cited by: §1, §3, §4.3, §4.5, §5.1, §5.1.
  • A. Fan, M. Lewis, and Y. Dauphin (2019) Strategies for structuring story generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2650–2660. Cited by: §5.1.
  • K. Filippova and Y. Altun (2013) Overcoming the lack of parallel data in sentence compression. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1481–1491. Cited by: §4.3.
  • Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin (2015) Scalable deep poisson factor analysis for topic modeling. In International Conference on Machine Learning, pp. 1823–1832. Cited by: §1, §5.2.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In International Conference on Machine Learning, pp. 1243–1252. Cited by: §4.3.
  • S. Goldfarb-Tarrant, H. Feng, N. Peng, and et al. (2019) Plan, write, and revise: an interactive system for open-domain story generation. arXiv preprint arXiv:1904.02357. Cited by: §5.1.
  • J. Gu, Z. Lu, H. Li, and V. O. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. Cited by: §5.1.
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In NIPS, pp. 1693–1701. Cited by: §1, §4.1.
  • G. E. Hinton and R. R. Salakhutdinov (2009) Replicated softmax: an undirected topic model. In Advances in neural information processing systems, pp. 1607–1614. Cited by: §5.2.
  • T. Hofmann (1999) Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 289–296. Cited by: §1.
  • P. Jain, P. Agrawal, A. Mishra, M. Sukhwani, A. Laha, and K. Sankaranarayanan (2017) Story generation from sequence of independent short descriptions. arXiv preprint arXiv:1707.05501. Cited by: §5.1.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2020) Spanbert: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, pp. 64–77. Cited by: §3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.1, §4.2.2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. ICLR. Cited by: §1, §1, §2.2, §2.
  • R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: §5.1.
  • H. Larochelle and S. Lauly (2012) A neural autoregressive topic model. In Advances in Neural Information Processing Systems, pp. 2708–2716. Cited by: §5.2.
  • J. H. Lau, T. Baldwin, and T. Cohn (2017) Topically driven neural language model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 355–365. Cited by: §5.3.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of NAACL-HLT, pp. 110–119. Cited by: §4.5.
  • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2157–2169. Cited by: §1.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.5.
  • Y. Liu, Z. Liu, T. Chua, and M. Sun (2015) Topical word embeddings. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §5.3.
  • H. H. Mao, B. P. Majumder, J. McAuley, and G. Cottrell (2019) Improving neural story generation by targeted common sense grounding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5990–5995. Cited by: §5.1.
  • Y. Miao, E. Grefenstette, P. Blunsom, and et al. (2017) Discovering discrete latent topics with neural variational inference. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2410–2419. Cited by: §1, §2.1, §2.2, §2, §5.2.
  • Y. Miao, L. Yu, P. Blunsom, and et al. (2016) Neural variational inference for text processing. In International conference on machine learning, pp. 1727–1736. Cited by: §2, §4.2.1, §5.2.
  • T. Mikolov and G. Zweig (2012) Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 234–239. Cited by: §5.3.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of NAACL-HLT, pp. 839–849. Cited by: §1, §4.1.
  • R. Paulus, C. Xiong, R. Socher, and et al. (2018) A deep reinforced model for abstractive summarization. ICLR. Cited by: §4.1.
  • N. Peng, M. Ghazvininejad, J. May, and K. Knight (2018) Towards controllable story generation. In Proceedings of the First Workshop on Storytelling, pp. 43–49. Cited by: §4.1.
  • J. Pennington, R. Socher, C. D. Manning, et al. (2014) GloVe: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §3.1, §4.2.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1, §3, §4.6, §5.1.
  • M. Roemmele (2016) Writing stories with help from recurrent neural networks. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §5.1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §4.1, §5.1.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
  • P. Tambwekar, M. Dhuliawala, L. J. Martin, A. Mehta, B. Harrison, and M. O. Riedl (2019) Controllable neural story plot generation via reward shaping. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5982–5988. Cited by: §5.1.
  • Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei (2005) Sharing clusters among related groups: hierarchical dirichlet processes. In Advances in neural information processing systems, pp. 1385–1392. Cited by: §1, §5.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.3, §4.2.2.
  • C. Wang, M. Li, and A. J. Smola (2019a) Language models with transformers. arXiv preprint arXiv:1904.09408. Cited by: §5.1.
  • W. Wang, Z. Gan, W. Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, and L. Carin (2018) Topic compositional neural language model. In International Conference on Artificial Intelligence and Statistics, pp. 356–365. Cited by: §5.3.
  • W. Wang, Z. Gan, H. Xu, R. Zhang, G. Wang, D. Shen, C. Chen, and L. Carin (2019b)

    Topic-guided variational autoencoders for text generation

    arXiv preprint arXiv:1903.07137. Cited by: §5.3.
  • H. Xu, W. Wang, W. Liu, and L. Carin (2018a) Distilled wasserstein learning for word embedding and topic modeling. In Advances in Neural Information Processing Systems, pp. 1716–1725. Cited by: §5.3.
  • J. Xu, X. Ren, Y. Zhang, Q. Zeng, X. Cai, and X. Sun (2018b) A skeleton-based model for promoting coherence among sentences in narrative story generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4306–4315. Cited by: §1, §4.3, §5.1.
  • L. Yao, N. Peng, W. Ralph, K. Knight, D. Zhao, and R. Yan (2019) Plan-and-write: towards better automatic storytelling. AAAI 2019. Cited by: §1, §4.1, §4.3, §4.5, §4.5, §5.1.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.2.2.
  • Y. Zhang, M. Galley, J. Gao, Z. Gan, X. Li, C. Brockett, and B. Dolan (2018) Generating informative and diverse conversational responses via adversarial information maximization. In Advances in Neural Information Processing Systems, pp. 1810–1820. Cited by: §4.5.