Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries

10/14/2020 ∙ by Xiaofei Sun, et al. ∙ Zhejiang University Peking University 0

Long-text generation remains a challenge. The difficulty of generating coherent long texts lies in the fact that existing models overwhelmingly focus on the tasks of local word prediction, and cannot make high level plans on what to generate or capture the high-level discourse dependencies between chunks of texts. Inspired by how humans write, where a list of bullet points or a catalog is first outlined, and then each bullet point is expanded to form the whole article, we propose SOE, a pipelined system that involves of summarizing, outlining and elaborating for long text generation: the model first outlines the summaries for different segments of long texts, and then elaborates on each bullet point to generate the corresponding segment. To avoid the labor-intensive process of summary soliciting, we propose the reconstruction strategy, which extracts segment summaries in an unsupervised manner by selecting its most informative part to reconstruct the segment.The proposed generation system comes with the following merits: (1) the summary provides high-level guidances for text generation and avoids the local minimum of individual word predictions; (2) the high-level discourse dependencies are captured in the conditional dependencies between summaries and are preserved during the summary expansion process and (3) additionally, we are able to consider significantly more contexts by representing contexts as concise summaries. Extensive experiments demonstrate that SOE produces long texts with significantly better quality, along with faster convergence speed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite that recent large-scale pretrained language models (PLMs) (Peters et al., 2018; Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Clark et al., 2020; Radford et al., 2019; Li et al., 2020a; Brown et al., 2020) are able to produce high-quality passages that can be hardly recognized by humans (Zellers et al., 2019), most of the generated “good” texts are within very limited length, e.g. hundreds of tokens for most cases (Guo et al., 2017; Bao et al., 2020; Yan et al., 2020), and generating coherent long texts remains a challenge Radford et al. (2019); Tan et al. (2020). The difficulty lies in the fact that existing models generate texts in a word-by-word manner: predicting each subsequent token given its proceeding contexts using the softmax objective. This word-by-word strategy overwhelmingly focuses on the prediction of local words, and cannot make high level plans on what to generate. This results in the fact that long texts generated by current models are usually repetitive, generic and self-contradictory Shen et al. (2019).

To address these issues, the coarse-to-fine generation strategy is proposed Fan et al. (2018); Xu et al. (2018); Yao et al. (2019); Mao et al. (2019). In coarse-to-fine generation, a list of keywords or a short prompt is first generated, serving as a summary of the original text. The prompt is then fed to a seq2seq model as an input to output the complete text. The coarse-to-fine generation strategy significantly improves generation over the word-by-word strategy, but still suffers from the following shortcomings: (a) limited capacity of the prompt: a single keyword list or prompt does not have enough capacity to summarize all the texts of long passages, since long texts are usually consists of several parts, each of which focuses on a specific aspect or topic (Zhou et al., 2018a; Narayan et al., 2018; Guan et al., 2019). The usage of the coarse-to-fine generation strategy is thus limited to texts that can be summarized by a single prompt (e.g., short stories). This explains why text length generated by the progressive generation model is still limited, e.g., the introduced writingprompts dataset in Fan et al. (2018) has an average length of stories around 735, and the average length of prompts is 28; (b) ignorance of high-level discourse dependency: the coarse-to-fine generation strategy does not capture discourse-level dependencies Li and Jurafsky (2016); Jernite et al. (2017), which handle the high-level information flow and interactions between segments of texts. The ignorance of discourse-level dependencies results in texts lacking for coherence.

Humans write in a hierarchical top-down manner: before writing a thousand-word-long essay, a human usually first prepares a list of bullet points or a catalogue, and then expands them to form the whole article. The sentence-level coherence between these bullet points is preserved when the bullet points are expanded, providing guarantees that the full text is coherent.

To mimic this top-down manner of human writing, in this paper, we propose SOE

, a pipelined system tha involves of summarizing, outlining and expanding for long text generation: the model first outlines the summaries for different segments of long texts, which actually mimics the process of humans outlining bullet points; next, the model elaborates on each bullet point to generate the corresponding segment. The proposed strategy comes with the following merits: (a) Since each segment is associated with its own summary rather than the entire text sharing a single prompt, the capacity of summaries to reconstruct the full text can be guaranteed; (b) The conditional generation probability between summaries captures the high-level discourse dependencies, and these dependencies are preserved when they are expanded to segments. This naturally resolves the incapability of modeling discourse-level dependencies in the coarse-to-fine generation approach. (c) This model is able to consider significantly larger amount of contexts by representing chunks of contexts as concise summaries.

Empirically, we do not readily have summaries for segments in hand. The model thus needs to learn to summarize in an unsupervised manner. Inspired the (Gehrmann et al., 2018b; Zhou et al., 2018b; Liu et al., 2018; Zhong et al., 2020), we propose the reconstruction strategy, which extracts segment summaries by selecting its most informative part to reconstruct the segment. Extensive experiments demonstrate that SOE produces long texts with significantly better quality than existing baselines.

The rest of this paper is organized as follows: Section 2 presents related works, followed by Section 3 reviewing some backgrounds. Section 4 introduces our proposed approach in detail. Section 5 and Section 6 respectively present the experiment results and ablation studies. Last, we give a conclusion in Section 7.

2 Related Work

2.1 Generating Long Texts

There are two lines of work to generate long text: This first line of work tackles the problem from the model perspective. New model structures Kitaev et al. (2020); Child et al. (2019); Dai et al. (2019); Ye et al. (2019); Guo et al. (2019); Sukhbaatar et al. (2019); Correia et al. (2019); Beltagy et al. (2020); Zaheer et al. (2020); Li et al. (2020b) are designed to give the model the ability to congest more contexts given limited memories or computing power. For example, Transformer-XL (Dai et al., 2019), a modifier to Transformers (Vaswani et al., 2017) uses a segment-level recurrence mechanism to enable learning long-term dependencies; Child et al. (2019); Correia et al. (2019); Kitaev et al. (2020); Beltagy et al. (2020); Zaheer et al. (2020) proposed to sparsify transformers by focusing only on a fraction of attention connections; Tay et al. (2020) replaced the dot-product self-attention with learned synthetic attention weights; Li et al. (2020b) used an LSTM predictor to automatically learn attention connections adapted to downstream tasks.

The second line of researches focuses on developing new generation strategies. Efforts have been devoted to the idea of planning-then-generation or coarse-to-fine generation (Wiseman et al., 2017; Sha et al., 2017; Gehrmann et al., 2018a; Wiseman et al., 2019; Moryossef et al., 2019; Puduppully et al., 2019; Hua and Wang, 2019; Shen et al., 2020; Fu et al., 2020)

, which greatly inspires this work. In coarse-to-fine generation, a list of keywords or a short sentence is first generated, providing guidances to generate the full text. A recent work from tan2020progressive takes a multi-step strategy, which progressively refines the generated incomplete text until reaching a specified stage. Similar ideas have also been applied to text summarization, where

Gehrmann et al. (2018b) proposed a bottom-up method that first identifies phrases within a document that are likely included in its summary. Our work is also inspired by the strategy of hierarchical generation Li et al. (2015b); Yu et al. (2016); Nallapati et al. (2016); Liu and Lapata (2019a), which consider text units with bigger granularity: Li et al. (2015b) proposed hierarchical LSTMs that arrange tokens, sentences and paragraphs in a hierarchical structure, with different levels of LSTMs capturing compositionality. Shen et al. (2019) used multi-level structures to learn a VAE model for generating long coherent text. Similar strategies are applied to the video captioning problem where Yu et al. (2016) exploited hierarchical RNNs for video caption generation.

2.2 Extractive Summarization

Extractive summarization refers to the problem of selecting part of the input text as its summary. A fundamental problem in extractive summarization is to score constituent texts units (e.g., phrases, sentences or paragraphs) and select highly-ranked one(s) as the summary. Haghighi and Vanderwende (2009) used word frequencies in the input text to assign scores to words, which are then in turn used to score sentences. Higher-ranked sentences are selected as the summary of the input text. Liu et al. (2018) presented a two-stage extractive-abstractive framework, which first coarsely identifies salient information, followed by a generation model used to refine it. Neural models have been widely used for scoring Cao et al. (2015); Ren et al. (2017); Zhou et al. (2018b). Liu and Lapata (2019b) finetuned BERT (Devlin et al., 2018) to score each sentence for extractive summarization; Zhang et al. (2019)

computed token similarity in each sentence using BERT contextual embeddings to serve as an automatic evaluation metric for text generation.

3 Background

We begin by reviewing the task of text generation.

Language Modeling

refers to the process of calculating the probability of a sequence , where each denotes a constituent token of

. The probability can be computed by decomposing the joint distribution

into a product of conditional distributions over tokens:


where is the partial sequence of tokens generated previously. During training, the model is optimized to minimize the negative log-likelihood (NLL) . During inference, the model decodes a token at each time step according to based on the softmax functions where is the output word embedding matrix and is the hidden state at time-step . Various smoothing models have been proposed to avoid overfitting Xie et al. (2017); Pereyra et al. (2017).

Sequence-to-Sequence (Seq2Seq) Generation

models generate a target sequence conditioning on a given source sequence , which differs from language models (LMs) in terms of whether or not conditioning on another input sequence. Similar to LMs, the probability of the target sequence can be typically factorized as:


Seq2seq models are also optimized to minimize the NLL . In the rest of this paper, we unify the notation of and by setting for LMs. Different architectures have been proposed to model , including transformers (Vaswani et al., 2017), LSTMs (Luong et al., 2015) and CNNs (Dauphin et al., 2017). At test time, sequences are usually generated using beam search, or its variants to promote diversity (Vijayakumar et al., 2016).

4 Model Details for SOE

In this section, we describe the details of SOE.

4.1 Notations

A long sequence of tokens is first sliced into a series of snippets s, where denotes the number of constituent snippets. Here we use the bold font to denote snippets, and the normal font to denote tokens. The number of tokens within each snippet is a hyper-parameter. We also use superscript to denote the index of a snippet, and subscript to denote the index of a token. Each consists a sequence of tokens , where denotes the length of . Our goal is to generate a subset of , denoted by given its proceeding snippets, denoted by . Each snippet is associated with a short summary , where denotes tokens and is the number of tokens in .

Figure 1: An overview of the proposed method. Given proceeding tokens , we first sequentially generate summaries for each snippet. Next we expand each summary to form the full text .

4.2 Pipeline Overview

Instead of generating all constituent words in one by one, we adopt a hierarchical strategy. The process of generating is decoupled into the following two stages.

(1) Outlining Segment Summaries : We sequentially generate the summary for each snippet conditioning on the summaries for previous snippets. This mimics the process of catalog generation when humans write.

(2) Expanding Summaries to Texts: we expand each summary to the full segment by sequentially generating its constituent words.

An overview of the proposed method is shown in Figure 1.

4.3 Extracting Golden Summaries

At the training time, we need to learn to generate summaries. But this is not straightforward because the golden summary for the snippet is not readily at hand. Manually soliciting summaries like Fan et al. (2018) is both costly and slow. We thus propose to take the idea of unsupervised extractive summarization, and for each snippet , we extract its summary unsupervisedly, and use the extracted as the golden summary for learning.

We investigate the following extractive methods to access the importance of selecting summary sentences, the first three of which are similar to Liu et al. (2018).


For comparing purposes, we use a random sentence as the summary.


We take the sentence with the highest average TF-IDF score (Ramos, 2003) as the golden summary . A word is assigned a score by TF-IDF that scales proportionally to the number of times the word appears in the document and is offset by the number of documents in the corpus that contain the word, which can be expressed as , where is the word count, is the total number of documents and is the total number of documents containing the word.


TextRank (Mihalcea and Tarau, 2004) is a weighted graph with tokens as nodes and the similarity between nodes as edges. We use BERT (Devlin et al., 2018) to compute the similarities between sentences and then rank them based on the TextRank algorithm.


A summary should be more informative than non-summary sentences, that is, a summary should have the most ability to reconstuct the full text. To measure the degree of a sentence’s reconstruction ability, we use a seq2seq model to predict the original given text the summary sentence, the probability of which is regarded as the reconstruction score. Suppose there are sentences in (e.g., ), and , and denotes the -th sentence in . The reconstruction score for , denoted by is given as follows:


To obtain , we train another seq2seq model, where the input is for each , and the output is by sequentially predicting tokens in . Given the trained model, we are to rank all sentences in and use the one with the highest score as the golden summary .

4.4 Outlining Segment Summaries

In the summary generation stage, we cannot observe , and our goal is to sequentially generate given :


The generation of summary can be factorized into sequentially generating the constituent word within it:


This process ends until generating a special end-of-sequence token <EOS> or reaching a specified summary length . We use the Transformer-base architecture (Vaswani et al., 2017) as the backbone. To take into account more contexts, we adopt the segment-level recurrence strategy, similar to Dai et al. (2019), where the hidden states computed for far away snippets are fixed and cached to be reused for the next new snippet. Gradients are not propagated to these far away snippets for memory and computation efficiency. This strategy allows the model to exploit information in history to the largest extent.

4.5 Expanding Summaries to Texts

Next, we expand each summary to the full text for each segment by sequentially generating its constituent words


which has the same termination conditions as in the summarization generation.

4.6 Training and Inference


For summary generation, the transformer model takes as the input and is optimized by minimizing the NLL loss . Due to the memory limitation, we limit to proceeding 384 tokens, and to 128 tokens at training. It is worth noting that the 384 tokens of mostly come from the segment right before, i.e., , while comes from multiple proceeding segments since the summary is more concise.

For the summary expanding stage, the transformer model takes as input and is optimized by minimizing the NLL loss . The two models, i.e., the summary generation and the summary expansion model share parameters, with a task-specific token appended to the start to notify the model on what to generate, summaries or segments.


At test time, we first use beam search with beam size to generate summaries. Given the generated summary, beam search is used again to generate the corresponding segment. We consider more contexts at test time, where is limited to 1,156 tokens and is limited to 512 tokens.

Additionally, we augment the vanilla beam search with the strategy of mutual information reranking Li et al. (2015a); Fang et al. (2015). The key point of mutual information is to, instead of merely handling the uni-directional dependency from the source to target based on the forward probability , it models the mutual dependency between the source and target in sequence-to-sequence generation, i.e., the combination of the forward probability and the backward probability . Specifically in our case, for summary generation, is generated as follows:


where is the backward probability of predicting the proceeding summary given . Since direct decoding from Eq.7 is infeasible, we follow the practical solution in li2015diversity, where we first generate an -best list based on the forward probability ,111We simplify as , where we train a seq2seq model to predict the proceeding summary given the current summary. and then rerank the -best list by combing the forward probability and the backward probability.

Similar strategy can also be applied to the summary expanding stage, where is obtained as follows:


The backward probability predicts the proceeding segment given the current segment. Again, beam search is combined with reranking to approximately find the optimal result.

4.7 Slicing Texts based on Coherence Scores

One more thing we need to care about is how to slice the text into segments. The simplest way is to slice the full text equally. But this is sub-optimal since the break point could be in the middle of two closely related sentences and one segment might contain multiple aspects.

We thus propose a slicing strategy based on sentence-level coherent scores. Using the Next Sentence Prediction (NSP) from BERT (Devlin et al., 2018), we are able to measure the coherence score between two consecutive sentences with index and , denoted by Score(). Given a full text , let denote the number sentences in , and denote the th sentence. Given a fixed value for the number sliced segments, will be sliced into segments, i.e., , where each consists of a group of consecutive sentences from . Let denotes the list of indexes of sentence in original , where denotes the index of the first sentence in , denotes the second sentence, etc. Let denote the number of sentences in .

We wish to maximize the coherence scores between two consecutive sentences within the same segment and minimize the score between two consecutive sentences belonging to different segments, giving the following objective to optimize:


where the coherence score between the ending sentence of a segment and the starting sentence of the next segment. Given , Eq.9

can be readily solved using linear programming.

5 Experiments

WikiText-103 BookCorpus
Model Perplexity # Parameters Perplexity # Parameters
Vanilla 25.0 130M 29.0 130M
WritingPrompts-Keyword 23.8 135M 28.3 135M
WritingPrompts-Sentence 24.1 135M 28.6 135M
Progressive WritingPrompts 23.3 150M 27.7 150M
SOE 22.2 132M 25.7 132M
Vanilla 20.0 220M 24.8 220M
SOE 17.4 224M 22.5 224M
Table 1: Perplexity of different models on WikiText-103 and BookCorpus. Vanilla stands for our implementation of Transformer-XL (Dai et al., 2019).
MSJ Diversity Adversarial Success S-Level Coherence
Model MSJ-2 MSJ-3 MSJ-4 D-1 D-2 Adversarial Success NSP
Vanilla 62.6 41.5 16.9 7.4 19.8 0.037 0.812
WritingPrompts-Keyword 63.0 42.2 17.5 8.9 22.0 0.057 0.836
WritingPrompts-Sentence 63.1 42.2 17.7 8.5 21.0 0.046 0.834
Progressive WritingPrompts 63.9 42.5 18.0 10.7 25.9 0.055 0.854
SOE 64.8 43.9 19.4 16.4 34.3 0.072 0.870
SOE+MI 65.2 44.4 20.0 20.6 40.8 0.103 0.881
Table 2: Results of different models in terms of diversity, adversarial success, MSJ and sentence-level coherence on the BookCorpus corpus. Vanilla stands for our implementation of Transformer-XL (Dai et al., 2019). “D-” stands for “Distinct-”, and MI stands the results for mutual information reranking.
Model Distinct-1 Distinct-2
Vanilla 11.7 25.5
SOE 24.1 45.0
SOE+MI 29.3 48.8
Table 3: Results of different models with large volumes in terms of diversity on the BookCorpus dataset.

In this section, we present experiment results. For different methods to generate summaries, we find that the performance of Reconstruction consistently outperforms the rest in our preliminary results. We thus only report results from Reconstruction in the section. We will get back to analysis on different summary generation methods in the ablation study section.

5.1 Datasets

We need a corpus of contiguous and long text to test SOE. We use two word-level datasets, WikiText-103 (Merity et al., 2016) and the BookCorpus dataset (Zhu et al., 2015).

WikiText-103 contains 103M training words from 28K articles, with an average length of 3.6K words per article. WikiText-103 can be used to test the ability of modeling long-term dependencies.

The BookCorpus dataset is a more suitable dataset for our purpose, with much longer and more contiguous texts. It contains a total number of roughly 1 billion words and 74 million sentences from 11k books, with an average length of 89K words for each book. The average number of words per sentence is 13. For both datasets, we predict the last 2,000 tokens at test time.

5.2 Baselines


Transformers with segment-level recurrence strategy Dai et al. (2019) naturally constitutes a baseline. The model sequentially generates texts in a word-by-word fashion.


first predicts a list of keywords or a single prompt, and then generates the full text given the prompt Fan et al. (2018). Different from fan2018hierarchical, where golden prompts for stories are available, we do not readily have the golden prompts. We thus use the extractive strategies described in Section 4.3, i.e, the TF-IDF method to pick the keyword list as the prompt (denoted by WritingPrompts-keyword) and the reconstruction method to select the highest ranking sentence as the prompt (denoted by WritingPrompts-sentence).

Progressive WritingPrompts

The progressive strategy proposed in tan2020progressive which involves multiple stages of prompt generation. Each stage produces a more fine-grained sequence than the stage that comes before, and is used as the input to generate the prompt for the next stage. We follow the protocols in tan2020progressive and use the TF-IDF score to obtain golden prompts for each stage. The number of stages is set to 4.

For all models, we used use Adam (Kingma and Ba, 2014) with learning rate of 1e-4, = 0.9,

= 0.999, rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout rate of 0.1 on all layers (including the softmax layer).

5.3 Evaluations

We use the following evaluation metrics to evaluate the quality of different generation models from different perspectives.

Perplexity (PPL)

Perplexity measures how fluent a piece of generated text could be (Dai et al., 2019). We use PPL as the basic evaluation metric in our experiments.


Perplexity cannot measure how diverse the generated text is. We thus use the scaled number of distinct unigrams (Distinct-1) and bigrams (Distinct-2) to demonstrate the degree of diversity (Li et al., 2016) for generated texts.

Adversarial Success

Inspired by adversarial evaluations (Bowman et al., 2016; Kannan and Vinyals, 2017; Li et al., 2017), we use the adversarial success metric, which is defined as the fraction of a model successfully fooling a trained evaluator to believe that machine-generated texts are from humans. The evaluator is a binary classification model. At the training time, it takes as inputs machine-generated texts and original texts, and are trained to discriminate them. At test time, adversarial success is the value , where denotes the accuracy of the trained evaluator predicting machine-generated texts as machine-generated. Higher values of adversarial success denotes better text quality.

MS-Jaccard (MSJ)

MSJ measures the similarity of the -gram frequencies between the generated texts and the golden texts Montahaei et al. (2019). We report MSJ-2, -3 and -4.

Sentence-Level Coherence

PPL, MSJ and diversity scores do not reflect the sentence-level coherence of generated texts. We adopt the strategy in Tan et al. (2020) where Next Sentence Prediction (NSP) from pretrained BERT model (Devlin et al., 2018) is used as a metric to measure the coherence between each sentence and its next sentence. We report average NSP scores for all consecutive sentence pairs within the generated text.

5.4 Results

Table 1 shows the results of perplexity for different models on the WikiText-103 and BookCorpus datasets. On both datasets, SOE achieves the lowest PPL compared to baselines Transformer-XL (Dai et al., 2019), WritingPrompts (Fan et al., 2018) and Progressive (Tan et al., 2020). In particular, for WikiText-103, we gain a PPL decrease -2.8, -1.6 and -1.1 against our implemented Transformer-XL, WritingPrompts-Sentence and Progressive, while having the same or even fewer parameters. Similar trend can be observed on BookCorpus.

Table 2 and Table 3 show the results for MSJ, diversity, adversarial success and sentence-level coherence scores. As can be seen, WritingPrompt-based models generally outperform the Transformer-XL model, which adopts the word-by-word generation strategy. This validates the superiority of two-step generation strategy over the naive word-by-word generation strategy for long-text generation. The progressive WritingPrompt model, which involves multi-step of generation and expanding, outperforms the one-step the WritingPrompt-keyword and WritingPrompt-sentence model, which is in accord with our expectation. SOE achieves significantly better results compared to Vanilla, WritingPrompts and Progressive models in terms of all evaluation metrics, showing that the proposed method can produce more fluent, coherent and diverse texts. The consistent performance boosts on all metrics demonstrate the importance of modeling discourse-level dependencies and necessity of summary expanding strategy for long-text generation.

Additionally, enhanced by mutual information (MI), we observe additional performance boosts, especially for diversity and adversarial success. This is in accord with our expectation: since mutual information is able to build bidirectional dependencies between the source and the target, models enhanced with mutual information can generate better summaries, and the phenomenon of generic and repetitive generation can be alleviated (Li et al., 2016), leading to more diverse results.

6 Ablation Studies

6.1 The Effect of Segment Length

The size of the segment can be neither too big nor too small: extremely long segments, might contain too many aspects or topics for the summary to summarize, in which case the model will degenerate into the WritingPrompts model Fan et al. (2018). For too short segments, the summary cannot provide high-level guidance. We thus need to find the sweet spot for the segment length. Figure 2 shows results on the BookCorpus dataset. It is clear from the figure that too short segments and too long segments both lead to inferior performances.

Figure 2: PPL on the BookCorpus dataset w.r.t. different segment lengths.

6.2 The Effect of Summary Generation Strategies

It is worthwhile to explore how different summary extraction methods affect the final performances. To this end, we conduct experiments on the BookCorpus dataset, using different summary extraction methods, i.e., Random, TextRank, TF-IDF and Reconstruction. Table 4 shows the results. We first compare the ppl for summary generation, where the reconstruction model achieves the lowest ppl and thus produces summaries that are the easiest to predict given proceeding contexts. It is also interesting to see that across all summary generation strategies, ppl for summarization generation is significantly larger than text prediction, which is reasonable since (1)generating summaries for the upcoming segment requires more generalization abilities; and (2) there are more diverse options for what the next segment should talk about than the local choices for what the next sentence should talk about. For the final text-generation ppl, reconstruction achieves the best results, in terms of PPL, MJ-3 and MJ-4. TextRank and TF-IDF are better than Vanilla. Interestingly, the strategy of using random sentences as summaries performs worse than without summaries, which can be explained by providing no guidances is better than incorrect guidances.

Method Summary PPL Text PPL MJ-4
Vanilla - 29.0 16.9
Random 40.1 30.2 15.5
TextRank 30.7 26.2 17.8
TF-IDF 33.0 26.9 17.3
Reconstruction 30.4 25.7 19.4
Table 4: Performances of different summary extracion methods described in Section 4.3. Vanilla is the plain model that generates tokens one by one without summaries.

6.3 The Effect of Coherence-based Text Slicing

We replace the coherence-based text slicing strategy with the naive equal slicing strategy, and see how this will negatively affect the performance. On the BookCorpus dataset, we observe an increase of summary generation ppl from 30.4 to 30.9, and an +0.7 increase of PPL from 25.7 to 26.4 in token generation, which demonstrates the importance of slicing text into coherent segments for generation. But it is also worth-noting that, even with the native equal slicing strategy, SOE still performs significantly better than other baseline models.

6.4 Decoupling The Effects of Summaries

The positive effects from summaries are two-fold: (1) it provides high-level guidances for segment generation; and (2) with far-away segments being concisely represented by summaries, it gives the model the ability to consider longer contexts. To quantitatively measure the influences from both aspects, we conduct the following experiments: at test time, for the computation and , the model can only access summaries for segments that are used as contexts. In other words, only summaries within the 1,156 tokens of proceeding contexts can be fed as inputs. This is different from the original version of SOE, in which can extend to proceeding contexts until the limit of 512 tokens is reached. We did not retrain the model, but add this limitation at test time. On the BookCorpus dataset, this leads to an increase of 0.8 in PPL (25.7 vs 26.5), and a decrease of 0.5 and 0.8 in MJ-3 (43.5 vs 43.9) and MJ-4 (18.6 vs 19.4).

6.5 Simplifying

Here we explore different simplifications for . For , the current summary is generated based on both previous summaries and segment tokens. We can simplify it as , where previous segment tokens are not fed as inputs to predict the summary, which will significantly decreases computing complexity. On the BookCorpus dataset, we observe an increase of PPL in summary generation from 30.4 to 31.2, which subsequently leads to an +0.9 increase of PPL from 25.7 to 26.6 in token generation.

6.6 Convergence Speed

Figure 3: Convergence speed for different models.

At last, We investigate how quickly different models converge. Results are shown in Figure 3. With the guidance of extracted summaries, SOE has a conspicuously faster convergence speed, where at about 200K training steps it has approximately reached the best result while the other two models — Vanilla and WritingPrompts — do not converge until 1000K training steps. The WritingPrompts model converges faster than then Vanilla because of the high-level guidance from prompts.

7 Conclusion

In this paper, we propose a two-step hierarchical generation strategy for long-text generation: the model first generates the summary for each segment conditioning on previous summaries, and next, each summary is expanded to form the full text segment. The proposed strategy provides high-level guidances for local text generation, and enables high-level discourse dependencies to be captured. Extensive experiments demonstrate that SOE produces long texts with significantly better quality,