Despite that recent large-scale pretrained language models (PLMs) (Peters et al., 2018; Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Clark et al., 2020; Radford et al., 2019; Li et al., 2020a; Brown et al., 2020) are able to produce high-quality passages that can be hardly recognized by humans (Zellers et al., 2019), most of the generated “good” texts are within very limited length, e.g. hundreds of tokens for most cases (Guo et al., 2017; Bao et al., 2020; Yan et al., 2020), and generating coherent long texts remains a challenge Radford et al. (2019); Tan et al. (2020). The difficulty lies in the fact that existing models generate texts in a word-by-word manner: predicting each subsequent token given its proceeding contexts using the softmax objective. This word-by-word strategy overwhelmingly focuses on the prediction of local words, and cannot make high level plans on what to generate. This results in the fact that long texts generated by current models are usually repetitive, generic and self-contradictory Shen et al. (2019).
To address these issues, the coarse-to-fine generation strategy is proposed Fan et al. (2018); Xu et al. (2018); Yao et al. (2019); Mao et al. (2019). In coarse-to-fine generation, a list of keywords or a short prompt is first generated, serving as a summary of the original text. The prompt is then fed to a seq2seq model as an input to output the complete text. The coarse-to-fine generation strategy significantly improves generation over the word-by-word strategy, but still suffers from the following shortcomings: (a) limited capacity of the prompt: a single keyword list or prompt does not have enough capacity to summarize all the texts of long passages, since long texts are usually consists of several parts, each of which focuses on a specific aspect or topic (Zhou et al., 2018a; Narayan et al., 2018; Guan et al., 2019). The usage of the coarse-to-fine generation strategy is thus limited to texts that can be summarized by a single prompt (e.g., short stories). This explains why text length generated by the progressive generation model is still limited, e.g., the introduced writingprompts dataset in Fan et al. (2018) has an average length of stories around 735, and the average length of prompts is 28; (b) ignorance of high-level discourse dependency: the coarse-to-fine generation strategy does not capture discourse-level dependencies Li and Jurafsky (2016); Jernite et al. (2017), which handle the high-level information flow and interactions between segments of texts. The ignorance of discourse-level dependencies results in texts lacking for coherence.
Humans write in a hierarchical top-down manner: before writing a thousand-word-long essay, a human usually first prepares a list of bullet points or a catalogue, and then expands them to form the whole article. The sentence-level coherence between these bullet points is preserved when the bullet points are expanded, providing guarantees that the full text is coherent.
To mimic this top-down manner of human writing, in this paper, we propose SOE
, a pipelined system tha involves of summarizing, outlining and expanding for long text generation: the model first outlines the summaries for different segments of long texts, which actually mimics the process of humans outlining bullet points; next, the model elaborates on each bullet point to generate the corresponding segment. The proposed strategy comes with the following merits: (a) Since each segment is associated with its own summary rather than the entire text sharing a single prompt, the capacity of summaries to reconstruct the full text can be guaranteed; (b) The conditional generation probability between summaries captures the high-level discourse dependencies, and these dependencies are preserved when they are expanded to segments. This naturally resolves the incapability of modeling discourse-level dependencies in the coarse-to-fine generation approach. (c) This model is able to consider significantly larger amount of contexts by representing chunks of contexts as concise summaries.
Empirically, we do not readily have summaries for segments in hand. The model thus needs to learn to summarize in an unsupervised manner. Inspired the (Gehrmann et al., 2018b; Zhou et al., 2018b; Liu et al., 2018; Zhong et al., 2020), we propose the reconstruction strategy, which extracts segment summaries by selecting its most informative part to reconstruct the segment. Extensive experiments demonstrate that SOE produces long texts with significantly better quality than existing baselines.
The rest of this paper is organized as follows: Section 2 presents related works, followed by Section 3 reviewing some backgrounds. Section 4 introduces our proposed approach in detail. Section 5 and Section 6 respectively present the experiment results and ablation studies. Last, we give a conclusion in Section 7.
2 Related Work
2.1 Generating Long Texts
There are two lines of work to generate long text: This first line of work tackles the problem from the model perspective. New model structures Kitaev et al. (2020); Child et al. (2019); Dai et al. (2019); Ye et al. (2019); Guo et al. (2019); Sukhbaatar et al. (2019); Correia et al. (2019); Beltagy et al. (2020); Zaheer et al. (2020); Li et al. (2020b) are designed to give the model the ability to congest more contexts given limited memories or computing power. For example, Transformer-XL (Dai et al., 2019), a modifier to Transformers (Vaswani et al., 2017) uses a segment-level recurrence mechanism to enable learning long-term dependencies; Child et al. (2019); Correia et al. (2019); Kitaev et al. (2020); Beltagy et al. (2020); Zaheer et al. (2020) proposed to sparsify transformers by focusing only on a fraction of attention connections; Tay et al. (2020) replaced the dot-product self-attention with learned synthetic attention weights; Li et al. (2020b) used an LSTM predictor to automatically learn attention connections adapted to downstream tasks.
The second line of researches focuses on developing new generation strategies. Efforts have been devoted to the idea of planning-then-generation or coarse-to-fine generation (Wiseman et al., 2017; Sha et al., 2017; Gehrmann et al., 2018a; Wiseman et al., 2019; Moryossef et al., 2019; Puduppully et al., 2019; Hua and Wang, 2019; Shen et al., 2020; Fu et al., 2020)
, which greatly inspires this work. In coarse-to-fine generation, a list of keywords or a short sentence is first generated, providing guidances to generate the full text. A recent work from tan2020progressive takes a multi-step strategy, which progressively refines the generated incomplete text until reaching a specified stage. Similar ideas have also been applied to text summarization, whereGehrmann et al. (2018b) proposed a bottom-up method that first identifies phrases within a document that are likely included in its summary. Our work is also inspired by the strategy of hierarchical generation Li et al. (2015b); Yu et al. (2016); Nallapati et al. (2016); Liu and Lapata (2019a), which consider text units with bigger granularity: Li et al. (2015b) proposed hierarchical LSTMs that arrange tokens, sentences and paragraphs in a hierarchical structure, with different levels of LSTMs capturing compositionality. Shen et al. (2019) used multi-level structures to learn a VAE model for generating long coherent text. Similar strategies are applied to the video captioning problem where Yu et al. (2016) exploited hierarchical RNNs for video caption generation.
2.2 Extractive Summarization
Extractive summarization refers to the problem of selecting part of the input text as its summary. A fundamental problem in extractive summarization is to score constituent texts units (e.g., phrases, sentences or paragraphs) and select highly-ranked one(s) as the summary. Haghighi and Vanderwende (2009) used word frequencies in the input text to assign scores to words, which are then in turn used to score sentences. Higher-ranked sentences are selected as the summary of the input text. Liu et al. (2018) presented a two-stage extractive-abstractive framework, which first coarsely identifies salient information, followed by a generation model used to refine it. Neural models have been widely used for scoring Cao et al. (2015); Ren et al. (2017); Zhou et al. (2018b). Liu and Lapata (2019b) finetuned BERT (Devlin et al., 2018) to score each sentence for extractive summarization; Zhang et al. (2019)
computed token similarity in each sentence using BERT contextual embeddings to serve as an automatic evaluation metric for text generation.
We begin by reviewing the task of text generation.
refers to the process of calculating the probability of a sequence , where each denotes a constituent token of
. The probability can be computed by decomposing the joint distributioninto a product of conditional distributions over tokens:
where is the partial sequence of tokens generated previously. During training, the model is optimized to minimize the negative log-likelihood (NLL) . During inference, the model decodes a token at each time step according to based on the softmax functions where is the output word embedding matrix and is the hidden state at time-step . Various smoothing models have been proposed to avoid overfitting Xie et al. (2017); Pereyra et al. (2017).
Sequence-to-Sequence (Seq2Seq) Generation
models generate a target sequence conditioning on a given source sequence , which differs from language models (LMs) in terms of whether or not conditioning on another input sequence. Similar to LMs, the probability of the target sequence can be typically factorized as:
Seq2seq models are also optimized to minimize the NLL . In the rest of this paper, we unify the notation of and by setting for LMs. Different architectures have been proposed to model , including transformers (Vaswani et al., 2017), LSTMs (Luong et al., 2015) and CNNs (Dauphin et al., 2017). At test time, sequences are usually generated using beam search, or its variants to promote diversity (Vijayakumar et al., 2016).
4 Model Details for SOE
In this section, we describe the details of SOE.
A long sequence of tokens is first sliced into a series of snippets s, where denotes the number of constituent snippets. Here we use the bold font to denote snippets, and the normal font to denote tokens. The number of tokens within each snippet is a hyper-parameter. We also use superscript to denote the index of a snippet, and subscript to denote the index of a token. Each consists a sequence of tokens , where denotes the length of . Our goal is to generate a subset of , denoted by given its proceeding snippets, denoted by . Each snippet is associated with a short summary , where denotes tokens and is the number of tokens in .
4.2 Pipeline Overview
Instead of generating all constituent words in one by one, we adopt a hierarchical strategy. The process of generating is decoupled into the following two stages.
(1) Outlining Segment Summaries : We sequentially generate the summary for each snippet conditioning on the summaries for previous snippets. This mimics the process of catalog generation when humans write.
(2) Expanding Summaries to Texts: we expand each summary to the full segment by sequentially generating its constituent words.
An overview of the proposed method is shown in Figure 1.
4.3 Extracting Golden Summaries
At the training time, we need to learn to generate summaries. But this is not straightforward because the golden summary for the snippet is not readily at hand. Manually soliciting summaries like Fan et al. (2018) is both costly and slow. We thus propose to take the idea of unsupervised extractive summarization, and for each snippet , we extract its summary unsupervisedly, and use the extracted as the golden summary for learning.
We investigate the following extractive methods to access the importance of selecting summary sentences, the first three of which are similar to Liu et al. (2018).
For comparing purposes, we use a random sentence as the summary.
We take the sentence with the highest average TF-IDF score (Ramos, 2003) as the golden summary . A word is assigned a score by TF-IDF that scales proportionally to the number of times the word appears in the document and is offset by the number of documents in the corpus that contain the word, which can be expressed as , where is the word count, is the total number of documents and is the total number of documents containing the word.
A summary should be more informative than non-summary sentences, that is, a summary should have the most ability to reconstuct the full text. To measure the degree of a sentence’s reconstruction ability, we use a seq2seq model to predict the original given text the summary sentence, the probability of which is regarded as the reconstruction score. Suppose there are sentences in (e.g., ), and , and denotes the -th sentence in . The reconstruction score for , denoted by is given as follows:
To obtain , we train another seq2seq model, where the input is for each , and the output is by sequentially predicting tokens in . Given the trained model, we are to rank all sentences in and use the one with the highest score as the golden summary .
4.4 Outlining Segment Summaries
In the summary generation stage, we cannot observe , and our goal is to sequentially generate given :
The generation of summary can be factorized into sequentially generating the constituent word within it:
This process ends until generating a special end-of-sequence token <EOS> or reaching a specified summary length . We use the Transformer-base architecture (Vaswani et al., 2017) as the backbone. To take into account more contexts, we adopt the segment-level recurrence strategy, similar to Dai et al. (2019), where the hidden states computed for far away snippets are fixed and cached to be reused for the next new snippet. Gradients are not propagated to these far away snippets for memory and computation efficiency. This strategy allows the model to exploit information in history to the largest extent.
4.5 Expanding Summaries to Texts
Next, we expand each summary to the full text for each segment by sequentially generating its constituent words
which has the same termination conditions as in the summarization generation.
4.6 Training and Inference
For summary generation, the transformer model takes as the input and is optimized by minimizing the NLL loss . Due to the memory limitation, we limit to proceeding 384 tokens, and to 128 tokens at training. It is worth noting that the 384 tokens of mostly come from the segment right before, i.e., , while comes from multiple proceeding segments since the summary is more concise.
For the summary expanding stage, the transformer model takes as input and is optimized by minimizing the NLL loss . The two models, i.e., the summary generation and the summary expansion model share parameters, with a task-specific token appended to the start to notify the model on what to generate, summaries or segments.
At test time, we first use beam search with beam size to generate summaries. Given the generated summary, beam search is used again to generate the corresponding segment. We consider more contexts at test time, where is limited to 1,156 tokens and is limited to 512 tokens.
Additionally, we augment the vanilla beam search with the strategy of mutual information reranking Li et al. (2015a); Fang et al. (2015). The key point of mutual information is to, instead of merely handling the uni-directional dependency from the source to target based on the forward probability , it models the mutual dependency between the source and target in sequence-to-sequence generation, i.e., the combination of the forward probability and the backward probability . Specifically in our case, for summary generation, is generated as follows:
where is the backward probability of predicting the proceeding summary given . Since direct decoding from Eq.7 is infeasible, we follow the practical solution in li2015diversity, where we first generate an -best list based on the forward probability ,111We simplify as , where we train a seq2seq model to predict the proceeding summary given the current summary. and then rerank the -best list by combing the forward probability and the backward probability.
Similar strategy can also be applied to the summary expanding stage, where is obtained as follows:
The backward probability predicts the proceeding segment given the current segment. Again, beam search is combined with reranking to approximately find the optimal result.
4.7 Slicing Texts based on Coherence Scores
One more thing we need to care about is how to slice the text into segments. The simplest way is to slice the full text equally. But this is sub-optimal since the break point could be in the middle of two closely related sentences and one segment might contain multiple aspects.
We thus propose a slicing strategy based on sentence-level coherent scores. Using the Next Sentence Prediction (NSP) from BERT (Devlin et al., 2018), we are able to measure the coherence score between two consecutive sentences with index and , denoted by Score(). Given a full text , let denote the number sentences in , and denote the th sentence. Given a fixed value for the number sliced segments, will be sliced into segments, i.e., , where each consists of a group of consecutive sentences from . Let denotes the list of indexes of sentence in original , where denotes the index of the first sentence in , denotes the second sentence, etc. Let denote the number of sentences in .
We wish to maximize the coherence scores between two consecutive sentences within the same segment and minimize the score between two consecutive sentences belonging to different segments, giving the following objective to optimize:
where the coherence score between the ending sentence of a segment and the starting sentence of the next segment. Given , Eq.9
can be readily solved using linear programming.
|Model||Perplexity||# Parameters||Perplexity||# Parameters|
|MSJ||Diversity||Adversarial Success||S-Level Coherence|
In this section, we present experiment results. For different methods to generate summaries, we find that the performance of Reconstruction consistently outperforms the rest in our preliminary results. We thus only report results from Reconstruction in the section. We will get back to analysis on different summary generation methods in the ablation study section.
WikiText-103 contains 103M training words from 28K articles, with an average length of 3.6K words per article. WikiText-103 can be used to test the ability of modeling long-term dependencies.
The BookCorpus dataset is a more suitable dataset for our purpose, with much longer and more contiguous texts. It contains a total number of roughly 1 billion words and 74 million sentences from 11k books, with an average length of 89K words for each book. The average number of words per sentence is 13. For both datasets, we predict the last 2,000 tokens at test time.
Transformers with segment-level recurrence strategy Dai et al. (2019) naturally constitutes a baseline. The model sequentially generates texts in a word-by-word fashion.
first predicts a list of keywords or a single prompt, and then generates the full text given the prompt Fan et al. (2018). Different from fan2018hierarchical, where golden prompts for stories are available, we do not readily have the golden prompts. We thus use the extractive strategies described in Section 4.3, i.e, the TF-IDF method to pick the keyword list as the prompt (denoted by WritingPrompts-keyword) and the reconstruction method to select the highest ranking sentence as the prompt (denoted by WritingPrompts-sentence).
The progressive strategy proposed in tan2020progressive which involves multiple stages of prompt generation. Each stage produces a more fine-grained sequence than the stage that comes before, and is used as the input to generate the prompt for the next stage. We follow the protocols in tan2020progressive and use the TF-IDF score to obtain golden prompts for each stage. The number of stages is set to 4.
We use the following evaluation metrics to evaluate the quality of different generation models from different perspectives.
Perplexity measures how fluent a piece of generated text could be (Dai et al., 2019). We use PPL as the basic evaluation metric in our experiments.
Perplexity cannot measure how diverse the generated text is. We thus use the scaled number of distinct unigrams (Distinct-1) and bigrams (Distinct-2) to demonstrate the degree of diversity (Li et al., 2016) for generated texts.
Inspired by adversarial evaluations (Bowman et al., 2016; Kannan and Vinyals, 2017; Li et al., 2017), we use the adversarial success metric, which is defined as the fraction of a model successfully fooling a trained evaluator to believe that machine-generated texts are from humans. The evaluator is a binary classification model. At the training time, it takes as inputs machine-generated texts and original texts, and are trained to discriminate them. At test time, adversarial success is the value , where denotes the accuracy of the trained evaluator predicting machine-generated texts as machine-generated. Higher values of adversarial success denotes better text quality.
MSJ measures the similarity of the -gram frequencies between the generated texts and the golden texts Montahaei et al. (2019). We report MSJ-2, -3 and -4.
PPL, MSJ and diversity scores do not reflect the sentence-level coherence of generated texts. We adopt the strategy in Tan et al. (2020) where Next Sentence Prediction (NSP) from pretrained BERT model (Devlin et al., 2018) is used as a metric to measure the coherence between each sentence and its next sentence. We report average NSP scores for all consecutive sentence pairs within the generated text.
Table 1 shows the results of perplexity for different models on the WikiText-103 and BookCorpus datasets. On both datasets, SOE achieves the lowest PPL compared to baselines Transformer-XL (Dai et al., 2019), WritingPrompts (Fan et al., 2018) and Progressive (Tan et al., 2020). In particular, for WikiText-103, we gain a PPL decrease -2.8, -1.6 and -1.1 against our implemented Transformer-XL, WritingPrompts-Sentence and Progressive, while having the same or even fewer parameters. Similar trend can be observed on BookCorpus.
Table 2 and Table 3 show the results for MSJ, diversity, adversarial success and sentence-level coherence scores. As can be seen, WritingPrompt-based models generally outperform the Transformer-XL model, which adopts the word-by-word generation strategy. This validates the superiority of two-step generation strategy over the naive word-by-word generation strategy for long-text generation. The progressive WritingPrompt model, which involves multi-step of generation and expanding, outperforms the one-step the WritingPrompt-keyword and WritingPrompt-sentence model, which is in accord with our expectation. SOE achieves significantly better results compared to Vanilla, WritingPrompts and Progressive models in terms of all evaluation metrics, showing that the proposed method can produce more fluent, coherent and diverse texts. The consistent performance boosts on all metrics demonstrate the importance of modeling discourse-level dependencies and necessity of summary expanding strategy for long-text generation.
Additionally, enhanced by mutual information (MI), we observe additional performance boosts, especially for diversity and adversarial success. This is in accord with our expectation: since mutual information is able to build bidirectional dependencies between the source and the target, models enhanced with mutual information can generate better summaries, and the phenomenon of generic and repetitive generation can be alleviated (Li et al., 2016), leading to more diverse results.
6 Ablation Studies
6.1 The Effect of Segment Length
The size of the segment can be neither too big nor too small: extremely long segments, might contain too many aspects or topics for the summary to summarize, in which case the model will degenerate into the WritingPrompts model Fan et al. (2018). For too short segments, the summary cannot provide high-level guidance. We thus need to find the sweet spot for the segment length. Figure 2 shows results on the BookCorpus dataset. It is clear from the figure that too short segments and too long segments both lead to inferior performances.
6.2 The Effect of Summary Generation Strategies
It is worthwhile to explore how different summary extraction methods affect the final performances. To this end, we conduct experiments on the BookCorpus dataset, using different summary extraction methods, i.e., Random, TextRank, TF-IDF and Reconstruction. Table 4 shows the results. We first compare the ppl for summary generation, where the reconstruction model achieves the lowest ppl and thus produces summaries that are the easiest to predict given proceeding contexts. It is also interesting to see that across all summary generation strategies, ppl for summarization generation is significantly larger than text prediction, which is reasonable since (1)generating summaries for the upcoming segment requires more generalization abilities; and (2) there are more diverse options for what the next segment should talk about than the local choices for what the next sentence should talk about. For the final text-generation ppl, reconstruction achieves the best results, in terms of PPL, MJ-3 and MJ-4. TextRank and TF-IDF are better than Vanilla. Interestingly, the strategy of using random sentences as summaries performs worse than without summaries, which can be explained by providing no guidances is better than incorrect guidances.
|Method||Summary PPL||Text PPL||MJ-4|
6.3 The Effect of Coherence-based Text Slicing
We replace the coherence-based text slicing strategy with the naive equal slicing strategy, and see how this will negatively affect the performance. On the BookCorpus dataset, we observe an increase of summary generation ppl from 30.4 to 30.9, and an +0.7 increase of PPL from 25.7 to 26.4 in token generation, which demonstrates the importance of slicing text into coherent segments for generation. But it is also worth-noting that, even with the native equal slicing strategy, SOE still performs significantly better than other baseline models.
6.4 Decoupling The Effects of Summaries
The positive effects from summaries are two-fold: (1) it provides high-level guidances for segment generation; and (2) with far-away segments being concisely represented by summaries, it gives the model the ability to consider longer contexts. To quantitatively measure the influences from both aspects, we conduct the following experiments: at test time, for the computation and , the model can only access summaries for segments that are used as contexts. In other words, only summaries within the 1,156 tokens of proceeding contexts can be fed as inputs. This is different from the original version of SOE, in which can extend to proceeding contexts until the limit of 512 tokens is reached. We did not retrain the model, but add this limitation at test time. On the BookCorpus dataset, this leads to an increase of 0.8 in PPL (25.7 vs 26.5), and a decrease of 0.5 and 0.8 in MJ-3 (43.5 vs 43.9) and MJ-4 (18.6 vs 19.4).
Here we explore different simplifications for . For , the current summary is generated based on both previous summaries and segment tokens. We can simplify it as , where previous segment tokens are not fed as inputs to predict the summary, which will significantly decreases computing complexity. On the BookCorpus dataset, we observe an increase of PPL in summary generation from 30.4 to 31.2, which subsequently leads to an +0.9 increase of PPL from 25.7 to 26.6 in token generation.
6.6 Convergence Speed
At last, We investigate how quickly different models converge. Results are shown in Figure 3. With the guidance of extracted summaries, SOE has a conspicuously faster convergence speed, where at about 200K training steps it has approximately reached the best result while the other two models — Vanilla and WritingPrompts — do not converge until 1000K training steps. The WritingPrompts model converges faster than then Vanilla because of the high-level guidance from prompts.
In this paper, we propose a two-step hierarchical generation strategy for long-text generation: the model first generates the summary for each segment conditioning on previous summaries, and next, each summary is expanded to form the full text segment. The proposed strategy provides high-level guidances for local text generation, and enables high-level discourse dependencies to be captured. Extensive experiments demonstrate that SOE produces long texts with significantly better quality,
- Bao et al. (2020) Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training.
- Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.
- Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space.
- Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Cao et al. (2015)
Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. 2015.
Ranking with recursive neural networks and its application to multi-document summarization.In AAAI
, pages 2153–2159. Citeseer.
- Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
Correia et al. (2019)
Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019.
Adaptively sparse transformers.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2174–2184, Hong Kong, China. Association for Computational Linguistics.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
Dauphin et al. (2017)
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017.
Language modeling with gated convolutional networks.
International conference on machine learning, pages 933–941.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833.
- Fang et al. (2015) Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In
- Fu et al. (2020) Yao Fu, Yansong Feng, and John P. Cunningham. 2020. Paraphrase generation with latent bag of words.
- Gehrmann et al. (2018a) Sebastian Gehrmann, Falcon Dai, Henry Elder, and Alexander Rush. 2018a. End-to-end content and plan selection for data-to-text generation. In Proceedings of the 11th International Conference on Natural Language Generation, pages 46–56, Tilburg University, The Netherlands. Association for Computational Linguistics.
- Gehrmann et al. (2018b) Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018b. Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792.
Guan et al. (2019)
Jian Guan, Yansen Wang, and Minlie Huang. 2019.
Story ending generation with incremental encoding and commonsense
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6473–6480.
- Guo et al. (2017) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624.
- Guo et al. (2019) Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1315–1325, Minneapolis, Minnesota. Association for Computational Linguistics.
- Haghighi and Vanderwende (2009) Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 362–370.
- Hua and Wang (2019) Xinyu Hua and Lu Wang. 2019. Sentence-level content planning and style specification for neural text generation.
- Jernite et al. (2017) Yacine Jernite, Samuel R Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557.
- Kannan and Vinyals (2017) Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kitaev et al. (2020) Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.
- Li et al. (2020a) Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, Baolin Peng, Yizhe Zhang, and Jianfeng Gao. 2020a. Optimus: Organizing sentences via pre-trained modeling of a latent space. arXiv preprint arXiv:2004.04092.
- Li et al. (2015a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015a. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
- Li and Jurafsky (2016) Jiwei Li and Dan Jurafsky. 2016. Neural net models for open-domain discourse coherence. arXiv preprint arXiv:1606.01545.
- Li et al. (2015b) Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015b. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057.
- Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
- Li et al. (2020b) Xiaoya Li, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020b. Sac: Accelerating and structuring self-attention via sparse adaptive connection. arXiv preprint arXiv:2003.09833.
- Liu et al. (2018) Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198.
- Liu and Lapata (2019a) Yang Liu and Mirella Lapata. 2019a. Hierarchical transformers for multi-document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5070–5081, Florence, Italy. Association for Computational Linguistics.
- Liu and Lapata (2019b) Yang Liu and Mirella Lapata. 2019b. Text summarization with pretrained encoders.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- Mao et al. (2019) Huanru Henry Mao, Bodhisattwa Prasad Majumder, Julian McAuley, and Garrison W Cottrell. 2019. Improving neural story generation by targeted common sense grounding. arXiv preprint arXiv:1908.09451.
- Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411.
- Montahaei et al. (2019) Ehsan Montahaei, Danial Alihosseini, and Mahdieh Soleymani Baghshah. 2019. Jointly measuring diversity and quality in text generation models. arXiv preprint arXiv:1904.03971.
- Moryossef et al. (2019) Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-step: Separating planning from realization in neural data-to-text generation.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
- Narayan et al. (2018) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
- Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
- Ramos (2003) J. Ramos. 2003. Using tf-idf to determine word relevance in document queries.
Ren et al. (2017)
Pengjie Ren, Zhumin Chen, Zhaochun Ren, Furu Wei, Jun Ma, and Maarten de Rijke.
Leveraging contextual sentence relations for extractive summarization using a neural attention model.In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 95–104.
- Sha et al. (2017) Lei Sha, Lili Mou, Tianyu Liu, Pascal Poupart, Sujian Li, Baobao Chang, and Zhifang Sui. 2017. Order-planning neural text generation from structured data. arXiv preprint arXiv:1709.00155.
- Shen et al. (2019) Dinghan Shen, Asli Celikyilmaz, Yizhe Zhang, Liqun Chen, Xin Wang, Jianfeng Gao, and Lawrence Carin. 2019. Towards generating long and coherent text with multi-level latent variable models. arXiv preprint arXiv:1902.00154.
- Shen et al. (2020) Tianxiao Shen, Victor Quach, Regina Barzilay, and Tommi Jaakkola. 2020. Blank language models. arXiv preprint arXiv:2002.03079.
- Sukhbaatar et al. (2019) Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335, Florence, Italy. Association for Computational Linguistics.
- Tan et al. (2020) Bowen Tan, Zichao Yang, Maruan AI-Shedivat, Eric P. Xing, and Zhiting Hu. 2020. Progressive generation of long text.
- Tay et al. (2020) Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. 2020. Synthesizer: Rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
- Wiseman et al. (2017) Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2017. Challenges in data-to-document generation.
- Wiseman et al. (2019) Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2019. Learning neural templates for text generation.
- Xie et al. (2017) Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. 2017. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573.
- Xu et al. (2018) Jingjing Xu, Xuancheng Ren, Yi Zhang, Qi Zeng, Xiaoyan Cai, and Xu Sun. 2018. A skeleton-based model for promoting coherence among sentences in narrative story generation. arXiv preprint arXiv:1808.06945.
Yan et al. (2020)
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
Zhang, and Ming Zhou. 2020.
Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5753–5763.
- Yao et al. (2019) Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019. Plan-and-write: Towards better automatic storytelling.
- Ye et al. (2019) Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. 2019. Bp-transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070.
Yu et al. (2016)
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016.
Video paragraph captioning using hierarchical recurrent neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4584–4593.
- Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. In Advances in Neural Information Processing Systems, pages 9054–9065.
- Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Zhong et al. (2020) Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2020. Extractive summarization as text matching. arXiv preprint arXiv:2004.08795.
- Zhou et al. (2018a) Deyu Zhou, Linsen Guo, and Yulan He. 2018a. Neural storyline extraction model for storyline generation from news articles. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1727–1736, New Orleans, Louisiana. Association for Computational Linguistics.
- Zhou et al. (2018b) Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. 2018b. Neural document summarization by jointly learning to score and select sentences. arXiv preprint arXiv:1807.02305.
- Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.