An ideal summarizer should provide the flexibility to generate summaries with varying proportions of reused text. Such summaries are required to cater to diverse usage scenarios. E.g., system abstracts may not contain excessive copied content without proper permission—11 consecutive words or longer are considered by EU standards as the author’s intellectual creation and it is thus protected by copyright law . Without proper control over copying, commercial summarizers can be held liable for copyright infringements. Moreover, system abstracts with an appropriate amount of copied content are more desirable than highly abstractive ones, as they are less likely to suffer from content hallucination  and better at preserving the meaning of the original text.
To date, it remains poorly understood whether modern abstractive summmarization can provide the needed flexibility to control over copying and generate diverse abstracts. Abstractive summarizers using encoder-decoder architectures can either copy words from the source text or generate new words unseen in the source [41, 5, 11]. Recent work further attempted to increase the use of unseen words in summaries [46, 19]. However, in all cases, the summarizers are trained on single-reference abstracts to produce single outputs with a fixed (corpus-level) copy rate. It can take multiple reference abstracts, created for the same input text with varying degrees of copying, to teach the system to generate abstracts with similar amounts of copying. However, not only can it be time-consuming and costly to create human abstracts, but this is unlikely to be how humans learn to exercise control over copying. Without an understanding of the copy mechanism of neural abstractive models, producing abstracts with varying degrees of copying can prove daunting at best and a “mission impossible” at worst.
Question: What is the most probable next word?
|Hint: the word is seen in the source text.|
|A 23-month-old toddler who was reportedly abducted in|
|Pennsylvania has been found dead, a district attorney said.|
|Missing Pennsylvania ___?___|
|Missing Pennsylvania toddler ___?___|
|Missing Pennsylvania toddler found ___?___|
|Reference Summary: Missing Pennsylvania toddler found dead|
|Question: What is the most probable next word?|
|Hint: the word is unseen in the source text.|
|Rescuers have suspended their search off the coast of Santa|
|Cruz Island for passengers who were trapped aboard the|
|Conception when the diving boat caught fire and sank.|
|Search has ___?___|
|Search has been suspended ___?___|
|Search has been suspended in the ___?___|
|Search has been suspended in the dive boat fire off ___?___|
|Reference Summary: Search has been suspended in the dive|
|boat fire off California coast|
In this paper, our goal is to generate abstractive summaries with varying amounts of reused text by developing a general framework that learns from single reference summaries. We define copy rate as the percentage of summary -grams appearing in the source text. A high copy rate suggests that the summary is generated largely by copying verbatim from the source text. Conversely, a low copy rate indicates there are more text shortening, word reordering, paraphrasing and abstraction involved in the generation process. We argue that abstractive summarizers are not necessarily trained on every word of reference summaries but they ought to separate the prediction of summary words that are seen in the source text from those unseen. The underlying principle is simple and intuitively appealing. If a summarizer is trained to predict only seen words, it learns to copy them from the source text, producing extractive summaries. As more unseen words are used for training, the summarizer gradually transforms from copying only to both copying and generating new words not present in the source text. By employing a “mix-and-match” strategy, we enable an abstractive summarizer to generate summaries with more, or less, copying.
We frame abstractive summarization as a language modeling task and present a decoder-only framework for it. It uses the same Transformer architecture  to both encode the source text and decode the summary. All network parameters are warm-started using pretrained deep representations. In contrast, in a typical encoder-decoder architecture, only parameters of the encoder and decoder can be warm-started but not those of the attention/copy mechanism . Further, our method allows for control over copying during both training and decoding stages of the neural model. We experiment with varying proportions of seen and unseen summary words in training to teach the summarizer to favor, or not to favor, copying. At decoding time, we compare different search strategies (best-first search vs. beam search) and reranking methods to encourage system abstracts to use wording similar to the original. Despite that only single reference summaries are available in benchmark evaluations, we are able to evaluate summary quality along multiple dimensions, using automatic metrics based on lexical similarity (ROUGE; Lin, 2004) and semantic similarity (BERTScore; Zhang et al., 2019), and through human assessment of grammaticality, informativeness, and whether system abstracts remain true-to-original. Our method demonstrates strong performance, either outperforming or performing on par with the best published results. The research contributions are summarized as follows:
we introduce a new summarization method that provides the needed flexibility to produce a spectrum of summaries for the same input and with a varying amount of copied content. Such summaries are highly desirable to cater to diverse real-world scenarios;111We make our implementation and models publicly available at
our method emphasizes on in-depth analysis of the copy behavior in summarization. It frames abstractive summarization as a language modeling task and exploits multiple strategies at training and decoding stages to generate diverse summary hypotheses. We show competitive results and demonstrate the effectiveness of the proposed method on exercising control over copying.
The significance of controlling over the copying behavior in summarization should not be underestimated. Human editors often reuse the text in the original article to produce a summary . But they can adjust the degree of copying to produce a wide spectrum of summaries. E.g., human-written summaries for newswire [33, 16], meetings [3, 26], scientific articles  and online forums 
contain varying amounts of reused text. Moreover, the degree of copying can have a direct impact on scores of automatic evaluation metrics. ROUGE was reported to favor summaries that use the same wording as the original. If reference summaries are made by copying, system summaries with less copying and perhaps more abstraction, compression, and paraphrasing will be disadvantaged when compared against other system summaries with substantial copying. There is thus an urgent need, and this paper makes a first attempt to present a summarization framework that is capable of producing summaries with varying amounts of reused text.
To date, various extractive and abstractive summarization techniques have been investigated 
. However, rarely has one technique been utilized to produce both extractive and abstractive summaries for any given text. Extractive summarization selects important and non-redundant sentences from the original document(s). The sentences can be optionally compressed to remove inessential phrases, leading to compressive summaries[28, 21, 45, 10, 9]. Abstractive summarization distills the source text into its essential meanings, then performs language generation from the representation to produce an abstract [1, 25, 23, 15]. These systems rarely provide the flexibility for an end user to indicate the desired amount of reused text in the summary. To eliminate the need to develop multiple systems for extractive and abstractive summarization, we attempt to introduce control into the copying behavior of a neural abstractive summarization system.
Neural abstractive summarization has demonstrated considerable recent success. It often utilizes an encoder-decoder architecture [40, 41, 5, 20, 4]; and more recently, studies have attempted to use deep contextualized representations such as BERT  and ELMo 
to give a further boost to it. An encoder network converts the source text to a fix-length vector, conditioned on which a decoder network unrolls the summary one word at a time. While it is tempting to use pretrained deep representations to “warm-start” the encoder/decoder, Khandelwal et al. Khandelwal:2019 find that results can be less satisfying as the attention weights are still not pretrained. In this paper we adopts adecoder-only framework  where the same Transformer architecture is used for both encoding the source text and decoding the summary.
Copying can help produce unseen words. It was originally introduced to the seq2seq framework for neural machine translation and later for abstractive summarization . Particularly, Knowles and Koehn Knowles:2018 examine the influence of context and sub-words on the copying behavior of an NMT system. To suppress copying, Kryściński et al. Kryscinski:2018 introduce a novelty metric which is to be optimized during policy learning; and Weber et al. Weber:2018 modify the scoring function of the summary sequence at decoding time. Fan, Grangier, and Auli Fan:2018 attempt to control over summary length, entities, source style and portions. But they do not address copying. In this paper, we focus on better understanding the copying behavior of a summarization system and present effective mechanisms to control the amount of reused text. We discuss what it takes for a summarizer to copy a word without an explicit copying mechanism, and how we may control the behavior to produce summaries with more, or less, copying. In the following we describe our model in great detail.
We frame abstractive summarization as a language modeling task and present a decoder-only framework for it. It uses the same Transformer architecture  to both encode the source text and decode the summary. Let , be a sequence of source tokens and ,
be summary tokens. Our goal is to model the conditional probability distributionusing a Transformer-inspired architecture.
We use byte-pair-encoding (BPE; Sennrich et al., 2016) for tokenization, with a vocabulary size of
tokens. BPE has been shown to improve the robustness and accuracy of neural model training. We use parameter tying, allowing the same token embeddings to be used in both the input layer and final softmax layer of the Transformer model. Our method also includes three special tokens:Start, End, and Mask, which respectively denote the start/end of a sequence and a “masked out” token. An illustration of our system architecture is provided in Figure 1.
We construct the source sequence by prepending ‘Start’ and appending ‘End’ to the input text. E.g., = START Elizabeth was taken to the hospital END, illustrated in Figure 1. Similarly, the target sequence is constructed by appending ‘End’ to the summary. E.g., = Elizabeth was hospitalized END. Our system learns to predict the target sequence one word at a time until the ‘End’ token has been reached. The conditional probability is shown in Eq. (1-2).
However, at training time, we argue that the system is not necessarily trained to predict every word of target sequences but a selected collection might suffice. Using selected target tokens provides important potential to steer the system to be more extractive than abstractive, or vice versa. We divide all tokens in the sequence into three categories: (a) summary tokens seen in the source text, (b) summary tokens unseen in the source, and (c) source tokens, with the expectation that training the system to predict only seen summary tokens may reinforce the copying behavior, unseen tokens allow for generation, and source words enable the system to learn better token representations. By mix and matching target tokens from three categories, we enable a summarizer to generate summaries with more, or less, copying.
We randomly sample a set of tokens from each category using a Bernoulli distribution with probability. The value of varies by category and more analysis is provided in the experiments section. Let denote whether the -th token of is selected; its probability is defined as
A selected token is replaced by ‘Mask’ 80% of the time, meaning that the token has been ‘masked out’ from the sequence . For 10% of the time, it is replaced by a random token from the vocabulary . It remains unchanged for the final 10%. In the following, we use to represent the masked sequence, whose selected tokens are to be predicted during model training. Our loss term is defined as follows:
It is important to note that we apply a binary mask to the self-attention mechanism of the Transformer architecture to allow (a) a source token to attend to all source tokens including itself, and (b) a summary token to attend to all source tokens, summary tokens prior to it, as well as the current token (‘Mask’) in order to learn deep contextualized representations. The formulation is similar to . Our binary mask is defined by Eq. (5). It is a square matrix whose -th row represents the mask of the -th token of . If it is a source token (), the mask allows it to attend to all source tokens ( for ). If it is a summary token (), it can attend to all tokens prior to it as well as the current token ( for ).
The input of Transformer consists of embedding matrices: , , and respectively denote the token, position, and segment embeddings . , and are one-hot matrices used to retrieve embeddings for tokens in sequence . The token, position, and segment embeddings for the -th token are then added up element-wisely.
Our Transformer model takes as input embeddings and the binary mask to produce a sequence of deep contextualized representations . Particularly, is used to predict the -th ‘missing’ token in the sequence. We use parameter tying, allowing the same token embeddings to be used in both the input layer (Eq. (6)) and final softmax layer of the model (Eq. (8)).
Given a trained model and an input text, the decoding stage searches for a summary sequence that maximizes . We present two search algorithms for this stage.
Best-first search uses a priority heap
to keep partial summaries, which are scored according to a heuristic function. At each iteration, the search algorithm takes the highest-scoring partial summary, extends it by one word, then pushes new summary sequences back to the priority heap. We generate new summary sequences by selecting words that give the highest probability of (Eq. (9)) then iteratively appending the words to the partial summary. If the highest-scoring summary in the heap concludes with an end-of-sentence symbol, it is moved to a pool of “completed summaries” for later reranking. The heap thus keeps a collection of partial summaries of varying lengths, which are visited according to their scores.222The size of the priority heap is capped at 1e5. If the heap has reached capacity and a new summary sequence needs to be pushed in, the lowest-scoring one will be removed from the heap. We provide an illustration of our best-first search algorithm in Algorithm 1.
In contrast, beam search is essentially breadth-first search. It maintains a beam of size at any time step, containing partial summaries of the same length. For each partial summary, the algorithm extends it by one word, producing new sequences by appending each of the words that give the highest probability of to the partial summary. This process generates a total of new summary sequences by extending on each of the partial summaries. The algorithm then selects k-best candidates, which are put in the beam for next iteration. If a candidate summary concludes with the end-of-sentence symbol, it is moved to the pool of “completed summaries”.
Both best-first search and beam search employ the same scoring function that scores a candidate summary by the sum of log-likelihoods (Eq. (9)). However, the two differ in their search strategies—beam search visits candidate summaries according to the summary length, whereas best-first search favors candidates attaining higher scores.
We compute using our trained CopyTrans model. Importantly, the ‘Mask’ token is used as a prompt for the model to predict the next word. E.g., “START Elizabeth was taken to the hospital END Elizabeth was MASK” is a concatenation of the source text, partial summary and ‘Mask’ token; it is fed to the CopyTrans model where the contextualized representation of ‘Mask’ is used as input to a softmax layer to predict the next token . In experimental results, we demonstrate that a dynamic, contextualized representation of ‘Mask’ performs reliably at predicting the next token. This represents an important distinction from shifting the target sequence by one position for prediction, which is common in encoder-decoder models.
Reranking A reranking step is necessary, in part because candidate summaries decoded using beam search or best-first search do not always meet the length requirement. E.g., an overly short summary containing only two words is rarely an informative summary, despite that it may give a high log-likelihood score. Below we compare three reranking strategies to offset this limitation.
Length normalization is adopted by See et al. See:2017 and it is frequently used in many other systems. It divides the original log-likelihood score, denoted as , by the total number of tokens in the summary to effectively prevent a long summary from being penalized.
BP-norm introduces a brevity penalty to summaries that do not to meet length expectation. As illustrated in Eq. (11), BP-norm performs length normalization, then adds a penalty term to the scoring function. We modify the original penalty term of  to make it favor summaries using more copying. In Eq. (12), we define to be the copy rate, i.e., the percentage of summary tokens seen in the source text, scaled by a factor . When the copy rate is set to 1, the penalty is dropped to 0. Yang, Huang, and Ma yang-etal-2018-breaking provides a nice proof showing that this penalty term can directly translate to a coefficient multiplied to the log-likelihood score (Eq. (13)).
Soft-bounded word reward (SBWR) is a newly introduced method by us that assigns a per-word reward to the summary. If the decoded summary is longer than expected (), the added words receive a diminishing reward of . If the summary is shorter (), every word of it will receive a reward. The method thus promotes summaries of similar length to the predicted
. A sigmoid function is used to smooth the reward values.is a coefficient to scale the total reward and it is tuned on the validation data.
We obtain the predicted length using greedy search, then empirically offset the predicted length by three words according to validation set. In all cases, we force the decoder to never output the same trigram more than once during testing, which is a common practice to avoid repetitions .
|Source Text: Premier Chang Chun-hsiung said Thursday he is|
|enraged and saddened by the snail-paced progress of the|
|reconstruction of areas hardest hit by a disastrous earthquake|
|that rattled Taiwan on Sept. 21 , 1999 .|
|1: premier expresses condolences for taiwan quake victims|
|2: premier angry over reconstruction of quake - hit areas|
|3: premier enraged and saddened by earthquake reconstruction|
|4: premier enraged by slow progress of post-quake reconstruction|
|Source Text: A blue-ribbon panel of experts said on Wednesday|
|that German economic growth will grind to a halt next year ,|
|raising doubts about Berlin ’s plans to shield Europe ’s biggest|
|economy from the global turmoil .|
|1: german experts raise doubts about economic recovery|
|2: experts say german growth will grind to a halt next year|
|3: german experts to grind to halt next year|
|4: german economy will grind to halt in 2009 , say experts|
Data and Evaluation Metrics
We evaluate our proposed method on the sentence summarization task. The goal is to condense a lengthy source sentence to a title-like summary. Comparing to single-document summarization, sentence summarization deals less with content selection; its ground-truth summaries also contain more paraphrasing and abstraction. We conduct experiments on the Gigaword  and Newsroom  datasets. Gigaword articles were collected during 1995-2010 and Newsroom spans the range of 1998-2017. We pair the first sentence of each article with its title to form an instance. The train/valid/test splits contain 4 million/10k/1951 instances for Gigaword and 199k/21k/21k instances for Newsroom. We experiment with both datasets to understand not only the copying behavior, but also domain adaptation effects for various models. Despite that only single reference summaries are available in benchmark evaluations, we are able to evaluate summary quality along multiple dimensions, using automatic metrics based on lexical similarity (ROUGE; Lin, 2004) and semantic similarity (BERTScore; Zhang et al., 2019), and through human assessment of grammaticality, informativeness, and whether system abstracts remain true-to-original.
We initialize the model parameters using pretrained Bert-Base (uncased) model. The model is fine-tuned on the training split of the Gigaword (or Newsroom) dataset for abstractive summarization. Our model uses a 12-layer Transformer architecture. Its hidden state size is 768 and has 12 attention heads. We use the Adam optimizer with . The learning rate is set to =4e-5 and it is halved whenever the validation loss does not change after 40,000 training steps. We set the weight decay to be for regular layers and no weight decay for dropout and layer-normalization. The sampling rate
is set to 0.1 for source words and 0.9 for summary words, both seen and unseen. Each model is fine-tuned for 6 epochs; an epoch takes about 5 hours on a Tesla V100 GPU. Our batch size is set to be 32.
|Multi-Task w/ Entailment||32.75||15.35||30.82||–|
Control over copying Could we bias a summarizer to produce summaries that are more extractive than abstractive, or vice versa? If the summarizer is trained solely on summary words seen in the source text, will it only learn to copy words during testing but not generate new words? We seek to answer these questions in this section. Particularly, we divide all tokens selected for training into three categories: (a) summary tokens seen in the source text, (b) summary tokens unseen in the source, and (c) source tokens, with the expectation that training the system to predict only seen summary tokens may reinforce the copying behavior, unseen tokens allow for generation, and source words enable the system to learn richer representations. By mix-and-matching tokens, we enable a summarizer to copy more, or less.
We analyze the copy rate of various summarization models in Table 3. Copy rate is defined as the percentage of summary -grams appearing in the source text. We set =1/2/3/4 and the average of them. A high copy rate suggests that the summary is generated largely by copying verbatim from the source text. We experiment with selecting varying amounts of seen summary tokens ( ), unseen summary tokens ( ), and source tokens ( ) for training, where the number of circles is proportional to the number of tokens used in computing the loss term. All summaries in Table 3 are decoded using beam search (k=5) without reranking.
Our findings suggest that, the factor that makes the most impact on the copying behavior of a summarizer is the proportion of seen and unseen summary words used for training the model. If the summarizer is trained on purely seen words (case a. in Table 3), it only reuses source words during testing, despite that there is nothing to prevent the system from generating new words. The 1-gram copy rate for case a. is about 99% for both datasets, with the minor gap due to tokenization discrepancies. As more unseen words are used for training, the summarizer gradually transforms from copying only to both copying and generating new words not present in the source text. We observe that the ratio of seen vs. unseen words in ground-truth summaries is about 2:1 in both datasets, and Newsroom is slightly more extractive than Gigaword
. Our analysis reveals that it is important to maintain a similar ratio during training in order to achieve high ROUGE scores. Pure extracts do not attain high ROUGE scores, as ground-truth summaries themselves are abstracts. Our analysis further suggests that training on source words has little impact on the copying behavior of the system, but it improves representation learning and has lead to consistently improved ROUGE-2 F-scores.
System comparison Table 4 shows results on benchmark summarization data containing 1951 testing instances from Gigaword. We contrast our system with summarization baselines developed in recent years. They include lvt5k-1sent , Multi-Task w/ Entailment , SEASS (Zhou et al., 2017), DRGD , EntailGen+QuesGen , PG Networks , Struct+2Way+Relation , R3Sum , and BiSET . Output summaries from the last four systems are graciously provided to us by the authors. We evaluate summary quality using two automatic metrics, including ROUGE333w/ options “-c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a -m”
(Lin, 2004) that measures n-gram overlap between system and reference summaries, and BERTScore (Zhang et al., 2019) that quantifies their semantic similarity using BERT-based contextualized representations.
Results show that our system achieves competitive performance, surpassing strong systems having reported results on this dataset, as judged by both metrics. These results demonstrate the effectiveness of our Transformer-based decoder-only architecture for abstractive summarization. We observe that using beam search with reranking yields the highest results (using case g. in Table 3 for training). Both BP-Norm and SBWR appear to be outstanding reranking methods, better than length normalization. Our observation also suggests that best-first search and beam search can produce similar outcome, despite that the two differ in their search strategies, with beam search visiting candidates according to summary length and best-first search favoring candidates having high log-likelihood scores. We suggest future work to explore other search methods such as A* search.
Domain adaptation We investigate the effect of domain adaptation by training the model on Gigaword then testing it on Newsroom test set. Results are reported in Table 5. Not surprisingly, there is a performance degradation when testing the model in a cross-domain setting. We observe that the model with more copying (pure-extract, case e.) seem to degrade more gracefully than its counterpart (best-abstract, case f.), with a smaller performance gap in cross-domain settings. Both of our models perform competitively comparing to other baseline methods.
To thoroughly analyze the quality of summaries, we ask human annotators to assess system outputs along three dimensions, including informativeness (Has the summary covered important content of the source text?), grammaticality (Is the summary sentence grammatically correct?), and truthfulness (Has the summary successfully preserved the meaning of the original text?). Both system and human summaries are scored according to these criteria using a Likert scale from 1 (worst) to 5 (best). We compare variants of our method generating (a) pure extracts (case e.) and (b) best abstracts (case g.), baselines of (c) PG networks, (d) R3Sum, (e) BiSET, and (f) human abstracts. Following , we perform Best-Worst Scaling where a human selects the best and worst summary among six candidates. The final rating of the system is computed as the percentage of times it was selected as the best minus that of the worst. We sample 200 instances from the Gigaword test set for evaluation. Each instance was assessed by five human evaluators from Amazon mechnical turk where low-quality annotations are manually removed. The results are presented in Table 6. We observe that human summaries (article titles) are imperfect. They can contain details that are nonexistent in the source (see Table 2), although they provide a means for researchers to train neural models without re-annotating reference summaries. In contrast, both of our systems perform slightly but consistently better than other baselines.
In this paper we present a Transformer-based, decoder-only framework to generate summaries with more, or less, copying. The proposed method can be used to generate both extractive and abstractive summaries. Our method emphasizes on in-depth analysis of the copy behavior in summarization. It exploits multiple strategies at training and decoding stages to generate diverse summary hypotheses. We show competitive results and demonstrate the effectiveness of the proposed method on exercising control over copying.
We are grateful to the reviewers for their helpful comments. The work was performed in part while Kaiqiang Song was an intern at Bosch Research. This research was supported in part by the National Science Foundation grant IIS-1909603.
-  (2005) Sentence fusion for multidocument news summarization. Computational Linguistics 31 (3). Cited by: Related Work.
-  (2018) Retrieve, rerank and rewrite: Soft template based neural summarization. In ACL, Cited by: Summarization Results.
-  (2005) The AMI meeting corpus. In MLMI, Cited by: Related Work.
-  (2018) Deep communicating agents for abstractive summarization. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), External Links: Cited by: Related Work.
-  (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), External Links: Cited by: Introduction, Related Work.
A legal perspective on training models for natural language processing. In Proc. of LREC, Cited by: Introduction.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. External Links: Cited by: Related Work, Training.
-  (2019) Unified language model pre-training for natural language understanding and generation. https://arxiv.org/abs/1905.03197. Cited by: Related Work, Training.
-  (2016) Learning-based single-document summarization with compression and anaphoricity constraints. In Proc. of ACL, External Links: Cited by: Related Work.
-  (2015) Sentence compression by deletion with lstms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: Related Work.
-  (2018) Bottom-up abstractive summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Cited by: Introduction.
-  (2018) NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), External Links: Cited by: Data and Evaluation Metrics.
-  (2016) Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: Related Work.
-  (2018) Soft, layer-specific multi-task summarization with entailment and question generation. In Proc. of ACL, External Links: Cited by: Summarization Results.
-  (2018) Guided neural language generation for abstractive summarization using abstract meaning representation. In Proc. of EMNLP, External Links: Cited by: Related Work.
-  (2015) Teaching machines to read and comprehend. In Proceedings of Neural Information Processing Systems (NIPS), External Links: Cited by: Related Work.
-  (1999) The decomposition of human-written summary sentences. In Proc. of SIGIR, Cited by: Related Work.
-  (2019) Sample efficient text summarization using a single pre-trained transformer. https://arxiv.org/abs/1905.08836. Cited by: Introduction.
-  (2018) Improving abstraction in text summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Cited by: Introduction.
-  (2018) Adapting the neural encoder-decoder framework from single to multi-document summarization. In EMNLP, Cited by: Related Work.
-  (2013) Document summarization via guided sentence compression. In EMNLP, Cited by: Related Work.
-  (2017) Deep recurrent generative decoder for abstractive text summarization. In EMNLP, External Links: Cited by: Summarization Results.
-  (2018) Abstract meaning representation for multi-document summarization. In COLING, Cited by: Related Work.
-  (2004) ROUGE: a package for automatic evaluation of summaries. In Wksp. on Text Summarization Branches Out, External Links: Cited by: Introduction, Data and Evaluation Metrics, Summarization Results.
-  (2015) Toward abstractive summarization using semantic representations. In Proc. of NAACL, External Links: Cited by: Related Work.
-  (2013) Towards abstractive speech summarization: Exploring unsupervised and supervised approaches for spoken utterance compression. IEEE Trans. ASLP 21 (7), pp. 1469–1480. Cited by: Related Work.
-  (2019) Hierarchical transformers for multi-document summarization. In ACL, Cited by: Human Evaluation.
Summarization with a joint model for sentence extraction and compression.
Workshop on Integer Linear Programming for Natural Language Processing, External Links: Cited by: Related Work.
-  (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proc. of SIGNLL, External Links: Cited by: Summarization Results.
-  (2011) Automatic summarization. Foundations and Trends in Information Retrieval. Cited by: Related Work.
-  (2015) Better summarization evaluation with word embeddings for ROUGE. In EMNLP, Cited by: Related Work.
-  (2017) Crowd-sourced iterative annotation for narrative summarization corpora. In EACL, Cited by: Related Work.
-  (2004) An introduction to DUC-2004. NIST. Cited by: Related Work.
-  (2011) English Gigaword fifth edition LDC2011T07. Philadelphia: Linguistic Data Consortium. Cited by: Data and Evaluation Metrics.
-  (2018) Multi-reward reinforced summarization with saliency and entailment. In NAACL, External Links: Cited by: Summarization Results.
-  (2017) A deep reinforced model for abstractive summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Cited by: Decoding.
-  (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: Related Work.
-  (2013) Generating extractive summaries of scientific paradigms. JAIR. Cited by: Related Work.
-  (2018) A structured review of the validity of BLEU. Computational Linguistics 44 (3), pp. 393–401. Cited by: Introduction.
A neural attention model for sentence summarization. In Proceedings of EMNLP, External Links: Cited by: Related Work.
-  (2017) Get to the point: Summarization with pointer-generator networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), External Links: Cited by: Introduction, Related Work, Related Work, Summarization Results.
-  (2018) Structure-infused copy mechanisms for abstractive summarization. In Proc. of COLING, External Links: Cited by: Summarization Results.
-  (2017) Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Cited by: Introduction, Our Approach.
-  (2019) BiSET: bi-directional selective encoding with template for abstractive summarization. In Proc. of ACL, Cited by: Summarization Results.
-  (2013) A sentence compression based framework to query-focused multi-document summarization. In ACL, External Links: Cited by: Related Work.
-  (2018) Controlling decoding for more abstractive summaries with copy-based networks. https://arxiv.org/abs/1803.07038. Cited by: Introduction.
-  (2018) Breaking the beam search curse: a study of (re-)scoring methods and stopping criteria for neural machine translation. In EMNLP, Cited by: Decoding.
BERTScore: Evaluating text generation with BERT. In https://arxiv.org/abs/1904.09675, Cited by: Introduction, Data and Evaluation Metrics, Summarization Results.