Recent work in large-scale language models [19; 20; 5] has allowed pretrained contextual representations to be easily adapted for a variety of downstream tasks, yielding improvements on many benchmarks evaluating natural language understanding . Less explored, however, has been the effect of these pretrained representations on text production tasks, such as abstractive summarization, where state of the art performance is still achieved with sequence to sequence (seq2seq) models [1; 6].
These sequence-to-sequence methods typically use an encoder and decoder model with separate parameters to represent the input article and produce the output summary, and the most successful solutions [1; 6; 22] use attention mechanisms that learn an alignment between encoder and decoder states. Pretrained language models, however, do not learn the parameters for such a task-specific alignment, making it challenging to integrate their learned representations into a summarization architecture at a higher level of abstraction than the word embedding.
In this work, we adapt full transformer language models for abstractive summarization. Building off the work of Liu et al. , who first proposed concatenating input and output text to a joint sequence and using a common transformer to encode both, we use a language model as a summarizer (rather than an encoder-decoder). With this approach, representations from a pretrained transformer language model (in this case, GPT ) can be used to fully initialize the parameters of the summarization model, allowing it to leverage the representational power of a model trained at much larger scale.
To accomplish this effectively, we outline two strategies for adapting pretrained representations for abstractive summarization. In the first, we augment the input representation of the summarization model by instantiating source embeddings that encode the token type of the text being read. This change allows the model to recognize whether a given token belongs to the input article or the output summary, thereby learning how to distinguish both types of text when encoding. In the second, we introduce a domain-adaptive training procedure that fine-tunes the transformer toward understanding general newswire text before training on the summarization end task directly, allowing the model to learn the general structure and language distribution of newswire text before being fine-tuned to produce summaries.
A comprehensive empirical study across three datasets, CNN/DailyMail , XSum , and Newsroom , shows that transformer language models can be used to train abstractive summarizers, producing summaries that are more concise and focused than state of the art baselines. Our investigation also empirically validates several observations about the abstractive summarization task. First, echoing the results of Sun et al. 
, the most common summarization evaluation metric, ROUGE, is highly sensitive to summary length, providing an advantage to methods that produce longer summaries, either through learning or with minimum summary length constraints. Second, achieving higher ROUGE scores is not strongly consistent with human assessments of abstractive summary quality. Finally, despite being conceived as abstractive summarizers, most current state of the art models are highly extractive, copying phrases and even sentences verbatim from the document.
In this paper, we focus on a variant of the Transformer  that has been pretrained on a large corpus of natural language stories: the GPT model . As our architecture is practically identical to the one proposed in Radford et al. , we point readers to that work for background on the architecture of the model, and focus below on the enhancements to the input representation made in our approach.
2.1 Input Representation
Each article is represented as a sequence of tokens and its corresponding summary is a sequence of tokens . As outlined in Figure 1, the input structure of the training set is a pair of article and corresponding summary concatenated into two sequences similar to :
where = , and D and E are special tokens identifying the delimitation and end of the sequence. Below, we define the process of encoding these sequences as inputs to the transformer.
First, each token in the concatenated sequence indexes a word embedding from a joint vocabulary for the article and summary (and special tokens).
Second, since the transformer (a self-attention model) has no concept of ordering of tokens, a position embeddingis initialized for each absolute position in the sequence . The embedding for each position in the sequence is added to the word embedding of the token occupying that position, augmenting the final representation of the input. For example, each token in the article would be represented as: . Once the delimitation token D is reached, the position counter is reset. For example, the first token of the article, , and the first token of the summary, , both receive as a positional embedding to augment their representations.
Finally, because the transformer must recognize pragmatic differences between the text of the article it reads and the text of the summary it learns to produce, an additional, source-specific embedding is initialized, . The source embedding encodes whether a token is from the article portion of the concatenated input, or the summary portion . For any article token (Eq. 2) or summary token (Eq. 3) then, the final encoding is:
In contrast to the other embeddings in the model, the source embeddings are not pretrained, introducing the potential that they could dominate pretrained representations for the word and position embeddings when summed (Eq. 2, 3). To avoid this, we normalize the random initialization of the source embeddings to have norm equal to half of the average norm of the word embeddings.
The model is initialized with pretrained parameters from the GPT model  that was trained on the BooksCorpus . Following this initialization, we pursue two additional training procedures: domain-adaptive training and end task training.
3.1 Domain-adapative Training
Despite the benefit of using pretrained representations from the GPT model to initialize a summarizer, there is a language shift between the storybooks data on which the GPT model was trained and the type of language found in newswire summarization datasets [8; 16; 7]. Additionally, there are structural differences between how articles are written (usually expressing salient points early on, followed by details later) and how stories unfold (less front-loading of key information).
To address this discrepancy, we propose domain-adaptive training (DAT) to adapt the transformer summarization model to the language distribution of newswire text by maximizing the conditional loglikelihood of the article tokens and summary tokens given all previous tokens in their concatenated input representation (see Figure 1):
where is length of the article, is the length of the summary, is the set of all tokens in the article that precede , is the set of all tokens in the summary that precede , and is the set of all article tokens. In this framework, the model is adapted to produce newswire-like language before being trained on the summarization end task, which only focuses on learning for summary production.
3.2 End Task Training
During end task training (ETT), the model is trained specifically to be able to produce a summary given a document, constraining the loss function toward maximizing the conditional loglikelihood of producing only the correct summary tokens given the set of article tokens:
where is the set of tokens in the summary that precede .
4 Experimental Setup
The CNN/Daily Mail dataset  consists of articles from CNN and Daily Mail. Each article is associated with several descriptive bullet point highlights. Similar to previous work , we concatenate the highlights to create a target summary for each article in the dataset and use the same dataset splits. The Extreme Summarization (XSum) dataset  consists of 230k article summary pairs taken from the BBC. Each summary is a single sentence long and is professionally written (usually by the author), making the dataset exhibit more abstractive content than typical summarization datasets, such as CNN/DailyMail . The Newsroom dataset  consists of 1.2M article summary pairs scraped from the Internet Archive. The articles come from a set of 38 publishers and cover diverse topics. We provide statistics about each dataset in Table 1.
We used a bytepair encoding (BPE) for tokenization. For each summarization dataset, we use the BPE to tokenize each article and summary, and then truncate the articles to a maximum length of 512 tokens and each summary to a maximum length of 110 tokens. We then format each article summary pair into the format outlined in Figure 1.
% Novel n-grams in Gold Summary
|Mean # Words|
We used a transformer decoder with blocks and masked self-attention heads in each block. We set the dimensionality of each self-attention head to be . Unless stated otherwise, we use the pretrained weights of Radford et al. 
to initialize the parameters of the model. Special tokens that are added to the vocabulary (i.e. the end token, start token, and delimiter token) are initialized by sampling from the standard normal distribution. Our full model with source embeddings (§2.1) is denoted as as Transformer-SM and we also train an ablation, Transformer-LM, that does not use source embeddings.
All models were trained with a learning rate of
and a minibatch size of 64. When domain-adaptive training (DAT) is used, we train for 10 epochs using DAT and then for an additional 10 epochs using end task training (ETT). Without DAT, we train on the end task for 20 epochs. Unless specified otherwise, the final model trained for each dataset uses both domain-adaptive training and end task training. We did not tune hyperparameters. All models were trained using the PyTorch package111https://pytorch.org/ and the HuggingFace implementation of GPT.222https://github.com/huggingface/pytorch-openai-transformer-lm We trained each model on 8 Tesla V100-SMX2. Training for a total of 20 epochs took approximately 1 day of clock time for the XSum and CNN/Daily Mail datasets, and 3 days for the Newsroom dataset. Our source code is publicly available.333https://github.com/Andrew03/transformer-abstractive-summarization
We perform generation by using beam search with a beam size of 3. We use the trigram trick 
during beam search. Each summary token is generated by decoding from the distribution yielded by the model from processing an input tensor that is the concatenation of the article tokens, the delimiter token, and any previously generated summary tokens.
We evaluate our system with common summarization metrics: ROUGE-1 (R-1), a measure of unigram recall between the summary and document, ROUGE-2 (R-2), a similar measure of bigram recall, and ROUGE-L (R-L), a measure of the longest common subsequence between the summary and document . We also report the length of the summary in terms of tokens produced. For each dataset, for evaluation on the test set, we selected models with the largest ROUGE-1 score on a subset of 500 samples from the validation set.
5.1 CNN/Daily Mail
We report the results from various models previously trained and evaluated on the CNN/Daily Mail dataset. The PGen and PGen + Coverage models , consist of attentive RNN encoder-decoders that integrate the ability to directly copy from the article when generating tokens. Pasunuru and Bansal  extend this work by adding policy gradient training with a mixture of rewards that promote saliency and entailment. Bottom-up summarization and the Copy Transformer  also extend See et al.  by using the copy mechanism to compress the article to only relevant content before summarizing it. Chen and Bansal  also look at performing content selection, but extract full sentences from the document with a novel extractor model. Finally, the DCA model  uses multiple separate communicating encoders over different parts of the document to produce representations that are more focused on salient details.
|PGen + Coverage ||39.53||17.28||36.38||59.75|
|RougeSal + Ent RL ||40.43||18.00||37.10||-|
|Bottom-Up Summ ||41.22||18.68||38.34||55.25|
|rnn-ext + RL ||41.47||18.72||37.76||77.44|
We report our results using automatic metrics in Table 2. On this dataset, our main model, Transformer-SM, performs slightly worse than other state of the art models. We note that our model tends to generate shorter summaries than the gold summaries ( shorter), which could lower ROUGE recall performance.
In Figure 2, we investigate the correlation of ROUGE-L scores with summary length, and note that a minimum decoding length used by state-of-the-art algorithms places baseline generated summaries in length bins of higher average ROUGE-L performance. When Transformer-SM produces summaries in these same length bins (i.e., more than 30 tokens), its performance is only consistently beaten by the DCA model, which was fine-tuned with RL.
While ROUGE scores are negatively influenced by the shorter average length of the summaries produced by our model, it is not clear that shorter summaries are correlated with worse quality. To evaluate this hypothesis, we perform a human evaluation on 125 (article, summary) pairs randomly sampled from the test set. The article and model-generated summaries were presented to three workers from Amazon Mechanical Turk (AMT).
Each worker was presented two model-generated summaries, one produced by the Transformer-SM model, and one from the DCA model  or the PGen+Cov model . Workers were asked to select the better summary for four different quality metrics from Celikyilmaz et al. : non-redundancy (fewer of the same ideas are repeated), coherence (ideas are expressed clearly), focus (the main ideas of the document are shared while avoiding superfluous details), and overall.
The results are presented in Table 3. Interestingly, the summaries from Transformer-SM are consistently preferred by humans across all 4 evaluations dimensions compared to those from the DCA and PGen+Coverage models, indicating that the Transformer-SM’s lower ROUGE scores observed in Table 2 are not necessarily correlated with human judgments of quality.
|Model Name||R-L P||R-L R|
|rnn-ext + RL ||99.05||12.77||77.44|
Due to the large improvements over the baseline models in the human evaluation categories of non-redundancy and focus, and the generally shorter summaries produced by Transformer-SM, we investigate whether Transformer-SM is able to more efficiently express key ideas of the document. To evaluate the efficiency of each model, we remove non-content words from the model-generated summaries and articles, and compute the ROUGE score between them. This measure serves as a proxy for the rate at which ideas expressed in the summary can be found in the document.
We report these results in Table 5 and observe that Transformer-SM reports comparable ROUGE-L recall scores to other baselines when evaluated with respect to the article, despite producing summaries that, on average, 27% shorter. Meanwhile, ROUGE-L precision is also very similar to the baseline models, indicating that the summaries of all models indicate a similar degree of information relevance.444The high precision scores across all models confirm that despite being conceived as abstractive generators, these models display highly extractive behavior. Combined with the results from Table 3, we conjecture that Transformer-SM is able to more efficiently express key ideas from the document. While other models may be producing longer summaries that yield higher ROUGE performance (Table 2), the additional tokens may reflect redundant and unsalient information, which human evaluators penalize.
Analysis of domain-adaptive training and source embeddings
Our approach involved two strategies for efficiently using transformer language models for abstractive summarization: domain-adaptive training and source embeddings. To assess their individual impact, we evaluate multiple training schedule permutations (e.g., various combinations of using pretrained representations from the GPT model and using domain-adaptive training), as well as the impact of source embeddings. Our results in Table 5 yield multiple interesting conclusions. First, in general, domain-adaptive training (+DAT in Table 5) provides a clear improvement over training directly on the end task, irrespective of whether pretrained representations are used. Similarly, using source embeddings (T-SM in Table 5) provides a repeated improvement over the T-LM ablation. Surprisingly, when pretrained initializations, DAT, and source embeddings are used in tandem, performance drops slightly compared to not using DAT or not using source embeddings. We note, however, that this observation does not hold true for the XSum dataset (§5.2), and conjecture that the extractive nature of the CNN/DailyMail dataset may make these approaches have redundant effects in this setting.
A study on the quality of abstractive summaries is best performed on the XSum dataset , which is specifically designed with gold summaries that are less extractive than the other datasets (Table 1).
We report the performance of Transformer-SM on this dataset in comparison to baselines originally reported in Narayan et al. : an attention-based sequence to sequence model (AttnS2S), a pointer-generator model capable of generating words and copying directly from the input (PGen), a second pointer-generator model with a coverage mechanism to prevent repetition (PGen+Cov), and the top performing variant of the topic aware convolutional sequence to sequence model (T-ConvS2S), in which the encoder and decoder are provided with word topic and document topic distributions obtained using LDA as additional inputs. Our final baseline is the Multi-Level Memory Network (MMN) , which applies attention over multiple memory layers for varying levels of abstraction.
We report our results in Table 6. Our models significantly outperform the comparison baselines across all three variants of the ROUGE metric. Interestingly, the Transformer-SM achieves noticeable improvement over the Transformer-LM model, suggesting that both source embeddings and domain adaptive training are helpful when target summaries are more abstractive. Examples of model-generated summaries from the XSum dataset illustrate the improvement over baselines qualitatively in Table 7. In support of results presented earlier, the model produces abstractive summaries that provide focused information about the main points of the articles.
|Article snippet||Officials said the attack happened at the Europa shopping centre in the capital Minsk. … Police later arrested the 18-year-old suspect. … "He cut one woman with the chainsaw and hit her with a hammer. She died. He also attacked others." The injured woman was taken to a local hospital. The attacker had brought the chainsaw and the axe to the shopping centre …|
|T-ConvS2S||A man has been arrested on suspicion of attempted murder by after a knife attack on a shopping centre in central London.|
|Transformer-SM||A teenage girl has been killed by a chainsaw attack at a shopping centre in central Russia, police say.|
|Gold||A young man has attacked people with a chainsaw and an axe at a shopping centre in Belarus, killing one woman and injuring another.|
|Article snippet||The 34-year-old Sweden striker’s contract with the french champions expires in the summer, and he has been linked with Manchester United, LA Galaxy and AC Milan. … PSG said Ibrahimovic leaves as "the greatest striker and one of the very best players in the club’s history" . …|
|T-ConvS2S||Paris St-Germain have completed the signing of Zlatan Ibrahimovic from Paris St-Germain for an undisclosed fee.|
|Transformer-SM||Zlatan Ibrahimovic says he will leave Paris St-Germain at the end of the season to return to the club.|
|Gold||Zlatan Ibrahimovic will leave Paris St-Germain at the end of the season.|
|Article snippet||… The animal was taken from Lathom pets and aquatics in Ormskirk on Tuesday afternoon, Lancashire police said. The shop’s owner said CCTV showed a man taking the tortoise - which needs calcium supplements - out of the tank. …|
|T-ConvS2S||A puppy’s pet shop has been stolen from a shop in Lancashire.|
|Transformer-SM||A tortoise has been stolen from a pet shop.|
|Gold||A baby tortoise has been stolen from a pet shop in Lancashire.|
Finally, we report the performance of our model on the Newsroom dataset , the largest of the evaluation datasets. Due to the large cost of training, only the Transformer-SM model was evaluated.
As baselines, we report the performance of models released by the authors of the Newsroom dataset . These models included an attentive encoder-decoder (Attn-S2S) and a pointer-generator network (PGen). We also compared against C10110 , a complex encoder-decoder that uses LSTMs, encoder attention, intra-decoder attention, and pointer-generation to produce summaries. We also compare against the Multi-Level Memory Network (MMN)  mentioned earlier. The authors of this baseline only evaluated on the abstractive subset of the Newsroom dataset.
We report our results with ROUGE-style automatic metrics in Table 8, showing that Transformer-SM outperforms the previous best model, C10110 , across all metrics. Interestingly, our model achieves its highest performance increase over baseline models on Rouge-L, the metric usually considered as being most strongly correlated with strong summaries. Furthermore, an analysis of different validation subsets of the Newsroom dataset in Table 9 (split on the level of extractiveness of the gold summaries) shows that Transformer-SM performs better than baselines approaches on all varieties of summary types.
6 Related Work
There has been a large variety of work exploring different methods for neural abstractive document summarization. Attention mechanisms have been shown to improve a variety of models [14; 25; 3], and is one of the motivating factors for this work. Pointer generator networks introduced in See et al.  have been shown to increase summary veracity, and inspired the tangential usage of copy mechanisms in Transformers for document summarization Gehrmann et al. 
. Other works have also explored the use of reinforcement learning to directly optimize summarization models on the ROUGE metric[17; 18; 2].
Our approach is also relevant to recent work on contextualized language representations that are pretrained on large-scale language corpora. These representations can then be simply integrated or fine-tuned for improved performance on many downstream tasks . SSL , CoVe , and ELMo  all learned contextualized representations through training RNN language models and encoder-decoders. Follow-up work extended these ideas, but replaced the RNN with a deep transformer  that was trained to learn language patterns on a large story dataset. BERT  more clearly extended the idea of using Transformers for language modeling by making the encoded representations bidirectional and adding two new loss functions: a masked token loss and next sentence prediction loss for more accurate discourse representations. More recently, GPT2  expanded the scale of pretrained language models, and showed promising results on zero-shot tasks.
In this work, we introduce two approaches for effectively adapting pretrained language model representations to abstractive summarization: domain-adaptive training, and source embeddings. We evaluate the effect of both approaches across three abstractive summarization testbeds: CNN/DailyMail, XSum, and Newsroom, and achieve state of the art ROUGE-L results on two of them, while showing superior human evaluation performance on the third. In the process, we show that the ROUGE-L metric often used for abstractive summarization evaluation is quite sensitive to summary length, allowing it to be exploitable by approaches that use heuristics to control summary length.
- Celikyilmaz et al.  Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In NAACL.
- Chen and Bansal  Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL.
- Cheng and Lapata  Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252.
- Dai and Le  Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In NIPS.
- Devlin et al.  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Gehrmann et al.  Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive summarization. In EMNLP.
- Grusky et al.  Max Grusky, Mor Naaman, and Yoav Artzi. 2019. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In NAACL.
- Hermann et al.  Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701.
- Kim et al.  Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim. 2018. Abstractive summarization of reddit posts with multi-level memory networks. arXiv preprint arXiv:1811.00783.
- Lin [2004a] Chin-Yew Lin. 2004a. Looking for a few good metrics: Automatic summarization evaluation-how many samples are enough? In NTCIR.
- Lin [2004b] Chin-Yew Lin. 2004b. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
- Liu et al.  Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. In ICLR.
McCann et al. 
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
Learned in translation: Contextualized word vectors.In NIPS.
Nallapati et al. 
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.In Thirty-First AAAI Conference on Artificial Intelligence.
- Nallapati et al.  Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL.
Narayan et al. 
Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018.
Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.In EMNLP.
- Pasunuru and Bansal  Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with saliency and entailment. In ACL.
- Paulus et al.  Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In ICLR.
- Peters et al.  Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
- Radford et al.  Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. 2018. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language-unsupervised/language_understanding_paper. pdf.
- Radford et al.  Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
- See et al.  Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In ACL.
- Shi et al.  Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K Reddy. 2018. Neural abstractive text summarization with sequence-to-sequence models. arXiv preprint arXiv:1812.02303.
- Sun et al.  Simeng Sun, Ori Shapira, Ido Dagan, and Ani Nenkova. 2019. How to compare summarizers without target length? pitfalls, solutions and re-examination of the neural summarization literature. In Proceedings of NAACL 2019 Workshop on Optimizing and Evaluating Neural Language Generation (NeuralGen).
- Tan et al.  Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Abstractive document summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1171–1181.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
- Wang et al.  Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint 1804.07461.
Zhu et al. 
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel
Urtasun, Antonio Torralba, and Sanja Fidler. 2015.
Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books.
2015 IEEE International Conference on Computer Vision (ICCV), pages 19–27.
Appendix A Reproducibility
We provide additional details relevant to the experimental environment here.
The CNN/Daily Mail dataset  consists of articles from CNN and Daily Mail.555https://www.cnn.com; https://www.dailymail.co.uk Each article is associated with several descriptive bullet point highlights. Similar to previous work , we concatenate the highlights to create a target summary for each article in the dataset. The Newsroom dataset  consists of 1.2M article summary pairs scraped from the Internet Archive.666https://archive.org/ The articles come from a set of 38 publishers and cover diverse topics. Finally, the Extreme Summarization (XSum) dataset  consists of 230k article summary pairs taken from the BBC. 777https://www.bbc.com/ Each summary is a single sentence long and is professionally written (usually by the author). For all datasets, we use the splits defined in the original works that proposed them. Because the datasets are too large to provide as supplementary material, we provide pointers in the source code README for acquiring them.
Details about important hyperparameters can be found in Section 4 of the paper. Additional training hyperparameters can be found as the default parameters in the training script of the source code.888https://github.com/Andrew03/transformer-abstractive-summarization Most hyperparameter values selected were the same ones suggested by previous work on transformer language models 
. The only hyperparameter we varied that is not measured as an ablation (i.e., training schedules and whether to include source embeddings) was the initialization of source embeddings (if they were included). For this hyperparameter, we explored three different initializations: 1) initializing both source embeddings with zero vectors, 2) initializing both source embeddings with values sampled from the standard normal distribution, and 3) initializing both source embeddings with values sampled from a normal distribution with mean 0 and standard deviation equal to half the norm of the average norm among pretrained embeddings from the GPT language model. This last one is the one we report in all experiments.
Each experiment was run as follows for any given model and dataset. First, we trained the model as described in the paper. After every 1000 minibatches, we compute ROUGE for a random, but persistent 500 example subset of the validation set. When the ROUGE-1 score of the model stopped rising, we used the previous checkpoint as a model to generate summaries for all articles in the test set. We used beam search to decode summaries using a beam with of 3. We ran exactly one evaluation run for each result we include in our paper.