Text summarization is defined as generating a concise sequence of text as summary, given relatively a longer document as source. A high-quality summary conveys the most important points of its associated source. The task is generally performed in two ways: 1) extractive in which salient sentences are identified and concatenated to form the final summary Nallapati et al. (2017); Dong et al. (2018); Sotudeh et al. (2021a); Narayan et al. (2020); Cho et al. (2020); and 2) abstractive that produces a paraphrasing of the main contents of the given text. See et al. (2017); Gehrmann et al. (2018); MacAvaney et al. (2019); Zhang et al. (2019); Sotudeh et al. (2020a); Lewis et al. (2020); Lebanoff et al. (2020) and is considered more challenging as the model needs to deal with novel words generation beyond sentence extraction.
Over the past few years, different neural models including RNN Hochreiter and Schmidhuber (1997) and Transformer-based Vaswani et al. (2017) networks have been proposed to facilitate the summarization task. While promising, the performance of such models is bound to the abundance of training data due to the massive model complexity Ying (2019). Lack of sufficient training data worsens the model’s ability to generalize patterns in training data to unseen data Althnian et al. (2021). In addition, overfitting will be likely inevitable as the model is forced to learn from a limited set of data; hence, hindering the generalization. This justifies the necessity of large-scale corpora for training large and complex models.
Prevalence of social media platforms has provided communities with an opportunity to exchange different types of data while interacting with each other. Reddit 111https://www.reddit.com/ is one of such popular platforms where users post their content of interest in a variety of domains. Tldr, the acronym for “Too Long; Didn’t Read”, is a common practice that aims at removing unnecessary information from the lengthy post, and presenting its gist information in a few words. Figure 1 shows a sample of Reddit post with its Tldr, which aims at abstracting post with extreme compression. Abundance of posts that contain such Tldr
s during recent years has given rise to generation of data collections that can be utilized for training deep neural networks; hence, addressing the challenge of large-scale datasets’ scarcity. Despite the possibility of acquiring large-scale datasets from social media platforms, training deep neural networks on such datasets is yet challenging. This might be due to the specific writing style of social media content such asinformal language and massive noise within such content Sotudeh et al. (2020b).
|Reddit TIFU||Social Media||120K|
|TldrHQ (ours)||Social Media||1.7M|
|Tldr9+ (ours)||Social Media||9.2M|
Table 1 shows some of the existing summarization datasets in social and non-social media domains. These datasets are specifically proposed for extreme summarization task, where the aim is to produce one to two summary sentences in extreme compression and high abstraction. In this paper, we introduce our dataset, Tldr9+ with over 9 million instances which is more than twice larger than the previous dataset Völske et al. (2017). We further sample high-quality instances in virtue of human annotations from Tldr9+ to construct TldrHQ yielding 1.7 million instances in the hope of providing firm grounds for future work. Owing to extremely short length of Tldr summaries (less than 40 words), our datasets are rather suitable for extreme summarization task than for longer ones.
|(a) submission-Tldr||(m) comment-Tldr|
In this research, we aim at harvesting instances that include Tldrs written by the Reddit users spanning the period of 2005-2021. Our early attempt at gathering such instances yields over 9 million instances with Tldrs as the initial set (i.e., Tldr9+
). Since social media posts are inherently noisy, we consider applying a heuristic method to cut out low-quality instances from the initial set, which ultimately results in 1.7 million high-quality instances. For deciding such heuristic, we employ human annotators to help to obtain a more fine-grained dataset (i.e.,TldrHQ). Furthermore, we establish various state-of-the-art extractive and abstractive summarization models on our proposed datasets. Finally, we carry out an analysis over the results on both datasets to shed light on future direction. We believe that our datasets can be utilized to pave the path for future research. Our miner code and data are made publicly available at https://github.com/sajastu/reddit_collector, along with the licensing details included.
2 Related work
Over the past few years, summarization community has witnessed a variety of summarization datasets in different domains See et al. (2017); Cohan et al. (2018); Kornilova and Eidelman (2019); Grusky et al. (2018); Sotudeh et al. (2021b). While these collections have provided a fair basis to perform different neural text summarization models, the necessity of introducing large-scale collections, in magnitude of over 4 million, has not been much explored.
Among the first attempts on this track, Rush et al. (2015)
gathered the English Gigaword corpusGraff et al. (2003) which contains around 4 million article-headline pairs for the task of news headline generation. Researchers have noted that lead bias is the common phenomenon in most news datasets, where early parts of the article generally include the most important information Kedzie et al. (2018); Zhu et al. (2019); Grenander et al. (2019). To alleviate the lead bias for training summarization models, there have been recent efforts to propose summarization datasets, where the lead bias phenomenon is mitigated and summaries are sampled from diverse source regions. Amongst those, Sharma et al. (2019) proposed BigPatent, consisting 1.3 million patent documents, collected from Google Patents Public Datasets, with human-written abstractive summaries. Kim et al. (2019) proposed Reddit TIFU in which the abstractive gold summaries are sampled from diverse regions of the source document, rather than lead regions.
Our proposed datasets are more suited for the task of extreme summarization Narayan et al. (2018); Cachola et al. (2020), where the task is to create a short one-sentence summary. To this end, Narayan et al. (2018) proposed XSUM dataset which is a real-word dataset compiling online articles from the British Broadcasting Corporation (BBC). TLDR generation task is also a new form of extreme summarization. Kim et al. (2019) collected Reddit-TIFU dataset, consisting of 120K posts from the online discussions from Reddit. Recent efforts have mined around 4 million Reddit posts along with their Tldr summaries Völske et al. (2017) which resulted in Webis-TLDR-17 dataset. While our work is similar to theirs, our collected dataset is more than twice larger than the one previously proposed.
3 The Reddit Collection
3.1 Data Collection
Reddit is a social news aggregation and discussion website platform that has been officially launched since June 2005. It supports some features specific to social platforms such as web content rating through up-voting, and discussion topics via subreddits. The user-created content can be of any domain such as News, Politics, Science, Sport and etc. Users can post or comment on a specific topic that falls into a specific subreddit. Within subreddits, users submit their post as submission, and others can react through commenting under the posted submission. Each submission and comment has a text body/selftext which reflects the users’ information exchange regarding a specific topic. The existence of social platforms such as Reddit has provided the research community with an opportunity to experiment with resources that use informal language, rather than those in news, scientific or legal documents which use formal language.
Tldr—Too Long; Didn’t Read— is a common practice in Reddit that often appears at the end of long Reddit posts. It is denoted as an extremely short summary that urges users to read a shorter version of a longer text when they do not have time to read the entire post. Figure 2 shows the ratio of posts containing such Tldr summaries over the entire submitted posts (and comments) across different years. It is observable that although we see an ascending trend since 2005, the number of Tldrs remains fixed (see Section 3.4) while the number of posts increases drastically.
Pushshift 222https://files.pushshift.io/ is a social media data repository platform that has been recently made available to NLP researchers Baumgartner et al. (2020). It contains recent and historical dumps of Reddit posts that are updated in real-time. In order to create the Tldr dataset, we downloaded the whole data dumps (submissions and comments) which cover the period of 2005-2021, and extracted instances that contain Tldrs within the posted source text. This mining process resulted in Tldr9+ dataset, which contains over 9 million instances. To acquire a more fine-grained dataset, with the help of human annotations, we obtained TldrHQ dataset, consisting of 1.7 million high-quality instances. The datasets’ construction details are discussed in what follows.
3.2 Datasets Construction: Tldr9+ and TldrHQ
Tldr9+. After downloading Reddit data dumps, we extract posts in which a mention of Tldr-style keywords is found. To find Tldr-style keywords within a given text, we declare a regular expression that matches words starting with “TL” and ending with “DR”, with permission of having up to three characters in-between as also done by Völske et al. (2017). This stage yields the Tldr9+ dataset as the full corpus. At the next filtering stage, we utilize a heuristic method along with human supervision to narrow it down to a more fine-grained dataset that contain high-quality instances.
TldrHQ. A few studies have noted that user-generated content in social media platforms is noisy Liu and Inkpen (2015) in terms of having spam, bad grammar, and spelling errors. To filter out such noisy instances from the Tldr9+ dataset, we use a heuristic method to drop low-quality instances while retaining high-quality ones. To be more specific, given a post-Tldr pair, we firstly identify the highest score source sentence in terms of Rouge-2 and Rouge-L mean scores (i.e., oracle sentence). The choice of oracle sentence lies in the fact that we postulate to extract a sentence from the longer post that has the highest similarity with the Tldr summary as the gold standard. We then decide to either drop or retain the instance if the score surpasses a pre-defined threshold. We experiment with different thresholds of 0.15, 0.17, 0.20, 0.22 and 0.25, and choose one considering the annotations done by human annotators. The details of human annotation process is discussed in what follows.
3.3 Human Annotation
As mentioned earlier, we first define 5 fixed thresholds including 0.15, 0.17, 0.20, 0.22, and 0.25 to create 5 data subsets from Tldr9+ dataset. Specifically, we take Tldr9+ as the initial seed, from which 5 subsets is created as follows. To gather instances for each of the pre-defined thresholds, we check if the oracle sentence’s score in the given instance surpasses the experimented threshold. If it does so, we add it to the subset, otherwise, it is dropped. We then randomly sample 20 cases from each of these subsets with their oracle sentence and Tldr summaries, yielding 100 cases for annotation in total. We have four human annotators from our NLP group either confirm (labeling with 1) or reject (labeling with 0) if the oracle sentence validates the Tldr summary. By definition, the sentence validates the Tldr summary if at least one fragment can be found within the sentence that semantically occurs in Tldr summary.
We further provide the instances’ text (i.e., source) as the “Context” for the oracle sentence, and ask the annotators to confirm or reject if the context also validates the Tldr summary. Context is specifically important for the cases where the oracle sentence does not validate the Tldr summary. In fact, by providing context, we aspire to verify if an ideal summarizer is able to generate the Tldr using the context when the oracle sentence is not much informative. For tie cases 333Suppose a case where two annotators confirm (label 1), while the other two reject (label 0)., we employ a fifth annotator to make the final decision.
|(a) Annotation with the oracle sentence||(b) Annotation with context|
|Threshold||score w/o context||score w/ context|
|(a) submission-Tldr||(b) comment-Tldr|
Table 2 presents the average decision score assigned to the samples on each threshold. The decision score for a given sample is defined as the annotators’ average confidence at giving label 1 to that specific sample. If the average confidence score surpasses 0.50, we assign 1 and if it is below 0.50, the sample is annotated with 0. Otherwise, the fifth annotator decides the label. As shown, threshold 0.22 attains the full score in the presence and absence of the context. Overall, this shows that most of the annotators believe the Tldr can be distilled considering both oracle sentence and the entire source.
Figure 3 shows pair-wise inter-rater score agreement Bennet et al. (1954) throughout the annotation process on threshold 0.22, denoting that annotators have mostly slight or fair agreement in labeling process. Specifically, when the context is not provided (i.e., merely with consideration of oracle sentence), raters (2, 4), (2, 3), and (1, 3) have quite a high rate of agreement. On the other hand, most pairs of annotators including (1, 2), (1, 4), and (2, 4) achieve a high agreement rate when the context is given. As the given decision scores —either only with oracle sentence or provided context— sum up to 1.0, and considering moderately high agreement rate between the annotators, we decide to sample our TldrHQ dataset from the instances in that was in threshold 0.22’s subset. This leads us to choose human-decided threshold 0.22 as our ground to sample High-Quality Tldrs for constructing TldrHQ dataset.
showing (a) the oracle sentence’s importance to its relative position; (b) percentage of novel n-grams; and (c) n-gram abstractiveness. The heat extent shows the number of the instances within the specific bin.
3.4 Dataset Analysis
In this section, we give statistics, along with analyses on the proposed datasets.
Table 3 shows general statistics of datasets in terms of post and Tldr length. As shown, the compression rate 444Compression rate = is 8.7 and 12.5 in Tldr9+, and TldrHQ datasets, respectively. This shows that authors generally tend to write much shorter Tldrs that highly shortens the post’s text, which is expected due to the nature of Tldr summaries.
|Dataset Size||1,671,099 posts|
|Train Set Size||1,590,132 posts|
|Mean Sentence Length||21.1 tokens|
|Min Sentence Length||1 token|
|Max Sentence Length||4,370 tokens|
|Total Vocabulary Size||1,582,436|
|Occurring 10+ Times||226,754|
|Train Vocabulary Size||1,138,415|
|Validation Vocabulary Size||248,148|
|Test Vocabulary Size||249,079|
|Train/Test Vocabulary Overlap||72.5%|
Figure 4 demonstrates the number of Tldr pairs in Tldr9+ across different years. As observed, 83.65% of these Tldrs occur after 2013 which shows the popularity of this writing style among Reddit users. We also see a similar trend for years after 2013, each of which constitutes a fixed amount (10%-12%) of the dataset. Table 4 demonstrates the detailed information including data size, sentence length and vocabulary statistics of TldrHQ dataset.
As mentioned earlier, we define the oracle sentence to be the one within the longer post that has the highest overlap with Tldr summary in terms of Rouge-2 and Rouge-L mean scores. The oracle sentence’s relative position in post’s text along with its importance is shown in Figure 5 (a). We define the oracle importance score as follows:
where is the set of all sentences within the post, and denotes the th sentence. is a function that takes in a post’s sentence, and outputs the mean of its Rouge-2 and Rouge-L score with respect to Tldr summary. Intuitively, the oracle importance score can be framed as the attention score over the oracle sentences when the scoring function is Rouge. Observing Figure 5, while more of the oracle sentences occur in early parts of the post’s text () with importance score of less than 0.30, it appears that the oracle sentences are spread out across the post’s text overall. This observation is substantial, justifying the usability of this dataset for extractive summarization task.
To analyze the abstraction level of TldrHQ dataset, we plot the percentage of novel n-grams within the Tldr summary See et al. (2017) in Figure 5 (b), as well as the Tldr’s n-gram abstractiveness Gehrmann et al. (2019) in Figure 5 (c) over the all instances in TldrHQ dataset. As indicated, there are quite a large proportion of novel n-gram words appeared in the Tldr summary as the heat extent is mostly concentrated in the upper half of the y-axis. These plots show the promising capability and challenges of this dataset to be used for abstractive summarization models.
4 Experimental Setup
We benchmark several extractive and abstractive summarization baselines over our two proposed datasets.
BertSumExt. Liu and Lapata (2019) BertSumExt model is the extractive variant of BertSum which is the Bert Model fine-tuned on text summarization task. In this regard, Bert [CLS] tokens are appended to the start of each input sentence, and their associated representations are used to predict if the sentence should be included in the final summary or not.
BertSumAbs. Lewis et al. (2020) BertSumAbs is the abstractive model of BertSum, where a Transformers-based decoder is added to the Bert Encoder.
BertSumExt Liu and Lapata (2019)
BertSumAbs Liu and Lapata (2019)
Bart Lewis et al. (2020)
Bart. Lewis et al. (2020) Bart
is a regressive autoencoder model that is pre-trained by first corrupting the text with an arbitrary noising function, and secondly, trying to reconstruct the original input text.Bart
is particularly effective when fine-tuned on text generation tasks such as summarization. AsBart has both encoder and decoder pre-trained, it can be perceived as an extension to general Bert models in which only encoder is pre-trained.
We randomly split our datasets to construct training, validation, and test sets. Specifically, for Tldr9+, we use 99-0.5-0.5 split which results in 9,139,935 (train), 43,753 (validation), and 43,749 (test) instances. To split TldrHQ, we use 95-2.5-2.5 division yielding 1,590,132 (train), 40,481 (validation), and 40,486 (test) pairs.
4.3 Training and Hyper-parameters
To train the summarization models, we utilize HuggingFace’s Transformers Wolf et al. (2020) for Bart, and the open implementation 555https://github.com/nlpyang/PreSumm of BertSumExt, BertSumAbs. We use warm-up steps of 32K, and 20K for Bart and BertSum variants, respectively. The AdamW optimizer Loshchilov and Hutter (2019) is used with learning rate of , beta parameter of , and weight decay of for Bart model. For BertSum variants, we use the default Adam Kingma and Ba (2015) optimizer with learning rates of for the encoder, and for the decoder as suggested by the main paper Liu and Lapata (2019)
. For all models, we use cross-entropy loss function. We train the models on 8 Nvidia Tesla V100 GPUs for 5 epochs with early stopping of the training when the validation loss does not decrease for 3 consecutive validation steps. The validation step is done every 25K training steps. To visualize and keep track of the learning process, we use Weight and BiasesBiewald (2020) toolkit.
5 Experimental Results
Table 5 presents the performance of the state-of-the-art summarization models on our proposed datasets in terms of Rouge-1, Rouge-2, and Rouge-L scores. As indicated, Bart outperforms all other models across all Rouge variants in both datasets. This is expected as Bart’s both encoder and decoder have been pre-trained on a large amount of unlabelled data, unlike BertSum variants that only have pre-trained encoders.
Comparing abstractive models with BertSumExt, we observe relatively large performance gap. This might be due to the fact that Tldrs in both Tldr9+ and TldrHQ datasets are rather abstractive than extractive as also shown in Section 3.4. Yet with the existence of such a huge gap, the Oracle-Ext (i.e., upper bound of an extractive summarizer) scores prove that more developed extractive summarizers can perform out-of-the-box and mitigate this gap. The performance gap on Tldr9+ brings various challenges to develop summarization models that better fit on the larger dataset that include noisy data Kumar et al. (2020). This noise might be handled via methods such as noise-aware training models Namysl et al. (2020), while enabling the models to benefit from the large-scale Tldr9+
dataset. We leave this part for future work. It has to be mentioned that automatic evaluation of summarization continues to be an issue and while this dataset does not solve that, instead can be used with any evaluation metric as they evolve.
To gain insights into the qualities of summarization model, we analyze the outputs generated by the models. The diagrams demonstrating n-gram abstractiveness and percentage of novel n-grams, generated by Bart and BertSumAbs, are plotted in Figure 6. As observed, Bart model appears to have a similar trend to the ground truth Tldrs. On the other hand, BertSumAbs model has increasing n-gram abstractiveness, and novel n-gram percentage with increasing n. It is also interesting that after 6-gram, BertSumExt model reaches a plateau when generating novel n-grams, but we a drop after 3-grams for Bart and the ground truth Tldrs. This shows that from 1-gram to 3-gram, there are increasing number of novel words appeared in the ground-truth and Bart, but after that, they both tend to copy n-grams rather than generating those.
To understand the limitation and qualities of current state-of-the-art summarization models, we conduct a qualitative analysis on several samples from TldrHQ dataset, of which one is shown in Figure 7. Analyzing this sample, we observe that Bart generated a better summary in terms of faithfulness to the ground truth Tldr. On the other hand, while BertSumAbs could identify the important region of the source document, it has produced a longer Tldr with additional information that is present in the source, but not in the ground truth Tldr summary. BertSumExt model could have identified a source sentence that is partly in connection with the ground truth Tldr, but it leaves out the most important sentence as the oracle to be extracted. Considering the upper performance of extractive summarizers (i.e., Oracle-Ext score in Table 5), we believe that there is a large room for improvement on this dataset. Investigations of more advanced models remain for future work.
In this paper, we proposed two large-scale summarization datasets called Tldr9+, and TldrHQ. The Tldr9+ dataset contains over 9 millions Reddit post-Tldr instances. To distill a more fine-grained dataset out of Tldr9+, we sample high-quality instances with the help of human annotations to construct TldrHQ. Our analyses over Tldr9+ and TldrHQ datasets show its usability for performing both extractive and abstractive summarization tasks. We further establish extractive and abstractive baseline results using state-of-the-art summarization models on both datasets. We hope our datasets can pave the path for future studies in this direction.
We warmly thank the anonymous reviewers as well as Tracy King for their helpful feedback and suggestions.
- Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Applied Sciences 11, pp. 796. Cited by: §1.
- The pushshift reddit dataset. In ICWSM, Cited by: §3.1.
- Communications Through Limited-Response Questioning*. Public Opinion Quarterly 18 (3), pp. 303–308. External Links: Cited by: §3.3.
- Experiment tracking with weights and biases. Note: Software available from wandb.com External Links: Cited by: §4.3.
- TLDR: extreme summarization of scientific documents. Findings of the Association for Computational Linguistics: EMNLP 2020 abs/2004.15011. Cited by: §2.
- Better highlighting: creating sub-sentence summary highlights. In EMNLP, Cited by: §1.
A discourse-aware attention model for abstractive summarization of long documents. In NAACL-HLT, Cited by: §2.
- BanditSum: extractive summarization as a contextual bandit. In EMNLP, Cited by: §1.
- Bottom-up abstractive summarization. EMNLP. Cited by: §1.
- Generating abstractive summaries with finetuned language models. In INLG, Cited by: §3.4.
- English gigaword. Linguistic Data Consortium, Philadelphia 4 (1), pp. 34. Cited by: §2.
- Countering the effects of lead bias in news summarization via multi-stage training and auxiliary losses. In EMNLP, Cited by: §2.
- Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In NAACL, Cited by: §2.
- Long short-term memory. Neural Computation 9, pp. 1735–1780. Cited by: §1.
Content selection in deep learning models of summarization. In EMNLP, Cited by: §2.
- Abstractive summarization of reddit posts with multi-level memory networks. In NAACL, Cited by: §2, §2.
- Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.3.
- BillSum: a corpus for automatic summarization of us legislation. ArXiv abs/1910.00523. Cited by: §2.
- Noisy text data: achilles’ heel of bert. In WNUT, Cited by: §5.
- Learning to fuse sentences with transformers for summarization. In EMNLP, Cited by: §1.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ACL. Cited by: §1, §4.1, §4.1, Table 5.
- Estimating user location in social media with stacked denoising auto-encoders. In VS@HLT-NAACL, Cited by: §3.2.
- Text summarization with pretrained encoders. In EMNLP/IJCNLP, Cited by: §4.1, §4.3, Table 5.
- Decoupled weight decay regularization. In ICLR, Cited by: §4.3.
- Ontology-aware clinical abstractive summarization. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Cited by: §1.
SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents. In AAAI, Cited by: §1.
- NAT: noise-aware training for robust neural sequence labeling. In ACL, Cited by: §5.
Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In EMNLP, Cited by: §2.
- Stepwise extractive summarization and planning with structured transformers. In EMNLP, Cited by: §1.
- A neural attention model for sentence summarization. Cited by: §2.
- Get to the point: summarization with pointer-generator networks. In ACL, Cited by: §1, §2, §3.4.
- BIGPATENT: a large-scale dataset for abstractive and coherent summarization. In ACL, Cited by: §2.
- On generating extended summaries of long documents. The AAAI-21 Workshop on Scientific Document Understanding (SDU). Cited by: §1.
- On generating extended summaries of long documents. SDU@AAAI abs/2012.14136. Cited by: §2.
- Attend to medical ontologies: content selection for clinical abstractive summarization. In ACL, Cited by: §1.
- GUIR at semeval-2020 task 12: domain-tuned contextualized models for offensive language detection. SemEval2020. Cited by: §1.
- Attention is all you need. ArXiv abs/1706.03762. Cited by: §1.
- TL;dr: mining reddit to learn automatic summarization. In NFiS@EMNLP, Cited by: §1, §2, §3.2.
Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Cited by: §4.3.
- An overview of overfitting and its solutions. Journal of Physics: Conference Series 1168, pp. 022022. External Links: Cited by: §1.
- PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In ICML, Cited by: §1.
- Make lead bias in your favor: zero-shot abstractive news summarization. arXiv: Computation and Language. Cited by: §2.