TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts

10/04/2021
by   Sajad Sotudeh, et al.
0

Recent models in developing summarization systems consist of millions of parameters and the model performance is highly dependent on the abundance of training data. While most existing summarization corpora contain data in the order of thousands to one million, generation of large-scale summarization datasets in order of couple of millions is yet to be explored. Practically, more data is better at generalizing the training patterns to unseen data. In this paper, we introduce TLDR9+ – a large-scale summarization dataset – containing over 9 million training instances extracted from Reddit discussion forum (https://github.com/sajastu/reddit_collector). This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. We go one step further and with the help of human annotations, we distill a more fine-grained dataset by sampling High-Quality instances from TLDR9+ and call it TLDRHQ dataset. We further pinpoint different state-of-the-art summarization models on our proposed datasets.

READ FULL TEXT
research
11/02/2020

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

In this paper, we introduce a large-scale Indonesian summarization datas...
research
12/02/2022

NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization

Narrative summarization aims to produce a distilled version of a narrati...
research
02/02/2023

Curriculum-guided Abstractive Summarization for Mental Health Online Posts

Automatically generating short summaries from users' online mental healt...
research
02/02/2023

Curriculum-Guided Abstractive Summarization

Recent Transformer-based summarization models have provided a promising ...
research
04/09/2021

Annotating and Modeling Fine-grained Factuality in Summarization

Recent pre-trained abstractive summarization systems have started to ach...
research
02/14/2023

Exploiting Summarization Data to Help Text Simplification

One of the major problems with text simplification is the lack of high-q...
research
10/18/2018

WikiHow: A Large Scale Text Summarization Dataset

Sequence-to-sequence models have recently gained the state of the art pe...

Please sign up or login with your details

Forgot password? Click here to reset