Two Huge Title and Keyword Generation Corpora of Research Articles

by   Erion Çano, et al.

Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and can be used to perform research on various tasks. In this paper, we introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research, containing 34 million and 23 million records, respectively. The data were retrieved from the Open Academic Graph which is a network of research profiles and publications. We carefully processed each record and also tried several extractive and abstractive methods of both tasks to create performance baselines for other researchers. We further illustrate the performance of those methods previewing their outputs. In the near future, we would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.



page 1

page 2

page 3

page 4


Topic Segmentation of Research Article Collections

Collections of research article data harvested from the web have become ...

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

We present NEWSROOM, a summarization dataset of 1.3 million articles and...

TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising

Text summarization aims to extract essential information from a piece of...

Structured Summarization of Academic Publications

We propose SUSIE, a novel summarization method that can work with state-...

Unsupervised Learning Algorithms for Keyword Extraction in an Undergraduate Thesis

The amount of data managed in many academic institutions has increased i...

Keyphrase Generation: A Multi-Aspect Survey

Extractive keyphrase generation research has been around since the ninet...

Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study

Using data-driven models for solving text summarization or similar tasks...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.